linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Shared memory for shared extents
@ 2021-10-22 20:15 Goldwyn Rodrigues
  2021-10-22 20:15 ` [RFC PATCH 1/5] mm: Use file parameter to determine bdi Goldwyn Rodrigues
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-22 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This is an attempt to reduce the memory footprint by using a shared
page(s) for shared extent(s) in the filesystem. I am hoping to start a
discussion to iron out the details for implementation.

Abstract
If mutiple files share an extent, reads from each of the files would
read individual page copies in the inode pagecache mapping, leading to
waste of memory. Instead add the read pages of the filesystem to
underlying device's bd_inode as opposed to file's inode mapping. The
cost is that file offset to device offset conversion happens early in
the read cycle.

Motivation:
 - Reduce memory usage for shared extents
 - Ease DAX implementation for shared extents
 - Reduce Container memory utilization

Implementation
In the read path, pages are read and added to the block_device's
inode's mapping as opposed to the inode's mapping. This is limited
to reads, while write's (and read before writes) still go through
inode's i_mapping. The path does check the inode's i_mapping before
falling back to block device's i_mapping to read pages which may be
dirty. The cost of the operation is file_to_device_offset() translation
on reads. The translation should return a valid value only in case
the file is CoW.

This also means that page->mapping may not point to file's mapping.

Questions:
1. Are there security implications for performing this for read-only
pages? An alternate idea is to add a "struct fspage", which would be
pointed by file's mapping and would point to the block device's page.
Multiple files with shared extents have their own independent fspage
pointing to the same page mapped to block device's mapping.
Any security parameters, if required, would be in this structure. The
advantage of this approach is it would be more flexible with respect to
CoW when the page is dirtied after reads. With the current approach, a
read for write is an independent operation so we can end up with two
copies of the same page. This implementation got complicated too quickly.

2. Should pages be dropped after writebacks (and clone_range) to avoid
duplicate copies?

Limitations:
1. The filesystem have exactly one underlying device.
2. Page size should be equal to filesystem block size

Goldwyn Rodrigues (5):
  mm: Use file parameter to determine bdi
  mm: Switch mapping to device mapping
  btrfs: Add sharedext mount option
  btrfs: Set s_bdev for btrfs super block
  btrfs: function to convert file offset to device offset

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/file.c    | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/super.c   |  7 +++++++
 include/linux/fs.h |  7 ++++++-
 mm/filemap.c       | 34 ++++++++++++++++++++++++++--------
 mm/readahead.c     |  3 +++
 6 files changed, 83 insertions(+), 11 deletions(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/5] mm: Use file parameter to determine bdi
  2021-10-22 20:15 [RFC PATCH 0/5] Shared memory for shared extents Goldwyn Rodrigues
@ 2021-10-22 20:15 ` Goldwyn Rodrigues
  2021-10-22 20:15 ` [RFC PATCH 2/5] mm: Switch mapping to device mapping Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-22 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This is done so the bdi inode is derived correctly when file->f_mapping
is not the same as mapping passed. Set conditionally because some
callee pass NULL for file pointer.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 mm/readahead.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/readahead.c b/mm/readahead.c
index d589f147f4c2..9f303a31f650 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -443,6 +443,9 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	unsigned long index = readahead_index(ractl);
 	pgoff_t prev_index;
 
+	if (ractl->file)
+		bdi = inode_to_bdi(file_inode(ractl->file));
+
 	/*
 	 * If the request exceeds the readahead window, allow the read to
 	 * be up to the optimal hardware IO size
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/5] mm: Switch mapping to device mapping
  2021-10-22 20:15 [RFC PATCH 0/5] Shared memory for shared extents Goldwyn Rodrigues
  2021-10-22 20:15 ` [RFC PATCH 1/5] mm: Use file parameter to determine bdi Goldwyn Rodrigues
@ 2021-10-22 20:15 ` Goldwyn Rodrigues
  2021-10-23  1:36   ` Matthew Wilcox
  2021-10-22 20:15 ` [RFC PATCH 3/5] btrfs: Add sharedext mount option Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-22 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Get the device offset and last_index from the filesystem and read it
from the device directly. If the device page(s) have been read before,
it can be picked up directly for reads. If not the page is read from the
device. The page would be added to device's mapping instead of the file.

Con:
This moves the read from file's readpage to block device's readpage.
Post read filesystem checks such as data CRC will be ignored.
An option is to change readpage() prototype to include file offset and
call file->f_mapping->a_ops->readpage().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/file.c    |  2 +-
 include/linux/fs.h |  7 ++++++-
 mm/filemap.c       | 34 ++++++++++++++++++++++++++--------
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ba44039071e5..e171d822a05e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3654,7 +3654,7 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return ret;
 	}
 
-	return filemap_read(iocb, to, ret);
+	return filemap_read(iocb, to, ret, NULL);
 }
 
 const struct file_operations btrfs_file_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..9409f703bcd1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3229,8 +3229,13 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_write_check_limits(struct file *file, loff_t pos,
 		loff_t *count);
 extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
+
+typedef pgoff_t (*file_offset_to_device_t)(struct file *filp, loff_t pos,
+		size_t len, pgoff_t *last_index);
+
 ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *to,
-		ssize_t already_read);
+		ssize_t already_read,
+		file_offset_to_device_t file_offset_to_device);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index d1458ecf2f51..21033503b0a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2445,14 +2445,17 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
 }
 
 static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
-		struct pagevec *pvec)
+		struct pagevec *pvec,
+		file_offset_to_device_t file_offset_to_device)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
 	struct file_ra_state *ra = &filp->f_ra;
+	struct super_block *sb = file_inode(filp)->i_sb;
 	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
 	pgoff_t last_index;
 	struct page *page;
+	bool switched = false;
 	int err = 0;
 
 	last_index = DIV_ROUND_UP(iocb->ki_pos + iter->count, PAGE_SIZE);
@@ -2461,6 +2464,19 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
 		return -EINTR;
 
 	filemap_get_read_batch(mapping, index, last_index, pvec);
+	if (!pagevec_count(pvec) && !switched && file_offset_to_device) {
+		pgoff_t idx, lidx;
+
+		idx = file_offset_to_device(filp, iocb->ki_pos,
+				iter->count, &lidx);
+		if (idx) {
+			mapping = sb->s_bdev->bd_inode->i_mapping;
+			index = idx;
+			last_index = lidx;
+			switched = true;
+			goto retry;
+		}
+	}
 	if (!pagevec_count(pvec)) {
 		if (iocb->ki_flags & IOCB_NOIO)
 			return -EAGAIN;
@@ -2471,8 +2487,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
 	if (!pagevec_count(pvec)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_page(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, pvec);
+		err = filemap_create_page(filp, mapping, index, pvec);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -2517,12 +2532,13 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
  * a negative error number.
  */
 ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
-		ssize_t already_read)
+		ssize_t already_read,
+		file_offset_to_device_t file_offset_to_device)
 {
 	struct file *filp = iocb->ki_filp;
 	struct file_ra_state *ra = &filp->f_ra;
 	struct address_space *mapping = filp->f_mapping;
-	struct inode *inode = mapping->host;
+	struct inode *inode = file_inode(filp);
 	struct pagevec pvec;
 	int i, error = 0;
 	bool writably_mapped;
@@ -2547,7 +2563,8 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
 			iocb->ki_flags |= IOCB_NOWAIT;
 
-		error = filemap_get_pages(iocb, iter, &pvec);
+		error = filemap_get_pages(iocb, iter, &pvec,
+				file_offset_to_device);
 		if (error < 0)
 			break;
 
@@ -2560,6 +2577,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		 * another truncate extends the file - this is desired though).
 		 */
 		isize = i_size_read(inode);
+
 		if (unlikely(iocb->ki_pos >= isize))
 			goto put_pages;
 		end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);
@@ -2586,7 +2604,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 					     page_size - offset);
 			size_t copied;
 
-			if (end_offset < page_offset(page))
+			if (!file_offset_to_device && (end_offset < page_offset(page)))
 				break;
 			if (i > 0)
 				mark_page_accessed(page);
@@ -2698,7 +2716,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 			return retval;
 	}
 
-	return filemap_read(iocb, iter, retval);
+	return filemap_read(iocb, iter, retval, NULL);
 }
 EXPORT_SYMBOL(generic_file_read_iter);
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/5] btrfs: Add sharedext mount option
  2021-10-22 20:15 [RFC PATCH 0/5] Shared memory for shared extents Goldwyn Rodrigues
  2021-10-22 20:15 ` [RFC PATCH 1/5] mm: Use file parameter to determine bdi Goldwyn Rodrigues
  2021-10-22 20:15 ` [RFC PATCH 2/5] mm: Switch mapping to device mapping Goldwyn Rodrigues
@ 2021-10-22 20:15 ` Goldwyn Rodrigues
  2021-10-22 20:15 ` [RFC PATCH 4/5] btrfs: Set s_bdev for btrfs super block Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-22 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

The mount option to use shared pages for shared extents. If set, pass
custom file_offset_to_device function to filemap_read().

sharedext may not describe it well. Suggest a better name?

TODO Checks:
Check if the pagesize == blocksize

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h | 1 +
 fs/btrfs/super.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4aa4f4760b72..5c97112143d6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1415,6 +1415,7 @@ enum {
 	BTRFS_MOUNT_DISCARD_ASYNC		= (1UL << 28),
 	BTRFS_MOUNT_IGNOREBADROOTS		= (1UL << 29),
 	BTRFS_MOUNT_IGNOREDATACSUMS		= (1UL << 30),
+	BTRFS_MOUNT_SHAREDEXT			= (1UL << 31),
 };
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d07b18b2b250..432f40f72466 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -397,6 +397,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_sharedext,
 	Opt_err,
 };
 
@@ -471,6 +472,7 @@ static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_sharedext, "sharedext"},
 	{Opt_err, NULL},
 };
 
@@ -1013,6 +1015,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_sharedext:
+			btrfs_info(info, "enabling shared memory for shared extents");
+			btrfs_set_opt(info->mount_opt, SHAREDEXT);
+			break;
 		case Opt_err:
 			btrfs_err(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/5] btrfs: Set s_bdev for btrfs super block
  2021-10-22 20:15 [RFC PATCH 0/5] Shared memory for shared extents Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2021-10-22 20:15 ` [RFC PATCH 3/5] btrfs: Add sharedext mount option Goldwyn Rodrigues
@ 2021-10-22 20:15 ` Goldwyn Rodrigues
  2021-10-22 20:15 ` [RFC PATCH 5/5] btrfs: function to convert file offset to device offset Goldwyn Rodrigues
  2021-10-23  1:43 ` [RFC PATCH 0/5] Shared memory for shared extents Matthew Wilcox
  5 siblings, 0 replies; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-22 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

s_bdev is not set. Use the latest bdev to setup s_bdev.
reads are performed on block device directly and derived from super
block ->s_bdev field.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 432f40f72466..9588a42d7a49 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1749,6 +1749,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
 		deactivate_locked_super(s);
 		return ERR_PTR(error);
 	}
+	s->s_bdev = bdev;
 
 	return dget(s->s_root);
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 5/5] btrfs: function to convert file offset to device offset
  2021-10-22 20:15 [RFC PATCH 0/5] Shared memory for shared extents Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2021-10-22 20:15 ` [RFC PATCH 4/5] btrfs: Set s_bdev for btrfs super block Goldwyn Rodrigues
@ 2021-10-22 20:15 ` Goldwyn Rodrigues
  2021-10-23  1:43 ` [RFC PATCH 0/5] Shared memory for shared extents Matthew Wilcox
  5 siblings, 0 replies; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-22 20:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs_file_to_device_offset() converts a file offset to device offset.
It also calculates the last_index which represents the last page in the
range which is within the extent.

btrfs_file_to_device_offset() is only passed conditionally based on if
BTRFS_SHAREDEXT is set in the mount flag.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/file.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e171d822a05e..f0b97d020575 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3643,18 +3643,56 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+static pgoff_t btrfs_file_offset_to_device(struct file *filp, loff_t pos,
+		size_t len, pgoff_t *last_index)
+{
+	struct extent_map *em;
+	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
+	u64 device_offset;
+	u64 device_len;
+
+	if (inode->flags & BTRFS_INODE_NODATACOW)
+		return 0;
+
+	em = btrfs_get_extent(inode, NULL, 0, pos, len);
+
+	device_offset = em->block_start;
+	if (device_offset == EXTENT_MAP_HOLE) {
+		free_extent_map(em);
+		return 0;
+	}
+
+	/* Delalloc should be in file's pagecache */
+	BUG_ON(device_offset == EXTENT_MAP_DELALLOC);
+
+	device_offset = (device_offset + (pos - em->start)) >> PAGE_SHIFT;
+	device_len = (em->len - (pos - em->start)) >> PAGE_SHIFT;
+	*last_index = device_offset + device_len;
+
+	free_extent_map(em);
+
+	return device_offset;
+}
+
 static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret = 0;
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct btrfs_fs_info *fs_info;
+	file_offset_to_device_t file_offset_to_device = NULL;
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = btrfs_direct_read(iocb, to);
 		if (ret < 0 || !iov_iter_count(to) ||
-		    iocb->ki_pos >= i_size_read(file_inode(iocb->ki_filp)))
+		    iocb->ki_pos >= i_size_read(inode))
 			return ret;
 	}
 
-	return filemap_read(iocb, to, ret, NULL);
+	fs_info = btrfs_sb(inode->i_sb);
+	if (btrfs_test_opt(fs_info, SHAREDEXT))
+		file_offset_to_device = btrfs_file_offset_to_device;
+
+	return filemap_read(iocb, to, ret, file_offset_to_device);
 }
 
 const struct file_operations btrfs_file_operations = {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/5] mm: Switch mapping to device mapping
  2021-10-22 20:15 ` [RFC PATCH 2/5] mm: Switch mapping to device mapping Goldwyn Rodrigues
@ 2021-10-23  1:36   ` Matthew Wilcox
  0 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2021-10-23  1:36 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-fsdevel, linux-btrfs, Goldwyn Rodrigues

On Fri, Oct 22, 2021 at 03:15:02PM -0500, Goldwyn Rodrigues wrote:
> Get the device offset and last_index from the filesystem and read it
> from the device directly. If the device page(s) have been read before,
> it can be picked up directly for reads. If not the page is read from the
> device. The page would be added to device's mapping instead of the file.

I really don't like this way of doing it.

Why doesn't the filesystem choose where it's going to cache the data and
call filemap_readpage on that inode, instead of having the generic code
call back into the filesystem to decide where to cache the data?

You know that it's a shared extent before you call into filemap_read(),
so you know it shouldn't be cached in the local inode.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] Shared memory for shared extents
  2021-10-22 20:15 [RFC PATCH 0/5] Shared memory for shared extents Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2021-10-22 20:15 ` [RFC PATCH 5/5] btrfs: function to convert file offset to device offset Goldwyn Rodrigues
@ 2021-10-23  1:43 ` Matthew Wilcox
  2021-10-25 14:53   ` Goldwyn Rodrigues
  5 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2021-10-23  1:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-fsdevel, linux-btrfs, Goldwyn Rodrigues

On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> This is an attempt to reduce the memory footprint by using a shared
> page(s) for shared extent(s) in the filesystem. I am hoping to start a
> discussion to iron out the details for implementation.

When you say "Shared extents", you mean reflinks, which are COW, right?
A lot of what you say here confuses me because you talk about dirty
shared pages, and that doesn't make any sense.  A write fault or
call to write() should be intercepted by the filesystem in order to
break the COW.  There's no such thing as a shared page which is dirty,
or has been written back.

You might (or might not!) choose to copy the pages from the shared
extent to the inode when breaking the COW.  But you can't continue
to share them!

> Abstract
> If mutiple files share an extent, reads from each of the files would
> read individual page copies in the inode pagecache mapping, leading to
> waste of memory. Instead add the read pages of the filesystem to
> underlying device's bd_inode as opposed to file's inode mapping. The
> cost is that file offset to device offset conversion happens early in
> the read cycle.
> 
> Motivation:
>  - Reduce memory usage for shared extents
>  - Ease DAX implementation for shared extents
>  - Reduce Container memory utilization
> 
> Implementation
> In the read path, pages are read and added to the block_device's
> inode's mapping as opposed to the inode's mapping. This is limited
> to reads, while write's (and read before writes) still go through
> inode's i_mapping. The path does check the inode's i_mapping before
> falling back to block device's i_mapping to read pages which may be
> dirty. The cost of the operation is file_to_device_offset() translation
> on reads. The translation should return a valid value only in case
> the file is CoW.
> 
> This also means that page->mapping may not point to file's mapping.
> 
> Questions:
> 1. Are there security implications for performing this for read-only
> pages? An alternate idea is to add a "struct fspage", which would be
> pointed by file's mapping and would point to the block device's page.
> Multiple files with shared extents have their own independent fspage
> pointing to the same page mapped to block device's mapping.
> Any security parameters, if required, would be in this structure. The
> advantage of this approach is it would be more flexible with respect to
> CoW when the page is dirtied after reads. With the current approach, a
> read for write is an independent operation so we can end up with two
> copies of the same page. This implementation got complicated too quickly.
> 
> 2. Should pages be dropped after writebacks (and clone_range) to avoid
> duplicate copies?
> 
> Limitations:
> 1. The filesystem have exactly one underlying device.
> 2. Page size should be equal to filesystem block size
> 
> Goldwyn Rodrigues (5):
>   mm: Use file parameter to determine bdi
>   mm: Switch mapping to device mapping
>   btrfs: Add sharedext mount option
>   btrfs: Set s_bdev for btrfs super block
>   btrfs: function to convert file offset to device offset
> 
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/file.c    | 42 ++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/super.c   |  7 +++++++
>  include/linux/fs.h |  7 ++++++-
>  mm/filemap.c       | 34 ++++++++++++++++++++++++++--------
>  mm/readahead.c     |  3 +++
>  6 files changed, 83 insertions(+), 11 deletions(-)
> 
> -- 
> 2.33.1
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] Shared memory for shared extents
  2021-10-23  1:43 ` [RFC PATCH 0/5] Shared memory for shared extents Matthew Wilcox
@ 2021-10-25 14:53   ` Goldwyn Rodrigues
  2021-10-25 15:43     ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-25 14:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-btrfs

On  2:43 23/10, Matthew Wilcox wrote:
> On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> > This is an attempt to reduce the memory footprint by using a shared
> > page(s) for shared extent(s) in the filesystem. I am hoping to start a
> > discussion to iron out the details for implementation.
> 
> When you say "Shared extents", you mean reflinks, which are COW, right?

Yes, shared extents are extents which are shared on disk by two or more
files. Yes, same as reflinks. Just to explain with an example:

If two files, f1 and f2 have shared extent(s), and both files are read. Each
file's mapping->i_pages will hold a copy of the contents of the shared
extent on disk. So, f1->mapping will have one copy and f2->mapping will
have another copy.

For reads (and only reads), if we use underlying device's mapping, we
can save on duplicate copy of the pages.

> A lot of what you say here confuses me because you talk about dirty
> shared pages, and that doesn't make any sense.  A write fault or
> call to write() should be intercepted by the filesystem in order to
> break the COW.  There's no such thing as a shared page which is dirty,
> or has been written back.


I am _not_ talking of writes at all here. However, just to clarify the
air: For a CoW environment,  if a write happens, it becomes the
filesystems responsibility to write it to a new device location. These
pages should still end up in inode's i_mapping, as it is done currently.
This also includes pages which are read before modifying/writing.
The filesystem will writeback these pages to a new device location.
My question in the end was if we should release these  pages
or perhaps move them to device's mapping once they are no
longer dirty - for subsequence reads.

> 
> You might (or might not!) choose to copy the pages from the shared
> extent to the inode when breaking the COW.  But you can't continue
> to share them!

Since a write goes through inode's i_mapping, it will go to a new device
location and will break the share. There is no change in this
phenomenon. What I am changing is only the "pure" reads from CoW
filesystems. IOW, there is an order to look pagecache for reads - 
1. inode->i_mapping->i_pages
2. inode->sb->s_bdev->bd_inode->i_mapping->i_pages (after file offset to
device).

This way reads from dirty pages will go through inode->i_mapping.

> 
> > Abstract
> > If mutiple files share an extent, reads from each of the files would
> > read individual page copies in the inode pagecache mapping, leading to
> > waste of memory. Instead add the read pages of the filesystem to
> > underlying device's bd_inode as opposed to file's inode mapping. The
> > cost is that file offset to device offset conversion happens early in
> > the read cycle.
> > 
> > Motivation:
> >  - Reduce memory usage for shared extents
> >  - Ease DAX implementation for shared extents
> >  - Reduce Container memory utilization
> > 
> > Implementation
> > In the read path, pages are read and added to the block_device's
> > inode's mapping as opposed to the inode's mapping. This is limited
> > to reads, while write's (and read before writes) still go through
> > inode's i_mapping. The path does check the inode's i_mapping before
> > falling back to block device's i_mapping to read pages which may be
> > dirty. The cost of the operation is file_to_device_offset() translation
> > on reads. The translation should return a valid value only in case
> > the file is CoW.
> > 
> > This also means that page->mapping may not point to file's mapping.
> > 
> > Questions:
> > 1. Are there security implications for performing this for read-only
> > pages? An alternate idea is to add a "struct fspage", which would be
> > pointed by file's mapping and would point to the block device's page.
> > Multiple files with shared extents have their own independent fspage
> > pointing to the same page mapped to block device's mapping.
> > Any security parameters, if required, would be in this structure. The
> > advantage of this approach is it would be more flexible with respect to
> > CoW when the page is dirtied after reads. With the current approach, a
> > read for write is an independent operation so we can end up with two
> > copies of the same page. This implementation got complicated too quickly.
> > 
> > 2. Should pages be dropped after writebacks (and clone_range) to avoid
> > duplicate copies?
> > 
> > Limitations:
> > 1. The filesystem have exactly one underlying device.
> > 2. Page size should be equal to filesystem block size
> > 
> > Goldwyn Rodrigues (5):
> >   mm: Use file parameter to determine bdi
> >   mm: Switch mapping to device mapping
> >   btrfs: Add sharedext mount option
> >   btrfs: Set s_bdev for btrfs super block
> >   btrfs: function to convert file offset to device offset
> > 
> >  fs/btrfs/ctree.h   |  1 +
> >  fs/btrfs/file.c    | 42 ++++++++++++++++++++++++++++++++++++++++--
> >  fs/btrfs/super.c   |  7 +++++++
> >  include/linux/fs.h |  7 ++++++-
> >  mm/filemap.c       | 34 ++++++++++++++++++++++++++--------
> >  mm/readahead.c     |  3 +++
> >  6 files changed, 83 insertions(+), 11 deletions(-)
> > 
> > -- 
> > 2.33.1
> > 

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] Shared memory for shared extents
  2021-10-25 14:53   ` Goldwyn Rodrigues
@ 2021-10-25 15:43     ` Matthew Wilcox
  2021-10-25 16:43       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2021-10-25 15:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-fsdevel, linux-btrfs

On Mon, Oct 25, 2021 at 09:53:01AM -0500, Goldwyn Rodrigues wrote:
> On  2:43 23/10, Matthew Wilcox wrote:
> > On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> > > This is an attempt to reduce the memory footprint by using a shared
> > > page(s) for shared extent(s) in the filesystem. I am hoping to start a
> > > discussion to iron out the details for implementation.
> > 
> > When you say "Shared extents", you mean reflinks, which are COW, right?
> 
> Yes, shared extents are extents which are shared on disk by two or more
> files. Yes, same as reflinks. Just to explain with an example:
> 
> If two files, f1 and f2 have shared extent(s), and both files are read. Each
> file's mapping->i_pages will hold a copy of the contents of the shared
> extent on disk. So, f1->mapping will have one copy and f2->mapping will
> have another copy.
> 
> For reads (and only reads), if we use underlying device's mapping, we
> can save on duplicate copy of the pages.

Yes; I'm familiar with the problem.  Dave Chinner and I had a great
discussion about it at LCA a couple of years ago.

The implementation I've had in mind for a while is that the filesystem
either creates a separate inode for a shared extent, or (as you've
done here) uses the bdev's inode.  We can discuss the pros/cons of
that separately.

To avoid the double-lookup problem, I was intending to generalise DAX
entries into PFN entries.  That way, if the read() (or mmap read fault)
misses in the inode's cache, we can look up the shared extent cache,
and then cache the physical address of the memory in the inode.

That makes reclaim/eviction of the page in the shared extent more
expensive because you have to iterate all the inodes which share the
extent and remove the PFN entries before the page can be reused.

Perhaps we should have a Zoom meeting about this before producing duelling
patch series?  I can host if you're interested.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/5] Shared memory for shared extents
  2021-10-25 15:43     ` Matthew Wilcox
@ 2021-10-25 16:43       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 11+ messages in thread
From: Goldwyn Rodrigues @ 2021-10-25 16:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-btrfs

On 16:43 25/10, Matthew Wilcox wrote:
> On Mon, Oct 25, 2021 at 09:53:01AM -0500, Goldwyn Rodrigues wrote:
> > On  2:43 23/10, Matthew Wilcox wrote:
> > > On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> > > > This is an attempt to reduce the memory footprint by using a shared
> > > > page(s) for shared extent(s) in the filesystem. I am hoping to start a
> > > > discussion to iron out the details for implementation.
> > > 
> > > When you say "Shared extents", you mean reflinks, which are COW, right?
> > 
> > Yes, shared extents are extents which are shared on disk by two or more
> > files. Yes, same as reflinks. Just to explain with an example:
> > 
> > If two files, f1 and f2 have shared extent(s), and both files are read. Each
> > file's mapping->i_pages will hold a copy of the contents of the shared
> > extent on disk. So, f1->mapping will have one copy and f2->mapping will
> > have another copy.
> > 
> > For reads (and only reads), if we use underlying device's mapping, we
> > can save on duplicate copy of the pages.
> 
> Yes; I'm familiar with the problem.  Dave Chinner and I had a great
> discussion about it at LCA a couple of years ago.
> 
> The implementation I've had in mind for a while is that the filesystem
> either creates a separate inode for a shared extent, or (as you've
> done here) uses the bdev's inode.  We can discuss the pros/cons of
> that separately.
> 
> To avoid the double-lookup problem, I was intending to generalise DAX
> entries into PFN entries.  That way, if the read() (or mmap read fault)
> misses in the inode's cache, we can look up the shared extent cache,
> and then cache the physical address of the memory in the inode.

I am not sure I understand. Could you provide an example? Would this be
specific to DAX? What about standard block devices?

> 
> That makes reclaim/eviction of the page in the shared extent more
> expensive because you have to iterate all the inodes which share the
> extent and remove the PFN entries before the page can be reused.

Not sure of this, but won't it complicate things if there are different
shared extents in different files? Say shared extent SE1 belongs to f1
and f2, where as SE2 belongs to f2 and f3?

> 
> Perhaps we should have a Zoom meeting about this before producing duelling
> patch series?  I can host if you're interested.

Yes, I think that would be nice. I am in the central US Timezone.
If possible, I would like to add David Disseldorp who is based in
Germany.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-10-25 16:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-22 20:15 [RFC PATCH 0/5] Shared memory for shared extents Goldwyn Rodrigues
2021-10-22 20:15 ` [RFC PATCH 1/5] mm: Use file parameter to determine bdi Goldwyn Rodrigues
2021-10-22 20:15 ` [RFC PATCH 2/5] mm: Switch mapping to device mapping Goldwyn Rodrigues
2021-10-23  1:36   ` Matthew Wilcox
2021-10-22 20:15 ` [RFC PATCH 3/5] btrfs: Add sharedext mount option Goldwyn Rodrigues
2021-10-22 20:15 ` [RFC PATCH 4/5] btrfs: Set s_bdev for btrfs super block Goldwyn Rodrigues
2021-10-22 20:15 ` [RFC PATCH 5/5] btrfs: function to convert file offset to device offset Goldwyn Rodrigues
2021-10-23  1:43 ` [RFC PATCH 0/5] Shared memory for shared extents Matthew Wilcox
2021-10-25 14:53   ` Goldwyn Rodrigues
2021-10-25 15:43     ` Matthew Wilcox
2021-10-25 16:43       ` Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).