All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] erofs: support page cache sharing between EROFS images in fscache mode
@ 2023-02-03  3:01 ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

v3:
- the page cache sharing now supports readahead:
    - add patch 1: support readahead in meta routine
    - patch 4: initialize f_ra of anonymous file in prep for readahead
    - patch 5: since now readahead is performed upon the data blob, the
      accuate inode size of data blob needs to be derived
- patch 6: filemap_read() is reused when implementing .read_iter()
- cover letter: ~20% memory usage reduction can be seen when
  distributing tensorflow:2.10.0 and tensorflow:2.10.1 at the same node
  (see "Effect" section of the cover letter)


RFC: https://lore.kernel.org/all/20230106125330.55529-1-jefflexu@linux.alibaba.com/
v2: https://lore.kernel.org/all/20230111083158.23462-1-jefflexu@linux.alibaba.com/


changes since RFC:
- patch 2: allocate an anonymous file (realfile) when file is opened,
  rather than allocate a single anonymous file for each blob at mount
  time
- patch 7: add 'sharecache' mount option to control if page cache
  sharing shall be enabled



[Background]
=============
Erofs already supports chunk deduplication across different images to
minimize disk usage since v6.1.  Furthermore, we can make inodes among
different images share page cache for these deduplicated chunks to
reduce the memory usage.  This shall be much usable in container
scenarios as deduplication is requisite for container images.


[Implementation]
================
This is achieved by managing page cache of deduplicated chunks in
blob's address space.  In this way, all inodes sharing the deduplicated
chunk will refer to and share the page cache in the blob's address
space.


[Restriction]
==============
The page cache sharing feature also supports .mmap().  The reverse
mapping requires that one vma can not be shared among inodes and can
be linked to only one inode.  As the vma will be finally linked to the
blob's address space when page cache sharing enabled, the restriction of
the reverse mapping actually requires that the mapped file area can not
be mapped to multiple blobs.  Thus page cache sharing can only be
enabled for those files mapped to one blob.

The chunk based data layout guarantees that a chunk will not cross the
device (blob) boundary.  Thus in chunk based data layout, those files
smaller than the chunk size shall be guaranteed to be mapped to one
blob.  As chunk size is tunable at a per-file basis, this restriction
can be relaxed at image building phase.  As long as we ensure that the
file can not be deduplicated, the file's chunk size can be set to a
reasonable value larger than the file size, so that the file contains
only one chunk, in which case page cache sharing feature can be enabled
on this file later.


[Effect]
========
The final optimization result of this feature depends on the following
factors:

1. The number of deduplicated (shared) chunks.  Images sharing most of
the layers (e.g. a base image and v1 image based on the base image) will
achieve better optimization.

2. As the restriction mentioned above, the number of files for which
page cache sharing can ben enabled among the files accessed.


I test the workload of starting up Tensorflow, which will access quite
many (~5K) files among the startup phase.  I get the base image of
Tensorflow from [1] and build a new image (e.g. v1 image) on top of this
base image.

Since the image got from [1] is in OCI format, I have to convert it to
erofs format with buildkit[2], with default chunk size of 1MB.

I run containers from these two images with containerd (base image first,
v2 image secondly).  The (page cache) memory usage of the rootfs
(container image) is shown as below:

			| page cache sharing	| page cache sharing
			| disabled		| enabled
------------------------|-----------------------|-------------------
First container image   |      			|
page cache usage (MB) 	| 150      		| 150
------------------------+-----------------------|-------------------
Second container image  |      			|
page cache usage (MB) 	| 150			| 7

It can be seen that most (~95%, 143MB/150MB) memory usage reduced under
this workload (when starting following containers sharing container image
layers).

The remained 7MB memory usage is consumed by directories, since page
cache sharing is enabled only for regular files in this RFC
implementation.


I also tested the memory usage reduction among minor versions of the
same container image.  I tested with v2.10.0 and v2.10.1 of tensorflow.
It shows ~20% memory usage reduction, as most files accessed in the
workload is .pyc files which are updated during the version bump.

			| page cache sharing
			| enabled
------------------------|-------------------
First container image   |
(tensorflow:2.10.0)	|
page cache usage (MB) 	| 150
------------------------+-------------------
Second container image  |
(tensorflow:2.10.1)	|
page cache usage (MB) 	| 122



[1] docker.io/tensorflow/tensorflow:2.10.0
[2] https://github.com/moby/buildkit


Jingbo Xu (9):
  erofs: support readahead in meta routine
  erofs: remove unused device mapping in the meta routine
  erofs: unify anonymous inodes for blob
  erofs: allocate anonymous file of blob for page cache sharing
  erofs: set accurate anony inode size for page cache sharing
  erofs: implement .read_iter for page cache sharing
  erofs: implement .mmap for page cache sharing
  erofs: add helper checking if page cache sharing shall be enabled
  erofs: introduce 'sharecache' mount option

 Documentation/filesystems/erofs.rst |   2 +
 fs/erofs/fscache.c                  | 271 +++++++++++++++++++++-------
 fs/erofs/inode.c                    |   4 +
 fs/erofs/internal.h                 |  34 +++-
 fs/erofs/super.c                    |  15 ++
 5 files changed, 256 insertions(+), 70 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 0/9] erofs: support page cache sharing between EROFS images in fscache mode
@ 2023-02-03  3:01 ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

v3:
- the page cache sharing now supports readahead:
    - add patch 1: support readahead in meta routine
    - patch 4: initialize f_ra of anonymous file in prep for readahead
    - patch 5: since now readahead is performed upon the data blob, the
      accuate inode size of data blob needs to be derived
- patch 6: filemap_read() is reused when implementing .read_iter()
- cover letter: ~20% memory usage reduction can be seen when
  distributing tensorflow:2.10.0 and tensorflow:2.10.1 at the same node
  (see "Effect" section of the cover letter)


RFC: https://lore.kernel.org/all/20230106125330.55529-1-jefflexu@linux.alibaba.com/
v2: https://lore.kernel.org/all/20230111083158.23462-1-jefflexu@linux.alibaba.com/


changes since RFC:
- patch 2: allocate an anonymous file (realfile) when file is opened,
  rather than allocate a single anonymous file for each blob at mount
  time
- patch 7: add 'sharecache' mount option to control if page cache
  sharing shall be enabled



[Background]
=============
Erofs already supports chunk deduplication across different images to
minimize disk usage since v6.1.  Furthermore, we can make inodes among
different images share page cache for these deduplicated chunks to
reduce the memory usage.  This shall be much usable in container
scenarios as deduplication is requisite for container images.


[Implementation]
================
This is achieved by managing page cache of deduplicated chunks in
blob's address space.  In this way, all inodes sharing the deduplicated
chunk will refer to and share the page cache in the blob's address
space.


[Restriction]
==============
The page cache sharing feature also supports .mmap().  The reverse
mapping requires that one vma can not be shared among inodes and can
be linked to only one inode.  As the vma will be finally linked to the
blob's address space when page cache sharing enabled, the restriction of
the reverse mapping actually requires that the mapped file area can not
be mapped to multiple blobs.  Thus page cache sharing can only be
enabled for those files mapped to one blob.

The chunk based data layout guarantees that a chunk will not cross the
device (blob) boundary.  Thus in chunk based data layout, those files
smaller than the chunk size shall be guaranteed to be mapped to one
blob.  As chunk size is tunable at a per-file basis, this restriction
can be relaxed at image building phase.  As long as we ensure that the
file can not be deduplicated, the file's chunk size can be set to a
reasonable value larger than the file size, so that the file contains
only one chunk, in which case page cache sharing feature can be enabled
on this file later.


[Effect]
========
The final optimization result of this feature depends on the following
factors:

1. The number of deduplicated (shared) chunks.  Images sharing most of
the layers (e.g. a base image and v1 image based on the base image) will
achieve better optimization.

2. As the restriction mentioned above, the number of files for which
page cache sharing can ben enabled among the files accessed.


I test the workload of starting up Tensorflow, which will access quite
many (~5K) files among the startup phase.  I get the base image of
Tensorflow from [1] and build a new image (e.g. v1 image) on top of this
base image.

Since the image got from [1] is in OCI format, I have to convert it to
erofs format with buildkit[2], with default chunk size of 1MB.

I run containers from these two images with containerd (base image first,
v2 image secondly).  The (page cache) memory usage of the rootfs
(container image) is shown as below:

			| page cache sharing	| page cache sharing
			| disabled		| enabled
------------------------|-----------------------|-------------------
First container image   |      			|
page cache usage (MB) 	| 150      		| 150
------------------------+-----------------------|-------------------
Second container image  |      			|
page cache usage (MB) 	| 150			| 7

It can be seen that most (~95%, 143MB/150MB) memory usage reduced under
this workload (when starting following containers sharing container image
layers).

The remained 7MB memory usage is consumed by directories, since page
cache sharing is enabled only for regular files in this RFC
implementation.


I also tested the memory usage reduction among minor versions of the
same container image.  I tested with v2.10.0 and v2.10.1 of tensorflow.
It shows ~20% memory usage reduction, as most files accessed in the
workload is .pyc files which are updated during the version bump.

			| page cache sharing
			| enabled
------------------------|-------------------
First container image   |
(tensorflow:2.10.0)	|
page cache usage (MB) 	| 150
------------------------+-------------------
Second container image  |
(tensorflow:2.10.1)	|
page cache usage (MB) 	| 122



[1] docker.io/tensorflow/tensorflow:2.10.0
[2] https://github.com/moby/buildkit


Jingbo Xu (9):
  erofs: support readahead in meta routine
  erofs: remove unused device mapping in the meta routine
  erofs: unify anonymous inodes for blob
  erofs: allocate anonymous file of blob for page cache sharing
  erofs: set accurate anony inode size for page cache sharing
  erofs: implement .read_iter for page cache sharing
  erofs: implement .mmap for page cache sharing
  erofs: add helper checking if page cache sharing shall be enabled
  erofs: introduce 'sharecache' mount option

 Documentation/filesystems/erofs.rst |   2 +
 fs/erofs/fscache.c                  | 271 +++++++++++++++++++++-------
 fs/erofs/inode.c                    |   4 +
 fs/erofs/internal.h                 |  34 +++-
 fs/erofs/super.c                    |  15 ++
 5 files changed, 256 insertions(+), 70 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/9] erofs: support readahead in meta routine
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

In prep for the following support for readahead for page cache sharing,
add support for readahead in meta routine.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 014e20962376..e2ebe8f7dbe9 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -193,6 +193,30 @@ static int erofs_fscache_meta_read_folio(struct file *data, struct folio *folio)
 	return ret;
 }
 
+static void erofs_fscache_meta_readahead(struct readahead_control *rac)
+{
+	int ret;
+	struct erofs_fscache *ctx = rac->mapping->host->i_private;
+	struct erofs_fscache_request *req;
+
+	if (!readahead_count(rac))
+		return;
+
+	req = erofs_fscache_req_alloc(rac->mapping,
+			readahead_pos(rac), readahead_length(rac));
+	if (IS_ERR(req))
+		return;
+
+	/* The request completion will drop refs on the folios. */
+	while (readahead_folio(rac))
+		;
+
+	ret = erofs_fscache_read_folios_async(ctx->cookie, req, req->start, req->len);
+	if (ret)
+		req->error = ret;
+	erofs_fscache_req_put(req);
+}
+
 static int erofs_fscache_data_read_slice(struct erofs_fscache_request *primary)
 {
 	struct address_space *mapping = primary->mapping;
@@ -319,6 +343,7 @@ static void erofs_fscache_readahead(struct readahead_control *rac)
 
 static const struct address_space_operations erofs_fscache_meta_aops = {
 	.read_folio = erofs_fscache_meta_read_folio,
+	.readahead  = erofs_fscache_meta_readahead,
 };
 
 const struct address_space_operations erofs_fscache_access_aops = {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 1/9] erofs: support readahead in meta routine
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

In prep for the following support for readahead for page cache sharing,
add support for readahead in meta routine.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 014e20962376..e2ebe8f7dbe9 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -193,6 +193,30 @@ static int erofs_fscache_meta_read_folio(struct file *data, struct folio *folio)
 	return ret;
 }
 
+static void erofs_fscache_meta_readahead(struct readahead_control *rac)
+{
+	int ret;
+	struct erofs_fscache *ctx = rac->mapping->host->i_private;
+	struct erofs_fscache_request *req;
+
+	if (!readahead_count(rac))
+		return;
+
+	req = erofs_fscache_req_alloc(rac->mapping,
+			readahead_pos(rac), readahead_length(rac));
+	if (IS_ERR(req))
+		return;
+
+	/* The request completion will drop refs on the folios. */
+	while (readahead_folio(rac))
+		;
+
+	ret = erofs_fscache_read_folios_async(ctx->cookie, req, req->start, req->len);
+	if (ret)
+		req->error = ret;
+	erofs_fscache_req_put(req);
+}
+
 static int erofs_fscache_data_read_slice(struct erofs_fscache_request *primary)
 {
 	struct address_space *mapping = primary->mapping;
@@ -319,6 +343,7 @@ static void erofs_fscache_readahead(struct readahead_control *rac)
 
 static const struct address_space_operations erofs_fscache_meta_aops = {
 	.read_folio = erofs_fscache_meta_read_folio,
+	.readahead  = erofs_fscache_meta_readahead,
 };
 
 const struct address_space_operations erofs_fscache_access_aops = {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/9] erofs: remove unused device mapping in the meta routine
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

Currently there're two anonymous inodes (inode and anon_inode in struct
erofs_fscache) may be used for each blob.  The former is only for
bootstrap and used as the address_space of page cache, while the latter
is used for both bootstrap and data blobs when share domain mode enabled
and behaves as a sentinel in the shared domain.

In prep for the following support for page cache sharing, following
patch will unify these two anonymous inodes.  That is, the unified
anonymous inode not only acts as the address_space of page cache, but
also a sentinel in share domain mode.

However the current meta routine can't work if above change applied.
Current meta routine will make a device mapping, and superblock of the
filesystem is required to do the device mapping.  Currently the
superblock is derived from the input meta folio, which is reasonable
since the anonymous inode (used for the address_space of page cache) is
always allocated from the filesystem's sb.  However after anonymous
inodes are unified, that is no longer always true.  For example, in
share domain mode, the unified anonymous inode will be allocated from
pseudo mnt, and the superblock derived from the folio is actually a
pseudo sb, which can't be used for the device mapping at all.

As for the meta routine itself, currently metadata is always on
bootstrap, which means device mapping is not needed so far.  After
removing the redundant device mapping logic, we can derive the required
fscache_ctx from anonymous inode's i_private.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index e2ebe8f7dbe9..14af4ebce54d 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -164,18 +164,8 @@ static int erofs_fscache_read_folios_async(struct fscache_cookie *cookie,
 static int erofs_fscache_meta_read_folio(struct file *data, struct folio *folio)
 {
 	int ret;
-	struct super_block *sb = folio_mapping(folio)->host->i_sb;
+	struct erofs_fscache *ctx = folio_mapping(folio)->host->i_private;
 	struct erofs_fscache_request *req;
-	struct erofs_map_dev mdev = {
-		.m_deviceid = 0,
-		.m_pa = folio_pos(folio),
-	};
-
-	ret = erofs_map_dev(sb, &mdev);
-	if (ret) {
-		folio_unlock(folio);
-		return ret;
-	}
 
 	req = erofs_fscache_req_alloc(folio_mapping(folio),
 				folio_pos(folio), folio_size(folio));
@@ -184,8 +174,8 @@ static int erofs_fscache_meta_read_folio(struct file *data, struct folio *folio)
 		return PTR_ERR(req);
 	}
 
-	ret = erofs_fscache_read_folios_async(mdev.m_fscache->cookie,
-				req, mdev.m_pa, folio_size(folio));
+	ret = erofs_fscache_read_folios_async(ctx->cookie, req,
+				folio_pos(folio), folio_size(folio));
 	if (ret)
 		req->error = ret;
 
@@ -494,6 +484,7 @@ struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
 		inode->i_size = OFFSET_MAX;
 		inode->i_mapping->a_ops = &erofs_fscache_meta_aops;
 		mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
+		inode->i_private = ctx;
 
 		ctx->inode = inode;
 	}
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/9] erofs: remove unused device mapping in the meta routine
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

Currently there're two anonymous inodes (inode and anon_inode in struct
erofs_fscache) may be used for each blob.  The former is only for
bootstrap and used as the address_space of page cache, while the latter
is used for both bootstrap and data blobs when share domain mode enabled
and behaves as a sentinel in the shared domain.

In prep for the following support for page cache sharing, following
patch will unify these two anonymous inodes.  That is, the unified
anonymous inode not only acts as the address_space of page cache, but
also a sentinel in share domain mode.

However the current meta routine can't work if above change applied.
Current meta routine will make a device mapping, and superblock of the
filesystem is required to do the device mapping.  Currently the
superblock is derived from the input meta folio, which is reasonable
since the anonymous inode (used for the address_space of page cache) is
always allocated from the filesystem's sb.  However after anonymous
inodes are unified, that is no longer always true.  For example, in
share domain mode, the unified anonymous inode will be allocated from
pseudo mnt, and the superblock derived from the folio is actually a
pseudo sb, which can't be used for the device mapping at all.

As for the meta routine itself, currently metadata is always on
bootstrap, which means device mapping is not needed so far.  After
removing the redundant device mapping logic, we can derive the required
fscache_ctx from anonymous inode's i_private.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index e2ebe8f7dbe9..14af4ebce54d 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -164,18 +164,8 @@ static int erofs_fscache_read_folios_async(struct fscache_cookie *cookie,
 static int erofs_fscache_meta_read_folio(struct file *data, struct folio *folio)
 {
 	int ret;
-	struct super_block *sb = folio_mapping(folio)->host->i_sb;
+	struct erofs_fscache *ctx = folio_mapping(folio)->host->i_private;
 	struct erofs_fscache_request *req;
-	struct erofs_map_dev mdev = {
-		.m_deviceid = 0,
-		.m_pa = folio_pos(folio),
-	};
-
-	ret = erofs_map_dev(sb, &mdev);
-	if (ret) {
-		folio_unlock(folio);
-		return ret;
-	}
 
 	req = erofs_fscache_req_alloc(folio_mapping(folio),
 				folio_pos(folio), folio_size(folio));
@@ -184,8 +174,8 @@ static int erofs_fscache_meta_read_folio(struct file *data, struct folio *folio)
 		return PTR_ERR(req);
 	}
 
-	ret = erofs_fscache_read_folios_async(mdev.m_fscache->cookie,
-				req, mdev.m_pa, folio_size(folio));
+	ret = erofs_fscache_read_folios_async(ctx->cookie, req,
+				folio_pos(folio), folio_size(folio));
 	if (ret)
 		req->error = ret;
 
@@ -494,6 +484,7 @@ struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
 		inode->i_size = OFFSET_MAX;
 		inode->i_mapping->a_ops = &erofs_fscache_meta_aops;
 		mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
+		inode->i_private = ctx;
 
 		ctx->inode = inode;
 	}
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/9] erofs: unify anonymous inodes for blob
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

Currently only bootstrap will allocate anonymous inode for the
address_space of page cache.  In prep for the following support for page
cache sharing, as the first step, always allocate anonymous inode for
this use for both bootstrap and data blobs.

Since now anonymous inode is also allocated for data blobs, release
these anonymous inodes when .put_super() is called, or we'll get
"VFS: Busy inodes after unmount." warning.

Similarly, the fscache contexts for data blobs are initialized prior to
the root inode, thus .kill_sb() shall also contain the cleanup routine,
so that these fscache contexts can be cleaned up when mount fails while
root inode has not been initialized yet.

Also remove the redundant set_nlink() when initializing anonymous inode,
since i_nlink has already been initialized to 1 when the inode gets
allocated.

Until then there're two anonymous inodes (inode and anon_inode in struct
erofs_fscache) may be used for each blob.  The former is used as the
address_space of page cache, while the latter is used as a sentinel in
the shared domain.

In prep for the following support for page cache sharing, unify these
two anonymous inodes.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c  | 101 ++++++++++++++++++++------------------------
 fs/erofs/internal.h |   6 +--
 fs/erofs/super.c    |   2 +
 3 files changed, 49 insertions(+), 60 deletions(-)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 14af4ebce54d..7f2e2d17e8e0 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -343,8 +343,6 @@ const struct address_space_operations erofs_fscache_access_aops = {
 
 static void erofs_fscache_domain_put(struct erofs_domain *domain)
 {
-	if (!domain)
-		return;
 	mutex_lock(&erofs_domain_list_lock);
 	if (refcount_dec_and_test(&domain->ref)) {
 		list_del(&domain->list);
@@ -448,12 +446,12 @@ static int erofs_fscache_register_domain(struct super_block *sb)
 
 static
 struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
-						   char *name,
-						   unsigned int flags)
+					struct super_block *isb, char *name)
 {
 	struct fscache_volume *volume = EROFS_SB(sb)->volume;
 	struct erofs_fscache *ctx;
 	struct fscache_cookie *cookie;
+	struct inode *inode;
 	int ret;
 
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
@@ -467,33 +465,27 @@ struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
 		ret = -EINVAL;
 		goto err;
 	}
-
 	fscache_use_cookie(cookie, false);
-	ctx->cookie = cookie;
-
-	if (flags & EROFS_REG_COOKIE_NEED_INODE) {
-		struct inode *const inode = new_inode(sb);
-
-		if (!inode) {
-			erofs_err(sb, "failed to get anon inode for %s", name);
-			ret = -ENOMEM;
-			goto err_cookie;
-		}
-
-		set_nlink(inode, 1);
-		inode->i_size = OFFSET_MAX;
-		inode->i_mapping->a_ops = &erofs_fscache_meta_aops;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
-		inode->i_private = ctx;
 
-		ctx->inode = inode;
+	inode = new_inode(isb);
+	if (!inode) {
+		erofs_err(sb, "failed to get anon inode for %s", name);
+		ret = -ENOMEM;
+		goto err_cookie;
 	}
 
+	inode->i_size = OFFSET_MAX;
+	inode->i_mapping->a_ops = &erofs_fscache_meta_aops;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
+	inode->i_private = ctx;
+
+	ctx->cookie = cookie;
+	ctx->inode = inode;
 	return ctx;
 
 err_cookie:
-	fscache_unuse_cookie(ctx->cookie, NULL, NULL);
-	fscache_relinquish_cookie(ctx->cookie, false);
+	fscache_unuse_cookie(cookie, NULL, NULL);
+	fscache_relinquish_cookie(cookie, false);
 err:
 	kfree(ctx);
 	return ERR_PTR(ret);
@@ -510,38 +502,34 @@ static void erofs_fscache_relinquish_cookie(struct erofs_fscache *ctx)
 
 static
 struct erofs_fscache *erofs_fscache_domain_init_cookie(struct super_block *sb,
-						       char *name,
-						       unsigned int flags)
+						       char *name)
 {
-	int err;
-	struct inode *inode;
 	struct erofs_fscache *ctx;
 	struct erofs_domain *domain = EROFS_SB(sb)->domain;
 
-	ctx = erofs_fscache_acquire_cookie(sb, name, flags);
+	ctx = erofs_fscache_acquire_cookie(sb, erofs_pseudo_mnt->mnt_sb, name);
 	if (IS_ERR(ctx))
 		return ctx;
 
 	ctx->name = kstrdup(name, GFP_KERNEL);
 	if (!ctx->name) {
-		err = -ENOMEM;
-		goto out;
+		erofs_fscache_relinquish_cookie(ctx);
+		return ERR_PTR(-ENOMEM);
 	}
 
-	inode = new_inode(erofs_pseudo_mnt->mnt_sb);
-	if (!inode) {
-		err = -ENOMEM;
-		goto out;
-	}
+	/*
+	 * In share domain scenarios, the unified anonymous inode is not only
+	 * used as the address_space of shared page cache, but also a sentinel
+	 * in pseudo mount.  The initial refcount is used for the former and
+	 * will be killed when the cookie finally gets relinquished.  For the
+	 * latter, the refcount is increased every time the cookie in the domain
+	 * gets referred to.
+	 */
+	igrab(ctx->inode);
 
 	ctx->domain = domain;
-	ctx->anon_inode = inode;
-	inode->i_private = ctx;
 	refcount_inc(&domain->ref);
 	return ctx;
-out:
-	erofs_fscache_relinquish_cookie(ctx);
-	return ERR_PTR(err);
 }
 
 static
@@ -572,7 +560,7 @@ struct erofs_fscache *erofs_domain_register_cookie(struct super_block *sb,
 		return ctx;
 	}
 	spin_unlock(&psb->s_inode_list_lock);
-	ctx = erofs_fscache_domain_init_cookie(sb, name, flags);
+	ctx = erofs_fscache_domain_init_cookie(sb, name);
 	mutex_unlock(&erofs_domain_cookies_lock);
 	return ctx;
 }
@@ -583,7 +571,7 @@ struct erofs_fscache *erofs_fscache_register_cookie(struct super_block *sb,
 {
 	if (EROFS_SB(sb)->domain_id)
 		return erofs_domain_register_cookie(sb, name, flags);
-	return erofs_fscache_acquire_cookie(sb, name, flags);
+	return erofs_fscache_acquire_cookie(sb, sb, name);
 }
 
 void erofs_fscache_unregister_cookie(struct erofs_fscache *ctx)
@@ -593,18 +581,20 @@ void erofs_fscache_unregister_cookie(struct erofs_fscache *ctx)
 
 	if (!ctx)
 		return;
-	domain = ctx->domain;
-	if (domain) {
-		mutex_lock(&erofs_domain_cookies_lock);
-		drop = atomic_read(&ctx->anon_inode->i_count) == 1;
-		iput(ctx->anon_inode);
-		mutex_unlock(&erofs_domain_cookies_lock);
-		if (!drop)
-			return;
-	}
 
-	erofs_fscache_relinquish_cookie(ctx);
-	erofs_fscache_domain_put(domain);
+	if (!ctx->domain)
+		return erofs_fscache_relinquish_cookie(ctx);
+
+	domain = ctx->domain;
+	mutex_lock(&erofs_domain_cookies_lock);
+	/* drop the ref for the sentinel in pseudo mount */
+	iput(ctx->inode);
+	drop = atomic_read(&ctx->inode->i_count) == 1;
+	if (drop)
+		erofs_fscache_relinquish_cookie(ctx);
+	mutex_unlock(&erofs_domain_cookies_lock);
+	if (drop)
+		erofs_fscache_domain_put(domain);
 }
 
 int erofs_fscache_register_fs(struct super_block *sb)
@@ -612,7 +602,7 @@ int erofs_fscache_register_fs(struct super_block *sb)
 	int ret;
 	struct erofs_sb_info *sbi = EROFS_SB(sb);
 	struct erofs_fscache *fscache;
-	unsigned int flags;
+	unsigned int flags = 0;
 
 	if (sbi->domain_id)
 		ret = erofs_fscache_register_domain(sb);
@@ -631,7 +621,6 @@ int erofs_fscache_register_fs(struct super_block *sb)
 	 *
 	 * Acquired domain/volume will be relinquished in kill_sb() on error.
 	 */
-	flags = EROFS_REG_COOKIE_NEED_INODE;
 	if (sbi->domain_id)
 		flags |= EROFS_REG_COOKIE_NEED_NOEXIST;
 	fscache = erofs_fscache_register_cookie(sb, sbi->fsid, flags);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index bb8501c0ff5b..b3d04bc2d279 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -108,8 +108,7 @@ struct erofs_domain {
 
 struct erofs_fscache {
 	struct fscache_cookie *cookie;
-	struct inode *inode;
-	struct inode *anon_inode;
+	struct inode *inode;	/* anonymous indoe for the blob */
 	struct erofs_domain *domain;
 	char *name;
 };
@@ -604,8 +603,7 @@ static inline int z_erofs_load_lzma_config(struct super_block *sb,
 #endif	/* !CONFIG_EROFS_FS_ZIP */
 
 /* flags for erofs_fscache_register_cookie() */
-#define EROFS_REG_COOKIE_NEED_INODE	1
-#define EROFS_REG_COOKIE_NEED_NOEXIST	2
+#define EROFS_REG_COOKIE_NEED_NOEXIST	1
 
 /* fscache.c */
 #ifdef CONFIG_EROFS_FS_ONDEMAND
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 626a615dafc2..835b69c9511b 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -969,6 +969,8 @@ static void erofs_put_super(struct super_block *sb)
 	iput(sbi->packed_inode);
 	sbi->packed_inode = NULL;
 #endif
+	erofs_free_dev_context(sbi->devs);
+	sbi->devs = NULL;
 	erofs_fscache_unregister_fs(sb);
 }
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/9] erofs: unify anonymous inodes for blob
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

Currently only bootstrap will allocate anonymous inode for the
address_space of page cache.  In prep for the following support for page
cache sharing, as the first step, always allocate anonymous inode for
this use for both bootstrap and data blobs.

Since now anonymous inode is also allocated for data blobs, release
these anonymous inodes when .put_super() is called, or we'll get
"VFS: Busy inodes after unmount." warning.

Similarly, the fscache contexts for data blobs are initialized prior to
the root inode, thus .kill_sb() shall also contain the cleanup routine,
so that these fscache contexts can be cleaned up when mount fails while
root inode has not been initialized yet.

Also remove the redundant set_nlink() when initializing anonymous inode,
since i_nlink has already been initialized to 1 when the inode gets
allocated.

Until then there're two anonymous inodes (inode and anon_inode in struct
erofs_fscache) may be used for each blob.  The former is used as the
address_space of page cache, while the latter is used as a sentinel in
the shared domain.

In prep for the following support for page cache sharing, unify these
two anonymous inodes.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c  | 101 ++++++++++++++++++++------------------------
 fs/erofs/internal.h |   6 +--
 fs/erofs/super.c    |   2 +
 3 files changed, 49 insertions(+), 60 deletions(-)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 14af4ebce54d..7f2e2d17e8e0 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -343,8 +343,6 @@ const struct address_space_operations erofs_fscache_access_aops = {
 
 static void erofs_fscache_domain_put(struct erofs_domain *domain)
 {
-	if (!domain)
-		return;
 	mutex_lock(&erofs_domain_list_lock);
 	if (refcount_dec_and_test(&domain->ref)) {
 		list_del(&domain->list);
@@ -448,12 +446,12 @@ static int erofs_fscache_register_domain(struct super_block *sb)
 
 static
 struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
-						   char *name,
-						   unsigned int flags)
+					struct super_block *isb, char *name)
 {
 	struct fscache_volume *volume = EROFS_SB(sb)->volume;
 	struct erofs_fscache *ctx;
 	struct fscache_cookie *cookie;
+	struct inode *inode;
 	int ret;
 
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
@@ -467,33 +465,27 @@ struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
 		ret = -EINVAL;
 		goto err;
 	}
-
 	fscache_use_cookie(cookie, false);
-	ctx->cookie = cookie;
-
-	if (flags & EROFS_REG_COOKIE_NEED_INODE) {
-		struct inode *const inode = new_inode(sb);
-
-		if (!inode) {
-			erofs_err(sb, "failed to get anon inode for %s", name);
-			ret = -ENOMEM;
-			goto err_cookie;
-		}
-
-		set_nlink(inode, 1);
-		inode->i_size = OFFSET_MAX;
-		inode->i_mapping->a_ops = &erofs_fscache_meta_aops;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
-		inode->i_private = ctx;
 
-		ctx->inode = inode;
+	inode = new_inode(isb);
+	if (!inode) {
+		erofs_err(sb, "failed to get anon inode for %s", name);
+		ret = -ENOMEM;
+		goto err_cookie;
 	}
 
+	inode->i_size = OFFSET_MAX;
+	inode->i_mapping->a_ops = &erofs_fscache_meta_aops;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
+	inode->i_private = ctx;
+
+	ctx->cookie = cookie;
+	ctx->inode = inode;
 	return ctx;
 
 err_cookie:
-	fscache_unuse_cookie(ctx->cookie, NULL, NULL);
-	fscache_relinquish_cookie(ctx->cookie, false);
+	fscache_unuse_cookie(cookie, NULL, NULL);
+	fscache_relinquish_cookie(cookie, false);
 err:
 	kfree(ctx);
 	return ERR_PTR(ret);
@@ -510,38 +502,34 @@ static void erofs_fscache_relinquish_cookie(struct erofs_fscache *ctx)
 
 static
 struct erofs_fscache *erofs_fscache_domain_init_cookie(struct super_block *sb,
-						       char *name,
-						       unsigned int flags)
+						       char *name)
 {
-	int err;
-	struct inode *inode;
 	struct erofs_fscache *ctx;
 	struct erofs_domain *domain = EROFS_SB(sb)->domain;
 
-	ctx = erofs_fscache_acquire_cookie(sb, name, flags);
+	ctx = erofs_fscache_acquire_cookie(sb, erofs_pseudo_mnt->mnt_sb, name);
 	if (IS_ERR(ctx))
 		return ctx;
 
 	ctx->name = kstrdup(name, GFP_KERNEL);
 	if (!ctx->name) {
-		err = -ENOMEM;
-		goto out;
+		erofs_fscache_relinquish_cookie(ctx);
+		return ERR_PTR(-ENOMEM);
 	}
 
-	inode = new_inode(erofs_pseudo_mnt->mnt_sb);
-	if (!inode) {
-		err = -ENOMEM;
-		goto out;
-	}
+	/*
+	 * In share domain scenarios, the unified anonymous inode is not only
+	 * used as the address_space of shared page cache, but also a sentinel
+	 * in pseudo mount.  The initial refcount is used for the former and
+	 * will be killed when the cookie finally gets relinquished.  For the
+	 * latter, the refcount is increased every time the cookie in the domain
+	 * gets referred to.
+	 */
+	igrab(ctx->inode);
 
 	ctx->domain = domain;
-	ctx->anon_inode = inode;
-	inode->i_private = ctx;
 	refcount_inc(&domain->ref);
 	return ctx;
-out:
-	erofs_fscache_relinquish_cookie(ctx);
-	return ERR_PTR(err);
 }
 
 static
@@ -572,7 +560,7 @@ struct erofs_fscache *erofs_domain_register_cookie(struct super_block *sb,
 		return ctx;
 	}
 	spin_unlock(&psb->s_inode_list_lock);
-	ctx = erofs_fscache_domain_init_cookie(sb, name, flags);
+	ctx = erofs_fscache_domain_init_cookie(sb, name);
 	mutex_unlock(&erofs_domain_cookies_lock);
 	return ctx;
 }
@@ -583,7 +571,7 @@ struct erofs_fscache *erofs_fscache_register_cookie(struct super_block *sb,
 {
 	if (EROFS_SB(sb)->domain_id)
 		return erofs_domain_register_cookie(sb, name, flags);
-	return erofs_fscache_acquire_cookie(sb, name, flags);
+	return erofs_fscache_acquire_cookie(sb, sb, name);
 }
 
 void erofs_fscache_unregister_cookie(struct erofs_fscache *ctx)
@@ -593,18 +581,20 @@ void erofs_fscache_unregister_cookie(struct erofs_fscache *ctx)
 
 	if (!ctx)
 		return;
-	domain = ctx->domain;
-	if (domain) {
-		mutex_lock(&erofs_domain_cookies_lock);
-		drop = atomic_read(&ctx->anon_inode->i_count) == 1;
-		iput(ctx->anon_inode);
-		mutex_unlock(&erofs_domain_cookies_lock);
-		if (!drop)
-			return;
-	}
 
-	erofs_fscache_relinquish_cookie(ctx);
-	erofs_fscache_domain_put(domain);
+	if (!ctx->domain)
+		return erofs_fscache_relinquish_cookie(ctx);
+
+	domain = ctx->domain;
+	mutex_lock(&erofs_domain_cookies_lock);
+	/* drop the ref for the sentinel in pseudo mount */
+	iput(ctx->inode);
+	drop = atomic_read(&ctx->inode->i_count) == 1;
+	if (drop)
+		erofs_fscache_relinquish_cookie(ctx);
+	mutex_unlock(&erofs_domain_cookies_lock);
+	if (drop)
+		erofs_fscache_domain_put(domain);
 }
 
 int erofs_fscache_register_fs(struct super_block *sb)
@@ -612,7 +602,7 @@ int erofs_fscache_register_fs(struct super_block *sb)
 	int ret;
 	struct erofs_sb_info *sbi = EROFS_SB(sb);
 	struct erofs_fscache *fscache;
-	unsigned int flags;
+	unsigned int flags = 0;
 
 	if (sbi->domain_id)
 		ret = erofs_fscache_register_domain(sb);
@@ -631,7 +621,6 @@ int erofs_fscache_register_fs(struct super_block *sb)
 	 *
 	 * Acquired domain/volume will be relinquished in kill_sb() on error.
 	 */
-	flags = EROFS_REG_COOKIE_NEED_INODE;
 	if (sbi->domain_id)
 		flags |= EROFS_REG_COOKIE_NEED_NOEXIST;
 	fscache = erofs_fscache_register_cookie(sb, sbi->fsid, flags);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index bb8501c0ff5b..b3d04bc2d279 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -108,8 +108,7 @@ struct erofs_domain {
 
 struct erofs_fscache {
 	struct fscache_cookie *cookie;
-	struct inode *inode;
-	struct inode *anon_inode;
+	struct inode *inode;	/* anonymous indoe for the blob */
 	struct erofs_domain *domain;
 	char *name;
 };
@@ -604,8 +603,7 @@ static inline int z_erofs_load_lzma_config(struct super_block *sb,
 #endif	/* !CONFIG_EROFS_FS_ZIP */
 
 /* flags for erofs_fscache_register_cookie() */
-#define EROFS_REG_COOKIE_NEED_INODE	1
-#define EROFS_REG_COOKIE_NEED_NOEXIST	2
+#define EROFS_REG_COOKIE_NEED_NOEXIST	1
 
 /* fscache.c */
 #ifdef CONFIG_EROFS_FS_ONDEMAND
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 626a615dafc2..835b69c9511b 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -969,6 +969,8 @@ static void erofs_put_super(struct super_block *sb)
 	iput(sbi->packed_inode);
 	sbi->packed_inode = NULL;
 #endif
+	erofs_free_dev_context(sbi->devs);
+	sbi->devs = NULL;
 	erofs_fscache_unregister_fs(sb);
 }
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 4/9] erofs: allocate anonymous file of blob for page cache sharing
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

In prep for the following support for page cache sharing based mmap,
allocate an anonymous file of corresponding blob, so that we can link
associated vma to the blob later.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c  | 75 +++++++++++++++++++++++++++++++++++++++++++++
 fs/erofs/internal.h |  1 +
 2 files changed, 76 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 7f2e2d17e8e0..bed02b21978a 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -4,6 +4,8 @@
  * Copyright (C) 2022, Bytedance Inc. All rights reserved.
  */
 #include <linux/fscache.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
 #include "internal.h"
 
 static DEFINE_MUTEX(erofs_domain_list_lock);
@@ -22,6 +24,11 @@ struct erofs_fscache_request {
 	refcount_t		ref;
 };
 
+struct erofs_fscache_finfo {
+	erofs_off_t pa;
+	pgoff_t max_idx;
+};
+
 static struct erofs_fscache_request *erofs_fscache_req_alloc(struct address_space *mapping,
 					     loff_t start, size_t len)
 {
@@ -341,6 +348,74 @@ const struct address_space_operations erofs_fscache_access_aops = {
 	.readahead = erofs_fscache_readahead,
 };
 
+static int erofs_fscache_share_meta_release(struct inode *inode, struct file *filp)
+{
+	kfree(filp->private_data);
+	filp->private_data = NULL;
+	return 0;
+}
+
+static const struct file_operations erofs_fscache_share_meta_fops = {
+	.release	= erofs_fscache_share_meta_release,
+};
+
+static int erofs_fscache_share_file_release(struct inode *inode, struct file *filp)
+{
+	fput(filp->private_data);
+	filp->private_data = NULL;
+	return 0;
+}
+
+static int erofs_fscache_share_file_open(struct inode *inode, struct file *filp)
+{
+	/* since page cache sharing is enabled only when i_size <= chunk_size */
+	struct erofs_map_blocks map = {}; /* .m_la = 0 */
+	struct erofs_map_dev mdev;
+	struct erofs_fscache_finfo *finfo;
+	struct inode *realinode;
+	struct file *realfile;
+	int ret;
+
+	ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW);
+	if (ret)
+		return ret;
+
+	mdev = (struct erofs_map_dev) {
+		.m_deviceid = map.m_deviceid,
+		.m_pa = map.m_pa,
+	};
+	ret = erofs_map_dev(inode->i_sb, &mdev);
+	if (ret)
+		return ret;
+
+	finfo = kzalloc(sizeof(struct erofs_fscache_finfo), GFP_KERNEL);
+	if (!finfo)
+		return -ENOMEM;
+	finfo->pa = mdev.m_pa;
+	finfo->max_idx = DIV_ROUND_UP(mdev.m_pa + inode->i_size, PAGE_SIZE);
+
+	realinode = mdev.m_fscache->inode;
+	ihold(realinode);
+	realfile = alloc_file_pseudo(realinode, filp->f_path.mnt, "[erofs]",
+				     O_RDONLY, &erofs_fscache_share_meta_fops);
+	if (IS_ERR(realfile)) {
+		iput(realinode);
+		kfree(finfo);
+		return PTR_ERR(realfile);
+	}
+
+	file_ra_state_init(&realfile->f_ra, filp->f_mapping);
+	realfile->private_data = finfo;
+	filp->private_data = realfile;
+	return 0;
+}
+
+const struct file_operations erofs_fscache_share_file_fops = {
+	.llseek		= generic_file_llseek,
+	.open		= erofs_fscache_share_file_open,
+	.release	= erofs_fscache_share_file_release,
+};
+
 static void erofs_fscache_domain_put(struct erofs_domain *domain)
 {
 	mutex_lock(&erofs_domain_list_lock);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index b3d04bc2d279..7c6a7a2d9acf 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -616,6 +616,7 @@ struct erofs_fscache *erofs_fscache_register_cookie(struct super_block *sb,
 void erofs_fscache_unregister_cookie(struct erofs_fscache *fscache);
 
 extern const struct address_space_operations erofs_fscache_access_aops;
+extern const struct file_operations erofs_fscache_share_file_fops;
 #else
 static inline int erofs_fscache_register_fs(struct super_block *sb)
 {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 4/9] erofs: allocate anonymous file of blob for page cache sharing
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

In prep for the following support for page cache sharing based mmap,
allocate an anonymous file of corresponding blob, so that we can link
associated vma to the blob later.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c  | 75 +++++++++++++++++++++++++++++++++++++++++++++
 fs/erofs/internal.h |  1 +
 2 files changed, 76 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 7f2e2d17e8e0..bed02b21978a 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -4,6 +4,8 @@
  * Copyright (C) 2022, Bytedance Inc. All rights reserved.
  */
 #include <linux/fscache.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
 #include "internal.h"
 
 static DEFINE_MUTEX(erofs_domain_list_lock);
@@ -22,6 +24,11 @@ struct erofs_fscache_request {
 	refcount_t		ref;
 };
 
+struct erofs_fscache_finfo {
+	erofs_off_t pa;
+	pgoff_t max_idx;
+};
+
 static struct erofs_fscache_request *erofs_fscache_req_alloc(struct address_space *mapping,
 					     loff_t start, size_t len)
 {
@@ -341,6 +348,74 @@ const struct address_space_operations erofs_fscache_access_aops = {
 	.readahead = erofs_fscache_readahead,
 };
 
+static int erofs_fscache_share_meta_release(struct inode *inode, struct file *filp)
+{
+	kfree(filp->private_data);
+	filp->private_data = NULL;
+	return 0;
+}
+
+static const struct file_operations erofs_fscache_share_meta_fops = {
+	.release	= erofs_fscache_share_meta_release,
+};
+
+static int erofs_fscache_share_file_release(struct inode *inode, struct file *filp)
+{
+	fput(filp->private_data);
+	filp->private_data = NULL;
+	return 0;
+}
+
+static int erofs_fscache_share_file_open(struct inode *inode, struct file *filp)
+{
+	/* since page cache sharing is enabled only when i_size <= chunk_size */
+	struct erofs_map_blocks map = {}; /* .m_la = 0 */
+	struct erofs_map_dev mdev;
+	struct erofs_fscache_finfo *finfo;
+	struct inode *realinode;
+	struct file *realfile;
+	int ret;
+
+	ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_RAW);
+	if (ret)
+		return ret;
+
+	mdev = (struct erofs_map_dev) {
+		.m_deviceid = map.m_deviceid,
+		.m_pa = map.m_pa,
+	};
+	ret = erofs_map_dev(inode->i_sb, &mdev);
+	if (ret)
+		return ret;
+
+	finfo = kzalloc(sizeof(struct erofs_fscache_finfo), GFP_KERNEL);
+	if (!finfo)
+		return -ENOMEM;
+	finfo->pa = mdev.m_pa;
+	finfo->max_idx = DIV_ROUND_UP(mdev.m_pa + inode->i_size, PAGE_SIZE);
+
+	realinode = mdev.m_fscache->inode;
+	ihold(realinode);
+	realfile = alloc_file_pseudo(realinode, filp->f_path.mnt, "[erofs]",
+				     O_RDONLY, &erofs_fscache_share_meta_fops);
+	if (IS_ERR(realfile)) {
+		iput(realinode);
+		kfree(finfo);
+		return PTR_ERR(realfile);
+	}
+
+	file_ra_state_init(&realfile->f_ra, filp->f_mapping);
+	realfile->private_data = finfo;
+	filp->private_data = realfile;
+	return 0;
+}
+
+const struct file_operations erofs_fscache_share_file_fops = {
+	.llseek		= generic_file_llseek,
+	.open		= erofs_fscache_share_file_open,
+	.release	= erofs_fscache_share_file_release,
+};
+
 static void erofs_fscache_domain_put(struct erofs_domain *domain)
 {
 	mutex_lock(&erofs_domain_list_lock);
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index b3d04bc2d279..7c6a7a2d9acf 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -616,6 +616,7 @@ struct erofs_fscache *erofs_fscache_register_cookie(struct super_block *sb,
 void erofs_fscache_unregister_cookie(struct erofs_fscache *fscache);
 
 extern const struct address_space_operations erofs_fscache_access_aops;
+extern const struct file_operations erofs_fscache_share_file_fops;
 #else
 static inline int erofs_fscache_register_fs(struct super_block *sb)
 {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 5/9] erofs: set accurate anony inode size for page cache sharing
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

In prep for the following support for readahead for page cache sharing,
we need accurate inode size of the anonymous inode, or the readahead
algorithm may exceed EOF of blobs if the inode size is OFFSET_MAX magic.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c  | 9 +++++++++
 fs/erofs/internal.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index bed02b21978a..4fe7f23b022e 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -554,6 +554,15 @@ struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
 	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
 	inode->i_private = ctx;
 
+	if (test_opt(&EROFS_SB(sb)->opt, SHARE_CACHE)) {
+		struct netfs_cache_resources cres;
+		ret = fscache_begin_read_operation(&cres, cookie);
+		if (ret)
+			goto err_cookie;
+		fscache_end_operation(&cres);
+		inode->i_size = cookie->object_size;
+	}
+
 	ctx->cookie = cookie;
 	ctx->inode = inode;
 	return ctx;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 7c6a7a2d9acf..60d14561fb46 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -181,6 +181,7 @@ struct erofs_sb_info {
 #define EROFS_MOUNT_POSIX_ACL		0x00000020
 #define EROFS_MOUNT_DAX_ALWAYS		0x00000040
 #define EROFS_MOUNT_DAX_NEVER		0x00000080
+#define EROFS_MOUNT_SHARE_CACHE		0x00000100
 
 #define clear_opt(opt, option)	((opt)->mount_opt &= ~EROFS_MOUNT_##option)
 #define set_opt(opt, option)	((opt)->mount_opt |= EROFS_MOUNT_##option)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 5/9] erofs: set accurate anony inode size for page cache sharing
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

In prep for the following support for readahead for page cache sharing,
we need accurate inode size of the anonymous inode, or the readahead
algorithm may exceed EOF of blobs if the inode size is OFFSET_MAX magic.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c  | 9 +++++++++
 fs/erofs/internal.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index bed02b21978a..4fe7f23b022e 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -554,6 +554,15 @@ struct erofs_fscache *erofs_fscache_acquire_cookie(struct super_block *sb,
 	mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
 	inode->i_private = ctx;
 
+	if (test_opt(&EROFS_SB(sb)->opt, SHARE_CACHE)) {
+		struct netfs_cache_resources cres;
+		ret = fscache_begin_read_operation(&cres, cookie);
+		if (ret)
+			goto err_cookie;
+		fscache_end_operation(&cres);
+		inode->i_size = cookie->object_size;
+	}
+
 	ctx->cookie = cookie;
 	ctx->inode = inode;
 	return ctx;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 7c6a7a2d9acf..60d14561fb46 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -181,6 +181,7 @@ struct erofs_sb_info {
 #define EROFS_MOUNT_POSIX_ACL		0x00000020
 #define EROFS_MOUNT_DAX_ALWAYS		0x00000040
 #define EROFS_MOUNT_DAX_NEVER		0x00000080
+#define EROFS_MOUNT_SHARE_CACHE		0x00000100
 
 #define clear_opt(opt, option)	((opt)->mount_opt &= ~EROFS_MOUNT_##option)
 #define set_opt(opt, option)	((opt)->mount_opt |= EROFS_MOUNT_##option)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 6/9] erofs: implement .read_iter for page cache sharing
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

When page cache sharing enabled, page caches are managed in the address
space of blobs rather than erofs inodes.  All erofs inodes sharing one
chunk will refer to and share the page cache in the blob's address
space.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 4fe7f23b022e..bdeb048b78b5 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -410,8 +410,31 @@ static int erofs_fscache_share_file_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static ssize_t erofs_fscache_share_file_read_iter(struct kiocb *iocb,
+						  struct iov_iter *iter)
+{
+	struct file *realfile = iocb->ki_filp->private_data;
+	struct erofs_fscache_finfo *finfo = realfile->private_data;
+	ssize_t res;
+
+	if (!iov_iter_count(iter))
+		return 0;
+
+	if (!is_sync_kiocb(iocb))
+		return filemap_read(iocb, iter, 0);
+
+	iov_iter_truncate(iter, file_inode(iocb->ki_filp)->i_size - iocb->ki_pos);
+	iocb->ki_filp = realfile;
+	iocb->ki_pos += finfo->pa;
+
+	res = filemap_read(iocb, iter, 0);
+	iocb->ki_pos -= finfo->pa;
+	return res;
+}
+
 const struct file_operations erofs_fscache_share_file_fops = {
 	.llseek		= generic_file_llseek,
+	.read_iter	= erofs_fscache_share_file_read_iter,
 	.open		= erofs_fscache_share_file_open,
 	.release	= erofs_fscache_share_file_release,
 };
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 6/9] erofs: implement .read_iter for page cache sharing
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

When page cache sharing enabled, page caches are managed in the address
space of blobs rather than erofs inodes.  All erofs inodes sharing one
chunk will refer to and share the page cache in the blob's address
space.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index 4fe7f23b022e..bdeb048b78b5 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -410,8 +410,31 @@ static int erofs_fscache_share_file_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static ssize_t erofs_fscache_share_file_read_iter(struct kiocb *iocb,
+						  struct iov_iter *iter)
+{
+	struct file *realfile = iocb->ki_filp->private_data;
+	struct erofs_fscache_finfo *finfo = realfile->private_data;
+	ssize_t res;
+
+	if (!iov_iter_count(iter))
+		return 0;
+
+	if (!is_sync_kiocb(iocb))
+		return filemap_read(iocb, iter, 0);
+
+	iov_iter_truncate(iter, file_inode(iocb->ki_filp)->i_size - iocb->ki_pos);
+	iocb->ki_filp = realfile;
+	iocb->ki_pos += finfo->pa;
+
+	res = filemap_read(iocb, iter, 0);
+	iocb->ki_pos -= finfo->pa;
+	return res;
+}
+
 const struct file_operations erofs_fscache_share_file_fops = {
 	.llseek		= generic_file_llseek,
+	.read_iter	= erofs_fscache_share_file_read_iter,
 	.open		= erofs_fscache_share_file_open,
 	.release	= erofs_fscache_share_file_release,
 };
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 7/9] erofs: implement .mmap for page cache sharing
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

In mmap(2), replace vma->vm_file with the anonymous file associated with
the blob, so that the vma will be linked to the address_space of the
blob.

One thing worth noting is that, we return error early in mmap(2) if
users attempt to map beyond the file size.  Normally filesystems won't
restrict this in mmap(2).  The checking is done in the fault handler,
and SIGBUS will be signaled to users if they actually attempt to access
the area beyond the end of the file.  However since vma->vm_file has
been changed to the anonymous file in mmap(2), we can no way derive the
file size of the original file.  As file size is immutable in ro
filesystem, let's fail early in mmap(2) in this case.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index bdeb048b78b5..af6ba52bbe8b 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -432,9 +432,36 @@ static ssize_t erofs_fscache_share_file_read_iter(struct kiocb *iocb,
 	return res;
 }
 
+vm_fault_t erofs_fscache_share_fault(struct vm_fault *vmf)
+{
+	struct erofs_fscache_finfo *finfo = vmf->vma->vm_file->private_data;
+
+	if (unlikely(vmf->pgoff >= finfo->max_idx))
+		return VM_FAULT_SIGBUS;
+	return filemap_fault(vmf);
+}
+
+static const struct vm_operations_struct erofs_fscache_share_file_vm_ops = {
+	.fault = erofs_fscache_share_fault,
+};
+
+static int erofs_fscache_share_file_mmap(struct file *file,
+					 struct vm_area_struct *vma)
+{
+	struct file *realfile = file->private_data;
+	struct erofs_fscache_finfo *finfo = realfile->private_data;
+
+	vma_set_file(vma, realfile);
+	vma->vm_pgoff = (finfo->pa >> PAGE_SHIFT) + vma->vm_pgoff;
+	vma->vm_ops = &erofs_fscache_share_file_vm_ops;
+	file_accessed(file);
+	return 0;
+}
+
 const struct file_operations erofs_fscache_share_file_fops = {
 	.llseek		= generic_file_llseek,
 	.read_iter	= erofs_fscache_share_file_read_iter,
+	.mmap		= erofs_fscache_share_file_mmap,
 	.open		= erofs_fscache_share_file_open,
 	.release	= erofs_fscache_share_file_release,
 };
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 7/9] erofs: implement .mmap for page cache sharing
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

In mmap(2), replace vma->vm_file with the anonymous file associated with
the blob, so that the vma will be linked to the address_space of the
blob.

One thing worth noting is that, we return error early in mmap(2) if
users attempt to map beyond the file size.  Normally filesystems won't
restrict this in mmap(2).  The checking is done in the fault handler,
and SIGBUS will be signaled to users if they actually attempt to access
the area beyond the end of the file.  However since vma->vm_file has
been changed to the anonymous file in mmap(2), we can no way derive the
file size of the original file.  As file size is immutable in ro
filesystem, let's fail early in mmap(2) in this case.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/fscache.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/erofs/fscache.c b/fs/erofs/fscache.c
index bdeb048b78b5..af6ba52bbe8b 100644
--- a/fs/erofs/fscache.c
+++ b/fs/erofs/fscache.c
@@ -432,9 +432,36 @@ static ssize_t erofs_fscache_share_file_read_iter(struct kiocb *iocb,
 	return res;
 }
 
+vm_fault_t erofs_fscache_share_fault(struct vm_fault *vmf)
+{
+	struct erofs_fscache_finfo *finfo = vmf->vma->vm_file->private_data;
+
+	if (unlikely(vmf->pgoff >= finfo->max_idx))
+		return VM_FAULT_SIGBUS;
+	return filemap_fault(vmf);
+}
+
+static const struct vm_operations_struct erofs_fscache_share_file_vm_ops = {
+	.fault = erofs_fscache_share_fault,
+};
+
+static int erofs_fscache_share_file_mmap(struct file *file,
+					 struct vm_area_struct *vma)
+{
+	struct file *realfile = file->private_data;
+	struct erofs_fscache_finfo *finfo = realfile->private_data;
+
+	vma_set_file(vma, realfile);
+	vma->vm_pgoff = (finfo->pa >> PAGE_SHIFT) + vma->vm_pgoff;
+	vma->vm_ops = &erofs_fscache_share_file_vm_ops;
+	file_accessed(file);
+	return 0;
+}
+
 const struct file_operations erofs_fscache_share_file_fops = {
 	.llseek		= generic_file_llseek,
 	.read_iter	= erofs_fscache_share_file_read_iter,
+	.mmap		= erofs_fscache_share_file_mmap,
 	.open		= erofs_fscache_share_file_open,
 	.release	= erofs_fscache_share_file_release,
 };
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 8/9] erofs: add helper checking if page cache sharing shall be enabled
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

Erofs supports chunk deduplication to reduce disk usage.  Furthermore we
can make inodes share page cache of these deduplicated chunks to reduce
the memory usage.  This shall be much usable in container scenarios as
deduplication is requisite for container image.

This can be achieved by managing page cache of deduplicated chunks in
blob's address space.  In this way, all inodes sharing the deduplicated
chunk will refer to and share the page cache in the blob's address
space.

So far there are some restrictions for enabling this feature.

The page cache sharing feature also supports .mmap().  The reverse
mapping requires that one vma can not be shared among inodes and can
be linked to only one inode.  As the vma will be finally linked to the
blob's address space when page cache sharing enabled, the restriction of
the reverse mapping actually requires that the mapped file area can not
be mapped to multiple blobs.  Thus page cache sharing can only be
enabled for those files mapped to one blob.

The chunk based data layout guarantees that a chunk will not cross the
device (blob) boundary.  Thus in chunk based data layout, those files
smaller than the chunk size shall be guaranteed to be mapped to one
blob.  As chunk size is tunable at a per-file basis, this restriction
can be relaxed at image building phase.  As long as we ensure that the
file can not be deduplicated, the file's chunk size can be set to a
reasonable value larger than the file size, so that the page cache
sharing feature can be enabled on this file later.

The second restriction is that EROFS_BLKSIZ mus be multiples of
PAGE_SIZE to avoid data leakage.  Otherwise unrelated data may be
exposed at the end of the last page, since file's data is arranged in
unit of EROFS_BLKSIZ in the image.

Considering all these restrictions, add a helper checking if page cache
sharing shall be enabled for specific file.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/internal.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 60d14561fb46..6019b076c625 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -369,6 +369,29 @@ static inline unsigned int erofs_inode_datalayout(unsigned int value)
 			      EROFS_I_DATALAYOUT_BITS);
 }
 
+static inline bool erofs_can_share_page(struct inode *inode)
+{
+	struct erofs_inode *vi = EROFS_I(inode);
+	struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);
+
+	/* enable page cache sharing only in share domain mode */
+	if (!erofs_is_fscache_mode(inode->i_sb) || !sbi->domain_id)
+		return false;
+
+	if (vi->datalayout != EROFS_INODE_CHUNK_BASED)
+		return false;
+
+	/* avoid crossing multi devicces/blobs */
+	if (inode->i_size > 1UL << vi->chunkbits)
+		return false;
+
+	/* avoid data leakage in mmap routine */
+	if (EROFS_BLKSIZ % PAGE_SIZE)
+		return false;
+
+	return true;
+}
+
 /*
  * Different from grab_cache_page_nowait(), reclaiming is never triggered
  * when allocating new pages.
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 8/9] erofs: add helper checking if page cache sharing shall be enabled
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

Erofs supports chunk deduplication to reduce disk usage.  Furthermore we
can make inodes share page cache of these deduplicated chunks to reduce
the memory usage.  This shall be much usable in container scenarios as
deduplication is requisite for container image.

This can be achieved by managing page cache of deduplicated chunks in
blob's address space.  In this way, all inodes sharing the deduplicated
chunk will refer to and share the page cache in the blob's address
space.

So far there are some restrictions for enabling this feature.

The page cache sharing feature also supports .mmap().  The reverse
mapping requires that one vma can not be shared among inodes and can
be linked to only one inode.  As the vma will be finally linked to the
blob's address space when page cache sharing enabled, the restriction of
the reverse mapping actually requires that the mapped file area can not
be mapped to multiple blobs.  Thus page cache sharing can only be
enabled for those files mapped to one blob.

The chunk based data layout guarantees that a chunk will not cross the
device (blob) boundary.  Thus in chunk based data layout, those files
smaller than the chunk size shall be guaranteed to be mapped to one
blob.  As chunk size is tunable at a per-file basis, this restriction
can be relaxed at image building phase.  As long as we ensure that the
file can not be deduplicated, the file's chunk size can be set to a
reasonable value larger than the file size, so that the page cache
sharing feature can be enabled on this file later.

The second restriction is that EROFS_BLKSIZ mus be multiples of
PAGE_SIZE to avoid data leakage.  Otherwise unrelated data may be
exposed at the end of the last page, since file's data is arranged in
unit of EROFS_BLKSIZ in the image.

Considering all these restrictions, add a helper checking if page cache
sharing shall be enabled for specific file.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/erofs/internal.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 60d14561fb46..6019b076c625 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -369,6 +369,29 @@ static inline unsigned int erofs_inode_datalayout(unsigned int value)
 			      EROFS_I_DATALAYOUT_BITS);
 }
 
+static inline bool erofs_can_share_page(struct inode *inode)
+{
+	struct erofs_inode *vi = EROFS_I(inode);
+	struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);
+
+	/* enable page cache sharing only in share domain mode */
+	if (!erofs_is_fscache_mode(inode->i_sb) || !sbi->domain_id)
+		return false;
+
+	if (vi->datalayout != EROFS_INODE_CHUNK_BASED)
+		return false;
+
+	/* avoid crossing multi devicces/blobs */
+	if (inode->i_size > 1UL << vi->chunkbits)
+		return false;
+
+	/* avoid data leakage in mmap routine */
+	if (EROFS_BLKSIZ % PAGE_SIZE)
+		return false;
+
+	return true;
+}
+
 /*
  * Different from grab_cache_page_nowait(), reclaiming is never triggered
  * when allocating new pages.
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 9/9] erofs: introduce 'sharecache' mount option
  2023-02-03  3:01 ` Jingbo Xu
@ 2023-02-03  3:01   ` Jingbo Xu
  -1 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: huyue2, linux-kernel, linux-fsdevel

Introduce 'sharecache' mount option to enable page cache sharing in
fscache mode.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 Documentation/filesystems/erofs.rst |  2 ++
 fs/erofs/inode.c                    |  4 ++++
 fs/erofs/internal.h                 |  3 +++
 fs/erofs/super.c                    | 13 +++++++++++++
 4 files changed, 22 insertions(+)

diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index a43aacf1494e..1478c4d168e2 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -122,6 +122,8 @@ device=%s              Specify a path to an extra device to be used together.
 fsid=%s                Specify a filesystem image ID for Fscache back-end.
 domain_id=%s           Specify a domain ID in fscache mode so that different images
                        with the same blobs under a given domain ID can share storage.
+(no)sharecache         Enable page cache sharing among different images in the
+                       same domain.
 ===================    =========================================================
 
 Sysfs Entries
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index d3b8736fa124..31d3ab8443d1 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -262,6 +262,10 @@ static int erofs_fill_inode(struct inode *inode)
 		inode->i_op = &erofs_generic_iops;
 		if (erofs_inode_is_data_compressed(vi->datalayout))
 			inode->i_fop = &generic_ro_fops;
+#ifdef CONFIG_EROFS_FS_ONDEMAND
+		else if (erofs_can_share_page(inode))
+			inode->i_fop = &erofs_fscache_share_file_fops;
+#endif
 		else
 			inode->i_fop = &erofs_file_fops;
 		break;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 6019b076c625..c3ac6d613eb1 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -374,6 +374,9 @@ static inline bool erofs_can_share_page(struct inode *inode)
 	struct erofs_inode *vi = EROFS_I(inode);
 	struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);
 
+	if (!test_opt(&sbi->opt, SHARE_CACHE))
+		return false;
+
 	/* enable page cache sharing only in share domain mode */
 	if (!erofs_is_fscache_mode(inode->i_sb) || !sbi->domain_id)
 		return false;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 835b69c9511b..d05346d34ed8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -456,6 +456,7 @@ enum {
 	Opt_device,
 	Opt_fsid,
 	Opt_domain_id,
+	Opt_sharecache,
 	Opt_err
 };
 
@@ -482,6 +483,7 @@ static const struct fs_parameter_spec erofs_fs_parameters[] = {
 	fsparam_string("device",	Opt_device),
 	fsparam_string("fsid",		Opt_fsid),
 	fsparam_string("domain_id",	Opt_domain_id),
+	fsparam_flag_no("sharecache",	Opt_sharecache),
 	{}
 };
 
@@ -590,9 +592,16 @@ static int erofs_fc_parse_param(struct fs_context *fc,
 		if (!ctx->domain_id)
 			return -ENOMEM;
 		break;
+	case Opt_sharecache:
+		if (result.boolean)
+			set_opt(&ctx->opt, SHARE_CACHE);
+		else
+			clear_opt(&ctx->opt, SHARE_CACHE);
+		break;
 #else
 	case Opt_fsid:
 	case Opt_domain_id:
+	case Opt_sharecache:
 		errorfc(fc, "%s option not supported", erofs_fs_parameters[opt].name);
 		break;
 #endif
@@ -1108,6 +1117,10 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",fsid=%s", sbi->fsid);
 	if (sbi->domain_id)
 		seq_printf(seq, ",domain_id=%s", sbi->domain_id);
+	if (test_opt(opt, SHARE_CACHE))
+		seq_puts(seq, ",sharecache");
+	else
+		seq_puts(seq, ",nosharecache");
 #endif
 	return 0;
 }
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 9/9] erofs: introduce 'sharecache' mount option
@ 2023-02-03  3:01   ` Jingbo Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Jingbo Xu @ 2023-02-03  3:01 UTC (permalink / raw)
  To: xiang, chao, linux-erofs; +Cc: linux-fsdevel, huyue2, linux-kernel

Introduce 'sharecache' mount option to enable page cache sharing in
fscache mode.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 Documentation/filesystems/erofs.rst |  2 ++
 fs/erofs/inode.c                    |  4 ++++
 fs/erofs/internal.h                 |  3 +++
 fs/erofs/super.c                    | 13 +++++++++++++
 4 files changed, 22 insertions(+)

diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index a43aacf1494e..1478c4d168e2 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -122,6 +122,8 @@ device=%s              Specify a path to an extra device to be used together.
 fsid=%s                Specify a filesystem image ID for Fscache back-end.
 domain_id=%s           Specify a domain ID in fscache mode so that different images
                        with the same blobs under a given domain ID can share storage.
+(no)sharecache         Enable page cache sharing among different images in the
+                       same domain.
 ===================    =========================================================
 
 Sysfs Entries
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index d3b8736fa124..31d3ab8443d1 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -262,6 +262,10 @@ static int erofs_fill_inode(struct inode *inode)
 		inode->i_op = &erofs_generic_iops;
 		if (erofs_inode_is_data_compressed(vi->datalayout))
 			inode->i_fop = &generic_ro_fops;
+#ifdef CONFIG_EROFS_FS_ONDEMAND
+		else if (erofs_can_share_page(inode))
+			inode->i_fop = &erofs_fscache_share_file_fops;
+#endif
 		else
 			inode->i_fop = &erofs_file_fops;
 		break;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 6019b076c625..c3ac6d613eb1 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -374,6 +374,9 @@ static inline bool erofs_can_share_page(struct inode *inode)
 	struct erofs_inode *vi = EROFS_I(inode);
 	struct erofs_sb_info *sbi = EROFS_SB(inode->i_sb);
 
+	if (!test_opt(&sbi->opt, SHARE_CACHE))
+		return false;
+
 	/* enable page cache sharing only in share domain mode */
 	if (!erofs_is_fscache_mode(inode->i_sb) || !sbi->domain_id)
 		return false;
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 835b69c9511b..d05346d34ed8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -456,6 +456,7 @@ enum {
 	Opt_device,
 	Opt_fsid,
 	Opt_domain_id,
+	Opt_sharecache,
 	Opt_err
 };
 
@@ -482,6 +483,7 @@ static const struct fs_parameter_spec erofs_fs_parameters[] = {
 	fsparam_string("device",	Opt_device),
 	fsparam_string("fsid",		Opt_fsid),
 	fsparam_string("domain_id",	Opt_domain_id),
+	fsparam_flag_no("sharecache",	Opt_sharecache),
 	{}
 };
 
@@ -590,9 +592,16 @@ static int erofs_fc_parse_param(struct fs_context *fc,
 		if (!ctx->domain_id)
 			return -ENOMEM;
 		break;
+	case Opt_sharecache:
+		if (result.boolean)
+			set_opt(&ctx->opt, SHARE_CACHE);
+		else
+			clear_opt(&ctx->opt, SHARE_CACHE);
+		break;
 #else
 	case Opt_fsid:
 	case Opt_domain_id:
+	case Opt_sharecache:
 		errorfc(fc, "%s option not supported", erofs_fs_parameters[opt].name);
 		break;
 #endif
@@ -1108,6 +1117,10 @@ static int erofs_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",fsid=%s", sbi->fsid);
 	if (sbi->domain_id)
 		seq_printf(seq, ",domain_id=%s", sbi->domain_id);
+	if (test_opt(opt, SHARE_CACHE))
+		seq_puts(seq, ",sharecache");
+	else
+		seq_puts(seq, ",nosharecache");
 #endif
 	return 0;
 }
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 7/9] erofs: implement .mmap for page cache sharing
  2023-02-03  3:01   ` Jingbo Xu
@ 2023-02-03 15:12     ` kernel test robot
  -1 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2023-02-03 15:12 UTC (permalink / raw)
  To: Jingbo Xu, xiang, chao, linux-erofs
  Cc: oe-kbuild-all, huyue2, linux-kernel, linux-fsdevel

Hi Jingbo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on xiang-erofs/dev-test]
[also build test WARNING on xiang-erofs/dev xiang-erofs/fixes linus/master v6.2-rc6 next-20230203]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jingbo-Xu/erofs-support-readahead-in-meta-routine/20230203-110255
base:   https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev-test
patch link:    https://lore.kernel.org/r/20230203030143.73105-8-jefflexu%40linux.alibaba.com
patch subject: [PATCH v3 7/9] erofs: implement .mmap for page cache sharing
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20230203/202302032301.KaFzWF1g-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/499758ba5c442083b32a76a3fd55b734df0c486b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Jingbo-Xu/erofs-support-readahead-in-meta-routine/20230203-110255
        git checkout 499758ba5c442083b32a76a3fd55b734df0c486b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash fs/erofs/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/erofs/fscache.c:435:12: warning: no previous prototype for 'erofs_fscache_share_fault' [-Wmissing-prototypes]
     435 | vm_fault_t erofs_fscache_share_fault(struct vm_fault *vmf)
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~


vim +/erofs_fscache_share_fault +435 fs/erofs/fscache.c

   434	
 > 435	vm_fault_t erofs_fscache_share_fault(struct vm_fault *vmf)
   436	{
   437		struct erofs_fscache_finfo *finfo = vmf->vma->vm_file->private_data;
   438	
   439		if (unlikely(vmf->pgoff >= finfo->max_idx))
   440			return VM_FAULT_SIGBUS;
   441		return filemap_fault(vmf);
   442	}
   443	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 7/9] erofs: implement .mmap for page cache sharing
@ 2023-02-03 15:12     ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2023-02-03 15:12 UTC (permalink / raw)
  To: Jingbo Xu, xiang, chao, linux-erofs
  Cc: linux-fsdevel, huyue2, linux-kernel, oe-kbuild-all

Hi Jingbo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on xiang-erofs/dev-test]
[also build test WARNING on xiang-erofs/dev xiang-erofs/fixes linus/master v6.2-rc6 next-20230203]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jingbo-Xu/erofs-support-readahead-in-meta-routine/20230203-110255
base:   https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev-test
patch link:    https://lore.kernel.org/r/20230203030143.73105-8-jefflexu%40linux.alibaba.com
patch subject: [PATCH v3 7/9] erofs: implement .mmap for page cache sharing
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20230203/202302032301.KaFzWF1g-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/499758ba5c442083b32a76a3fd55b734df0c486b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Jingbo-Xu/erofs-support-readahead-in-meta-routine/20230203-110255
        git checkout 499758ba5c442083b32a76a3fd55b734df0c486b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash fs/erofs/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/erofs/fscache.c:435:12: warning: no previous prototype for 'erofs_fscache_share_fault' [-Wmissing-prototypes]
     435 | vm_fault_t erofs_fscache_share_fault(struct vm_fault *vmf)
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~


vim +/erofs_fscache_share_fault +435 fs/erofs/fscache.c

   434	
 > 435	vm_fault_t erofs_fscache_share_fault(struct vm_fault *vmf)
   436	{
   437		struct erofs_fscache_finfo *finfo = vmf->vma->vm_file->private_data;
   438	
   439		if (unlikely(vmf->pgoff >= finfo->max_idx))
   440			return VM_FAULT_SIGBUS;
   441		return filemap_fault(vmf);
   442	}
   443	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-02-03 15:13 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-03  3:01 [PATCH v3 0/9] erofs: support page cache sharing between EROFS images in fscache mode Jingbo Xu
2023-02-03  3:01 ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 1/9] erofs: support readahead in meta routine Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 2/9] erofs: remove unused device mapping in the " Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 3/9] erofs: unify anonymous inodes for blob Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 4/9] erofs: allocate anonymous file of blob for page cache sharing Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 5/9] erofs: set accurate anony inode size " Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 6/9] erofs: implement .read_iter " Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 7/9] erofs: implement .mmap " Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03 15:12   ` kernel test robot
2023-02-03 15:12     ` kernel test robot
2023-02-03  3:01 ` [PATCH v3 8/9] erofs: add helper checking if page cache sharing shall be enabled Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu
2023-02-03  3:01 ` [PATCH v3 9/9] erofs: introduce 'sharecache' mount option Jingbo Xu
2023-02-03  3:01   ` Jingbo Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.