All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] shmem,tmpfs: general maintenance
@ 2023-09-30  3:23 Hugh Dickins
  2023-09-30  3:25 ` [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union Hugh Dickins
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

And here is a series of patches based on v6.6-rc3: mostly just cosmetic
mods in mm/shmem.c, but the last two enforcing the "size=" limit better.
8/8 goes into percpu counter territory, and could stand alone: I'll add
some more Cc's on that one.

Applies to any v6.6-rc so far, and to next-20230929 and to
mm-everything-2023-09-29-23-51: hah, there's now an 09-30-01-16,
I haven't tried it yet, but this should be good on that too.

1/8 shmem: shrink shmem_inode_info: dir_offsets in a union
2/8 shmem: remove vma arg from shmem_get_folio_gfp()
3/8 shmem: factor shmem_falloc_wait() out of shmem_fault()
4/8 shmem: trivial tidyups, removing extra blank lines, etc
5/8 shmem: shmem_acct_blocks() and shmem_inode_acct_blocks()
6/8 shmem: move memcg charge out of shmem_add_to_page_cache()
7/8 shmem: _add_to_page_cache() before shmem_inode_acct_blocks()
8/8 shmem,percpu_counter: add _limited_add(fbc, limit, amount)

 include/linux/percpu_counter.h |  23 ++
 include/linux/shmem_fs.h       |  16 +-
 lib/percpu_counter.c           |  53 ++++
 mm/shmem.c                     | 500 +++++++++++++++++------------------
 4 files changed, 333 insertions(+), 259 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
@ 2023-09-30  3:25 ` Hugh Dickins
  2023-09-30 16:16   ` Chuck Lever
  2023-10-03 13:06   ` Jan Kara
  2023-09-30  3:26 ` [PATCH 2/8] shmem: remove vma arg from shmem_get_folio_gfp() Hugh Dickins
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

Shave 32 bytes off (the 64-bit) shmem_inode_info.  There was a 4-byte
pahole after stop_eviction, better filled by fsflags.  And the 24-byte
dir_offsets can only be used by directories, whereas shrinklist and
swaplist only by shmem_mapping() inodes (regular files or long symlinks):
so put those into a union.  No change in mm/shmem.c is required for this.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 6b0c626620f5..2caa6b86106a 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -23,18 +23,22 @@ struct shmem_inode_info {
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
-	pgoff_t			fallocend;	/* highest fallocate endindex */
-	struct list_head        shrinklist;     /* shrinkable hpage inodes */
-	struct list_head	swaplist;	/* chain of maybes on swap */
+	union {
+	    struct offset_ctx	dir_offsets;	/* stable directory offsets */
+	    struct {
+		struct list_head shrinklist;	/* shrinkable hpage inodes */
+		struct list_head swaplist;	/* chain of maybes on swap */
+	    };
+	};
+	struct timespec64	i_crtime;	/* file creation time */
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
+	pgoff_t			fallocend;	/* highest fallocate endindex */
+	unsigned int		fsflags;	/* for FS_IOC_[SG]ETFLAGS */
 	atomic_t		stop_eviction;	/* hold when working on inode */
-	struct timespec64	i_crtime;	/* file creation time */
-	unsigned int		fsflags;	/* flags for FS_IOC_[SG]ETFLAGS */
 #ifdef CONFIG_TMPFS_QUOTA
 	struct dquot		*i_dquot[MAXQUOTAS];
 #endif
-	struct offset_ctx	dir_offsets;	/* stable entry offsets */
 	struct inode		vfs_inode;
 };
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/8] shmem: remove vma arg from shmem_get_folio_gfp()
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
  2023-09-30  3:25 ` [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union Hugh Dickins
@ 2023-09-30  3:26 ` Hugh Dickins
  2023-10-03 13:07   ` Jan Kara
  2023-09-30  3:27 ` [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault() Hugh Dickins
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

The vma is already there in vmf->vma, so no need for a separate arg.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 69595d341882..824eb55671d2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1921,14 +1921,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache.
  *
- * vma, vmf, and fault_type are only supplied by shmem_fault:
- * otherwise they are NULL.
+ * vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL.
  */
 static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
-		struct vm_area_struct *vma, struct vm_fault *vmf,
-		vm_fault_t *fault_type)
+		struct vm_fault *vmf, vm_fault_t *fault_type)
 {
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo;
@@ -2141,7 +2140,7 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
 		enum sgp_type sgp)
 {
 	return shmem_get_folio_gfp(inode, index, foliop, sgp,
-			mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL);
+			mapping_gfp_mask(inode->i_mapping), NULL, NULL);
 }
 
 /*
@@ -2225,7 +2224,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	}
 
 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
-				  gfp, vma, vmf, &ret);
+				  gfp, vmf, &ret);
 	if (err)
 		return vmf_error(err);
 	if (folio)
@@ -4897,7 +4896,7 @@ struct folio *shmem_read_folio_gfp(struct address_space *mapping,
 
 	BUG_ON(!shmem_mapping(mapping));
 	error = shmem_get_folio_gfp(inode, index, &folio, SGP_CACHE,
-				  gfp, NULL, NULL, NULL);
+				    gfp, NULL, NULL);
 	if (error)
 		return ERR_PTR(error);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
  2023-09-30  3:25 ` [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union Hugh Dickins
  2023-09-30  3:26 ` [PATCH 2/8] shmem: remove vma arg from shmem_get_folio_gfp() Hugh Dickins
@ 2023-09-30  3:27 ` Hugh Dickins
  2023-10-03 13:18   ` Jan Kara
  2023-09-30  3:28 ` [PATCH 4/8] shmem: trivial tidyups, removing extra blank lines, etc Hugh Dickins
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

That Trinity livelock shmem_falloc avoidance block is unlikely, and a
distraction from the proper business of shmem_fault(): separate it out.
(This used to help compilers save stack on the fault path too, but both
gcc and clang nowadays seem to make better choices anyway.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 126 +++++++++++++++++++++++++++++------------------------
 1 file changed, 69 insertions(+), 57 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 824eb55671d2..5501a5bc8d8c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2148,87 +2148,99 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
  * entry unconditionally - even if something else had already woken the
  * target.
  */
-static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
+static int synchronous_wake_function(wait_queue_entry_t *wait,
+			unsigned int mode, int sync, void *key)
 {
 	int ret = default_wake_function(wait, mode, sync, key);
 	list_del_init(&wait->entry);
 	return ret;
 }
 
+/*
+ * Trinity finds that probing a hole which tmpfs is punching can
+ * prevent the hole-punch from ever completing: which in turn
+ * locks writers out with its hold on i_rwsem.  So refrain from
+ * faulting pages into the hole while it's being punched.  Although
+ * shmem_undo_range() does remove the additions, it may be unable to
+ * keep up, as each new page needs its own unmap_mapping_range() call,
+ * and the i_mmap tree grows ever slower to scan if new vmas are added.
+ *
+ * It does not matter if we sometimes reach this check just before the
+ * hole-punch begins, so that one fault then races with the punch:
+ * we just need to make racing faults a rare case.
+ *
+ * The implementation below would be much simpler if we just used a
+ * standard mutex or completion: but we cannot take i_rwsem in fault,
+ * and bloating every shmem inode for this unlikely case would be sad.
+ */
+static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode)
+{
+	struct shmem_falloc *shmem_falloc;
+	struct file *fpin = NULL;
+	vm_fault_t ret = 0;
+
+	spin_lock(&inode->i_lock);
+	shmem_falloc = inode->i_private;
+	if (shmem_falloc &&
+	    shmem_falloc->waitq &&
+	    vmf->pgoff >= shmem_falloc->start &&
+	    vmf->pgoff < shmem_falloc->next) {
+		wait_queue_head_t *shmem_falloc_waitq;
+		DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
+
+		ret = VM_FAULT_NOPAGE;
+		fpin = maybe_unlock_mmap_for_io(vmf, NULL);
+		shmem_falloc_waitq = shmem_falloc->waitq;
+		prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
+				TASK_UNINTERRUPTIBLE);
+		spin_unlock(&inode->i_lock);
+		schedule();
+
+		/*
+		 * shmem_falloc_waitq points into the shmem_fallocate()
+		 * stack of the hole-punching task: shmem_falloc_waitq
+		 * is usually invalid by the time we reach here, but
+		 * finish_wait() does not dereference it in that case;
+		 * though i_lock needed lest racing with wake_up_all().
+		 */
+		spin_lock(&inode->i_lock);
+		finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
+	}
+	spin_unlock(&inode->i_lock);
+	if (fpin) {
+		fput(fpin);
+		ret = VM_FAULT_RETRY;
+	}
+	return ret;
+}
+
 static vm_fault_t shmem_fault(struct vm_fault *vmf)
 {
-	struct vm_area_struct *vma = vmf->vma;
-	struct inode *inode = file_inode(vma->vm_file);
+	struct inode *inode = file_inode(vmf->vma->vm_file);
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
 	struct folio *folio = NULL;
+	vm_fault_t ret = 0;
 	int err;
-	vm_fault_t ret = VM_FAULT_LOCKED;
 
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
-	 * prevent the hole-punch from ever completing: which in turn
-	 * locks writers out with its hold on i_rwsem.  So refrain from
-	 * faulting pages into the hole while it's being punched.  Although
-	 * shmem_undo_range() does remove the additions, it may be unable to
-	 * keep up, as each new page needs its own unmap_mapping_range() call,
-	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
-	 *
-	 * It does not matter if we sometimes reach this check just before the
-	 * hole-punch begins, so that one fault then races with the punch:
-	 * we just need to make racing faults a rare case.
-	 *
-	 * The implementation below would be much simpler if we just used a
-	 * standard mutex or completion: but we cannot take i_rwsem in fault,
-	 * and bloating every shmem inode for this unlikely case would be sad.
+	 * prevent the hole-punch from ever completing: noted in i_private.
 	 */
 	if (unlikely(inode->i_private)) {
-		struct shmem_falloc *shmem_falloc;
-
-		spin_lock(&inode->i_lock);
-		shmem_falloc = inode->i_private;
-		if (shmem_falloc &&
-		    shmem_falloc->waitq &&
-		    vmf->pgoff >= shmem_falloc->start &&
-		    vmf->pgoff < shmem_falloc->next) {
-			struct file *fpin;
-			wait_queue_head_t *shmem_falloc_waitq;
-			DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
-
-			ret = VM_FAULT_NOPAGE;
-			fpin = maybe_unlock_mmap_for_io(vmf, NULL);
-			if (fpin)
-				ret = VM_FAULT_RETRY;
-
-			shmem_falloc_waitq = shmem_falloc->waitq;
-			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
-					TASK_UNINTERRUPTIBLE);
-			spin_unlock(&inode->i_lock);
-			schedule();
-
-			/*
-			 * shmem_falloc_waitq points into the shmem_fallocate()
-			 * stack of the hole-punching task: shmem_falloc_waitq
-			 * is usually invalid by the time we reach here, but
-			 * finish_wait() does not dereference it in that case;
-			 * though i_lock needed lest racing with wake_up_all().
-			 */
-			spin_lock(&inode->i_lock);
-			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
-			spin_unlock(&inode->i_lock);
-
-			if (fpin)
-				fput(fpin);
+		ret = shmem_falloc_wait(vmf, inode);
+		if (ret)
 			return ret;
-		}
-		spin_unlock(&inode->i_lock);
 	}
 
+	WARN_ON_ONCE(vmf->page != NULL);
 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
 				  gfp, vmf, &ret);
 	if (err)
 		return vmf_error(err);
-	if (folio)
+	if (folio) {
 		vmf->page = folio_file_page(folio, vmf->pgoff);
+		ret |= VM_FAULT_LOCKED;
+	}
 	return ret;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/8] shmem: trivial tidyups, removing extra blank lines, etc
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
                   ` (2 preceding siblings ...)
  2023-09-30  3:27 ` [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault() Hugh Dickins
@ 2023-09-30  3:28 ` Hugh Dickins
  2023-10-03 13:20   ` Jan Kara
  2023-09-30  3:30 ` [PATCH 5/8] shmem: shmem_acct_blocks() and shmem_inode_acct_blocks() Hugh Dickins
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

Mostly removing a few superfluous blank lines, joining short arglines,
imposing some 80-column observance, correcting a couple of comments.
None of it more interesting than deleting a repeated INIT_LIST_HEAD().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 56 ++++++++++++++++++++----------------------------------
 1 file changed, 21 insertions(+), 35 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5501a5bc8d8c..caee8ba841f7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -756,7 +756,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * Like filemap_add_folio, but error if expected item has gone.
+ * Somewhat like filemap_add_folio, but error if expected item has gone.
  */
 static int shmem_add_to_page_cache(struct folio *folio,
 				   struct address_space *mapping,
@@ -825,7 +825,7 @@ static int shmem_add_to_page_cache(struct folio *folio,
 }
 
 /*
- * Like delete_from_page_cache, but substitutes swap for @folio.
+ * Somewhat like filemap_remove_folio, but substitutes swap for @folio.
  */
 static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
 {
@@ -887,7 +887,6 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 			cond_resched_rcu();
 		}
 	}
-
 	rcu_read_unlock();
 
 	return swapped << PAGE_SHIFT;
@@ -1213,7 +1212,6 @@ static int shmem_setattr(struct mnt_idmap *idmap,
 	if (i_uid_needs_update(idmap, attr, inode) ||
 	    i_gid_needs_update(idmap, attr, inode)) {
 		error = dquot_transfer(idmap, inode, attr);
-
 		if (error)
 			return error;
 	}
@@ -2456,7 +2454,6 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
 	if (err)
 		return ERR_PTR(err);
 
-
 	inode = new_inode(sb);
 	if (!inode) {
 		shmem_free_inode(sb, 0);
@@ -2481,11 +2478,10 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
 		shmem_set_inode_flags(inode, info->fsflags);
 	INIT_LIST_HEAD(&info->shrinklist);
 	INIT_LIST_HEAD(&info->swaplist);
-	INIT_LIST_HEAD(&info->swaplist);
-	if (sbinfo->noswap)
-		mapping_set_unevictable(inode->i_mapping);
 	simple_xattrs_init(&info->xattrs);
 	cache_no_acl(inode);
+	if (sbinfo->noswap)
+		mapping_set_unevictable(inode->i_mapping);
 	mapping_set_large_folios(inode->i_mapping);
 
 	switch (mode & S_IFMT) {
@@ -2697,7 +2693,6 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	}
 
 	ret = shmem_get_folio(inode, index, &folio, SGP_WRITE);
-
 	if (ret)
 		return ret;
 
@@ -3229,8 +3224,7 @@ shmem_mknod(struct mnt_idmap *idmap, struct inode *dir,
 	error = simple_acl_create(dir, inode);
 	if (error)
 		goto out_iput;
-	error = security_inode_init_security(inode, dir,
-					     &dentry->d_name,
+	error = security_inode_init_security(inode, dir, &dentry->d_name,
 					     shmem_initxattrs, NULL);
 	if (error && error != -EOPNOTSUPP)
 		goto out_iput;
@@ -3259,14 +3253,11 @@ shmem_tmpfile(struct mnt_idmap *idmap, struct inode *dir,
 	int error;
 
 	inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, 0, VM_NORESERVE);
-
 	if (IS_ERR(inode)) {
 		error = PTR_ERR(inode);
 		goto err_out;
 	}
-
-	error = security_inode_init_security(inode, dir,
-					     NULL,
+	error = security_inode_init_security(inode, dir, NULL,
 					     shmem_initxattrs, NULL);
 	if (error && error != -EOPNOTSUPP)
 		goto out_iput;
@@ -3303,7 +3294,8 @@ static int shmem_create(struct mnt_idmap *idmap, struct inode *dir,
 /*
  * Link a file..
  */
-static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
+static int shmem_link(struct dentry *old_dentry, struct inode *dir,
+		      struct dentry *dentry)
 {
 	struct inode *inode = d_inode(old_dentry);
 	int ret = 0;
@@ -3334,7 +3326,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	inode_inc_iversion(dir);
 	inc_nlink(inode);
 	ihold(inode);	/* New dentry reference */
-	dget(dentry);		/* Extra pinning count for the created dentry */
+	dget(dentry);	/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
 	return ret;
@@ -3354,7 +3346,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
 					     inode_set_ctime_current(inode));
 	inode_inc_iversion(dir);
 	drop_nlink(inode);
-	dput(dentry);	/* Undo the count from "create" - this does all the work */
+	dput(dentry);	/* Undo the count from "create" - does all the work */
 	return 0;
 }
 
@@ -3464,7 +3456,6 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
 
 	inode = shmem_get_inode(idmap, dir->i_sb, dir, S_IFLNK | 0777, 0,
 				VM_NORESERVE);
-
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
@@ -3518,8 +3509,7 @@ static void shmem_put_link(void *arg)
 	folio_put(arg);
 }
 
-static const char *shmem_get_link(struct dentry *dentry,
-				  struct inode *inode,
+static const char *shmem_get_link(struct dentry *dentry, struct inode *inode,
 				  struct delayed_call *done)
 {
 	struct folio *folio = NULL;
@@ -3593,8 +3583,7 @@ static int shmem_fileattr_set(struct mnt_idmap *idmap,
  * Callback for security_inode_init_security() for acquiring xattrs.
  */
 static int shmem_initxattrs(struct inode *inode,
-			    const struct xattr *xattr_array,
-			    void *fs_info)
+			    const struct xattr *xattr_array, void *fs_info)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
@@ -3778,7 +3767,6 @@ static struct dentry *shmem_find_alias(struct inode *inode)
 	return alias ?: d_find_any_alias(inode);
 }
 
-
 static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
 		struct fid *fid, int fh_len, int fh_type)
 {
@@ -4362,8 +4350,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	}
 #endif /* CONFIG_TMPFS_QUOTA */
 
-	inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL, S_IFDIR | sbinfo->mode, 0,
-				VM_NORESERVE);
+	inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL,
+				S_IFDIR | sbinfo->mode, 0, VM_NORESERVE);
 	if (IS_ERR(inode)) {
 		error = PTR_ERR(inode);
 		goto failed;
@@ -4666,11 +4654,9 @@ static ssize_t shmem_enabled_show(struct kobject *kobj,
 
 	for (i = 0; i < ARRAY_SIZE(values); i++) {
 		len += sysfs_emit_at(buf, len,
-				     shmem_huge == values[i] ? "%s[%s]" : "%s%s",
-				     i ? " " : "",
-				     shmem_format_huge(values[i]));
+				shmem_huge == values[i] ? "%s[%s]" : "%s%s",
+				i ? " " : "", shmem_format_huge(values[i]));
 	}
-
 	len += sysfs_emit_at(buf, len, "\n");
 
 	return len;
@@ -4767,8 +4753,9 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
 #define shmem_acct_size(flags, size)		0
 #define shmem_unacct_size(flags, size)		do {} while (0)
 
-static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir,
-					    umode_t mode, dev_t dev, unsigned long flags)
+static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
+				struct super_block *sb, struct inode *dir,
+				umode_t mode, dev_t dev, unsigned long flags)
 {
 	struct inode *inode = ramfs_get_inode(sb, dir, mode, dev);
 	return inode ? inode : ERR_PTR(-ENOSPC);
@@ -4778,8 +4765,8 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct supe
 
 /* common code */
 
-static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, loff_t size,
-				       unsigned long flags, unsigned int i_flags)
+static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name,
+			loff_t size, unsigned long flags, unsigned int i_flags)
 {
 	struct inode *inode;
 	struct file *res;
@@ -4798,7 +4785,6 @@ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, l
 
 	inode = shmem_get_inode(&nop_mnt_idmap, mnt->mnt_sb, NULL,
 				S_IFREG | S_IRWXUGO, 0, flags);
-
 	if (IS_ERR(inode)) {
 		shmem_unacct_size(flags, size);
 		return ERR_CAST(inode);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/8] shmem: shmem_acct_blocks() and shmem_inode_acct_blocks()
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
                   ` (3 preceding siblings ...)
  2023-09-30  3:28 ` [PATCH 4/8] shmem: trivial tidyups, removing extra blank lines, etc Hugh Dickins
@ 2023-09-30  3:30 ` Hugh Dickins
  2023-10-03 13:21   ` Jan Kara
  2023-09-30  3:31 ` [PATCH 6/8] shmem: move memcg charge out of shmem_add_to_page_cache() Hugh Dickins
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

By historical accident, shmem_acct_block() and shmem_inode_acct_block()
were never pluralized when the pages argument was added, despite their
complements being shmem_unacct_blocks() and shmem_inode_unacct_blocks()
all along.  It has been an irritation: fix their naming at last.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index caee8ba841f7..63ba6037b23a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -189,10 +189,10 @@ static inline int shmem_reacct_size(unsigned long flags,
 /*
  * ... whereas tmpfs objects are accounted incrementally as
  * pages are allocated, in order to allow large sparse files.
- * shmem_get_folio reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
+ * shmem_get_folio reports shmem_acct_blocks failure as -ENOSPC not -ENOMEM,
  * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
  */
-static inline int shmem_acct_block(unsigned long flags, long pages)
+static inline int shmem_acct_blocks(unsigned long flags, long pages)
 {
 	if (!(flags & VM_NORESERVE))
 		return 0;
@@ -207,13 +207,13 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages)
 		vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE));
 }
 
-static int shmem_inode_acct_block(struct inode *inode, long pages)
+static int shmem_inode_acct_blocks(struct inode *inode, long pages)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	int err = -ENOSPC;
 
-	if (shmem_acct_block(info->flags, pages))
+	if (shmem_acct_blocks(info->flags, pages))
 		return err;
 
 	might_sleep();	/* when quotas */
@@ -447,7 +447,7 @@ bool shmem_charge(struct inode *inode, long pages)
 {
 	struct address_space *mapping = inode->i_mapping;
 
-	if (shmem_inode_acct_block(inode, pages))
+	if (shmem_inode_acct_blocks(inode, pages))
 		return false;
 
 	/* nrpages adjustment first, then shmem_recalc_inode() when balanced */
@@ -1671,7 +1671,7 @@ static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode,
 		huge = false;
 	nr = huge ? HPAGE_PMD_NR : 1;
 
-	err = shmem_inode_acct_block(inode, nr);
+	err = shmem_inode_acct_blocks(inode, nr);
 	if (err)
 		goto failed;
 
@@ -2572,7 +2572,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 	int ret;
 	pgoff_t max_off;
 
-	if (shmem_inode_acct_block(inode, 1)) {
+	if (shmem_inode_acct_blocks(inode, 1)) {
 		/*
 		 * We may have got a page, returned -ENOENT triggering a retry,
 		 * and now we find ourselves with -ENOMEM. Release the page, to
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/8] shmem: move memcg charge out of shmem_add_to_page_cache()
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
                   ` (4 preceding siblings ...)
  2023-09-30  3:30 ` [PATCH 5/8] shmem: shmem_acct_blocks() and shmem_inode_acct_blocks() Hugh Dickins
@ 2023-09-30  3:31 ` Hugh Dickins
  2023-10-03 13:28   ` Jan Kara
  2023-09-30  3:32 ` [PATCH 7/8] shmem: _add_to_page_cache() before shmem_inode_acct_blocks() Hugh Dickins
  2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
  7 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

Extract shmem's memcg charging out of shmem_add_to_page_cache(): it's
misleading done there, because many calls are dealing with a swapcache
page, whose memcg is nowadays always remembered while swapped out, then
the charge re-levied when it's brought back into swapcache.

Temporarily move it back up to the shmem_get_folio_gfp() level, where
the memcg was charged before v5.8; but the next commit goes on to move
it back down to a new home.

In making this change, it becomes clear that shmem_swapin_folio() does
not need to know the vma, just the fault mm (if any): call it fault_mm
rather than charge_mm - let mem_cgroup_charge() decide whom to charge.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 68 +++++++++++++++++++++++-------------------------------
 1 file changed, 29 insertions(+), 39 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 63ba6037b23a..0a7f7b567b80 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -146,9 +146,8 @@ static unsigned long shmem_default_max_inodes(void)
 #endif
 
 static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
-			     struct folio **foliop, enum sgp_type sgp,
-			     gfp_t gfp, struct vm_area_struct *vma,
-			     vm_fault_t *fault_type);
+			struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
+			struct mm_struct *fault_mm, vm_fault_t *fault_type);
 
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
 {
@@ -760,12 +759,10 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
  */
 static int shmem_add_to_page_cache(struct folio *folio,
 				   struct address_space *mapping,
-				   pgoff_t index, void *expected, gfp_t gfp,
-				   struct mm_struct *charge_mm)
+				   pgoff_t index, void *expected, gfp_t gfp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
 	long nr = folio_nr_pages(folio);
-	int error;
 
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -776,16 +773,7 @@ static int shmem_add_to_page_cache(struct folio *folio,
 	folio->mapping = mapping;
 	folio->index = index;
 
-	if (!folio_test_swapcache(folio)) {
-		error = mem_cgroup_charge(folio, charge_mm, gfp);
-		if (error) {
-			if (folio_test_pmd_mappable(folio)) {
-				count_vm_event(THP_FILE_FALLBACK);
-				count_vm_event(THP_FILE_FALLBACK_CHARGE);
-			}
-			goto error;
-		}
-	}
+	gfp &= GFP_RECLAIM_MASK;
 	folio_throttle_swaprate(folio, gfp);
 
 	do {
@@ -813,15 +801,12 @@ static int shmem_add_to_page_cache(struct folio *folio,
 	} while (xas_nomem(&xas, gfp));
 
 	if (xas_error(&xas)) {
-		error = xas_error(&xas);
-		goto error;
+		folio->mapping = NULL;
+		folio_ref_sub(folio, nr);
+		return xas_error(&xas);
 	}
 
 	return 0;
-error:
-	folio->mapping = NULL;
-	folio_ref_sub(folio, nr);
-	return error;
 }
 
 /*
@@ -1324,10 +1309,8 @@ static int shmem_unuse_swap_entries(struct inode *inode,
 
 		if (!xa_is_value(folio))
 			continue;
-		error = shmem_swapin_folio(inode, indices[i],
-					  &folio, SGP_CACHE,
-					  mapping_gfp_mask(mapping),
-					  NULL, NULL);
+		error = shmem_swapin_folio(inode, indices[i], &folio, SGP_CACHE,
+					mapping_gfp_mask(mapping), NULL, NULL);
 		if (error == 0) {
 			folio_unlock(folio);
 			folio_put(folio);
@@ -1810,12 +1793,11 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
  */
 static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			     struct folio **foliop, enum sgp_type sgp,
-			     gfp_t gfp, struct vm_area_struct *vma,
+			     gfp_t gfp, struct mm_struct *fault_mm,
 			     vm_fault_t *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	swp_entry_t swap;
@@ -1843,7 +1825,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
-			count_memcg_event_mm(charge_mm, PGMAJFAULT);
+			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
 		/* Here we actually start the io */
 		folio = shmem_swapin(swap, gfp, info, index);
@@ -1880,8 +1862,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	error = shmem_add_to_page_cache(folio, mapping, index,
-					swp_to_radix_entry(swap), gfp,
-					charge_mm);
+					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
 
@@ -1929,7 +1910,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo;
-	struct mm_struct *charge_mm;
+	struct mm_struct *fault_mm;
 	struct folio *folio;
 	pgoff_t hindex;
 	gfp_t huge_gfp;
@@ -1946,7 +1927,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	}
 
 	sbinfo = SHMEM_SB(inode->i_sb);
-	charge_mm = vma ? vma->vm_mm : NULL;
+	fault_mm = vma ? vma->vm_mm : NULL;
 
 	folio = filemap_get_entry(mapping, index);
 	if (folio && vma && userfaultfd_minor(vma)) {
@@ -1958,7 +1939,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 
 	if (xa_is_value(folio)) {
 		error = shmem_swapin_folio(inode, index, &folio,
-					  sgp, gfp, vma, fault_type);
+					   sgp, gfp, fault_mm, fault_type);
 		if (error == -EEXIST)
 			goto repeat;
 
@@ -2044,9 +2025,16 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		__folio_set_referenced(folio);
 
-	error = shmem_add_to_page_cache(folio, mapping, hindex,
-					NULL, gfp & GFP_RECLAIM_MASK,
-					charge_mm);
+	error = mem_cgroup_charge(folio, fault_mm, gfp);
+	if (error) {
+		if (folio_test_pmd_mappable(folio)) {
+			count_vm_event(THP_FILE_FALLBACK);
+			count_vm_event(THP_FILE_FALLBACK_CHARGE);
+		}
+		goto unacct;
+	}
+
+	error = shmem_add_to_page_cache(folio, mapping, hindex, NULL, gfp);
 	if (error)
 		goto unacct;
 
@@ -2644,8 +2632,10 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 	if (unlikely(pgoff >= max_off))
 		goto out_release;
 
-	ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL,
-				      gfp & GFP_RECLAIM_MASK, dst_vma->vm_mm);
+	ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp);
+	if (ret)
+		goto out_release;
+	ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
 	if (ret)
 		goto out_release;
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/8] shmem: _add_to_page_cache() before shmem_inode_acct_blocks()
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
                   ` (5 preceding siblings ...)
  2023-09-30  3:31 ` [PATCH 6/8] shmem: move memcg charge out of shmem_add_to_page_cache() Hugh Dickins
@ 2023-09-30  3:32 ` Hugh Dickins
  2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
  7 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

There has been a recurring problem, that when a tmpfs volume is being
filled by racing threads, some fail with ENOSPC (or consequent SIGBUS
or EFAULT) even though all allocations were within the permitted size.

This was a problem since early days, but magnified and complicated by
the addition of huge pages.  We have often worked around it by adding
some slop to the tmpfs size, but it's hard to say how much is needed,
and some users prefer not to do that e.g. keeping sparse files in a
tightly tailored tmpfs helps to prevent accidental writing to holes.

This comes from the allocation sequence:
1. check page cache for existing folio
2. check and reserve from vm_enough_memory
3. check and account from size of tmpfs
4. if huge, check page cache for overlapping folio
5. allocate physical folio, huge or small
6. check and charge from mem cgroup limit
7. add to page cache (but maybe another folio already got in).

Concurrent tasks allocating at the same position could deplete the size
allowance and fail.  Doing vm_enough_memory and size checks before the
folio allocation was intentional (to limit the load on the page allocator
from this source) and still has some virtue; but memory cgroup never did
that, so I think it's better reordered to favour predictable behaviour.

1. check page cache for existing folio
2. if huge, check page cache for overlapping folio
3. allocate physical folio, huge or small
4. check and charge from mem cgroup limit
5. add to page cache (but maybe another folio already got in)
6. check and reserve from vm_enough_memory
7. check and account from size of tmpfs.

The folio lock held from allocation onwards ensures that the !uptodate
folio cannot be used by others, and can safely be deleted from the cache
if checks 6 or 7 subsequently fail (and those waiting on folio lock
already check that the folio was not truncated once they get the lock);
and the early addition to page cache ensures that racers find it before
they try to duplicate the accounting.

Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying,
which can be combined inside the new shmem_alloc_and_add_folio(): doing
2 splits twice (once huge, once nonhuge) is not exactly equivalent to
trying 5 splits (and giving up early on huge), but let's keep it simple
unless more complication proves necessary.

Userfaultfd is a foreign country: they do things differently there,
and for good reason - to avoid mmap_lock deadlock.  Leave ordering in
shmem_mfill_atomic_pte() untouched for now, but I would rather like to
mesh it better with shmem_get_folio_gfp() in the future.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 235 +++++++++++++++++++++++++++--------------------------
 1 file changed, 121 insertions(+), 114 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0a7f7b567b80..4f4ab26bc58a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -789,13 +789,11 @@ static int shmem_add_to_page_cache(struct folio *folio,
 		xas_store(&xas, folio);
 		if (xas_error(&xas))
 			goto unlock;
-		if (folio_test_pmd_mappable(folio)) {
-			count_vm_event(THP_FILE_ALLOC);
+		if (folio_test_pmd_mappable(folio))
 			__lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr);
-		}
-		mapping->nrpages += nr;
 		__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
 		__lruvec_stat_mod_folio(folio, NR_SHMEM, nr);
+		mapping->nrpages += nr;
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp));
@@ -1612,25 +1610,17 @@ static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
 		struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
-	struct address_space *mapping = info->vfs_inode.i_mapping;
-	pgoff_t hindex;
 	struct folio *folio;
 
-	hindex = round_down(index, HPAGE_PMD_NR);
-	if (xa_find(&mapping->i_pages, &hindex, hindex + HPAGE_PMD_NR - 1,
-								XA_PRESENT))
-		return NULL;
-
-	shmem_pseudo_vma_init(&pvma, info, hindex);
+	shmem_pseudo_vma_init(&pvma, info, index);
 	folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, &pvma, 0, true);
 	shmem_pseudo_vma_destroy(&pvma);
-	if (!folio)
-		count_vm_event(THP_FILE_FALLBACK);
+
 	return folio;
 }
 
 static struct folio *shmem_alloc_folio(gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
+		struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
 	struct folio *folio;
@@ -1642,36 +1632,101 @@ static struct folio *shmem_alloc_folio(gfp_t gfp,
 	return folio;
 }
 
-static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode,
-		pgoff_t index, bool huge)
+static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
+		struct inode *inode, pgoff_t index,
+		struct mm_struct *fault_mm, bool huge)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct folio *folio;
-	int nr;
-	int err;
+	long pages;
+	int error;
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		huge = false;
-	nr = huge ? HPAGE_PMD_NR : 1;
 
-	err = shmem_inode_acct_blocks(inode, nr);
-	if (err)
-		goto failed;
+	if (huge) {
+		pages = HPAGE_PMD_NR;
+		index = round_down(index, HPAGE_PMD_NR);
+
+		/*
+		 * Check for conflict before waiting on a huge allocation.
+		 * Conflict might be that a huge page has just been allocated
+		 * and added to page cache by a racing thread, or that there
+		 * is already at least one small page in the huge extent.
+		 * Be careful to retry when appropriate, but not forever!
+		 * Elsewhere -EEXIST would be the right code, but not here.
+		 */
+		if (xa_find(&mapping->i_pages, &index,
+				index + HPAGE_PMD_NR - 1, XA_PRESENT))
+			return ERR_PTR(-E2BIG);
 
-	if (huge)
 		folio = shmem_alloc_hugefolio(gfp, info, index);
-	else
+		if (!folio)
+			count_vm_event(THP_FILE_FALLBACK);
+	} else {
+		pages = 1;
 		folio = shmem_alloc_folio(gfp, info, index);
-	if (folio) {
-		__folio_set_locked(folio);
-		__folio_set_swapbacked(folio);
-		return folio;
+	}
+	if (!folio)
+		return ERR_PTR(-ENOMEM);
+
+	__folio_set_locked(folio);
+	__folio_set_swapbacked(folio);
+
+	gfp &= GFP_RECLAIM_MASK;
+	error = mem_cgroup_charge(folio, fault_mm, gfp);
+	if (error) {
+		if (xa_find(&mapping->i_pages, &index,
+				index + pages - 1, XA_PRESENT)) {
+			error = -EEXIST;
+		} else if (huge) {
+			count_vm_event(THP_FILE_FALLBACK);
+			count_vm_event(THP_FILE_FALLBACK_CHARGE);
+		}
+		goto unlock;
 	}
 
-	err = -ENOMEM;
-	shmem_inode_unacct_blocks(inode, nr);
-failed:
-	return ERR_PTR(err);
+	error = shmem_add_to_page_cache(folio, mapping, index, NULL, gfp);
+	if (error)
+		goto unlock;
+
+	error = shmem_inode_acct_blocks(inode, pages);
+	if (error) {
+		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+		long freed;
+		/*
+		 * Try to reclaim some space by splitting a few
+		 * large folios beyond i_size on the filesystem.
+		 */
+		shmem_unused_huge_shrink(sbinfo, NULL, 2);
+		/*
+		 * And do a shmem_recalc_inode() to account for freed pages:
+		 * except our folio is there in cache, so not quite balanced.
+		 */
+		spin_lock(&info->lock);
+		freed = pages + info->alloced - info->swapped -
+			READ_ONCE(mapping->nrpages);
+		if (freed > 0)
+			info->alloced -= freed;
+		spin_unlock(&info->lock);
+		if (freed > 0)
+			shmem_inode_unacct_blocks(inode, freed);
+		error = shmem_inode_acct_blocks(inode, pages);
+		if (error) {
+			filemap_remove_folio(folio);
+			goto unlock;
+		}
+	}
+
+	shmem_recalc_inode(inode, pages, 0);
+	folio_add_lru(folio);
+	return folio;
+
+unlock:
+	folio_unlock(folio);
+	folio_put(folio);
+	return ERR_PTR(error);
 }
 
 /*
@@ -1907,29 +1962,22 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		struct vm_fault *vmf, vm_fault_t *fault_type)
 {
 	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
-	struct address_space *mapping = inode->i_mapping;
-	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct shmem_sb_info *sbinfo;
 	struct mm_struct *fault_mm;
 	struct folio *folio;
-	pgoff_t hindex;
-	gfp_t huge_gfp;
 	int error;
-	int once = 0;
-	int alloced = 0;
+	bool alloced;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return -EFBIG;
 repeat:
 	if (sgp <= SGP_CACHE &&
-	    ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
+	    ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode))
 		return -EINVAL;
-	}
 
-	sbinfo = SHMEM_SB(inode->i_sb);
+	alloced = false;
 	fault_mm = vma ? vma->vm_mm : NULL;
 
-	folio = filemap_get_entry(mapping, index);
+	folio = filemap_get_entry(inode->i_mapping, index);
 	if (folio && vma && userfaultfd_minor(vma)) {
 		if (!xa_is_value(folio))
 			folio_put(folio);
@@ -1951,7 +1999,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		folio_lock(folio);
 
 		/* Has the folio been truncated or swapped out? */
-		if (unlikely(folio->mapping != mapping)) {
+		if (unlikely(folio->mapping != inode->i_mapping)) {
 			folio_unlock(folio);
 			folio_put(folio);
 			goto repeat;
@@ -1986,65 +2034,38 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (!shmem_is_huge(inode, index, false,
-			   vma ? vma->vm_mm : NULL, vma ? vma->vm_flags : 0))
-		goto alloc_nohuge;
+	if (shmem_is_huge(inode, index, false, fault_mm,
+			  vma ? vma->vm_flags : 0)) {
+		gfp_t huge_gfp;
 
-	huge_gfp = vma_thp_gfp_mask(vma);
-	huge_gfp = limit_gfp_mask(huge_gfp, gfp);
-	folio = shmem_alloc_and_acct_folio(huge_gfp, inode, index, true);
-	if (IS_ERR(folio)) {
-alloc_nohuge:
-		folio = shmem_alloc_and_acct_folio(gfp, inode, index, false);
-	}
-	if (IS_ERR(folio)) {
-		int retry = 5;
-
-		error = PTR_ERR(folio);
-		folio = NULL;
-		if (error != -ENOSPC)
-			goto unlock;
-		/*
-		 * Try to reclaim some space by splitting a large folio
-		 * beyond i_size on the filesystem.
-		 */
-		while (retry--) {
-			int ret;
-
-			ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
-			if (ret == SHRINK_STOP)
-				break;
-			if (ret)
-				goto alloc_nohuge;
+		huge_gfp = vma_thp_gfp_mask(vma);
+		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
+		folio = shmem_alloc_and_add_folio(huge_gfp,
+				inode, index, fault_mm, true);
+		if (!IS_ERR(folio)) {
+			count_vm_event(THP_FILE_ALLOC);
+			goto alloced;
 		}
+		if (PTR_ERR(folio) == -EEXIST)
+			goto repeat;
+	}
+
+	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+	if (IS_ERR(folio)) {
+		error = PTR_ERR(folio);
+		if (error == -EEXIST)
+			goto repeat;
+		folio = NULL;
 		goto unlock;
 	}
 
-	hindex = round_down(index, folio_nr_pages(folio));
-
-	if (sgp == SGP_WRITE)
-		__folio_set_referenced(folio);
-
-	error = mem_cgroup_charge(folio, fault_mm, gfp);
-	if (error) {
-		if (folio_test_pmd_mappable(folio)) {
-			count_vm_event(THP_FILE_FALLBACK);
-			count_vm_event(THP_FILE_FALLBACK_CHARGE);
-		}
-		goto unacct;
-	}
-
-	error = shmem_add_to_page_cache(folio, mapping, hindex, NULL, gfp);
-	if (error)
-		goto unacct;
-
-	folio_add_lru(folio);
-	shmem_recalc_inode(inode, folio_nr_pages(folio), 0);
+alloced:
 	alloced = true;
-
 	if (folio_test_pmd_mappable(folio) &&
 	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
 					folio_next_index(folio) - 1) {
+		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+		struct shmem_inode_info *info = SHMEM_I(inode);
 		/*
 		 * Part of the large folio is beyond i_size: subject
 		 * to shrink under memory pressure.
@@ -2062,6 +2083,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		spin_unlock(&sbinfo->shrinklist_lock);
 	}
 
+	if (sgp == SGP_WRITE)
+		folio_set_referenced(folio);
 	/*
 	 * Let SGP_FALLOC use the SGP_WRITE optimization on a new folio.
 	 */
@@ -2085,11 +2108,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	/* Perhaps the file has been truncated since we checked */
 	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
-		if (alloced) {
-			folio_clear_dirty(folio);
-			filemap_remove_folio(folio);
-			shmem_recalc_inode(inode, 0, 0);
-		}
 		error = -EINVAL;
 		goto unlock;
 	}
@@ -2100,25 +2118,14 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	/*
 	 * Error recovery.
 	 */
-unacct:
-	shmem_inode_unacct_blocks(inode, folio_nr_pages(folio));
-
-	if (folio_test_large(folio)) {
-		folio_unlock(folio);
-		folio_put(folio);
-		goto alloc_nohuge;
-	}
 unlock:
+	if (alloced)
+		filemap_remove_folio(folio);
+	shmem_recalc_inode(inode, 0, 0);
 	if (folio) {
 		folio_unlock(folio);
 		folio_put(folio);
 	}
-	if (error == -ENOSPC && !once++) {
-		shmem_recalc_inode(inode, 0, 0);
-		goto repeat;
-	}
-	if (error == -EEXIST)
-		goto repeat;
 	return error;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
                   ` (6 preceding siblings ...)
  2023-09-30  3:32 ` [PATCH 7/8] shmem: _add_to_page_cache() before shmem_inode_acct_blocks() Hugh Dickins
@ 2023-09-30  3:42 ` Hugh Dickins
  2023-10-04 15:32   ` Jan Kara
                     ` (3 more replies)
  7 siblings, 4 replies; 29+ messages in thread
From: Hugh Dickins @ 2023-09-30  3:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Dave Chinner, Darrick J. Wong, Christian Brauner,
	Carlos Maiolino, Chuck Lever, Jan Kara, Matthew Wilcox,
	Johannes Weiner, Axel Rasmussen, linux-fsdevel, linux-kernel,
	linux-mm

Percpu counter's compare and add are separate functions: without locking
around them (which would defeat their purpose), it has been possible to
overflow the intended limit.  Imagine all the other CPUs fallocating
tmpfs huge pages to the limit, in between this CPU's compare and its add.

I have not seen reports of that happening; but tmpfs's recent addition
of dquot_alloc_block_nodirty() in between the compare and the add makes
it even more likely, and I'd be uncomfortable to leave it unfixed.

Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.

I believe this implementation is correct, and slightly more efficient
than the combination of compare and add (taking the lock once rather
than twice when nearing full - the last 128MiB of a tmpfs volume on a
machine with 128 CPUs and 4KiB pages); but it does beg for a better
design - when nearing full, there is no new batching, but the costly
percpu counter sum across CPUs still has to be done, while locked.

Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
__percpu_counter_limited_add() then be adding a num_dying_cpus() to
num_online_cpus(), when they calculate the maximum which could be held
across CPUs?  But the times when it matters would be vanishingly rare.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
---
Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
which are just internal to shmem, and do not affect this patch (which
applies to v6.6-rc and linux-next as is): but want to run this by you.

 include/linux/percpu_counter.h | 23 +++++++++++++++
 lib/percpu_counter.c           | 53 ++++++++++++++++++++++++++++++++++
 mm/shmem.c                     | 10 +++----
 3 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index d01351b1526f..8cb7c071bd5c 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -57,6 +57,8 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
 			      s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit,
+				  s64 amount, s32 batch);
 void percpu_counter_sync(struct percpu_counter *fbc);
 
 static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
@@ -69,6 +71,13 @@ static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 	percpu_counter_add_batch(fbc, amount, percpu_counter_batch);
 }
 
+static inline bool
+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount)
+{
+	return __percpu_counter_limited_add(fbc, limit, amount,
+					    percpu_counter_batch);
+}
+
 /*
  * With percpu_counter_add_local() and percpu_counter_sub_local(), counts
  * are accumulated in local per cpu counter and not in fbc->count until
@@ -185,6 +194,20 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 	local_irq_restore(flags);
 }
 
+static inline bool
+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount)
+{
+	unsigned long flags;
+	s64 count;
+
+	local_irq_save(flags);
+	count = fbc->count + amount;
+	if (count <= limit)
+		fbc->count = count;
+	local_irq_restore(flags);
+	return count <= limit;
+}
+
 /* non-SMP percpu_counter_add_local is the same with percpu_counter_add */
 static inline void
 percpu_counter_add_local(struct percpu_counter *fbc, s64 amount)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 9073430dc865..58a3392f471b 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -278,6 +278,59 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 }
 EXPORT_SYMBOL(__percpu_counter_compare);
 
+/*
+ * Compare counter, and add amount if the total is within limit.
+ * Return true if amount was added, false if it would exceed limit.
+ */
+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
+				  s64 limit, s64 amount, s32 batch)
+{
+	s64 count;
+	s64 unknown;
+	unsigned long flags;
+	bool good;
+
+	if (amount > limit)
+		return false;
+
+	local_irq_save(flags);
+	unknown = batch * num_online_cpus();
+	count = __this_cpu_read(*fbc->counters);
+
+	/* Skip taking the lock when safe */
+	if (abs(count + amount) <= batch &&
+	    fbc->count + unknown <= limit) {
+		this_cpu_add(*fbc->counters, amount);
+		local_irq_restore(flags);
+		return true;
+	}
+
+	raw_spin_lock(&fbc->lock);
+	count = fbc->count + amount;
+
+	/* Skip percpu_counter_sum() when safe */
+	if (count + unknown > limit) {
+		s32 *pcount;
+		int cpu;
+
+		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
+			pcount = per_cpu_ptr(fbc->counters, cpu);
+			count += *pcount;
+		}
+	}
+
+	good = count <= limit;
+	if (good) {
+		count = __this_cpu_read(*fbc->counters);
+		fbc->count += count + amount;
+		__this_cpu_sub(*fbc->counters, count);
+	}
+
+	raw_spin_unlock(&fbc->lock);
+	local_irq_restore(flags);
+	return good;
+}
+
 static int __init percpu_counter_startup(void)
 {
 	int ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 4f4ab26bc58a..7cb72c747954 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -217,15 +217,15 @@ static int shmem_inode_acct_blocks(struct inode *inode, long pages)
 
 	might_sleep();	/* when quotas */
 	if (sbinfo->max_blocks) {
-		if (percpu_counter_compare(&sbinfo->used_blocks,
-					   sbinfo->max_blocks - pages) > 0)
+		if (!percpu_counter_limited_add(&sbinfo->used_blocks,
+						sbinfo->max_blocks, pages))
 			goto unacct;
 
 		err = dquot_alloc_block_nodirty(inode, pages);
-		if (err)
+		if (err) {
+			percpu_counter_sub(&sbinfo->used_blocks, pages);
 			goto unacct;
-
-		percpu_counter_add(&sbinfo->used_blocks, pages);
+		}
 	} else {
 		err = dquot_alloc_block_nodirty(inode, pages);
 		if (err)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union
  2023-09-30  3:25 ` [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union Hugh Dickins
@ 2023-09-30 16:16   ` Chuck Lever
  2023-10-03 13:06   ` Jan Kara
  1 sibling, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2023-09-30 16:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christian Brauner, Carlos Maiolino, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

On Fri, Sep 29, 2023 at 08:25:38PM -0700, Hugh Dickins wrote:
> Shave 32 bytes off (the 64-bit) shmem_inode_info.  There was a 4-byte
> pahole after stop_eviction, better filled by fsflags.  And the 24-byte
> dir_offsets can only be used by directories, whereas shrinklist and
> swaplist only by shmem_mapping() inodes (regular files or long symlinks):
> so put those into a union.  No change in mm/shmem.c is required for this.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>


> ---
>  include/linux/shmem_fs.h | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 6b0c626620f5..2caa6b86106a 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -23,18 +23,22 @@ struct shmem_inode_info {
>  	unsigned long		flags;
>  	unsigned long		alloced;	/* data pages alloced to file */
>  	unsigned long		swapped;	/* subtotal assigned to swap */
> -	pgoff_t			fallocend;	/* highest fallocate endindex */
> -	struct list_head        shrinklist;     /* shrinkable hpage inodes */
> -	struct list_head	swaplist;	/* chain of maybes on swap */
> +	union {
> +	    struct offset_ctx	dir_offsets;	/* stable directory offsets */
> +	    struct {
> +		struct list_head shrinklist;	/* shrinkable hpage inodes */
> +		struct list_head swaplist;	/* chain of maybes on swap */
> +	    };
> +	};
> +	struct timespec64	i_crtime;	/* file creation time */
>  	struct shared_policy	policy;		/* NUMA memory alloc policy */
>  	struct simple_xattrs	xattrs;		/* list of xattrs */
> +	pgoff_t			fallocend;	/* highest fallocate endindex */
> +	unsigned int		fsflags;	/* for FS_IOC_[SG]ETFLAGS */
>  	atomic_t		stop_eviction;	/* hold when working on inode */
> -	struct timespec64	i_crtime;	/* file creation time */
> -	unsigned int		fsflags;	/* flags for FS_IOC_[SG]ETFLAGS */
>  #ifdef CONFIG_TMPFS_QUOTA
>  	struct dquot		*i_dquot[MAXQUOTAS];
>  #endif
> -	struct offset_ctx	dir_offsets;	/* stable entry offsets */
>  	struct inode		vfs_inode;
>  };
>  
> -- 
> 2.35.3
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union
  2023-09-30  3:25 ` [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union Hugh Dickins
  2023-09-30 16:16   ` Chuck Lever
@ 2023-10-03 13:06   ` Jan Kara
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Kara @ 2023-10-03 13:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Fri 29-09-23 20:25:38, Hugh Dickins wrote:
> Shave 32 bytes off (the 64-bit) shmem_inode_info.  There was a 4-byte
> pahole after stop_eviction, better filled by fsflags.  And the 24-byte
> dir_offsets can only be used by directories, whereas shrinklist and
> swaplist only by shmem_mapping() inodes (regular files or long symlinks):
> so put those into a union.  No change in mm/shmem.c is required for this.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  include/linux/shmem_fs.h | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 6b0c626620f5..2caa6b86106a 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -23,18 +23,22 @@ struct shmem_inode_info {
>  	unsigned long		flags;
>  	unsigned long		alloced;	/* data pages alloced to file */
>  	unsigned long		swapped;	/* subtotal assigned to swap */
> -	pgoff_t			fallocend;	/* highest fallocate endindex */
> -	struct list_head        shrinklist;     /* shrinkable hpage inodes */
> -	struct list_head	swaplist;	/* chain of maybes on swap */
> +	union {
> +	    struct offset_ctx	dir_offsets;	/* stable directory offsets */
> +	    struct {
> +		struct list_head shrinklist;	/* shrinkable hpage inodes */
> +		struct list_head swaplist;	/* chain of maybes on swap */
> +	    };
> +	};
> +	struct timespec64	i_crtime;	/* file creation time */
>  	struct shared_policy	policy;		/* NUMA memory alloc policy */
>  	struct simple_xattrs	xattrs;		/* list of xattrs */
> +	pgoff_t			fallocend;	/* highest fallocate endindex */
> +	unsigned int		fsflags;	/* for FS_IOC_[SG]ETFLAGS */
>  	atomic_t		stop_eviction;	/* hold when working on inode */
> -	struct timespec64	i_crtime;	/* file creation time */
> -	unsigned int		fsflags;	/* flags for FS_IOC_[SG]ETFLAGS */
>  #ifdef CONFIG_TMPFS_QUOTA
>  	struct dquot		*i_dquot[MAXQUOTAS];
>  #endif
> -	struct offset_ctx	dir_offsets;	/* stable entry offsets */
>  	struct inode		vfs_inode;
>  };
>  
> -- 
> 2.35.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] shmem: remove vma arg from shmem_get_folio_gfp()
  2023-09-30  3:26 ` [PATCH 2/8] shmem: remove vma arg from shmem_get_folio_gfp() Hugh Dickins
@ 2023-10-03 13:07   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2023-10-03 13:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Fri 29-09-23 20:26:53, Hugh Dickins wrote:
> The vma is already there in vmf->vma, so no need for a separate arg.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Sure. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/shmem.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 69595d341882..824eb55671d2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1921,14 +1921,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   * vm. If we swap it in we mark it dirty since we also free the swap
>   * entry since a page cannot live in both the swap and page cache.
>   *
> - * vma, vmf, and fault_type are only supplied by shmem_fault:
> - * otherwise they are NULL.
> + * vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL.
>   */
>  static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  		struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
> -		struct vm_area_struct *vma, struct vm_fault *vmf,
> -		vm_fault_t *fault_type)
> +		struct vm_fault *vmf, vm_fault_t *fault_type)
>  {
> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_sb_info *sbinfo;
> @@ -2141,7 +2140,7 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
>  		enum sgp_type sgp)
>  {
>  	return shmem_get_folio_gfp(inode, index, foliop, sgp,
> -			mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL);
> +			mapping_gfp_mask(inode->i_mapping), NULL, NULL);
>  }
>  
>  /*
> @@ -2225,7 +2224,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  	}
>  
>  	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> -				  gfp, vma, vmf, &ret);
> +				  gfp, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
>  	if (folio)
> @@ -4897,7 +4896,7 @@ struct folio *shmem_read_folio_gfp(struct address_space *mapping,
>  
>  	BUG_ON(!shmem_mapping(mapping));
>  	error = shmem_get_folio_gfp(inode, index, &folio, SGP_CACHE,
> -				  gfp, NULL, NULL, NULL);
> +				    gfp, NULL, NULL);
>  	if (error)
>  		return ERR_PTR(error);
>  
> -- 
> 2.35.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()
  2023-09-30  3:27 ` [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault() Hugh Dickins
@ 2023-10-03 13:18   ` Jan Kara
  2023-10-06  3:48     ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2023-10-03 13:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Fri 29-09-23 20:27:53, Hugh Dickins wrote:
> That Trinity livelock shmem_falloc avoidance block is unlikely, and a
> distraction from the proper business of shmem_fault(): separate it out.
> (This used to help compilers save stack on the fault path too, but both
> gcc and clang nowadays seem to make better choices anyway.)
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Looking at the code I'm just wondering whether the livelock with
shmem_undo_range() couldn't be more easy to avoid by making
shmem_undo_range() always advance the index by 1 after evicting a page and
thus guaranteeing a forward progress... Because the forward progress within
find_get_entries() is guaranteed these days, it should be enough.

								Honza

> ---
>  mm/shmem.c | 126 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 69 insertions(+), 57 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 824eb55671d2..5501a5bc8d8c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2148,87 +2148,99 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop,
>   * entry unconditionally - even if something else had already woken the
>   * target.
>   */
> -static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
> +static int synchronous_wake_function(wait_queue_entry_t *wait,
> +			unsigned int mode, int sync, void *key)
>  {
>  	int ret = default_wake_function(wait, mode, sync, key);
>  	list_del_init(&wait->entry);
>  	return ret;
>  }
>  
> +/*
> + * Trinity finds that probing a hole which tmpfs is punching can
> + * prevent the hole-punch from ever completing: which in turn
> + * locks writers out with its hold on i_rwsem.  So refrain from
> + * faulting pages into the hole while it's being punched.  Although
> + * shmem_undo_range() does remove the additions, it may be unable to
> + * keep up, as each new page needs its own unmap_mapping_range() call,
> + * and the i_mmap tree grows ever slower to scan if new vmas are added.
> + *
> + * It does not matter if we sometimes reach this check just before the
> + * hole-punch begins, so that one fault then races with the punch:
> + * we just need to make racing faults a rare case.
> + *
> + * The implementation below would be much simpler if we just used a
> + * standard mutex or completion: but we cannot take i_rwsem in fault,
> + * and bloating every shmem inode for this unlikely case would be sad.
> + */
> +static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode)
> +{
> +	struct shmem_falloc *shmem_falloc;
> +	struct file *fpin = NULL;
> +	vm_fault_t ret = 0;
> +
> +	spin_lock(&inode->i_lock);
> +	shmem_falloc = inode->i_private;
> +	if (shmem_falloc &&
> +	    shmem_falloc->waitq &&
> +	    vmf->pgoff >= shmem_falloc->start &&
> +	    vmf->pgoff < shmem_falloc->next) {
> +		wait_queue_head_t *shmem_falloc_waitq;
> +		DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
> +
> +		ret = VM_FAULT_NOPAGE;
> +		fpin = maybe_unlock_mmap_for_io(vmf, NULL);
> +		shmem_falloc_waitq = shmem_falloc->waitq;
> +		prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> +				TASK_UNINTERRUPTIBLE);
> +		spin_unlock(&inode->i_lock);
> +		schedule();
> +
> +		/*
> +		 * shmem_falloc_waitq points into the shmem_fallocate()
> +		 * stack of the hole-punching task: shmem_falloc_waitq
> +		 * is usually invalid by the time we reach here, but
> +		 * finish_wait() does not dereference it in that case;
> +		 * though i_lock needed lest racing with wake_up_all().
> +		 */
> +		spin_lock(&inode->i_lock);
> +		finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> +	}
> +	spin_unlock(&inode->i_lock);
> +	if (fpin) {
> +		fput(fpin);
> +		ret = VM_FAULT_RETRY;
> +	}
> +	return ret;
> +}
> +
>  static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  {
> -	struct vm_area_struct *vma = vmf->vma;
> -	struct inode *inode = file_inode(vma->vm_file);
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
>  	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
>  	struct folio *folio = NULL;
> +	vm_fault_t ret = 0;
>  	int err;
> -	vm_fault_t ret = VM_FAULT_LOCKED;
>  
>  	/*
>  	 * Trinity finds that probing a hole which tmpfs is punching can
> -	 * prevent the hole-punch from ever completing: which in turn
> -	 * locks writers out with its hold on i_rwsem.  So refrain from
> -	 * faulting pages into the hole while it's being punched.  Although
> -	 * shmem_undo_range() does remove the additions, it may be unable to
> -	 * keep up, as each new page needs its own unmap_mapping_range() call,
> -	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
> -	 *
> -	 * It does not matter if we sometimes reach this check just before the
> -	 * hole-punch begins, so that one fault then races with the punch:
> -	 * we just need to make racing faults a rare case.
> -	 *
> -	 * The implementation below would be much simpler if we just used a
> -	 * standard mutex or completion: but we cannot take i_rwsem in fault,
> -	 * and bloating every shmem inode for this unlikely case would be sad.
> +	 * prevent the hole-punch from ever completing: noted in i_private.
>  	 */
>  	if (unlikely(inode->i_private)) {
> -		struct shmem_falloc *shmem_falloc;
> -
> -		spin_lock(&inode->i_lock);
> -		shmem_falloc = inode->i_private;
> -		if (shmem_falloc &&
> -		    shmem_falloc->waitq &&
> -		    vmf->pgoff >= shmem_falloc->start &&
> -		    vmf->pgoff < shmem_falloc->next) {
> -			struct file *fpin;
> -			wait_queue_head_t *shmem_falloc_waitq;
> -			DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
> -
> -			ret = VM_FAULT_NOPAGE;
> -			fpin = maybe_unlock_mmap_for_io(vmf, NULL);
> -			if (fpin)
> -				ret = VM_FAULT_RETRY;
> -
> -			shmem_falloc_waitq = shmem_falloc->waitq;
> -			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> -					TASK_UNINTERRUPTIBLE);
> -			spin_unlock(&inode->i_lock);
> -			schedule();
> -
> -			/*
> -			 * shmem_falloc_waitq points into the shmem_fallocate()
> -			 * stack of the hole-punching task: shmem_falloc_waitq
> -			 * is usually invalid by the time we reach here, but
> -			 * finish_wait() does not dereference it in that case;
> -			 * though i_lock needed lest racing with wake_up_all().
> -			 */
> -			spin_lock(&inode->i_lock);
> -			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> -			spin_unlock(&inode->i_lock);
> -
> -			if (fpin)
> -				fput(fpin);
> +		ret = shmem_falloc_wait(vmf, inode);
> +		if (ret)
>  			return ret;
> -		}
> -		spin_unlock(&inode->i_lock);
>  	}
>  
> +	WARN_ON_ONCE(vmf->page != NULL);
>  	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
>  				  gfp, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
> -	if (folio)
> +	if (folio) {
>  		vmf->page = folio_file_page(folio, vmf->pgoff);
> +		ret |= VM_FAULT_LOCKED;
> +	}
>  	return ret;
>  }
>  
> -- 
> 2.35.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/8] shmem: trivial tidyups, removing extra blank lines, etc
  2023-09-30  3:28 ` [PATCH 4/8] shmem: trivial tidyups, removing extra blank lines, etc Hugh Dickins
@ 2023-10-03 13:20   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2023-10-03 13:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Fri 29-09-23 20:28:50, Hugh Dickins wrote:
> Mostly removing a few superfluous blank lines, joining short arglines,
> imposing some 80-column observance, correcting a couple of comments.
> None of it more interesting than deleting a repeated INIT_LIST_HEAD().
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Autumn cleaning ;). Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/shmem.c | 56 ++++++++++++++++++++----------------------------------
>  1 file changed, 21 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5501a5bc8d8c..caee8ba841f7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -756,7 +756,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  /*
> - * Like filemap_add_folio, but error if expected item has gone.
> + * Somewhat like filemap_add_folio, but error if expected item has gone.
>   */
>  static int shmem_add_to_page_cache(struct folio *folio,
>  				   struct address_space *mapping,
> @@ -825,7 +825,7 @@ static int shmem_add_to_page_cache(struct folio *folio,
>  }
>  
>  /*
> - * Like delete_from_page_cache, but substitutes swap for @folio.
> + * Somewhat like filemap_remove_folio, but substitutes swap for @folio.
>   */
>  static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
>  {
> @@ -887,7 +887,6 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>  			cond_resched_rcu();
>  		}
>  	}
> -
>  	rcu_read_unlock();
>  
>  	return swapped << PAGE_SHIFT;
> @@ -1213,7 +1212,6 @@ static int shmem_setattr(struct mnt_idmap *idmap,
>  	if (i_uid_needs_update(idmap, attr, inode) ||
>  	    i_gid_needs_update(idmap, attr, inode)) {
>  		error = dquot_transfer(idmap, inode, attr);
> -
>  		if (error)
>  			return error;
>  	}
> @@ -2456,7 +2454,6 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
>  	if (err)
>  		return ERR_PTR(err);
>  
> -
>  	inode = new_inode(sb);
>  	if (!inode) {
>  		shmem_free_inode(sb, 0);
> @@ -2481,11 +2478,10 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
>  		shmem_set_inode_flags(inode, info->fsflags);
>  	INIT_LIST_HEAD(&info->shrinklist);
>  	INIT_LIST_HEAD(&info->swaplist);
> -	INIT_LIST_HEAD(&info->swaplist);
> -	if (sbinfo->noswap)
> -		mapping_set_unevictable(inode->i_mapping);
>  	simple_xattrs_init(&info->xattrs);
>  	cache_no_acl(inode);
> +	if (sbinfo->noswap)
> +		mapping_set_unevictable(inode->i_mapping);
>  	mapping_set_large_folios(inode->i_mapping);
>  
>  	switch (mode & S_IFMT) {
> @@ -2697,7 +2693,6 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>  	}
>  
>  	ret = shmem_get_folio(inode, index, &folio, SGP_WRITE);
> -
>  	if (ret)
>  		return ret;
>  
> @@ -3229,8 +3224,7 @@ shmem_mknod(struct mnt_idmap *idmap, struct inode *dir,
>  	error = simple_acl_create(dir, inode);
>  	if (error)
>  		goto out_iput;
> -	error = security_inode_init_security(inode, dir,
> -					     &dentry->d_name,
> +	error = security_inode_init_security(inode, dir, &dentry->d_name,
>  					     shmem_initxattrs, NULL);
>  	if (error && error != -EOPNOTSUPP)
>  		goto out_iput;
> @@ -3259,14 +3253,11 @@ shmem_tmpfile(struct mnt_idmap *idmap, struct inode *dir,
>  	int error;
>  
>  	inode = shmem_get_inode(idmap, dir->i_sb, dir, mode, 0, VM_NORESERVE);
> -
>  	if (IS_ERR(inode)) {
>  		error = PTR_ERR(inode);
>  		goto err_out;
>  	}
> -
> -	error = security_inode_init_security(inode, dir,
> -					     NULL,
> +	error = security_inode_init_security(inode, dir, NULL,
>  					     shmem_initxattrs, NULL);
>  	if (error && error != -EOPNOTSUPP)
>  		goto out_iput;
> @@ -3303,7 +3294,8 @@ static int shmem_create(struct mnt_idmap *idmap, struct inode *dir,
>  /*
>   * Link a file..
>   */
> -static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
> +static int shmem_link(struct dentry *old_dentry, struct inode *dir,
> +		      struct dentry *dentry)
>  {
>  	struct inode *inode = d_inode(old_dentry);
>  	int ret = 0;
> @@ -3334,7 +3326,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
>  	inode_inc_iversion(dir);
>  	inc_nlink(inode);
>  	ihold(inode);	/* New dentry reference */
> -	dget(dentry);		/* Extra pinning count for the created dentry */
> +	dget(dentry);	/* Extra pinning count for the created dentry */
>  	d_instantiate(dentry, inode);
>  out:
>  	return ret;
> @@ -3354,7 +3346,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
>  					     inode_set_ctime_current(inode));
>  	inode_inc_iversion(dir);
>  	drop_nlink(inode);
> -	dput(dentry);	/* Undo the count from "create" - this does all the work */
> +	dput(dentry);	/* Undo the count from "create" - does all the work */
>  	return 0;
>  }
>  
> @@ -3464,7 +3456,6 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
>  
>  	inode = shmem_get_inode(idmap, dir->i_sb, dir, S_IFLNK | 0777, 0,
>  				VM_NORESERVE);
> -
>  	if (IS_ERR(inode))
>  		return PTR_ERR(inode);
>  
> @@ -3518,8 +3509,7 @@ static void shmem_put_link(void *arg)
>  	folio_put(arg);
>  }
>  
> -static const char *shmem_get_link(struct dentry *dentry,
> -				  struct inode *inode,
> +static const char *shmem_get_link(struct dentry *dentry, struct inode *inode,
>  				  struct delayed_call *done)
>  {
>  	struct folio *folio = NULL;
> @@ -3593,8 +3583,7 @@ static int shmem_fileattr_set(struct mnt_idmap *idmap,
>   * Callback for security_inode_init_security() for acquiring xattrs.
>   */
>  static int shmem_initxattrs(struct inode *inode,
> -			    const struct xattr *xattr_array,
> -			    void *fs_info)
> +			    const struct xattr *xattr_array, void *fs_info)
>  {
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> @@ -3778,7 +3767,6 @@ static struct dentry *shmem_find_alias(struct inode *inode)
>  	return alias ?: d_find_any_alias(inode);
>  }
>  
> -
>  static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
>  		struct fid *fid, int fh_len, int fh_type)
>  {
> @@ -4362,8 +4350,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  	}
>  #endif /* CONFIG_TMPFS_QUOTA */
>  
> -	inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL, S_IFDIR | sbinfo->mode, 0,
> -				VM_NORESERVE);
> +	inode = shmem_get_inode(&nop_mnt_idmap, sb, NULL,
> +				S_IFDIR | sbinfo->mode, 0, VM_NORESERVE);
>  	if (IS_ERR(inode)) {
>  		error = PTR_ERR(inode);
>  		goto failed;
> @@ -4666,11 +4654,9 @@ static ssize_t shmem_enabled_show(struct kobject *kobj,
>  
>  	for (i = 0; i < ARRAY_SIZE(values); i++) {
>  		len += sysfs_emit_at(buf, len,
> -				     shmem_huge == values[i] ? "%s[%s]" : "%s%s",
> -				     i ? " " : "",
> -				     shmem_format_huge(values[i]));
> +				shmem_huge == values[i] ? "%s[%s]" : "%s%s",
> +				i ? " " : "", shmem_format_huge(values[i]));
>  	}
> -
>  	len += sysfs_emit_at(buf, len, "\n");
>  
>  	return len;
> @@ -4767,8 +4753,9 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
>  #define shmem_acct_size(flags, size)		0
>  #define shmem_unacct_size(flags, size)		do {} while (0)
>  
> -static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block *sb, struct inode *dir,
> -					    umode_t mode, dev_t dev, unsigned long flags)
> +static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
> +				struct super_block *sb, struct inode *dir,
> +				umode_t mode, dev_t dev, unsigned long flags)
>  {
>  	struct inode *inode = ramfs_get_inode(sb, dir, mode, dev);
>  	return inode ? inode : ERR_PTR(-ENOSPC);
> @@ -4778,8 +4765,8 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct supe
>  
>  /* common code */
>  
> -static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, loff_t size,
> -				       unsigned long flags, unsigned int i_flags)
> +static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name,
> +			loff_t size, unsigned long flags, unsigned int i_flags)
>  {
>  	struct inode *inode;
>  	struct file *res;
> @@ -4798,7 +4785,6 @@ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, l
>  
>  	inode = shmem_get_inode(&nop_mnt_idmap, mnt->mnt_sb, NULL,
>  				S_IFREG | S_IRWXUGO, 0, flags);
> -
>  	if (IS_ERR(inode)) {
>  		shmem_unacct_size(flags, size);
>  		return ERR_CAST(inode);
> -- 
> 2.35.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] shmem: shmem_acct_blocks() and shmem_inode_acct_blocks()
  2023-09-30  3:30 ` [PATCH 5/8] shmem: shmem_acct_blocks() and shmem_inode_acct_blocks() Hugh Dickins
@ 2023-10-03 13:21   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2023-10-03 13:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Fri 29-09-23 20:30:03, Hugh Dickins wrote:
> By historical accident, shmem_acct_block() and shmem_inode_acct_block()
> were never pluralized when the pages argument was added, despite their
> complements being shmem_unacct_blocks() and shmem_inode_unacct_blocks()
> all along.  It has been an irritation: fix their naming at last.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

OK. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/shmem.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index caee8ba841f7..63ba6037b23a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -189,10 +189,10 @@ static inline int shmem_reacct_size(unsigned long flags,
>  /*
>   * ... whereas tmpfs objects are accounted incrementally as
>   * pages are allocated, in order to allow large sparse files.
> - * shmem_get_folio reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
> + * shmem_get_folio reports shmem_acct_blocks failure as -ENOSPC not -ENOMEM,
>   * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
>   */
> -static inline int shmem_acct_block(unsigned long flags, long pages)
> +static inline int shmem_acct_blocks(unsigned long flags, long pages)
>  {
>  	if (!(flags & VM_NORESERVE))
>  		return 0;
> @@ -207,13 +207,13 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages)
>  		vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE));
>  }
>  
> -static int shmem_inode_acct_block(struct inode *inode, long pages)
> +static int shmem_inode_acct_blocks(struct inode *inode, long pages)
>  {
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>  	int err = -ENOSPC;
>  
> -	if (shmem_acct_block(info->flags, pages))
> +	if (shmem_acct_blocks(info->flags, pages))
>  		return err;
>  
>  	might_sleep();	/* when quotas */
> @@ -447,7 +447,7 @@ bool shmem_charge(struct inode *inode, long pages)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  
> -	if (shmem_inode_acct_block(inode, pages))
> +	if (shmem_inode_acct_blocks(inode, pages))
>  		return false;
>  
>  	/* nrpages adjustment first, then shmem_recalc_inode() when balanced */
> @@ -1671,7 +1671,7 @@ static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode,
>  		huge = false;
>  	nr = huge ? HPAGE_PMD_NR : 1;
>  
> -	err = shmem_inode_acct_block(inode, nr);
> +	err = shmem_inode_acct_blocks(inode, nr);
>  	if (err)
>  		goto failed;
>  
> @@ -2572,7 +2572,7 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
>  	int ret;
>  	pgoff_t max_off;
>  
> -	if (shmem_inode_acct_block(inode, 1)) {
> +	if (shmem_inode_acct_blocks(inode, 1)) {
>  		/*
>  		 * We may have got a page, returned -ENOENT triggering a retry,
>  		 * and now we find ourselves with -ENOMEM. Release the page, to
> -- 
> 2.35.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/8] shmem: move memcg charge out of shmem_add_to_page_cache()
  2023-09-30  3:31 ` [PATCH 6/8] shmem: move memcg charge out of shmem_add_to_page_cache() Hugh Dickins
@ 2023-10-03 13:28   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2023-10-03 13:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Fri 29-09-23 20:31:27, Hugh Dickins wrote:
> Extract shmem's memcg charging out of shmem_add_to_page_cache(): it's
> misleading done there, because many calls are dealing with a swapcache
> page, whose memcg is nowadays always remembered while swapped out, then
> the charge re-levied when it's brought back into swapcache.
> 
> Temporarily move it back up to the shmem_get_folio_gfp() level, where
> the memcg was charged before v5.8; but the next commit goes on to move
> it back down to a new home.
> 
> In making this change, it becomes clear that shmem_swapin_folio() does
> not need to know the vma, just the fault mm (if any): call it fault_mm
> rather than charge_mm - let mem_cgroup_charge() decide whom to charge.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/shmem.c | 68 +++++++++++++++++++++++-------------------------------
>  1 file changed, 29 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 63ba6037b23a..0a7f7b567b80 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -146,9 +146,8 @@ static unsigned long shmem_default_max_inodes(void)
>  #endif
>  
>  static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> -			     struct folio **foliop, enum sgp_type sgp,
> -			     gfp_t gfp, struct vm_area_struct *vma,
> -			     vm_fault_t *fault_type);
> +			struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
> +			struct mm_struct *fault_mm, vm_fault_t *fault_type);
>  
>  static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
>  {
> @@ -760,12 +759,10 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>   */
>  static int shmem_add_to_page_cache(struct folio *folio,
>  				   struct address_space *mapping,
> -				   pgoff_t index, void *expected, gfp_t gfp,
> -				   struct mm_struct *charge_mm)
> +				   pgoff_t index, void *expected, gfp_t gfp)
>  {
>  	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
>  	long nr = folio_nr_pages(folio);
> -	int error;
>  
>  	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> @@ -776,16 +773,7 @@ static int shmem_add_to_page_cache(struct folio *folio,
>  	folio->mapping = mapping;
>  	folio->index = index;
>  
> -	if (!folio_test_swapcache(folio)) {
> -		error = mem_cgroup_charge(folio, charge_mm, gfp);
> -		if (error) {
> -			if (folio_test_pmd_mappable(folio)) {
> -				count_vm_event(THP_FILE_FALLBACK);
> -				count_vm_event(THP_FILE_FALLBACK_CHARGE);
> -			}
> -			goto error;
> -		}
> -	}
> +	gfp &= GFP_RECLAIM_MASK;
>  	folio_throttle_swaprate(folio, gfp);
>  
>  	do {
> @@ -813,15 +801,12 @@ static int shmem_add_to_page_cache(struct folio *folio,
>  	} while (xas_nomem(&xas, gfp));
>  
>  	if (xas_error(&xas)) {
> -		error = xas_error(&xas);
> -		goto error;
> +		folio->mapping = NULL;
> +		folio_ref_sub(folio, nr);
> +		return xas_error(&xas);
>  	}
>  
>  	return 0;
> -error:
> -	folio->mapping = NULL;
> -	folio_ref_sub(folio, nr);
> -	return error;
>  }
>  
>  /*
> @@ -1324,10 +1309,8 @@ static int shmem_unuse_swap_entries(struct inode *inode,
>  
>  		if (!xa_is_value(folio))
>  			continue;
> -		error = shmem_swapin_folio(inode, indices[i],
> -					  &folio, SGP_CACHE,
> -					  mapping_gfp_mask(mapping),
> -					  NULL, NULL);
> +		error = shmem_swapin_folio(inode, indices[i], &folio, SGP_CACHE,
> +					mapping_gfp_mask(mapping), NULL, NULL);
>  		if (error == 0) {
>  			folio_unlock(folio);
>  			folio_put(folio);
> @@ -1810,12 +1793,11 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>   */
>  static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  			     struct folio **foliop, enum sgp_type sgp,
> -			     gfp_t gfp, struct vm_area_struct *vma,
> +			     gfp_t gfp, struct mm_struct *fault_mm,
>  			     vm_fault_t *fault_type)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> -	struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
>  	struct swap_info_struct *si;
>  	struct folio *folio = NULL;
>  	swp_entry_t swap;
> @@ -1843,7 +1825,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		if (fault_type) {
>  			*fault_type |= VM_FAULT_MAJOR;
>  			count_vm_event(PGMAJFAULT);
> -			count_memcg_event_mm(charge_mm, PGMAJFAULT);
> +			count_memcg_event_mm(fault_mm, PGMAJFAULT);
>  		}
>  		/* Here we actually start the io */
>  		folio = shmem_swapin(swap, gfp, info, index);
> @@ -1880,8 +1862,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	}
>  
>  	error = shmem_add_to_page_cache(folio, mapping, index,
> -					swp_to_radix_entry(swap), gfp,
> -					charge_mm);
> +					swp_to_radix_entry(swap), gfp);
>  	if (error)
>  		goto failed;
>  
> @@ -1929,7 +1910,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_sb_info *sbinfo;
> -	struct mm_struct *charge_mm;
> +	struct mm_struct *fault_mm;
>  	struct folio *folio;
>  	pgoff_t hindex;
>  	gfp_t huge_gfp;
> @@ -1946,7 +1927,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	}
>  
>  	sbinfo = SHMEM_SB(inode->i_sb);
> -	charge_mm = vma ? vma->vm_mm : NULL;
> +	fault_mm = vma ? vma->vm_mm : NULL;
>  
>  	folio = filemap_get_entry(mapping, index);
>  	if (folio && vma && userfaultfd_minor(vma)) {
> @@ -1958,7 +1939,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  
>  	if (xa_is_value(folio)) {
>  		error = shmem_swapin_folio(inode, index, &folio,
> -					  sgp, gfp, vma, fault_type);
> +					   sgp, gfp, fault_mm, fault_type);
>  		if (error == -EEXIST)
>  			goto repeat;
>  
> @@ -2044,9 +2025,16 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	if (sgp == SGP_WRITE)
>  		__folio_set_referenced(folio);
>  
> -	error = shmem_add_to_page_cache(folio, mapping, hindex,
> -					NULL, gfp & GFP_RECLAIM_MASK,
> -					charge_mm);
> +	error = mem_cgroup_charge(folio, fault_mm, gfp);
> +	if (error) {
> +		if (folio_test_pmd_mappable(folio)) {
> +			count_vm_event(THP_FILE_FALLBACK);
> +			count_vm_event(THP_FILE_FALLBACK_CHARGE);
> +		}
> +		goto unacct;
> +	}
> +
> +	error = shmem_add_to_page_cache(folio, mapping, hindex, NULL, gfp);
>  	if (error)
>  		goto unacct;
>  
> @@ -2644,8 +2632,10 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
>  	if (unlikely(pgoff >= max_off))
>  		goto out_release;
>  
> -	ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL,
> -				      gfp & GFP_RECLAIM_MASK, dst_vma->vm_mm);
> +	ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp);
> +	if (ret)
> +		goto out_release;
> +	ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
>  	if (ret)
>  		goto out_release;
>  
> -- 
> 2.35.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
@ 2023-10-04 15:32   ` Jan Kara
  2023-10-04 23:10   ` Dave Chinner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2023-10-04 15:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Tim Chen, Dave Chinner, Darrick J. Wong,
	Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

On Fri 29-09-23 20:42:45, Hugh Dickins wrote:
> Percpu counter's compare and add are separate functions: without locking
> around them (which would defeat their purpose), it has been possible to
> overflow the intended limit.  Imagine all the other CPUs fallocating
> tmpfs huge pages to the limit, in between this CPU's compare and its add.
> 
> I have not seen reports of that happening; but tmpfs's recent addition
> of dquot_alloc_block_nodirty() in between the compare and the add makes
> it even more likely, and I'd be uncomfortable to leave it unfixed.
> 
> Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> 
> I believe this implementation is correct, and slightly more efficient
> than the combination of compare and add (taking the lock once rather
> than twice when nearing full - the last 128MiB of a tmpfs volume on a
> machine with 128 CPUs and 4KiB pages); but it does beg for a better
> design - when nearing full, there is no new batching, but the costly
> percpu counter sum across CPUs still has to be done, while locked.
> 
> Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> num_online_cpus(), when they calculate the maximum which could be held
> across CPUs?  But the times when it matters would be vanishingly rare.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Darrick J. Wong <djwong@kernel.org>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
  2023-10-04 15:32   ` Jan Kara
@ 2023-10-04 23:10   ` Dave Chinner
  2023-10-06  5:35     ` Hugh Dickins
  2023-10-05 16:50   ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Chen, Tim C
  2024-05-25  6:00   ` Mateusz Guzik
  3 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-10-04 23:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Tim Chen, Dave Chinner, Darrick J. Wong,
	Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> Percpu counter's compare and add are separate functions: without locking
> around them (which would defeat their purpose), it has been possible to
> overflow the intended limit.  Imagine all the other CPUs fallocating
> tmpfs huge pages to the limit, in between this CPU's compare and its add.
> 
> I have not seen reports of that happening; but tmpfs's recent addition
> of dquot_alloc_block_nodirty() in between the compare and the add makes
> it even more likely, and I'd be uncomfortable to leave it unfixed.
> 
> Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> 
> I believe this implementation is correct, and slightly more efficient
> than the combination of compare and add (taking the lock once rather
> than twice when nearing full - the last 128MiB of a tmpfs volume on a
> machine with 128 CPUs and 4KiB pages); but it does beg for a better
> design - when nearing full, there is no new batching, but the costly
> percpu counter sum across CPUs still has to be done, while locked.
> 
> Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> num_online_cpus(), when they calculate the maximum which could be held
> across CPUs?  But the times when it matters would be vanishingly rare.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Darrick J. Wong <djwong@kernel.org>
> ---
> Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
> which are just internal to shmem, and do not affect this patch (which
> applies to v6.6-rc and linux-next as is): but want to run this by you.

Hmmmm. IIUC, this only works for addition that approaches the limit
from below?

So if we are approaching the limit from above (i.e. add of a
negative amount, limit is zero) then this code doesn't work the same
as the open-coded compare+add operation would?

Hence I think this looks like a "add if result is less than"
operation, which is distinct from then "add if result is greater
than" operation that we use this same pattern for in XFS and ext4.
Perhaps a better name is in order?

I'm also not a great fan of having two
similar-but-not-quite-the-same implementations for the two
comparisons, but unless we decide to convert the XFs slow path to
this it doesn't matter that much at the moment....

Implementation seems OK at a quick glance, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
  2023-10-04 15:32   ` Jan Kara
  2023-10-04 23:10   ` Dave Chinner
@ 2023-10-05 16:50   ` Chen, Tim C
  2023-10-06  5:42     ` Hugh Dickins
  2024-05-25  6:00   ` Mateusz Guzik
  3 siblings, 1 reply; 29+ messages in thread
From: Chen, Tim C @ 2023-10-05 16:50 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Dave Chinner, Darrick J. Wong, Christian Brauner,
	Carlos Maiolino, Chuck Lever, Jan Kara, Matthew Wilcox,
	Johannes Weiner, Axel Rasmussen, linux-fsdevel, linux-kernel,
	linux-mm

>Signed-off-by: Hugh Dickins <hughd@google.com>
>Cc: Tim Chen <tim.c.chen@intel.com>
>Cc: Dave Chinner <dchinner@redhat.com>
>Cc: Darrick J. Wong <djwong@kernel.org>
>---
>Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7, which are
>just internal to shmem, and do not affect this patch (which applies to v6.6-rc
>and linux-next as is): but want to run this by you.
>
> include/linux/percpu_counter.h | 23 +++++++++++++++
> lib/percpu_counter.c           | 53 ++++++++++++++++++++++++++++++++++
> mm/shmem.c                     | 10 +++----
> 3 files changed, 81 insertions(+), 5 deletions(-)
>
>diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
>index d01351b1526f..8cb7c071bd5c 100644
>--- a/include/linux/percpu_counter.h
>+++ b/include/linux/percpu_counter.h
>@@ -57,6 +57,8 @@ void percpu_counter_add_batch(struct percpu_counter
>*fbc, s64 amount,
> 			      s32 batch);
> s64 __percpu_counter_sum(struct percpu_counter *fbc);  int
>__percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
>+bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit,
>+				  s64 amount, s32 batch);
> void percpu_counter_sync(struct percpu_counter *fbc);
>
> static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
>@@ -69,6 +71,13 @@ static inline void percpu_counter_add(struct
>percpu_counter *fbc, s64 amount)
> 	percpu_counter_add_batch(fbc, amount, percpu_counter_batch);  }
>
>+static inline bool
>+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64
>+amount) {
>+	return __percpu_counter_limited_add(fbc, limit, amount,
>+					    percpu_counter_batch);
>+}
>+
> /*
>  * With percpu_counter_add_local() and percpu_counter_sub_local(), counts
>  * are accumulated in local per cpu counter and not in fbc->count until @@ -
>185,6 +194,20 @@ percpu_counter_add(struct percpu_counter *fbc, s64
>amount)
> 	local_irq_restore(flags);
> }
>
>+static inline bool
>+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64
>+amount) {
>+	unsigned long flags;
>+	s64 count;
>+
>+	local_irq_save(flags);
>+	count = fbc->count + amount;
>+	if (count <= limit)
>+		fbc->count = count;
>+	local_irq_restore(flags);
>+	return count <= limit;
>+}
>+
> /* non-SMP percpu_counter_add_local is the same with percpu_counter_add
>*/  static inline void  percpu_counter_add_local(struct percpu_counter *fbc,
>s64 amount) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index
>9073430dc865..58a3392f471b 100644
>--- a/lib/percpu_counter.c
>+++ b/lib/percpu_counter.c
>@@ -278,6 +278,59 @@ int __percpu_counter_compare(struct
>percpu_counter *fbc, s64 rhs, s32 batch)  }
>EXPORT_SYMBOL(__percpu_counter_compare);
>
>+/*
>+ * Compare counter, and add amount if the total is within limit.
>+ * Return true if amount was added, false if it would exceed limit.
>+ */
>+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
>+				  s64 limit, s64 amount, s32 batch) {
>+	s64 count;
>+	s64 unknown;
>+	unsigned long flags;
>+	bool good;
>+
>+	if (amount > limit)
>+		return false;
>+
>+	local_irq_save(flags);
>+	unknown = batch * num_online_cpus();
>+	count = __this_cpu_read(*fbc->counters);
>+
>+	/* Skip taking the lock when safe */
>+	if (abs(count + amount) <= batch &&
>+	    fbc->count + unknown <= limit) {
>+		this_cpu_add(*fbc->counters, amount);
>+		local_irq_restore(flags);
>+		return true;
>+	}
>+
>+	raw_spin_lock(&fbc->lock);
>+	count = fbc->count + amount;
>+

Perhaps we can fast path the case where for sure
we will exceed limit? 

if (fbc->count + amount - unknown > limit)
	return false;
 
Tim

>+	/* Skip percpu_counter_sum() when safe */
>+	if (count + unknown > limit) {
>+		s32 *pcount;
>+		int cpu;
>+
>+		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
>+			pcount = per_cpu_ptr(fbc->counters, cpu);
>+			count += *pcount;
>+		}
>+	}
>+
>+	good = count <= limit;
>+	if (good) {
>+		count = __this_cpu_read(*fbc->counters);
>+		fbc->count += count + amount;
>+		__this_cpu_sub(*fbc->counters, count);
>+	}
>+
>+	raw_spin_unlock(&fbc->lock);
>+	local_irq_restore(flags);
>+	return good;
>+}
>+
> static int __init percpu_counter_startup(void)  {
> 	int ret;
>diff --git a/mm/shmem.c b/mm/shmem.c
>index 4f4ab26bc58a..7cb72c747954 100644
>--- a/mm/shmem.c
>+++ b/mm/shmem.c
>@@ -217,15 +217,15 @@ static int shmem_inode_acct_blocks(struct inode
>*inode, long pages)
>
> 	might_sleep();	/* when quotas */
> 	if (sbinfo->max_blocks) {
>-		if (percpu_counter_compare(&sbinfo->used_blocks,
>-					   sbinfo->max_blocks - pages) > 0)
>+		if (!percpu_counter_limited_add(&sbinfo->used_blocks,
>+						sbinfo->max_blocks, pages))
> 			goto unacct;
>
> 		err = dquot_alloc_block_nodirty(inode, pages);
>-		if (err)
>+		if (err) {
>+			percpu_counter_sub(&sbinfo->used_blocks, pages);
> 			goto unacct;
>-
>-		percpu_counter_add(&sbinfo->used_blocks, pages);
>+		}
> 	} else {
> 		err = dquot_alloc_block_nodirty(inode, pages);
> 		if (err)
>--
>2.35.3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()
  2023-10-03 13:18   ` Jan Kara
@ 2023-10-06  3:48     ` Hugh Dickins
  2023-10-06 11:01       ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-10-06  3:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Hugh Dickins, Andrew Morton, Christian Brauner, Carlos Maiolino,
	Chuck Lever, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	Vlastmil Babka, Sasha Levin, linux-fsdevel, linux-kernel,
	linux-mm

On Tue, 3 Oct 2023, Jan Kara wrote:
> On Fri 29-09-23 20:27:53, Hugh Dickins wrote:
> > That Trinity livelock shmem_falloc avoidance block is unlikely, and a
> > distraction from the proper business of shmem_fault(): separate it out.
> > (This used to help compilers save stack on the fault path too, but both
> > gcc and clang nowadays seem to make better choices anyway.)
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Looks good. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks a lot for all these reviews, Jan.  (And I particularly enjoyed
your "Autumn cleaning" remark: sweeping up the leaves, I've been glad
to have "Autumn Almanac" running through my head since reading that.)

> 
> Looking at the code I'm just wondering whether the livelock with
> shmem_undo_range() couldn't be more easy to avoid by making
> shmem_undo_range() always advance the index by 1 after evicting a page and
> thus guaranteeing a forward progress... Because the forward progress within
> find_get_entries() is guaranteed these days, it should be enough.

I'm not sure that I understand your "advance the index by 1" comment.
Since the "/* If all gone or hole-punch or unfalloc, we're done */"
change went in, I believe shmem_undo_range() does make guaranteed
forward progress; but its forward progress is not enough.

I would love to delete all that shmem_falloc_wait() strangeness;
and your comment excited me to look, hey, can we just delete all that
stuff now, instead of shifting it aside?  That would be much nicer.

And if I'd replied to you yesterday, I'd have been saying yes we can.
But that's because I hadn't got far enough through re-reading the
various July 2014 3.16-rc mail threads.  I had been excited to find
myself posting a revert of the patch; before reaching Sasha's later
livelock which ended up with "belt and braces" retaining the
shmem_falloc wait while adding the "If all gone or hole-punch" mod.

https://marc.info/?l=linux-kernel&m=140487864819409&w=2
though that thread did not resolve, and morphed into lockdep moans.

So I've reverted to my usual position: that it's conceivable that
something has changed meanwhile, to make that Trinity livelock no
longer an issue (in particular, i_mmap_mutex changed to i_mmap_rwsem,
and much later unmap_mapping_range() changed to only take it for read:
but though that change gives hope, I suspect it would turn out to be
ineffectual against the livelock); but that would have to be proved.

If there's someone who can re-demonstrate Sasha's Trinity livelock
on 3.16-with-shmem-falloc-wait-disabled, or re-demonstrate it on any
later release-with-shmem-falloc-wait-disabled, but demonstrate that
the livelock does not occur on 6.6-rc-with-shmem-falloc-wait-disabled
(or that plus some simple adjustment of hacker's choosing): that would
be great news, and we could delete all this - though probably not
without bisecting to satisfy ourselves on what was the crucial change.

But I never got around to running up Trinity to work on it originally,
nor in the years since, nor do I expect to soon.  (Vlastimil had a
good simple technique for demonstrating a part of the problem, but
fixing that part turned out not fix the whole issue with Trinity.)

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-10-04 23:10   ` Dave Chinner
@ 2023-10-06  5:35     ` Hugh Dickins
  2023-10-09  0:15       ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-10-06  5:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Andrew Morton, Tim Chen, Dave Chinner,
	Darrick J. Wong, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Thu, 5 Oct 2023, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > Percpu counter's compare and add are separate functions: without locking
> > around them (which would defeat their purpose), it has been possible to
> > overflow the intended limit.  Imagine all the other CPUs fallocating
> > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > 
> > I have not seen reports of that happening; but tmpfs's recent addition
> > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > 
> > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > 
> > I believe this implementation is correct, and slightly more efficient
> > than the combination of compare and add (taking the lock once rather
> > than twice when nearing full - the last 128MiB of a tmpfs volume on a
> > machine with 128 CPUs and 4KiB pages); but it does beg for a better
> > design - when nearing full, there is no new batching, but the costly
> > percpu counter sum across CPUs still has to be done, while locked.
> > 
> > Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> > well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> > __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> > num_online_cpus(), when they calculate the maximum which could be held
> > across CPUs?  But the times when it matters would be vanishingly rare.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: Tim Chen <tim.c.chen@intel.com>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Cc: Darrick J. Wong <djwong@kernel.org>
> > ---
> > Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
> > which are just internal to shmem, and do not affect this patch (which
> > applies to v6.6-rc and linux-next as is): but want to run this by you.
> 
> Hmmmm. IIUC, this only works for addition that approaches the limit
> from below?

That's certainly how I was thinking about it, and what I need for tmpfs.
Precisely what its limitations (haha) are, I'll have to take care to
spell out.

(IIRC - it's a while since I wrote it - it can be used for subtraction,
but goes the very slow way when it could go the fast way - uncompared
percpu_counter_sub() much better for that.  You might be proposing that
a tweak could adjust it to going the fast way when coming down from the
"limit", but going the slow way as it approaches 0 - that would be neat,
but I've not yet looked into whether it's feasily done.)

> 
> So if we are approaching the limit from above (i.e. add of a
> negative amount, limit is zero) then this code doesn't work the same
> as the open-coded compare+add operation would?

To it and to me, a limit of 0 means nothing positive can be added
(and it immediately returns false for that case); and adding anything
negative would be an error since the positive would not have been allowed.

Would a negative limit have any use?

It's definitely not allowing all the possibilities that you could arrange
with a separate compare and add; whether it's ruling out some useful
possibilities to which it can easily be generalized, I'm not sure.

Well worth a look - but it'll be easier for me to break it than get
it right, so I might just stick to adding some comments.

I might find that actually I prefer your way round: getting slower
as approaching 0, without any need for specifying a limit??  That the
tmpfs case pushed it in this direction, when it's better reversed?  Or
that might be an embarrassing delusion which I'll regret having mentioned.

> 
> Hence I think this looks like a "add if result is less than"
> operation, which is distinct from then "add if result is greater
> than" operation that we use this same pattern for in XFS and ext4.
> Perhaps a better name is in order?

The name still seems good to me, but a comment above it on its
assumptions/limitations well worth adding.

I didn't find a percpu_counter_compare() in ext4, and haven't got
far yet with understanding the XFS ones: tomorrow...

> 
> I'm also not a great fan of having two
> similar-but-not-quite-the-same implementations for the two
> comparisons, but unless we decide to convert the XFs slow path to
> this it doesn't matter that much at the moment....
> 
> Implementation seems OK at a quick glance, though.

Thanks!

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-10-05 16:50   ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Chen, Tim C
@ 2023-10-06  5:42     ` Hugh Dickins
  2023-10-06 22:59       ` Dennis Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-10-06  5:42 UTC (permalink / raw)
  To: Chen, Tim C
  Cc: Hugh Dickins, Andrew Morton, Dave Chinner, Darrick J. Wong,
	Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

On Thu, 5 Oct 2023, Chen, Tim C wrote:

> >--- a/lib/percpu_counter.c
> >+++ b/lib/percpu_counter.c
> >@@ -278,6 +278,59 @@ int __percpu_counter_compare(struct
> >percpu_counter *fbc, s64 rhs, s32 batch)  }
> >EXPORT_SYMBOL(__percpu_counter_compare);
> >
> >+/*
> >+ * Compare counter, and add amount if the total is within limit.
> >+ * Return true if amount was added, false if it would exceed limit.
> >+ */
> >+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> >+				  s64 limit, s64 amount, s32 batch) {
> >+	s64 count;
> >+	s64 unknown;
> >+	unsigned long flags;
> >+	bool good;
> >+
> >+	if (amount > limit)
> >+		return false;
> >+
> >+	local_irq_save(flags);
> >+	unknown = batch * num_online_cpus();
> >+	count = __this_cpu_read(*fbc->counters);
> >+
> >+	/* Skip taking the lock when safe */
> >+	if (abs(count + amount) <= batch &&
> >+	    fbc->count + unknown <= limit) {
> >+		this_cpu_add(*fbc->counters, amount);
> >+		local_irq_restore(flags);
> >+		return true;
> >+	}
> >+
> >+	raw_spin_lock(&fbc->lock);
> >+	count = fbc->count + amount;
> >+
> 
> Perhaps we can fast path the case where for sure
> we will exceed limit? 
> 
> if (fbc->count + amount - unknown > limit)
> 	return false;

Thanks, that sounds reasonable: I'll try to add something like that -
but haven't thought about it carefully enough yet (too easy for me
to overlook some negative case which messes everything up).

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault()
  2023-10-06  3:48     ` Hugh Dickins
@ 2023-10-06 11:01       ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2023-10-06 11:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, Andrew Morton, Christian Brauner, Carlos Maiolino,
	Chuck Lever, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	Vlastmil Babka, Sasha Levin, linux-fsdevel, linux-kernel,
	linux-mm

On Thu 05-10-23 20:48:00, Hugh Dickins wrote:
> On Tue, 3 Oct 2023, Jan Kara wrote:
> > On Fri 29-09-23 20:27:53, Hugh Dickins wrote:
> > > That Trinity livelock shmem_falloc avoidance block is unlikely, and a
> > > distraction from the proper business of shmem_fault(): separate it out.
> > > (This used to help compilers save stack on the fault path too, but both
> > > gcc and clang nowadays seem to make better choices anyway.)
> > > 
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > 
> > Looks good. Feel free to add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Thanks a lot for all these reviews, Jan.  (And I particularly enjoyed
> your "Autumn cleaning" remark: sweeping up the leaves, I've been glad
> to have "Autumn Almanac" running through my head since reading that.)

:-) You have widened my musical horizon.

> > Looking at the code I'm just wondering whether the livelock with
> > shmem_undo_range() couldn't be more easy to avoid by making
> > shmem_undo_range() always advance the index by 1 after evicting a page and
> > thus guaranteeing a forward progress... Because the forward progress within
> > find_get_entries() is guaranteed these days, it should be enough.
> 
> I'm not sure that I understand your "advance the index by 1" comment.
> Since the "/* If all gone or hole-punch or unfalloc, we're done */"
> change went in, I believe shmem_undo_range() does make guaranteed
> forward progress; but its forward progress is not enough.

Right, I have missed that retry when glancing over the code. And the
"advance the index by 1" was also wrong because find_get_entries() already
does it these days.

> I would love to delete all that shmem_falloc_wait() strangeness;
> and your comment excited me to look, hey, can we just delete all that
> stuff now, instead of shifting it aside?  That would be much nicer.

Well, even if you decided to keep the synchronization what you could do
these days is to use inode->i_mapping->invalidate_lock for synchronization
instead of your home-grown solution (like all other filesystems do for
these kind of races). If you don't want to pay the cost of rwsem
acquisition in the fault fast path, you could do it in the hole-punch
running case only like currently.

> And if I'd replied to you yesterday, I'd have been saying yes we can.
> But that's because I hadn't got far enough through re-reading the
> various July 2014 3.16-rc mail threads.  I had been excited to find
> myself posting a revert of the patch; before reaching Sasha's later
> livelock which ended up with "belt and braces" retaining the
> shmem_falloc wait while adding the "If all gone or hole-punch" mod.
> 
> https://marc.info/?l=linux-kernel&m=140487864819409&w=2
> though that thread did not resolve, and morphed into lockdep moans.
> 
> So I've reverted to my usual position: that it's conceivable that
> something has changed meanwhile, to make that Trinity livelock no
> longer an issue (in particular, i_mmap_mutex changed to i_mmap_rwsem,
> and much later unmap_mapping_range() changed to only take it for read:
> but though that change gives hope, I suspect it would turn out to be
> ineffectual against the livelock); but that would have to be proved.
> 
> If there's someone who can re-demonstrate Sasha's Trinity livelock
> on 3.16-with-shmem-falloc-wait-disabled, or re-demonstrate it on any
> later release-with-shmem-falloc-wait-disabled, but demonstrate that
> the livelock does not occur on 6.6-rc-with-shmem-falloc-wait-disabled
> (or that plus some simple adjustment of hacker's choosing): that would
> be great news, and we could delete all this - though probably not
> without bisecting to satisfy ourselves on what was the crucial change.
> 
> But I never got around to running up Trinity to work on it originally,
> nor in the years since, nor do I expect to soon.  (Vlastimil had a
> good simple technique for demonstrating a part of the problem, but
> fixing that part turned out not fix the whole issue with Trinity.)

Fair enough. I agree that we should do some testing before we actually
remove the serialization because the problem was not well understood even
back then and likely had something to do with unmap_mapping_folio()
inefficiency (e.g. unmapping one page at a time acquiring heavily contended
i_mmap_mutex for each page) rather than page cache eviction itself.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-10-06  5:42     ` Hugh Dickins
@ 2023-10-06 22:59       ` Dennis Zhou
  2023-10-12  4:09         ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Dennis Zhou @ 2023-10-06 22:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chen, Tim C, Andrew Morton, Dave Chinner, Darrick J. Wong,
	Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

Hello,

On Thu, Oct 05, 2023 at 10:42:17PM -0700, Hugh Dickins wrote:
> On Thu, 5 Oct 2023, Chen, Tim C wrote:
> 
> > >--- a/lib/percpu_counter.c
> > >+++ b/lib/percpu_counter.c
> > >@@ -278,6 +278,59 @@ int __percpu_counter_compare(struct
> > >percpu_counter *fbc, s64 rhs, s32 batch)  }
> > >EXPORT_SYMBOL(__percpu_counter_compare);
> > >
> > >+/*
> > >+ * Compare counter, and add amount if the total is within limit.
> > >+ * Return true if amount was added, false if it would exceed limit.
> > >+ */
> > >+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> > >+				  s64 limit, s64 amount, s32 batch) {
> > >+	s64 count;
> > >+	s64 unknown;
> > >+	unsigned long flags;
> > >+	bool good;
> > >+
> > >+	if (amount > limit)
> > >+		return false;
> > >+
> > >+	local_irq_save(flags);
> > >+	unknown = batch * num_online_cpus();
> > >+	count = __this_cpu_read(*fbc->counters);
> > >+
> > >+	/* Skip taking the lock when safe */
> > >+	if (abs(count + amount) <= batch &&
> > >+	    fbc->count + unknown <= limit) {
> > >+		this_cpu_add(*fbc->counters, amount);
> > >+		local_irq_restore(flags);
> > >+		return true;
> > >+	}
> > >+
> > >+	raw_spin_lock(&fbc->lock);
> > >+	count = fbc->count + amount;
> > >+
> > 
> > Perhaps we can fast path the case where for sure
> > we will exceed limit? 
> > 
> > if (fbc->count + amount - unknown > limit)
> > 	return false;
> 
> Thanks, that sounds reasonable: I'll try to add something like that -
> but haven't thought about it carefully enough yet (too easy for me
> to overlook some negative case which messes everything up).
> 
> Hugh
>

Sorry for the late chime in. I'm traveling right now.

I haven't been super happy lately with percpu_counter as it has had a
few corner cases such as the cpu_dying_mask fiasco which I thought we
fixed with a series from tglx [1]. If not I can resurrect it and pull
it.

I feel like percpu_counter is deviating from its original intended
usecase which, from my perspective, was a thin wrapper around a percpu
variable. At this point we seem to be bolting onto percpu_counter
instead of giving it a clear focus for what it's supposed to do well.
I think I understand the use case, and ultimately it's kind of the
duality where I think it was xfs is using percpu_counters where it must
be > 0 for the value to make sense and there was a race condition with
cpu dying [2].

At this point, I think it's probably better to wholy think about the
lower bound and upper bound problem of percpu_counter wrt the # of
online cpus.

Thanks,
Dennis

[1] https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
[2] https://lore.kernel.org/lkml/20230406015629.1804722-1-yebin@huaweicloud.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-10-06  5:35     ` Hugh Dickins
@ 2023-10-09  0:15       ` Dave Chinner
  2023-10-12  4:36         ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-10-09  0:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Tim Chen, Dave Chinner, Darrick J. Wong,
	Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

On Thu, Oct 05, 2023 at 10:35:33PM -0700, Hugh Dickins wrote:
> On Thu, 5 Oct 2023, Dave Chinner wrote:
> > On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > > Percpu counter's compare and add are separate functions: without locking
> > > around them (which would defeat their purpose), it has been possible to
> > > overflow the intended limit.  Imagine all the other CPUs fallocating
> > > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > > 
> > > I have not seen reports of that happening; but tmpfs's recent addition
> > > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > > 
> > > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > > 
> > > I believe this implementation is correct, and slightly more efficient
> > > than the combination of compare and add (taking the lock once rather
> > > than twice when nearing full - the last 128MiB of a tmpfs volume on a
> > > machine with 128 CPUs and 4KiB pages); but it does beg for a better
> > > design - when nearing full, there is no new batching, but the costly
> > > percpu counter sum across CPUs still has to be done, while locked.
> > > 
> > > Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> > > well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> > > __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> > > num_online_cpus(), when they calculate the maximum which could be held
> > > across CPUs?  But the times when it matters would be vanishingly rare.
> > > 
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > Cc: Tim Chen <tim.c.chen@intel.com>
> > > Cc: Dave Chinner <dchinner@redhat.com>
> > > Cc: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > > Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
> > > which are just internal to shmem, and do not affect this patch (which
> > > applies to v6.6-rc and linux-next as is): but want to run this by you.
> > 
> > Hmmmm. IIUC, this only works for addition that approaches the limit
> > from below?
> 
> That's certainly how I was thinking about it, and what I need for tmpfs.
> Precisely what its limitations (haha) are, I'll have to take care to
> spell out.
> 
> (IIRC - it's a while since I wrote it - it can be used for subtraction,
> but goes the very slow way when it could go the fast way - uncompared
> percpu_counter_sub() much better for that.  You might be proposing that
> a tweak could adjust it to going the fast way when coming down from the
> "limit", but going the slow way as it approaches 0 - that would be neat,
> but I've not yet looked into whether it's feasily done.)
> 
> > 
> > So if we are approaching the limit from above (i.e. add of a
> > negative amount, limit is zero) then this code doesn't work the same
> > as the open-coded compare+add operation would?
> 
> To it and to me, a limit of 0 means nothing positive can be added
> (and it immediately returns false for that case); and adding anything
> negative would be an error since the positive would not have been allowed.
> 
> Would a negative limit have any use?

I don't have any use for it, but the XFS case is decrementing free
space to determine if ENOSPC has been hit. It's the opposite
implemention to shmem, which increments used space to determine if
ENOSPC is hit.

> It's definitely not allowing all the possibilities that you could arrange
> with a separate compare and add; whether it's ruling out some useful
> possibilities to which it can easily be generalized, I'm not sure.
> 
> Well worth a look - but it'll be easier for me to break it than get
> it right, so I might just stick to adding some comments.
> 
> I might find that actually I prefer your way round: getting slower
> as approaching 0, without any need for specifying a limit??  That the
> tmpfs case pushed it in this direction, when it's better reversed?  Or
> that might be an embarrassing delusion which I'll regret having mentioned.

I think there's cases for both approaching and upper limit from
before and a lower limit from above. Both are the same "compare and
add" algorithm, just with minor logic differences...

> > Hence I think this looks like a "add if result is less than"
> > operation, which is distinct from then "add if result is greater
> > than" operation that we use this same pattern for in XFS and ext4.
> > Perhaps a better name is in order?
> 
> The name still seems good to me, but a comment above it on its
> assumptions/limitations well worth adding.
> 
> I didn't find a percpu_counter_compare() in ext4, and haven't got

Go search for EXT4_FREECLUSTERS_WATERMARK....

> far yet with understanding the XFS ones: tomorrow...

XFS detects being near ENOSPC to change the batch update size so
taht when near ENOSPC the percpu counter always aggregates to the
global sum on every modification. i.e. it becomes more accurate (but
slower) near the ENOSPC threshold. Then if the result of the
subtraction ends up being less than zero, it takes a lock (i.e. goes
even slower!), undoes the subtraction that took it below zero, and
determines if it can dip into the reserve pool or ENOSPC should be
reported.

Some of that could be optimised, but we need that external "lock and
undo" mechanism to manage the reserve pool space atomically at
ENOSPC...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-10-06 22:59       ` Dennis Zhou
@ 2023-10-12  4:09         ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2023-10-12  4:09 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Hugh Dickins, Chen, Tim C, Andrew Morton, Dave Chinner,
	Darrick J. Wong, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	linux-fsdevel, linux-kernel, linux-mm

On Fri, 6 Oct 2023, Dennis Zhou wrote:
> 
> Sorry for the late chime in. I'm traveling right now.

No problem at all, thanks for looking.

> 
> I haven't been super happy lately with percpu_counter as it has had a
> few corner cases such as the cpu_dying_mask fiasco which I thought we
> fixed with a series from tglx [1]. If not I can resurrect it and pull
> it.
> 
> I feel like percpu_counter is deviating from its original intended
> usecase which, from my perspective, was a thin wrapper around a percpu
> variable. At this point we seem to be bolting onto percpu_counter
> instead of giving it a clear focus for what it's supposed to do well.
> I think I understand the use case, and ultimately it's kind of the
> duality where I think it was xfs is using percpu_counters where it must
> be > 0 for the value to make sense and there was a race condition with
> cpu dying [2].
> 
> At this point, I think it's probably better to wholy think about the
> lower bound and upper bound problem of percpu_counter wrt the # of
> online cpus.
> 
> Thanks,
> Dennis
> 
> [1] https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
> [2] https://lore.kernel.org/lkml/20230406015629.1804722-1-yebin@huaweicloud.com/

Thanks for the links.  I can see that the current cpu_dying situation
is not ideal, but don't see any need to get any deeper into that for
percpu_counter_limited_add(): I did consider an update to remove its
use of cpu_dying_mask, but that just seemed wrong - it should do the
same as is currently done in __percpu_counter_sum(), then be updated
along with that when cpu_dying is sorted, by tglx's series or otherwise.

I don't think I agree with you about percpu_counter deviating from its
original intended usecase; but I haven't studied the history to see what
that initial usecase was.  Whatever, we've had percpu_counter_add() and
percpu_counter_compare() for many years, and percpu_counter_limited_add()
is just an atomic combination of those: I don't see it as deviating at all.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-10-09  0:15       ` Dave Chinner
@ 2023-10-12  4:36         ` Hugh Dickins
  2023-10-12  4:40           ` [PATCH 9/8] percpu_counter: extend _limited_add() to negative amounts Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2023-10-12  4:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Andrew Morton, Tim Chen, Dave Chinner,
	Darrick J. Wong, Christian Brauner, Carlos Maiolino, Chuck Lever,
	Jan Kara, Matthew Wilcox, Johannes Weiner, Axel Rasmussen,
	Dennis Zhou, linux-fsdevel, linux-kernel, linux-mm

On Mon, 9 Oct 2023, Dave Chinner wrote:
> On Thu, Oct 05, 2023 at 10:35:33PM -0700, Hugh Dickins wrote:
> > On Thu, 5 Oct 2023, Dave Chinner wrote:
> > > 
> > > Hmmmm. IIUC, this only works for addition that approaches the limit
> > > from below?
> > 
> > That's certainly how I was thinking about it, and what I need for tmpfs.
> > Precisely what its limitations (haha) are, I'll have to take care to
> > spell out.
> > 
> > (IIRC - it's a while since I wrote it - it can be used for subtraction,
> > but goes the very slow way when it could go the fast way - uncompared
> > percpu_counter_sub() much better for that.  You might be proposing that
> > a tweak could adjust it to going the fast way when coming down from the
> > "limit", but going the slow way as it approaches 0 - that would be neat,
> > but I've not yet looked into whether it's feasily done.)

Easily done once I'd looked at it from the right angle.

> > 
> > > 
> > > So if we are approaching the limit from above (i.e. add of a
> > > negative amount, limit is zero) then this code doesn't work the same
> > > as the open-coded compare+add operation would?
> > 
> > To it and to me, a limit of 0 means nothing positive can be added
> > (and it immediately returns false for that case); and adding anything
> > negative would be an error since the positive would not have been allowed.
> > 
> > Would a negative limit have any use?

There was no reason to exclude it, once I was thinking clearly
about the comparisons.

> 
> I don't have any use for it, but the XFS case is decrementing free
> space to determine if ENOSPC has been hit. It's the opposite
> implemention to shmem, which increments used space to determine if
> ENOSPC is hit.

Right.

> 
> > It's definitely not allowing all the possibilities that you could arrange
> > with a separate compare and add; whether it's ruling out some useful
> > possibilities to which it can easily be generalized, I'm not sure.
> > 
> > Well worth a look - but it'll be easier for me to break it than get
> > it right, so I might just stick to adding some comments.
> > 
> > I might find that actually I prefer your way round: getting slower
> > as approaching 0, without any need for specifying a limit??  That the
> > tmpfs case pushed it in this direction, when it's better reversed?  Or
> > that might be an embarrassing delusion which I'll regret having mentioned.
> 
> I think there's cases for both approaching and upper limit from
> before and a lower limit from above. Both are the same "compare and
> add" algorithm, just with minor logic differences...

Good, thanks, you've saved me: I was getting a bit fundamentalist there,
thinking to offer one simplest primitive from which anything could be
built.  But when it came down to it, I had no enthusiam for rewriting
tmpfs's used_blocks as free_blocks, just to avoid that limit argument.

> 
> > > Hence I think this looks like a "add if result is less than"
> > > operation, which is distinct from then "add if result is greater
> > > than" operation that we use this same pattern for in XFS and ext4.
> > > Perhaps a better name is in order?
> > 
> > The name still seems good to me, but a comment above it on its
> > assumptions/limitations well worth adding.
> > 
> > I didn't find a percpu_counter_compare() in ext4, and haven't got
> 
> Go search for EXT4_FREECLUSTERS_WATERMARK....

Ah, not a percpu_counter_compare() user, but doing its own thing.

> 
> > far yet with understanding the XFS ones: tomorrow...
> 
> XFS detects being near ENOSPC to change the batch update size so
> taht when near ENOSPC the percpu counter always aggregates to the
> global sum on every modification. i.e. it becomes more accurate (but
> slower) near the ENOSPC threshold. Then if the result of the
> subtraction ends up being less than zero, it takes a lock (i.e. goes
> even slower!), undoes the subtraction that took it below zero, and
> determines if it can dip into the reserve pool or ENOSPC should be
> reported.
> 
> Some of that could be optimised, but we need that external "lock and
> undo" mechanism to manage the reserve pool space atomically at
> ENOSPC...

Thanks for going above and beyond with the description; but I'll be
honest and admit that I only looked quickly, and did not reach any
conclusion as to whether such usage could or should be converted
to percpu_counter_limited_add() - which would never take any XFS
locks, of course, so might just end up doubling the slow work.

But absolutely I agree with you, and thank you for pointing out,
how stupidly useless percpu_counter_limited_add() was for decrementing -
it was nothing more than a slow way of doing percpu_counter_sub().

I'm about to send in a 9/8, extending it to be more useful: thanks.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 9/8] percpu_counter: extend _limited_add() to negative amounts
  2023-10-12  4:36         ` Hugh Dickins
@ 2023-10-12  4:40           ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2023-10-12  4:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Tim Chen, Dave Chinner, Darrick J. Wong,
	Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, Dennis Zhou,
	linux-fsdevel, linux-kernel, linux-mm

Though tmpfs does not need it, percpu_counter_limited_add() can be twice
as useful if it works sensibly with negative amounts (subs) - typically
decrements towards a limit of 0 or nearby: as suggested by Dave Chinner.

And in the course of that reworking, skip the percpu counter sum if it is
already obvious that the limit would be passed: as suggested by Tim Chen.

Extend the comment above __percpu_counter_limited_add(), defining the
behaviour with positive and negative amounts, allowing negative limits,
but not bothering about overflow beyond S64_MAX.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/percpu_counter.h | 11 +++++--
 lib/percpu_counter.c           | 54 +++++++++++++++++++++++++---------
 2 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 8cb7c071bd5c..3a44dd1e33d2 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -198,14 +198,21 @@ static inline bool
 percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount)
 {
 	unsigned long flags;
+	bool good = false;
 	s64 count;
 
+	if (amount == 0)
+		return true;
+
 	local_irq_save(flags);
 	count = fbc->count + amount;
-	if (count <= limit)
+	if ((amount > 0 && count <= limit) ||
+	    (amount < 0 && count >= limit)) {
 		fbc->count = count;
+		good = true;
+	}
 	local_irq_restore(flags);
-	return count <= limit;
+	return good;
 }
 
 /* non-SMP percpu_counter_add_local is the same with percpu_counter_add */
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 58a3392f471b..44dd133594d4 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -279,8 +279,16 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 EXPORT_SYMBOL(__percpu_counter_compare);
 
 /*
- * Compare counter, and add amount if the total is within limit.
- * Return true if amount was added, false if it would exceed limit.
+ * Compare counter, and add amount if total is: less than or equal to limit if
+ * amount is positive, or greater than or equal to limit if amount is negative.
+ * Return true if amount is added, or false if total would be beyond the limit.
+ *
+ * Negative limit is allowed, but unusual.
+ * When negative amounts (subs) are given to percpu_counter_limited_add(),
+ * the limit would most naturally be 0 - but other limits are also allowed.
+ *
+ * Overflow beyond S64_MAX is not allowed for: counter, limit and amount
+ * are all assumed to be sane (far from S64_MIN and S64_MAX).
  */
 bool __percpu_counter_limited_add(struct percpu_counter *fbc,
 				  s64 limit, s64 amount, s32 batch)
@@ -288,10 +296,10 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
 	s64 count;
 	s64 unknown;
 	unsigned long flags;
-	bool good;
+	bool good = false;
 
-	if (amount > limit)
-		return false;
+	if (amount == 0)
+		return true;
 
 	local_irq_save(flags);
 	unknown = batch * num_online_cpus();
@@ -299,7 +307,8 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
 
 	/* Skip taking the lock when safe */
 	if (abs(count + amount) <= batch &&
-	    fbc->count + unknown <= limit) {
+	    ((amount > 0 && fbc->count + unknown <= limit) ||
+	     (amount < 0 && fbc->count - unknown >= limit))) {
 		this_cpu_add(*fbc->counters, amount);
 		local_irq_restore(flags);
 		return true;
@@ -309,7 +318,19 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
 	count = fbc->count + amount;
 
 	/* Skip percpu_counter_sum() when safe */
-	if (count + unknown > limit) {
+	if (amount > 0) {
+		if (count - unknown > limit)
+			goto out;
+		if (count + unknown <= limit)
+			good = true;
+	} else {
+		if (count + unknown < limit)
+			goto out;
+		if (count - unknown >= limit)
+			good = true;
+	}
+
+	if (!good) {
 		s32 *pcount;
 		int cpu;
 
@@ -317,15 +338,20 @@ bool __percpu_counter_limited_add(struct percpu_counter *fbc,
 			pcount = per_cpu_ptr(fbc->counters, cpu);
 			count += *pcount;
 		}
+		if (amount > 0) {
+			if (count > limit)
+				goto out;
+		} else {
+			if (count < limit)
+				goto out;
+		}
+		good = true;
 	}
 
-	good = count <= limit;
-	if (good) {
-		count = __this_cpu_read(*fbc->counters);
-		fbc->count += count + amount;
-		__this_cpu_sub(*fbc->counters, count);
-	}
-
+	count = __this_cpu_read(*fbc->counters);
+	fbc->count += count + amount;
+	__this_cpu_sub(*fbc->counters, count);
+out:
 	raw_spin_unlock(&fbc->lock);
 	local_irq_restore(flags);
 	return good;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
  2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
                     ` (2 preceding siblings ...)
  2023-10-05 16:50   ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Chen, Tim C
@ 2024-05-25  6:00   ` Mateusz Guzik
  3 siblings, 0 replies; 29+ messages in thread
From: Mateusz Guzik @ 2024-05-25  6:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Tim Chen, Dave Chinner, Darrick J. Wong,
	Christian Brauner, Carlos Maiolino, Chuck Lever, Jan Kara,
	Matthew Wilcox, Johannes Weiner, Axel Rasmussen, linux-fsdevel,
	linux-kernel, linux-mm

On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> Percpu counter's compare and add are separate functions: without locking
> around them (which would defeat their purpose), it has been possible to
> overflow the intended limit.  Imagine all the other CPUs fallocating
> tmpfs huge pages to the limit, in between this CPU's compare and its add.
> 
> I have not seen reports of that happening; but tmpfs's recent addition
> of dquot_alloc_block_nodirty() in between the compare and the add makes
> it even more likely, and I'd be uncomfortable to leave it unfixed.
> 
> Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> 

I think the posted (and by now landed) code is racy.

I had seen there was a follow up patch which further augmented the
routine, but it did not alter the issue below so I'm responding to this
thread.

> +/*
> + * Compare counter, and add amount if the total is within limit.
> + * Return true if amount was added, false if it would exceed limit.
> + */
> +bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> +				  s64 limit, s64 amount, s32 batch)
> +{
> +	s64 count;
> +	s64 unknown;
> +	unsigned long flags;
> +	bool good;
> +
> +	if (amount > limit)
> +		return false;
> +
> +	local_irq_save(flags);
> +	unknown = batch * num_online_cpus();
> +	count = __this_cpu_read(*fbc->counters);
> +
> +	/* Skip taking the lock when safe */
> +	if (abs(count + amount) <= batch &&
> +	    fbc->count + unknown <= limit) {
> +		this_cpu_add(*fbc->counters, amount);
> +		local_irq_restore(flags);
> +		return true;
> +	}
> +

Note the value of fbc->count is *not* stabilized.

> +	raw_spin_lock(&fbc->lock);
> +	count = fbc->count + amount;
> +
> +	/* Skip percpu_counter_sum() when safe */
> +	if (count + unknown > limit) {
> +		s32 *pcount;
> +		int cpu;
> +
> +		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> +			pcount = per_cpu_ptr(fbc->counters, cpu);
> +			count += *pcount;
> +		}
> +	}
> +
> +	good = count <= limit;
> +	if (good) {
> +		count = __this_cpu_read(*fbc->counters);
> +		fbc->count += count + amount;
> +		__this_cpu_sub(*fbc->counters, count);
> +	}
> +
> +	raw_spin_unlock(&fbc->lock);
> +	local_irq_restore(flags);
> +	return good;
> +}
> +

Consider 2 cpus executing the func at the same time, where one is
updating fbc->counter in the slow path while the other is testing it in
the fast path.

cpu0						cpu1
						if (abs(count + amount) <= batch &&				
						    fbc->count + unknown <= limit)
/* gets past the per-cpu traversal */
/* at this point cpu0 decided to bump fbc->count, while cpu1 decided to
 * bump the local counter */
							this_cpu_add(*fbc->counters, amount);
fbc->count += count + amount;

Suppose fbc->count update puts it close enough to the limit that an
addition from cpu1 would put the entire thing over said limit.

Since the 2 operations are not synchronized cpu1 can observe fbc->count
prior to the bump and update it's local counter, resulting in
aforementioned overflow.

Am I misreading something here? Is this prevented?

To my grep the only user is quotas in shmem and I wonder if that use is
even justified. I am aware of performance realities of atomics. However,
it very well may be that whatever codepaths which exercise the counter
are suffering multicore issues elsewhere, making an atomic (in a
dedicated cacheline) a perfectly sane choice for the foreseeable future.
Should this be true there would be 0 rush working out a fixed variant of
the routine.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-05-25  6:00 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
2023-09-30  3:25 ` [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union Hugh Dickins
2023-09-30 16:16   ` Chuck Lever
2023-10-03 13:06   ` Jan Kara
2023-09-30  3:26 ` [PATCH 2/8] shmem: remove vma arg from shmem_get_folio_gfp() Hugh Dickins
2023-10-03 13:07   ` Jan Kara
2023-09-30  3:27 ` [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault() Hugh Dickins
2023-10-03 13:18   ` Jan Kara
2023-10-06  3:48     ` Hugh Dickins
2023-10-06 11:01       ` Jan Kara
2023-09-30  3:28 ` [PATCH 4/8] shmem: trivial tidyups, removing extra blank lines, etc Hugh Dickins
2023-10-03 13:20   ` Jan Kara
2023-09-30  3:30 ` [PATCH 5/8] shmem: shmem_acct_blocks() and shmem_inode_acct_blocks() Hugh Dickins
2023-10-03 13:21   ` Jan Kara
2023-09-30  3:31 ` [PATCH 6/8] shmem: move memcg charge out of shmem_add_to_page_cache() Hugh Dickins
2023-10-03 13:28   ` Jan Kara
2023-09-30  3:32 ` [PATCH 7/8] shmem: _add_to_page_cache() before shmem_inode_acct_blocks() Hugh Dickins
2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
2023-10-04 15:32   ` Jan Kara
2023-10-04 23:10   ` Dave Chinner
2023-10-06  5:35     ` Hugh Dickins
2023-10-09  0:15       ` Dave Chinner
2023-10-12  4:36         ` Hugh Dickins
2023-10-12  4:40           ` [PATCH 9/8] percpu_counter: extend _limited_add() to negative amounts Hugh Dickins
2023-10-05 16:50   ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Chen, Tim C
2023-10-06  5:42     ` Hugh Dickins
2023-10-06 22:59       ` Dennis Zhou
2023-10-12  4:09         ` Hugh Dickins
2024-05-25  6:00   ` Mateusz Guzik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.