All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue
@ 2017-02-24 15:43 Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 01/10] fs: Use lockdep_assert_held() instead of comments Peter Zijlstra
                   ` (12 more replies)
  0 siblings, 13 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

(my appologies if this arrives a second time; I seem to have
fat-fingered my send command the first time and things didn't reach
neither me or the list).

Hi all,

So I'm not entirely happy with these patches; but I don't really know
fs/inode.c as well as some of you and I figured I'd reached a point where I
need feedback (or maybe I'm well past that, we'll see).

So the kernel has recently grown a reference count type, this thing is fairly
strict with semantics; such that it can give 'helpful' warnings when people
'accidentally' violate the rules and create bugs.

The one at the core of this patch set is that refcount_t assumes 0 means 'free'
or 'freeing'.

The problem is that inode::i_count is _not_ a reference count, it is a usage
count (for lack of a better name), it counts how many active users of the inode
are out there. But 0 users is a perfectly fine state for an inode to be in,
it'll just sit in the cache waiting for a new user (or reclaim).

Now refcount_t has no operations to increment once we've hit 0, because if you
assume 0 means 'free', increment from 0 means use-after-free, and that's a bad
thing.

So what this patch-set attempts is doing a +1 bias on the usage-count to turn
it into an actual reference count, where the extra reference is the pointer the
cache itself has to the object.

This then results in the need to do something like: dec_and_lock at the 2->1
transition instead of the usual 1->0; for this purpose we introduce
refcount_dec_unless().

So far, it sounds fairly sensible; _except_ for the wee little problem that a
fair amount of code looks at the value of i_count. Some of this is fine, eg.
the evict path verifies it is indeed 0. But other places look at !0 values and
those are suspect.

To make matters worse; once i_count is a refcount, it appears trivial to avoid
inode_hash_lock for lookups (yay RCU!) and looking at i_count becomes even more
of a problem because then holding i_lock will not in fact stabilize it anymore.

So I've 'ignored' (by assuming they were already broken) the i_count
observers and done that RCU conversion -- even though I have no idea what
workload would hit the global inode_hash_lock hard enough for it to matter
(see, maybe I'm well past the point where I could've used feedback).


There's a number of options here:

 - I'm not completely insane, and these patches can be made to work.

 - We decide usage-counts are useful and try and support them in refcount_t;
   this has the down-side that people can more easily write bad code (by doing
   from 0 increments that should not have happened).

 - We decide usage-counts need their own type (urgh, more...).

 - None of the above, we keep i_count as is and let people hunt and convert
   actual refcounts.


I'm ok with all those; I just figured it'd be 'fun' to convert something
non-trivial. FWIW, this boots and builds a kernel (but that's about all the
testing its had).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 01/10] fs: Use lockdep_assert_held() instead of comments
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 02/10] fs: Avoid looking at i_count without i_lock held Peter Zijlstra
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-1.patch --]
[-- Type: text/plain, Size: 854 bytes --]

Because nobody reads comments...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/inode.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -384,11 +384,9 @@ static void init_once(void *foo)
 	inode_init_once(inode);
 }
 
-/*
- * inode->i_lock must be held
- */
 void __iget(struct inode *inode)
 {
+	lockdep_assert_held(&inode->i_lock);
 	atomic_inc(&inode->i_count);
 }
 
@@ -409,11 +407,10 @@ static void inode_lru_list_add(struct in
 
 /*
  * Add inode to LRU if needed (inode is unused and clean).
- *
- * Needs inode->i_lock held.
  */
 void inode_add_lru(struct inode *inode)
 {
+	lockdep_assert_held(&inode->i_lock);
 	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
 				I_FREEING | I_WILL_FREE)) &&
 	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & MS_ACTIVE)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 02/10] fs: Avoid looking at i_count without i_lock held
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 01/10] fs: Use lockdep_assert_held() instead of comments Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
       [not found]   ` <CA+55aFxLw8FXf61rsGYDjA1tS=joDeaF7OSgaepLWwcz4zt=dg@mail.gmail.com>
  2017-02-24 15:43 ` [RFC][PATCH 03/10] fs: Introduce i_count() Peter Zijlstra
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-2.patch --]
[-- Type: text/plain, Size: 825 bytes --]

In general i_count will only stay 0 if we have i_lock held. I realize
this is evict, so having MS_ACTIVE cleared might avoid the race
against find_inode_fast() in other ways.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/inode.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -604,10 +604,12 @@ void evict_inodes(struct super_block *sb
 again:
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		if (atomic_read(&inode->i_count))
+		spin_lock(&inode->i_lock);
+		if (atomic_read(&inode->i_count)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
-		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
 			continue;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 03/10] fs: Introduce i_count()
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 01/10] fs: Use lockdep_assert_held() instead of comments Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 02/10] fs: Avoid looking at i_count without i_lock held Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 04/10] fs: Restructure iput() Peter Zijlstra
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-3a.patch --]
[-- Type: text/plain, Size: 10557 bytes --]

A preparatory patch; encapsulate i_count reading, modeled after
d_count().

There are a few sites around the kernel that look at i_count and base
decsions on it, some look OK, some really flaky:

 - spufs:
	i_count == 1, no i_lock

 - ext4_free_inode:
	i_count == 0, OK because super_operations::evict_inode

 - writeback_single_inode:
	i_count || I_WILL_FREE, seems OK

 - hpfs_write_inode: no clue

 - fsnotify_unmount_inodes:
	i_count == 0, seem OK, we hold i_lock.
	find_inode_fast() is serialized with i_lock
	futex uses inc_not_zero().

 - btrfs_invalidate_inodes:
	checks for >1 i_count, while holding one, seems racy, nothing
	(obviously) stops another reference from happening. Might be
	fine in that any further reference cannot be an alias,
	definitely needs a comment.

 - check_conflicting_open: no clue

 - maybe_indirect_to_direct: no clue

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/powerpc/platforms/cell/spufs/file.c         |    2 +-
 drivers/staging/lustre/lustre/llite/vvp_object.c |    2 +-
 fs/btrfs/inode.c                                 |    2 +-
 fs/ceph/mds_client.c                             |    2 +-
 fs/cifs/inode.c                                  |    2 +-
 fs/ext4/ialloc.c                                 |    4 ++--
 fs/fs-writeback.c                                |    2 +-
 fs/hpfs/inode.c                                  |    2 +-
 fs/inode.c                                       |   11 ++++++-----
 fs/locks.c                                       |    2 +-
 fs/nfs/inode.c                                   |    4 ++--
 fs/notify/inode_mark.c                           |    2 +-
 fs/orangefs/namei.c                              |    4 ++--
 fs/reiserfs/stree.c                              |    2 +-
 fs/ubifs/super.c                                 |    2 +-
 fs/xfs/xfs_trace.h                               |    2 +-
 include/linux/fs.h                               |    5 +++++
 include/trace/events/filelock.h                  |    2 +-
 18 files changed, 30 insertions(+), 24 deletions(-)

--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1492,7 +1492,7 @@ static int spufs_mfc_open(struct inode *
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (i_count(inode) != 1) /* XXX */
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
--- a/drivers/staging/lustre/lustre/llite/vvp_object.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_object.c
@@ -72,7 +72,7 @@ static int vvp_object_print(const struct
 		lli = ll_i2info(inode);
 		(*p)(env, cookie, "%lu/%u %o %u %d %p "DFID,
 		     inode->i_ino, inode->i_generation, inode->i_mode,
-		     inode->i_nlink, atomic_read(&inode->i_count),
+		     inode->i_nlink, i_count(inode),
 		     lli->lli_clob, PFID(&lli->lli_fid));
 	}
 	return 0;
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5616,7 +5616,7 @@ void btrfs_invalidate_inodes(struct btrf
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+			if (i_count(inode) > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1448,7 +1448,7 @@ static int trim_caps_cb(struct inode *in
 		spin_unlock(&ci->i_ceph_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, atomic_read(&inode->i_count));
+		     inode, cap, i_count(inode));
 		return 0;
 	}
 
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1950,7 +1950,7 @@ int cifs_revalidate_dentry_attr(struct d
 	}
 
 	cifs_dbg(FYI, "Update attributes: %s inode 0x%p count %d dentry: 0x%p d_time %ld jiffies %ld\n",
-		 full_path, inode, inode->i_count.counter,
+		 full_path, inode, i_count(inode),
 		 dentry, cifs_get_time(dentry), jiffies);
 
 	if (cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext)
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -270,10 +270,10 @@ void ext4_free_inode(handle_t *handle, s
 		       "nonexistent device\n", __func__, __LINE__);
 		return;
 	}
-	if (atomic_read(&inode->i_count) > 1) {
+	if (i_count(inode)) {
 		ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d",
 			 __func__, __LINE__, inode->i_ino,
-			 atomic_read(&inode->i_count));
+			 i_count(inode));
 		return;
 	}
 	if (inode->i_nlink) {
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1384,7 +1384,7 @@ static int writeback_single_inode(struct
 	int ret = 0;
 
 	spin_lock(&inode->i_lock);
-	if (!atomic_read(&inode->i_count))
+	if (!i_count(inode))
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -183,7 +183,7 @@ void hpfs_write_inode(struct inode *i)
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !i_count(i)) { /* XXX */
 		if (*hpfs_inode->i_rddir_off)
 			pr_err("write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -411,9 +411,10 @@ static void inode_lru_list_add(struct in
 void inode_add_lru(struct inode *inode)
 {
 	lockdep_assert_held(&inode->i_lock);
+
 	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
 				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & MS_ACTIVE)
+	    !i_count(inode) && inode->i_sb->s_flags & MS_ACTIVE)
 		inode_lru_list_add(inode);
 }
 
@@ -605,7 +606,7 @@ void evict_inodes(struct super_block *sb
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
-		if (atomic_read(&inode->i_count)) {
+		if (i_count(inode)) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
@@ -665,7 +666,7 @@ int invalidate_inodes(struct super_block
 			busy = 1;
 			continue;
 		}
-		if (atomic_read(&inode->i_count)) {
+		if (i_count(inode)) {
 			spin_unlock(&inode->i_lock);
 			busy = 1;
 			continue;
@@ -713,9 +714,9 @@ static enum lru_status inode_lru_isolate
 
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
-	 * through the LRU as we canot reclaim them now.
+	 * through the LRU as we cannot reclaim them now.
 	 */
-	if (atomic_read(&inode->i_count) ||
+	if (i_count(inode) ||
 	    (inode->i_state & ~I_REFERENCED)) {
 		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1651,7 +1651,7 @@ check_conflicting_open(const struct dent
 		return -EAGAIN;
 
 	if ((arg == F_WRLCK) && ((d_count(dentry) > 1) ||
-	    (atomic_read(&inode->i_count) > 1)))
+	    (i_count(inode) > 1)))
 		ret = -EAGAIN;
 
 	return ret;
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -531,7 +531,7 @@ nfs_fhget(struct super_block *sb, struct
 		inode->i_sb->s_id,
 		(unsigned long long)NFS_FILEID(inode),
 		nfs_display_fhandle_hash(fh),
-		atomic_read(&inode->i_count));
+		i_count(inode));
 
 out:
 	return inode;
@@ -1696,7 +1696,7 @@ static int nfs_update_inode(struct inode
 	dfprintk(VFS, "NFS: %s(%s/%lu fh_crc=0x%08x ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
 			nfs_display_fhandle_hash(NFS_FH(inode)),
-			atomic_read(&inode->i_count), fattr->valid);
+			i_count(inode), fattr->valid);
 
 	if (!nfs_fileid_valid(nfsi, fattr)) {
 		printk(KERN_ERR "NFS: server %s error: fileid changed\n"
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -171,7 +171,7 @@ void fsnotify_unmount_inodes(struct supe
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count)) {
+		if (!i_count(inode)) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -202,14 +202,14 @@ static struct dentry *orangefs_lookup(st
 		     __func__,
 		     __LINE__,
 		     inode->i_ino,
-		     (int)atomic_read(&inode->i_count));
+		     (int)i_count(inode));
 
 	/* update dentry/inode pair into dcache */
 	res = d_splice_alias(inode, dentry);
 
 	gossip_debug(GOSSIP_NAME_DEBUG,
 		     "Lookup success (inode ct = %d)\n",
-		     (int)atomic_read(&inode->i_count));
+		     (int)i_count(inode));
 out:
 	op_release(new_op);
 	return res;
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1555,7 +1555,7 @@ static int maybe_indirect_to_direct(stru
 	 * reading in the last block.  The user will hit problems trying to
 	 * read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (i_count(inode) > 1 || /* XXX */
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -349,7 +349,7 @@ static void ubifs_evict_inode(struct ino
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!i_count(inode));
 
 	truncate_inode_pages_final(&inode->i_data);
 
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -704,7 +704,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->count = i_count(VFS_I(ip));
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2711,6 +2711,11 @@ static inline void lockdep_annotate_inod
 extern void unlock_new_inode(struct inode *);
 extern unsigned int get_next_ino(void);
 
+static inline int i_count(struct inode *inode)
+{
+	return atomic_read(&inode->i_count);
+}
+
 extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
--- a/include/trace/events/filelock.h
+++ b/include/trace/events/filelock.h
@@ -185,7 +185,7 @@ TRACE_EVENT(generic_add_lease,
 		__entry->i_ino = inode->i_ino;
 		__entry->wcount = atomic_read(&inode->i_writecount);
 		__entry->dcount = d_count(fl->fl_file->f_path.dentry);
-		__entry->icount = atomic_read(&inode->i_count);
+		__entry->icount = i_count(inode);
 		__entry->fl_owner = fl ? fl->fl_owner : NULL;
 		__entry->fl_flags = fl ? fl->fl_flags : 0;
 		__entry->fl_type = fl ? fl->fl_type : 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 04/10] fs: Restructure iput()
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 03/10] fs: Introduce i_count() Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 05/10] fs: Remove iput_final() Peter Zijlstra
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-3b.patch --]
[-- Type: text/plain, Size: 1115 bytes --]

Preparatory patch for elimination of iput_final().

No functional change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/inode.c |   22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1528,19 +1528,21 @@ void iput(struct inode *inode)
 {
 	if (!inode)
 		return;
+
 	BUG_ON(inode->i_state & I_CLEAR);
 retry:
-	if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
-		if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
-			atomic_inc(&inode->i_count);
-			inode->i_state &= ~I_DIRTY_TIME;
-			spin_unlock(&inode->i_lock);
-			trace_writeback_lazytime_iput(inode);
-			mark_inode_dirty_sync(inode);
-			goto retry;
-		}
-		iput_final(inode);
+	if (!atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
+		return;
+
+	if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
+		atomic_inc(&inode->i_count);
+		inode->i_state &= ~I_DIRTY_TIME;
+		spin_unlock(&inode->i_lock);
+		trace_writeback_lazytime_iput(inode);
+		mark_inode_dirty_sync(inode);
+		goto retry;
 	}
+	iput_final(inode);
 }
 EXPORT_SYMBOL(iput);
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 05/10] fs: Remove iput_final()
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 04/10] fs: Restructure iput() Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 06/10] fs: Rework i_count Peter Zijlstra
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-3c.patch --]
[-- Type: text/plain, Size: 2905 bytes --]

Fold iput_final() into iput, its only caller. This prepares things for
future patches, also I feel iput_final() is a misnomer, since it can
still be non-final and leave the inode hashed and ready.

No functional change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/inode.c |   73 +++++++++++++++++++++++++++----------------------------------
 1 file changed, 33 insertions(+), 40 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1468,24 +1468,47 @@ int generic_delete_inode(struct inode *i
 }
 EXPORT_SYMBOL(generic_delete_inode);
 
-/*
- * Called when we're dropping the last reference
- * to an inode.
+/**
+ *	iput	- put an inode
+ *	@inode: inode to put
  *
- * Call the FS "drop_inode()" function, defaulting to
- * the legacy UNIX filesystem behaviour.  If it tells
- * us to evict inode, do so.  Otherwise, retain inode
- * in cache if fs is alive, sync and evict if fs is
- * shutting down.
+ *	Puts an inode, dropping its usage count. If the inode use count hits
+ *	zero, the inode is then freed and may also be destroyed.
+ *
+ *	Consequently, iput() can sleep.
  */
-static void iput_final(struct inode *inode)
+void iput(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
-	const struct super_operations *op = inode->i_sb->s_op;
+	const struct super_operations *op = sb->s_op;
 	int drop;
 
+	if (!inode)
+		return;
+
+	BUG_ON(inode->i_state & I_CLEAR);
+retry:
+	if (!atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
+		return;
+
+	if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
+		atomic_inc(&inode->i_count);
+		inode->i_state &= ~I_DIRTY_TIME;
+		spin_unlock(&inode->i_lock);
+		trace_writeback_lazytime_iput(inode);
+		mark_inode_dirty_sync(inode);
+		goto retry;
+	}
+
 	WARN_ON(inode->i_state & I_NEW);
 
+	/*
+	 * Call the FS "drop_inode()" function, defaulting to
+	 * the legacy UNIX filesystem behaviour.  If it tells
+	 * us to evict inode, do so.  Otherwise, retain inode
+	 * in cache if fs is alive, sync and evict if fs is
+	 * shutting down.
+	 */
 	if (op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1514,36 +1537,6 @@ static void iput_final(struct inode *ino
 
 	evict(inode);
 }
-
-/**
- *	iput	- put an inode
- *	@inode: inode to put
- *
- *	Puts an inode, dropping its usage count. If the inode use count hits
- *	zero, the inode is then freed and may also be destroyed.
- *
- *	Consequently, iput() can sleep.
- */
-void iput(struct inode *inode)
-{
-	if (!inode)
-		return;
-
-	BUG_ON(inode->i_state & I_CLEAR);
-retry:
-	if (!atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
-		return;
-
-	if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
-		atomic_inc(&inode->i_count);
-		inode->i_state &= ~I_DIRTY_TIME;
-		spin_unlock(&inode->i_lock);
-		trace_writeback_lazytime_iput(inode);
-		mark_inode_dirty_sync(inode);
-		goto retry;
-	}
-	iput_final(inode);
-}
 EXPORT_SYMBOL(iput);
 
 /**

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 06/10] fs: Rework i_count
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 05/10] fs: Remove iput_final() Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 20:49   ` Al Viro
  2017-02-24 15:43 ` [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode Peter Zijlstra
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-3d2.patch --]
[-- Type: text/plain, Size: 4012 bytes --]

The inode count: i_count, does not conform to normal reference
counting semantics, its a usage count. That is, objects can and do
live without any references (i_count == 0).

This is because the reference from the inode hash table is not
accounted. This makes that find_inode_fast() can result in 0->1
transitions and similarly, iput() can do 1->0->1 transitions.

This patch changes things to include the inode hash table's reference
and reworks iput() to avoid spurious drops to 0.

It does mean that we now call super_operations::drop_inode() with
non-zero i_count, but I could not find an implementation where this
mattered. Only once we really decide to fully drop the inode and
remove it from the hash will we do the final dec_and_test.

This basically boils down to pushing part of iput() into
atomic_dec_and_lock().

Also, this would allow an RCU based find_inode*().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/btrfs/inode.c   |    2 +-
 fs/inode.c         |   24 +++++++++++++++++++-----
 include/linux/fs.h |   11 ++++++++++-
 3 files changed, 30 insertions(+), 7 deletions(-)

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3172,7 +3172,7 @@ void btrfs_add_delayed_iput(struct inode
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_inode *binode = BTRFS_I(inode);
 
-	if (atomic_add_unless(&inode->i_count, -1, 1))
+	if (atomic_add_unless(&inode->i_count, -1, 2))
 		return;
 
 	spin_lock(&fs_info->delayed_iput_lock);
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -135,7 +135,7 @@ int inode_init_always(struct super_block
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	atomic_set(&inode->i_count, 2); /* hashed and ref */
 	inode->i_op = &empty_iops;
 	inode->i_fop = &no_open_fops;
 	inode->__i_nlink = 1;
@@ -387,7 +387,7 @@ static void init_once(void *foo)
 void __iget(struct inode *inode)
 {
 	lockdep_assert_held(&inode->i_lock);
-	atomic_inc(&inode->i_count);
+	WARN_ON(!atomic_inc_not_zero(&inode->i_count));
 }
 
 /*
@@ -395,7 +395,7 @@ void __iget(struct inode *inode)
  */
 void ihold(struct inode *inode)
 {
-	WARN_ON(atomic_inc_return(&inode->i_count) < 2);
+	WARN_ON(atomic_inc_return(&inode->i_count) < 3);
 }
 EXPORT_SYMBOL(ihold);
 
@@ -814,6 +814,8 @@ static struct inode *find_inode_fast(str
 {
 	struct inode *inode = NULL;
 
+	lockdep_assert_held(&inode_hash_lock);
+
 repeat:
 	hlist_for_each_entry(inode, head, i_hash) {
 		if (inode->i_ino != ino)
@@ -1488,11 +1490,12 @@ void iput(struct inode *inode)
 
 	BUG_ON(inode->i_state & I_CLEAR);
 retry:
-	if (!atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
+	if (atomic_add_unless(&inode->i_count, -1, 2))
 		return;
 
+	spin_lock(&inode->i_lock);
+
 	if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
-		atomic_inc(&inode->i_count);
 		inode->i_state &= ~I_DIRTY_TIME;
 		spin_unlock(&inode->i_lock);
 		trace_writeback_lazytime_iput(inode);
@@ -1500,6 +1503,7 @@ void iput(struct inode *inode)
 		goto retry;
 	}
 
+	atomic_dec(&inode->i_count); /* 2 -> 1 */
 	WARN_ON(inode->i_state & I_NEW);
 
 	/*
@@ -1520,6 +1524,16 @@ void iput(struct inode *inode)
 		spin_unlock(&inode->i_lock);
 		return;
 	}
+
+	/*
+	 * If, at this point, only the hashtable has a reference left
+	 * continue to take the inode out, otherwise someone got a ref
+	 * while we weren't looking.
+	 */
+	if (atomic_cmpxchg(&inode->i_count, 1, 0) != 1) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
 
 	if (!drop) {
 		inode->i_state |= I_WILL_FREE;
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2713,7 +2713,16 @@ extern unsigned int get_next_ino(void);
 
 static inline int i_count(struct inode *inode)
 {
-	return atomic_read(&inode->i_count);
+	int i_count = atomic_read(&inode->i_count);
+
+	/*
+	 * In order to preserve the 'old' usage-count semantics, remove the
+	 * reference that the hash-table has.
+	 */
+	if (i_count)
+		i_count--;
+
+	return i_count;
 }
 
 extern void __iget(struct inode * inode);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 06/10] fs: Rework i_count Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 20:52   ` Al Viro
  2017-02-24 15:43 ` [RFC][PATCH 08/10] fs: Do RCU versions for find_inode() Peter Zijlstra
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-6.patch --]
[-- Type: text/plain, Size: 966 bytes --]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/orangefs/super.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/fs/orangefs/super.c
+++ b/fs/orangefs/super.c
@@ -115,6 +115,13 @@ static struct inode *orangefs_alloc_inod
 	return &orangefs_inode->vfs_inode;
 }
 
+static void orangefs_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
+	kmem_cache_free(orangefs_inode_cache, orangefs_inode);
+}
+
 static void orangefs_destroy_inode(struct inode *inode)
 {
 	struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
@@ -123,7 +130,7 @@ static void orangefs_destroy_inode(struc
 			"%s: deallocated %p destroying inode %pU\n",
 			__func__, orangefs_inode, get_khandle_from_ino(inode));
 
-	kmem_cache_free(orangefs_inode_cache, orangefs_inode);
+	call_rcu(&inode->i_rcu, orangefs_i_callback);
 }
 
 /*

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 08/10] fs: Do RCU versions for find_inode()
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 15:43 ` [RFC][PATCH 09/10] locking/refcount: Provide refcount_dec_unless() Peter Zijlstra
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-5.patch --]
[-- Type: text/plain, Size: 7450 bytes --]

Now that i_count is a proper reference count, such that 0 means free
or freeing, and all .destroy_inode methods use RCU to free inodes, we
can trivially convert the inode hash to RCU and do RCU lookups.

So provide RCU variants of find_inode() and find_inode_fast(), in case
we do hit an inode with i_count==0, we fall back to the old code that
does a __wait_for_freeing_inode().

_However_ this makes the situation with using i_count() for decisions
far worse; those few that are in the evict/free path seem safe, the
rest is up for grabs.

If the rest of the sites were OK, they probably are no longer and need
help.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/inode.c |  104 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 76 insertions(+), 28 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -471,7 +471,7 @@ void __insert_inode_hash(struct inode *i
 
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_add_head(&inode->i_hash, b);
+	hlist_add_head_rcu(&inode->i_hash, b);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -487,7 +487,7 @@ void __remove_inode_hash(struct inode *i
 {
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_del_init(&inode->i_hash);
+	hlist_del_init_rcu(&inode->i_hash);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -777,15 +777,15 @@ long prune_icache_sb(struct super_block
 }
 
 static void __wait_on_freeing_inode(struct inode *inode);
-/*
- * Called with the inode lock held.
- */
-static struct inode *find_inode(struct super_block *sb,
+
+static struct inode *__find_inode(struct super_block *sb,
 				struct hlist_head *head,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
-	struct inode *inode = NULL;
+	struct inode *inode;
+
+	lockdep_assert_held(&inode_hash_lock);
 
 repeat:
 	hlist_for_each_entry(inode, head, i_hash) {
@@ -805,14 +805,44 @@ static struct inode *find_inode(struct s
 	return NULL;
 }
 
+static struct inode *find_inode(struct super_block *sb,
+				struct hlist_head *head,
+				int (*test)(struct inode *, void *),
+				void *data)
+{
+	struct inode *inode;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(inode, head, i_hash) {
+		if (inode->i_sb != sb)
+			continue;
+		if (!test(inode, data))
+			continue;
+		if (atomic_inc_not_zero(&inode->i_count))
+			goto out_unlock;
+		goto slow;
+	}
+	inode = NULL;
+out_unlock:
+	rcu_read_unlock();
+	return inode;
+
+slow:
+	rcu_read_unlock();
+	spin_lock(&inode_hash_lock);
+	inode = __find_inode(sb, head, test, data);
+	spin_unlock(&inode_hash_lock);
+	return inode;
+}
+
 /*
- * find_inode_fast is the fast path version of find_inode, see the comment at
+ * __find_inode_fast is the fast path version of __find_inode, see the comment at
  * iget_locked for details.
  */
-static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+static struct inode *__find_inode_fast(struct super_block *sb,
+		struct hlist_head *head, unsigned long ino)
 {
-	struct inode *inode = NULL;
+	struct inode *inode;
 
 	lockdep_assert_held(&inode_hash_lock);
 
@@ -834,6 +864,34 @@ static struct inode *find_inode_fast(str
 	return NULL;
 }
 
+static struct inode *find_inode_fast(struct super_block *sb,
+		struct hlist_head *head, unsigned long ino)
+{
+	struct inode *inode;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(inode, head, i_hash) {
+		if (inode->i_ino != ino)
+			continue;
+		if (inode->i_sb != sb)
+			continue;
+		if (atomic_inc_not_zero(&inode->i_count))
+			goto out_unlock;
+		goto slow;
+	}
+	inode = NULL;
+out_unlock:
+	rcu_read_unlock();
+	return inode;
+
+slow:
+	rcu_read_unlock();
+	spin_lock(&inode_hash_lock);
+	inode = __find_inode_fast(sb, head, ino);
+	spin_unlock(&inode_hash_lock);
+	return inode;
+}
+
 /*
  * Each cpu owns a range of LAST_INO_BATCH numbers.
  * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
@@ -1026,10 +1084,7 @@ struct inode *iget5_locked(struct super_
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 again:
-	spin_lock(&inode_hash_lock);
 	inode = find_inode(sb, head, test, data);
-	spin_unlock(&inode_hash_lock);
-
 	if (inode) {
 		wait_on_inode(inode);
 		if (unlikely(inode_unhashed(inode))) {
@@ -1045,14 +1100,14 @@ struct inode *iget5_locked(struct super_
 
 		spin_lock(&inode_hash_lock);
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = __find_inode(sb, head, test, data);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
 
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
 			spin_unlock(&inode_hash_lock);
@@ -1104,9 +1159,7 @@ struct inode *iget_locked(struct super_b
 	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 again:
-	spin_lock(&inode_hash_lock);
 	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
 	if (inode) {
 		wait_on_inode(inode);
 		if (unlikely(inode_unhashed(inode))) {
@@ -1122,12 +1175,12 @@ struct inode *iget_locked(struct super_b
 
 		spin_lock(&inode_hash_lock);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = __find_inode_fast(sb, head, ino);
 		if (!old) {
 			inode->i_ino = ino;
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
 			spin_unlock(&inode_hash_lock);
@@ -1258,9 +1311,7 @@ struct inode *ilookup5_nowait(struct sup
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
 	inode = find_inode(sb, head, test, data);
-	spin_unlock(&inode_hash_lock);
 
 	return inode;
 }
@@ -1313,10 +1364,7 @@ struct inode *ilookup(struct super_block
 	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 again:
-	spin_lock(&inode_hash_lock);
 	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
-
 	if (inode) {
 		wait_on_inode(inode);
 		if (unlikely(inode_unhashed(inode))) {
@@ -1345,7 +1393,7 @@ EXPORT_SYMBOL(ilookup);
  * the inode_hash_lock spinlock held.
  *
  * This is a even more generalized version of ilookup5() when the
- * function must never block --- find_inode() can block in
+ * function must never block --- __find_inode() can block in
  * __wait_on_freeing_inode() --- or when the caller can not increment
  * the reference count because the resulting iput() might cause an
  * inode eviction.  The tradeoff is that the @match funtion must be
@@ -1402,7 +1450,7 @@ int insert_inode_locked(struct inode *in
 		if (likely(!old)) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return 0;
@@ -1445,7 +1493,7 @@ int insert_inode_locked4(struct inode *i
 		if (likely(!old)) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 09/10] locking/refcount: Provide refcount_dec_unless()
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 08/10] fs: Do RCU versions for find_inode() Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-27  9:28   ` Reshetova, Elena
  2017-02-24 15:43 ` [RFC][PATCH 10/10] fs: Convert i_count over to refcount_t Peter Zijlstra
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-ref-9.patch --]
[-- Type: text/plain, Size: 2303 bytes --]

By allowing a different unless value than 1, we can do dec_and_lock
like things on higher values, like 2, which is useful if the data
structure we lock also owns a reference (because in that case we'd
never hit 1).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/refcount.h |    8 +++++++-
 lib/refcount.c           |   14 ++++++++------
 2 files changed, 15 insertions(+), 7 deletions(-)

--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -35,7 +35,13 @@ extern __must_check bool refcount_dec_an
 extern void refcount_dec(refcount_t *r);
 
 extern __must_check bool refcount_dec_if_one(refcount_t *r);
-extern __must_check bool refcount_dec_not_one(refcount_t *r);
+extern __must_check bool refcount_dec_unless(refcount_t *r, unsigned int unless);
+
+static inline __must_check bool refcount_dec_not_one(refcount_t *r)
+{
+	return refcount_dec_unless(r, 1);
+}
+
 extern __must_check bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock);
 extern __must_check bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock);
 
--- a/lib/refcount.c
+++ b/lib/refcount.c
@@ -174,12 +174,14 @@ bool refcount_dec_if_one(refcount_t *r)
 EXPORT_SYMBOL_GPL(refcount_dec_if_one);
 
 /*
- * No atomic_t counterpart, it decrements unless the value is 1, in which case
- * it will return false.
+ * No atomic_t counterpart, it decrements unless the value is as specified, in
+ * which case it will return false.
  *
- * Was often done like: atomic_add_unless(&var, -1, 1)
+ * Was often done like: atomic_add_unless(&var, -1, unless), where the most
+ * common variant has unless==1 is provided as a convenience wrapper, see
+ * refcount_dec_not_one().
  */
-bool refcount_dec_not_one(refcount_t *r)
+bool refcount_dec_unless(refcount_t *r, unsigned int unless)
 {
 	unsigned int new, val = atomic_read(&r->refs);
 
@@ -187,7 +189,7 @@ bool refcount_dec_not_one(refcount_t *r)
 		if (unlikely(val == UINT_MAX))
 			return true;
 
-		if (val == 1)
+		if (val == unless)
 			return false;
 
 		new = val - 1;
@@ -200,7 +202,7 @@ bool refcount_dec_not_one(refcount_t *r)
 
 	return true;
 }
-EXPORT_SYMBOL_GPL(refcount_dec_not_one);
+EXPORT_SYMBOL_GPL(refcount_dec_unless);
 
 /*
  * Similar to atomic_dec_and_mutex_lock(), it will WARN on underflow and fail

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 10/10] fs: Convert i_count over to refcount_t
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (8 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 09/10] locking/refcount: Provide refcount_dec_unless() Peter Zijlstra
@ 2017-02-24 15:43 ` Peter Zijlstra
  2017-02-24 16:43 ` [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Christoph Hellwig
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 15:43 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, elena.reshetova,
	ishkamiel, dwindsor, gregkh, peterz

[-- Attachment #1: peterz-fs-inode-4.patch --]
[-- Type: text/plain, Size: 5366 bytes --]

Now that i_count is a proper refcount, convert it to refcount_t.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/btrfs/inode.c   |    2 +-
 fs/f2fs/super.c    |    4 ++--
 fs/inode.c         |   16 ++++++++--------
 fs/xfs/xfs_inode.c |    2 +-
 fs/xfs/xfs_inode.h |    2 +-
 include/linux/fs.h |    5 +++--
 kernel/futex.c     |    2 +-
 7 files changed, 17 insertions(+), 16 deletions(-)

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3172,7 +3172,7 @@ void btrfs_add_delayed_iput(struct inode
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_inode *binode = BTRFS_I(inode);
 
-	if (atomic_add_unless(&inode->i_count, -1, 2))
+	if (refcount_dec_unless(&inode->i_count, 2))
 		return;
 
 	spin_lock(&fs_info->delayed_iput_lock);
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -601,7 +601,7 @@ static int f2fs_drop_inode(struct inode
 	if ((!inode_unhashed(inode) && inode->i_state & I_SYNC)) {
 		if (!inode->i_nlink && !is_bad_inode(inode)) {
 			/* to avoid evict_inode call simultaneously */
-			atomic_inc(&inode->i_count);
+			refcount_inc(&inode->i_count);
 			spin_unlock(&inode->i_lock);
 
 			/* some remained atomic pages should discarded */
@@ -621,7 +621,7 @@ static int f2fs_drop_inode(struct inode
 
 			fscrypt_put_encryption_info(inode, NULL);
 			spin_lock(&inode->i_lock);
-			atomic_dec(&inode->i_count);
+			refcount_dec(&inode->i_count);
 		}
 		return 0;
 	}
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -135,7 +135,7 @@ int inode_init_always(struct super_block
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 2); /* hashed and ref */
+	refcount_set(&inode->i_count, 2); /* hashed and ref */
 	inode->i_op = &empty_iops;
 	inode->i_fop = &no_open_fops;
 	inode->__i_nlink = 1;
@@ -387,7 +387,7 @@ static void init_once(void *foo)
 void __iget(struct inode *inode)
 {
 	lockdep_assert_held(&inode->i_lock);
-	WARN_ON(!atomic_inc_not_zero(&inode->i_count));
+	refcount_inc(&inode->i_count); /* must not be 0 */
 }
 
 /*
@@ -395,7 +395,7 @@ void __iget(struct inode *inode)
  */
 void ihold(struct inode *inode)
 {
-	WARN_ON(atomic_inc_return(&inode->i_count) < 3);
+	refcount_inc(&inode->i_count);
 }
 EXPORT_SYMBOL(ihold);
 
@@ -818,7 +818,7 @@ static struct inode *find_inode(struct s
 			continue;
 		if (!test(inode, data))
 			continue;
-		if (atomic_inc_not_zero(&inode->i_count))
+		if (refcount_inc_not_zero(&inode->i_count))
 			goto out_unlock;
 		goto slow;
 	}
@@ -875,7 +875,7 @@ static struct inode *find_inode_fast(str
 			continue;
 		if (inode->i_sb != sb)
 			continue;
-		if (atomic_inc_not_zero(&inode->i_count))
+		if (refcount_inc_not_zero(&inode->i_count))
 			goto out_unlock;
 		goto slow;
 	}
@@ -1538,7 +1538,7 @@ void iput(struct inode *inode)
 
 	BUG_ON(inode->i_state & I_CLEAR);
 retry:
-	if (atomic_add_unless(&inode->i_count, -1, 2))
+	if (refcount_dec_unless(&inode->i_count, 2))
 		return;
 
 	spin_lock(&inode->i_lock);
@@ -1551,7 +1551,7 @@ void iput(struct inode *inode)
 		goto retry;
 	}
 
-	atomic_dec(&inode->i_count); /* 2 -> 1 */
+	refcount_dec(&inode->i_count); /* 2 -> 1 */
 	WARN_ON(inode->i_state & I_NEW);
 
 	/*
@@ -1578,7 +1578,7 @@ void iput(struct inode *inode)
 	 * continue to take the inode out, otherwise someone got a ref
 	 * while we weren't looking.
 	 */
-	if (atomic_cmpxchg(&inode->i_count, 1, 0) != 1) {
+	if (!refcount_dec_if_one(&inode->i_count)) {
 		spin_unlock(&inode->i_lock);
 		return;
 	}
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1563,7 +1563,7 @@ xfs_itruncate_extents(
 	int			done = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-	ASSERT(!atomic_read(&VFS_I(ip)->i_count) ||
+	ASSERT(!refcount_read(&VFS_I(ip)->i_count) ||
 	       xfs_isilocked(ip, XFS_IOLOCK_EXCL));
 	ASSERT(new_size <= XFS_ISIZE(ip));
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -475,7 +475,7 @@ static inline void xfs_setup_existing_in
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
+	ASSERT(refcount_read(&VFS_I(ip)->i_count) > 0) ; \
 	ihold(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -23,6 +23,7 @@
 #include <linux/fiemap.h>
 #include <linux/rculist_bl.h>
 #include <linux/atomic.h>
+#include <linux/refcount.h>
 #include <linux/shrinker.h>
 #include <linux/migrate_mode.h>
 #include <linux/uidgid.h>
@@ -622,7 +623,7 @@ struct inode {
 		struct rcu_head		i_rcu;
 	};
 	u64			i_version;
-	atomic_t		i_count;
+	refcount_t		i_count;
 	atomic_t		i_dio_count;
 	atomic_t		i_writecount;
 #ifdef CONFIG_IMA
@@ -2713,7 +2714,7 @@ extern unsigned int get_next_ino(void);
 
 static inline int i_count(struct inode *inode)
 {
-	int i_count = atomic_read(&inode->i_count);
+	int i_count = refcount_read(&inode->i_count);
 
 	/*
 	 * In order to preserve the 'old' usage-count semantics, remove the
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -676,7 +676,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 		 * cases, therefore a successful atomic_inc return below will
 		 * guarantee that get_futex_key() will still imply smp_mb(); (B).
 		 */
-		if (WARN_ON_ONCE(!atomic_inc_not_zero(&inode->i_count))) {
+		if (WARN_ON_ONCE(!refcount_inc_not_zero(&inode->i_count))) {
 			rcu_read_unlock();
 			put_page(page);
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (9 preceding siblings ...)
  2017-02-24 15:43 ` [RFC][PATCH 10/10] fs: Convert i_count over to refcount_t Peter Zijlstra
@ 2017-02-24 16:43 ` Christoph Hellwig
  2017-02-24 17:07   ` Peter Zijlstra
  2017-02-24 20:59   ` David Windsor
       [not found] ` <CA+55aFy1bNbsX_3T-s_EUwTP-r_SmJJMvB3=-2nffehFVP=EdQ@mail.gmail.com>
  2017-02-24 20:42 ` Al Viro
  12 siblings, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2017-02-24 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Al Viro, Linus Torvalds, Chris Mason, linux-kernel,
	linux-fsdevel, David Howells, elena.reshetova, ishkamiel,
	dwindsor, gregkh

Usage counts are common and useful, so for now they should stay as-is
and if people can came up with a useful primitive for them we can
consider implementing it.

Trying to shoe-horn everything into refcount_t is a horrible idea.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 02/10] fs: Avoid looking at i_count without i_lock held
       [not found]   ` <CA+55aFxLw8FXf61rsGYDjA1tS=joDeaF7OSgaepLWwcz4zt=dg@mail.gmail.com>
@ 2017-02-24 17:06     ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hans Liljestrand, Al Viro, linux-kernel, dwindsor, David Howells,
	linux-fsdevel, Chris Mason, gregkh, elena.reshetova

On Fri, Feb 24, 2017 at 08:56:06AM -0800, Linus Torvalds wrote:
> This one looks very questionable.
> 
> Taking the lock for every single inode, even when we can tell that it's
> pointless, is horrid.
> 
> Even if you really think you found a race, i think it would be better to
> leave the unlocked read around as an optimistic check, and then do a safe
> double check read inside the lock (together with the i_state checks)
> 
> Hmm?

Yeah, pretty dumb that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue
  2017-02-24 16:43 ` [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Christoph Hellwig
@ 2017-02-24 17:07   ` Peter Zijlstra
  2017-02-24 20:59   ` David Windsor
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2017-02-24 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Linus Torvalds, Chris Mason, linux-kernel,
	linux-fsdevel, David Howells, elena.reshetova, ishkamiel,
	dwindsor, gregkh

On Fri, Feb 24, 2017 at 08:43:30AM -0800, Christoph Hellwig wrote:
> Usage counts are common and useful, so for now they should stay as-is
> and if people can came up with a useful primitive for them we can
> consider implementing it.
> 
> Trying to shoe-horn everything into refcount_t is a horrible idea.

Sure; and like I said that is a perfectly fine option. I just wanted to
see how horrible this ended up.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Fwd: [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue
       [not found]     ` <CA+55aFyR8wkHps5_AqUqzx8MDMNxRZZ7+MYH9g=ZCUi=4Oey8w@mail.gmail.com>
@ 2017-02-24 19:24       ` Linus Torvalds
  0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2017-02-24 19:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hans Liljestrand, Al Viro, Linux Kernel Mailing List,
	David Windsor, David Howells, linux-fsdevel, Chris Mason,
	Greg Kroah-Hartman, Reshetova, Elena

[ Gaah, resending because I was once again bitten by the html email
problem on android and the list rules.  Sorry for duplicates for the
people who didn't bounce the html ]

On Feb 24, 2017 08:39, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> To make matters worse; once i_count is a refcount, it appears trivial to avoid
> inode_hash_lock for lookups (yay RCU!) and looking at i_count becomes even more
> of a problem because then holding i_lock will not in fact stabilize it anymore.

Note that if this is the main reason for the series, I would argue
against this all.

Inode lookup is simply not an important function. Inodes just aren't
the primary data structure in the vfs layer, and they are seldom
looked up, since the common operation is too look up the dentry (that
then has a direct pointer to the inode).

Inode lookup tends too happen in places like file creation etc, where
the real costs are elsewhere (ie complex locking cost for directories
and for filesystem inode number generation etc).

So you actually have a load where this all is even noticeable? If I
recall correctly, the loads where we have seen inode locking etc
issues have been things like very high rate socket create/destroy, but
those were generally disable by just avoiding all the inode games
entirely (is not putting the inode on any superblock lists, and never
looking anything up, just always allocating a new inode).

Also note that for the same reason, the usual refcount overflow
advantages are rather questionable. There are very few things that
increment the inode use count, and they aren't generally under direct
user control. Users get references to dentries, not inodes, when they
open or map files. So the inode counts tend to be limited by the
number of hard links you can have to things (or number of mounts you
can do) which are much more limited than the refcount anyway.

So I get the feeling that you should worry about other refcount users
long before you worry about inodes. They may be very central in
traditional UNIX, but not so much in Linux (any more).

                    Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue
  2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
                   ` (11 preceding siblings ...)
       [not found] ` <CA+55aFy1bNbsX_3T-s_EUwTP-r_SmJJMvB3=-2nffehFVP=EdQ@mail.gmail.com>
@ 2017-02-24 20:42 ` Al Viro
  12 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2017-02-24 20:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Chris Mason, linux-kernel, linux-fsdevel,
	David Howells, elena.reshetova, ishkamiel, dwindsor, gregkh

On Fri, Feb 24, 2017 at 04:43:29PM +0100, Peter Zijlstra wrote:
 
> There's a number of options here:
> 
>  - I'm not completely insane, and these patches can be made to work.
> 
>  - We decide usage-counts are useful and try and support them in refcount_t;
>    this has the down-side that people can more easily write bad code (by doing
>    from 0 increments that should not have happened).
> 
>  - We decide usage-counts need their own type (urgh, more...).
> 
>  - None of the above, we keep i_count as is and let people hunt and convert
>    actual refcounts.

The last one; if some object has non-trivial lifetime rules, don't try to
shoehorn it into refcount_t.  VFS-side the same goes for
struct dentry		(non-trivial lifetime and locking rules)
struct mount		(per-CPU fun, barriers, etc.)
struct super_block	(non-trivial lifecycle and lifetime rules)

I'm not sure if struct file is a good match, BTW - net/unix/garbage.c would
be one place in need of a careful looking into if we went for it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 06/10] fs: Rework i_count
  2017-02-24 15:43 ` [RFC][PATCH 06/10] fs: Rework i_count Peter Zijlstra
@ 2017-02-24 20:49   ` Al Viro
  0 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2017-02-24 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Chris Mason, linux-kernel, linux-fsdevel,
	David Howells, elena.reshetova, ishkamiel, dwindsor, gregkh

On Fri, Feb 24, 2017 at 04:43:35PM +0100, Peter Zijlstra wrote:

>  {
> -	return atomic_read(&inode->i_count);
> +	int i_count = atomic_read(&inode->i_count);
> +
> +	/*
> +	 * In order to preserve the 'old' usage-count semantics, remove the
> +	 * reference that the hash-table has.

What does it have to do with hashtable, when you are bumping it for _all_
inodes, hashed or not hashed?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-24 15:43 ` [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode Peter Zijlstra
@ 2017-02-24 20:52   ` Al Viro
  2017-02-24 23:00     ` Mike Marshall
  0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2017-02-24 20:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Chris Mason, linux-kernel, linux-fsdevel,
	David Howells, elena.reshetova, ishkamiel, dwindsor, gregkh

That, AFAICS, fixes a real bug.  Applied, and it needs Cc:stable as well.


> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  fs/orangefs/super.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> --- a/fs/orangefs/super.c
> +++ b/fs/orangefs/super.c
> @@ -115,6 +115,13 @@ static struct inode *orangefs_alloc_inod
>  	return &orangefs_inode->vfs_inode;
>  }
>  
> +static void orangefs_i_callback(struct rcu_head *head)
> +{
> +	struct inode *inode = container_of(head, struct inode, i_rcu);
> +	struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
> +	kmem_cache_free(orangefs_inode_cache, orangefs_inode);
> +}
> +
>  static void orangefs_destroy_inode(struct inode *inode)
>  {
>  	struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
> @@ -123,7 +130,7 @@ static void orangefs_destroy_inode(struc
>  			"%s: deallocated %p destroying inode %pU\n",
>  			__func__, orangefs_inode, get_khandle_from_ino(inode));
>  
> -	kmem_cache_free(orangefs_inode_cache, orangefs_inode);
> +	call_rcu(&inode->i_rcu, orangefs_i_callback);
>  }
>  
>  /*
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue
  2017-02-24 16:43 ` [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Christoph Hellwig
  2017-02-24 17:07   ` Peter Zijlstra
@ 2017-02-24 20:59   ` David Windsor
  1 sibling, 0 replies; 26+ messages in thread
From: David Windsor @ 2017-02-24 20:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Al Viro, Linus Torvalds, Chris Mason, LKML,
	linux-fsdevel, David Howells, Reshetova, Elena, Hans Liljestrand,
	Greg KH

On Fri, Feb 24, 2017 at 11:43 AM, Christoph Hellwig <hch@infradead.org> wrote:
> Usage counts are common and useful, so for now they should stay as-is
> and if people can came up with a useful primitive for them we can
> consider implementing it.
>

While developing the refcount_t API, we used coccinelle to find areas
where converted atomic_t->refcount_t variables were being initialized
to 0 (rather than 1).

If we're considering a usagecount_t type, this may be a good starting
point for determining just how pervasive usage counts actually are (at
least the ones that we've converted [or attempted to convert] to
refcount_t).  Usage count types not atomic_t before the refcount_t
conversion obviously weren't included in our search.

Note that some of these have already been converted to refcounts,
either by adding a +1 bias, or some other workaround.  Others are
simply too confusing and we haven't decided what to do with them yet;
I've marked these as (not converted to refcount_t) below.

Some preemptive flame retardant: I realize that not all of these are
"usage" counts.  They _do_ all fall into the category of refcount_t
(formerly atomic_t) variables that are initialized to 0, a common
usage count idiom.  This is the same thing that's going on with struct
inode.i_count and therefore relevant to this discussion.

Not (yet) converted to refcount_t:
fs/nfsd/state.c: struct nfsd4_session.se_ref
fs/nfsd/state.c: struct nfsd4_client.cl_refcount
fs/hfsplus/hfsplus_fs.h: struct hfsplus_inode_info.opencnt
fs/xfs/xfs_log_priv.h: struct xlog_in_core.ic_refcnt
sound/usb/usbaudio.h: struct snd_usb_audio.usage_count
tools/perf/util/evlist.h: struct perf_mmap.refcnt
include/net/nf_conntrack.h: struct nf_conntrack.use
include/net/ip_vs.h: struct ip_vs_service.refcnt

Converted to refcount_t:
fs/fuse/fuse_i.h: struct fuse_file.count
fs/btrfs/delayed-inode.h: struct btrfs_delayed_inode.refs
fs/btrfs/compression.c: struct compressed_bio.pending_bios
fs/btrfs/extent_io.h: struct extent_buffer.io_pages
net/packet/internal.h: struct packet_fanout.sk_ref
xfs/xfs_buf.h: struct xfs_buf.b_lru_ref
sound/core/seq/seq_ports.h: snd_seq_subscribers.ref_count
tools/perf/util/comm.c: struct comm_str.refcnt
include/net/bluetooth/hci_core.h: struct hci_conn.refcnt
include/linux/filter.h: struct sk_filter.refcnt
include/linux/skbuff.h struct sk_buff.users

There are 7 more instances in drivers/, 3 of which haven't (yet) been
converted to refcount_t, but I'm too lazy to continue with this list.

> Trying to shoe-horn everything into refcount_t is a horrible idea.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-24 20:52   ` Al Viro
@ 2017-02-24 23:00     ` Mike Marshall
  2017-02-25 20:31       ` Mike Marshall
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Marshall @ 2017-02-24 23:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Zijlstra, Linus Torvalds, Chris Mason, LKML, linux-fsdevel,
	David Howells, elena.reshetova, ishkamiel, dwindsor,
	Greg Kroah-Hartman

Thanks Al... I was going to try and evaluate that patch next
week, now all I have to do is test it <g> ...

-Mike

On Fri, Feb 24, 2017 at 3:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> That, AFAICS, fixes a real bug.  Applied, and it needs Cc:stable as well.
>
>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>  fs/orangefs/super.c |    9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> --- a/fs/orangefs/super.c
>> +++ b/fs/orangefs/super.c
>> @@ -115,6 +115,13 @@ static struct inode *orangefs_alloc_inod
>>       return &orangefs_inode->vfs_inode;
>>  }
>>
>> +static void orangefs_i_callback(struct rcu_head *head)
>> +{
>> +     struct inode *inode = container_of(head, struct inode, i_rcu);
>> +     struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
>> +     kmem_cache_free(orangefs_inode_cache, orangefs_inode);
>> +}
>> +
>>  static void orangefs_destroy_inode(struct inode *inode)
>>  {
>>       struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
>> @@ -123,7 +130,7 @@ static void orangefs_destroy_inode(struc
>>                       "%s: deallocated %p destroying inode %pU\n",
>>                       __func__, orangefs_inode, get_khandle_from_ino(inode));
>>
>> -     kmem_cache_free(orangefs_inode_cache, orangefs_inode);
>> +     call_rcu(&inode->i_rcu, orangefs_i_callback);
>>  }
>>
>>  /*
>>
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-24 23:00     ` Mike Marshall
@ 2017-02-25 20:31       ` Mike Marshall
  2017-02-27  0:34         ` Mike Marshall
  2017-02-27  8:44         ` David Howells
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Marshall @ 2017-02-25 20:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Zijlstra, Linus Torvalds, Chris Mason, LKML, linux-fsdevel,
	David Howells, elena.reshetova, Hans Liljestrand, David Windsor,
	Greg Kroah-Hartman

After looking through the code and seeing how some other filesystems
use call_rcu, it seems that call_rcu has to do with consistency and
waiting for stuff to complete before returning an object to the slab cache,
whereas we were just calling kmem_cache_free without worrying about that
kind of stuff...

Is that a "close enough" description of the error that is being
fixed here?

-Mike

On Fri, Feb 24, 2017 at 6:00 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Thanks Al... I was going to try and evaluate that patch next
> week, now all I have to do is test it <g> ...
>
> -Mike
>
> On Fri, Feb 24, 2017 at 3:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> That, AFAICS, fixes a real bug.  Applied, and it needs Cc:stable as well.
>>
>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> ---
>>>  fs/orangefs/super.c |    9 ++++++++-
>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> --- a/fs/orangefs/super.c
>>> +++ b/fs/orangefs/super.c
>>> @@ -115,6 +115,13 @@ static struct inode *orangefs_alloc_inod
>>>       return &orangefs_inode->vfs_inode;
>>>  }
>>>
>>> +static void orangefs_i_callback(struct rcu_head *head)
>>> +{
>>> +     struct inode *inode = container_of(head, struct inode, i_rcu);
>>> +     struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
>>> +     kmem_cache_free(orangefs_inode_cache, orangefs_inode);
>>> +}
>>> +
>>>  static void orangefs_destroy_inode(struct inode *inode)
>>>  {
>>>       struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
>>> @@ -123,7 +130,7 @@ static void orangefs_destroy_inode(struc
>>>                       "%s: deallocated %p destroying inode %pU\n",
>>>                       __func__, orangefs_inode, get_khandle_from_ino(inode));
>>>
>>> -     kmem_cache_free(orangefs_inode_cache, orangefs_inode);
>>> +     call_rcu(&inode->i_rcu, orangefs_i_callback);
>>>  }
>>>
>>>  /*
>>>
>>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-25 20:31       ` Mike Marshall
@ 2017-02-27  0:34         ` Mike Marshall
  2017-02-27  1:20           ` Linus Torvalds
  2017-02-27  8:44         ` David Howells
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Marshall @ 2017-02-27  0:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Zijlstra, Linus Torvalds, Chris Mason, LKML, linux-fsdevel,
	David Howells, elena.reshetova, Hans Liljestrand, David Windsor,
	Greg Kroah-Hartman

Since Orangefs uses ref-walk, not rcu-walk, this patch with call_rcu
has seemed weird to me.

Using the git log, I searched back to where it seems to me call_rcu was
added, a giant patch from 2005 by David Howells which includes tons of
source and a large amount of documentation.

It seems that the call back function in call_rcu is used to destroy objects
only after readers are known to be finished, the mechanism by which
the readers are known to be finished is described in the documentation.

Perhaps I shouldn't think that Orangefs doesn't use rcu-walk, rather
that it switches to ref-walk from rcu-walk when d_revalidate returns
ECHILD (which it does right away).

-Mike

On Sat, Feb 25, 2017 at 3:31 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> After looking through the code and seeing how some other filesystems
> use call_rcu, it seems that call_rcu has to do with consistency and
> waiting for stuff to complete before returning an object to the slab cache,
> whereas we were just calling kmem_cache_free without worrying about that
> kind of stuff...
>
> Is that a "close enough" description of the error that is being
> fixed here?
>
> -Mike
>
> On Fri, Feb 24, 2017 at 6:00 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>> Thanks Al... I was going to try and evaluate that patch next
>> week, now all I have to do is test it <g> ...
>>
>> -Mike
>>
>> On Fri, Feb 24, 2017 at 3:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> That, AFAICS, fixes a real bug.  Applied, and it needs Cc:stable as well.
>>>
>>>
>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>> ---
>>>>  fs/orangefs/super.c |    9 ++++++++-
>>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> --- a/fs/orangefs/super.c
>>>> +++ b/fs/orangefs/super.c
>>>> @@ -115,6 +115,13 @@ static struct inode *orangefs_alloc_inod
>>>>       return &orangefs_inode->vfs_inode;
>>>>  }
>>>>
>>>> +static void orangefs_i_callback(struct rcu_head *head)
>>>> +{
>>>> +     struct inode *inode = container_of(head, struct inode, i_rcu);
>>>> +     struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
>>>> +     kmem_cache_free(orangefs_inode_cache, orangefs_inode);
>>>> +}
>>>> +
>>>>  static void orangefs_destroy_inode(struct inode *inode)
>>>>  {
>>>>       struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(inode);
>>>> @@ -123,7 +130,7 @@ static void orangefs_destroy_inode(struc
>>>>                       "%s: deallocated %p destroying inode %pU\n",
>>>>                       __func__, orangefs_inode, get_khandle_from_ino(inode));
>>>>
>>>> -     kmem_cache_free(orangefs_inode_cache, orangefs_inode);
>>>> +     call_rcu(&inode->i_rcu, orangefs_i_callback);
>>>>  }
>>>>
>>>>  /*
>>>>
>>>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-27  0:34         ` Mike Marshall
@ 2017-02-27  1:20           ` Linus Torvalds
  0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2017-02-27  1:20 UTC (permalink / raw)
  To: Mike Marshall
  Cc: Al Viro, Peter Zijlstra, Chris Mason, LKML, linux-fsdevel,
	David Howells, Reshetova, Elena, Hans Liljestrand, David Windsor,
	Greg Kroah-Hartman

On Sun, Feb 26, 2017 at 4:34 PM, Mike Marshall <hubcap@omnibond.com> wrote:
> Since Orangefs uses ref-walk, not rcu-walk, this patch with call_rcu
> has seemed weird to me.

Even if orangefs never really allows RCU walking, the VFS layer will
look up dentries - and look at their inodes - from RCU. It will then
call into the filesystem

So even if orangefs always returned ECHILD (it doesn't - it has a
timeout) - we'd be following the dentry inode pointer in jus a RCU
read region.

           Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-25 20:31       ` Mike Marshall
  2017-02-27  0:34         ` Mike Marshall
@ 2017-02-27  8:44         ` David Howells
  2017-02-27 14:44           ` Mike Marshall
  1 sibling, 1 reply; 26+ messages in thread
From: David Howells @ 2017-02-27  8:44 UTC (permalink / raw)
  To: Mike Marshall
  Cc: dhowells, Al Viro, Peter Zijlstra, Linus Torvalds, Chris Mason,
	LKML, linux-fsdevel, elena.reshetova, Hans Liljestrand,
	David Windsor, Greg Kroah-Hartman

Mike Marshall <hubcap@omnibond.com> wrote:

> Using the git log, I searched back to where it seems to me call_rcu was
> added, a giant patch from 2005 by David Howells which includes tons of
> source and a large amount of documentation.

I'm pretty sure Paul McKenney added call_rcu(), not me.

David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC][PATCH 09/10] locking/refcount: Provide refcount_dec_unless()
  2017-02-24 15:43 ` [RFC][PATCH 09/10] locking/refcount: Provide refcount_dec_unless() Peter Zijlstra
@ 2017-02-27  9:28   ` Reshetova, Elena
  0 siblings, 0 replies; 26+ messages in thread
From: Reshetova, Elena @ 2017-02-27  9:28 UTC (permalink / raw)
  To: Peter Zijlstra, Al Viro, Linus Torvalds, Chris Mason
  Cc: linux-kernel, linux-fsdevel, David Howells, ishkamiel, dwindsor, gregkh

> By allowing a different unless value than 1, we can do dec_and_lock
> like things on higher values, like 2, which is useful if the data
> structure we lock also owns a reference (because in that case we'd
> never hit 1).
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/refcount.h |    8 +++++++-
>  lib/refcount.c           |   14 ++++++++------
>  2 files changed, 15 insertions(+), 7 deletions(-)

Even if we decide not to touch inode->i_count because people believe it doesn't make that much sense, 
this API extension would be highly useful in other places. The consequences if an usage counter overflows can be also very much undesirable and I don't think the code that uses them is prepared to handle such cases. 
So we also need a way to address it if we want to get rid of problem in general.  
In some cases you can do a general +1 on counting scheme, benefit from an existing lock and convert them to refcount_t, but sometimes it is really not appropriate to take a new generic lock.

So, there are basically two ways for converting these cases:

- create a new type, usagecounter_t (put whatever name you like) (as we discussed before a number of times) and provide below functions there
- extend refcount_t type to have the below functions 

First approach might be overall cleaner, but extend even further the set of atomic-like primitives we already have. Some people get confused when given too many choices and default to atomic_t as "old, default option". We don't want this to happen.
Second approach is more compact, but its usage needs to be documented more precisely. Personally it is hard for me to think of a case when adding refcount_dec_unless() to API
would result in people misusing refcount_t API more than they would do without it. But maybe I am not creative enough? 

Best Regards,
Elena. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode
  2017-02-27  8:44         ` David Howells
@ 2017-02-27 14:44           ` Mike Marshall
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Marshall @ 2017-02-27 14:44 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Peter Zijlstra, Linus Torvalds, Chris Mason, LKML,
	linux-fsdevel, elena.reshetova, Hans Liljestrand, David Windsor,
	Greg Kroah-Hartman

Hi... sorry if I got the attribution wrong... the commit I studied
was:

commit 76d8aeabfeb1c42641a81c44280177b9a08670d8
Author: David Howells <dhowells@redhat.com>
Date:   Thu Jun 23 22:00:49 2005 -0700

It was huge <g> ...

-Mike


On Mon, Feb 27, 2017 at 3:44 AM, David Howells <dhowells@redhat.com> wrote:
> Mike Marshall <hubcap@omnibond.com> wrote:
>
>> Using the git log, I searched back to where it seems to me call_rcu was
>> added, a giant patch from 2005 by David Howells which includes tons of
>> source and a large amount of documentation.
>
> I'm pretty sure Paul McKenney added call_rcu(), not me.
>
> David

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-02-27 14:45 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-24 15:43 [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Peter Zijlstra
2017-02-24 15:43 ` [RFC][PATCH 01/10] fs: Use lockdep_assert_held() instead of comments Peter Zijlstra
2017-02-24 15:43 ` [RFC][PATCH 02/10] fs: Avoid looking at i_count without i_lock held Peter Zijlstra
     [not found]   ` <CA+55aFxLw8FXf61rsGYDjA1tS=joDeaF7OSgaepLWwcz4zt=dg@mail.gmail.com>
2017-02-24 17:06     ` Peter Zijlstra
2017-02-24 15:43 ` [RFC][PATCH 03/10] fs: Introduce i_count() Peter Zijlstra
2017-02-24 15:43 ` [RFC][PATCH 04/10] fs: Restructure iput() Peter Zijlstra
2017-02-24 15:43 ` [RFC][PATCH 05/10] fs: Remove iput_final() Peter Zijlstra
2017-02-24 15:43 ` [RFC][PATCH 06/10] fs: Rework i_count Peter Zijlstra
2017-02-24 20:49   ` Al Viro
2017-02-24 15:43 ` [RFC][PATCH 07/10] orangefs: Use RCU for destroy_inode Peter Zijlstra
2017-02-24 20:52   ` Al Viro
2017-02-24 23:00     ` Mike Marshall
2017-02-25 20:31       ` Mike Marshall
2017-02-27  0:34         ` Mike Marshall
2017-02-27  1:20           ` Linus Torvalds
2017-02-27  8:44         ` David Howells
2017-02-27 14:44           ` Mike Marshall
2017-02-24 15:43 ` [RFC][PATCH 08/10] fs: Do RCU versions for find_inode() Peter Zijlstra
2017-02-24 15:43 ` [RFC][PATCH 09/10] locking/refcount: Provide refcount_dec_unless() Peter Zijlstra
2017-02-27  9:28   ` Reshetova, Elena
2017-02-24 15:43 ` [RFC][PATCH 10/10] fs: Convert i_count over to refcount_t Peter Zijlstra
2017-02-24 16:43 ` [RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue Christoph Hellwig
2017-02-24 17:07   ` Peter Zijlstra
2017-02-24 20:59   ` David Windsor
     [not found] ` <CA+55aFy1bNbsX_3T-s_EUwTP-r_SmJJMvB3=-2nffehFVP=EdQ@mail.gmail.com>
     [not found]   ` <CA+55aFz0DbAGZ8gc+s35nm1N5frXjK_NOh7QzuSfZeJbjsT6Sg@mail.gmail.com>
     [not found]     ` <CA+55aFyR8wkHps5_AqUqzx8MDMNxRZZ7+MYH9g=ZCUi=4Oey8w@mail.gmail.com>
2017-02-24 19:24       ` Fwd: " Linus Torvalds
2017-02-24 20:42 ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.