linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] refcounting improvements in sysfs.
@ 2010-03-24  3:20 NeilBrown
  2010-03-24  3:20 ` [PATCH 3/3] kref: create karef and use for sysfs_dirent->s_active NeilBrown
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: NeilBrown @ 2010-03-24  3:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: linux-kernel

This series tidies up the refcount of sysfs_dirents in sysfs,
using kref where appropriate and a new karef for s_active.
This achieves significant code simplification, especially the first
patch.

This is in part inspired by http://lwn.net/Articles/336224/ :-)

NeilBrown

---

NeilBrown (3):
      sysfs: simplify handling for s_active refcount
      sysfs: make s_count a kref
      kref: create karef and use for sysfs_dirent->s_active


 fs/sysfs/dir.c       |   74 ++++++++++++++++++++------------------------------
 fs/sysfs/mount.c     |    3 +-
 fs/sysfs/sysfs.h     |   19 +++++--------
 include/linux/kref.h |   38 ++++++++++++++++++++++++++
 lib/kref.c           |   13 +++++++++
 5 files changed, 90 insertions(+), 57 deletions(-)

-- 
Signature


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] sysfs: simplify handling for s_active refcount
  2010-03-24  3:20 [PATCH 0/3] refcounting improvements in sysfs NeilBrown
  2010-03-24  3:20 ` [PATCH 3/3] kref: create karef and use for sysfs_dirent->s_active NeilBrown
  2010-03-24  3:20 ` [PATCH 2/3] sysfs: make s_count a kref NeilBrown
@ 2010-03-24  3:20 ` NeilBrown
  2010-03-26  4:24   ` Eric W. Biederman
  2010-03-26  3:10 ` [PATCH 0/3] refcounting improvements in sysfs Eric W. Biederman
  2010-03-26  4:49 ` Tejun Heo
  4 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2010-03-24  3:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: linux-kernel

s_active counts the number of active references to a 'sysfs_direct'.
When we wish to deactivate a sysfs_direct, we subtract a large
number for the refcount so it will always appear negative.  When
it is negative, new references will not be taken.
After that subtraction, we wait for all the active references to
drain away.

The subtraction of the large number contains exactly the same
information as the setting of the flag SYSFS_FLAG_REMOVED.
(We know this as we already assert that SYSFS_FLAG_REMOVED is set
before adding the large-negative-bias).
So doing both is pointless.

By starting s_active with a value of 1, not 0 (as is typical of
reference counts) and using atomic_inc_not_zero, we can significantly
simplify the code while keeping exactly the same functionality.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/sysfs/dir.c   |   60 ++++++++++++++----------------------------------------
 fs/sysfs/sysfs.h |    2 --
 2 files changed, 16 insertions(+), 46 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 5907178..76a2d10 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -95,26 +95,13 @@ static void sysfs_unlink_sibling(struct sysfs_dirent *sd)
  */
 struct sysfs_dirent *sysfs_get_active(struct sysfs_dirent *sd)
 {
-	if (unlikely(!sd))
+	if (likely(sd)
+	    && (sd->s_flags & SYSFS_FLAG_REMOVED) == 0
+	    && atomic_inc_not_zero(&sd->s_active)) {
+		rwsem_acquire_read(&sd->dep_map, 0, 1, _RET_IP_);
+		return sd;
+	} else
 		return NULL;
-
-	while (1) {
-		int v, t;
-
-		v = atomic_read(&sd->s_active);
-		if (unlikely(v < 0))
-			return NULL;
-
-		t = atomic_cmpxchg(&sd->s_active, v, v + 1);
-		if (likely(t == v)) {
-			rwsem_acquire_read(&sd->dep_map, 0, 1, _RET_IP_);
-			return sd;
-		}
-		if (t < 0)
-			return NULL;
-
-		cpu_relax();
-	}
 }
 
 /**
@@ -126,34 +113,26 @@ struct sysfs_dirent *sysfs_get_active(struct sysfs_dirent *sd)
  */
 void sysfs_put_active(struct sysfs_dirent *sd)
 {
-	struct completion *cmpl;
-	int v;
-
 	if (unlikely(!sd))
 		return;
 
 	rwsem_release(&sd->dep_map, 1, _RET_IP_);
-	v = atomic_dec_return(&sd->s_active);
-	if (likely(v != SD_DEACTIVATED_BIAS))
-		return;
-
-	/* atomic_dec_return() is a mb(), we'll always see the updated
-	 * sd->s_sibling.
-	 */
-	cmpl = (void *)sd->s_sibling;
-	complete(cmpl);
+	if (atomic_dec_and_test(&sd->s_active)) {
+		struct completion *cmpl = (void*)sd->s_sibling;
+		complete(cmpl);
+	}
 }
 
 /**
  *	sysfs_deactivate - deactivate sysfs_dirent
  *	@sd: sysfs_dirent to deactivate
  *
- *	Deny new active references and drain existing ones.
+ *	New active references are already denied.
+ *      Drop ours and drain other existing ones.
  */
 static void sysfs_deactivate(struct sysfs_dirent *sd)
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
-	int v;
 
 	BUG_ON(sd->s_sibling || !(sd->s_flags & SYSFS_FLAG_REMOVED));
 
@@ -163,20 +142,13 @@ static void sysfs_deactivate(struct sysfs_dirent *sd)
 	sd->s_sibling = (void *)&wait;
 
 	rwsem_acquire(&sd->dep_map, 0, 0, _RET_IP_);
-	/* atomic_add_return() is a mb(), put_active() will always see
-	 * the updated sd->s_sibling.
-	 */
-	v = atomic_add_return(SD_DEACTIVATED_BIAS, &sd->s_active);
-
-	if (v != SD_DEACTIVATED_BIAS) {
-		lock_contended(&sd->dep_map, _RET_IP_);
-		wait_for_completion(&wait);
-	}
+	sysfs_put_active(sd);
+	lock_contended(&sd->dep_map, _RET_IP_);
+	wait_for_completion(&wait);
 
 	sd->s_sibling = NULL;
 
 	lock_acquired(&sd->dep_map, _RET_IP_);
-	rwsem_release(&sd->dep_map, 1, _RET_IP_);
 }
 
 static int sysfs_alloc_ino(ino_t *pino)
@@ -318,7 +290,7 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
 		goto err_out2;
 
 	atomic_set(&sd->s_count, 1);
-	atomic_set(&sd->s_active, 0);
+	atomic_set(&sd->s_active, 1);
 
 	sd->s_name = name;
 	sd->s_mode = mode;
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 30f5a44..6a2a60e 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -71,8 +71,6 @@ struct sysfs_dirent {
 	struct sysfs_inode_attrs *s_iattr;
 };
 
-#define SD_DEACTIVATED_BIAS		INT_MIN
-
 #define SYSFS_TYPE_MASK			0x00ff
 #define SYSFS_DIR			0x0001
 #define SYSFS_KOBJ_ATTR			0x0002



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] sysfs: make s_count a kref
  2010-03-24  3:20 [PATCH 0/3] refcounting improvements in sysfs NeilBrown
  2010-03-24  3:20 ` [PATCH 3/3] kref: create karef and use for sysfs_dirent->s_active NeilBrown
@ 2010-03-24  3:20 ` NeilBrown
  2010-03-26  4:29   ` Eric W. Biederman
  2010-03-24  3:20 ` [PATCH 1/3] sysfs: simplify handling for s_active refcount NeilBrown
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2010-03-24  3:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: linux-kernel

s_count in sysfs behaves exactly like a kref, so change it to
be one.
This requires adding a KREF_INIT macro to kref.h

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/sysfs/dir.c       |   12 +++++++++---
 fs/sysfs/mount.c     |    2 +-
 fs/sysfs/sysfs.h     |   15 +++++++--------
 include/linux/kref.h |    1 +
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 76a2d10..63790ac 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -177,8 +177,14 @@ static void sysfs_free_ino(ino_t ino)
 	spin_unlock(&sysfs_ino_lock);
 }
 
-void release_sysfs_dirent(struct sysfs_dirent * sd)
+static void no_recurse(struct kref *kref)
 {
+}
+void release_sysfs_dirent(struct kref *count)
+{
+	struct sysfs_dirent *sd = container_of(count,
+					       struct sysfs_dirent,
+					       s_count);
 	struct sysfs_dirent *parent_sd;
 
  repeat:
@@ -199,7 +205,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
 	kmem_cache_free(sysfs_dir_cachep, sd);
 
 	sd = parent_sd;
-	if (sd && atomic_dec_and_test(&sd->s_count))
+	if (sd && kref_put(&sd->s_count, no_recurse))
 		goto repeat;
 }
 
@@ -289,7 +295,7 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
 	if (sysfs_alloc_ino(&sd->s_ino))
 		goto err_out2;
 
-	atomic_set(&sd->s_count, 1);
+	kref_init(&sd->s_count);
 	atomic_set(&sd->s_active, 1);
 
 	sd->s_name = name;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 0cb1088..07bff03 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -33,7 +33,7 @@ static const struct super_operations sysfs_ops = {
 
 struct sysfs_dirent sysfs_root = {
 	.s_name		= "",
-	.s_count	= ATOMIC_INIT(1),
+	.s_count	= KREF_INIT,
 	.s_flags	= SYSFS_DIR,
 	.s_mode		= S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
 	.s_ino		= 1,
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 6a2a60e..f003a88 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -49,7 +49,7 @@ struct sysfs_inode_attrs {
  * requires s_active reference.
  */
 struct sysfs_dirent {
-	atomic_t		s_count;
+	struct kref		s_count;
 	atomic_t		s_active;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
@@ -140,7 +140,7 @@ struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
 				      const unsigned char *name);
 struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type);
 
-void release_sysfs_dirent(struct sysfs_dirent *sd);
+void release_sysfs_dirent(struct kref *count);
 
 int sysfs_create_subdir(struct kobject *kobj, const char *name,
 			struct sysfs_dirent **p_sd);
@@ -151,18 +151,17 @@ int sysfs_rename(struct sysfs_dirent *sd,
 
 static inline struct sysfs_dirent *__sysfs_get(struct sysfs_dirent *sd)
 {
-	if (sd) {
-		WARN_ON(!atomic_read(&sd->s_count));
-		atomic_inc(&sd->s_count);
-	}
+	if (sd)
+		kref_get(&sd->s_count);
+
 	return sd;
 }
 #define sysfs_get(sd) __sysfs_get(sd)
 
 static inline void __sysfs_put(struct sysfs_dirent *sd)
 {
-	if (sd && atomic_dec_and_test(&sd->s_count))
-		release_sysfs_dirent(sd);
+	if (sd)
+		kref_put(&sd->s_count, release_sysfs_dirent);
 }
 #define sysfs_put(sd) __sysfs_put(sd)
 
diff --git a/include/linux/kref.h b/include/linux/kref.h
index 13003ee..b006f74 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -25,4 +25,5 @@ void kref_init(struct kref *kref);
 void kref_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
 
+#define KREF_INIT {ATOMIC_INIT(1)}
 #endif /* _KREF_H_ */



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] kref: create karef and use for sysfs_dirent->s_active
  2010-03-24  3:20 [PATCH 0/3] refcounting improvements in sysfs NeilBrown
@ 2010-03-24  3:20 ` NeilBrown
  2010-03-26  4:50   ` Eric W. Biederman
  2010-03-24  3:20 ` [PATCH 2/3] sysfs: make s_count a kref NeilBrown
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2010-03-24  3:20 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: linux-kernel

->s_active is almost a kref, but needs atomic_inc_not_zero which
is not generally appropriate for a kref, as you can only access a
kreffed object if you hold a reference, and in that case the counter
cannot possibly be zero.

So introduce 'karef' - a counter of active references.  An active
reference is separate from an existential reference and normally
implies one.
For a karef, get_not_zero have be appropriate providing a separate
existential reference is held.

Then change sysfs_dirent->s_active to be the first (and so-far only)
karef. super_block->s_active is another candidate to be a karef.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/sysfs/dir.c       |   18 ++++++++++++------
 fs/sysfs/mount.c     |    1 +
 fs/sysfs/sysfs.h     |    2 +-
 include/linux/kref.h |   37 +++++++++++++++++++++++++++++++++++++
 lib/kref.c           |   13 +++++++++++++
 5 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 63790ac..f0e2303 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -97,13 +97,22 @@ struct sysfs_dirent *sysfs_get_active(struct sysfs_dirent *sd)
 {
 	if (likely(sd)
 	    && (sd->s_flags & SYSFS_FLAG_REMOVED) == 0
-	    && atomic_inc_not_zero(&sd->s_active)) {
+	    && karef_get_not_zero(&sd->s_active)) {
 		rwsem_acquire_read(&sd->dep_map, 0, 1, _RET_IP_);
 		return sd;
 	} else
 		return NULL;
 }
 
+
+static void sysfs_release(struct karef *k)
+{
+	struct sysfs_dirent *sd = container_of(k, struct sysfs_dirent, s_active);
+	struct completion *cmpl = (void*)sd->s_sibling;
+
+	complete(cmpl);
+}
+	
 /**
  *	sysfs_put_active - put an active reference to sysfs_dirent
  *	@sd: sysfs_dirent to put an active reference to
@@ -117,10 +126,7 @@ void sysfs_put_active(struct sysfs_dirent *sd)
 		return;
 
 	rwsem_release(&sd->dep_map, 1, _RET_IP_);
-	if (atomic_dec_and_test(&sd->s_active)) {
-		struct completion *cmpl = (void*)sd->s_sibling;
-		complete(cmpl);
-	}
+	karef_put(&sd->s_active, sysfs_release);
 }
 
 /**
@@ -296,7 +302,7 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
 		goto err_out2;
 
 	kref_init(&sd->s_count);
-	atomic_set(&sd->s_active, 1);
+	karef_init(&sd->s_active);
 
 	sd->s_name = name;
 	sd->s_mode = mode;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 07bff03..5914f87 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -34,6 +34,7 @@ static const struct super_operations sysfs_ops = {
 struct sysfs_dirent sysfs_root = {
 	.s_name		= "",
 	.s_count	= KREF_INIT,
+	.s_active	= KAREF_INIT,
 	.s_flags	= SYSFS_DIR,
 	.s_mode		= S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
 	.s_ino		= 1,
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index f003a88..7eb387b 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -50,7 +50,7 @@ struct sysfs_inode_attrs {
  */
 struct sysfs_dirent {
 	struct kref		s_count;
-	atomic_t		s_active;
+	struct karef		s_active;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 #endif
diff --git a/include/linux/kref.h b/include/linux/kref.h
index b006f74..2671cf8 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -17,6 +17,11 @@
 
 #include <linux/types.h>
 
+/* A kref can be embedded in an object to count all references
+ * the that object.  When the last reference is dropped (kref_put)
+ * the object is destroyed.
+ * See Documentation/kref.txt
+ */
 struct kref {
 	atomic_t refcount;
 };
@@ -26,4 +31,36 @@ void kref_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
 
 #define KREF_INIT {ATOMIC_INIT(1)}
+
+/* A karef is similar to a kref, except that it counts references
+ * which hold the object 'active' rather than 'in existence'.
+ * The object can continue to exist after the karef reaches zero
+ * so it can be safe to access the object, but not possible to
+ * get a new reference (i.e. the object cannot be re-activated).
+ * So get_not_zero makes sense for karef, while it doesn't for
+ * kref.
+ * Normally the fact that a karef is non-zero will imply a held reference
+ * on some kref.  The 'release' function for the karef would then kref_put
+ * that kref.
+ */
+struct karef {
+	struct kref kref;
+};
+
+static inline void karef_init(struct karef *karef)
+{
+	kref_init(&karef->kref);
+}
+static inline void karef_get(struct karef *karef)
+{
+	kref_get(&karef->kref);
+}
+static inline int karef_put(struct karef *karef,
+			    void (*release) (struct karef *kref))
+{
+	return kref_put(&karef->kref, (void(*)(struct kref *kref))release);
+}
+int karef_get_not_zero(struct karef *karef);
+
+#define KAREF_INIT {{ATOMIC_INIT(1)}}
 #endif /* _KREF_H_ */
diff --git a/lib/kref.c b/lib/kref.c
index 69761d3..7aadf3d 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -36,6 +36,19 @@ void kref_get(struct kref *kref)
 }
 
 /**
+ * karef_get_not_zero - increment refcount unless it is already zero
+ * @karef: object
+ *
+ * This is only safe to use if there is something separate from this
+ * karef (such as a lock or a kref) that keeps the object
+ * from being freed.
+ */
+int karef_get_not_zero(struct karef *karef)
+{
+	return atomic_inc_not_zero(&karef->kref.refcount);
+}
+
+/**
  * kref_put - decrement refcount for object.
  * @kref: object.
  * @release: pointer to the function that will clean up the object when the



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-24  3:20 [PATCH 0/3] refcounting improvements in sysfs NeilBrown
                   ` (2 preceding siblings ...)
  2010-03-24  3:20 ` [PATCH 1/3] sysfs: simplify handling for s_active refcount NeilBrown
@ 2010-03-26  3:10 ` Eric W. Biederman
  2010-03-26  3:28   ` Neil Brown
  2010-03-26  4:49 ` Tejun Heo
  4 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2010-03-26  3:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Greg Kroah-Hartman, linux-kernel, Tejun Heo

NeilBrown <neilb@suse.de> writes:

> This series tidies up the refcount of sysfs_dirents in sysfs,
> using kref where appropriate and a new karef for s_active.
> This achieves significant code simplification, especially the first
> patch.
>
> This is in part inspired by http://lwn.net/Articles/336224/ :-)

Neil I would appreciate if you would cc' Tejun and myself on
these kind of patches as we wrote the code and the best canidates
to review it.

Thanks,
Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-26  3:10 ` [PATCH 0/3] refcounting improvements in sysfs Eric W. Biederman
@ 2010-03-26  3:28   ` Neil Brown
  0 siblings, 0 replies; 20+ messages in thread
From: Neil Brown @ 2010-03-26  3:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Greg Kroah-Hartman, linux-kernel, Tejun Heo

On Thu, 25 Mar 2010 20:10:15 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > This series tidies up the refcount of sysfs_dirents in sysfs,
> > using kref where appropriate and a new karef for s_active.
> > This achieves significant code simplification, especially the first
> > patch.
> >
> > This is in part inspired by http://lwn.net/Articles/336224/ :-)
> 
> Neil I would appreciate if you would cc' Tejun and myself on
> these kind of patches as we wrote the code and the best canidates
> to review it.
>

Yes, you are right of course.
I just pulled names out of MAINTAINERS figuring the maintainer would know
what best to do with the patch.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sysfs: simplify handling for s_active refcount
  2010-03-24  3:20 ` [PATCH 1/3] sysfs: simplify handling for s_active refcount NeilBrown
@ 2010-03-26  4:24   ` Eric W. Biederman
  2010-03-26  5:32     ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2010-03-26  4:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Greg Kroah-Hartman, linux-kernel

NeilBrown <neilb@suse.de> writes:

> s_active counts the number of active references to a 'sysfs_direct'.
> When we wish to deactivate a sysfs_direct, we subtract a large
> number for the refcount so it will always appear negative.  When
> it is negative, new references will not be taken.
> After that subtraction, we wait for all the active references to
> drain away.
>
> The subtraction of the large number contains exactly the same
> information as the setting of the flag SYSFS_FLAG_REMOVED.
> (We know this as we already assert that SYSFS_FLAG_REMOVED is set
> before adding the large-negative-bias).
> So doing both is pointless.
>
> By starting s_active with a value of 1, not 0 (as is typical of
> reference counts) and using atomic_inc_not_zero, we can significantly
> simplify the code while keeping exactly the same functionality.

Overall your logic appears correct but in detail this patch scares me.

sd->s_flags is protected by the sysfs_mutex, and you aren't
taking it when you read it.  So in general I don't see the new check
if (sd->s_flags & SYSFS_FLAG_REMOVED) == 0 providing any guarantee of
progress whatsoever with user space applications repeated reading from
a sysfs file when that sysfs file is being removed.  They could easily
have the sd->s_flags value cached and never see the new value, given a
crazy enough cache architecture.

So as attractive as this patch is I don't think it is correct.

Eric 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] sysfs: make s_count a kref
  2010-03-24  3:20 ` [PATCH 2/3] sysfs: make s_count a kref NeilBrown
@ 2010-03-26  4:29   ` Eric W. Biederman
  0 siblings, 0 replies; 20+ messages in thread
From: Eric W. Biederman @ 2010-03-26  4:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: Greg Kroah-Hartman, linux-kernel

NeilBrown <neilb@suse.de> writes:

> s_count in sysfs behaves exactly like a kref, so change it to
> be one.
> This requires adding a KREF_INIT macro to kref.h

Except where it doesn't.  The whole no_recurse thing is definitely
not idiomatic kref usage.  I could find only 3 instances of someone even
looking at the return value.  I would argue based on that the return
value of kref_put should be removed.

KREF_INIT if we want it should be added in a separate patch.

kref should be kept for the stupid simple cases where so we don't have
to think about refcounting, and just know it works.  kref should not
be where we are getting clever, and it this feels like getting clever
to me.

It isn't like the atomic primitives are any worse than kref.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-24  3:20 [PATCH 0/3] refcounting improvements in sysfs NeilBrown
                   ` (3 preceding siblings ...)
  2010-03-26  3:10 ` [PATCH 0/3] refcounting improvements in sysfs Eric W. Biederman
@ 2010-03-26  4:49 ` Tejun Heo
  2010-03-26  5:10   ` Tejun Heo
  2010-03-26  6:02   ` Neil Brown
  4 siblings, 2 replies; 20+ messages in thread
From: Tejun Heo @ 2010-03-26  4:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: Greg Kroah-Hartman, linux-kernel

Hello,

On 03/24/2010 12:20 PM, NeilBrown wrote:
> This series tidies up the refcount of sysfs_dirents in sysfs,
> using kref where appropriate and a new karef for s_active.
> This achieves significant code simplification, especially the first
> patch.
> 
> This is in part inspired by http://lwn.net/Articles/336224/ :-)

Nice article.  In general, yeap, I agree it would be nice to have a
working reference count abstraction.  However, kref along with kobject
is a good example of obscurity by abstraction anti pattern.  :-)

kobject API incorrectly suggests that it deals with the last put
problem.  There still are large number of code paths which do the
following,

  if (!(kob = kobject_get(kobj)))
	return;

I believe (or at least hope) the actual problem cases are mostly fixed
now but there still are a lot of misconceptions around how stuff built
on kref/kobject is synchronized and they sometimes lead to race
conditions buried deep under several layers of abstractions and it
becomes very hard to see those race conditions when they are buried
deep.

If you want to kill refcounts w/ bias based off switch, please put it
inside an abstraction which at least synchronizes itself properly.
Open coding w/ bias at least warns you that there is some complex
stuff going on and you need to trade carefully.  Putting the switch on
a separate flag - people often forget how bits in a flag field are
synchronized - and the rest of refcount in a nice looking kref bundle
is very likely to lead to subtle race conditions which are *very*
difficult to notice.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] kref: create karef and use for sysfs_dirent->s_active
  2010-03-24  3:20 ` [PATCH 3/3] kref: create karef and use for sysfs_dirent->s_active NeilBrown
@ 2010-03-26  4:50   ` Eric W. Biederman
  0 siblings, 0 replies; 20+ messages in thread
From: Eric W. Biederman @ 2010-03-26  4:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: Greg Kroah-Hartman, linux-kernel

NeilBrown <neilb@suse.de> writes:

> ->s_active is almost a kref, but needs atomic_inc_not_zero which
> is not generally appropriate for a kref, as you can only access a
> kreffed object if you hold a reference, and in that case the counter
> cannot possibly be zero.
>
> So introduce 'karef' - a counter of active references.  An active
> reference is separate from an existential reference and normally
> implies one.
> For a karef, get_not_zero have be appropriate providing a separate
> existential reference is held.
>
> Then change sysfs_dirent->s_active to be the first (and so-far only)
> karef. super_block->s_active is another candidate to be a karef.

If this actually captured the semantics of active references I would find
this interesting.  But you currently miss the inconvenient but crucial
part of preventing new references after deletion, and you miss the
lockdep bits.

Given that with the completion we act enough like a lock we can cause
deadlocks I would expect a generic version for dummies (aka a kref flavor)
to include lockdep annotations from the start.

Lock ordering issues combined with ref counts are rare but they really
suck, are hard to find (without lockdep) and hard to reason about.

The logical semantics are:

read_trylock() sysfs_get_active
read_unlock() sysfs_put_active
write_lock() sysfs_deactivate

Maybe your karef stuff is good for something but it models the sysfs
usage poorly.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-26  4:49 ` Tejun Heo
@ 2010-03-26  5:10   ` Tejun Heo
  2010-03-26  6:02   ` Neil Brown
  1 sibling, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2010-03-26  5:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Greg Kroah-Hartman, linux-kernel

One thing to add.

On 03/26/2010 01:49 PM, Tejun Heo wrote:
> Nice article.  In general, yeap, I agree it would be nice to have a
> working reference count abstraction.  However, kref along with kobject
> is a good example of obscurity by abstraction anti pattern.  :-)

I think one of the reasons why k* hasn't really worked as well as it
was orginally imagined to do is the way we - the kernel programmers -
think and work.  We add abstractions when something is functionally
necessary, so in a lot of cases the functional requirements become the
implementation and the communication among us (ie. serves as implied
documentation).  The k* stuff detracts from this principle.  Those
abstractions are there for the purpose of abstracting and our usual
mindset becomes very susceptible to misinterpretations as no easily
identifiable functional requirements are there - we either end up
imaginging something up or waste time frustrated trying to figure out
why the hell that abstraction is there.

This actualy is a very generic problem.  When a LOT of people are
trying to work together sharing a lot of infrastructures, it is very
deterimental to impose certain paradigm upon them.  People can easily
agree upon functional necessities but one guy's wildest, most
ambitious paradigmatic vision looks like a complete bull to another
gal.

So, let's keep the abstractions to the just necessary level and
communicate at the functional layer.  In this case, mount and sysfs
shares the requirement for a refcount w/ a kill switch.  I'm not sure
it warrants common abstraction at this stage but if the *function* can
be wrapped nicely along with lockdep annotations and all, why not?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sysfs: simplify handling for s_active refcount
  2010-03-26  4:24   ` Eric W. Biederman
@ 2010-03-26  5:32     ` Neil Brown
  2010-03-26  5:42       ` Tejun Heo
  2010-03-26  7:53       ` Eric W. Biederman
  0 siblings, 2 replies; 20+ messages in thread
From: Neil Brown @ 2010-03-26  5:32 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Greg Kroah-Hartman, linux-kernel

On Thu, 25 Mar 2010 21:24:43 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > s_active counts the number of active references to a 'sysfs_direct'.
> > When we wish to deactivate a sysfs_direct, we subtract a large
> > number for the refcount so it will always appear negative.  When
> > it is negative, new references will not be taken.
> > After that subtraction, we wait for all the active references to
> > drain away.
> >
> > The subtraction of the large number contains exactly the same
> > information as the setting of the flag SYSFS_FLAG_REMOVED.
> > (We know this as we already assert that SYSFS_FLAG_REMOVED is set
> > before adding the large-negative-bias).
> > So doing both is pointless.
> >
> > By starting s_active with a value of 1, not 0 (as is typical of
> > reference counts) and using atomic_inc_not_zero, we can significantly
> > simplify the code while keeping exactly the same functionality.
> 
> Overall your logic appears correct but in detail this patch scares me.
> 
> sd->s_flags is protected by the sysfs_mutex, and you aren't
> taking it when you read it.  So in general I don't see the new check
> if (sd->s_flags & SYSFS_FLAG_REMOVED) == 0 providing any guarantee of
> progress whatsoever with user space applications repeated reading from
> a sysfs file when that sysfs file is being removed.  They could easily
> have the sd->s_flags value cached and never see the new value, given a
> crazy enough cache architecture.

As you say, this is only a liveness issue.  The atomic_inc_not_zero
guarantees that we don't take a new reference after the last one is gone.
The test on SYSFS_FLAG_REMOVED is only there to ensure that the count does
eventually get to zero.
There could only be a problem here if the change to s_flags was not
propagated to all CPUs in some reasonably small time.

I'm not expert on these things but it was my understanding that interesting
cache architectures could arbitrarily re-order accesses, but does not delay
them indefinitely.
Inserting barriers could possibly make this more predictable, but that would
just delay certain loads/stores until a known state was reached - it would
not make the data visible at another CPU any faster (or would it?).

So unless there is no cache-coherency protocol at all, I think that
SYSFS_FLAG_REMOVED will be seen promptly and that s_active will drop to zero
and quickly as it does today.

> 
> So as attractive as this patch is I don't think it is correct.
>

I'm pleased you find it attractive - I certainly think the
"atomic_inc_not_zero" is much more readable than the code it replaces.
Hopefully if there are really problems (maybe I've fundamentally
misunderstood caches) they can be easily resolved (a couple of memory
barriers at worst?).

Thanks for the review,
NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sysfs: simplify handling for s_active refcount
  2010-03-26  5:32     ` Neil Brown
@ 2010-03-26  5:42       ` Tejun Heo
  2010-03-26  7:53       ` Eric W. Biederman
  1 sibling, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2010-03-26  5:42 UTC (permalink / raw)
  To: Neil Brown; +Cc: Eric W. Biederman, Greg Kroah-Hartman, linux-kernel

Hello, Neil.

On 03/26/2010 02:32 PM, Neil Brown wrote:
> Hopefully if there are really problems (maybe I've fundamentally
> misunderstood caches) they can be easily resolved (a couple of memory
> barriers at worst?).

Oh, no, please don't do that.  That will inject a whole lot more of
complexity into the mix.  Now, it's only a problem of logical
complexity.  If you add memory barriers there, it not only complicates
the problem itself beyond recognition but also reduces problem
reproducibility (especially because x86/64 are relatively strongly
ordered) and thus test coverage significantly.  Now the refcounting
can be understood by most people who put some time into it but if you
put memory barriers into it, only Oleg would be able to identify
problems.  :-P

If you really want to kill the bias and in an easy readable way,
please put it inside a struct w/ dedicated spinlock but if you are
gonna do that it might as well be better to simply implement the bias
anti-pattern correctly there once.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-26  4:49 ` Tejun Heo
  2010-03-26  5:10   ` Tejun Heo
@ 2010-03-26  6:02   ` Neil Brown
  2010-03-26  6:32     ` Tejun Heo
  1 sibling, 1 reply; 20+ messages in thread
From: Neil Brown @ 2010-03-26  6:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Greg Kroah-Hartman, linux-kernel

On Fri, 26 Mar 2010 13:49:36 +0900
Tejun Heo <teheo@suse.de> wrote:

> Hello,
> 
> On 03/24/2010 12:20 PM, NeilBrown wrote:
> > This series tidies up the refcount of sysfs_dirents in sysfs,
> > using kref where appropriate and a new karef for s_active.
> > This achieves significant code simplification, especially the first
> > patch.
> > 
> > This is in part inspired by http://lwn.net/Articles/336224/ :-)
> 
> Nice article.  In general, yeap, I agree it would be nice to have a
> working reference count abstraction.  However, kref along with kobject
> is a good example of obscurity by abstraction anti pattern.  :-)

I'm not at all sure that opinion would be universal....

refcounting is something that it is quite easy to get wrong.  There are
several slightly different models for refcounting and if you don't have a
clear understanding of the different use cases it is easy to get confused
about exactly what model is being used and so use a refcount wrongly.
kref certainly doesn't cover all models for refcounting but it does cover one
fairly common one very well and I think that it's use bring clarity rather
than obscurity.
Of course if it is used for a refcount which should really follow a different
model then that can cause confusion...

> 
> kobject API incorrectly suggests that it deals with the last put
> problem.  There still are large number of code paths which do the
> following,
> 
>   if (!(kob = kobject_get(kobj)))
> 	return;

kobject_get *always* returns exactly the argument that was passed to it.
(kref_get doesn't have a return value.)

I don't see how the code above has any bearing on the last-put problem, which
I think kref and thus kobject do handle exactly correctly.

> 
> I believe (or at least hope) the actual problem cases are mostly fixed
> now but there still are a lot of misconceptions around how stuff built
> on kref/kobject is synchronized and they sometimes lead to race
> conditions buried deep under several layers of abstractions and it
> becomes very hard to see those race conditions when they are buried
> deep.

I agree that there probably misconceptions about how kref works and they are
probably based on a lack of appreciation of the subtle differences in
flavours of refcounts.  Hence my desire to create and document different
k*ref types which clarify the different use cases.

> 
> If you want to kill refcounts w/ bias based off switch, please put it
> inside an abstraction which at least synchronizes itself properly.
> Open coding w/ bias at least warns you that there is some complex
> stuff going on and you need to trade carefully.  Putting the switch on
> a separate flag - people often forget how bits in a flag field are
> synchronized - and the rest of refcount in a nice looking kref bundle
> is very likely to lead to subtle race conditions which are *very*
> difficult to notice.

The only other use of a BIAS that I am aware of is in struct super_block, and
Al Viro recently removed that in his bleeding edge tree (two days before I
sent him a patch to do the same thing:-)

It is dangerous to build too much into an abstraction else you will find that
no-one uses it as it is too specific.

The s_count and s_active in struct super_block are very similar to s_count
and s_active in struct sysfs_dirent, however they are also quite different.
super_block uses a non-atomic s_count (because a spinlock is always held
anyway) and has a separate way of preventing new s_active references (s_root
becomes NULL).  The only real similarity is that they both have an 'active'
refcount that *can* become zero and still be visible, which is different to a
kref but still a model worth encapsulating (I think) in karef.

BTW I'd be perfectly happy if the first patch was taken and subsequent ones
not.  I think they are a good idea, but I'm happy to forgo them (for now:-).

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-26  6:02   ` Neil Brown
@ 2010-03-26  6:32     ` Tejun Heo
  2010-03-29  5:10       ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2010-03-26  6:32 UTC (permalink / raw)
  To: Neil Brown; +Cc: Greg Kroah-Hartman, linux-kernel

Hello, Neil.

On 03/26/2010 03:02 PM, Neil Brown wrote:
>> Nice article.  In general, yeap, I agree it would be nice to have a
>> working reference count abstraction.  However, kref along with kobject
>> is a good example of obscurity by abstraction anti pattern.  :-)
> 
> I'm not at all sure that opinion would be universal....
> 
> refcounting is something that it is quite easy to get wrong.  There are
> several slightly different models for refcounting and if you don't have a
> clear understanding of the different use cases it is easy to get confused
> about exactly what model is being used and so use a refcount wrongly.
> kref certainly doesn't cover all models for refcounting but it does cover one
> fairly common one very well and I think that it's use bring clarity rather
> than obscurity.
> Of course if it is used for a refcount which should really follow a different
> model then that can cause confusion...

I don't know.  After spending some time with k* and device model, I
grew a pretty strong dislike for abstractions without clear functional
necessities.  kref being much simpler than kobject, the abuse is much
less severe but there have been kref misuses (in kobject and SCSI
midlayer) which could have been avoided or at least easily located if
they simply had used atomic_t instead of dreaming up some mystical
properties of kref.

>> kobject API incorrectly suggests that it deals with the last put
>> problem.  There still are large number of code paths which do the
>> following,
>>
>>   if (!(kob = kobject_get(kobj)))
>> 	return;
> 
> kobject_get *always* returns exactly the argument that was passed to it.
> (kref_get doesn't have a return value.)
> 
> I don't see how the code above has any bearing on the last-put
> problem, which I think kref and thus kobject do handle exactly
> correctly.

Oh, I should have been more explicit.  It's not directly related to
kref but just something that always comes to my mind when thinking
about k* abstractions.  The above bogus condition checks used to be
used quite widely.  The programmer for some reason believed the last
kobject_put() somehow will magically make future kobject_get()s return
NULL, which of course doesn't make any sense but hey the
implementation is buried kilometers deep, the API and other usages
seem to suggest that and it's easy to imagine something up when you're
tired.

As an another example, please take a look at the kref API.

  int kref_put(struct kref *kref, void (*release) (struct kref *kref));

The function takes @release callback but under which context is it
called?  If you look at the source code, it's called in-line which
isn't clear from the API at all (why the hell take a callback if
you're gonna call it in-line?) and there have been *several* subtle
bugs which could have been avoided or at least would have been caught
much easier if it were not for that meaningless separate release
callback.  It's just too easy to forget about the executing context
when people write and review stuff over function boundaries.

 void put_something(something)
 {
	 if (kref_put(&something->kref))
		 do something which might dead lock;
 }

is way easier to avoid bugs and review than

 void really_kill_something(struct kref *kref)
 {
	 struct my_something *something = container_of(...);

	 do something which might dead lock;
 }

 void put_something(something)
 {
	 kref_put(&someting->kref);
 }

This is *way* worse than atomic_t not better and the confusion is
caused exactly by superflous abstraction which leads the users of the
API to imagine some non-existing function of the abstraction and
hinders the flow of review.

>> I believe (or at least hope) the actual problem cases are mostly fixed
>> now but there still are a lot of misconceptions around how stuff built
>> on kref/kobject is synchronized and they sometimes lead to race
>> conditions buried deep under several layers of abstractions and it
>> becomes very hard to see those race conditions when they are buried
>> deep.
> 
> I agree that there probably misconceptions about how kref works and they are
> probably based on a lack of appreciation of the subtle differences in
> flavours of refcounts.  Hence my desire to create and document different
> k*ref types which clarify the different use cases.

Oh, I'm not objecting to cleaning up how reference counts are done
per-se but *PLEASE* refrain from introducing abstractions for
abstraction's sake.  The k* stuff, device model and sysfs already
walked down that road and got burned badly.

> BTW I'd be perfectly happy if the first patch was taken and
> subsequent ones not.  I think they are a good idea, but I'm happy to
> forgo them (for now:-).

If it can be done in a way that it doesn't substitute pure logical
complexity with one which involves memory ordering issues, which is
almost always way worse, I have no objection at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sysfs: simplify handling for s_active refcount
  2010-03-26  5:32     ` Neil Brown
  2010-03-26  5:42       ` Tejun Heo
@ 2010-03-26  7:53       ` Eric W. Biederman
  2010-03-29  4:43         ` Neil Brown
  1 sibling, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2010-03-26  7:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: Greg Kroah-Hartman, linux-kernel

Neil Brown <neilb@suse.de> writes:

> On Thu, 25 Mar 2010 21:24:43 -0700
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> NeilBrown <neilb@suse.de> writes:
>> 
>> > s_active counts the number of active references to a 'sysfs_direct'.
>> > When we wish to deactivate a sysfs_direct, we subtract a large
>> > number for the refcount so it will always appear negative.  When
>> > it is negative, new references will not be taken.
>> > After that subtraction, we wait for all the active references to
>> > drain away.
>> >
>> > The subtraction of the large number contains exactly the same
>> > information as the setting of the flag SYSFS_FLAG_REMOVED.
>> > (We know this as we already assert that SYSFS_FLAG_REMOVED is set
>> > before adding the large-negative-bias).
>> > So doing both is pointless.
>> >
>> > By starting s_active with a value of 1, not 0 (as is typical of
>> > reference counts) and using atomic_inc_not_zero, we can significantly
>> > simplify the code while keeping exactly the same functionality.
>> 
>> Overall your logic appears correct but in detail this patch scares me.
>> 
>> sd->s_flags is protected by the sysfs_mutex, and you aren't
>> taking it when you read it.  So in general I don't see the new check
>> if (sd->s_flags & SYSFS_FLAG_REMOVED) == 0 providing any guarantee of
>> progress whatsoever with user space applications repeated reading from
>> a sysfs file when that sysfs file is being removed.  They could easily
>> have the sd->s_flags value cached and never see the new value, given a
>> crazy enough cache architecture.
>
> As you say, this is only a liveness issue.  The atomic_inc_not_zero
> guarantees that we don't take a new reference after the last one is gone.
> The test on SYSFS_FLAG_REMOVED is only there to ensure that the count does
> eventually get to zero.
> There could only be a problem here if the change to s_flags was not
> propagated to all CPUs in some reasonably small time.

Having this handled correctly is a key part of the abstraction
implemented with sysfs_get_active sysfs_put_active and
sysfs_deactivate.  It is a requirement that userspace can not cuause
denial of service attacks on the rest of the kernel.

> I'm not expert on these things but it was my understanding that interesting
> cache architectures could arbitrarily re-order accesses, but does not delay
> them indefinitely.
> Inserting barriers could possibly make this more predictable, but that would
> just delay certain loads/stores until a known state was reached - it would
> not make the data visible at another CPU any faster (or would it?).

> So unless there is no cache-coherency protocol at all, I think that
> SYSFS_FLAG_REMOVED will be seen promptly and that s_active will drop to zero
> and quickly as it does today.

This is my problem with your patch.  You have replaced something that
used an existing abstraction properly, with something that does not
use any abstraction at all and we are reduced to talking about cpu
memory ordering requirements.

Furthermore this code you are changing is not used raw but it is the
building blocks for a well defined abstraction.  The users of
sysfs_get_active sysfs_put_active and sysfs_deactivate don't have to
know the details of how they are implemented to think about and
reason about it accurately.

I am very much in favor of a general primitive that we can use for
blocking ref counts.  We have them in sysfs, proc, sysctl, and a few
other places with similar requirements.  Each location has a different
implementation, but essentially the same semantics.

What is really needed is something with the semantics of:
read_trylock
read_unlock
write_lock() perhaps synchronize?

With multiple reads in side the read side critical section, and 
biased so that the read side is very cheap, and the data structure
size is very cheap.  We have an ugly but otherwise near optimal
implementation in sysfs.  srcu is close the data structure
costs too much to use on every sysfs dirent.

Improving and getting it right is very attractive.  The code that
exists today is correct and signals that something fishy is going
on so there is no point in doing something less than right.

Eric


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sysfs: simplify handling for s_active refcount
  2010-03-26  7:53       ` Eric W. Biederman
@ 2010-03-29  4:43         ` Neil Brown
  2010-03-29  7:47           ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2010-03-29  4:43 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Greg Kroah-Hartman, linux-kernel

On Fri, 26 Mar 2010 00:53:48 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Neil Brown <neilb@suse.de> writes:
> 
> > On Thu, 25 Mar 2010 21:24:43 -0700
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> >
> >> NeilBrown <neilb@suse.de> writes:
> >> 
> >> > s_active counts the number of active references to a 'sysfs_direct'.
> >> > When we wish to deactivate a sysfs_direct, we subtract a large
> >> > number for the refcount so it will always appear negative.  When
> >> > it is negative, new references will not be taken.
> >> > After that subtraction, we wait for all the active references to
> >> > drain away.
> >> >
> >> > The subtraction of the large number contains exactly the same
> >> > information as the setting of the flag SYSFS_FLAG_REMOVED.
> >> > (We know this as we already assert that SYSFS_FLAG_REMOVED is set
> >> > before adding the large-negative-bias).
> >> > So doing both is pointless.
> >> >
> >> > By starting s_active with a value of 1, not 0 (as is typical of
> >> > reference counts) and using atomic_inc_not_zero, we can significantly
> >> > simplify the code while keeping exactly the same functionality.
> >> 
> >> Overall your logic appears correct but in detail this patch scares me.
> >> 
> >> sd->s_flags is protected by the sysfs_mutex, and you aren't
> >> taking it when you read it.  So in general I don't see the new check
> >> if (sd->s_flags & SYSFS_FLAG_REMOVED) == 0 providing any guarantee of
> >> progress whatsoever with user space applications repeated reading from
> >> a sysfs file when that sysfs file is being removed.  They could easily
> >> have the sd->s_flags value cached and never see the new value, given a
> >> crazy enough cache architecture.
> >
> > As you say, this is only a liveness issue.  The atomic_inc_not_zero
> > guarantees that we don't take a new reference after the last one is gone.
> > The test on SYSFS_FLAG_REMOVED is only there to ensure that the count does
> > eventually get to zero.
> > There could only be a problem here if the change to s_flags was not
> > propagated to all CPUs in some reasonably small time.
> 
> Having this handled correctly is a key part of the abstraction
> implemented with sysfs_get_active sysfs_put_active and
> sysfs_deactivate.  It is a requirement that userspace can not cuause
> denial of service attacks on the rest of the kernel.
> 
> > I'm not expert on these things but it was my understanding that interesting
> > cache architectures could arbitrarily re-order accesses, but does not delay
> > them indefinitely.
> > Inserting barriers could possibly make this more predictable, but that would
> > just delay certain loads/stores until a known state was reached - it would
> > not make the data visible at another CPU any faster (or would it?).
> 
> > So unless there is no cache-coherency protocol at all, I think that
> > SYSFS_FLAG_REMOVED will be seen promptly and that s_active will drop to zero
> > and quickly as it does today.
> 
> This is my problem with your patch.  You have replaced something that
> used an existing abstraction properly, with something that does not
> use any abstraction at all and we are reduced to talking about cpu
> memory ordering requirements.

We are not talking about memory ordering requirements.  There are no memory
ordering requirements introduced in the patch.  It really doesn't matter if
the setting of SYSFS_FLAG_REMOVED is seen before or after anything else.
All that matters is that it does get seen, and it will be seen long before
you could possibly describe the situation as 'denial of service'.

However if we do consider memory ordering guarantees we can describe a clear
limit to the possibly delay between SYSFS_FLAG_REMOVED being set, and being
seen.  The atomic_inc_not_zero serves as a memory barrier in exactly the same
way that the current code requires atomic_dec_return.  So while

	if (likely(sd)
	    && (sd->s_flags & SYSFS_FLAG_REMOVED) == 0
	    && atomic_inc_not_zero(&sd->s_active)) {

could possibly gain a reference even 'after' SYS_FLAG_REMOVED as been set,
a second call to this on the same processor will see SYSFS_FLAG_REMOVED.
So at the absolute most, we could see NCPUS active references gained and
dropped after SYSFS_FLAG_REMOVED was set - a clear limit which is all we need.
I'm still not sure we even need to argue in terms of memory barriers to be
sure the code is correct, but it seems they are sufficient to give a simple
proof.
The sysfs code already has a dependency on implicit memory barriers to ensure
that ->s_sibling points to a completion.  My change does not significantly
add to that.


> 
> Furthermore this code you are changing is not used raw but it is the
> building blocks for a well defined abstraction.  The users of
> sysfs_get_active sysfs_put_active and sysfs_deactivate don't have to
> know the details of how they are implemented to think about and
> reason about it accurately.

True.  However it is raw in the sense that if some other developer wanted
similar functionality in their own code they might copy the low level stuff
as they wouldn't be able to use the sys_* routines directly.  Copying
complex code is not a good path to maintainability.

> 
> I am very much in favor of a general primitive that we can use for
> blocking ref counts.  We have them in sysfs, proc, sysctl, and a few
> other places with similar requirements.  Each location has a different
> implementation, but essentially the same semantics.
> 
> What is really needed is something with the semantics of:
> read_trylock
> read_unlock
> write_lock() perhaps synchronize?

Yes - I think the synergy between locks an refcounts is really interesting...

As you say, simply using a general abstraction which provides these functions
would incur a space cost that you don't want.  You avoid this cost by
re-using s_sibling to store a reference to a completion.
Finding a primitive that allows such optimisations is an interesting
challenge.
I really think that atomic_inc_not_zero is sufficient to provide what you
need.

NeilBrown


> 
> With multiple reads in side the read side critical section, and 
> biased so that the read side is very cheap, and the data structure
> size is very cheap.  We have an ugly but otherwise near optimal
> implementation in sysfs.  srcu is close the data structure
> costs too much to use on every sysfs dirent.
> 
> Improving and getting it right is very attractive.  The code that
> exists today is correct and signals that something fishy is going
> on so there is no point in doing something less than right.
> 
> Eric
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-26  6:32     ` Tejun Heo
@ 2010-03-29  5:10       ` Neil Brown
  2010-03-31  3:20         ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2010-03-29  5:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Greg Kroah-Hartman, linux-kernel

On Fri, 26 Mar 2010 15:32:51 +0900
Tejun Heo <teheo@suse.de> wrote:

> Hello, Neil.
> 
> On 03/26/2010 03:02 PM, Neil Brown wrote:
> >> Nice article.  In general, yeap, I agree it would be nice to have a
> >> working reference count abstraction.  However, kref along with kobject
> >> is a good example of obscurity by abstraction anti pattern.  :-)
> > 
> > I'm not at all sure that opinion would be universal....
> > 
> > refcounting is something that it is quite easy to get wrong.  There are
> > several slightly different models for refcounting and if you don't have a
> > clear understanding of the different use cases it is easy to get confused
> > about exactly what model is being used and so use a refcount wrongly.
> > kref certainly doesn't cover all models for refcounting but it does cover one
> > fairly common one very well and I think that it's use bring clarity rather
> > than obscurity.
> > Of course if it is used for a refcount which should really follow a different
> > model then that can cause confusion...
> 
> I don't know.  After spending some time with k* and device model, I
> grew a pretty strong dislike for abstractions without clear functional
> necessities.  kref being much simpler than kobject, the abuse is much
> less severe but there have been kref misuses (in kobject and SCSI
> midlayer) which could have been avoided or at least easily located if
> they simply had used atomic_t instead of dreaming up some mystical
> properties of kref.

Hi Tejun,

this strikes me as really valuable experience that it would be great to
share.   While I generally like kref and see value in at least some of
kobject I don't for a moment suppose they are perfect.  Fixing them requires
a good understanding of the problems they cause.  If you have that knowledge,
it would be a great resource for anyone wanting to 'fix' kobject.
Are you interested in writing anything (more) up at all???


> 
> >> kobject API incorrectly suggests that it deals with the last put
> >> problem.  There still are large number of code paths which do the
> >> following,
> >>
> >>   if (!(kob = kobject_get(kobj)))
> >> 	return;
> > 
> > kobject_get *always* returns exactly the argument that was passed to it.
> > (kref_get doesn't have a return value.)
> > 
> > I don't see how the code above has any bearing on the last-put
> > problem, which I think kref and thus kobject do handle exactly
> > correctly.
> 
> Oh, I should have been more explicit.  It's not directly related to
> kref but just something that always comes to my mind when thinking
> about k* abstractions.  The above bogus condition checks used to be
> used quite widely.  The programmer for some reason believed the last
> kobject_put() somehow will magically make future kobject_get()s return
> NULL, which of course doesn't make any sense but hey the
> implementation is buried kilometers deep, the API and other usages
> seem to suggest that and it's easy to imagine something up when you're
> tired.

This would argue that having a return value from kobject_get violates Rusty's
law that interfaces should be hard to misuse.
We could probably change that - it is only used 19 times in the current
kernel.
It probably doesn't help that Documentation/kobject.txt includes the text:

    A successful call to kobject_get() will increment the kobject's reference
    counter and return the pointer to the kobject.

which seems to suggest that an unsuccessful call is possible.


> 
> As an another example, please take a look at the kref API.
> 
>   int kref_put(struct kref *kref, void (*release) (struct kref *kref));
> 
> The function takes @release callback but under which context is it
> called?  If you look at the source code, it's called in-line which
> isn't clear from the API at all (why the hell take a callback if
> you're gonna call it in-line?) and there have been *several* subtle
> bugs which could have been avoided or at least would have been caught
> much easier if it were not for that meaningless separate release
> callback.  It's just too easy to forget about the executing context
> when people write and review stuff over function boundaries.
> 
>  void put_something(something)
>  {
> 	 if (kref_put(&something->kref))
> 		 do something which might dead lock;
>  }
> 
> is way easier to avoid bugs and review than
> 
>  void really_kill_something(struct kref *kref)
>  {
> 	 struct my_something *something = container_of(...);
> 
> 	 do something which might dead lock;
>  }
> 
>  void put_something(something)
>  {
> 	 kref_put(&someting->kref);
>  }
> 
> This is *way* worse than atomic_t not better and the confusion is
> caused exactly by superflous abstraction which leads the users of the
> API to imagine some non-existing function of the abstraction and
> hinders the flow of review.

I'm not immediately convinced by this, though maybe I haven't seen enough
examples yet.

The deadlocks that I have come across would not have been any more obvious in
either of the above - they were caused because sysfs_remove deadlocks when
called from inside a sysfs attribute action...

Also, while this re-write is possible for kref uses it isn't really possible
in kobject as the 'final_put' function must be included in the ktype (though
maybe you don't like that either).

What would be really helpful here is a survey of what sorts of things are
actually done in final_put functions so would could create some guidelines
about how to write the release functions.

Thanks,
NeilBrown


> 
> >> I believe (or at least hope) the actual problem cases are mostly fixed
> >> now but there still are a lot of misconceptions around how stuff built
> >> on kref/kobject is synchronized and they sometimes lead to race
> >> conditions buried deep under several layers of abstractions and it
> >> becomes very hard to see those race conditions when they are buried
> >> deep.
> > 
> > I agree that there probably misconceptions about how kref works and they are
> > probably based on a lack of appreciation of the subtle differences in
> > flavours of refcounts.  Hence my desire to create and document different
> > k*ref types which clarify the different use cases.
> 
> Oh, I'm not objecting to cleaning up how reference counts are done
> per-se but *PLEASE* refrain from introducing abstractions for
> abstraction's sake.  The k* stuff, device model and sysfs already
> walked down that road and got burned badly.
> 
> > BTW I'd be perfectly happy if the first patch was taken and
> > subsequent ones not.  I think they are a good idea, but I'm happy to
> > forgo them (for now:-).
> 
> If it can be done in a way that it doesn't substitute pure logical
> complexity with one which involves memory ordering issues, which is
> almost always way worse, I have no objection at all.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sysfs: simplify handling for s_active refcount
  2010-03-29  4:43         ` Neil Brown
@ 2010-03-29  7:47           ` Neil Brown
  0 siblings, 0 replies; 20+ messages in thread
From: Neil Brown @ 2010-03-29  7:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: Eric W. Biederman, Greg Kroah-Hartman, linux-kernel

On Mon, 29 Mar 2010 15:43:25 +1100
Neil Brown <neilb@suse.de> wrote:

> However if we do consider memory ordering guarantees we can describe a clear
> limit to the possibly delay between SYSFS_FLAG_REMOVED being set, and being
> seen.  The atomic_inc_not_zero serves as a memory barrier in exactly the same
> way that the current code requires atomic_dec_return.  So while
> 
> 	if (likely(sd)
> 	    && (sd->s_flags & SYSFS_FLAG_REMOVED) == 0
> 	    && atomic_inc_not_zero(&sd->s_active)) {
> 
> could possibly gain a reference even 'after' SYS_FLAG_REMOVED as been set,
> a second call to this on the same processor will see SYSFS_FLAG_REMOVED.
> So at the absolute most, we could see NCPUS active references gained and
> dropped after SYSFS_FLAG_REMOVED was set - a clear limit which is all we need.

It just occurred to me that this 'proof' isn't quite complete in itself.  I
need to also show that there is a suitable memory barrier after
SYSFS_FLAG_REMOVED is set.  There is as it is always set under sysfs_mutex,
so the mutex_unlock provides a barrier.
So after sysfs_mutex is unlocked, it is conceivable that each CPU could grant
one active reference against the sysfs_dirent before SYSFS_FLAG_REMOVED was
globally visible.

> I'm still not sure we even need to argue in terms of memory barriers to be
> sure the code is correct, but it seems they are sufficient to give a simple
> proof.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] refcounting improvements in sysfs.
  2010-03-29  5:10       ` Neil Brown
@ 2010-03-31  3:20         ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2010-03-31  3:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: Greg Kroah-Hartman, linux-kernel

Hello, Neil.

On 03/29/2010 02:10 PM, Neil Brown wrote:
>> I don't know.  After spending some time with k* and device model, I
>> grew a pretty strong dislike for abstractions without clear functional
>> necessities.  kref being much simpler than kobject, the abuse is much
>> less severe but there have been kref misuses (in kobject and SCSI
>> midlayer) which could have been avoided or at least easily located if
>> they simply had used atomic_t instead of dreaming up some mystical
>> properties of kref.
> 
> this strikes me as really valuable experience that it would be great
> to share.  While I generally like kref and see value in at least
> some of kobject I don't for a moment suppose they are perfect.
> Fixing them requires a good understanding of the problems they
> cause.  If you have that knowledge, it would be a great resource for
> anyone wanting to 'fix' kobject.  Are you interested in writing
> anything (more) up at all???

I'm not sure I have enough to say to write something up about. :-) One
thing I've been hoping to do is to hide the k* abstraction behind
device model abstraction so that device model API users only have to
deal with device model abstractions - devices and drivers.  They are
what the driver writers need and have pretty good tangible concrete
concept about.  I think my opinions can be compressed into

* Add abstractions which serve concrete functional necessities as
  doing so makes communication among peer programmers much easier by
  reducing confusion.

* Don't get too attached to software engineering theories.  They're
  not bad in themselves but often end up labeling only certain small
  subset of all the things which need to be considered, with the
  adverse side effect of emphasizing those named ones out of
  proportion harming overall trade-off balance.

> This would argue that having a return value from kobject_get violates Rusty's
> law that interfaces should be hard to misuse.
> We could probably change that - it is only used 19 times in the current
> kernel.
> It probably doesn't help that Documentation/kobject.txt includes the text:
> 
>     A successful call to kobject_get() will increment the kobject's reference
>     counter and return the pointer to the kobject.
> 
> which seems to suggest that an unsuccessful call is possible.

Yeap, the return value is only for convenience and I think it's more
harmful overall, but then again I suppose most people got used to it
now, which is another important factor to consider.  Given that there
are not many users of the return value, I agree that removing it would
be better.

>> This is *way* worse than atomic_t not better and the confusion is
>> caused exactly by superflous abstraction which leads the users of the
>> API to imagine some non-existing function of the abstraction and
>> hinders the flow of review.
> 
> I'm not immediately convinced by this, though maybe I haven't seen enough
> examples yet.
> 
> The deadlocks that I have come across would not have been any more obvious in
> either of the above - they were caused because sysfs_remove deadlocks when
> called from inside a sysfs attribute action...
>
> Also, while this re-write is possible for kref uses it isn't really possible
> in kobject as the 'final_put' function must be included in the ktype (though
> maybe you don't like that either).

Ah, right, I probably was mixing kobject and kref release functions
when talking about the problems I've seen.  I still don't like the
kref API and have chosen atomic_t over it in several cases.  To me,
the separate ->release seems way too heavy compared to the
functionality the abstraction actually provides and hinders writing
and reviewing more than it clarifies things.

> What would be really helpful here is a survey of what sorts of things are
> actually done in final_put functions so would could create some guidelines
> about how to write the release functions.

For kobject, I'm not sure what should be done other than hiding it as
much as possible behind device model abstractions.  For kref, I think
it would be great if it can be made simpler and dumber so that it only
clarifies the intention of the usage instead of forcing separate
->release.  If fancier reference helper is needed, capability to defer
release to a separate context and handle module symbol dereference
problem automatically would be nice provided they can be done in clean
manner, but these are just something I've been vaguely hoping for so
they may be absurd to actually implement.  :-)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-03-31  3:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-24  3:20 [PATCH 0/3] refcounting improvements in sysfs NeilBrown
2010-03-24  3:20 ` [PATCH 3/3] kref: create karef and use for sysfs_dirent->s_active NeilBrown
2010-03-26  4:50   ` Eric W. Biederman
2010-03-24  3:20 ` [PATCH 2/3] sysfs: make s_count a kref NeilBrown
2010-03-26  4:29   ` Eric W. Biederman
2010-03-24  3:20 ` [PATCH 1/3] sysfs: simplify handling for s_active refcount NeilBrown
2010-03-26  4:24   ` Eric W. Biederman
2010-03-26  5:32     ` Neil Brown
2010-03-26  5:42       ` Tejun Heo
2010-03-26  7:53       ` Eric W. Biederman
2010-03-29  4:43         ` Neil Brown
2010-03-29  7:47           ` Neil Brown
2010-03-26  3:10 ` [PATCH 0/3] refcounting improvements in sysfs Eric W. Biederman
2010-03-26  3:28   ` Neil Brown
2010-03-26  4:49 ` Tejun Heo
2010-03-26  5:10   ` Tejun Heo
2010-03-26  6:02   ` Neil Brown
2010-03-26  6:32     ` Tejun Heo
2010-03-29  5:10       ` Neil Brown
2010-03-31  3:20         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).