[PATCH 00/10] staging: erofs: decompression stability enhancement

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/10] staging: erofs: decompression stability enhancement
@ 2018-11-20 14:34 Gao Xiang
  2018-11-20 14:34 ` [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position Gao Xiang
                   ` (9 more replies)
  0 siblings, 10 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

Hi,

This patchset mainly focuses on erofs decompression stability, most of
them are found in the internal beta test. Some patches which are still
in testing or reviewing aren't included in this patchset and will be
sent after carefully testing.

Thanks,
Gao Xiang

Gao Xiang (10):
  staging: erofs: fix `trace_erofs_readpage' position
  staging: erofs: fix race when the managed cache is enabled
  staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup
  staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
  staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  staging: erofs: fix the definition of DBG_BUGON
  staging: erofs: separate into init_once / always
  staging: erofs: locked before registering for all new workgroups
  staging: erofs: decompress asynchronously if PG_readahead page at
    first
  staging: erofs: rename strange variable names in z_erofs_vle_frontend

 drivers/staging/erofs/internal.h  |  69 ++++++++++++--------
 drivers/staging/erofs/unzip_vle.c |  78 ++++++++++++++++-------
 drivers/staging/erofs/utils.c     | 131 ++++++++++++++++++++++++++------------
 3 files changed, 190 insertions(+), 88 deletions(-)

-- 
2.14.4


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-22 10:19   ` Greg Kroah-Hartman
  2018-11-20 14:34 ` [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled Gao Xiang
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

`trace_erofs_readpage' should be placed in .readpage()
rather than in the internal `z_erofs_do_read_page'.

Fixes: 284db12cfda3 ("staging: erofs: add trace points for reading zipped data")

Reviewed-by: Chen Gong <gongchen4@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/unzip_vle.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 6a283f618f46..ede3383ac601 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -598,8 +598,6 @@ static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
 	unsigned int cur, end, spiltted, index;
 	int err = 0;
 
-	trace_erofs_readpage(page, false);
-
 	/* register locked file pages as online pages in pack */
 	z_erofs_onlinepage_init(page);
 
@@ -1288,6 +1286,8 @@ static int z_erofs_vle_normalaccess_readpage(struct file *file,
 	int err;
 	LIST_HEAD(pagepool);
 
+	trace_erofs_readpage(page, false);
+
 #if (EROFS_FS_ZIP_CACHE_LVL >= 2)
 	f.cachedzone_la = (erofs_off_t)page->index << PAGE_SHIFT;
 #endif
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
  2018-11-20 14:34 ` [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-22 10:17   ` Greg Kroah-Hartman
  2018-11-22 10:19   ` Greg Kroah-Hartman
  2018-11-20 14:34 ` [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup Gao Xiang
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

When the managed cache is enabled, the last reference count
of a workgroup must be used for its workstation.

Otherwise, it could lead to incorrect (un)freezes in
the reclaim path, and it would be harmful.

A typical race as follows:

Thread 1 (In the reclaim path)  Thread 2
workgroup_freeze(grp, 1)                                refcnt = 1
...
workgroup_unfreeze(grp, 1)                              refcnt = 1
                                workgroup_get(grp)      refcnt = 2 (x)
workgroup_put(grp)                                      refcnt = 1 (x)
                                ...unexpected behaviors

* grp is detached but still used, which violates cache-managed
  freeze constraint.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/internal.h |   1 +
 drivers/staging/erofs/utils.c    | 131 +++++++++++++++++++++++++++------------
 2 files changed, 93 insertions(+), 39 deletions(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index 57575c7f5635..89dbd0888e53 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -250,6 +250,7 @@ static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
 }
 
 #define __erofs_workgroup_get(grp)	atomic_inc(&(grp)->refcount)
+#define __erofs_workgroup_put(grp)	atomic_dec(&(grp)->refcount)
 
 extern int erofs_workgroup_put(struct erofs_workgroup *grp);
 
diff --git a/drivers/staging/erofs/utils.c b/drivers/staging/erofs/utils.c
index ea8a962e5c95..90de8d9195b7 100644
--- a/drivers/staging/erofs/utils.c
+++ b/drivers/staging/erofs/utils.c
@@ -83,12 +83,21 @@ int erofs_register_workgroup(struct super_block *sb,
 
 	grp = xa_tag_pointer(grp, tag);
 
-	err = radix_tree_insert(&sbi->workstn_tree,
-		grp->index, grp);
+	/*
+	 * Bump up reference count before making this workgroup
+	 * visible to other users in order to avoid potential UAF
+	 * without serialized by erofs_workstn_lock.
+	 */
+	__erofs_workgroup_get(grp);
 
-	if (!err) {
-		__erofs_workgroup_get(grp);
-	}
+	err = radix_tree_insert(&sbi->workstn_tree,
+				grp->index, grp);
+	if (unlikely(err))
+		/*
+		 * it's safe to decrease since the workgroup isn't visible
+		 * and refcount >= 2 (cannot be freezed).
+		 */
+		__erofs_workgroup_put(grp);
 
 	erofs_workstn_unlock(sbi);
 	radix_tree_preload_end();
@@ -97,19 +106,91 @@ int erofs_register_workgroup(struct super_block *sb,
 
 extern void erofs_workgroup_free_rcu(struct erofs_workgroup *grp);
 
+static void  __erofs_workgroup_free(struct erofs_workgroup *grp)
+{
+	atomic_long_dec(&erofs_global_shrink_cnt);
+	erofs_workgroup_free_rcu(grp);
+}
+
 int erofs_workgroup_put(struct erofs_workgroup *grp)
 {
 	int count = atomic_dec_return(&grp->refcount);
 
 	if (count == 1)
 		atomic_long_inc(&erofs_global_shrink_cnt);
-	else if (!count) {
-		atomic_long_dec(&erofs_global_shrink_cnt);
-		erofs_workgroup_free_rcu(grp);
-	}
+	else if (!count)
+		__erofs_workgroup_free(grp);
 	return count;
 }
 
+#ifdef EROFS_FS_HAS_MANAGED_CACHE
+/* for cache-managed case, customized reclaim paths exist */
+static void erofs_workgroup_unfreeze_final(struct erofs_workgroup *grp)
+{
+	erofs_workgroup_unfreeze(grp, 0);
+	__erofs_workgroup_free(grp);
+}
+
+bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
+				    struct erofs_workgroup *grp,
+				    bool cleanup)
+{
+	/*
+	 * for managed cache enabled, the refcount of workgroups
+	 * themselves could be < 0 (freezed). So there is no guarantee
+	 * that all refcount > 0 if managed cache is enabled.
+	 */
+	if (!erofs_workgroup_try_to_freeze(grp, 1))
+		return false;
+
+	/*
+	 * note that all cached pages should be unlinked
+	 * before delete it from the radix tree.
+	 * Otherwise some cached pages of an orphan old workgroup
+	 * could be still linked after the new one is available.
+	 */
+	if (erofs_try_to_free_all_cached_pages(sbi, grp)) {
+		erofs_workgroup_unfreeze(grp, 1);
+		return false;
+	}
+
+	/* it is impossible to fail after we freeze the workgroup */
+	BUG_ON(xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
+						  grp->index)) != grp);
+
+	/*
+	 * if managed cache is enable, the last refcount
+	 * should indicate the related workstation.
+	 */
+	erofs_workgroup_unfreeze_final(grp);
+	return true;
+}
+
+#else
+/* for nocache case, no customized reclaim path at all */
+bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
+				    struct erofs_workgroup *grp,
+				    bool cleanup)
+{
+	int cnt = atomic_read(&grp->refcount);
+
+	DBG_BUGON(cnt <= 0);
+	DBG_BUGON(cleanup && cnt != 1);
+
+	if (cnt > 1)
+		return false;
+
+	if (xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
+					       grp->index)) != grp)
+		return false;
+
+	/* (rarely) could be grabbed again when freeing */
+	erofs_workgroup_put(grp);
+	return true;
+}
+
+#endif
+
 unsigned long erofs_shrink_workstation(struct erofs_sb_info *sbi,
 				       unsigned long nr_shrink,
 				       bool cleanup)
@@ -126,41 +207,13 @@ unsigned long erofs_shrink_workstation(struct erofs_sb_info *sbi,
 		batch, first_index, PAGEVEC_SIZE);
 
 	for (i = 0; i < found; ++i) {
-		int cnt;
 		struct erofs_workgroup *grp = xa_untag_pointer(batch[i]);
 
 		first_index = grp->index + 1;
 
-		cnt = atomic_read(&grp->refcount);
-		BUG_ON(cnt <= 0);
-
-		if (cleanup)
-			BUG_ON(cnt != 1);
-
-#ifndef EROFS_FS_HAS_MANAGED_CACHE
-		else if (cnt > 1)
-#else
-		if (!erofs_workgroup_try_to_freeze(grp, 1))
-#endif
-			continue;
-
-		if (xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
-			grp->index)) != grp) {
-#ifdef EROFS_FS_HAS_MANAGED_CACHE
-skip:
-			erofs_workgroup_unfreeze(grp, 1);
-#endif
+		/* try to shrink each valid workgroup */
+		if (!erofs_try_to_release_workgroup(sbi, grp, cleanup))
 			continue;
-		}
-
-#ifdef EROFS_FS_HAS_MANAGED_CACHE
-		if (erofs_try_to_free_all_cached_pages(sbi, grp))
-			goto skip;
-
-		erofs_workgroup_unfreeze(grp, 1);
-#endif
-		/* (rarely) grabbed again when freeing */
-		erofs_workgroup_put(grp);
 
 		++freed;
 		if (unlikely(!--nr_shrink))
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
  2018-11-20 14:34 ` [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position Gao Xiang
  2018-11-20 14:34 ` [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-22 10:20   ` Greg Kroah-Hartman
  2018-11-20 14:34 ` [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}' Gao Xiang
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

It's better to use atomic_cond_read_relaxed, which is implemented
in hardware instructions to monitor a variable changes currently
for ARM64, instead of open-coded busy waiting.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/internal.h | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index 89dbd0888e53..eb80ba44d072 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -221,23 +221,29 @@ static inline void erofs_workgroup_unfreeze(
 	preempt_enable();
 }
 
+#if defined(CONFIG_SMP)
+static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
+{
+	return atomic_cond_read_relaxed(&grp->refcount,
+					VAL != EROFS_LOCKED_MAGIC);
+}
+#else
+static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
+{
+	int v = atomic_read(&grp->refcount);
+
+	/* workgroup is never freezed on uniprocessor systems */
+	DBG_BUGON(v == EROFS_LOCKED_MAGIC);
+	return v;
+}
+#endif
+
 static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
 {
-	const int locked = (int)EROFS_LOCKED_MAGIC;
 	int o;
 
 repeat:
-	o = atomic_read(&grp->refcount);
-
-	/* spin if it is temporarily locked at the reclaim path */
-	if (unlikely(o == locked)) {
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-		do
-			cpu_relax();
-		while (atomic_read(&grp->refcount) == locked);
-#endif
-		goto repeat;
-	}
+	o = erofs_wait_on_workgroup_freezed(grp);
 
 	if (unlikely(o <= 0))
 		return -1;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
                   ` (2 preceding siblings ...)
  2018-11-20 14:34 ` [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-22 10:21   ` Greg Kroah-Hartman
  2018-11-20 14:34 ` [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze Gao Xiang
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

There are two minor issues in the current freeze interface:

   1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
      therefore fix the incorrect conditions;

   2) For SMP platforms, it should also disable preemption before
      doing atomic_cmpxchg in case that some high priority tasks
      preempt between atomic_cmpxchg and disable_preempt, then spin
      on the locked refcount later.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/internal.h | 41 ++++++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index eb80ba44d072..2e0ef92c138b 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -194,40 +194,49 @@ struct erofs_workgroup {
 
 #define EROFS_LOCKED_MAGIC     (INT_MIN | 0xE0F510CCL)
 
-static inline bool erofs_workgroup_try_to_freeze(
-	struct erofs_workgroup *grp, int v)
+#if defined(CONFIG_SMP)
+static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
+						 int val)
 {
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-	if (v != atomic_cmpxchg(&grp->refcount,
-		v, EROFS_LOCKED_MAGIC))
-		return false;
 	preempt_disable();
-#else
-	preempt_disable();
-	if (atomic_read(&grp->refcount) != v) {
+	if (val != atomic_cmpxchg(&grp->refcount, val, EROFS_LOCKED_MAGIC)) {
 		preempt_enable();
 		return false;
 	}
-#endif
 	return true;
 }
 
-static inline void erofs_workgroup_unfreeze(
-	struct erofs_workgroup *grp, int v)
+static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
+					    int orig_val)
 {
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-	atomic_set(&grp->refcount, v);
-#endif
+	atomic_set(&grp->refcount, orig_val);
 	preempt_enable();
 }
 
-#if defined(CONFIG_SMP)
 static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
 {
 	return atomic_cond_read_relaxed(&grp->refcount,
 					VAL != EROFS_LOCKED_MAGIC);
 }
 #else
+static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
+						 int val)
+{
+	preempt_disable();
+	/* no need to spin on UP platforms, let's just disable preemption. */
+	if (val != atomic_read(&grp->refcount)) {
+		preempt_enable();
+		return false;
+	}
+	return true;
+}
+
+static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
+					    int orig_val)
+{
+	preempt_enable();
+}
+
 static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
 {
 	int v = atomic_read(&grp->refcount);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
                   ` (3 preceding siblings ...)
  2018-11-20 14:34 ` [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}' Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-22 10:22   ` Greg Kroah-Hartman
  2018-11-20 14:34 ` [PATCH 06/10] staging: erofs: fix the definition of DBG_BUGON Gao Xiang
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

Just like other generic locks, insert a full barrier
in case of memory reorder.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/internal.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index 2e0ef92c138b..f77653d33633 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -209,6 +209,7 @@ static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
 static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
 					    int orig_val)
 {
+	smp_mb();
 	atomic_set(&grp->refcount, orig_val);
 	preempt_enable();
 }
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 06/10] staging: erofs: fix the definition of DBG_BUGON
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
                   ` (4 preceding siblings ...)
  2018-11-20 14:34 ` [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-20 14:34 ` [PATCH 07/10] staging: erofs: separate into init_once / always Gao Xiang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

It's better not to positively BUG_ON the kernel, however developers
need a way to locate issues as soon as possible.

DBG_BUGON is introduced and it could only crash when EROFS_FS_DEBUG
(EROFS developping feature) is on. It is helpful for developers
to find and solve bugs quickly.

Previously, DBG_BUGON is defined as ((void)0) if EROFS_FS_DEBUG is off,
but some unused variable warnings as follows could occur:

drivers/staging/erofs/unzip_vle.c: In function ‘init_always’:
drivers/staging/erofs/unzip_vle.c:61:33: warning: unused variable ‘work’ [-Wunused-variable]
  struct z_erofs_vle_work *const work =
                                 ^~~~

Fix it to #define DBG_BUGON(x) ((void)(x)).

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
index f77653d33633..0aa2a41b9885 100644
--- a/drivers/staging/erofs/internal.h
+++ b/drivers/staging/erofs/internal.h
@@ -39,7 +39,7 @@
 #define debugln(x, ...)         ((void)0)
 
 #define dbg_might_sleep()       ((void)0)
-#define DBG_BUGON(...)          ((void)0)
+#define DBG_BUGON(x)            ((void)(x))
 #endif
 
 enum {
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
                   ` (5 preceding siblings ...)
  2018-11-20 14:34 ` [PATCH 06/10] staging: erofs: fix the definition of DBG_BUGON Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-22 10:23   ` Greg Kroah-Hartman
  2018-11-20 14:34 ` [PATCH 08/10] staging: erofs: locked before registering for all new workgroups Gao Xiang
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

`z_erofs_vle_workgroup' is heavily generated in the decompression,
for example, it resets 32 bytes redundantly for 64-bit platforms
even through Z_EROFS_VLE_INLINE_PAGEVECS + Z_EROFS_CLUSTER_MAX_PAGES,
default 4, pages are stored in `z_erofs_vle_workgroup'.

As an another example, `struct mutex' takes 72 bytes for our kirin
64-bit platforms, it's unnecessary to be reseted at first and
be initialized each time.

Let's avoid filling all `z_erofs_vle_workgroup' with 0 at first
since most fields are reinitialized to meaningful values later,
and pagevec is no need to initialized at all.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/unzip_vle.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index ede3383ac601..4e5843e8ee35 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -43,12 +43,38 @@ static inline int init_unzip_workqueue(void)
 	return z_erofs_workqueue ? 0 : -ENOMEM;
 }
 
+static void init_once(void *ptr)
+{
+	struct z_erofs_vle_workgroup *grp = ptr;
+	struct z_erofs_vle_work *const work =
+		z_erofs_vle_grab_primary_work(grp);
+	unsigned int i;
+
+	mutex_init(&work->lock);
+	work->nr_pages = 0;
+	work->vcnt = 0;
+	for (i = 0; i < Z_EROFS_CLUSTER_MAX_PAGES; ++i)
+		grp->compressed_pages[i] = NULL;
+}
+
+static void init_always(struct z_erofs_vle_workgroup *grp)
+{
+	struct z_erofs_vle_work *const work =
+		z_erofs_vle_grab_primary_work(grp);
+
+	atomic_set(&grp->obj.refcount, 1);
+	grp->flags = 0;
+
+	DBG_BUGON(work->nr_pages);
+	DBG_BUGON(work->vcnt);
+}
+
 int __init z_erofs_init_zip_subsystem(void)
 {
 	z_erofs_workgroup_cachep =
 		kmem_cache_create("erofs_compress",
 				  Z_EROFS_WORKGROUP_SIZE, 0,
-				  SLAB_RECLAIM_ACCOUNT, NULL);
+				  SLAB_RECLAIM_ACCOUNT, init_once);
 
 	if (z_erofs_workgroup_cachep) {
 		if (!init_unzip_workqueue())
@@ -370,10 +396,11 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
 	BUG_ON(grp);
 
 	/* no available workgroup, let's allocate one */
-	grp = kmem_cache_zalloc(z_erofs_workgroup_cachep, GFP_NOFS);
+	grp = kmem_cache_alloc(z_erofs_workgroup_cachep, GFP_NOFS);
 	if (unlikely(!grp))
 		return ERR_PTR(-ENOMEM);
 
+	init_always(grp);
 	grp->obj.index = f->idx;
 	grp->llen = map->m_llen;
 
@@ -381,7 +408,6 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
 		(map->m_flags & EROFS_MAP_ZIPPED) ?
 			Z_EROFS_VLE_WORKGRP_FMT_LZ4 :
 			Z_EROFS_VLE_WORKGRP_FMT_PLAIN);
-	atomic_set(&grp->obj.refcount, 1);
 
 	/* new workgrps have been claimed as type 1 */
 	WRITE_ONCE(grp->next, *f->owned_head);
@@ -394,8 +420,6 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
 	work = z_erofs_vle_grab_primary_work(grp);
 	work->pageofs = f->pageofs;
 
-	mutex_init(&work->lock);
-
 	if (gnew) {
 		int err = erofs_register_workgroup(f->sb, &grp->obj, 0);
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 08/10] staging: erofs: locked before registering for all new workgroups
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
                   ` (6 preceding siblings ...)
  2018-11-20 14:34 ` [PATCH 07/10] staging: erofs: separate into init_once / always Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-22 10:24   ` Greg Kroah-Hartman
  2018-11-20 14:34 ` [PATCH 09/10] staging: erofs: decompress asynchronously if PG_readahead page at first Gao Xiang
  2018-11-20 14:34 ` [PATCH 10/10] staging: erofs: rename strange variable names in z_erofs_vle_frontend Gao Xiang
  9 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

Let's make sure that the one registering a workgroup will also
take the primary work lock at first for two reasons:
  1) There's no need to introduce such a race window (and consequently
     overhead) between registering and locking, other tasks could break
     in by chance, and the race seems unnecessary (no benefit at all);

  2) It's better to take the primary work when a workgroup
     is registered to apply the cache managed policy, for example,
     if some other tasks break in, it could turn into the in-place
     decompression rather than use as the cached decompression.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/unzip_vle.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 4e5843e8ee35..a1376f3c6065 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -420,18 +420,22 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
 	work = z_erofs_vle_grab_primary_work(grp);
 	work->pageofs = f->pageofs;
 
+	/* lock all primary followed works before visible to others */
+	if (unlikely(!mutex_trylock(&work->lock)))
+		/* for a new workgroup, try_lock *never* fails */
+		DBG_BUGON(1);
+
 	if (gnew) {
 		int err = erofs_register_workgroup(f->sb, &grp->obj, 0);
 
 		if (err) {
+			mutex_unlock(&work->lock);
 			kmem_cache_free(z_erofs_workgroup_cachep, grp);
 			return ERR_PTR(-EAGAIN);
 		}
 	}
 
 	*f->owned_head = *f->grp_ret = grp;
-
-	mutex_lock(&work->lock);
 	return work;
 }
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 09/10] staging: erofs: decompress asynchronously if PG_readahead page at first
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
                   ` (7 preceding siblings ...)
  2018-11-20 14:34 ` [PATCH 08/10] staging: erofs: locked before registering for all new workgroups Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  2018-11-20 14:34 ` [PATCH 10/10] staging: erofs: rename strange variable names in z_erofs_vle_frontend Gao Xiang
  9 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

For the case of nr_to_read == lookahead_size, it is better to
decompress asynchronously as well since no page will be needed immediately.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/unzip_vle.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index a1376f3c6065..824d2c12c2f3 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -1344,8 +1344,8 @@ static int z_erofs_vle_normalaccess_readpages(struct file *filp,
 {
 	struct inode *const inode = mapping->host;
 	struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
-	const bool sync = __should_decompress_synchronously(sbi, nr_pages);
 
+	bool sync = __should_decompress_synchronously(sbi, nr_pages);
 	struct z_erofs_vle_frontend f = VLE_FRONTEND_INIT(inode);
 	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
 	struct page *head = NULL;
@@ -1363,6 +1363,13 @@ static int z_erofs_vle_normalaccess_readpages(struct file *filp,
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 
+		/*
+		 * A pure asynchronous readahead is indicated if
+		 * a PG_readahead marked page is hitted at first.
+		 * Let's also do asynchronous decompression for this case.
+		 */
+		sync &= !(PageReadahead(page) && !head);
+
 		if (add_to_page_cache_lru(page, mapping, page->index, gfp)) {
 			list_add(&page->lru, &pagepool);
 			continue;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 10/10] staging: erofs: rename strange variable names in z_erofs_vle_frontend
  2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
                   ` (8 preceding siblings ...)
  2018-11-20 14:34 ` [PATCH 09/10] staging: erofs: decompress asynchronously if PG_readahead page at first Gao Xiang
@ 2018-11-20 14:34 ` Gao Xiang
  9 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-20 14:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel
  Cc: LKML, linux-erofs, Chao Yu, Miao Xie, weidu.du, Gao Xiang

Previously, 2 members called `initial' and `cachedzone_la' are used
for applying caching policy (whether the workgroup is at either end),
which are hard to understand, rename them to `backmost' and `headoffset'.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
---
 drivers/staging/erofs/unzip_vle.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
index 824d2c12c2f3..1ef178e7ac39 100644
--- a/drivers/staging/erofs/unzip_vle.c
+++ b/drivers/staging/erofs/unzip_vle.c
@@ -586,10 +586,9 @@ struct z_erofs_vle_frontend {
 
 	z_erofs_vle_owned_workgrp_t owned_head;
 
-	bool initial;
-#if (EROFS_FS_ZIP_CACHE_LVL >= 2)
-	erofs_off_t cachedzone_la;
-#endif
+	/* used for applying cache strategy on the fly */
+	bool backmost;
+	erofs_off_t headoffset;
 };
 
 #define VLE_FRONTEND_INIT(__i) { \
@@ -600,7 +599,7 @@ struct z_erofs_vle_frontend {
 	}, \
 	.builder = VLE_WORK_BUILDER_INIT(), \
 	.owned_head = Z_EROFS_VLE_WORKGRP_TAIL, \
-	.initial = true, }
+	.backmost = true, }
 
 static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
 				struct page *page,
@@ -643,7 +642,7 @@ static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
 	debugln("%s: [out-of-range] pos %llu", __func__, offset + cur);
 
 	if (z_erofs_vle_work_iter_end(builder))
-		fe->initial = false;
+		fe->backmost = false;
 
 	map->m_la = offset + cur;
 	map->m_llen = 0;
@@ -669,8 +668,8 @@ static int z_erofs_do_read_page(struct z_erofs_vle_frontend *fe,
 		erofs_blknr(map->m_pa),
 		grp->compressed_pages, erofs_blknr(map->m_plen),
 		/* compressed page caching selection strategy */
-		fe->initial | (EROFS_FS_ZIP_CACHE_LVL >= 2 ?
-			map->m_la < fe->cachedzone_la : 0));
+		fe->backmost | (EROFS_FS_ZIP_CACHE_LVL >= 2 ?
+				map->m_la < fe->headoffset : 0));
 
 	if (noio_outoforder && builder_is_followed(builder))
 		builder->role = Z_EROFS_VLE_WORK_PRIMARY;
@@ -1316,9 +1315,8 @@ static int z_erofs_vle_normalaccess_readpage(struct file *file,
 
 	trace_erofs_readpage(page, false);
 
-#if (EROFS_FS_ZIP_CACHE_LVL >= 2)
-	f.cachedzone_la = (erofs_off_t)page->index << PAGE_SHIFT;
-#endif
+	f.headoffset = (erofs_off_t)page->index << PAGE_SHIFT;
+
 	err = z_erofs_do_read_page(&f, page, &pagepool);
 	(void)z_erofs_vle_work_iter_end(&f.builder);
 
@@ -1354,9 +1352,8 @@ static int z_erofs_vle_normalaccess_readpages(struct file *filp,
 	trace_erofs_readpages(mapping->host, lru_to_page(pages),
 			      nr_pages, false);
 
-#if (EROFS_FS_ZIP_CACHE_LVL >= 2)
-	f.cachedzone_la = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
-#endif
+	f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
+
 	for (; nr_pages; --nr_pages) {
 		struct page *page = lru_to_page(pages);
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-20 14:34 ` [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled Gao Xiang
@ 2018-11-22 10:17   ` Greg Kroah-Hartman
  2018-11-22 10:42     ` Gao Xiang
  2018-11-22 10:19   ` Greg Kroah-Hartman
  1 sibling, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:17 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:17PM +0800, Gao Xiang wrote:
> When the managed cache is enabled, the last reference count
> of a workgroup must be used for its workstation.
> 
> Otherwise, it could lead to incorrect (un)freezes in
> the reclaim path, and it would be harmful.
> 
> A typical race as follows:
> 
> Thread 1 (In the reclaim path)  Thread 2
> workgroup_freeze(grp, 1)                                refcnt = 1
> ...
> workgroup_unfreeze(grp, 1)                              refcnt = 1
>                                 workgroup_get(grp)      refcnt = 2 (x)
> workgroup_put(grp)                                      refcnt = 1 (x)
>                                 ...unexpected behaviors
> 
> * grp is detached but still used, which violates cache-managed
>   freeze constraint.
> 
> Reviewed-by: Chao Yu <yuchao0@huawei.com>
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
>  drivers/staging/erofs/internal.h |   1 +
>  drivers/staging/erofs/utils.c    | 131 +++++++++++++++++++++++++++------------
>  2 files changed, 93 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 57575c7f5635..89dbd0888e53 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -250,6 +250,7 @@ static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
>  }
>  
>  #define __erofs_workgroup_get(grp)	atomic_inc(&(grp)->refcount)
> +#define __erofs_workgroup_put(grp)	atomic_dec(&(grp)->refcount)

Any specific reason why you are not using the refcount.h api instead of
"doing it yourself" with atomic_inc/dec()?

I'm not rejecting this, just curious.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-20 14:34 ` [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled Gao Xiang
  2018-11-22 10:17   ` Greg Kroah-Hartman
@ 2018-11-22 10:19   ` Greg Kroah-Hartman
  1 sibling, 0 replies; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:19 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:17PM +0800, Gao Xiang wrote:
> When the managed cache is enabled, the last reference count
> of a workgroup must be used for its workstation.
> 
> Otherwise, it could lead to incorrect (un)freezes in
> the reclaim path, and it would be harmful.
> 
> A typical race as follows:
> 
> Thread 1 (In the reclaim path)  Thread 2
> workgroup_freeze(grp, 1)                                refcnt = 1
> ...
> workgroup_unfreeze(grp, 1)                              refcnt = 1
>                                 workgroup_get(grp)      refcnt = 2 (x)
> workgroup_put(grp)                                      refcnt = 1 (x)
>                                 ...unexpected behaviors
> 
> * grp is detached but still used, which violates cache-managed
>   freeze constraint.
> 
> Reviewed-by: Chao Yu <yuchao0@huawei.com>
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
>  drivers/staging/erofs/internal.h |   1 +
>  drivers/staging/erofs/utils.c    | 131 +++++++++++++++++++++++++++------------
>  2 files changed, 93 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 57575c7f5635..89dbd0888e53 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -250,6 +250,7 @@ static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
>  }
>  
>  #define __erofs_workgroup_get(grp)	atomic_inc(&(grp)->refcount)
> +#define __erofs_workgroup_put(grp)	atomic_dec(&(grp)->refcount)
>  
>  extern int erofs_workgroup_put(struct erofs_workgroup *grp);
>  
> diff --git a/drivers/staging/erofs/utils.c b/drivers/staging/erofs/utils.c
> index ea8a962e5c95..90de8d9195b7 100644
> --- a/drivers/staging/erofs/utils.c
> +++ b/drivers/staging/erofs/utils.c
> @@ -83,12 +83,21 @@ int erofs_register_workgroup(struct super_block *sb,
>  
>  	grp = xa_tag_pointer(grp, tag);
>  
> -	err = radix_tree_insert(&sbi->workstn_tree,
> -		grp->index, grp);
> +	/*
> +	 * Bump up reference count before making this workgroup
> +	 * visible to other users in order to avoid potential UAF
> +	 * without serialized by erofs_workstn_lock.
> +	 */
> +	__erofs_workgroup_get(grp);
>  
> -	if (!err) {
> -		__erofs_workgroup_get(grp);
> -	}
> +	err = radix_tree_insert(&sbi->workstn_tree,
> +				grp->index, grp);
> +	if (unlikely(err))
> +		/*
> +		 * it's safe to decrease since the workgroup isn't visible
> +		 * and refcount >= 2 (cannot be freezed).
> +		 */
> +		__erofs_workgroup_put(grp);
>  
>  	erofs_workstn_unlock(sbi);
>  	radix_tree_preload_end();
> @@ -97,19 +106,91 @@ int erofs_register_workgroup(struct super_block *sb,
>  
>  extern void erofs_workgroup_free_rcu(struct erofs_workgroup *grp);
>  
> +static void  __erofs_workgroup_free(struct erofs_workgroup *grp)
> +{
> +	atomic_long_dec(&erofs_global_shrink_cnt);
> +	erofs_workgroup_free_rcu(grp);
> +}
> +
>  int erofs_workgroup_put(struct erofs_workgroup *grp)
>  {
>  	int count = atomic_dec_return(&grp->refcount);
>  
>  	if (count == 1)
>  		atomic_long_inc(&erofs_global_shrink_cnt);
> -	else if (!count) {
> -		atomic_long_dec(&erofs_global_shrink_cnt);
> -		erofs_workgroup_free_rcu(grp);
> -	}
> +	else if (!count)
> +		__erofs_workgroup_free(grp);
>  	return count;
>  }
>  
> +#ifdef EROFS_FS_HAS_MANAGED_CACHE
> +/* for cache-managed case, customized reclaim paths exist */
> +static void erofs_workgroup_unfreeze_final(struct erofs_workgroup *grp)
> +{
> +	erofs_workgroup_unfreeze(grp, 0);
> +	__erofs_workgroup_free(grp);
> +}
> +
> +bool erofs_try_to_release_workgroup(struct erofs_sb_info *sbi,
> +				    struct erofs_workgroup *grp,
> +				    bool cleanup)
> +{
> +	/*
> +	 * for managed cache enabled, the refcount of workgroups
> +	 * themselves could be < 0 (freezed). So there is no guarantee
> +	 * that all refcount > 0 if managed cache is enabled.
> +	 */
> +	if (!erofs_workgroup_try_to_freeze(grp, 1))
> +		return false;
> +
> +	/*
> +	 * note that all cached pages should be unlinked
> +	 * before delete it from the radix tree.
> +	 * Otherwise some cached pages of an orphan old workgroup
> +	 * could be still linked after the new one is available.
> +	 */
> +	if (erofs_try_to_free_all_cached_pages(sbi, grp)) {
> +		erofs_workgroup_unfreeze(grp, 1);
> +		return false;
> +	}
> +
> +	/* it is impossible to fail after we freeze the workgroup */
> +	BUG_ON(xa_untag_pointer(radix_tree_delete(&sbi->workstn_tree,
> +						  grp->index)) != grp);
> +

It is not a good idea to crash the system.  Please try to recover from
this, a BUG_ON() implies that the developer has no idea how to handle
this type of error so either it can never happen (which means that the
BUG_ON can be removed), or you need to fix the logic to handle this type
of issue when it does happen.

I'm not going to take this as-is, sorry.

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position
  2018-11-20 14:34 ` [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position Gao Xiang
@ 2018-11-22 10:19   ` Greg Kroah-Hartman
  2018-11-22 10:49     ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:19 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:16PM +0800, Gao Xiang wrote:
> `trace_erofs_readpage' should be placed in .readpage()
> rather than in the internal `z_erofs_do_read_page'.

Why?  What happens with the code today?

> 
> Fixes: 284db12cfda3 ("staging: erofs: add trace points for reading zipped data")

Should this get into 4.20-final?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup
  2018-11-20 14:34 ` [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup Gao Xiang
@ 2018-11-22 10:20   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:20 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:18PM +0800, Gao Xiang wrote:
> It's better to use atomic_cond_read_relaxed, which is implemented
> in hardware instructions to monitor a variable changes currently
> for ARM64, instead of open-coded busy waiting.
> 
> Reviewed-by: Chao Yu <yuchao0@huawei.com>
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
>  drivers/staging/erofs/internal.h | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 89dbd0888e53..eb80ba44d072 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -221,23 +221,29 @@ static inline void erofs_workgroup_unfreeze(
>  	preempt_enable();
>  }
>  
> +#if defined(CONFIG_SMP)
> +static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
> +{
> +	return atomic_cond_read_relaxed(&grp->refcount,
> +					VAL != EROFS_LOCKED_MAGIC);
> +}
> +#else
> +static inline int erofs_wait_on_workgroup_freezed(struct erofs_workgroup *grp)
> +{
> +	int v = atomic_read(&grp->refcount);

Again, why not use the refcount api instead of doing this yourself?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
  2018-11-20 14:34 ` [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}' Gao Xiang
@ 2018-11-22 10:21   ` Greg Kroah-Hartman
  2018-11-22 10:29     ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:21 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
> There are two minor issues in the current freeze interface:
> 
>    1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
>       therefore fix the incorrect conditions;
> 
>    2) For SMP platforms, it should also disable preemption before
>       doing atomic_cmpxchg in case that some high priority tasks
>       preempt between atomic_cmpxchg and disable_preempt, then spin
>       on the locked refcount later.

spinning on a refcount implies that you are trying to do your own type
of locking.  Why not use the in-kernel locking api instead?  It will
always do better than trying to do your own logic as the developers
there know locking across all types of cpus better than filesystem
developers :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  2018-11-20 14:34 ` [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze Gao Xiang
@ 2018-11-22 10:22   ` Greg Kroah-Hartman
  2018-11-22 10:56     ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:22 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:20PM +0800, Gao Xiang wrote:
> Just like other generic locks, insert a full barrier
> in case of memory reorder.
> 
> Reviewed-by: Chao Yu <yuchao0@huawei.com>
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
>  drivers/staging/erofs/internal.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/staging/erofs/internal.h b/drivers/staging/erofs/internal.h
> index 2e0ef92c138b..f77653d33633 100644
> --- a/drivers/staging/erofs/internal.h
> +++ b/drivers/staging/erofs/internal.h
> @@ -209,6 +209,7 @@ static inline bool erofs_workgroup_try_to_freeze(struct erofs_workgroup *grp,
>  static inline void erofs_workgroup_unfreeze(struct erofs_workgroup *grp,
>  					    int orig_val)
>  {
> +	smp_mb();

Please document this memory barrier.  It does not make much sense to
me...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-20 14:34 ` [PATCH 07/10] staging: erofs: separate into init_once / always Gao Xiang
@ 2018-11-22 10:23   ` Greg Kroah-Hartman
  2018-11-22 10:34     ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:23 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:22PM +0800, Gao Xiang wrote:
> `z_erofs_vle_workgroup' is heavily generated in the decompression,
> for example, it resets 32 bytes redundantly for 64-bit platforms
> even through Z_EROFS_VLE_INLINE_PAGEVECS + Z_EROFS_CLUSTER_MAX_PAGES,
> default 4, pages are stored in `z_erofs_vle_workgroup'.
> 
> As an another example, `struct mutex' takes 72 bytes for our kirin
> 64-bit platforms, it's unnecessary to be reseted at first and
> be initialized each time.
> 
> Let's avoid filling all `z_erofs_vle_workgroup' with 0 at first
> since most fields are reinitialized to meaningful values later,
> and pagevec is no need to initialized at all.
> 
> Reviewed-by: Chao Yu <yuchao0@huawei.com>
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
>  drivers/staging/erofs/unzip_vle.c | 34 +++++++++++++++++++++++++++++-----
>  1 file changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
> index ede3383ac601..4e5843e8ee35 100644
> --- a/drivers/staging/erofs/unzip_vle.c
> +++ b/drivers/staging/erofs/unzip_vle.c
> @@ -43,12 +43,38 @@ static inline int init_unzip_workqueue(void)
>  	return z_erofs_workqueue ? 0 : -ENOMEM;
>  }
>  
> +static void init_once(void *ptr)
> +{
> +	struct z_erofs_vle_workgroup *grp = ptr;
> +	struct z_erofs_vle_work *const work =
> +		z_erofs_vle_grab_primary_work(grp);
> +	unsigned int i;
> +
> +	mutex_init(&work->lock);
> +	work->nr_pages = 0;
> +	work->vcnt = 0;
> +	for (i = 0; i < Z_EROFS_CLUSTER_MAX_PAGES; ++i)
> +		grp->compressed_pages[i] = NULL;
> +}
> +
> +static void init_always(struct z_erofs_vle_workgroup *grp)
> +{
> +	struct z_erofs_vle_work *const work =
> +		z_erofs_vle_grab_primary_work(grp);
> +
> +	atomic_set(&grp->obj.refcount, 1);
> +	grp->flags = 0;
> +
> +	DBG_BUGON(work->nr_pages);
> +	DBG_BUGON(work->vcnt);

How can these ever be triggered?  I understand the need for debugging
code when you are writing code, but at this point it shouldn't be needed
anymore, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/10] staging: erofs: locked before registering for all new workgroups
  2018-11-20 14:34 ` [PATCH 08/10] staging: erofs: locked before registering for all new workgroups Gao Xiang
@ 2018-11-22 10:24   ` Greg Kroah-Hartman
  2018-11-22 10:35     ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 10:24 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Tue, Nov 20, 2018 at 10:34:23PM +0800, Gao Xiang wrote:
> Let's make sure that the one registering a workgroup will also
> take the primary work lock at first for two reasons:
>   1) There's no need to introduce such a race window (and consequently
>      overhead) between registering and locking, other tasks could break
>      in by chance, and the race seems unnecessary (no benefit at all);
> 
>   2) It's better to take the primary work when a workgroup
>      is registered to apply the cache managed policy, for example,
>      if some other tasks break in, it could turn into the in-place
>      decompression rather than use as the cached decompression.
> 
> Reviewed-by: Chao Yu <yuchao0@huawei.com>
> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
> ---
>  drivers/staging/erofs/unzip_vle.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c
> index 4e5843e8ee35..a1376f3c6065 100644
> --- a/drivers/staging/erofs/unzip_vle.c
> +++ b/drivers/staging/erofs/unzip_vle.c
> @@ -420,18 +420,22 @@ z_erofs_vle_work_register(const struct z_erofs_vle_work_finder *f,
>  	work = z_erofs_vle_grab_primary_work(grp);
>  	work->pageofs = f->pageofs;
>  
> +	/* lock all primary followed works before visible to others */
> +	if (unlikely(!mutex_trylock(&work->lock)))
> +		/* for a new workgroup, try_lock *never* fails */
> +		DBG_BUGON(1);

Again, drop this, if it never fails, then there's no need for this.  If
it can fail, then properly handle it.

And trylock can fail, so this needs to be fixed.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
  2018-11-22 10:21   ` Greg Kroah-Hartman
@ 2018-11-22 10:29     ` Gao Xiang
  2018-11-22 11:03       ` Greg Kroah-Hartman
  2018-11-22 11:05       ` Greg Kroah-Hartman
  0 siblings, 2 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 10:29 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 18:21, Greg Kroah-Hartman wrote:
> On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
>> There are two minor issues in the current freeze interface:
>>
>>    1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
>>       therefore fix the incorrect conditions;
>>
>>    2) For SMP platforms, it should also disable preemption before
>>       doing atomic_cmpxchg in case that some high priority tasks
>>       preempt between atomic_cmpxchg and disable_preempt, then spin
>>       on the locked refcount later.
> 
> spinning on a refcount implies that you are trying to do your own type
> of locking.  Why not use the in-kernel locking api instead?  It will
> always do better than trying to do your own logic as the developers
> there know locking across all types of cpus better than filesystem
> developers :)

It is because refcount also plays a role as a spinlock on a specific value
(== EROFS_LOCKED_MAGIC), no need to introduce such a value since the spin
window is small.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 10:23   ` Greg Kroah-Hartman
@ 2018-11-22 10:34     ` Gao Xiang
  2018-11-22 11:05       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 10:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
>> +
>> +	DBG_BUGON(work->nr_pages);
>> +	DBG_BUGON(work->vcnt);
> How can these ever be triggered?  I understand the need for debugging
> code when you are writing code, but at this point it shouldn't be needed
> anymore, right?

I need to avoid some fields is not 0 when the new workgroup is created (because
work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
But that is not obvious, it is promised by the current logic.

In order to not introduce such a issue in the future, or there are some potential
race (work->nr_pages and work->vcnt != 0 when the previous workgroup == 0), it need
to be noticed to developpers as early as possible.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/10] staging: erofs: locked before registering for all new workgroups
  2018-11-22 10:24   ` Greg Kroah-Hartman
@ 2018-11-22 10:35     ` Gao Xiang
  0 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 10:35 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 18:24, Greg Kroah-Hartman wrote:
>> +	/* lock all primary followed works before visible to others */
>> +	if (unlikely(!mutex_trylock(&work->lock)))
>> +		/* for a new workgroup, try_lock *never* fails */
>> +		DBG_BUGON(1);
> Again, drop this, if it never fails, then there's no need for this.  If
> it can fail, then properly handle it.
> 
> And trylock can fail, so this needs to be fixed.

OK, I will drop this.

Thanks,
Gao Xiang

> 
> thanks,

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-22 10:17   ` Greg Kroah-Hartman
@ 2018-11-22 10:42     ` Gao Xiang
  2018-11-22 11:06       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 10:42 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
> Any specific reason why you are not using the refcount.h api instead of
> "doing it yourself" with atomic_inc/dec()?
> 
> I'm not rejecting this, just curious.

As I explained in the previous email,
Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'

we need such a function when the value is >= 0, it plays as a refcount,
but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
and actually there is no need to introduce a seperate spinlock_t because
we don't actually care about its performance (rarely locked). and
the corresponding struct is too large for now, we need to decrease its size.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position
  2018-11-22 10:19   ` Greg Kroah-Hartman
@ 2018-11-22 10:49     ` Gao Xiang
  2018-11-22 11:00       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 10:49 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 18:19, Greg Kroah-Hartman wrote:
> On Tue, Nov 20, 2018 at 10:34:16PM +0800, Gao Xiang wrote:
>> `trace_erofs_readpage' should be placed in .readpage()
>> rather than in the internal `z_erofs_do_read_page'.
> Why?  What happens with the code today?
trace_erofs_readpage is used to trace .readpage() interface (it represents sync read)
hook rather than its internal implementation z_erofs_do_read_page (which both .readpage()
and .readpages() uses it). Chen Gong places the tracepoint to a wrong place by mistake.
And we found it by our internal test using this tracepoint.

> 
>> Fixes: 284db12cfda3 ("staging: erofs: add trace points for reading zipped data")
> Should this get into 4.20-final?
I think so, which is not very important but I think it should be fixed...

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  2018-11-22 10:22   ` Greg Kroah-Hartman
@ 2018-11-22 10:56     ` Gao Xiang
  2018-11-22 18:50       ` Andrea Parri
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 10:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
> Please document this memory barrier.  It does not make much sense to
> me...

Because we need to make the other observers noticing the latest values modified
in this locking period before unfreezing the whole workgroup, one way is to use
a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
the first one.

Hmmm...ok, I will add a simple message to explain this, but I think that is
plain enough for a lock...

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position
  2018-11-22 10:49     ` Gao Xiang
@ 2018-11-22 11:00       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 11:00 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 06:49:53PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 18:19, Greg Kroah-Hartman wrote:
> > On Tue, Nov 20, 2018 at 10:34:16PM +0800, Gao Xiang wrote:
> >> `trace_erofs_readpage' should be placed in .readpage()
> >> rather than in the internal `z_erofs_do_read_page'.
> > Why?  What happens with the code today?
> trace_erofs_readpage is used to trace .readpage() interface (it represents sync read)
> hook rather than its internal implementation z_erofs_do_read_page (which both .readpage()
> and .readpages() uses it). Chen Gong places the tracepoint to a wrong place by mistake.
> And we found it by our internal test using this tracepoint.

Ok, you should put this in the changelog text when you redo this series.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
  2018-11-22 10:29     ` Gao Xiang
@ 2018-11-22 11:03       ` Greg Kroah-Hartman
  2018-11-22 11:05       ` Greg Kroah-Hartman
  1 sibling, 0 replies; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 11:03 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 06:29:34PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 18:21, Greg Kroah-Hartman wrote:
> > On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
> >> There are two minor issues in the current freeze interface:
> >>
> >>    1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
> >>       therefore fix the incorrect conditions;
> >>
> >>    2) For SMP platforms, it should also disable preemption before
> >>       doing atomic_cmpxchg in case that some high priority tasks
> >>       preempt between atomic_cmpxchg and disable_preempt, then spin
> >>       on the locked refcount later.
> > 
> > spinning on a refcount implies that you are trying to do your own type
> > of locking.  Why not use the in-kernel locking api instead?  It will
> > always do better than trying to do your own logic as the developers
> > there know locking across all types of cpus better than filesystem
> > developers :)
> 
> It is because refcount also plays a role as a spinlock on a specific value
> (== EROFS_LOCKED_MAGIC), no need to introduce such a value since the spin
> window is small.

Do not try to overload a reference count as a spinlock, you will run
into problems and in the end have to implement everything that the core
locking code has done.

Your use of this is highly suspect, what happens when you use a "real"
spinlock and a "real" reference count instead?  Those are two different
things from a logical point of view and you are mixing them together
which is odd.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 10:34     ` Gao Xiang
@ 2018-11-22 11:05       ` Greg Kroah-Hartman
  2018-11-22 11:11         ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 11:05 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
> >> +
> >> +	DBG_BUGON(work->nr_pages);
> >> +	DBG_BUGON(work->vcnt);
> > How can these ever be triggered?  I understand the need for debugging
> > code when you are writing code, but at this point it shouldn't be needed
> > anymore, right?
> 
> I need to avoid some fields is not 0 when the new workgroup is created (because
> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
> But that is not obvious, it is promised by the current logic.

Then delete these lines if they can never happen :)

> In order to not introduce such a issue in the future, or there are some potential
> race (work->nr_pages and work->vcnt != 0 when the previous workgroup == 0), it need
> to be noticed to developpers as early as possible.

Then make it a real call, do not wrap it in odd macros that do not
really explain why it is "debugging only".  Your code is "real" now,
make the logic real for all developers and users.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
  2018-11-22 10:29     ` Gao Xiang
  2018-11-22 11:03       ` Greg Kroah-Hartman
@ 2018-11-22 11:05       ` Greg Kroah-Hartman
  2018-11-22 11:22         ` Gao Xiang
  1 sibling, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 11:05 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 06:29:34PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 18:21, Greg Kroah-Hartman wrote:
> > On Tue, Nov 20, 2018 at 10:34:19PM +0800, Gao Xiang wrote:
> >> There are two minor issues in the current freeze interface:
> >>
> >>    1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK,
> >>       therefore fix the incorrect conditions;
> >>
> >>    2) For SMP platforms, it should also disable preemption before
> >>       doing atomic_cmpxchg in case that some high priority tasks
> >>       preempt between atomic_cmpxchg and disable_preempt, then spin
> >>       on the locked refcount later.
> > 
> > spinning on a refcount implies that you are trying to do your own type
> > of locking.  Why not use the in-kernel locking api instead?  It will
> > always do better than trying to do your own logic as the developers
> > there know locking across all types of cpus better than filesystem
> > developers :)
> 
> It is because refcount also plays a role as a spinlock on a specific value
> (== EROFS_LOCKED_MAGIC), no need to introduce such a value since the spin
> window is small.

As I said in another email, doing two things with one variable is odd as
those are two different types of functions.

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-22 10:42     ` Gao Xiang
@ 2018-11-22 11:06       ` Greg Kroah-Hartman
  2018-11-22 11:43         ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 11:06 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 06:42:52PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
> > Any specific reason why you are not using the refcount.h api instead of
> > "doing it yourself" with atomic_inc/dec()?
> > 
> > I'm not rejecting this, just curious.
> 
> As I explained in the previous email,
> Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
> 
> we need such a function when the value is >= 0, it plays as a refcount,
> but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
> and actually there is no need to introduce a seperate spinlock_t because
> we don't actually care about its performance (rarely locked). and
> the corresponding struct is too large for now, we need to decrease its size.

Why do you need to decrease the size?  How many of these structures are
created?

And you will care about the performance when a lock is being held, as is
evident by your logic to try to fix those issues in this patch series.
Using a "real" lock will solve all of that and keep you from having to
implement it all "by hand".

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 11:05       ` Greg Kroah-Hartman
@ 2018-11-22 11:11         ` Gao Xiang
  2018-11-22 11:26           ` Greg Kroah-Hartman
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 11:11 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
>>>> +
>>>> +	DBG_BUGON(work->nr_pages);
>>>> +	DBG_BUGON(work->vcnt);
>>> How can these ever be triggered?  I understand the need for debugging
>>> code when you are writing code, but at this point it shouldn't be needed
>>> anymore, right?
>>
>> I need to avoid some fields is not 0 when the new workgroup is created (because
>> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
>> But that is not obvious, it is promised by the current logic.
> 
> Then delete these lines if they can never happen :)

I don't know how to observe such a race in our beta test and community users.

Because if the kernel is crashed, we could collect the whole kernel dump to observe the memory
and all registers, if we only have some warning, it will be not easy to get the state as early as possible.

Thank,
Gao Xiang

> 
>> In order to not introduce such a issue in the future, or there are some potential
>> race (work->nr_pages and work->vcnt != 0 when the previous workgroup == 0), it need
>> to be noticed to developpers as early as possible.
> 
> Then make it a real call, do not wrap it in odd macros that do not
> really explain why it is "debugging only".  Your code is "real" now,
> make the logic real for all developers and users.
> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
  2018-11-22 11:05       ` Greg Kroah-Hartman
@ 2018-11-22 11:22         ` Gao Xiang
  0 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 11:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
> As I said in another email, doing two things with one variable is odd as
> those are two different types of functions.

I think lockref_put_or_lock do the similar thing, it is heavily used in dcache.c, but
1) 0 is special for us, we need lock it on a < 0 value rather than 0.
2) spinlock_t is too large for us because every compressed page will have the structure,
   but the locking rarely happens.

Thanks,
Gao Xiang

> 
> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 11:11         ` Gao Xiang
@ 2018-11-22 11:26           ` Greg Kroah-Hartman
  2018-11-22 11:37             ` Gao Xiang
  2018-11-22 12:00             ` Gao Xiang
  0 siblings, 2 replies; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 11:26 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 07:11:08PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
> > On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
> >> Hi Greg,
> >>
> >> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
> >>>> +
> >>>> +	DBG_BUGON(work->nr_pages);
> >>>> +	DBG_BUGON(work->vcnt);
> >>> How can these ever be triggered?  I understand the need for debugging
> >>> code when you are writing code, but at this point it shouldn't be needed
> >>> anymore, right?
> >>
> >> I need to avoid some fields is not 0 when the new workgroup is created (because
> >> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
> >> But that is not obvious, it is promised by the current logic.
> > 
> > Then delete these lines if they can never happen :)
> 
> I don't know how to observe such a race in our beta test and community users.

	/*
	 * Let developers know something went really wrong with their
	 * initialization code
	 */
	if (!work->nr_pages) {
		pr_err("nr_pages == NULL!");
		WARN_ON(1);
	}
	if (!work->vcnt) {
		pr_err("vcnt == NULL!");
		WARN_ON(1);
	}

or something like that.

Don't make people rebuild your code with different options for
debugging.  That will never work in the 'real world' when people start
using the code.  You need to have things enabled for people all the
time, which is why we have dynamic debugging in the kernel now, and not
a zillion different "DRIVER_DEBUG" build options anymore.

> Because if the kernel is crashed, we could collect the whole kernel dump to observe the memory
> and all registers, if we only have some warning, it will be not easy to get the state as early as possible.

When the kernel crashes, geting a dump is hard on almost all hardware.
It is only rare systems that you can get a kernel dump.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 11:26           ` Greg Kroah-Hartman
@ 2018-11-22 11:37             ` Gao Xiang
  2018-11-22 12:00             ` Gao Xiang
  1 sibling, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 11:37 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 07:11:08PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 19:05, Greg Kroah-Hartman wrote:
>>> On Thu, Nov 22, 2018 at 06:34:10PM +0800, Gao Xiang wrote:
>>>> Hi Greg,
>>>>
>>>> On 2018/11/22 18:23, Greg Kroah-Hartman wrote:
>>>>>> +
>>>>>> +	DBG_BUGON(work->nr_pages);
>>>>>> +	DBG_BUGON(work->vcnt);
>>>>> How can these ever be triggered?  I understand the need for debugging
>>>>> code when you are writing code, but at this point it shouldn't be needed
>>>>> anymore, right?
>>>>
>>>> I need to avoid some fields is not 0 when the new workgroup is created (because
>>>> work->nr_pages and work->vcnt == 0 usually after the previous workgroup is freed).
>>>> But that is not obvious, it is promised by the current logic.
>>>
>>> Then delete these lines if they can never happen :)
>>
>> I don't know how to observe such a race in our beta test and community users.
> 
> 	/*
> 	 * Let developers know something went really wrong with their
> 	 * initialization code
> 	 */
> 	if (!work->nr_pages) {
> 		pr_err("nr_pages == NULL!");
> 		WARN_ON(1);
> 	}
> 	if (!work->vcnt) {
> 		pr_err("vcnt == NULL!");
> 		WARN_ON(1);
> 	}
> 
> or something like that.
> 
> Don't make people rebuild your code with different options for
> debugging.  That will never work in the 'real world' when people start
> using the code.  You need to have things enabled for people all the
> time, which is why we have dynamic debugging in the kernel now, and not
> a zillion different "DRIVER_DEBUG" build options anymore.
> 
>> Because if the kernel is crashed, we could collect the whole kernel dump to observe the memory
>> and all registers, if we only have some warning, it will be not easy to get the state as early as possible.
> 
> When the kernel crashes, geting a dump is hard on almost all hardware.
> It is only rare systems that you can get a kernel dump.

This piece of code is already used in our hisilicon beta test, all memorydump,
registers, stack can be collected.

Apart from that, I observed many f2fs bugs are observed in this way and
many erofs bugs are collected in that way...sigh...

Thanks,
Gao Xiang


> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-22 11:06       ` Greg Kroah-Hartman
@ 2018-11-22 11:43         ` Gao Xiang
  2018-11-22 12:26           ` Greg Kroah-Hartman
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 11:43 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 19:06, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 06:42:52PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
>>> Any specific reason why you are not using the refcount.h api instead of
>>> "doing it yourself" with atomic_inc/dec()?
>>>
>>> I'm not rejecting this, just curious.
>> As I explained in the previous email,
>> Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
>>
>> we need such a function when the value is >= 0, it plays as a refcount,
>> but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
>> and actually there is no need to introduce a seperate spinlock_t because
>> we don't actually care about its performance (rarely locked). and
>> the corresponding struct is too large for now, we need to decrease its size.
> Why do you need to decrease the size?  How many of these structures are
> created?

As I said in the previous email, every compressed page will have a managed structure
called erofs_workgroup, and it is heavily allocated like page/inode/dentry in the erofs.

> 
> And you will care about the performance when a lock is being held, as is
> evident by your logic to try to fix those issues in this patch series.
> Using a "real" lock will solve all of that and keep you from having to
> implement it all "by hand".

The function is much like lockref (aligned_u64 lock_count;) with the exception as my previous email
explained.

Thanks,
Gao Xiang

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 11:26           ` Greg Kroah-Hartman
  2018-11-22 11:37             ` Gao Xiang
@ 2018-11-22 12:00             ` Gao Xiang
  2018-11-22 13:01               ` Gao Xiang
  1 sibling, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 12:00 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
> Don't make people rebuild your code with different options for
> debugging.  That will never work in the 'real world' when people start
> using the code.  You need to have things enabled for people all the
> time, which is why we have dynamic debugging in the kernel now, and not
> a zillion different "DRIVER_DEBUG" build options anymore.

Actually, current erofs handle differently for beta users (in eng mode)
and commercial users.

CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
expression is false, we could get the whole memorydump as early as possible.
But for commercial users, there are no such observing points to promise
the kernel stability and performance.

It has helped us to find several bug, and I cannot find some alternative way
to get the the first scene of the accident...

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-22 11:43         ` Gao Xiang
@ 2018-11-22 12:26           ` Greg Kroah-Hartman
  2018-11-22 12:41             ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 12:26 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 07:43:50PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 19:06, Greg Kroah-Hartman wrote:
> > On Thu, Nov 22, 2018 at 06:42:52PM +0800, Gao Xiang wrote:
> >> Hi Greg,
> >>
> >> On 2018/11/22 18:17, Greg Kroah-Hartman wrote:
> >>> Any specific reason why you are not using the refcount.h api instead of
> >>> "doing it yourself" with atomic_inc/dec()?
> >>>
> >>> I'm not rejecting this, just curious.
> >> As I explained in the previous email,
> >> Re: [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'
> >>
> >> we need such a function when the value is >= 0, it plays as a refcount,
> >> but when the refcount == EROFS_LOCKED_MAGIC (<0, but not 0 as refcount.h),
> >> and actually there is no need to introduce a seperate spinlock_t because
> >> we don't actually care about its performance (rarely locked). and
> >> the corresponding struct is too large for now, we need to decrease its size.
> > Why do you need to decrease the size?  How many of these structures are
> > created?
> 
> As I said in the previous email, every compressed page will have a managed structure
> called erofs_workgroup, and it is heavily allocated like page/inode/dentry in the erofs.

Ugh, every page?  Ok, nevermind, I take back my objections.  You all are
crazy and need to do crazy things like this :)

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-22 12:26           ` Greg Kroah-Hartman
@ 2018-11-22 12:41             ` Gao Xiang
  2018-11-22 13:20               ` Greg Kroah-Hartman
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 12:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 20:26, Greg Kroah-Hartman wrote:
> Ugh, every page?  Ok, nevermind, I take back my objections.  You all are
> crazy and need to do crazy things like this :)

...Do you have some idea about this? ... I think it is fairly normal... :(

We have a large number of managed workgroups, the current use case is the 4k compressed size,
each compressed page has a workgroup, but erofs also has reclaim paths to free them for low memory cases.

But since the structure (z_erofs_vle_workgroup, erofs_workgroup) is critical,
I need to make it as small as possible.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 12:00             ` Gao Xiang
@ 2018-11-22 13:01               ` Gao Xiang
  2018-11-22 13:23                 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 13:01 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, linux-erofs, LKML, weidu.du, Miao Xie

Hi Greg,

On 2018/11/22 20:00, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
>> Don't make people rebuild your code with different options for
>> debugging.  That will never work in the 'real world' when people start
>> using the code.  You need to have things enabled for people all the
>> time, which is why we have dynamic debugging in the kernel now, and not
>> a zillion different "DRIVER_DEBUG" build options anymore.
> Actually, current erofs handle differently for beta users (in eng mode)
> and commercial users.
> 
> CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
> expression is false, we could get the whole memorydump as early as possible.
> But for commercial users, there are no such observing points to promise
> the kernel stability and performance.
> 
> It has helped us to find several bug, and I cannot find some alternative way
> to get the the first scene of the accident...

I'm about to send v2 of this patchset... I need to get your opinion...

I think for the current erofs development state, I tend to leave such a
developping switch because the code could be modified frequently,
many potential bugs could be avoid when this debugging mode is on and
it will be dropped after erofs becomes more stable...

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled
  2018-11-22 12:41             ` Gao Xiang
@ 2018-11-22 13:20               ` Greg Kroah-Hartman
  0 siblings, 0 replies; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 13:20 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, linux-erofs, Chao Yu, LKML, weidu.du, Miao Xie

On Thu, Nov 22, 2018 at 08:41:15PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 20:26, Greg Kroah-Hartman wrote:
> > Ugh, every page?  Ok, nevermind, I take back my objections.  You all are
> > crazy and need to do crazy things like this :)
> 
> ...Do you have some idea about this? ... I think it is fairly normal... :(
> 
> We have a large number of managed workgroups, the current use case is the 4k compressed size,
> each compressed page has a workgroup, but erofs also has reclaim paths to free them for low memory cases.
> 
> But since the structure (z_erofs_vle_workgroup, erofs_workgroup) is critical,
> I need to make it as small as possible.

I do not know "real" filesystems (i.e. ones that put stuff on disks),
only virtual ones (like sysfs/kernfs), so I do not know what to suggest
here, sorry.  I thought this was a much higner-level structure, if this
is per-compressed-page, then you are more than welcome to optimize for
this.  Good luck with your "fake" spinlocks though :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 13:01               ` Gao Xiang
@ 2018-11-22 13:23                 ` Greg Kroah-Hartman
  2018-11-22 13:59                   ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Greg Kroah-Hartman @ 2018-11-22 13:23 UTC (permalink / raw)
  To: Gao Xiang; +Cc: devel, weidu.du, Miao Xie, linux-erofs, LKML

On Thu, Nov 22, 2018 at 09:01:31PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 20:00, Gao Xiang wrote:
> > Hi Greg,
> > 
> > On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
> >> Don't make people rebuild your code with different options for
> >> debugging.  That will never work in the 'real world' when people start
> >> using the code.  You need to have things enabled for people all the
> >> time, which is why we have dynamic debugging in the kernel now, and not
> >> a zillion different "DRIVER_DEBUG" build options anymore.
> > Actually, current erofs handle differently for beta users (in eng mode)
> > and commercial users.
> > 
> > CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
> > expression is false, we could get the whole memorydump as early as possible.
> > But for commercial users, there are no such observing points to promise
> > the kernel stability and performance.
> > 
> > It has helped us to find several bug, and I cannot find some alternative way
> > to get the the first scene of the accident...
> 
> I'm about to send v2 of this patchset... I need to get your opinion...
> 
> I think for the current erofs development state, I tend to leave such a
> developping switch because the code could be modified frequently,
> many potential bugs could be avoid when this debugging mode is on and
> it will be dropped after erofs becomes more stable...

You have two competing issues here.  You need to have some sort of debug
build that allows you to get feedback from users, and you need to
provide a solid, reliable system for those not wanting to debug
anything.

If you use a kernel build option, then you can do what you are doing
now, just that do not expect for any user that has problems with the
filesystem, they will rebuild the code just to try to help debug it.
That is going to require run-time debugging to be enabled as they will
not usually have built their own kernel/module at all.

For now, keep doing what you are doing, I just started to worry when I
saw the initial BUG_ON() as we had just talked about these at the
Maintainers summit a few weeks ago, and how we need to get rid of them
from the kernel as they are a crutch that we need to solve.

thanks, and sorry for the noise.  Please resend this series again,

greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] staging: erofs: separate into init_once / always
  2018-11-22 13:23                 ` Greg Kroah-Hartman
@ 2018-11-22 13:59                   ` Gao Xiang
  0 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-22 13:59 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: devel, weidu.du, Miao Xie, linux-erofs, LKML

Hi Greg,

On 2018/11/22 21:23, Greg Kroah-Hartman wrote:
> On Thu, Nov 22, 2018 at 09:01:31PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 20:00, Gao Xiang wrote:
>>> Hi Greg,
>>>
>>> On 2018/11/22 19:26, Greg Kroah-Hartman wrote:
>>>> Don't make people rebuild your code with different options for
>>>> debugging.  That will never work in the 'real world' when people start
>>>> using the code.  You need to have things enabled for people all the
>>>> time, which is why we have dynamic debugging in the kernel now, and not
>>>> a zillion different "DRIVER_DEBUG" build options anymore.
>>> Actually, current erofs handle differently for beta users (in eng mode)
>>> and commercial users.
>>>
>>> CONFIG_EROFS_FS_DEBUG is enable for all beta users, after an observed
>>> expression is false, we could get the whole memorydump as early as possible.
>>> But for commercial users, there are no such observing points to promise
>>> the kernel stability and performance.
>>>
>>> It has helped us to find several bug, and I cannot find some alternative way
>>> to get the the first scene of the accident...
>>
>> I'm about to send v2 of this patchset... I need to get your opinion...
>>
>> I think for the current erofs development state, I tend to leave such a
>> developping switch because the code could be modified frequently,
>> many potential bugs could be avoid when this debugging mode is on and
>> it will be dropped after erofs becomes more stable...
> 
> You have two competing issues here.  You need to have some sort of debug
> build that allows you to get feedback from users, and you need to
> provide a solid, reliable system for those not wanting to debug
> anything.

Yes, you are right :)

> 
> If you use a kernel build option, then you can do what you are doing
> now, just that do not expect for any user that has problems with the
> filesystem, they will rebuild the code just to try to help debug it.
> That is going to require run-time debugging to be enabled as they will
> not usually have built their own kernel/module at all.

Yes, actually the current Android system have 2 modes -- eng and commerical build
for all versions.

A large number of beta users will receive eng build, and many buges were observed
from these users. That is the debugging benefit what Android system does now.

I do not expect real users to rebuild kernels for debugging, I think after the Android
use case, erofs has become solid enough. For commerical builds, it will have enough
error handing case found by Android users and developers.

> 
> For now, keep doing what you are doing, I just started to worry when I
> saw the initial BUG_ON() as we had just talked about these at the
> Maintainers summit a few weeks ago, and how we need to get rid of them
> from the kernel as they are a crutch that we need to solve.

Thanks for understanding...

Thanks,
Gao Xiang

> 
> thanks, and sorry for the noise.  Please resend this series again,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  2018-11-22 10:56     ` Gao Xiang
@ 2018-11-22 18:50       ` Andrea Parri
  2018-11-23  2:51         ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Andrea Parri @ 2018-11-22 18:50 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Greg Kroah-Hartman, devel, linux-erofs, Chao Yu, LKML, weidu.du,
	Miao Xie

On Thu, Nov 22, 2018 at 06:56:32PM +0800, Gao Xiang wrote:
> Hi Greg,
> 
> On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
> > Please document this memory barrier.  It does not make much sense to
> > me...
> 
> Because we need to make the other observers noticing the latest values modified
> in this locking period before unfreezing the whole workgroup, one way is to use
> a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
> the first one.
> 
> Hmmm...ok, I will add a simple message to explain this, but I think that is
> plain enough for a lock...

Sympathizing with Greg's request, let me add some specific suggestions:

  1. It wouldn't hurt to indicate a pair of memory accesses which are
     intended to be "ordered" by the memory barrier in question (yes,
     this pair might not be unique, but you should be able to provide
     an example).

  2. Memory barriers always come matched by other memory barriers, or
     dependencies (it really does not make sense to talk about a full
     barrier "in isolation"): please also indicate (an instance of) a
     matching barrier or the matching barriers.

  3. How do the hardware threads communicate?  In the acquire/release
     pattern you mentioned above, the load-acquire *reads from* a/the
     previous store-release, a memory access that follows the acquire
     somehow communicate with a memory access preceding the release...

  4. It is a good practice to include the above information within an
     (inline) comment accompanying the added memory barrier (in fact,
     IIRC, checkpatch.pl gives you a "memory barrier without comment"
     warning when you omit to do so); not just in the commit message.

Hope this helps.  Please let me know if something I wrote is unclear,

  Andrea


> 
> Thanks,
> Gao Xiang
> 
> > 
> > thanks,
> > 
> > greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  2018-11-22 18:50       ` Andrea Parri
@ 2018-11-23  2:51         ` Gao Xiang
  2018-11-23  9:51           ` Andrea Parri
  0 siblings, 1 reply; 46+ messages in thread
From: Gao Xiang @ 2018-11-23  2:51 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Greg Kroah-Hartman, devel, linux-erofs, Chao Yu, LKML, weidu.du,
	Miao Xie

Hi Andrea,

On 2018/11/23 2:50, Andrea Parri wrote:
> On Thu, Nov 22, 2018 at 06:56:32PM +0800, Gao Xiang wrote:
>> Hi Greg,
>>
>> On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
>>> Please document this memory barrier.  It does not make much sense to
>>> me...
>>
>> Because we need to make the other observers noticing the latest values modified
>> in this locking period before unfreezing the whole workgroup, one way is to use
>> a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
>> the first one.
>>
>> Hmmm...ok, I will add a simple message to explain this, but I think that is
>> plain enough for a lock...
> 
> Sympathizing with Greg's request, let me add some specific suggestions:
> 
>   1. It wouldn't hurt to indicate a pair of memory accesses which are
>      intended to be "ordered" by the memory barrier in question (yes,
>      this pair might not be unique, but you should be able to provide
>      an example).
> 
>   2. Memory barriers always come matched by other memory barriers, or
>      dependencies (it really does not make sense to talk about a full
>      barrier "in isolation"): please also indicate (an instance of) a
>      matching barrier or the matching barriers.
> 
>   3. How do the hardware threads communicate?  In the acquire/release
>      pattern you mentioned above, the load-acquire *reads from* a/the
>      previous store-release, a memory access that follows the acquire
>      somehow communicate with a memory access preceding the release...
> 
>   4. It is a good practice to include the above information within an
>      (inline) comment accompanying the added memory barrier (in fact,
>      IIRC, checkpatch.pl gives you a "memory barrier without comment"
>      warning when you omit to do so); not just in the commit message.
> 
> Hope this helps.  Please let me know if something I wrote is unclear,

Thanks for taking time on the detailed explanation. I think it is helpful for me. :)
And you are right, barriers should be in pairs, and I think I need to explain more:

255 static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
256 {
257         int o;
258
259 repeat:
260         o = erofs_wait_on_workgroup_freezed(grp);
261
262         if (unlikely(o <= 0))
263                 return -1;
264
265         if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o)) <- *
266                 goto repeat;
            imply a memory barrier here
267
268         *ocnt = o;
269         return 0;
270 }

I think atomic_cmpxchg implies a memory barrier semantics when the value comparison (*) succeeds...

I don't know whether my understanding is correct, If I am wrong..please correct me, or
I need to add more detailed code comments to explain in the code?

Thanks,
Gao Xiang


> 
>   Andrea
> 
> 
>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>> thanks,
>>>
>>> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  2018-11-23  2:51         ` Gao Xiang
@ 2018-11-23  9:51           ` Andrea Parri
  2018-11-23 10:00             ` Gao Xiang
  0 siblings, 1 reply; 46+ messages in thread
From: Andrea Parri @ 2018-11-23  9:51 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Greg Kroah-Hartman, devel, linux-erofs, Chao Yu, LKML, weidu.du,
	Miao Xie

On Fri, Nov 23, 2018 at 10:51:33AM +0800, Gao Xiang wrote:
> Hi Andrea,
> 
> On 2018/11/23 2:50, Andrea Parri wrote:
> > On Thu, Nov 22, 2018 at 06:56:32PM +0800, Gao Xiang wrote:
> >> Hi Greg,
> >>
> >> On 2018/11/22 18:22, Greg Kroah-Hartman wrote:
> >>> Please document this memory barrier.  It does not make much sense to
> >>> me...
> >>
> >> Because we need to make the other observers noticing the latest values modified
> >> in this locking period before unfreezing the whole workgroup, one way is to use
> >> a memory barrier and the other way is to use ACQUIRE and RELEASE. we selected
> >> the first one.
> >>
> >> Hmmm...ok, I will add a simple message to explain this, but I think that is
> >> plain enough for a lock...
> > 
> > Sympathizing with Greg's request, let me add some specific suggestions:
> > 
> >   1. It wouldn't hurt to indicate a pair of memory accesses which are
> >      intended to be "ordered" by the memory barrier in question (yes,
> >      this pair might not be unique, but you should be able to provide
> >      an example).
> > 
> >   2. Memory barriers always come matched by other memory barriers, or
> >      dependencies (it really does not make sense to talk about a full
> >      barrier "in isolation"): please also indicate (an instance of) a
> >      matching barrier or the matching barriers.
> > 
> >   3. How do the hardware threads communicate?  In the acquire/release
> >      pattern you mentioned above, the load-acquire *reads from* a/the
> >      previous store-release, a memory access that follows the acquire
> >      somehow communicate with a memory access preceding the release...
> > 
> >   4. It is a good practice to include the above information within an
> >      (inline) comment accompanying the added memory barrier (in fact,
> >      IIRC, checkpatch.pl gives you a "memory barrier without comment"
> >      warning when you omit to do so); not just in the commit message.
> > 
> > Hope this helps.  Please let me know if something I wrote is unclear,
> 
> Thanks for taking time on the detailed explanation. I think it is helpful for me. :)
> And you are right, barriers should be in pairs, and I think I need to explain more:
> 
> 255 static inline bool erofs_workgroup_get(struct erofs_workgroup *grp, int *ocnt)
> 256 {
> 257         int o;
> 258
> 259 repeat:
> 260         o = erofs_wait_on_workgroup_freezed(grp);
> 261
> 262         if (unlikely(o <= 0))
> 263                 return -1;
> 264
> 265         if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o)) <- *
> 266                 goto repeat;
>             imply a memory barrier here
> 267
> 268         *ocnt = o;
> 269         return 0;
> 270 }
> 
> I think atomic_cmpxchg implies a memory barrier semantics when the value comparison (*) succeeds...

Correct.  This is informally documented in Documentation/atomic_t.txt
and formalized within tools/memory-model/.


> 
> I don't know whether my understanding is correct, If I am wrong..please correct me, or
> I need to add more detailed code comments to explain in the code?

Yes, please; please review the above points (including 1. and 3.) and
try to address them with inline comments.  Maybe (if that matches the
*behavior*/guarantee you have in mind...) something like:

[in erofs_workgroup_unfreeze()]

	/*
	 * Orders the store/load to/from [???] and the store to
	 * ->refcount performed by the atomic_set() below.
	 *
	 * Matches the atomic_cmpxchg() in erofs_workgroup_get().
	 *
	 * Guarantees that if a successful atomic_cmpxchg() reads
	 * the value stored by the atomic_set() then [???].
	 */
	smp_mb();
	atomic_set(&grp->refcount, v);


[in erofs_workgroup_get()]

	/*
	 * Orders the load from ->refcount performed by the
	 * atomic_cmpxchg() below and the store/load [???].
	 *
	 * See the comment for the smp_mb() in
	 * erofs_workgroup_unfreeze().
	 */
	if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o))
		goto repeat;

Thanks,
  Andrea


> 
> Thanks,
> Gao Xiang
> 
> 
> > 
> >   Andrea
> > 
> > 
> >>
> >> Thanks,
> >> Gao Xiang
> >>
> >>>
> >>> thanks,
> >>>
> >>> greg k-h

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze
  2018-11-23  9:51           ` Andrea Parri
@ 2018-11-23 10:00             ` Gao Xiang
  0 siblings, 0 replies; 46+ messages in thread
From: Gao Xiang @ 2018-11-23 10:00 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Greg Kroah-Hartman, devel, linux-erofs, Chao Yu, LKML, weidu.du,
	Miao Xie

Hi Andrea,

On 2018/11/23 17:51, Andrea Parri wrote:
> Correct.  This is informally documented in Documentation/atomic_t.txt
> and formalized within tools/memory-model/.
> 
> 
>> I don't know whether my understanding is correct, If I am wrong..please correct me, or
>> I need to add more detailed code comments to explain in the code?
> Yes, please; please review the above points (including 1. and 3.) and
> try to address them with inline comments.  Maybe (if that matches the
> *behavior*/guarantee you have in mind...) something like:
> 
> [in erofs_workgroup_unfreeze()]
> 
> 	/*
> 	 * Orders the store/load to/from [???] and the store to
> 	 * ->refcount performed by the atomic_set() below.
> 	 *
> 	 * Matches the atomic_cmpxchg() in erofs_workgroup_get().
> 	 *
> 	 * Guarantees that if a successful atomic_cmpxchg() reads
> 	 * the value stored by the atomic_set() then [???].
> 	 */
> 	smp_mb();
> 	atomic_set(&grp->refcount, v);
> 
> 
> [in erofs_workgroup_get()]
> 
> 	/*
> 	 * Orders the load from ->refcount performed by the
> 	 * atomic_cmpxchg() below and the store/load [???].
> 	 *
> 	 * See the comment for the smp_mb() in
> 	 * erofs_workgroup_unfreeze().
> 	 */
> 	if (unlikely(atomic_cmpxchg(&grp->refcount, o, o + 1) != o))
> 		goto repeat;
> 

OK, I will add these comments in the next version patchset, will be sent later.
Thanks for your suggestion. :)

Thanks,
Gao Xiang

> Thanks,
>   Andrea
> 
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2018-11-23 10:00 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20 14:34 [PATCH 00/10] staging: erofs: decompression stability enhancement Gao Xiang
2018-11-20 14:34 ` [PATCH 01/10] staging: erofs: fix `trace_erofs_readpage' position Gao Xiang
2018-11-22 10:19   ` Greg Kroah-Hartman
2018-11-22 10:49     ` Gao Xiang
2018-11-22 11:00       ` Greg Kroah-Hartman
2018-11-20 14:34 ` [PATCH 02/10] staging: erofs: fix race when the managed cache is enabled Gao Xiang
2018-11-22 10:17   ` Greg Kroah-Hartman
2018-11-22 10:42     ` Gao Xiang
2018-11-22 11:06       ` Greg Kroah-Hartman
2018-11-22 11:43         ` Gao Xiang
2018-11-22 12:26           ` Greg Kroah-Hartman
2018-11-22 12:41             ` Gao Xiang
2018-11-22 13:20               ` Greg Kroah-Hartman
2018-11-22 10:19   ` Greg Kroah-Hartman
2018-11-20 14:34 ` [PATCH 03/10] staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup Gao Xiang
2018-11-22 10:20   ` Greg Kroah-Hartman
2018-11-20 14:34 ` [PATCH 04/10] staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}' Gao Xiang
2018-11-22 10:21   ` Greg Kroah-Hartman
2018-11-22 10:29     ` Gao Xiang
2018-11-22 11:03       ` Greg Kroah-Hartman
2018-11-22 11:05       ` Greg Kroah-Hartman
2018-11-22 11:22         ` Gao Xiang
2018-11-20 14:34 ` [PATCH 05/10] staging: erofs: add a full barrier in erofs_workgroup_unfreeze Gao Xiang
2018-11-22 10:22   ` Greg Kroah-Hartman
2018-11-22 10:56     ` Gao Xiang
2018-11-22 18:50       ` Andrea Parri
2018-11-23  2:51         ` Gao Xiang
2018-11-23  9:51           ` Andrea Parri
2018-11-23 10:00             ` Gao Xiang
2018-11-20 14:34 ` [PATCH 06/10] staging: erofs: fix the definition of DBG_BUGON Gao Xiang
2018-11-20 14:34 ` [PATCH 07/10] staging: erofs: separate into init_once / always Gao Xiang
2018-11-22 10:23   ` Greg Kroah-Hartman
2018-11-22 10:34     ` Gao Xiang
2018-11-22 11:05       ` Greg Kroah-Hartman
2018-11-22 11:11         ` Gao Xiang
2018-11-22 11:26           ` Greg Kroah-Hartman
2018-11-22 11:37             ` Gao Xiang
2018-11-22 12:00             ` Gao Xiang
2018-11-22 13:01               ` Gao Xiang
2018-11-22 13:23                 ` Greg Kroah-Hartman
2018-11-22 13:59                   ` Gao Xiang
2018-11-20 14:34 ` [PATCH 08/10] staging: erofs: locked before registering for all new workgroups Gao Xiang
2018-11-22 10:24   ` Greg Kroah-Hartman
2018-11-22 10:35     ` Gao Xiang
2018-11-20 14:34 ` [PATCH 09/10] staging: erofs: decompress asynchronously if PG_readahead page at first Gao Xiang
2018-11-20 14:34 ` [PATCH 10/10] staging: erofs: rename strange variable names in z_erofs_vle_frontend Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).