All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
@ 2016-03-05  2:51 ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: xfs, linux-kernel, Ingo Molnar, Peter Zijlstra, Scott J Norton,
	Douglas Hatch, Waiman Long

This patchset allows the degeneration of per-cpu counters back to
global counters when:

 1) The number of CPUs in the system is large, hence a high cost for
    calling percpu_counter_sum().
 2) The initial count value is small so that it has a high chance of
    excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
  percpu_counter: Allow falling back to global counter on large system
  xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

 fs/xfs/xfs_mount.c             |    1 -
 fs/xfs/xfs_mount.h             |    5 +++
 fs/xfs/xfs_super.c             |    6 +++
 include/linux/percpu_counter.h |   10 +++++
 lib/percpu_counter.c           |   72 +++++++++++++++++++++++++++++++++++++++-
 5 files changed, 92 insertions(+), 2 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
@ 2016-03-05  2:51 ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
	Ingo Molnar, Douglas Hatch

This patchset allows the degeneration of per-cpu counters back to
global counters when:

 1) The number of CPUs in the system is large, hence a high cost for
    calling percpu_counter_sum().
 2) The initial count value is small so that it has a high chance of
    excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
  percpu_counter: Allow falling back to global counter on large system
  xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

 fs/xfs/xfs_mount.c             |    1 -
 fs/xfs/xfs_mount.h             |    5 +++
 fs/xfs/xfs_super.c             |    6 +++
 include/linux/percpu_counter.h |   10 +++++
 lib/percpu_counter.c           |   72 +++++++++++++++++++++++++++++++++++++++-
 5 files changed, 92 insertions(+), 2 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
  2016-03-05  2:51 ` Waiman Long
@ 2016-03-05  2:51   ` Waiman Long
  -1 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: xfs, linux-kernel, Ingo Molnar, Peter Zijlstra, Scott J Norton,
	Douglas Hatch, Waiman Long

Per-cpu counters are used in quite a number of places within
the kernel.  On large system with a lot of CPUs, however, doing a
percpu_counter_sum() can be very expensive as nr_cpu cachelines will
need to be read. In __percpu_counter_compare(), the chance of calling
percpu_counter_sum() also increases with increasing number of CPUs
if the global counter value is relatively small.

On large system, using a global counter with lock may actually be
faster than doing a percpu_counter_sum() which can be frequently
called from __percpu_counter_compare().

This patch provides a mechanism to selectively degenerate per-cpu
counters to global counters at per-cpu counter initialization time. The
following new API is added:

  percpu_counter_set_limit(struct percpu_counter *fbc,
                           u32 percpu_limit)

The function should be called after percpu_counter_set(). It will
compare the total limit (nr_cpu * percpu_limit) against the current
counter value.  If the limit is not smaller, it will disable per-cpu
counter and use only the global counter instead. At run time, when
the counter value grows past the total limit, per-cpu counter will
be enabled again.

Runtime disabling of per-cpu counters, however, is not currently
supported as it will slow down the per-cpu fast path.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/percpu_counter.h |   10 +++++
 lib/percpu_counter.c           |   72 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 84a1094..04a3783 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,8 +16,14 @@
 
 #ifdef CONFIG_SMP
 
+/*
+ * The per-cpu counter will be degenerated into a global counter when limit
+ * is set at initialization time. It will change back to a real per-cpu
+ * counter once the count exceed the given limit.
+ */
 struct percpu_counter {
 	raw_spinlock_t lock;
+	u32 limit;
 	s64 count;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
@@ -42,6 +48,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit);
 
 static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
 {
@@ -170,6 +177,9 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
 	return 1;
 }
 
+static inline void percpu_counter_set_limit(struct percpu_counter *fbc,
+					    u32 percpu_limit) { }
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index f051d69..f101c06 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -75,11 +75,25 @@ EXPORT_SYMBOL(percpu_counter_set);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	unsigned long flags;
+
+	if (fbc->limit) {
+		raw_spin_lock_irqsave(&fbc->lock, flags);
+		if (unlikely(!fbc->limit)) {
+			raw_spin_unlock_irqrestore(&fbc->lock, flags);
+			goto percpu_add;
+		}
+		fbc->count += amount;
+		if (abs(fbc->count) > fbc->limit)
+			fbc->limit = 0;	/* Revert back to per-cpu counter */
 
+		raw_spin_unlock_irqrestore(&fbc->lock, flags);
+		return;
+	}
+percpu_add:
 	preempt_disable();
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		unsigned long flags;
 		raw_spin_lock_irqsave(&fbc->lock, flags);
 		fbc->count += count;
 		__this_cpu_sub(*fbc->counters, count - amount);
@@ -94,6 +108,8 @@ EXPORT_SYMBOL(__percpu_counter_add);
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
+ *
+ * If a limit is set, the count can be returned directly without locking.
  */
 s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
@@ -101,6 +117,9 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 	int cpu;
 	unsigned long flags;
 
+	if (READ_ONCE(fbc->limit))
+		return READ_ONCE(fbc->count);
+
 	raw_spin_lock_irqsave(&fbc->lock, flags);
 	ret = fbc->count;
 	for_each_online_cpu(cpu) {
@@ -120,6 +139,7 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
 	raw_spin_lock_init(&fbc->lock);
 	lockdep_set_class(&fbc->lock, key);
 	fbc->count = amount;
+	fbc->limit = 0;
 	fbc->counters = alloc_percpu_gfp(s32, gfp);
 	if (!fbc->counters)
 		return -ENOMEM;
@@ -202,6 +222,9 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 	s64	count;
 
 	count = percpu_counter_read(fbc);
+	if (READ_ONCE(fbc->limit))
+		goto compare;
+
 	/* Check to see if rough count will be sufficient for comparison */
 	if (abs(count - rhs) > (batch * num_online_cpus())) {
 		if (count > rhs)
@@ -211,6 +234,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 	}
 	/* Need to use precise count */
 	count = percpu_counter_sum(fbc);
+compare:
 	if (count > rhs)
 		return 1;
 	else if (count < rhs)
@@ -220,6 +244,52 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 }
 EXPORT_SYMBOL(__percpu_counter_compare);
 
+/*
+ * Set the limit if the count is less than the given per-cpu limit * # of cpus.
+ *
+ * This function should only be called at initialization time right after
+ * percpu_counter_set(). Limit will only be set if there is more than
+ * 32 cpus in the system and the current counter value is not bigger than
+ * the limit. Once it is set, it can be cleared as soon as the counter
+ * value exceeds the given limit and real per-cpu counters are used again.
+ * However, switching from per-cpu counters back to global counter is not
+ * currently supported as that will slow down the per-cpu counter fastpath.
+ *
+ * The magic number 32 is chosen to be a compromise between the cost of
+ * reading all the per-cpu counters and that of locking. It can be changed
+ * if there is a better value.
+ */
+#define PERCPU_SET_LIMIT_CPU_THRESHOLD	32
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit)
+{
+	unsigned long flags;
+	int nrcpus = num_possible_cpus();
+	u32 limit;
+
+	if (nrcpus <= PERCPU_SET_LIMIT_CPU_THRESHOLD)
+		return;
+
+	if (!fbc->count) {
+		WARN(1, "percpu_counter_set_limit() called without an initial counter value!\n");
+		return;
+	}
+	/*
+	 * Use default batch size if the given percpu limit is 0.
+	 */
+	if (!percpu_limit)
+		percpu_limit = percpu_counter_batch;
+	limit = percpu_limit * nrcpus;
+
+	/*
+	 * Limit will not be set if the count is large enough
+	 */
+	raw_spin_lock_irqsave(&fbc->lock, flags);
+	if (abs(fbc->count) <= limit)
+		fbc->limit = limit;
+	raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_set_limit);
+
 static int __init percpu_counter_startup(void)
 {
 	compute_batch_value();
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
@ 2016-03-05  2:51   ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
	Ingo Molnar, Douglas Hatch

Per-cpu counters are used in quite a number of places within
the kernel.  On large system with a lot of CPUs, however, doing a
percpu_counter_sum() can be very expensive as nr_cpu cachelines will
need to be read. In __percpu_counter_compare(), the chance of calling
percpu_counter_sum() also increases with increasing number of CPUs
if the global counter value is relatively small.

On large system, using a global counter with lock may actually be
faster than doing a percpu_counter_sum() which can be frequently
called from __percpu_counter_compare().

This patch provides a mechanism to selectively degenerate per-cpu
counters to global counters at per-cpu counter initialization time. The
following new API is added:

  percpu_counter_set_limit(struct percpu_counter *fbc,
                           u32 percpu_limit)

The function should be called after percpu_counter_set(). It will
compare the total limit (nr_cpu * percpu_limit) against the current
counter value.  If the limit is not smaller, it will disable per-cpu
counter and use only the global counter instead. At run time, when
the counter value grows past the total limit, per-cpu counter will
be enabled again.

Runtime disabling of per-cpu counters, however, is not currently
supported as it will slow down the per-cpu fast path.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/percpu_counter.h |   10 +++++
 lib/percpu_counter.c           |   72 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 84a1094..04a3783 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,8 +16,14 @@
 
 #ifdef CONFIG_SMP
 
+/*
+ * The per-cpu counter will be degenerated into a global counter when limit
+ * is set at initialization time. It will change back to a real per-cpu
+ * counter once the count exceed the given limit.
+ */
 struct percpu_counter {
 	raw_spinlock_t lock;
+	u32 limit;
 	s64 count;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
@@ -42,6 +48,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit);
 
 static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
 {
@@ -170,6 +177,9 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
 	return 1;
 }
 
+static inline void percpu_counter_set_limit(struct percpu_counter *fbc,
+					    u32 percpu_limit) { }
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index f051d69..f101c06 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -75,11 +75,25 @@ EXPORT_SYMBOL(percpu_counter_set);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	unsigned long flags;
+
+	if (fbc->limit) {
+		raw_spin_lock_irqsave(&fbc->lock, flags);
+		if (unlikely(!fbc->limit)) {
+			raw_spin_unlock_irqrestore(&fbc->lock, flags);
+			goto percpu_add;
+		}
+		fbc->count += amount;
+		if (abs(fbc->count) > fbc->limit)
+			fbc->limit = 0;	/* Revert back to per-cpu counter */
 
+		raw_spin_unlock_irqrestore(&fbc->lock, flags);
+		return;
+	}
+percpu_add:
 	preempt_disable();
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		unsigned long flags;
 		raw_spin_lock_irqsave(&fbc->lock, flags);
 		fbc->count += count;
 		__this_cpu_sub(*fbc->counters, count - amount);
@@ -94,6 +108,8 @@ EXPORT_SYMBOL(__percpu_counter_add);
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
+ *
+ * If a limit is set, the count can be returned directly without locking.
  */
 s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
@@ -101,6 +117,9 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 	int cpu;
 	unsigned long flags;
 
+	if (READ_ONCE(fbc->limit))
+		return READ_ONCE(fbc->count);
+
 	raw_spin_lock_irqsave(&fbc->lock, flags);
 	ret = fbc->count;
 	for_each_online_cpu(cpu) {
@@ -120,6 +139,7 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
 	raw_spin_lock_init(&fbc->lock);
 	lockdep_set_class(&fbc->lock, key);
 	fbc->count = amount;
+	fbc->limit = 0;
 	fbc->counters = alloc_percpu_gfp(s32, gfp);
 	if (!fbc->counters)
 		return -ENOMEM;
@@ -202,6 +222,9 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 	s64	count;
 
 	count = percpu_counter_read(fbc);
+	if (READ_ONCE(fbc->limit))
+		goto compare;
+
 	/* Check to see if rough count will be sufficient for comparison */
 	if (abs(count - rhs) > (batch * num_online_cpus())) {
 		if (count > rhs)
@@ -211,6 +234,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 	}
 	/* Need to use precise count */
 	count = percpu_counter_sum(fbc);
+compare:
 	if (count > rhs)
 		return 1;
 	else if (count < rhs)
@@ -220,6 +244,52 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 }
 EXPORT_SYMBOL(__percpu_counter_compare);
 
+/*
+ * Set the limit if the count is less than the given per-cpu limit * # of cpus.
+ *
+ * This function should only be called at initialization time right after
+ * percpu_counter_set(). Limit will only be set if there is more than
+ * 32 cpus in the system and the current counter value is not bigger than
+ * the limit. Once it is set, it can be cleared as soon as the counter
+ * value exceeds the given limit and real per-cpu counters are used again.
+ * However, switching from per-cpu counters back to global counter is not
+ * currently supported as that will slow down the per-cpu counter fastpath.
+ *
+ * The magic number 32 is chosen to be a compromise between the cost of
+ * reading all the per-cpu counters and that of locking. It can be changed
+ * if there is a better value.
+ */
+#define PERCPU_SET_LIMIT_CPU_THRESHOLD	32
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit)
+{
+	unsigned long flags;
+	int nrcpus = num_possible_cpus();
+	u32 limit;
+
+	if (nrcpus <= PERCPU_SET_LIMIT_CPU_THRESHOLD)
+		return;
+
+	if (!fbc->count) {
+		WARN(1, "percpu_counter_set_limit() called without an initial counter value!\n");
+		return;
+	}
+	/*
+	 * Use default batch size if the given percpu limit is 0.
+	 */
+	if (!percpu_limit)
+		percpu_limit = percpu_counter_batch;
+	limit = percpu_limit * nrcpus;
+
+	/*
+	 * Limit will not be set if the count is large enough
+	 */
+	raw_spin_lock_irqsave(&fbc->lock, flags);
+	if (abs(fbc->count) <= limit)
+		fbc->limit = limit;
+	raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_set_limit);
+
 static int __init percpu_counter_startup(void)
 {
 	compute_batch_value();
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
  2016-03-05  2:51 ` Waiman Long
@ 2016-03-05  2:51   ` Waiman Long
  -1 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: xfs, linux-kernel, Ingo Molnar, Peter Zijlstra, Scott J Norton,
	Douglas Hatch, Waiman Long

Small XFS filesystems on systems with large number of CPUs can incur a
significant overhead due to excessive calls to the percpu_counter_sum()
function which needs to walk through a large number of different
cachelines.

This patch uses the newly added percpu_counter_set_limit() API to
potentially switch the m_fdblocks and m_ifree per-cpu counters to
a global counter with locks at filesystem mount time if its size
is small relatively to the number of CPUs available.

A possible use case is the use of the NVDIMM as an application scratch
storage area for log file and other small files. Current battery-backed
NVDIMMs are pretty small in size, e.g. 8G per DIMM. So we cannot create
large filesystem on top of them.

On a 4-socket 80-thread system running 4.5-rc6 kernel, this patch can
improve the throughput of the AIM7 XFS disk workload by 25%. Before
the patch, the perf profile was:

  18.68%   0.08%  reaim  [k] __percpu_counter_compare
  18.05%   9.11%  reaim  [k] __percpu_counter_sum
   0.37%   0.36%  reaim  [k] __percpu_counter_add

After the patch, the perf profile was:

   0.73%   0.36%  reaim  [k] __percpu_counter_add
   0.27%   0.27%  reaim  [k] __percpu_counter_compare

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/xfs/xfs_mount.c |    1 -
 fs/xfs/xfs_mount.h |    5 +++++
 fs/xfs/xfs_super.c |    6 ++++++
 3 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bb753b3..fe74b91 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1163,7 +1163,6 @@ xfs_mod_ifree(
  * a large batch count (1024) to minimise global counter updates except when
  * we get near to ENOSPC and we have to be very accurate with our updates.
  */
-#define XFS_FDBLOCKS_BATCH	1024
 int
 xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b570984..d9520f4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -206,6 +206,11 @@ typedef struct xfs_mount {
 #define	XFS_WSYNC_WRITEIO_LOG	14	/* 16k */
 
 /*
+ * FD blocks batch size for per-cpu compare
+ */
+#define XFS_FDBLOCKS_BATCH	1024
+
+/*
  * Allow large block sizes to be reported to userspace programs if the
  * "largeio" mount option is used.
  *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 59c9b7b..c0b4f79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1412,6 +1412,12 @@ xfs_reinit_percpu_counters(
 	percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
 	percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
 	percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
+
+	/*
+	 * Use default batch size for m_ifree
+	 */
+	percpu_counter_set_limit(&mp->m_ifree, 0);
+	percpu_counter_set_limit(&mp->m_fdblocks, 4 * XFS_FDBLOCKS_BATCH);
 }
 
 static void
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
@ 2016-03-05  2:51   ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-05  2:51 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter, Dave Chinner
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, Waiman Long, xfs,
	Ingo Molnar, Douglas Hatch

Small XFS filesystems on systems with large number of CPUs can incur a
significant overhead due to excessive calls to the percpu_counter_sum()
function which needs to walk through a large number of different
cachelines.

This patch uses the newly added percpu_counter_set_limit() API to
potentially switch the m_fdblocks and m_ifree per-cpu counters to
a global counter with locks at filesystem mount time if its size
is small relatively to the number of CPUs available.

A possible use case is the use of the NVDIMM as an application scratch
storage area for log file and other small files. Current battery-backed
NVDIMMs are pretty small in size, e.g. 8G per DIMM. So we cannot create
large filesystem on top of them.

On a 4-socket 80-thread system running 4.5-rc6 kernel, this patch can
improve the throughput of the AIM7 XFS disk workload by 25%. Before
the patch, the perf profile was:

  18.68%   0.08%  reaim  [k] __percpu_counter_compare
  18.05%   9.11%  reaim  [k] __percpu_counter_sum
   0.37%   0.36%  reaim  [k] __percpu_counter_add

After the patch, the perf profile was:

   0.73%   0.36%  reaim  [k] __percpu_counter_add
   0.27%   0.27%  reaim  [k] __percpu_counter_compare

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/xfs/xfs_mount.c |    1 -
 fs/xfs/xfs_mount.h |    5 +++++
 fs/xfs/xfs_super.c |    6 ++++++
 3 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bb753b3..fe74b91 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1163,7 +1163,6 @@ xfs_mod_ifree(
  * a large batch count (1024) to minimise global counter updates except when
  * we get near to ENOSPC and we have to be very accurate with our updates.
  */
-#define XFS_FDBLOCKS_BATCH	1024
 int
 xfs_mod_fdblocks(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b570984..d9520f4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -206,6 +206,11 @@ typedef struct xfs_mount {
 #define	XFS_WSYNC_WRITEIO_LOG	14	/* 16k */
 
 /*
+ * FD blocks batch size for per-cpu compare
+ */
+#define XFS_FDBLOCKS_BATCH	1024
+
+/*
  * Allow large block sizes to be reported to userspace programs if the
  * "largeio" mount option is used.
  *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 59c9b7b..c0b4f79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1412,6 +1412,12 @@ xfs_reinit_percpu_counters(
 	percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
 	percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
 	percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
+
+	/*
+	 * Use default batch size for m_ifree
+	 */
+	percpu_counter_set_limit(&mp->m_ifree, 0);
+	percpu_counter_set_limit(&mp->m_fdblocks, 4 * XFS_FDBLOCKS_BATCH);
 }
 
 static void
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
  2016-03-05  2:51 ` Waiman Long
@ 2016-03-05  6:34   ` Dave Chinner
  -1 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-03-05  6:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Christoph Lameter, xfs, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Scott J Norton, Douglas Hatch

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> This patchset allows the degeneration of per-cpu counters back to
> global counters when:
> 
>  1) The number of CPUs in the system is large, hence a high cost for
>     calling percpu_counter_sum().
>  2) The initial count value is small so that it has a high chance of
>     excessive percpu_counter_sum() calls.
> 
> When the above 2 conditions are true, this patchset allows the user of
> per-cpu counters to selectively degenerate them into global counters
> with lock. This is done by calling the new percpu_counter_set_limit()
> API after percpu_counter_set(). Without this call, there is no change
> in the behavior of the per-cpu counters.
> 
> Patch 1 implements the new percpu_counter_set_limit() API.
> 
> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
> per-cpu counters.
> 
> Waiman Long (2):
>   percpu_counter: Allow falling back to global counter on large system
>   xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters.  We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.

The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

	a) 99.9% of production workloads never execute,
	b) where we expect performance to degrade as allocation gets
	   computationally expensive as we close in on ENOSPC,
	c) we start to execute blocking data flush operations that
	   slow everything down massively, and
	d) is indicative that the workload is about to suffer
	   from a fatal, unrecoverable error (i.e. ENOSPC)

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
@ 2016-03-05  6:34   ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-03-05  6:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Peter Zijlstra, Scott J Norton, linux-kernel,
	xfs, Ingo Molnar, Douglas Hatch, Tejun Heo

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> This patchset allows the degeneration of per-cpu counters back to
> global counters when:
> 
>  1) The number of CPUs in the system is large, hence a high cost for
>     calling percpu_counter_sum().
>  2) The initial count value is small so that it has a high chance of
>     excessive percpu_counter_sum() calls.
> 
> When the above 2 conditions are true, this patchset allows the user of
> per-cpu counters to selectively degenerate them into global counters
> with lock. This is done by calling the new percpu_counter_set_limit()
> API after percpu_counter_set(). Without this call, there is no change
> in the behavior of the per-cpu counters.
> 
> Patch 1 implements the new percpu_counter_set_limit() API.
> 
> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
> per-cpu counters.
> 
> Waiman Long (2):
>   percpu_counter: Allow falling back to global counter on large system
>   xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters.  We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.

The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

	a) 99.9% of production workloads never execute,
	b) where we expect performance to degrade as allocation gets
	   computationally expensive as we close in on ENOSPC,
	c) we start to execute blocking data flush operations that
	   slow everything down massively, and
	d) is indicative that the workload is about to suffer
	   from a fatal, unrecoverable error (i.e. ENOSPC)

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
  2016-03-05  6:34   ` Dave Chinner
  (?)
@ 2016-03-07 17:39   ` Waiman Long
  2016-03-07 21:33       ` Dave Chinner
  -1 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2016-03-07 17:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tejun Heo, Christoph Lameter, xfs, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Scott J Norton, Douglas Hatch

On 03/05/2016 01:34 AM, Dave Chinner wrote:
> On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
>> This patchset allows the degeneration of per-cpu counters back to
>> global counters when:
>>
>>   1) The number of CPUs in the system is large, hence a high cost for
>>      calling percpu_counter_sum().
>>   2) The initial count value is small so that it has a high chance of
>>      excessive percpu_counter_sum() calls.
>>
>> When the above 2 conditions are true, this patchset allows the user of
>> per-cpu counters to selectively degenerate them into global counters
>> with lock. This is done by calling the new percpu_counter_set_limit()
>> API after percpu_counter_set(). Without this call, there is no change
>> in the behavior of the per-cpu counters.
>>
>> Patch 1 implements the new percpu_counter_set_limit() API.
>>
>> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
>> per-cpu counters.
>>
>> Waiman Long (2):
>>    percpu_counter: Allow falling back to global counter on large system
>>    xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
> NACK.
>
> This change to turns off per-counter free block counters for 32p for
> the XFS free block counters.  We proved 10 years ago that a global
> lock for these counters was a massive scalability limitation for
> concurrent buffered writes on 16p machines.
>
> IOWs, this change is going to cause fast path concurrent sequential
> write regressions for just about everyone, even on empty
> filesystems.

That is not really the case here. The patch won't change anything if 
there is enough free blocks available in the filesystem. It will turn on 
global lock at mount time iff the number of free blocks available is 
less than the given limit. In the case of XFS, it is 12MB per CPU. On 
the 80-thread system that I used for testing, it will be a bit less than 
1GB. Even if global lock is enabled at the beginning, it will be 
transitioned back to percpu lock as soon as enough free blocks become 
available.

I am aware that if there are enough threads pounding on the lock, it can 
cause a scalability bottleneck. However, the qspinlock used in x86 
should greatly alleviate the scalability impact compared with 10 years 
ago when we used the ticket lock. BTW, what exactly was the 
microbenchmark that you used to exercise concurrent sequential write? I 
would like to try it out on the new hardware and kernel.

The AIM7 microbenchmark that I used was not able to generate more than 
1% CPU time in spinlock contention for __percpu_counter_add() on my 
80-thread test system. On the other hand, the overhead of doing 
percpu_counter_sum() had consumed more than 18% of CPU time with the 
same microbenchmark when the filesystem was small. If the number of 
__percpu_counter_add() call is large enough to cause significant 
spinlock contention, I think the time wasted in percpu_counter_sum() 
will be even more for a small filesytem. In the borderline case when the 
filesystem is small enough to trigger the use of global lock with my 
patch, but not small enough to trigger excessive percpu_counter_sum() 
call, then my patch will have caused a degradation in performance.

So I don't think this patch will cause any problem with the free block 
count. The other percpu count m_ifree, however, is a problem in the 
current code. It used the default batch size, which is my 80-thread 
system, is 12800 (2*nr_cpus^2). However, the number of free inodes in 
the in the various XFS filesystems were less than 2k. So 
percpu_counter_sum() was called every time xfs_mod_ifree() was called. 
This costed about 3%CPU time with my microbenchmark, which was also 
eliminated by my patch.

> The behaviour you are seeing only occurs when the filesystem is near
> to ENOSPC. As i asked you last time - if you want to make this
> problem go away, please increase the size of the filesystem you are
> running your massively concurrent benchmarks on.
>
> IOWs, please stop trying to optimise a filesystem slow path that:
>
> 	a) 99.9% of production workloads never execute,
> 	b) where we expect performance to degrade as allocation gets
> 	   computationally expensive as we close in on ENOSPC,
> 	c) we start to execute blocking data flush operations that
> 	   slow everything down massively, and
> 	d) is indicative that the workload is about to suffer
> 	   from a fatal, unrecoverable error (i.e. ENOSPC)
>

I totally agree. I am not trying to optimize a filesystem slowpath. 
There are use cases, however, where we may want to create relatively 
small filesystem. One example that I cited in patch 2 is the battery 
backed NVDIMM that I have played with recently. They can be used for log 
files or other small files. Each dimm is 8 GB. You can have a few of 
those available. So the filesystem size could be 32GB or so. That can 
come close to the the limit where excessive percpu_counter_sum() call 
can happen. What I want to do here is to try to reduce the chance of 
excessive percpu_counter_sum() calls causing a performance problem. For 
a large filesystem that is nowhere near ENOSPC, my patch will have no 
performance impact whatsoever.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
  2016-03-05  2:51   ` Waiman Long
@ 2016-03-07 18:24     ` Christoph Lameter
  -1 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2016-03-07 18:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Dave Chinner, xfs, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Scott J Norton, Douglas Hatch

On Fri, 4 Mar 2016, Waiman Long wrote:

> This patch provides a mechanism to selectively degenerate per-cpu
> counters to global counters at per-cpu counter initialization time. The
> following new API is added:
>
>   percpu_counter_set_limit(struct percpu_counter *fbc,
>                            u32 percpu_limit)
>
> The function should be called after percpu_counter_set(). It will
> compare the total limit (nr_cpu * percpu_limit) against the current
> counter value.  If the limit is not smaller, it will disable per-cpu
> counter and use only the global counter instead. At run time, when
> the counter value grows past the total limit, per-cpu counter will
> be enabled again.

Hmmm... That is requiring manual setting of a limit. Would it not be
possible to completely automatize the switch over? F.e. one could
keep a cpumask of processors that use the per cpu counters.

Then in the fastpath if the current cpu is a member increment the per cpu
counter. If not do the spinlock thing. If there is contention add the
cpu to the cpumask and use the  per cpu counters. Thus automatically
scaling for the processors on which frequent increments are operating.

Then regularly (once per minute or so) degenerate the counter by folding
the per cpu diffs into the global count and zapping the cpumask.

If the cpumask is empty you can use the global count. Otherwise you just
need to add up the counters of the cpus set in the cpumask.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
@ 2016-03-07 18:24     ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2016-03-07 18:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, xfs, Ingo Molnar,
	Douglas Hatch, Dave Chinner, Tejun Heo

On Fri, 4 Mar 2016, Waiman Long wrote:

> This patch provides a mechanism to selectively degenerate per-cpu
> counters to global counters at per-cpu counter initialization time. The
> following new API is added:
>
>   percpu_counter_set_limit(struct percpu_counter *fbc,
>                            u32 percpu_limit)
>
> The function should be called after percpu_counter_set(). It will
> compare the total limit (nr_cpu * percpu_limit) against the current
> counter value.  If the limit is not smaller, it will disable per-cpu
> counter and use only the global counter instead. At run time, when
> the counter value grows past the total limit, per-cpu counter will
> be enabled again.

Hmmm... That is requiring manual setting of a limit. Would it not be
possible to completely automatize the switch over? F.e. one could
keep a cpumask of processors that use the per cpu counters.

Then in the fastpath if the current cpu is a member increment the per cpu
counter. If not do the spinlock thing. If there is contention add the
cpu to the cpumask and use the  per cpu counters. Thus automatically
scaling for the processors on which frequent increments are operating.

Then regularly (once per minute or so) degenerate the counter by folding
the per cpu diffs into the global count and zapping the cpumask.

If the cpumask is empty you can use the global count. Otherwise you just
need to add up the counters of the cpus set in the cpumask.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
  2016-03-07 18:24     ` Christoph Lameter
  (?)
@ 2016-03-07 19:47     ` Waiman Long
  -1 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-07 19:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Dave Chinner, xfs, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Scott J Norton, Douglas Hatch

On 03/07/2016 01:24 PM, Christoph Lameter wrote:
> On Fri, 4 Mar 2016, Waiman Long wrote:
>
>> This patch provides a mechanism to selectively degenerate per-cpu
>> counters to global counters at per-cpu counter initialization time. The
>> following new API is added:
>>
>>    percpu_counter_set_limit(struct percpu_counter *fbc,
>>                             u32 percpu_limit)
>>
>> The function should be called after percpu_counter_set(). It will
>> compare the total limit (nr_cpu * percpu_limit) against the current
>> counter value.  If the limit is not smaller, it will disable per-cpu
>> counter and use only the global counter instead. At run time, when
>> the counter value grows past the total limit, per-cpu counter will
>> be enabled again.
> Hmmm... That is requiring manual setting of a limit. Would it not be
> possible to completely automatize the switch over? F.e. one could
> keep a cpumask of processors that use the per cpu counters.

The limit is usually the batch size used or a multiple of it.

> Then in the fastpath if the current cpu is a member increment the per cpu
> counter. If not do the spinlock thing. If there is contention add the
> cpu to the cpumask and use the  per cpu counters. Thus automatically
> scaling for the processors on which frequent increments are operating.

That is an interesting idea. I will do some prototyping and see how it 
goes. One of the downside that I see is the increase in the size of the 
percpu_counter structure.

> Then regularly (once per minute or so) degenerate the counter by folding
> the per cpu diffs into the global count and zapping the cpumask.

Actually, I think we need 2 cpumasks - one for deciding to use global or 
percpu count and another one for which percpu counts are used as it is 
not safe to change a per-cpu count other than your own one.

> If the cpumask is empty you can use the global count. Otherwise you just
> need to add up the counters of the cpus set in the cpumask.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
  2016-03-07 17:39   ` Waiman Long
@ 2016-03-07 21:33       ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-03-07 21:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Dave Chinner, Tejun Heo, Christoph Lameter, xfs, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Scott J Norton, Douglas Hatch

On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
> On 03/05/2016 01:34 AM, Dave Chinner wrote:
> >On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> >>This patchset allows the degeneration of per-cpu counters back
> >>to global counters when:
> >>
> >>  1) The number of CPUs in the system is large, hence a high
> >>  cost for calling percpu_counter_sum().  2) The initial count
> >>  value is small so that it has a high chance of excessive
> >>  percpu_counter_sum() calls.
> >>
> >>When the above 2 conditions are true, this patchset allows the
> >>user of per-cpu counters to selectively degenerate them into
> >>global counters with lock. This is done by calling the new
> >>percpu_counter_set_limit() API after percpu_counter_set().
> >>Without this call, there is no change in the behavior of the
> >>per-cpu counters.
> >>
> >>Patch 1 implements the new percpu_counter_set_limit() API.
> >>
> >>Patch 2 modifies XFS to call the new API for the m_ifree and
> >>m_fdblocks per-cpu counters.
> >>
> >>Waiman Long (2): percpu_counter: Allow falling back to global
> >>counter on large system xfs: Allow degeneration of
> >>m_fdblocks/m_ifree to global counters
> >NACK.
> >
> >This change to turns off per-counter free block counters for 32p
> >for the XFS free block counters.  We proved 10 years ago that a
> >global lock for these counters was a massive scalability
> >limitation for concurrent buffered writes on 16p machines.
> >
> >IOWs, this change is going to cause fast path concurrent
> >sequential write regressions for just about everyone, even on
> >empty filesystems.
> 
> That is not really the case here. The patch won't change anything
> if there is enough free blocks available in the filesystem.  It
> will turn on global lock at mount time iff the number of free
> blocks available is less than the given limit. In the case of XFS,
> it is 12MB per CPU. On the 80-thread system that I used for
> testing, it will be a bit less than 1GB. Even if global lock is
> enabled at the beginning, it will be transitioned back to percpu
> lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".

> I am aware that if there are enough threads pounding on the lock,
> it can cause a scalability bottleneck. However, the qspinlock used
> in x86 should greatly alleviate the scalability impact compared
> with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency....

> BTW, what exactly
> was the microbenchmark that you used to exercise concurrent
> sequential write? I would like to try it out on the new hardware
> and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

> >near to ENOSPC. As i asked you last time - if you want to make
> >this problem go away, please increase the size of the filesystem
> >you are running your massively concurrent benchmarks on.
> >
> >IOWs, please stop trying to optimise a filesystem slow path that:
> >
> >	a) 99.9% of production workloads never execute, b) where we
> >	expect performance to degrade as allocation gets
> >	computationally expensive as we close in on ENOSPC, c) we
> >	start to execute blocking data flush operations that slow
> >	everything down massively, and d) is indicative that the
> >	workload is about to suffer from a fatal, unrecoverable
> >	error (i.e. ENOSPC)
> >
> 
> I totally agree. I am not trying to optimize a filesystem
> slowpath.

Where else in the kernel is there a requirement for 100%
accurate threshold detection on per-cpu counters? There isn't, is
there?

> There are use cases, however, where we may want to
> create relatively small filesystem. One example that I cited in
> patch 2 is the battery backed NVDIMM that I have played with
> recently. They can be used for log files or other small files.
> Each dimm is 8 GB. You can have a few of those available. So the
> filesystem size could be 32GB or so.  That can come close to the
> the limit where excessive percpu_counter_sum() call can happen.
> What I want to do here is to try to reduce the chance of excessive
> percpu_counter_sum() calls causing a performance problem. For a
> large filesystem that is nowhere near ENOSPC, my patch will have
> no performance impact whatsoever.

Yet your patch won't have any effect on these "small" filesystems
because unless they have less free space than your threshold at
mount time (rare!) they won't ever have this global lock turned on.
Not to mention if space if freed in the fs, the global lock is
turned off, and will never get turned back on. 

Further, anyone using XFS on nvdimms will be enabling DAX, which
goes through the direct IO path rather than the buffered IO path
that is generating all this block accounting pressure. Hence it will
behave differently, and so your solution doesn't obviously apply to
that workload space, either.

When we get production workloads hitting free block accounting
issues near ENOSPC, then we'll look at optimising the XFS accounting
code. Microbenchmarks are great when they have real-work relevance,
but this doesn't right now. Not to mention we've got bigger things
to worry about in XFS right now in terms of ENOSPC accounting (think
reverse mapping, shared blocks and breaking shares via COW right
next to ENOSPC) and getting these working *correctly* takes
precendence of optimisation of the accounting code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
@ 2016-03-07 21:33       ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-03-07 21:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Peter Zijlstra, Scott J Norton, linux-kernel,
	xfs, Ingo Molnar, Douglas Hatch, Dave Chinner, Tejun Heo

On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
> On 03/05/2016 01:34 AM, Dave Chinner wrote:
> >On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> >>This patchset allows the degeneration of per-cpu counters back
> >>to global counters when:
> >>
> >>  1) The number of CPUs in the system is large, hence a high
> >>  cost for calling percpu_counter_sum().  2) The initial count
> >>  value is small so that it has a high chance of excessive
> >>  percpu_counter_sum() calls.
> >>
> >>When the above 2 conditions are true, this patchset allows the
> >>user of per-cpu counters to selectively degenerate them into
> >>global counters with lock. This is done by calling the new
> >>percpu_counter_set_limit() API after percpu_counter_set().
> >>Without this call, there is no change in the behavior of the
> >>per-cpu counters.
> >>
> >>Patch 1 implements the new percpu_counter_set_limit() API.
> >>
> >>Patch 2 modifies XFS to call the new API for the m_ifree and
> >>m_fdblocks per-cpu counters.
> >>
> >>Waiman Long (2): percpu_counter: Allow falling back to global
> >>counter on large system xfs: Allow degeneration of
> >>m_fdblocks/m_ifree to global counters
> >NACK.
> >
> >This change to turns off per-counter free block counters for 32p
> >for the XFS free block counters.  We proved 10 years ago that a
> >global lock for these counters was a massive scalability
> >limitation for concurrent buffered writes on 16p machines.
> >
> >IOWs, this change is going to cause fast path concurrent
> >sequential write regressions for just about everyone, even on
> >empty filesystems.
> 
> That is not really the case here. The patch won't change anything
> if there is enough free blocks available in the filesystem.  It
> will turn on global lock at mount time iff the number of free
> blocks available is less than the given limit. In the case of XFS,
> it is 12MB per CPU. On the 80-thread system that I used for
> testing, it will be a bit less than 1GB. Even if global lock is
> enabled at the beginning, it will be transitioned back to percpu
> lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".

> I am aware that if there are enough threads pounding on the lock,
> it can cause a scalability bottleneck. However, the qspinlock used
> in x86 should greatly alleviate the scalability impact compared
> with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency....

> BTW, what exactly
> was the microbenchmark that you used to exercise concurrent
> sequential write? I would like to try it out on the new hardware
> and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

> >near to ENOSPC. As i asked you last time - if you want to make
> >this problem go away, please increase the size of the filesystem
> >you are running your massively concurrent benchmarks on.
> >
> >IOWs, please stop trying to optimise a filesystem slow path that:
> >
> >	a) 99.9% of production workloads never execute, b) where we
> >	expect performance to degrade as allocation gets
> >	computationally expensive as we close in on ENOSPC, c) we
> >	start to execute blocking data flush operations that slow
> >	everything down massively, and d) is indicative that the
> >	workload is about to suffer from a fatal, unrecoverable
> >	error (i.e. ENOSPC)
> >
> 
> I totally agree. I am not trying to optimize a filesystem
> slowpath.

Where else in the kernel is there a requirement for 100%
accurate threshold detection on per-cpu counters? There isn't, is
there?

> There are use cases, however, where we may want to
> create relatively small filesystem. One example that I cited in
> patch 2 is the battery backed NVDIMM that I have played with
> recently. They can be used for log files or other small files.
> Each dimm is 8 GB. You can have a few of those available. So the
> filesystem size could be 32GB or so.  That can come close to the
> the limit where excessive percpu_counter_sum() call can happen.
> What I want to do here is to try to reduce the chance of excessive
> percpu_counter_sum() calls causing a performance problem. For a
> large filesystem that is nowhere near ENOSPC, my patch will have
> no performance impact whatsoever.

Yet your patch won't have any effect on these "small" filesystems
because unless they have less free space than your threshold at
mount time (rare!) they won't ever have this global lock turned on.
Not to mention if space if freed in the fs, the global lock is
turned off, and will never get turned back on. 

Further, anyone using XFS on nvdimms will be enabling DAX, which
goes through the direct IO path rather than the buffered IO path
that is generating all this block accounting pressure. Hence it will
behave differently, and so your solution doesn't obviously apply to
that workload space, either.

When we get production workloads hitting free block accounting
issues near ENOSPC, then we'll look at optimising the XFS accounting
code. Microbenchmarks are great when they have real-work relevance,
but this doesn't right now. Not to mention we've got bigger things
to worry about in XFS right now in terms of ENOSPC accounting (think
reverse mapping, shared blocks and breaking shares via COW right
next to ENOSPC) and getting these working *correctly* takes
precendence of optimisation of the accounting code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
  2016-03-07 18:24     ` Christoph Lameter
  (?)
  (?)
@ 2016-03-16 19:20     ` Waiman Long
  2016-03-18  1:58         ` Christoph Lameter
  -1 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2016-03-16 19:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Dave Chinner, xfs, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Scott J Norton, Douglas Hatch

On 03/07/2016 01:24 PM, Christoph Lameter wrote:
> On Fri, 4 Mar 2016, Waiman Long wrote:
>
>> This patch provides a mechanism to selectively degenerate per-cpu
>> counters to global counters at per-cpu counter initialization time. The
>> following new API is added:
>>
>>    percpu_counter_set_limit(struct percpu_counter *fbc,
>>                             u32 percpu_limit)
>>
>> The function should be called after percpu_counter_set(). It will
>> compare the total limit (nr_cpu * percpu_limit) against the current
>> counter value.  If the limit is not smaller, it will disable per-cpu
>> counter and use only the global counter instead. At run time, when
>> the counter value grows past the total limit, per-cpu counter will
>> be enabled again.
> Hmmm... That is requiring manual setting of a limit. Would it not be
> possible to completely automatize the switch over? F.e. one could
> keep a cpumask of processors that use the per cpu counters.
>
> Then in the fastpath if the current cpu is a member increment the per cpu
> counter. If not do the spinlock thing. If there is contention add the
> cpu to the cpumask and use the  per cpu counters. Thus automatically
> scaling for the processors on which frequent increments are operating.
>
> Then regularly (once per minute or so) degenerate the counter by folding
> the per cpu diffs into the global count and zapping the cpumask.
>
> If the cpumask is empty you can use the global count. Otherwise you just
> need to add up the counters of the cpus set in the cpumask.
>

I have modified the patch to try that out. However, that doesn't yield 
that much of improvement in term of performance and it slows down the 
percpu fast path a bit. So I am going to focus on my existing patch 
first and think about that later.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
  2016-03-07 21:33       ` Dave Chinner
  (?)
@ 2016-03-16 20:06       ` Waiman Long
  -1 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2016-03-16 20:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Chinner, Tejun Heo, Christoph Lameter, xfs, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Scott J Norton, Douglas Hatch

On 03/07/2016 04:33 PM, Dave Chinner wrote:
> On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
>> On 03/05/2016 01:34 AM, Dave Chinner wrote:
>>> On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
>>>> This patchset allows the degeneration of per-cpu counters back
>>>> to global counters when:
>>>>
>>>>   1) The number of CPUs in the system is large, hence a high
>>>>   cost for calling percpu_counter_sum().  2) The initial count
>>>>   value is small so that it has a high chance of excessive
>>>>   percpu_counter_sum() calls.
>>>>
>>>> When the above 2 conditions are true, this patchset allows the
>>>> user of per-cpu counters to selectively degenerate them into
>>>> global counters with lock. This is done by calling the new
>>>> percpu_counter_set_limit() API after percpu_counter_set().
>>>> Without this call, there is no change in the behavior of the
>>>> per-cpu counters.
>>>>
>>>> Patch 1 implements the new percpu_counter_set_limit() API.
>>>>
>>>> Patch 2 modifies XFS to call the new API for the m_ifree and
>>>> m_fdblocks per-cpu counters.
>>>>
>>>> Waiman Long (2): percpu_counter: Allow falling back to global
>>>> counter on large system xfs: Allow degeneration of
>>>> m_fdblocks/m_ifree to global counters
>>> NACK.
>>>
>>> This change to turns off per-counter free block counters for 32p
>>> for the XFS free block counters.  We proved 10 years ago that a
>>> global lock for these counters was a massive scalability
>>> limitation for concurrent buffered writes on 16p machines.
>>>
>>> IOWs, this change is going to cause fast path concurrent
>>> sequential write regressions for just about everyone, even on
>>> empty filesystems.
>> That is not really the case here. The patch won't change anything
>> if there is enough free blocks available in the filesystem.  It
>> will turn on global lock at mount time iff the number of free
>> blocks available is less than the given limit. In the case of XFS,
>> it is 12MB per CPU. On the 80-thread system that I used for
>> testing, it will be a bit less than 1GB. Even if global lock is
>> enabled at the beginning, it will be transitioned back to percpu
>> lock as soon as enough free blocks become available.
> Again: How is this an optimisation that is generally useful? Nobody
> runs their production 80-thread workloads on a filesystems with less
> than 1GB of free space. This is a situation that most admins would
> consider "impending doom".

In most cases, there will be enough free blocks in m_fdblocks that the 
switching to global count will never happen. However, I found that 
m_ifree is a different story. On the 80-cpu system that I used, the 
percpu slowpath will be activated when there are less than 2*80^2 = 
12800 free inodes available which is usually the case because the code 
use the default batch size (which scale linearly with # of cpus). Here, 
my patch can really help.

>
>> I am aware that if there are enough threads pounding on the lock,
>> it can cause a scalability bottleneck. However, the qspinlock used
>> in x86 should greatly alleviate the scalability impact compared
>> with 10 years ago when we used the ticket lock.
> Regardless of whether there is less contention, it still brings back
> a global serialisation point and modified cacheline (the free block
> counter) in the filesystem that, at some point, will limit
> concurrency....

Yes, that is true, but the alternative here is to access all the 
cachelines of the percpu counters and evict quite a number of other 
useful cachelines along the way. My patch activates the global counter 
at mount time only when the current count is too small. It was proven in 
my test case that accessing all those cachelines was worse that taken 
the lock when there are large number of cpus.

Once the counter increase past the limit, it will disable the global 
counter and fall back to the usual per-cpu mode. The global counter 
won't be reactivated unless you unmount and remount the filesystem 
again. So I don't this case will cause any performance bottleneck that 
is worse than what the existing code is.

>> BTW, what exactly
>> was the microbenchmark that you used to exercise concurrent
>> sequential write? I would like to try it out on the new hardware
>> and kernel.
> Just something that HPC apps have been known to do for more then 20
> years: concurrent sequential write from every CPU in the system.
>
> http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Thanks.

>
>>> near to ENOSPC. As i asked you last time - if you want to make
>>> this problem go away, please increase the size of the filesystem
>>> you are running your massively concurrent benchmarks on.
>>>
>>> IOWs, please stop trying to optimise a filesystem slow path that:
>>>
>>> 	a) 99.9% of production workloads never execute, b) where we
>>> 	expect performance to degrade as allocation gets
>>> 	computationally expensive as we close in on ENOSPC, c) we
>>> 	start to execute blocking data flush operations that slow
>>> 	everything down massively, and d) is indicative that the
>>> 	workload is about to suffer from a fatal, unrecoverable
>>> 	error (i.e. ENOSPC)
>>>
>> I totally agree. I am not trying to optimize a filesystem
>> slowpath.
> Where else in the kernel is there a requirement for 100%
> accurate threshold detection on per-cpu counters? There isn't, is
> there?

I don't quite get what you are asking here. The goal of this patch is 
not 100% accurate threshold detection. This is not what I am aiming for. 
Instead, I am trying to reduce the performance impact due to excessive 
per_counter_sum() calls. Accurate threshold detection is just a side 
effect, not the main purpose for this patch.

>> There are use cases, however, where we may want to
>> create relatively small filesystem. One example that I cited in
>> patch 2 is the battery backed NVDIMM that I have played with
>> recently. They can be used for log files or other small files.
>> Each dimm is 8 GB. You can have a few of those available. So the
>> filesystem size could be 32GB or so.  That can come close to the
>> the limit where excessive percpu_counter_sum() call can happen.
>> What I want to do here is to try to reduce the chance of excessive
>> percpu_counter_sum() calls causing a performance problem. For a
>> large filesystem that is nowhere near ENOSPC, my patch will have
>> no performance impact whatsoever.
> Yet your patch won't have any effect on these "small" filesystems
> because unless they have less free space than your threshold at
> mount time (rare!) they won't ever have this global lock turned on.
> Not to mention if space if freed in the fs, the global lock is
> turned off, and will never get turned back on.

Yes, as I said above, the m_fdblocks counter isn't an issue in almost 
all the cases. However, the patch was found to be useful in reducing 
performance overhead of the percpu_counter_sum() call on the m_ifree 
counter on a moderately sized XFS partition. BTW, I made no change in 
the XFS code other than activating the counter limits. If those limits 
are not activated, there is absolutely no change to the current behavior.

Yes, the benchmark that I shown in the changelog isn't real-life 
scenario. I will try to look for other benchmark results that are more 
realistic.

Cheers,
Longman


> Further, anyone using XFS on nvdimms will be enabling DAX, which
> goes through the direct IO path rather than the buffered IO path
> that is generating all this block accounting pressure. Hence it will
> behave differently, and so your solution doesn't obviously apply to
> that workload space, either.
>
> When we get production workloads hitting free block accounting
> issues near ENOSPC, then we'll look at optimising the XFS accounting
> code. Microbenchmarks are great when they have real-work relevance,
> but this doesn't right now. Not to mention we've got bigger things
> to worry about in XFS right now in terms of ENOSPC accounting (think
> reverse mapping, shared blocks and breaking shares via COW right
> next to ENOSPC) and getting these working *correctly* takes
> precendence of optimisation of the accounting code.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
  2016-03-16 19:20     ` Waiman Long
@ 2016-03-18  1:58         ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2016-03-18  1:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Dave Chinner, xfs, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Scott J Norton, Douglas Hatch

On Wed, 16 Mar 2016, Waiman Long wrote:

> > If the cpumask is empty you can use the global count. Otherwise you just
> > need to add up the counters of the cpus set in the cpumask.
> >
>
> I have modified the patch to try that out. However, that doesn't yield that
> much of improvement in term of performance and it slows down the percpu fast
> path a bit. So I am going to focus on my existing patch first and think about
> that later.

Hmmm... Maybe look at the cause of the slowdown first?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system
@ 2016-03-18  1:58         ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2016-03-18  1:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Scott J Norton, linux-kernel, xfs, Ingo Molnar,
	Douglas Hatch, Dave Chinner, Tejun Heo

On Wed, 16 Mar 2016, Waiman Long wrote:

> > If the cpumask is empty you can use the global count. Otherwise you just
> > need to add up the counters of the cpus set in the cpumask.
> >
>
> I have modified the patch to try that out. However, that doesn't yield that
> much of improvement in term of performance and it slows down the percpu fast
> path a bit. So I am going to focus on my existing patch first and think about
> that later.

Hmmm... Maybe look at the cause of the slowdown first?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-03-18  1:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-05  2:51 [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Waiman Long
2016-03-05  2:51 ` Waiman Long
2016-03-05  2:51 ` [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system Waiman Long
2016-03-05  2:51   ` Waiman Long
2016-03-07 18:24   ` Christoph Lameter
2016-03-07 18:24     ` Christoph Lameter
2016-03-07 19:47     ` Waiman Long
2016-03-16 19:20     ` Waiman Long
2016-03-18  1:58       ` Christoph Lameter
2016-03-18  1:58         ` Christoph Lameter
2016-03-05  2:51 ` [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters Waiman Long
2016-03-05  2:51   ` Waiman Long
2016-03-05  6:34 ` [RFC PATCH 0/2] percpu_counter: Enable switching to global counter Dave Chinner
2016-03-05  6:34   ` Dave Chinner
2016-03-07 17:39   ` Waiman Long
2016-03-07 21:33     ` Dave Chinner
2016-03-07 21:33       ` Dave Chinner
2016-03-16 20:06       ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.