All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node
@ 2018-01-17 14:23 Bob Peterson
  2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson
  2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson
  0 siblings, 2 replies; 9+ messages in thread
From: Bob Peterson @ 2018-01-17 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

I've gone through a few iterations of this patch set now with mixed reviews,
the previous attempt being a resend on 10 January.

Based on feedback, I've simplified the patch set greatly and broken
it down into two simple patches, given here. The first patch reworks
function gfs2_rgrp_used_recently() to simplify and use better calculations.
The second patch adds an "intra-node" congestion check to the existing
function which was previously only checking inter-node (DLM) congestion,
which ended up severely hurting single-node scalability.

This patch greatly improves the scalability of the GFS2 block allocator
in cases where more than one single process wants to allocate blocks.
Here are the results of a test I ran on a test machine using iozone with
a growing number of concurrent processes:

Before the two patches were applied:
        Children see throughput for  1 initial writers  =  543094.50 kB/sec
        Children see throughput for  2 initial writers  =  631279.31 kB/sec
        Children see throughput for  4 initial writers  =  618569.31 kB/sec
        Children see throughput for  6 initial writers  =  672926.77 kB/sec
        Children see throughput for  8 initial writers  =  620530.25 kB/sec
        Children see throughput for 10 initial writers  =  637743.89 kB/sec
        Children see throughput for 12 initial writers  =  625197.03 kB/sec
        Children see throughput for 14 initial writers  =  627233.04 kB/sec
        Children see throughput for 16 initial writers  =  346880.52 kB/sec

After the two patches are applied:
        Children see throughput for  1 initial writers  =  539514.88 kB/sec
        Children see throughput for  2 initial writers  =  630325.97 kB/sec
        Children see throughput for  4 initial writers  =  820960.05 kB/sec
        Children see throughput for  6 initial writers  =  773291.00 kB/sec
        Children see throughput for  8 initial writers  =  764553.85 kB/sec
        Children see throughput for 10 initial writers  =  837788.38 kB/sec
        Children see throughput for 12 initial writers  =  752443.34 kB/sec
        Children see throughput for 14 initial writers  =  781917.29 kB/sec
        Children see throughput for 16 initial writers  =  816540.99 kB/sec

With one or two processes running, the difference amounts to noise.
But even with only 4 concurrent processes, the block allocator has 25
percent better throughput with the patches. At 16 concurrent processes,
the overall throughput is more than double, although that number may
be skewed by the fact that I've got 2 sockets, each of which has 8
cores, and some of the cores are used by the driver and its monitoring.
---
Bob Peterson (2):
  GFS2: Simplify gfs2_rgrp_used_recently
  GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases

 fs/gfs2/rgrp.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 69 insertions(+), 14 deletions(-)

-- 
2.14.3



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently
  2018-01-17 14:23 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node Bob Peterson
@ 2018-01-17 14:23 ` Bob Peterson
  2018-01-18  9:04   ` Steven Whitehouse
  2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson
  1 sibling, 1 reply; 9+ messages in thread
From: Bob Peterson @ 2018-01-17 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch simplifies function gfs2_rgrp_used_recently. Now it uses
ktime functions to do its calculations. Also, it's called with the
rgrp rather than the reservation, and the constant HZ rather than
the hard-coded value 1000.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/rgrp.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 6dea72f49316..4c9467bf9ab6 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1900,21 +1900,18 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
 }
 
 /**
- * gfs2_rgrp_used_recently
- * @rs: The block reservation with the rgrp to test
+ * gfs2_rgrp_used_recently - check if a rgrp has been used recently
+ * @rgd: The rgrp to test
  * @msecs: The time limit in milliseconds
  *
  * Returns: True if the rgrp glock has been used within the time limit
  */
-static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs,
-				    u64 msecs)
+static inline bool gfs2_rgrp_used_recently(const struct gfs2_rgrpd *rgd,
+					   u64 msecs)
 {
-	u64 tdiff;
-
-	tdiff = ktime_to_ns(ktime_sub(ktime_get_real(),
-                            rs->rs_rbm.rgd->rd_gl->gl_dstamp));
-
-	return tdiff > (msecs * 1000 * 1000);
+	return (ktime_before(ktime_get_real(),
+			     ktime_add(rgd->rd_gl->gl_dstamp,
+				       ms_to_ktime(msecs))));
 }
 
 static u32 gfs2_orlov_skip(const struct gfs2_inode *ip)
@@ -2014,7 +2011,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 				    !fast_to_acquire(rs->rs_rbm.rgd))
 					goto next_rgrp;
 				if ((loops < 2) &&
-				    gfs2_rgrp_used_recently(rs, 1000) &&
+				    gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) &&
 				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
 					goto next_rgrp;
 			}
-- 
2.14.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases
  2018-01-17 14:23 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node Bob Peterson
  2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson
@ 2018-01-17 14:23 ` Bob Peterson
  2018-01-18  9:08   ` Steven Whitehouse
  1 sibling, 1 reply; 9+ messages in thread
From: Bob Peterson @ 2018-01-17 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Before this patch, function gfs2_rgrp_congested only checked the
inter-node (dlm) case. But it only makes sense to check the rgrp
glock statistics if we're using a real locking protocol like dlm.
For lock_nolock, this is just a waste of time and can give false
positives.

This patch adds logic to function gfs2_rgrp_congested to check
inter-node and intra-node congestion. It checks for intra-node
(local node process) congestion first, since that's faster.

If there are other processes actively using this rgrp (i.e. if
they have reservations or are holding its glock) it's considered
congested in the intra-node sense.

If there is not intra-node congestion, and the locking protocol
is lock_dlm, we need to check the locking statistics to see if
there is any inter-node congestion.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/rgrp.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 4c9467bf9ab6..97db13a841f4 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1826,10 +1826,57 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
 	return;
 }
 
+/**
+ * other_rgrp_users - figure out if this rgrp has other local users
+ * @rgd: The resource group
+ * @locked: true if we've already held the glock
+ *
+ * We're trying to figure out if the given rgrp has anybody competing for
+ * its free space. If other processes have enqueued its glock, there's a
+ * good chance of competition.
+ *
+ * If there are multi-block reservations for this rgrp, there's a good
+ * chance another process will lock the rgrp for block allocations soon.
+ *
+ * If we've already held the glock, we no longer care if there are holders
+ * because that's now a given (rgrp glocks are never shared).
+ */
+static inline bool other_rgrp_users(const struct gfs2_rgrpd *rgd, bool locked)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (!locked && !list_empty(&gl->gl_holders))
+		return true;
+	if (!RB_EMPTY_ROOT(&rgd->rd_rstree))
+		return true;
+	return false;
+}
+
 /**
  * gfs2_rgrp_congested - Use stats to figure out whether an rgrp is congested
  * @rgd: The rgrp in question
  * @loops: An indication of how picky we can be (0=very, 1=less so)
+ * @locked: Indication of whether we've already held the rgrp glock
+ *
+ * There are two kinds of congestion: inter-node and intra-node.
+ *
+ * Inter-node congestion is where multiple nodes all want to allocate blocks
+ * inside the same rgrp, which means they need to trade the rgrp glock back
+ * and forth, which is a slow process. To mitigate this, we use glock
+ * statistics to predict whether the glock is historically fast to acquire.
+ *
+ * Intra-node congestion is where you have multiple processes on the same
+ * node, all trying to allocate blocks in the same rgrp. There's nothing wrong
+ * with doing so, but each process needs to wait for the other to release the
+ * rgrp glock before it may proceed.
+ *
+ * Both kinds of congestion can hurt performance, but it's faster to check
+ * intra-node, so we do that first. After all, why bother to check if we can
+ * get the glock quickly from DLM if other processes have also used that
+ * same reasoning.
+ *
+ * If we're not using inter-node locking (dlm) it doesn't make sense to check
+ * the glock statistics.
  *
  * This function uses the recently added glock statistics in order to
  * figure out whether a parciular resource group is suffering from
@@ -1853,7 +1900,8 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
  * Returns: A boolean verdict on the congestion status
  */
 
-static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
+static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops,
+				bool locked)
 {
 	const struct gfs2_glock *gl = rgd->rd_gl;
 	const struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
@@ -1865,6 +1913,15 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
 	u64 var;
 	int cpu, nonzero = 0;
 
+	/* Intra-node (local) lock congestion checks: */
+	if (loops == 0 && other_rgrp_users(rgd, locked))
+		return true;
+
+	/* If not inter-node (DLM) locking, we don't care about the stats  */
+	if (rgd->rd_sbd->sd_lockstruct.ls_ops->lm_lock == NULL)
+		return false;
+
+	/* Inter-node lock congestion checks: */
 	preempt_disable();
 	for_each_present_cpu(cpu) {
 		st = &per_cpu_ptr(sdp->sd_lkstats, cpu)->lkstats[LM_TYPE_RGRP];
@@ -2012,7 +2069,8 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 					goto next_rgrp;
 				if ((loops < 2) &&
 				    gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) &&
-				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops,
+							false))
 					goto next_rgrp;
 			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
@@ -2021,7 +2079,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 			if (unlikely(error))
 				return error;
 			if (!gfs2_rs_active(rs) && (loops < 2) &&
-			    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+			    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops, true))
 				goto skip_rgrp;
 			if (sdp->sd_args.ar_rgrplvb) {
 				error = update_rgrp_lvb(rs->rs_rbm.rgd);
-- 
2.14.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently
  2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson
@ 2018-01-18  9:04   ` Steven Whitehouse
  0 siblings, 0 replies; 9+ messages in thread
From: Steven Whitehouse @ 2018-01-18  9:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 17/01/18 14:23, Bob Peterson wrote:
> This patch simplifies function gfs2_rgrp_used_recently. Now it uses
> ktime functions to do its calculations. Also, it's called with the
> rgrp rather than the reservation, and the constant HZ rather than
> the hard-coded value 1000.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/rgrp.c | 19 ++++++++-----------
>   1 file changed, 8 insertions(+), 11 deletions(-)
I'm not sure that this does simplify it. In any case the argument is a 
time specified in msec, and HZ is a constant whose value is the number 
of clock ticks in a second. It makes no sense to assign one to the other.

It would be good to get away from the constant of 1 sec that is used 
here, but that required some analysis in order to figure out how to 
create a suitable algorithm. If as a result of testing we can work out a 
more suitable constant to use here, then that is fine as a stop gap 
measure too. HZ though makes no sense at all in this case,

Steve.

> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 6dea72f49316..4c9467bf9ab6 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1900,21 +1900,18 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
>   }
>   
>   /**
> - * gfs2_rgrp_used_recently
> - * @rs: The block reservation with the rgrp to test
> + * gfs2_rgrp_used_recently - check if a rgrp has been used recently
> + * @rgd: The rgrp to test
>    * @msecs: The time limit in milliseconds
>    *
>    * Returns: True if the rgrp glock has been used within the time limit
>    */
> -static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs,
> -				    u64 msecs)
> +static inline bool gfs2_rgrp_used_recently(const struct gfs2_rgrpd *rgd,
> +					   u64 msecs)
>   {
> -	u64 tdiff;
> -
> -	tdiff = ktime_to_ns(ktime_sub(ktime_get_real(),
> -                            rs->rs_rbm.rgd->rd_gl->gl_dstamp));
> -
> -	return tdiff > (msecs * 1000 * 1000);
> +	return (ktime_before(ktime_get_real(),
> +			     ktime_add(rgd->rd_gl->gl_dstamp,
> +				       ms_to_ktime(msecs))));
>   }
>   
>   static u32 gfs2_orlov_skip(const struct gfs2_inode *ip)
> @@ -2014,7 +2011,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
>   				    !fast_to_acquire(rs->rs_rbm.rgd))
>   					goto next_rgrp;
>   				if ((loops < 2) &&
> -				    gfs2_rgrp_used_recently(rs, 1000) &&
> +				    gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) &&
>   				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
>   					goto next_rgrp;
>   			}



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases
  2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson
@ 2018-01-18  9:08   ` Steven Whitehouse
  2018-01-19 14:43     ` Bob Peterson
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Whitehouse @ 2018-01-18  9:08 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 17/01/18 14:23, Bob Peterson wrote:
> Before this patch, function gfs2_rgrp_congested only checked the
> inter-node (dlm) case. But it only makes sense to check the rgrp
> glock statistics if we're using a real locking protocol like dlm.
> For lock_nolock, this is just a waste of time and can give false
> positives.
>
> This patch adds logic to function gfs2_rgrp_congested to check
> inter-node and intra-node congestion. It checks for intra-node
> (local node process) congestion first, since that's faster.
>
> If there are other processes actively using this rgrp (i.e. if
> they have reservations or are holding its glock) it's considered
> congested in the intra-node sense.
>
> If there is not intra-node congestion, and the locking protocol
> is lock_dlm, we need to check the locking statistics to see if
> there is any inter-node congestion.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>

My comments in relation to the previous patch still stand. This is still 
the wrong answer to the problem. If we have lock contention here, then 
we should be looking to fix the locking and not trying to avoid the 
issue by choosing a different rgrp.

If we are deallocating blocks, we have no choice over which rgrp we 
lock, since we must lock the rgrp that contains the blocks we want to 
deallocate. The same problem exists there too, so lets fix this properly 
as per my previous comments,

Steve.


> ---
>   fs/gfs2/rgrp.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 61 insertions(+), 3 deletions(-)
>
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 4c9467bf9ab6..97db13a841f4 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1826,10 +1826,57 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
>   	return;
>   }
>   
> +/**
> + * other_rgrp_users - figure out if this rgrp has other local users
> + * @rgd: The resource group
> + * @locked: true if we've already held the glock
> + *
> + * We're trying to figure out if the given rgrp has anybody competing for
> + * its free space. If other processes have enqueued its glock, there's a
> + * good chance of competition.
> + *
> + * If there are multi-block reservations for this rgrp, there's a good
> + * chance another process will lock the rgrp for block allocations soon.
> + *
> + * If we've already held the glock, we no longer care if there are holders
> + * because that's now a given (rgrp glocks are never shared).
> + */
> +static inline bool other_rgrp_users(const struct gfs2_rgrpd *rgd, bool locked)
> +{
> +	struct gfs2_glock *gl = rgd->rd_gl;
> +
> +	if (!locked && !list_empty(&gl->gl_holders))
> +		return true;
> +	if (!RB_EMPTY_ROOT(&rgd->rd_rstree))
> +		return true;
> +	return false;
> +}
> +
>   /**
>    * gfs2_rgrp_congested - Use stats to figure out whether an rgrp is congested
>    * @rgd: The rgrp in question
>    * @loops: An indication of how picky we can be (0=very, 1=less so)
> + * @locked: Indication of whether we've already held the rgrp glock
> + *
> + * There are two kinds of congestion: inter-node and intra-node.
> + *
> + * Inter-node congestion is where multiple nodes all want to allocate blocks
> + * inside the same rgrp, which means they need to trade the rgrp glock back
> + * and forth, which is a slow process. To mitigate this, we use glock
> + * statistics to predict whether the glock is historically fast to acquire.
> + *
> + * Intra-node congestion is where you have multiple processes on the same
> + * node, all trying to allocate blocks in the same rgrp. There's nothing wrong
> + * with doing so, but each process needs to wait for the other to release the
> + * rgrp glock before it may proceed.
> + *
> + * Both kinds of congestion can hurt performance, but it's faster to check
> + * intra-node, so we do that first. After all, why bother to check if we can
> + * get the glock quickly from DLM if other processes have also used that
> + * same reasoning.
> + *
> + * If we're not using inter-node locking (dlm) it doesn't make sense to check
> + * the glock statistics.
>    *
>    * This function uses the recently added glock statistics in order to
>    * figure out whether a parciular resource group is suffering from
> @@ -1853,7 +1900,8 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
>    * Returns: A boolean verdict on the congestion status
>    */
>   
> -static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
> +static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops,
> +				bool locked)
>   {
>   	const struct gfs2_glock *gl = rgd->rd_gl;
>   	const struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> @@ -1865,6 +1913,15 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
>   	u64 var;
>   	int cpu, nonzero = 0;
>   
> +	/* Intra-node (local) lock congestion checks: */
> +	if (loops == 0 && other_rgrp_users(rgd, locked))
> +		return true;
> +
> +	/* If not inter-node (DLM) locking, we don't care about the stats  */
> +	if (rgd->rd_sbd->sd_lockstruct.ls_ops->lm_lock == NULL)
> +		return false;
> +
> +	/* Inter-node lock congestion checks: */
>   	preempt_disable();
>   	for_each_present_cpu(cpu) {
>   		st = &per_cpu_ptr(sdp->sd_lkstats, cpu)->lkstats[LM_TYPE_RGRP];
> @@ -2012,7 +2069,8 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
>   					goto next_rgrp;
>   				if ((loops < 2) &&
>   				    gfs2_rgrp_used_recently(rs->rs_rbm.rgd, HZ) &&
> -				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> +				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops,
> +							false))
>   					goto next_rgrp;
>   			}
>   			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
> @@ -2021,7 +2079,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
>   			if (unlikely(error))
>   				return error;
>   			if (!gfs2_rs_active(rs) && (loops < 2) &&
> -			    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> +			    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops, true))
>   				goto skip_rgrp;
>   			if (sdp->sd_args.ar_rgrplvb) {
>   				error = update_rgrp_lvb(rs->rs_rbm.rgd);



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases
  2018-01-18  9:08   ` Steven Whitehouse
@ 2018-01-19 14:43     ` Bob Peterson
  2018-01-24 17:04       ` Bob Peterson
  0 siblings, 1 reply; 9+ messages in thread
From: Bob Peterson @ 2018-01-19 14:43 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

----- Original Message -----
| > This patch adds logic to function gfs2_rgrp_congested to check
| > inter-node and intra-node congestion. It checks for intra-node
| > (local node process) congestion first, since that's faster.
| >
| > If there are other processes actively using this rgrp (i.e. if
| > they have reservations or are holding its glock) it's considered
| > congested in the intra-node sense.
| My comments in relation to the previous patch still stand. This is still
| the wrong answer to the problem. If we have lock contention here, then
| we should be looking to fix the locking and not trying to avoid the
| issue by choosing a different rgrp.
| 
| If we are deallocating blocks, we have no choice over which rgrp we
| lock, since we must lock the rgrp that contains the blocks we want to
| deallocate. The same problem exists there too, so lets fix this properly
| as per my previous comments,
| 
| Steve.

Some of the relevant comments from the previous were:

| This is the wrong solution I think. We should not be choosing other
| rgrps for allocation due to local contention. That is just likely to
| land up creating more fragmentation by spreading things out across the
| disk. The issue is that we can't currently separate the locking into dlm
| (exclusive) and local (shared) but important data structures spinlock
| protected, so that we land up with lock contention on the rgrp and
| everything serialized via the glock. At the moment the glock state
| machine only supports everything (dlm & local) shared or everything
| exclusive.
| 
| If we want to resolve this issue, then we need to deal with that issue
| first and allow multiple local users of the same rgrp to co-exist with
| minimal lock contention. That requires the glock state machine to
| support that particular lock mode first though,
| 
| Steve.

I'd like to hear more about how you envision this to work (see below).

If you study the problem with glocktop, you will see that in the
intra-node congestion case (for example, ten processes on one node
all doing iozone, dd, or other heavy user of the block allocator)
most of the processes are stuck waiting for various rgrp glocks.
That's the problem.

My proposed solution was to reduce or eliminate that contention by
letting each process stick to a unique rgrp for block assignments.
This is pretty much what RHEL5 and RHEL6 did when they were using
the "try locks" for rgrps, and that's exactly why those older
releases performed significantly faster with those tests than
RHEL7 and upstream.

It's true that the block deallocator suffers from the same issue,
and my proposed solution alleviates that too by spreading the blocks
to multiple rgrps. Blocks may be both allocated and deallocated
simultaneously because the glocks are unique.

This is also not too different from how we gained performance
boosts with the Orlov algorithm to assign unique rgrps to groups
of files, which also spread the distance of the writes.

This will not increase file fragmentation because that is all
mitigated by the multi-block allocation algorithms we have
in place.

The file system itself may incur more fragmentation as now files
may span a greater distance with writes to unique rgrps, but
that is no different from RHEL5, RHEL6, and older releases,
which acquired their rgrp glocks with "try locks" which yielded
basically the same results.

I'm not at all concerned about "head bouncing" performance issues
because those are pretty much absorbed by almost all modern
storage arrays. That's what the performance guys, and my own
testing, tell me. Head-bounce problems are also mitigated by
many modern hard drives as well. And let's face it: very few
people are going to use GFS2 with hard drives that have head-
bounce performance problems; GFS2 is mostly used on storage
arrays.

That's not to say there isn't a better solution. We've talked in
the past about possibly allowing multiple processes to share the
same rgrp for block allocations while the rgrp is exclusively
locked (held) on that node. We can, for example, have unique
reservations, and as long as the reservations don't conflict,
we can safely assign blocks from them simultaneously. However,
we will still get into sticky situations and need some kind of
process-exclusive lock on the rgrp itself. For example, only one
process may assign fields like rg_free, rg_inodes, and so forth.
Maybe we can mitigate that with atomic variables and cpu boundaries.
For the deallocation path, we can, for example, pre-construct a
"multi-block reservation" structure for an existing file, then
deallocate from that with the knowledge that other processes will
avoid blocks touched by the reservations. But at some point, we
will need a process-exclusive lock (spin_lock, rwsem, mutex, etc.)
and that may prove to have a similar contention problem. This
may require a fair amount of design to achieve both correctness
and decent performance.

I've also thought about assigning unique glocks for each bitmap
of the rgrp rather than the entire rgrp. We could keep assign
each writer a unique bitmap, but that's probably not good enough
for two reasons: (1) writers to non-zero-index bitmaps still
need to write to the rgrp block itself to adjust its used, free,
and dinode counts. (2) this would only scale to the number of
bitmaps, which is typically 5 or so, so the ten-writer tests
would still contend in a lot of cases. Still, it would help
"large (e.g. 2GB) rgrp" performance where you typically have 13
or so bitmaps.

Since the glocks are kind of like a layer of caching for dlm
locks, the idea of possibly having more than one "EX" state
for a glock which both map to the same DLM state is intriguing.
After all, rgrp glocks (like all glocks) have unique "glops"
that we might be able to leverage for that purpose.
I'll have to give that some thought, although the "multi-block
reservations" concept still seems like a better fit.

Perhaps we should do this in two-phases:

(1) Implement my current design as a short-term solution which
has been proven to increase performance by 20-50 percent for
some use cases and gets the block allocator performance back
to where it was in RHEL6. (And by the same token, helps the
deallocator for the same reasons.)

This is a stop-gap measure while we work on a longer-term
solution that's likely to involve some design work and therefore
take more time to perfect.

(2) Design and implement a longer-term solution that allows
for multiple processes to share an rgrp while its glock is held
in EXclusive mode.

As always, I'm open to more ideas on how to improve this.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases
  2018-01-19 14:43     ` Bob Peterson
@ 2018-01-24 17:04       ` Bob Peterson
  2018-01-24 20:46         ` Steven Whitehouse
  2018-01-25 11:47         ` Steven Whitehouse
  0 siblings, 2 replies; 9+ messages in thread
From: Bob Peterson @ 2018-01-24 17:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

----- Original Message -----
| | My comments in relation to the previous patch still stand. This is still
| | the wrong answer to the problem. If we have lock contention here, then
| | we should be looking to fix the locking and not trying to avoid the
| | issue by choosing a different rgrp.
| This is also not too different from how we gained performance
| boosts with the Orlov algorithm to assign unique rgrps to groups
| of files, which also spread the distance of the writes.
| 
| This will not increase file fragmentation because that is all
| mitigated by the multi-block allocation algorithms we have
| in place.

I've done some more analysis of the scale-out performance with and
without these two patches, with help from Barry in the performance group.

Attached to this email is an output file that compares performance results
from a stock test kernel to the same test kernel plus my previously posted
patches for intra-node congestion. You can see from this summary grep
that the scalability is vastly better with the patch:

[root at perf82 bin]# pr -m -T -w 160 stock829.log iozone16.log | grep 'Parent'
	Parent sees throughput for  1 initial writers	=  543094.03 kB/sec		Parent sees throughput for  1 initial writers	=  539514.33 kB/sec
	Parent sees throughput for  2 initial writers	=  621523.39 kB/sec		Parent sees throughput for  2 initial writers	=  625554.26 kB/sec
	Parent sees throughput for  4 initial writers	=  244177.34 kB/sec		Parent sees throughput for  4 initial writers	=  803140.31 kB/sec
	Parent sees throughput for  6 initial writers	=  196014.09 kB/sec		Parent sees throughput for  6 initial writers	=  765439.86 kB/sec
	Parent sees throughput for  8 initial writers	=  157997.14 kB/sec		Parent sees throughput for  8 initial writers	=  760862.85 kB/sec
	Parent sees throughput for 10 initial writers	=  155714.45 kB/sec		Parent sees throughput for 10 initial writers	=  829598.76 kB/sec
	Parent sees throughput for 12 initial writers	=  133166.40 kB/sec		Parent sees throughput for 12 initial writers	=  744604.51 kB/sec
	Parent sees throughput for 14 initial writers	=  146353.44 kB/sec		Parent sees throughput for 14 initial writers	=  776214.99 kB/sec
	Parent sees throughput for 16 initial writers	=   84699.82 kB/sec		Parent sees throughput for 16 initial writers	=  796142.16 kB/sec

What this means is that performance of 16 iozone threads was 97.5% parallel
with my patches compared to 24.4% on the stock kernel.

Looking at the file fragmentation data in the same file, you can see that
file fragmentation is also noticeably improved in all runs with the patch
in the 16-process case:

Fragmentation Summary								Fragmentation Summary
/saswork/iozonetest1: 8622 extents found					/saswork/iozonetest1: 8435 extents found
/saswork/iozonetest10: 8614 extents found					/saswork/iozonetest10: 8322 extents found
/saswork/iozonetest11: 8617 extents found					/saswork/iozonetest11: 8312 extents found
/saswork/iozonetest12: 8658 extents found					/saswork/iozonetest12: 8319 extents found
/saswork/iozonetest13: 8469 extents found					/saswork/iozonetest13: 8435 extents found
/saswork/iozonetest14: 8650 extents found					/saswork/iozonetest14: 8315 extents found
/saswork/iozonetest15: 8649 extents found					/saswork/iozonetest15: 8489 extents found
/saswork/iozonetest16: 8653 extents found					/saswork/iozonetest16: 8436 extents found
/saswork/iozonetest2: 8464 extents found					/saswork/iozonetest2: 8509 extents found
/saswork/iozonetest3: 8635 extents found					/saswork/iozonetest3: 8448 extents found
/saswork/iozonetest4: 8659 extents found					/saswork/iozonetest4: 9203 extents found
/saswork/iozonetest5: 8647 extents found					/saswork/iozonetest5: 8434 extents found
/saswork/iozonetest6: 8669 extents found					/saswork/iozonetest6: 8435 extents found
/saswork/iozonetest7: 8656 extents found					/saswork/iozonetest7: 8309 extents found
/saswork/iozonetest8: 8660 extents found					/saswork/iozonetest8: 8410 extents found
/saswork/iozonetest9: 8607 extents found					/saswork/iozonetest9: 8435 extents found

See the attachment for more details.

And as I said in my previous email, overall file _system_ fragmentation
is, IMHO, not a big concern in the light of these performance gains.

I'll remove the reference to HZ from the first patch: good call.

I still see no reason to throw away the second patch, at least while we work
on a longer-term solution.

Regards,

Bob Peterson
Red Hat File Systems
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bob.iozone16.pr.txt
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20180124/fc2b6568/attachment.txt>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases
  2018-01-24 17:04       ` Bob Peterson
@ 2018-01-24 20:46         ` Steven Whitehouse
  2018-01-25 11:47         ` Steven Whitehouse
  1 sibling, 0 replies; 9+ messages in thread
From: Steven Whitehouse @ 2018-01-24 20:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com



On 24/01/18 17:04, Bob Peterson wrote:
> Hi,
>
> ----- Original Message -----
> | | My comments in relation to the previous patch still stand. This is still
> | | the wrong answer to the problem. If we have lock contention here, then
> | | we should be looking to fix the locking and not trying to avoid the
> | | issue by choosing a different rgrp.
> | This is also not too different from how we gained performance
> | boosts with the Orlov algorithm to assign unique rgrps to groups
> | of files, which also spread the distance of the writes.
> |
> | This will not increase file fragmentation because that is all
> | mitigated by the multi-block allocation algorithms we have
> | in place.
>
> I've done some more analysis of the scale-out performance with and
> without these two patches, with help from Barry in the performance group.
>
> Attached to this email is an output file that compares performance results
> from a stock test kernel to the same test kernel plus my previously posted
> patches for intra-node congestion. You can see from this summary grep
> that the scalability is vastly better with the patch:
>
> [root at perf82 bin]# pr -m -T -w 160 stock829.log iozone16.log | grep 'Parent'
> 	Parent sees throughput for  1 initial writers	=  543094.03 kB/sec		Parent sees throughput for  1 initial writers	=  539514.33 kB/sec
> 	Parent sees throughput for  2 initial writers	=  621523.39 kB/sec		Parent sees throughput for  2 initial writers	=  625554.26 kB/sec
> 	Parent sees throughput for  4 initial writers	=  244177.34 kB/sec		Parent sees throughput for  4 initial writers	=  803140.31 kB/sec
> 	Parent sees throughput for  6 initial writers	=  196014.09 kB/sec		Parent sees throughput for  6 initial writers	=  765439.86 kB/sec
> 	Parent sees throughput for  8 initial writers	=  157997.14 kB/sec		Parent sees throughput for  8 initial writers	=  760862.85 kB/sec
> 	Parent sees throughput for 10 initial writers	=  155714.45 kB/sec		Parent sees throughput for 10 initial writers	=  829598.76 kB/sec
> 	Parent sees throughput for 12 initial writers	=  133166.40 kB/sec		Parent sees throughput for 12 initial writers	=  744604.51 kB/sec
> 	Parent sees throughput for 14 initial writers	=  146353.44 kB/sec		Parent sees throughput for 14 initial writers	=  776214.99 kB/sec
> 	Parent sees throughput for 16 initial writers	=   84699.82 kB/sec		Parent sees throughput for 16 initial writers	=  796142.16 kB/sec
>
> What this means is that performance of 16 iozone threads was 97.5% parallel
> with my patches compared to 24.4% on the stock kernel.
>
> Looking at the file fragmentation data in the same file, you can see that
> file fragmentation is also noticeably improved in all runs with the patch
> in the 16-process case:
>
> Fragmentation Summary								Fragmentation Summary
> /saswork/iozonetest1: 8622 extents found					/saswork/iozonetest1: 8435 extents found
> /saswork/iozonetest10: 8614 extents found					/saswork/iozonetest10: 8322 extents found
> /saswork/iozonetest11: 8617 extents found					/saswork/iozonetest11: 8312 extents found
> /saswork/iozonetest12: 8658 extents found					/saswork/iozonetest12: 8319 extents found
> /saswork/iozonetest13: 8469 extents found					/saswork/iozonetest13: 8435 extents found
> /saswork/iozonetest14: 8650 extents found					/saswork/iozonetest14: 8315 extents found
> /saswork/iozonetest15: 8649 extents found					/saswork/iozonetest15: 8489 extents found
> /saswork/iozonetest16: 8653 extents found					/saswork/iozonetest16: 8436 extents found
> /saswork/iozonetest2: 8464 extents found					/saswork/iozonetest2: 8509 extents found
> /saswork/iozonetest3: 8635 extents found					/saswork/iozonetest3: 8448 extents found
> /saswork/iozonetest4: 8659 extents found					/saswork/iozonetest4: 9203 extents found
> /saswork/iozonetest5: 8647 extents found					/saswork/iozonetest5: 8434 extents found
> /saswork/iozonetest6: 8669 extents found					/saswork/iozonetest6: 8435 extents found
> /saswork/iozonetest7: 8656 extents found					/saswork/iozonetest7: 8309 extents found
> /saswork/iozonetest8: 8660 extents found					/saswork/iozonetest8: 8410 extents found
> /saswork/iozonetest9: 8607 extents found					/saswork/iozonetest9: 8435 extents found
>
> See the attachment for more details.
>
> And as I said in my previous email, overall file _system_ fragmentation
> is, IMHO, not a big concern in the light of these performance gains.
>
> I'll remove the reference to HZ from the first patch: good call.
>
> I still see no reason to throw away the second patch, at least while we work
> on a longer-term solution.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

I've no doubt that this is an issue that we need to address. The 
question is how to do it, rather than if. However iozone is not a great 
workload for testing this kind of thing, and if that was done on a brand 
new filesystem, then that will also have a significant effect on the 
fragmentation.

Another issue is whether the (number of nodes x number of I/O threads 
per node) <= (number of rgrps) since in the case of a new filesystem, 
that will make a significant difference in terms of whether contention 
is seen or not. The proposed new algorithm seems to assume that this 
will always be the case. If it is not the case, then we will see 
significant contention on the rgrps. If we resolve this by doing more of 
the allocation/deallocation in parallel on a single rgrp, then we only 
require (number of rgrps) > (number of nodes) which is a less onerous 
requirement.

I'm not proposing that we should throw out the patch either, but we 
should try to do things in a different order. Since the problem is 
really a single node issue, in that we want to be able to better share 
rgrps between threads on a single node, then it should be addressed as 
such. Adding a new flag to allow us to share an exclusive cluster lock 
among local processes should not be tricky to do. I've attached an 
(untested) patch that should allow that. The new semantics are that 
exclusive locks which have the GL_SHARED_EX flags on them will be shared 
locally. They will not be shared with exclusive locks that do not have 
that flag, so that gives us a way to introduce the change gradually.... 
as each glock can be audited for safety individually and the flag added 
when we know it is safe.

It will likely be easier to start with the deallocation paths, since the 
critical paths are shorter and easier to identify. We can probably just 
add a suitable lock to cover the changes to the rgrp as a wrapper, and 
we might also be able to shrink the coverage of that lock over time, 
too. It needs to cover the bitmap changes and the summary info changes 
so that the two are still atomic with respect to other local threads 
working on the same rgrp. If we are careful we might even be able to 
make the lock a spinlock eventually, but it will probably need to be a 
per rgrp mutex or rwsem to start with.

There will need to be some careful audit in order to ensure that the new 
per rgrp lock covers the correct parts of the code, and that we minimise 
those parts, in order to allow maximum parallelism. However, if we can 
achieve that, then we should get much greater gains in performance 
overall. Not least because this applies equally to allocation and 
deallocation, and your proposed new algorithm only addresses the 
allocation side of things.

Once we have resolved the intra-node issues, then we can go back and 
take a second look at whether we can get even more gain from inter-node 
changes. The points I made in the previous emails still stand though, in 
that we should not be trying to solve an intra-node issue with an 
inter-node solution.

The iozone test is not a very good test case for this - the worst case 
would be where we had large numbers of threads on each node, all writing 
out very small files. In that case we definitely want them grouped 
together for future reads and not spread over the disk in many different 
rgrps. Imagine someone running a find over such a filesystem, for 
example. We don't want that to degenerate into random reads, but to 
retain at least some locality. There will never be an algorithm that 
will be able to cope with every workload, but we should be able to get 
the same result (or better!) in terms of performance without having to 
change the allocation algorithm so drastically,

Steve.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: my.diff
Type: text/x-patch
Size: 1329 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20180124/39bee092/attachment.bin>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases
  2018-01-24 17:04       ` Bob Peterson
  2018-01-24 20:46         ` Steven Whitehouse
@ 2018-01-25 11:47         ` Steven Whitehouse
  1 sibling, 0 replies; 9+ messages in thread
From: Steven Whitehouse @ 2018-01-25 11:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

Some further thoughts...

Whenever we find a problem related to a lock, it is a good plan to 
understand where the problem actually lies. In other words whether the 
locking itself is slow, or whether it is some action that is being 
performed under the lock that is the issue. We have the ability to 
easily create histograms of DLM lock times, and almost as easily create 
histograms of the glock times (gfs2_glock_queue -> gfs2_promote). We can 
easily filter on glock type (rgrp) and the lock transistions that we 
care about (any -> EX) too. So it would be interesting to look at this 
in order to get more of an insight into what is really going on.

Taking the raw histogram and multiplying the count by the centre of each 
bucket gives us total time taken for each different lock latency. Then 
it is easy to see which latencies are the ones causing the most delay.

It would also be interesting to know how long it takes to allocate and 
deallocate a block. What are the operations that take the most time? 
Unfortunately our block allocation tracepoint doesn't give us that info, 
but it is probably not that tricky to alter it, so that it does.

That would give us a much more detailed picture of what is going on I think,

Steve.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-01-25 11:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-17 14:23 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Rework rgrp glock congestion functions for intra-node Bob Peterson
2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Simplify gfs2_rgrp_used_recently Bob Peterson
2018-01-18  9:04   ` Steven Whitehouse
2018-01-17 14:23 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases Bob Peterson
2018-01-18  9:08   ` Steven Whitehouse
2018-01-19 14:43     ` Bob Peterson
2018-01-24 17:04       ` Bob Peterson
2018-01-24 20:46         ` Steven Whitehouse
2018-01-25 11:47         ` Steven Whitehouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.