All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)
       [not found] <630774683.26694578.1509626258810.JavaMail.zimbra@redhat.com>
@ 2017-11-02 12:55 ` Bob Peterson
  2017-11-02 16:16   ` Steven Whitehouse
  0 siblings, 1 reply; 5+ messages in thread
From: Bob Peterson @ 2017-11-02 12:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 04 August I posted a patch for this, along with a graph
that shows the performance improvements. Since I got no responses,
and it's non-trivial, I did not push it to the GFS2 development
git tree.

This version is identical, except for a new improved version of
the (tiny) function gfs2_rgrp_used_recently(). The function does
the same thing, but is now less archaic.

In a nutshell, the problem is that today's GFS2 does not treat
intra-node block allocations fairly among processes on the same node.

For example, if you start multiple processes running an iozone
test on GFS2, one of them will be highly favored at the expense
of the others. For example, here is some output from a "stock"
kernel without the patch:

        Child[0] xfer count = 17802752.00 kB, Throughput =   35294.80 kB/sec
        Child[1] xfer count = 17600768.00 kB, Throughput =   34971.67 kB/sec
        Child[2] xfer count = 18067456.00 kB, Throughput =   35819.86 kB/sec
        Child[3] xfer count = 67108864.00 kB, Throughput =  133046.50 kB/sec

In this case, the last child was given priority for its block
allocations, at the expense of all the others.

The problem was that the GFS2 block allocator algorithm detected
only inter-node contention (contention from other nodes), but not
intra-node contention.

This patch reworks the GFS2 block allocator so that it uses a
much more intelligent algorithm to detect intra-node block
allocator contention as well as inter-node. The result is more
fair distribution of the block allocator resources. The same
exact test using this algorithm, I got:

        Child[0] xfer count = 67108864.00 kB, Throughput =   71773.73 kB/sec
        Child[1] xfer count = 67044352.00 kB, Throughput =   71699.45 kB/sec
        Child[2] xfer count = 66049280.00 kB, Throughput =   70635.47 kB/sec
        Child[3] xfer count = 63489024.00 kB, Throughput =   67816.11 kB/sec

Detecting possible contention and distributing the block allocations more
fairly also helps overall performance. In the example above:

Before the patch:
        Children see throughput for  4 initial writers  =  239132.83 kB/sec
After the patch:
        Children see throughput for  4 initial writers  =  281924.77 kB/sec

The algorithm scales well, and I tested it up to 24 processes, with a
24-core machine.

Regards,

Bob Peterson
Red Hat File Systems
---
commit 88fd88d54658d0db57d942f53732840740830f80
Author: Bob Peterson <rpeterso@redhat.com>
Date:   Wed Apr 19 08:11:03 2017 -0400

GFS2: Rework rgrp glock congestion functions for intra-node

There are two kinds of congestion: inter-node and intra-node.

Inter-node congestion is where multiple nodes want to allocate
blocks inside the same rgrp, which means they need to trade the
rgrp glock back and forth, which is a slow process. To mitigate
this, we use glock statistics to predict whether the glock is
historically fast to acquire. This hasn't really changed.

Intra-node congestion is where you have multiple processes on
the same node, all trying to allocate blocks in the same rgrp.
There's nothing wrong with doing so, but each process needs to
wait for the other to release the rgrp glock before it may
proceed, and thus it's slower than it can be.

If each of those processes operated on separate rgrps, they
wouldn't need to do all this glock waiting, and thus would be
faster.

Both kinds of congestion can hurt performance, but it's faster
to check intra-node, so we do that first. After all, why bother
to check if we can get the glock quickly from DLM if other
processes have also used that same reasoning: They'd just
generate intranode congestion.

So this patch reworks the "congestion" algorithms for resource
group glocks to take into account intra-node demands as well as
inter-node demands.

We can predict whether a rgrp glock is the victim of intranode
congestion based on set of rules. Rules Enforced by this patch:

1. If the current process has a multi-block reservation in the
   rgrp, it needs to use it regardless of the congestion. The
   congestion algorithm should have prevented the reservation
   in the first place.
2. If some process has the rgrp gl_lockref spin_lock locked,
   they are preparing to use it for a reservation. So we take
   this as a clear sign of impending contention.
3. If the rgrp currently has a glock holder, we know we need to
   wait before we can lock it, regardless of whether the holder
   represents an actual holder or a waiter.
4. If the rgrp currently has a multi-block reservation, and we
   already know it's not ours, then intranode contention is
   likely.
5. If none of these conditions are true, we check to see if we
   can acquire (lock / enqueue) the glock relatively quickly.
   If this is lock_dlm protocol, we check if the rgrp is one
   of our preferred rgrps. If so, we treat it as fast. As it
   was before this patch, for lock_nolock protocol, all rgrps
   are considered "preferred."
6. If the rgrp glock is unlocked, it's generally not fast to
   acquire. At the least, we need to push it through the glock
   state machine and read it from the media. Worst case, we also
   need to get dlm's permission to do so. This is ignored for
   preferred rgrps, since we want to set up easy access to them
   anyway.
7. If the DEMOTE or DEMOTE_IN_PROGRESS bits are set, we know we
   have an even longer wait from the glock state machine, as
   any enqueues will need to wait for the glock to be demoted,
   then promoted again.
8. If, after all this, we deem the rgrp to be good enough for
   this attempt (loop), and the locking protocol is lock_dlm,
   we do the normal checks we did before this patch. Namely, we
   check the internode locking statistics kept by lock_dlm to
   see if it's a hot spot for the cluster.

Note that with each "loop" through the list of rgrps, we become
more lax with our requirements. So the first time through, we
can be more picky about using an rgrp no one else is using.
But on subsequent passes through the rgrps, we need to accept
rgrps that are less than ideal.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 95b2a57ded33..dc478ae53085 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1806,7 +1806,22 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
 }
 
 /**
- * gfs2_rgrp_congested - Use stats to figure out whether an rgrp is congested
+ * gfs2_rgrp_used_recently
+ * @rs: The block reservation with the rgrp to test
+ * @msecs: The time limit in milliseconds
+ *
+ * Returns: True if the rgrp glock has been used within the time limit
+ */
+static inline bool gfs2_rgrp_used_recently(const struct gfs2_rgrpd *rgd,
+					   u64 msecs)
+{
+	return (ktime_before(ktime_get_real(),
+			     ktime_add(rgd->rd_gl->gl_dstamp,
+				       ms_to_ktime(msecs))));
+}
+
+/**
+ * gfs2_rgrp_congested_dlm - Use stats to figure out if an rgrp is congested
  * @rgd: The rgrp in question
  * @loops: An indication of how picky we can be (0=very, 1=less so)
  *
@@ -1832,7 +1847,7 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
  * Returns: A boolean verdict on the congestion status
  */
 
-static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
+static bool gfs2_rgrp_congested_dlm(const struct gfs2_rgrpd *rgd, int loops)
 {
 	const struct gfs2_glock *gl = rgd->rd_gl;
 	const struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
@@ -1844,6 +1859,11 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
 	u64 var;
 	int cpu, nonzero = 0;
 
+	/* If it hasn't been used recently we can't judge the statistics, so
+	   assume it's not congested. */
+	if (!gfs2_rgrp_used_recently(rgd, HZ))
+		return false;
+
 	preempt_disable();
 	for_each_present_cpu(cpu) {
 		st = &per_cpu_ptr(sdp->sd_lkstats, cpu)->lkstats[LM_TYPE_RGRP];
@@ -1879,21 +1899,110 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
 }
 
 /**
- * gfs2_rgrp_used_recently
- * @rs: The block reservation with the rgrp to test
- * @msecs: The time limit in milliseconds
+ * fast_to_acquire - determine if a resource group will be fast to acquire
  *
- * Returns: True if the rgrp glock has been used within the time limit
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline bool fast_to_acquire(const struct gfs2_rgrpd *rgd)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
+		return true;
+	if (gl->gl_state == LM_ST_UNLOCKED)
+		return false;
+	if (!list_empty(&gl->gl_holders))
+		return false;
+	if (test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) ||
+	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
+		return false;
+	return true;
+}
+
+/**
+ * other_rgrp_users - figure out if this rgrp has other users
+ * @rgd: The resource group
+ * @locked: true if we've already held the glock
+ *
+ * We're trying to figure out if the given rgrp has anybody competing for
+ * its free space. If other processes have enqueued its glock, there's a
+ * good chance of competition.
+ *
+ * If there are multi-block reservations for this rgrp, there's a good
+ * chance another process will lock the rgrp for block allocations soon.
+ *
+ * If we've already held the glock, we no longer care if there are holders
+ * because that's now a given (rgrp glocks are never shared).
+ */
+static inline bool other_rgrp_users(const struct gfs2_rgrpd *rgd, bool locked)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (spin_is_locked(&gl->gl_lockref.lock)) /* someone preparing to use it. */
+		return true;
+	if (!locked && !list_empty(&gl->gl_holders))
+		return true;
+	if (!RB_EMPTY_ROOT(&rgd->rd_rstree))
+		return true;
+	return false;
+}
+
+/**
+ * gfs2_rgrp_congested - decide whether a rgrp glock is congested
+ * @rgd: The rgrp in question
+ * @loops: An indication of how picky we can be (0=very, 1=less so)
+ * @locked: Indicates if checks are before or after we've enqueued the glock.
+ *
+ * There are two kinds of congestion: inter-node and intra-node.
+ *
+ * Inter-node congestion is where multiple nodes all want to allocate blocks
+ * inside the same rgrp, which means they need to trade the rgrp glock back
+ * and forth, which is a slow process. To mitigate this, we use glock
+ * statistics to predict whether the glock is historically fast to acquire.
+ *
+ * Intra-node congestion is where you have multiple processes on the same
+ * node, all trying to allocate blocks in the same rgrp. There's nothing wrong
+ * with doing so, but each process needs to wait for the other to release the
+ * rgrp glock before it may proceed. We can predict whether a rgrp glock is
+ * congested by how many block reservations are currently attached.
+ *
+ * Both kinds of congestion can hurt performance, but it's faster to check
+ * intra-node, so we do that first. After all, why bother to check if we can
+ * get the glock quickly from DLM if other processes have also used that
+ * same reasoning.
+ *
+ * We know the number of loops we've been around, so we know how desperate we
+ * are to find something. On first loop, call it congested if anyone else has
+ * a block reservation. On second loop, call it congested if it's not fast to
+ * acquire.
  */
-static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs,
-				    u64 msecs)
+static bool gfs2_rgrp_congested(const struct gfs2_blkreserv *rs, int loops,
+				bool locked)
 {
-	u64 tdiff;
+	const struct gfs2_rgrpd *rgd = rs->rs_rbm.rgd;
+
+	/* We already have a reservation, we need to use it regardless */
+	if (gfs2_rs_active(rs))
+		return false;
+
+	/* If we've rejected all the rgrps a few times, can no longer worry
+	   about whether the rgrp is congested. Fill in blocks where we can. */
+	if (loops >= 2)
+		return false;
+
+	/* Check for intra-node congestion */
+	if (loops == 0 && other_rgrp_users(rgd, locked))
+		return true;
+
+	if (loops <= 1 && !fast_to_acquire(rgd))
+		return true;
 
-	tdiff = ktime_to_ns(ktime_sub(ktime_get_real(),
-                            rs->rs_rbm.rgd->rd_gl->gl_dstamp));
+	/* Check for inter-node congestion */
+	if (rgd->rd_sbd->sd_lockstruct.ls_ops->lm_lock) /* lock_dlm */
+		return gfs2_rgrp_congested_dlm(rgd, loops);
 
-	return tdiff > (msecs * 1000 * 1000);
+	return false;
 }
 
 static u32 gfs2_orlov_skip(const struct gfs2_inode *ip)
@@ -1920,25 +2029,6 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
 }
 
 /**
- * fast_to_acquire - determine if a resource group will be fast to acquire
- *
- * If this is one of our preferred rgrps, it should be quicker to acquire,
- * because we tried to set ourselves up as dlm lock master.
- */
-static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
-{
-	struct gfs2_glock *gl = rgd->rd_gl;
-
-	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
-	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
-	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
-		return 1;
-	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
-		return 1;
-	return 0;
-}
-
-/**
  * gfs2_inplace_reserve - Reserve space in the filesystem
  * @ip: the inode to reserve space for
  * @ap: the allocation parameters
@@ -1988,22 +2078,14 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 			rg_locked = 0;
 			if (skip && skip--)
 				goto next_rgrp;
-			if (!gfs2_rs_active(rs)) {
-				if (loops == 0 &&
-				    !fast_to_acquire(rs->rs_rbm.rgd))
+			if (gfs2_rgrp_congested(rs, loops, false))
 					goto next_rgrp;
-				if ((loops < 2) &&
-				    gfs2_rgrp_used_recently(rs, 1000) &&
-				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-					goto next_rgrp;
-			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
 						   LM_ST_EXCLUSIVE, flags,
 						   &rs->rs_rgd_gh);
 			if (unlikely(error))
 				return error;
-			if (!gfs2_rs_active(rs) && (loops < 2) &&
-			    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+			if (gfs2_rgrp_congested(rs, loops, true))
 				goto skip_rgrp;
 			if (sdp->sd_args.ar_rgrplvb) {
 				error = update_rgrp_lvb(rs->rs_rbm.rgd);



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)
  2017-11-02 12:55 ` [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2) Bob Peterson
@ 2017-11-02 16:16   ` Steven Whitehouse
  2017-11-02 16:28     ` Bob Peterson
  2017-11-02 17:06     ` Bob Peterson
  0 siblings, 2 replies; 5+ messages in thread
From: Steven Whitehouse @ 2017-11-02 16:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 02/11/17 12:55, Bob Peterson wrote:
> Hi,
>
> On 04 August I posted a patch for this, along with a graph
> that shows the performance improvements. Since I got no responses,
> and it's non-trivial, I did not push it to the GFS2 development
> git tree.
>
> This version is identical, except for a new improved version of
> the (tiny) function gfs2_rgrp_used_recently(). The function does
> the same thing, but is now less archaic.
>
> In a nutshell, the problem is that today's GFS2 does not treat
> intra-node block allocations fairly among processes on the same node.
>
> For example, if you start multiple processes running an iozone
> test on GFS2, one of them will be highly favored at the expense
> of the others. For example, here is some output from a "stock"
> kernel without the patch:
>
>          Child[0] xfer count = 17802752.00 kB, Throughput =   35294.80 kB/sec
>          Child[1] xfer count = 17600768.00 kB, Throughput =   34971.67 kB/sec
>          Child[2] xfer count = 18067456.00 kB, Throughput =   35819.86 kB/sec
>          Child[3] xfer count = 67108864.00 kB, Throughput =  133046.50 kB/sec
>
> In this case, the last child was given priority for its block
> allocations, at the expense of all the others.
>
> The problem was that the GFS2 block allocator algorithm detected
> only inter-node contention (contention from other nodes), but not
> intra-node contention.
>
> This patch reworks the GFS2 block allocator so that it uses a
> much more intelligent algorithm to detect intra-node block
> allocator contention as well as inter-node. The result is more
> fair distribution of the block allocator resources. The same
> exact test using this algorithm, I got:
>
>          Child[0] xfer count = 67108864.00 kB, Throughput =   71773.73 kB/sec
>          Child[1] xfer count = 67044352.00 kB, Throughput =   71699.45 kB/sec
>          Child[2] xfer count = 66049280.00 kB, Throughput =   70635.47 kB/sec
>          Child[3] xfer count = 63489024.00 kB, Throughput =   67816.11 kB/sec
>
> Detecting possible contention and distributing the block allocations more
> fairly also helps overall performance. In the example above:
>
> Before the patch:
>          Children see throughput for  4 initial writers  =  239132.83 kB/sec
> After the patch:
>          Children see throughput for  4 initial writers  =  281924.77 kB/sec
>
> The algorithm scales well, and I tested it up to 24 processes, with a
> 24-core machine.
So about 18% improvement. That is worth having, but what about the read 
side? What does the fragmentation look like afterwards as there is no 
point in speeding up writes at the expense of reads.

> Regards,
>
> Bob Peterson
> Red Hat File Systems
> ---
> commit 88fd88d54658d0db57d942f53732840740830f80
> Author: Bob Peterson <rpeterso@redhat.com>
> Date:   Wed Apr 19 08:11:03 2017 -0400
>
> GFS2: Rework rgrp glock congestion functions for intra-node
>
> There are two kinds of congestion: inter-node and intra-node.
>
> Inter-node congestion is where multiple nodes want to allocate
> blocks inside the same rgrp, which means they need to trade the
> rgrp glock back and forth, which is a slow process. To mitigate
> this, we use glock statistics to predict whether the glock is
> historically fast to acquire. This hasn't really changed.
>
> Intra-node congestion is where you have multiple processes on
> the same node, all trying to allocate blocks in the same rgrp.
> There's nothing wrong with doing so, but each process needs to
> wait for the other to release the rgrp glock before it may
> proceed, and thus it's slower than it can be.
>
> If each of those processes operated on separate rgrps, they
> wouldn't need to do all this glock waiting, and thus would be
> faster.
>
> Both kinds of congestion can hurt performance, but it's faster
> to check intra-node, so we do that first. After all, why bother
> to check if we can get the glock quickly from DLM if other
> processes have also used that same reasoning: They'd just
> generate intranode congestion.
>
> So this patch reworks the "congestion" algorithms for resource
> group glocks to take into account intra-node demands as well as
> inter-node demands.
>
> We can predict whether a rgrp glock is the victim of intranode
> congestion based on set of rules. Rules Enforced by this patch:
>
> 1. If the current process has a multi-block reservation in the
>     rgrp, it needs to use it regardless of the congestion. The
>     congestion algorithm should have prevented the reservation
>     in the first place.
> 2. If some process has the rgrp gl_lockref spin_lock locked,
>     they are preparing to use it for a reservation. So we take
>     this as a clear sign of impending contention.
There can be all kinds of reasons why this lock has been taken. It 
should have very low contention on it, so that I don't think this is 
useful overall.

> 3. If the rgrp currently has a glock holder, we know we need to
>     wait before we can lock it, regardless of whether the holder
>     represents an actual holder or a waiter.
That is probably wrong. We ought to be able to to multiple allocations 
in an rgrp if the node is holding the rgrp lock, without having to 
constantly queue and requeue locks. In other words, if we hold the rgrp 
in excl mode, then a useful extension would be to allow multiple users 
to perform allocations in parallel. So the exclusive rgrp glock becomes 
locally sharable - that is the real issue here I think.

> 4. If the rgrp currently has a multi-block reservation, and we
>     already know it's not ours, then intranode contention is
>     likely.
What do you mean by "not ours"? In general each node only sees the 
reservations which are taken out on the node itself.
> 5. If none of these conditions are true, we check to see if we
>     can acquire (lock / enqueue) the glock relatively quickly.
>     If this is lock_dlm protocol, we check if the rgrp is one
>     of our preferred rgrps. If so, we treat it as fast. As it
>     was before this patch, for lock_nolock protocol, all rgrps
>     are considered "preferred."
The preferred rgrp is the one which is closest to the previous 
allocation for the inode, so that we reduce fragmentation.

> 6. If the rgrp glock is unlocked, it's generally not fast to
>     acquire. At the least, we need to push it through the glock
>     state machine and read it from the media. Worst case, we also
>     need to get dlm's permission to do so. This is ignored for
>     preferred rgrps, since we want to set up easy access to them
>     anyway.
It should be fast to acquire unless another node is using it... if not 
then we need to look to see what is holding things up. Chances are it is 
due to the wait for a journal commit on a remote node, and that needs to 
be looked at first I think.

> 7. If the DEMOTE or DEMOTE_IN_PROGRESS bits are set, we know we
>     have an even longer wait from the glock state machine, as
>     any enqueues will need to wait for the glock to be demoted,
>     then promoted again.
> 8. If, after all this, we deem the rgrp to be good enough for
>     this attempt (loop), and the locking protocol is lock_dlm,
>     we do the normal checks we did before this patch. Namely, we
>     check the internode locking statistics kept by lock_dlm to
>     see if it's a hot spot for the cluster.
>
> Note that with each "loop" through the list of rgrps, we become
> more lax with our requirements. So the first time through, we
> can be more picky about using an rgrp no one else is using.
> But on subsequent passes through the rgrps, we need to accept
> rgrps that are less than ideal.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
>
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 95b2a57ded33..dc478ae53085 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1806,7 +1806,22 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
>   }
>   
>   /**
> - * gfs2_rgrp_congested - Use stats to figure out whether an rgrp is congested
> + * gfs2_rgrp_used_recently
> + * @rs: The block reservation with the rgrp to test
> + * @msecs: The time limit in milliseconds
> + *
> + * Returns: True if the rgrp glock has been used within the time limit
> + */
> +static inline bool gfs2_rgrp_used_recently(const struct gfs2_rgrpd *rgd,
> +					   u64 msecs)
> +{
> +	return (ktime_before(ktime_get_real(),
> +			     ktime_add(rgd->rd_gl->gl_dstamp,
> +				       ms_to_ktime(msecs))));
> +}
> +
> +/**
> + * gfs2_rgrp_congested_dlm - Use stats to figure out if an rgrp is congested
>    * @rgd: The rgrp in question
>    * @loops: An indication of how picky we can be (0=very, 1=less so)
>    *
> @@ -1832,7 +1847,7 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
>    * Returns: A boolean verdict on the congestion status
>    */
>   
> -static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
> +static bool gfs2_rgrp_congested_dlm(const struct gfs2_rgrpd *rgd, int loops)
>   {
>   	const struct gfs2_glock *gl = rgd->rd_gl;
>   	const struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> @@ -1844,6 +1859,11 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
>   	u64 var;
>   	int cpu, nonzero = 0;
>   
> +	/* If it hasn't been used recently we can't judge the statistics, so
> +	   assume it's not congested. */
> +	if (!gfs2_rgrp_used_recently(rgd, HZ))
> +		return false;
> +
>   	preempt_disable();
>   	for_each_present_cpu(cpu) {
>   		st = &per_cpu_ptr(sdp->sd_lkstats, cpu)->lkstats[LM_TYPE_RGRP];
> @@ -1879,21 +1899,110 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
>   }
>   
>   /**
> - * gfs2_rgrp_used_recently
> - * @rs: The block reservation with the rgrp to test
> - * @msecs: The time limit in milliseconds
> + * fast_to_acquire - determine if a resource group will be fast to acquire
>    *
> - * Returns: True if the rgrp glock has been used within the time limit
> + * If this is one of our preferred rgrps, it should be quicker to acquire,
> + * because we tried to set ourselves up as dlm lock master.
> + */
> +static inline bool fast_to_acquire(const struct gfs2_rgrpd *rgd)
> +{
> +	struct gfs2_glock *gl = rgd->rd_gl;
> +
> +	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
> +		return true;
> +	if (gl->gl_state == LM_ST_UNLOCKED)
> +		return false;
> +	if (!list_empty(&gl->gl_holders))
> +		return false;
> +	if (test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) ||
> +	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
> +		return false;
> +	return true;
> +}
> +
> +/**
> + * other_rgrp_users - figure out if this rgrp has other users
> + * @rgd: The resource group
> + * @locked: true if we've already held the glock
> + *
> + * We're trying to figure out if the given rgrp has anybody competing for
> + * its free space. If other processes have enqueued its glock, there's a
> + * good chance of competition.
> + *
> + * If there are multi-block reservations for this rgrp, there's a good
> + * chance another process will lock the rgrp for block allocations soon.
> + *
> + * If we've already held the glock, we no longer care if there are holders
> + * because that's now a given (rgrp glocks are never shared).
> + */
> +static inline bool other_rgrp_users(const struct gfs2_rgrpd *rgd, bool locked)
> +{
> +	struct gfs2_glock *gl = rgd->rd_gl;
> +
> +	if (spin_is_locked(&gl->gl_lockref.lock)) /* someone preparing to use it. */
> +		return true;
> +	if (!locked && !list_empty(&gl->gl_holders))
> +		return true;
> +	if (!RB_EMPTY_ROOT(&rgd->rd_rstree))
> +		return true;
> +	return false;
> +}
> +
> +/**
> + * gfs2_rgrp_congested - decide whether a rgrp glock is congested
> + * @rgd: The rgrp in question
> + * @loops: An indication of how picky we can be (0=very, 1=less so)
> + * @locked: Indicates if checks are before or after we've enqueued the glock.
> + *
> + * There are two kinds of congestion: inter-node and intra-node.
> + *
> + * Inter-node congestion is where multiple nodes all want to allocate blocks
> + * inside the same rgrp, which means they need to trade the rgrp glock back
> + * and forth, which is a slow process. To mitigate this, we use glock
> + * statistics to predict whether the glock is historically fast to acquire.
> + *
> + * Intra-node congestion is where you have multiple processes on the same
> + * node, all trying to allocate blocks in the same rgrp. There's nothing wrong
> + * with doing so, but each process needs to wait for the other to release the
> + * rgrp glock before it may proceed. We can predict whether a rgrp glock is
> + * congested by how many block reservations are currently attached.
> + *
> + * Both kinds of congestion can hurt performance, but it's faster to check
> + * intra-node, so we do that first. After all, why bother to check if we can
> + * get the glock quickly from DLM if other processes have also used that
> + * same reasoning.
> + *
> + * We know the number of loops we've been around, so we know how desperate we
> + * are to find something. On first loop, call it congested if anyone else has
> + * a block reservation. On second loop, call it congested if it's not fast to
> + * acquire.
>    */
> -static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs,
> -				    u64 msecs)
> +static bool gfs2_rgrp_congested(const struct gfs2_blkreserv *rs, int loops,
> +				bool locked)
>   {
> -	u64 tdiff;
> +	const struct gfs2_rgrpd *rgd = rs->rs_rbm.rgd;
> +
> +	/* We already have a reservation, we need to use it regardless */
> +	if (gfs2_rs_active(rs))
> +		return false;
> +
> +	/* If we've rejected all the rgrps a few times, can no longer worry
> +	   about whether the rgrp is congested. Fill in blocks where we can. */
> +	if (loops >= 2)
> +		return false;
> +
> +	/* Check for intra-node congestion */
> +	if (loops == 0 && other_rgrp_users(rgd, locked))
> +		return true;
> +
> +	if (loops <= 1 && !fast_to_acquire(rgd))
> +		return true;
>   
> -	tdiff = ktime_to_ns(ktime_sub(ktime_get_real(),
> -                            rs->rs_rbm.rgd->rd_gl->gl_dstamp));
> +	/* Check for inter-node congestion */
> +	if (rgd->rd_sbd->sd_lockstruct.ls_ops->lm_lock) /* lock_dlm */
> +		return gfs2_rgrp_congested_dlm(rgd, loops);
>   
> -	return tdiff > (msecs * 1000 * 1000);
> +	return false;
>   }
>   
>   static u32 gfs2_orlov_skip(const struct gfs2_inode *ip)
> @@ -1920,25 +2029,6 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
>   }
>   
>   /**
> - * fast_to_acquire - determine if a resource group will be fast to acquire
> - *
> - * If this is one of our preferred rgrps, it should be quicker to acquire,
> - * because we tried to set ourselves up as dlm lock master.
> - */
> -static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
> -{
> -	struct gfs2_glock *gl = rgd->rd_gl;
> -
> -	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
> -	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
> -	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
> -		return 1;
> -	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
> -		return 1;
> -	return 0;
> -}
> -
> -/**
>    * gfs2_inplace_reserve - Reserve space in the filesystem
>    * @ip: the inode to reserve space for
>    * @ap: the allocation parameters
> @@ -1988,22 +2078,14 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
>   			rg_locked = 0;
>   			if (skip && skip--)
>   				goto next_rgrp;
> -			if (!gfs2_rs_active(rs)) {
> -				if (loops == 0 &&
> -				    !fast_to_acquire(rs->rs_rbm.rgd))
> +			if (gfs2_rgrp_congested(rs, loops, false))
>   					goto next_rgrp;
> -				if ((loops < 2) &&
> -				    gfs2_rgrp_used_recently(rs, 1000) &&
> -				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> -					goto next_rgrp;
> -			}
>   			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
>   						   LM_ST_EXCLUSIVE, flags,
>   						   &rs->rs_rgd_gh);
>   			if (unlikely(error))
>   				return error;
> -			if (!gfs2_rs_active(rs) && (loops < 2) &&
> -			    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
> +			if (gfs2_rgrp_congested(rs, loops, true))
>   				goto skip_rgrp;
>   			if (sdp->sd_args.ar_rgrplvb) {
>   				error = update_rgrp_lvb(rs->rs_rbm.rgd);
>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)
  2017-11-02 16:16   ` Steven Whitehouse
@ 2017-11-02 16:28     ` Bob Peterson
  2017-11-02 17:06     ` Bob Peterson
  1 sibling, 0 replies; 5+ messages in thread
From: Bob Peterson @ 2017-11-02 16:28 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

----- Original Message -----
| So about 18% improvement. That is worth having, but what about the read
| side? What does the fragmentation look like afterwards as there is no
| point in speeding up writes at the expense of reads.

Good question.

Since the changes are confined to the GFS2 block allocator, which only
affects writes, it should only degrade reads if fragmentation is made
worse by the patch. But the patch actually improves fragmentation.
Here's an example analysis output from 8 simultaneous iozone jobs:

stock kernel:

Fragmentation Summary
/mnt/gfs2/iozonetest1: 17234 extents found
/mnt/gfs2/iozonetest2: 17049 extents found
/mnt/gfs2/iozonetest3: 17063 extents found
/mnt/gfs2/iozonetest4: 17062 extents found
/mnt/gfs2/iozonetest5: 17051 extents found
/mnt/gfs2/iozonetest6: 16988 extents found
/mnt/gfs2/iozonetest7: 17035 extents found
/mnt/gfs2/iozonetest8: 17049 extents found

patched kernel:

Fragmentation Summary
/mnt/gfs2/iozonetest1: 16622 extents found
/mnt/gfs2/iozonetest2: 16613 extents found
/mnt/gfs2/iozonetest3: 16611 extents found
/mnt/gfs2/iozonetest4: 16616 extents found
/mnt/gfs2/iozonetest5: 17315 extents found
/mnt/gfs2/iozonetest6: 16611 extents found
/mnt/gfs2/iozonetest7: 16612 extents found
/mnt/gfs2/iozonetest8: 16656 extents found

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)
  2017-11-02 16:16   ` Steven Whitehouse
  2017-11-02 16:28     ` Bob Peterson
@ 2017-11-02 17:06     ` Bob Peterson
  2017-11-03 11:20       ` Steven Whitehouse
  1 sibling, 1 reply; 5+ messages in thread
From: Bob Peterson @ 2017-11-02 17:06 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Addressing Steve's other questions:

----- Original Message -----
| > We can predict whether a rgrp glock is the victim of intranode
| > congestion based on set of rules. Rules Enforced by this patch:
| >
| > 1. If the current process has a multi-block reservation in the
| >     rgrp, it needs to use it regardless of the congestion. The
| >     congestion algorithm should have prevented the reservation
| >     in the first place.
| > 2. If some process has the rgrp gl_lockref spin_lock locked,
| >     they are preparing to use it for a reservation. So we take
| >     this as a clear sign of impending contention.
| There can be all kinds of reasons why this lock has been taken. It
| should have very low contention on it, so that I don't think this is
| useful overall.

I'll try some experiments without this check.

| > 3. If the rgrp currently has a glock holder, we know we need to
| >     wait before we can lock it, regardless of whether the holder
| >     represents an actual holder or a waiter.
| That is probably wrong. We ought to be able to to multiple allocations
| in an rgrp if the node is holding the rgrp lock, without having to
| constantly queue and requeue locks. In other words, if we hold the rgrp
| in excl mode, then a useful extension would be to allow multiple users
| to perform allocations in parallel. So the exclusive rgrp glock becomes
| locally sharable - that is the real issue here I think.

That sounds like a really good enhancement, but out of the scope of
this patch.
 
| > 4. If the rgrp currently has a multi-block reservation, and we
| >     already know it's not ours, then intranode contention is
| >     likely.
| What do you mean by "not ours"? In general each node only sees the
| reservations which are taken out on the node itself.

What I mean is: is the reservation associated with the current inode?
If the current inode for which we're allocating blocks has a current
reservation associated with it (gfs2_rs_active(rs) where rs is ip->i_res)
then we've obliged to use it. If not, we go out searching for a new
reservation in the rgrp. If the rgrp has a reservation queued from
a different inode (so its rgrp->rd_rstree is not empty) then another
process on this node has queued a reservation to this rgrp for the
purposes of a different file.

| > 5. If none of these conditions are true, we check to see if we
| >     can acquire (lock / enqueue) the glock relatively quickly.
| >     If this is lock_dlm protocol, we check if the rgrp is one
| >     of our preferred rgrps. If so, we treat it as fast. As it
| >     was before this patch, for lock_nolock protocol, all rgrps
| >     are considered "preferred."
| The preferred rgrp is the one which is closest to the previous
| allocation for the inode, so that we reduce fragmentation.

That's not what I mean. What I mean is: Not too long ago, we
introduced the concept of a "preferred" set of resource groups such
that each node in the cluster has a unique set of rgrps it prefers to
use. When its block allocations stay within the set of
"preferred" rgrps for that node, there is no contention for rgrp
glocks. The only fighting that takes place is if a node goes
outside its preferred set of rgrps, which it sometimes needs to do.
This was a performance enhancement and it did make a big difference.

| > 6. If the rgrp glock is unlocked, it's generally not fast to
| >     acquire. At the least, we need to push it through the glock
| >     state machine and read it from the media. Worst case, we also
| >     need to get dlm's permission to do so. This is ignored for
| >     preferred rgrps, since we want to set up easy access to them
| >     anyway.
| It should be fast to acquire unless another node is using it... if not
| then we need to look to see what is holding things up. Chances are it is
| due to the wait for a journal commit on a remote node, and that needs to
| be looked at first I think.

But if multiple processes on the same node all want to do block
allocations, they need to wait for the appropriate rgrp glock enqueue,
which means their allocations may queue up behind others. This patch
is all about intra-node. The inter-node case is already well taken
care of. Of course, if we implement a way for multiple processes to
share the same rgrp glock in EX as you suggested above (taking the
EX lock in the name of, say, a single block allocator rather than per
process) then this all becomes moot. But that could be tricky to
implement. It's definitely worth investigating though.
 
Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)
  2017-11-02 17:06     ` Bob Peterson
@ 2017-11-03 11:20       ` Steven Whitehouse
  0 siblings, 0 replies; 5+ messages in thread
From: Steven Whitehouse @ 2017-11-03 11:20 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 02/11/17 17:06, Bob Peterson wrote:
> Addressing Steve's other questions:
>
> ----- Original Message -----
> | > We can predict whether a rgrp glock is the victim of intranode
> | > congestion based on set of rules. Rules Enforced by this patch:
> | >
> | > 1. If the current process has a multi-block reservation in the
> | >     rgrp, it needs to use it regardless of the congestion. The
> | >     congestion algorithm should have prevented the reservation
> | >     in the first place.
> | > 2. If some process has the rgrp gl_lockref spin_lock locked,
> | >     they are preparing to use it for a reservation. So we take
> | >     this as a clear sign of impending contention.
> | There can be all kinds of reasons why this lock has been taken. It
> | should have very low contention on it, so that I don't think this is
> | useful overall.
>
> I'll try some experiments without this check.
>
> | > 3. If the rgrp currently has a glock holder, we know we need to
> | >     wait before we can lock it, regardless of whether the holder
> | >     represents an actual holder or a waiter.
> | That is probably wrong. We ought to be able to to multiple allocations
> | in an rgrp if the node is holding the rgrp lock, without having to
> | constantly queue and requeue locks. In other words, if we hold the rgrp
> | in excl mode, then a useful extension would be to allow multiple users
> | to perform allocations in parallel. So the exclusive rgrp glock becomes
> | locally sharable - that is the real issue here I think.
>
> That sounds like a really good enhancement, but out of the scope of
> this patch.
Yes, thats true. However the patch as it stands does a lot of different 
things and those things need to be separated out and looked at 
individually. Which ones really make the performance difference in this 
case?

Doing so many things in a single patch means that it is very tricky if a 
regression is noticed later, since rolling back is either all or 
nothing. These things definitely need splitting up into smaller changes.

>   
> | > 4. If the rgrp currently has a multi-block reservation, and we
> | >     already know it's not ours, then intranode contention is
> | >     likely.
> | What do you mean by "not ours"? In general each node only sees the
> | reservations which are taken out on the node itself.
>
> What I mean is: is the reservation associated with the current inode?
> If the current inode for which we're allocating blocks has a current
> reservation associated with it (gfs2_rs_active(rs) where rs is ip->i_res)
> then we've obliged to use it. If not, we go out searching for a new
> reservation in the rgrp. If the rgrp has a reservation queued from
> a different inode (so its rgrp->rd_rstree is not empty) then another
> process on this node has queued a reservation to this rgrp for the
> purposes of a different file.
The current allocation algorithm tries to put related files (i.e. those 
with the same parent directory) near each other. That is not an 
unreasonable plan, and if there were multiple inodes with the same 
parent directory being written to at the same time, it is not a good 
plan to separate them too far. This is really just a workaround for the 
bigger issue of not being able to have multiple local processes using 
the same rgrp at the same time. I think that needs looking at again,

Steve.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-11-03 11:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <630774683.26694578.1509626258810.JavaMail.zimbra@redhat.com>
2017-11-02 12:55 ` [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2) Bob Peterson
2017-11-02 16:16   ` Steven Whitehouse
2017-11-02 16:28     ` Bob Peterson
2017-11-02 17:06     ` Bob Peterson
2017-11-03 11:20       ` Steven Whitehouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.