[Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)

From: Bob Peterson <rpeterso@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)
Date: Thu, 2 Nov 2017 08:55:38 -0400 (EDT)	[thread overview]
Message-ID: <478802130.26702749.1509627338658.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <630774683.26694578.1509626258810.JavaMail.zimbra@redhat.com>

Hi,

On 04 August I posted a patch for this, along with a graph
that shows the performance improvements. Since I got no responses,
and it's non-trivial, I did not push it to the GFS2 development
git tree.

This version is identical, except for a new improved version of
the (tiny) function gfs2_rgrp_used_recently(). The function does
the same thing, but is now less archaic.

In a nutshell, the problem is that today's GFS2 does not treat
intra-node block allocations fairly among processes on the same node.

For example, if you start multiple processes running an iozone
test on GFS2, one of them will be highly favored at the expense
of the others. For example, here is some output from a "stock"
kernel without the patch:

        Child[0] xfer count = 17802752.00 kB, Throughput =   35294.80 kB/sec
        Child[1] xfer count = 17600768.00 kB, Throughput =   34971.67 kB/sec
        Child[2] xfer count = 18067456.00 kB, Throughput =   35819.86 kB/sec
        Child[3] xfer count = 67108864.00 kB, Throughput =  133046.50 kB/sec

In this case, the last child was given priority for its block
allocations, at the expense of all the others.

The problem was that the GFS2 block allocator algorithm detected
only inter-node contention (contention from other nodes), but not
intra-node contention.

This patch reworks the GFS2 block allocator so that it uses a
much more intelligent algorithm to detect intra-node block
allocator contention as well as inter-node. The result is more
fair distribution of the block allocator resources. The same
exact test using this algorithm, I got:

        Child[0] xfer count = 67108864.00 kB, Throughput =   71773.73 kB/sec
        Child[1] xfer count = 67044352.00 kB, Throughput =   71699.45 kB/sec
        Child[2] xfer count = 66049280.00 kB, Throughput =   70635.47 kB/sec
        Child[3] xfer count = 63489024.00 kB, Throughput =   67816.11 kB/sec

Detecting possible contention and distributing the block allocations more
fairly also helps overall performance. In the example above:

Before the patch:
        Children see throughput for  4 initial writers  =  239132.83 kB/sec
After the patch:
        Children see throughput for  4 initial writers  =  281924.77 kB/sec

The algorithm scales well, and I tested it up to 24 processes, with a
24-core machine.

Regards,

Bob Peterson
Red Hat File Systems
---
commit 88fd88d54658d0db57d942f53732840740830f80
Author: Bob Peterson <rpeterso@redhat.com>
Date:   Wed Apr 19 08:11:03 2017 -0400

GFS2: Rework rgrp glock congestion functions for intra-node

There are two kinds of congestion: inter-node and intra-node.

Inter-node congestion is where multiple nodes want to allocate
blocks inside the same rgrp, which means they need to trade the
rgrp glock back and forth, which is a slow process. To mitigate
this, we use glock statistics to predict whether the glock is
historically fast to acquire. This hasn't really changed.

Intra-node congestion is where you have multiple processes on
the same node, all trying to allocate blocks in the same rgrp.
There's nothing wrong with doing so, but each process needs to
wait for the other to release the rgrp glock before it may
proceed, and thus it's slower than it can be.

If each of those processes operated on separate rgrps, they
wouldn't need to do all this glock waiting, and thus would be
faster.

Both kinds of congestion can hurt performance, but it's faster
to check intra-node, so we do that first. After all, why bother
to check if we can get the glock quickly from DLM if other
processes have also used that same reasoning: They'd just
generate intranode congestion.

So this patch reworks the "congestion" algorithms for resource
group glocks to take into account intra-node demands as well as
inter-node demands.

We can predict whether a rgrp glock is the victim of intranode
congestion based on set of rules. Rules Enforced by this patch:

1. If the current process has a multi-block reservation in the
   rgrp, it needs to use it regardless of the congestion. The
   congestion algorithm should have prevented the reservation
   in the first place.
2. If some process has the rgrp gl_lockref spin_lock locked,
   they are preparing to use it for a reservation. So we take
   this as a clear sign of impending contention.
3. If the rgrp currently has a glock holder, we know we need to
   wait before we can lock it, regardless of whether the holder
   represents an actual holder or a waiter.
4. If the rgrp currently has a multi-block reservation, and we
   already know it's not ours, then intranode contention is
   likely.
5. If none of these conditions are true, we check to see if we
   can acquire (lock / enqueue) the glock relatively quickly.
   If this is lock_dlm protocol, we check if the rgrp is one
   of our preferred rgrps. If so, we treat it as fast. As it
   was before this patch, for lock_nolock protocol, all rgrps
   are considered "preferred."
6. If the rgrp glock is unlocked, it's generally not fast to
   acquire. At the least, we need to push it through the glock
   state machine and read it from the media. Worst case, we also
   need to get dlm's permission to do so. This is ignored for
   preferred rgrps, since we want to set up easy access to them
   anyway.
7. If the DEMOTE or DEMOTE_IN_PROGRESS bits are set, we know we
   have an even longer wait from the glock state machine, as
   any enqueues will need to wait for the glock to be demoted,
   then promoted again.
8. If, after all this, we deem the rgrp to be good enough for
   this attempt (loop), and the locking protocol is lock_dlm,
   we do the normal checks we did before this patch. Namely, we
   check the internode locking statistics kept by lock_dlm to
   see if it's a hot spot for the cluster.

Note that with each "loop" through the list of rgrps, we become
more lax with our requirements. So the first time through, we
can be more picky about using an rgrp no one else is using.
But on subsequent passes through the rgrps, we need to accept
rgrps that are less than ideal.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 95b2a57ded33..dc478ae53085 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1806,7 +1806,22 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
 }
 
 /**
- * gfs2_rgrp_congested - Use stats to figure out whether an rgrp is congested
+ * gfs2_rgrp_used_recently
+ * @rs: The block reservation with the rgrp to test
+ * @msecs: The time limit in milliseconds
+ *
+ * Returns: True if the rgrp glock has been used within the time limit
+ */
+static inline bool gfs2_rgrp_used_recently(const struct gfs2_rgrpd *rgd,
+					   u64 msecs)
+{
+	return (ktime_before(ktime_get_real(),
+			     ktime_add(rgd->rd_gl->gl_dstamp,
+				       ms_to_ktime(msecs))));
+}
+
+/**
+ * gfs2_rgrp_congested_dlm - Use stats to figure out if an rgrp is congested
  * @rgd: The rgrp in question
  * @loops: An indication of how picky we can be (0=very, 1=less so)
  *
@@ -1832,7 +1847,7 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip
  * Returns: A boolean verdict on the congestion status
  */
 
-static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
+static bool gfs2_rgrp_congested_dlm(const struct gfs2_rgrpd *rgd, int loops)
 {
 	const struct gfs2_glock *gl = rgd->rd_gl;
 	const struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
@@ -1844,6 +1859,11 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
 	u64 var;
 	int cpu, nonzero = 0;
 
+	/* If it hasn't been used recently we can't judge the statistics, so
+	   assume it's not congested. */
+	if (!gfs2_rgrp_used_recently(rgd, HZ))
+		return false;
+
 	preempt_disable();
 	for_each_present_cpu(cpu) {
 		st = &per_cpu_ptr(sdp->sd_lkstats, cpu)->lkstats[LM_TYPE_RGRP];
@@ -1879,21 +1899,110 @@ static bool gfs2_rgrp_congested(const struct gfs2_rgrpd *rgd, int loops)
 }
 
 /**
- * gfs2_rgrp_used_recently
- * @rs: The block reservation with the rgrp to test
- * @msecs: The time limit in milliseconds
+ * fast_to_acquire - determine if a resource group will be fast to acquire
  *
- * Returns: True if the rgrp glock has been used within the time limit
+ * If this is one of our preferred rgrps, it should be quicker to acquire,
+ * because we tried to set ourselves up as dlm lock master.
+ */
+static inline bool fast_to_acquire(const struct gfs2_rgrpd *rgd)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
+		return true;
+	if (gl->gl_state == LM_ST_UNLOCKED)
+		return false;
+	if (!list_empty(&gl->gl_holders))
+		return false;
+	if (test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) ||
+	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
+		return false;
+	return true;
+}
+
+/**
+ * other_rgrp_users - figure out if this rgrp has other users
+ * @rgd: The resource group
+ * @locked: true if we've already held the glock
+ *
+ * We're trying to figure out if the given rgrp has anybody competing for
+ * its free space. If other processes have enqueued its glock, there's a
+ * good chance of competition.
+ *
+ * If there are multi-block reservations for this rgrp, there's a good
+ * chance another process will lock the rgrp for block allocations soon.
+ *
+ * If we've already held the glock, we no longer care if there are holders
+ * because that's now a given (rgrp glocks are never shared).
+ */
+static inline bool other_rgrp_users(const struct gfs2_rgrpd *rgd, bool locked)
+{
+	struct gfs2_glock *gl = rgd->rd_gl;
+
+	if (spin_is_locked(&gl->gl_lockref.lock)) /* someone preparing to use it. */
+		return true;
+	if (!locked && !list_empty(&gl->gl_holders))
+		return true;
+	if (!RB_EMPTY_ROOT(&rgd->rd_rstree))
+		return true;
+	return false;
+}
+
+/**
+ * gfs2_rgrp_congested - decide whether a rgrp glock is congested
+ * @rgd: The rgrp in question
+ * @loops: An indication of how picky we can be (0=very, 1=less so)
+ * @locked: Indicates if checks are before or after we've enqueued the glock.
+ *
+ * There are two kinds of congestion: inter-node and intra-node.
+ *
+ * Inter-node congestion is where multiple nodes all want to allocate blocks
+ * inside the same rgrp, which means they need to trade the rgrp glock back
+ * and forth, which is a slow process. To mitigate this, we use glock
+ * statistics to predict whether the glock is historically fast to acquire.
+ *
+ * Intra-node congestion is where you have multiple processes on the same
+ * node, all trying to allocate blocks in the same rgrp. There's nothing wrong
+ * with doing so, but each process needs to wait for the other to release the
+ * rgrp glock before it may proceed. We can predict whether a rgrp glock is
+ * congested by how many block reservations are currently attached.
+ *
+ * Both kinds of congestion can hurt performance, but it's faster to check
+ * intra-node, so we do that first. After all, why bother to check if we can
+ * get the glock quickly from DLM if other processes have also used that
+ * same reasoning.
+ *
+ * We know the number of loops we've been around, so we know how desperate we
+ * are to find something. On first loop, call it congested if anyone else has
+ * a block reservation. On second loop, call it congested if it's not fast to
+ * acquire.
  */
-static bool gfs2_rgrp_used_recently(const struct gfs2_blkreserv *rs,
-				    u64 msecs)
+static bool gfs2_rgrp_congested(const struct gfs2_blkreserv *rs, int loops,
+				bool locked)
 {
-	u64 tdiff;
+	const struct gfs2_rgrpd *rgd = rs->rs_rbm.rgd;
+
+	/* We already have a reservation, we need to use it regardless */
+	if (gfs2_rs_active(rs))
+		return false;
+
+	/* If we've rejected all the rgrps a few times, can no longer worry
+	   about whether the rgrp is congested. Fill in blocks where we can. */
+	if (loops >= 2)
+		return false;
+
+	/* Check for intra-node congestion */
+	if (loops == 0 && other_rgrp_users(rgd, locked))
+		return true;
+
+	if (loops <= 1 && !fast_to_acquire(rgd))
+		return true;
 
-	tdiff = ktime_to_ns(ktime_sub(ktime_get_real(),
-                            rs->rs_rbm.rgd->rd_gl->gl_dstamp));
+	/* Check for inter-node congestion */
+	if (rgd->rd_sbd->sd_lockstruct.ls_ops->lm_lock) /* lock_dlm */
+		return gfs2_rgrp_congested_dlm(rgd, loops);
 
-	return tdiff > (msecs * 1000 * 1000);
+	return false;
 }
 
 static u32 gfs2_orlov_skip(const struct gfs2_inode *ip)
@@ -1920,25 +2029,6 @@ static bool gfs2_select_rgrp(struct gfs2_rgrpd **pos, const struct gfs2_rgrpd *b
 }
 
 /**
- * fast_to_acquire - determine if a resource group will be fast to acquire
- *
- * If this is one of our preferred rgrps, it should be quicker to acquire,
- * because we tried to set ourselves up as dlm lock master.
- */
-static inline int fast_to_acquire(struct gfs2_rgrpd *rgd)
-{
-	struct gfs2_glock *gl = rgd->rd_gl;
-
-	if (gl->gl_state != LM_ST_UNLOCKED && list_empty(&gl->gl_holders) &&
-	    !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags) &&
-	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
-		return 1;
-	if (rgd->rd_flags & GFS2_RDF_PREFERRED)
-		return 1;
-	return 0;
-}
-
-/**
  * gfs2_inplace_reserve - Reserve space in the filesystem
  * @ip: the inode to reserve space for
  * @ap: the allocation parameters
@@ -1988,22 +2078,14 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct gfs2_alloc_parms *ap)
 			rg_locked = 0;
 			if (skip && skip--)
 				goto next_rgrp;
-			if (!gfs2_rs_active(rs)) {
-				if (loops == 0 &&
-				    !fast_to_acquire(rs->rs_rbm.rgd))
+			if (gfs2_rgrp_congested(rs, loops, false))
 					goto next_rgrp;
-				if ((loops < 2) &&
-				    gfs2_rgrp_used_recently(rs, 1000) &&
-				    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
-					goto next_rgrp;
-			}
 			error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
 						   LM_ST_EXCLUSIVE, flags,
 						   &rs->rs_rgd_gh);
 			if (unlikely(error))
 				return error;
-			if (!gfs2_rs_active(rs) && (loops < 2) &&
-			    gfs2_rgrp_congested(rs->rs_rbm.rgd, loops))
+			if (gfs2_rgrp_congested(rs, loops, true))
 				goto skip_rgrp;
 			if (sdp->sd_args.ar_rgrplvb) {
 				error = update_rgrp_lvb(rs->rs_rbm.rgd);