All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
       [not found] <1673564717.11791069.1433515261791.JavaMail.zimbra@redhat.com>
@ 2015-06-05 14:49 ` Bob Peterson
  2015-06-08 12:18   ` Steven Whitehouse
  0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-05 14:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

This patch allows the block allocation code to retain the buffers
for the resource groups so they don't need to be re-read from buffer
cache with every request. This is a performance improvement that's
especially noticeable when resource groups are very large. For
example, with 2GB resource groups and 4K blocks, there can be 33
blocks for every resource group. This patch allows those 33 buffers
to be kept around and not read in and thrown away with every
operation. The buffers are released when the resource group is
either synced or invalidated.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glops.c | 14 +++++++++++---
 fs/gfs2/rgrp.c  | 23 +++++++++++++++++++----
 fs/gfs2/rgrp.h  |  1 +
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index fe91951..c23377f 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -144,6 +144,12 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
 	struct gfs2_rgrpd *rgd;
 	int error;
 
+	spin_lock(&gl->gl_spin);
+	rgd = gl->gl_object;
+	if (rgd)
+		gfs2_rgrp_brelse(rgd);
+	spin_unlock(&gl->gl_spin);
+
 	if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
 		return;
 	GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
@@ -175,15 +181,17 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
 {
 	struct gfs2_sbd *sdp = gl->gl_sbd;
 	struct address_space *mapping = &sdp->sd_aspace;
+	struct gfs2_rgrpd *rgd = gl->gl_object;
+
+	if (rgd)
+		gfs2_rgrp_brelse(rgd);
 
 	WARN_ON_ONCE(!(flags & DIO_METADATA));
 	gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));
 	truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
 
-	if (gl->gl_object) {
-		struct gfs2_rgrpd *rgd = (struct gfs2_rgrpd *)gl->gl_object;
+	if (rgd)
 		rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
-	}
 }
 
 /**
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index cd53d6e..c6c6232 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1244,14 +1244,13 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh)
 }
 
 /**
- * gfs2_rgrp_go_unlock - Release RG bitmaps read in with gfs2_rgrp_bh_get()
- * @gh: The glock holder for the resource group
+ * gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get()
+ * @rgd: The resource group
  *
  */
 
-void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
+void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
 {
-	struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
 	int x, length = rgd->rd_length;
 
 	for (x = 0; x < length; x++) {
@@ -1264,6 +1263,22 @@ void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
 
 }
 
+/**
+ * gfs2_rgrp_go_unlock - Unlock a rgrp glock
+ * @gh: The glock holder for the resource group
+ *
+ */
+
+void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
+{
+	struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
+	int demote_requested = test_bit(GLF_DEMOTE, &gh->gh_gl->gl_flags) |
+		test_bit(GLF_PENDING_DEMOTE, &gh->gh_gl->gl_flags);
+
+	if (rgd && demote_requested)
+		gfs2_rgrp_brelse(rgd);
+}
+
 int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
 			     struct buffer_head *bh,
 			     const struct gfs2_bitmap *bi, unsigned minlen, u64 *ptrimmed)
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 68972ec..c0ab33f 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -36,6 +36,7 @@ extern void gfs2_clear_rgrpd(struct gfs2_sbd *sdp);
 extern int gfs2_rindex_update(struct gfs2_sbd *sdp);
 extern void gfs2_free_clones(struct gfs2_rgrpd *rgd);
 extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh);
+extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd);
 extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh);
 
 extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-05 14:49 ` [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation Bob Peterson
@ 2015-06-08 12:18   ` Steven Whitehouse
  2015-06-09 14:45     ` Bob Peterson
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-08 12:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 05/06/15 15:49, Bob Peterson wrote:
> Hi,
>
> This patch allows the block allocation code to retain the buffers
> for the resource groups so they don't need to be re-read from buffer
> cache with every request. This is a performance improvement that's
> especially noticeable when resource groups are very large. For
> example, with 2GB resource groups and 4K blocks, there can be 33
> blocks for every resource group. This patch allows those 33 buffers
> to be kept around and not read in and thrown away with every
> operation. The buffers are released when the resource group is
> either synced or invalidated.
The blocks should be cached between operations, so this should only be 
resulting in a skip of the look up of the cached block, and no changes 
to the actual I/O. Does that mean that grab_cache_page() is slow I 
wonder? Or is this an issue of going around the retry loop due to lack 
of memory at some stage?

How does this interact with the rgrplvb support? I'd guess that with 
that turned on, this is no longer an issue, because we'd only read in 
the blocks for the rgrps that we are actually going to use?



Steve.

> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/glops.c | 14 +++++++++++---
>   fs/gfs2/rgrp.c  | 23 +++++++++++++++++++----
>   fs/gfs2/rgrp.h  |  1 +
>   3 files changed, 31 insertions(+), 7 deletions(-)
>
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index fe91951..c23377f 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -144,6 +144,12 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
>   	struct gfs2_rgrpd *rgd;
>   	int error;
>   
> +	spin_lock(&gl->gl_spin);
> +	rgd = gl->gl_object;
> +	if (rgd)
> +		gfs2_rgrp_brelse(rgd);
> +	spin_unlock(&gl->gl_spin);
> +
>   	if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
>   		return;
>   	GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
> @@ -175,15 +181,17 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
>   {
>   	struct gfs2_sbd *sdp = gl->gl_sbd;
>   	struct address_space *mapping = &sdp->sd_aspace;
> +	struct gfs2_rgrpd *rgd = gl->gl_object;
> +
> +	if (rgd)
> +		gfs2_rgrp_brelse(rgd);
>   
>   	WARN_ON_ONCE(!(flags & DIO_METADATA));
>   	gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));
>   	truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
>   
> -	if (gl->gl_object) {
> -		struct gfs2_rgrpd *rgd = (struct gfs2_rgrpd *)gl->gl_object;
> +	if (rgd)
>   		rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
> -	}
>   }
>   
>   /**
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index cd53d6e..c6c6232 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1244,14 +1244,13 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh)
>   }
>   
>   /**
> - * gfs2_rgrp_go_unlock - Release RG bitmaps read in with gfs2_rgrp_bh_get()
> - * @gh: The glock holder for the resource group
> + * gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get()
> + * @rgd: The resource group
>    *
>    */
>   
> -void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
> +void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
>   {
> -	struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
>   	int x, length = rgd->rd_length;
>   
>   	for (x = 0; x < length; x++) {
> @@ -1264,6 +1263,22 @@ void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
>   
>   }
>   
> +/**
> + * gfs2_rgrp_go_unlock - Unlock a rgrp glock
> + * @gh: The glock holder for the resource group
> + *
> + */
> +
> +void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
> +{
> +	struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
> +	int demote_requested = test_bit(GLF_DEMOTE, &gh->gh_gl->gl_flags) |
> +		test_bit(GLF_PENDING_DEMOTE, &gh->gh_gl->gl_flags);
> +
> +	if (rgd && demote_requested)
> +		gfs2_rgrp_brelse(rgd);
> +}
> +
>   int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
>   			     struct buffer_head *bh,
>   			     const struct gfs2_bitmap *bi, unsigned minlen, u64 *ptrimmed)
> diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
> index 68972ec..c0ab33f 100644
> --- a/fs/gfs2/rgrp.h
> +++ b/fs/gfs2/rgrp.h
> @@ -36,6 +36,7 @@ extern void gfs2_clear_rgrpd(struct gfs2_sbd *sdp);
>   extern int gfs2_rindex_update(struct gfs2_sbd *sdp);
>   extern void gfs2_free_clones(struct gfs2_rgrpd *rgd);
>   extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh);
> +extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd);
>   extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh);
>   
>   extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-08 12:18   ` Steven Whitehouse
@ 2015-06-09 14:45     ` Bob Peterson
  2015-06-10 10:30       ` Steven Whitehouse
  0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-09 14:45 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Hi,
> 
> 
> On 05/06/15 15:49, Bob Peterson wrote:
> > Hi,
> >
> > This patch allows the block allocation code to retain the buffers
> > for the resource groups so they don't need to be re-read from buffer
> > cache with every request. This is a performance improvement that's
> > especially noticeable when resource groups are very large. For
> > example, with 2GB resource groups and 4K blocks, there can be 33
> > blocks for every resource group. This patch allows those 33 buffers
> > to be kept around and not read in and thrown away with every
> > operation. The buffers are released when the resource group is
> > either synced or invalidated.
> The blocks should be cached between operations, so this should only be
> resulting in a skip of the look up of the cached block, and no changes
> to the actual I/O. Does that mean that grab_cache_page() is slow I
> wonder? Or is this an issue of going around the retry loop due to lack
> of memory at some stage?
> 
> How does this interact with the rgrplvb support? I'd guess that with
> that turned on, this is no longer an issue, because we'd only read in
> the blocks for the rgrps that we are actually going to use?
> 
> 
> 
> Steve.

Hi,

If you compare the two vmstat outputs in the bugzilla #1154782, you'll
see no significant difference in memory usage nor cpu usage. So I assume
the page lookup is the "slow" part; not because it's such a slow thing
but because it's done 33 times per read-reference-invalidate (33 pages
to look up per rgrp).

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-09 14:45     ` Bob Peterson
@ 2015-06-10 10:30       ` Steven Whitehouse
  2015-06-12 19:50         ` Bob Peterson
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-10 10:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 09/06/15 15:45, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>>
>> On 05/06/15 15:49, Bob Peterson wrote:
>>> Hi,
>>>
>>> This patch allows the block allocation code to retain the buffers
>>> for the resource groups so they don't need to be re-read from buffer
>>> cache with every request. This is a performance improvement that's
>>> especially noticeable when resource groups are very large. For
>>> example, with 2GB resource groups and 4K blocks, there can be 33
>>> blocks for every resource group. This patch allows those 33 buffers
>>> to be kept around and not read in and thrown away with every
>>> operation. The buffers are released when the resource group is
>>> either synced or invalidated.
>> The blocks should be cached between operations, so this should only be
>> resulting in a skip of the look up of the cached block, and no changes
>> to the actual I/O. Does that mean that grab_cache_page() is slow I
>> wonder? Or is this an issue of going around the retry loop due to lack
>> of memory at some stage?
>>
>> How does this interact with the rgrplvb support? I'd guess that with
>> that turned on, this is no longer an issue, because we'd only read in
>> the blocks for the rgrps that we are actually going to use?
>>
>>
>>
>> Steve.
> Hi,
>
> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
> see no significant difference in memory usage nor cpu usage. So I assume
> the page lookup is the "slow" part; not because it's such a slow thing
> but because it's done 33 times per read-reference-invalidate (33 pages
> to look up per rgrp).
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

Thats true, however, as I understand the problem here, the issue is not 
reading in the blocks for the rgrp that is eventually selected to use, 
but the reading in of those blocks for the rgrps that we reject, for 
whatever reason (full, or congested, or whatever). So with rgrplvb 
enabled, we don't then read those rgrps in off disk at all in most cases 
- so I was wondering whether that solves the problem without needing 
this change?

Ideally I'd like to make the rgrplvb setting the default, since it is 
much more efficient. The question is how we can do that and still remain 
backward compatible? Not an easy one to answer :(

Also, if the page lookup is the slow thing, then we should look at using 
pagevec_lookup() to get the pages in chunks rather than doing it 
individually (and indeed, multiple times per page, in case of block size 
less than page size). We know that the blocks will always be contiguous 
on disk, so we should be able to send down large I/Os, rather than 
relying on the block stack to merge them as we do at the moment, which 
should be a further improvement too,

Steve.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-10 10:30       ` Steven Whitehouse
@ 2015-06-12 19:50         ` Bob Peterson
  2015-06-15 11:18           ` Steven Whitehouse
  0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-12 19:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Hi,
> 
> 
> On 09/06/15 15:45, Bob Peterson wrote:
> > ----- Original Message -----
> >> Hi,
> >>
> >>
> >> On 05/06/15 15:49, Bob Peterson wrote:
> >>> Hi,
> >>>
> >>> This patch allows the block allocation code to retain the buffers
> >>> for the resource groups so they don't need to be re-read from buffer
> >>> cache with every request. This is a performance improvement that's
> >>> especially noticeable when resource groups are very large. For
> >>> example, with 2GB resource groups and 4K blocks, there can be 33
> >>> blocks for every resource group. This patch allows those 33 buffers
> >>> to be kept around and not read in and thrown away with every
> >>> operation. The buffers are released when the resource group is
> >>> either synced or invalidated.
> >> The blocks should be cached between operations, so this should only be
> >> resulting in a skip of the look up of the cached block, and no changes
> >> to the actual I/O. Does that mean that grab_cache_page() is slow I
> >> wonder? Or is this an issue of going around the retry loop due to lack
> >> of memory at some stage?
> >>
> >> How does this interact with the rgrplvb support? I'd guess that with
> >> that turned on, this is no longer an issue, because we'd only read in
> >> the blocks for the rgrps that we are actually going to use?
> >>
> >>
> >>
> >> Steve.
> > Hi,
> >
> > If you compare the two vmstat outputs in the bugzilla #1154782, you'll
> > see no significant difference in memory usage nor cpu usage. So I assume
> > the page lookup is the "slow" part; not because it's such a slow thing
> > but because it's done 33 times per read-reference-invalidate (33 pages
> > to look up per rgrp).
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> 
> Thats true, however, as I understand the problem here, the issue is not
> reading in the blocks for the rgrp that is eventually selected to use,
> but the reading in of those blocks for the rgrps that we reject, for
> whatever reason (full, or congested, or whatever). So with rgrplvb
> enabled, we don't then read those rgrps in off disk at all in most cases
> - so I was wondering whether that solves the problem without needing
> this change?
> 
> Ideally I'd like to make the rgrplvb setting the default, since it is
> much more efficient. The question is how we can do that and still remain
> backward compatible? Not an easy one to answer :(
> 
> Also, if the page lookup is the slow thing, then we should look at using
> pagevec_lookup() to get the pages in chunks rather than doing it
> individually (and indeed, multiple times per page, in case of block size
> less than page size). We know that the blocks will always be contiguous
> on disk, so we should be able to send down large I/Os, rather than
> relying on the block stack to merge them as we do at the moment, which
> should be a further improvement too,
> 
> Steve.

Hi,

The rgrplvb mount option only helps if the file system is using lock_dlm.
For lock_nolock, it's still just as slow because lock_nolock has no knowledge
of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
with lock_dlm.

I like the idea of making rgrplvb the default mount option, and I don't
see a problem doing that.

I think the rgrplvb option should be compatible with this patch, but
I'll set up a test environment in order to test that they work together
harmoniously.

I also like the idea of using a pagevec for reading in multiple pages for
the rgrps, but that's another improvement for another day. If there's
not a bugzilla record open for that, perhaps we should open one.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-12 19:50         ` Bob Peterson
@ 2015-06-15 11:18           ` Steven Whitehouse
  2015-06-15 13:56             ` Bob Peterson
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-15 11:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 12/06/15 20:50, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>>
>> On 09/06/15 15:45, Bob Peterson wrote:
>>> ----- Original Message -----
>>>> Hi,
>>>>
>>>>
>>>> On 05/06/15 15:49, Bob Peterson wrote:
>>>>> Hi,
>>>>>
>>>>> This patch allows the block allocation code to retain the buffers
>>>>> for the resource groups so they don't need to be re-read from buffer
>>>>> cache with every request. This is a performance improvement that's
>>>>> especially noticeable when resource groups are very large. For
>>>>> example, with 2GB resource groups and 4K blocks, there can be 33
>>>>> blocks for every resource group. This patch allows those 33 buffers
>>>>> to be kept around and not read in and thrown away with every
>>>>> operation. The buffers are released when the resource group is
>>>>> either synced or invalidated.
>>>> The blocks should be cached between operations, so this should only be
>>>> resulting in a skip of the look up of the cached block, and no changes
>>>> to the actual I/O. Does that mean that grab_cache_page() is slow I
>>>> wonder? Or is this an issue of going around the retry loop due to lack
>>>> of memory at some stage?
>>>>
>>>> How does this interact with the rgrplvb support? I'd guess that with
>>>> that turned on, this is no longer an issue, because we'd only read in
>>>> the blocks for the rgrps that we are actually going to use?
>>>>
>>>>
>>>>
>>>> Steve.
>>> Hi,
>>>
>>> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
>>> see no significant difference in memory usage nor cpu usage. So I assume
>>> the page lookup is the "slow" part; not because it's such a slow thing
>>> but because it's done 33 times per read-reference-invalidate (33 pages
>>> to look up per rgrp).
>>>
>>> Regards,
>>>
>>> Bob Peterson
>>> Red Hat File Systems
>> Thats true, however, as I understand the problem here, the issue is not
>> reading in the blocks for the rgrp that is eventually selected to use,
>> but the reading in of those blocks for the rgrps that we reject, for
>> whatever reason (full, or congested, or whatever). So with rgrplvb
>> enabled, we don't then read those rgrps in off disk at all in most cases
>> - so I was wondering whether that solves the problem without needing
>> this change?
>>
>> Ideally I'd like to make the rgrplvb setting the default, since it is
>> much more efficient. The question is how we can do that and still remain
>> backward compatible? Not an easy one to answer :(
>>
>> Also, if the page lookup is the slow thing, then we should look at using
>> pagevec_lookup() to get the pages in chunks rather than doing it
>> individually (and indeed, multiple times per page, in case of block size
>> less than page size). We know that the blocks will always be contiguous
>> on disk, so we should be able to send down large I/Os, rather than
>> relying on the block stack to merge them as we do at the moment, which
>> should be a further improvement too,
>>
>> Steve.
> Hi,
>
> The rgrplvb mount option only helps if the file system is using lock_dlm.
> For lock_nolock, it's still just as slow because lock_nolock has no knowledge
> of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
> with lock_dlm.
That sounds like a bug... it should work in the same way, even with 
lock_nolock.

> I like the idea of making rgrplvb the default mount option, and I don't
> see a problem doing that.
The issue is that it will not be backwards compatible. We really need a 
way to at least easily detect if someone has a mixed cluster (some nodes 
with rgrplvb enabled, and some without) otherwise it might be confusing 
when we get odd reports of allocations failing, even when there appears 
to be free space.

We need to treat what we put into LVBs in the same way as we treat the 
on-disk format in terms of
backwards compatibility.

>
> I think the rgrplvb option should be compatible with this patch, but
> I'll set up a test environment in order to test that they work together
> harmoniously.
>
> I also like the idea of using a pagevec for reading in multiple pages for
> the rgrps, but that's another improvement for another day. If there's
> not a bugzilla record open for that, perhaps we should open one.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

If we have rgrplvb, then we my not need this patch, since we will not be 
looking up the rgrp's blocks as often. So we should see the benefit just 
by turning that on I think... at least it would be good to see whether 
there is any performance difference there. In cases where we have nodes 
competing for the rgrps, then the blocks will not be cached anyway, so 
we will gain no benefit from this patch, since we'll have to read the 
blocks anyway, hence my thought that speeding up the lookup is the way 
to go, since it will give the benefit for more different cases - both 
when the rgrps blocks are cached and uncached,


Steve.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-15 11:18           ` Steven Whitehouse
@ 2015-06-15 13:56             ` Bob Peterson
  2015-06-15 14:26               ` Steven Whitehouse
  2015-06-16 13:54               ` Bob Peterson
  0 siblings, 2 replies; 11+ messages in thread
From: Bob Peterson @ 2015-06-15 13:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> >>> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
> >>> see no significant difference in memory usage nor cpu usage. So I assume
> >>> the page lookup is the "slow" part; not because it's such a slow thing
> >>> but because it's done 33 times per read-reference-invalidate (33 pages
> >>> to look up per rgrp).
> >>>
> >>> Regards,
> >>>
> >>> Bob Peterson
> >>> Red Hat File Systems
> >> Thats true, however, as I understand the problem here, the issue is not
> >> reading in the blocks for the rgrp that is eventually selected to use,
> >> but the reading in of those blocks for the rgrps that we reject, for
> >> whatever reason (full, or congested, or whatever). So with rgrplvb
> >> enabled, we don't then read those rgrps in off disk at all in most cases
> >> - so I was wondering whether that solves the problem without needing
> >> this change?

Actually, I believe the problem is reading in the blocks for the rgrps we
use, not the ones we reject. In this case, I think the rejected rgrps are
pretty minimal. 

> > The rgrplvb mount option only helps if the file system is using lock_dlm.
> > For lock_nolock, it's still just as slow because lock_nolock has no
> > knowledge
> > of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
> > with lock_dlm.
> That sounds like a bug... it should work in the same way, even with
> lock_nolock.

Perhaps it is a bug in the rgrplvb code. I'll investigate the possibility.
Until I look into the matter, all I can tell you is that the lvb option doesn't
come near the performance of this patch. Here are some example runs:

Stock kernel with -r128:
              kB  reclen    write
         2097152      32   213428
         2097152      64   199363
         2097152     128   202046
         2097152     256   212355
         2097152     512   228691
         2097152    1024   216815

Stock kernel with -r2048:
              kB  reclen    write
         2097152      32   150471
         2097152      64   166858
         2097152     128   165517
         2097152     256   168206
         2097152     512   163427
         2097152    1024   158296

Stock kernel with -r2048 and -o rgrplvb:
              kB  reclen    write
         2097152      32   167268
         2097152      64   165654
         2097152     128   166783
         2097152     256   164070
         2097152     512   166561
         2097152    1024   166933

With my patch and -r2048:
              kB  reclen    write
         2097152      32   196209
         2097152      64   224383
         2097152     128   223108
         2097152     256   228552
         2097152     512   224295
         2097152    1024   229110

With my patch and -r2048 and -o rgrplvb:
              kB  reclen    write
         2097152      32   214281
         2097152      64   227061
         2097152     128   226949
         2097152     256   229651
         2097152     512   229196
         2097152    1024   226651

I'll see if I can track down why the rgrplvb option isn't performing as well.
I suspect the matter goes back to my first comment above. Namely, that the
slowdown goes back to the slowness of page cache lookup for the buffers of the
rgrps we are using (not rejected ones).

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-15 13:56             ` Bob Peterson
@ 2015-06-15 14:26               ` Steven Whitehouse
  2015-06-15 14:43                 ` Bob Peterson
  2015-06-16 13:54               ` Bob Peterson
  1 sibling, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-15 14:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 15/06/15 14:56, Bob Peterson wrote:
> ----- Original Message -----
>>>>> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
>>>>> see no significant difference in memory usage nor cpu usage. So I assume
>>>>> the page lookup is the "slow" part; not because it's such a slow thing
>>>>> but because it's done 33 times per read-reference-invalidate (33 pages
>>>>> to look up per rgrp).
>>>>>
>>>>> Regards,
>>>>>
>>>>> Bob Peterson
>>>>> Red Hat File Systems
>>>> Thats true, however, as I understand the problem here, the issue is not
>>>> reading in the blocks for the rgrp that is eventually selected to use,
>>>> but the reading in of those blocks for the rgrps that we reject, for
>>>> whatever reason (full, or congested, or whatever). So with rgrplvb
>>>> enabled, we don't then read those rgrps in off disk at all in most cases
>>>> - so I was wondering whether that solves the problem without needing
>>>> this change?
> Actually, I believe the problem is reading in the blocks for the rgrps we
> use, not the ones we reject. In this case, I think the rejected rgrps are
> pretty minimal.
>
>>> The rgrplvb mount option only helps if the file system is using lock_dlm.
>>> For lock_nolock, it's still just as slow because lock_nolock has no
>>> knowledge
>>> of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
>>> with lock_dlm.
>> That sounds like a bug... it should work in the same way, even with
>> lock_nolock.
> Perhaps it is a bug in the rgrplvb code. I'll investigate the possibility.
> Until I look into the matter, all I can tell you is that the lvb option doesn't
> come near the performance of this patch. Here are some example runs:
>
> Stock kernel with -r128:
>                kB  reclen    write
>           2097152      32   213428
>           2097152      64   199363
>           2097152     128   202046
>           2097152     256   212355
>           2097152     512   228691
>           2097152    1024   216815
>
> Stock kernel with -r2048:
>                kB  reclen    write
>           2097152      32   150471
>           2097152      64   166858
>           2097152     128   165517
>           2097152     256   168206
>           2097152     512   163427
>           2097152    1024   158296
>
> Stock kernel with -r2048 and -o rgrplvb:
>                kB  reclen    write
>           2097152      32   167268
>           2097152      64   165654
>           2097152     128   166783
>           2097152     256   164070
>           2097152     512   166561
>           2097152    1024   166933
>
> With my patch and -r2048:
>                kB  reclen    write
>           2097152      32   196209
>           2097152      64   224383
>           2097152     128   223108
>           2097152     256   228552
>           2097152     512   224295
>           2097152    1024   229110
>
> With my patch and -r2048 and -o rgrplvb:
>                kB  reclen    write
>           2097152      32   214281
>           2097152      64   227061
>           2097152     128   226949
>           2097152     256   229651
>           2097152     512   229196
>           2097152    1024   226651
>
> I'll see if I can track down why the rgrplvb option isn't performing as well.
> I suspect the matter goes back to my first comment above. Namely, that the
> slowdown goes back to the slowness of page cache lookup for the buffers of the
> rgrps we are using (not rejected ones).
I'm assuming that these figures are bandwidth rather than times, since 
that appears to show that the patch makes quite a large difference. 
However the reclen is rather small. In the 32 bytes case, thats 128 
writes for each new block thats being allocated, unless of course that 
is 32k?

Steve.

> Regards,
>
> Bob Peterson
> Red Hat File Systems



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-15 14:26               ` Steven Whitehouse
@ 2015-06-15 14:43                 ` Bob Peterson
  2015-06-16 10:19                   ` Steven Whitehouse
  0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-15 14:43 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> I'm assuming that these figures are bandwidth rather than times, since
> that appears to show that the patch makes quite a large difference.
> However the reclen is rather small. In the 32 bytes case, thats 128
> writes for each new block thats being allocated, unless of course that
> is 32k?
> 
> Steve.

Hi,

To do this test, I'm executing this command:
numactl --cpunodebind=0 --membind=0 /home/bob/iozone/iozone3_429/src/current/iozone -az -f /mnt/gfs2/iozone-gfs2 -n 2048m -g 2048m -y 32k -q 1m -e -i 0 -+n &> /home/bob/iozone.out

According to iozone -h, specifying -y this way is 32K, not 32 bytes.
The -q is maximum write size, in KB, so -q 1m is 1MB writes.
The -g is maximum file size, in KB, so -g 2048 is a 2MB file.

So the test varies the writes from 32K to 1MB, adjusting the number
of writes to get the file to 2MB.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-15 14:43                 ` Bob Peterson
@ 2015-06-16 10:19                   ` Steven Whitehouse
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-16 10:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 15/06/15 15:43, Bob Peterson wrote:
> ----- Original Message -----
>> I'm assuming that these figures are bandwidth rather than times, since
>> that appears to show that the patch makes quite a large difference.
>> However the reclen is rather small. In the 32 bytes case, thats 128
>> writes for each new block thats being allocated, unless of course that
>> is 32k?
>>
>> Steve.
> Hi,
>
> To do this test, I'm executing this command:
> numactl --cpunodebind=0 --membind=0 /home/bob/iozone/iozone3_429/src/current/iozone -az -f /mnt/gfs2/iozone-gfs2 -n 2048m -g 2048m -y 32k -q 1m -e -i 0 -+n &> /home/bob/iozone.out
>
> According to iozone -h, specifying -y this way is 32K, not 32 bytes.
> The -q is maximum write size, in KB, so -q 1m is 1MB writes.
> The -g is maximum file size, in KB, so -g 2048 is a 2MB file.
>
> So the test varies the writes from 32K to 1MB, adjusting the number
> of writes to get the file to 2MB.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

Ok, that makes sense to me. I guess that there is not a lot of searching 
for the rgrps going on, which is why we are not seeing a big gain in the 
rgrplvb case. If the fs was nearly full perhaps we'd see more of a 
difference in that case.

Either way, provided the rgrplvb code can continue to work, that is 
really the important thing, so I think we should be ok with this. I'd 
still very much like to see what can be done to reduce the page look up 
cost more directly though, since that seems to be the real issue here,

Steve.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
  2015-06-15 13:56             ` Bob Peterson
  2015-06-15 14:26               ` Steven Whitehouse
@ 2015-06-16 13:54               ` Bob Peterson
  1 sibling, 0 replies; 11+ messages in thread
From: Bob Peterson @ 2015-06-16 13:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> I'll see if I can track down why the rgrplvb option isn't performing as well.
> I suspect the matter goes back to my first comment above. Namely, that the
> slowdown goes back to the slowness of page cache lookup for the buffers of
> the
> rgrps we are using (not rejected ones).

Hi,

I did verify that the vast majority of time is spent doing page cache lookups,
in function gfs2_getbuf(). The time spent in gfs2_meta_read() and gfs2_meta_wait()
are minimal--almost nil--presumably because the page is already in cache.

The rgrplvb option isn't improving this because it still calls gfs2_rgrp_bh_get
which still does all the page cache lookups for the rgrp we need. The rgrplvb
option should, in theory, save us a lot of time when searching for a suitable
rgrp, especially when there are multiple nodes doing allocations. 

I can't think of any reason why my patch would be incompatible with rgrplvb,
but I'd like to ask Ben Marzinski to scrutinize my patch carefully and see
if he can find any flaws in the design.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-16 13:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1673564717.11791069.1433515261791.JavaMail.zimbra@redhat.com>
2015-06-05 14:49 ` [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation Bob Peterson
2015-06-08 12:18   ` Steven Whitehouse
2015-06-09 14:45     ` Bob Peterson
2015-06-10 10:30       ` Steven Whitehouse
2015-06-12 19:50         ` Bob Peterson
2015-06-15 11:18           ` Steven Whitehouse
2015-06-15 13:56             ` Bob Peterson
2015-06-15 14:26               ` Steven Whitehouse
2015-06-15 14:43                 ` Bob Peterson
2015-06-16 10:19                   ` Steven Whitehouse
2015-06-16 13:54               ` Bob Peterson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.