* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
[not found] <1673564717.11791069.1433515261791.JavaMail.zimbra@redhat.com>
@ 2015-06-05 14:49 ` Bob Peterson
2015-06-08 12:18 ` Steven Whitehouse
0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-05 14:49 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
This patch allows the block allocation code to retain the buffers
for the resource groups so they don't need to be re-read from buffer
cache with every request. This is a performance improvement that's
especially noticeable when resource groups are very large. For
example, with 2GB resource groups and 4K blocks, there can be 33
blocks for every resource group. This patch allows those 33 buffers
to be kept around and not read in and thrown away with every
operation. The buffers are released when the resource group is
either synced or invalidated.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
fs/gfs2/glops.c | 14 +++++++++++---
fs/gfs2/rgrp.c | 23 +++++++++++++++++++----
fs/gfs2/rgrp.h | 1 +
3 files changed, 31 insertions(+), 7 deletions(-)
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index fe91951..c23377f 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -144,6 +144,12 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
struct gfs2_rgrpd *rgd;
int error;
+ spin_lock(&gl->gl_spin);
+ rgd = gl->gl_object;
+ if (rgd)
+ gfs2_rgrp_brelse(rgd);
+ spin_unlock(&gl->gl_spin);
+
if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
return;
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
@@ -175,15 +181,17 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
{
struct gfs2_sbd *sdp = gl->gl_sbd;
struct address_space *mapping = &sdp->sd_aspace;
+ struct gfs2_rgrpd *rgd = gl->gl_object;
+
+ if (rgd)
+ gfs2_rgrp_brelse(rgd);
WARN_ON_ONCE(!(flags & DIO_METADATA));
gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));
truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
- if (gl->gl_object) {
- struct gfs2_rgrpd *rgd = (struct gfs2_rgrpd *)gl->gl_object;
+ if (rgd)
rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
- }
}
/**
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index cd53d6e..c6c6232 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1244,14 +1244,13 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh)
}
/**
- * gfs2_rgrp_go_unlock - Release RG bitmaps read in with gfs2_rgrp_bh_get()
- * @gh: The glock holder for the resource group
+ * gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get()
+ * @rgd: The resource group
*
*/
-void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
+void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
{
- struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
int x, length = rgd->rd_length;
for (x = 0; x < length; x++) {
@@ -1264,6 +1263,22 @@ void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
}
+/**
+ * gfs2_rgrp_go_unlock - Unlock a rgrp glock
+ * @gh: The glock holder for the resource group
+ *
+ */
+
+void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
+{
+ struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
+ int demote_requested = test_bit(GLF_DEMOTE, &gh->gh_gl->gl_flags) |
+ test_bit(GLF_PENDING_DEMOTE, &gh->gh_gl->gl_flags);
+
+ if (rgd && demote_requested)
+ gfs2_rgrp_brelse(rgd);
+}
+
int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
struct buffer_head *bh,
const struct gfs2_bitmap *bi, unsigned minlen, u64 *ptrimmed)
diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
index 68972ec..c0ab33f 100644
--- a/fs/gfs2/rgrp.h
+++ b/fs/gfs2/rgrp.h
@@ -36,6 +36,7 @@ extern void gfs2_clear_rgrpd(struct gfs2_sbd *sdp);
extern int gfs2_rindex_update(struct gfs2_sbd *sdp);
extern void gfs2_free_clones(struct gfs2_rgrpd *rgd);
extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh);
+extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd);
extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh);
extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-05 14:49 ` [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation Bob Peterson
@ 2015-06-08 12:18 ` Steven Whitehouse
2015-06-09 14:45 ` Bob Peterson
0 siblings, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-08 12:18 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 05/06/15 15:49, Bob Peterson wrote:
> Hi,
>
> This patch allows the block allocation code to retain the buffers
> for the resource groups so they don't need to be re-read from buffer
> cache with every request. This is a performance improvement that's
> especially noticeable when resource groups are very large. For
> example, with 2GB resource groups and 4K blocks, there can be 33
> blocks for every resource group. This patch allows those 33 buffers
> to be kept around and not read in and thrown away with every
> operation. The buffers are released when the resource group is
> either synced or invalidated.
The blocks should be cached between operations, so this should only be
resulting in a skip of the look up of the cached block, and no changes
to the actual I/O. Does that mean that grab_cache_page() is slow I
wonder? Or is this an issue of going around the retry loop due to lack
of memory at some stage?
How does this interact with the rgrplvb support? I'd guess that with
that turned on, this is no longer an issue, because we'd only read in
the blocks for the rgrps that we are actually going to use?
Steve.
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
> fs/gfs2/glops.c | 14 +++++++++++---
> fs/gfs2/rgrp.c | 23 +++++++++++++++++++----
> fs/gfs2/rgrp.h | 1 +
> 3 files changed, 31 insertions(+), 7 deletions(-)
>
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index fe91951..c23377f 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -144,6 +144,12 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
> struct gfs2_rgrpd *rgd;
> int error;
>
> + spin_lock(&gl->gl_spin);
> + rgd = gl->gl_object;
> + if (rgd)
> + gfs2_rgrp_brelse(rgd);
> + spin_unlock(&gl->gl_spin);
> +
> if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
> return;
> GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
> @@ -175,15 +181,17 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
> {
> struct gfs2_sbd *sdp = gl->gl_sbd;
> struct address_space *mapping = &sdp->sd_aspace;
> + struct gfs2_rgrpd *rgd = gl->gl_object;
> +
> + if (rgd)
> + gfs2_rgrp_brelse(rgd);
>
> WARN_ON_ONCE(!(flags & DIO_METADATA));
> gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));
> truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
>
> - if (gl->gl_object) {
> - struct gfs2_rgrpd *rgd = (struct gfs2_rgrpd *)gl->gl_object;
> + if (rgd)
> rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
> - }
> }
>
> /**
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index cd53d6e..c6c6232 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -1244,14 +1244,13 @@ int gfs2_rgrp_go_lock(struct gfs2_holder *gh)
> }
>
> /**
> - * gfs2_rgrp_go_unlock - Release RG bitmaps read in with gfs2_rgrp_bh_get()
> - * @gh: The glock holder for the resource group
> + * gfs2_rgrp_brelse - Release RG bitmaps read in with gfs2_rgrp_bh_get()
> + * @rgd: The resource group
> *
> */
>
> -void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
> +void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd)
> {
> - struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
> int x, length = rgd->rd_length;
>
> for (x = 0; x < length; x++) {
> @@ -1264,6 +1263,22 @@ void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
>
> }
>
> +/**
> + * gfs2_rgrp_go_unlock - Unlock a rgrp glock
> + * @gh: The glock holder for the resource group
> + *
> + */
> +
> +void gfs2_rgrp_go_unlock(struct gfs2_holder *gh)
> +{
> + struct gfs2_rgrpd *rgd = gh->gh_gl->gl_object;
> + int demote_requested = test_bit(GLF_DEMOTE, &gh->gh_gl->gl_flags) |
> + test_bit(GLF_PENDING_DEMOTE, &gh->gh_gl->gl_flags);
> +
> + if (rgd && demote_requested)
> + gfs2_rgrp_brelse(rgd);
> +}
> +
> int gfs2_rgrp_send_discards(struct gfs2_sbd *sdp, u64 offset,
> struct buffer_head *bh,
> const struct gfs2_bitmap *bi, unsigned minlen, u64 *ptrimmed)
> diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
> index 68972ec..c0ab33f 100644
> --- a/fs/gfs2/rgrp.h
> +++ b/fs/gfs2/rgrp.h
> @@ -36,6 +36,7 @@ extern void gfs2_clear_rgrpd(struct gfs2_sbd *sdp);
> extern int gfs2_rindex_update(struct gfs2_sbd *sdp);
> extern void gfs2_free_clones(struct gfs2_rgrpd *rgd);
> extern int gfs2_rgrp_go_lock(struct gfs2_holder *gh);
> +extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd);
> extern void gfs2_rgrp_go_unlock(struct gfs2_holder *gh);
>
> extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-08 12:18 ` Steven Whitehouse
@ 2015-06-09 14:45 ` Bob Peterson
2015-06-10 10:30 ` Steven Whitehouse
0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-09 14:45 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> Hi,
>
>
> On 05/06/15 15:49, Bob Peterson wrote:
> > Hi,
> >
> > This patch allows the block allocation code to retain the buffers
> > for the resource groups so they don't need to be re-read from buffer
> > cache with every request. This is a performance improvement that's
> > especially noticeable when resource groups are very large. For
> > example, with 2GB resource groups and 4K blocks, there can be 33
> > blocks for every resource group. This patch allows those 33 buffers
> > to be kept around and not read in and thrown away with every
> > operation. The buffers are released when the resource group is
> > either synced or invalidated.
> The blocks should be cached between operations, so this should only be
> resulting in a skip of the look up of the cached block, and no changes
> to the actual I/O. Does that mean that grab_cache_page() is slow I
> wonder? Or is this an issue of going around the retry loop due to lack
> of memory at some stage?
>
> How does this interact with the rgrplvb support? I'd guess that with
> that turned on, this is no longer an issue, because we'd only read in
> the blocks for the rgrps that we are actually going to use?
>
>
>
> Steve.
Hi,
If you compare the two vmstat outputs in the bugzilla #1154782, you'll
see no significant difference in memory usage nor cpu usage. So I assume
the page lookup is the "slow" part; not because it's such a slow thing
but because it's done 33 times per read-reference-invalidate (33 pages
to look up per rgrp).
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-09 14:45 ` Bob Peterson
@ 2015-06-10 10:30 ` Steven Whitehouse
2015-06-12 19:50 ` Bob Peterson
0 siblings, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-10 10:30 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 09/06/15 15:45, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>>
>> On 05/06/15 15:49, Bob Peterson wrote:
>>> Hi,
>>>
>>> This patch allows the block allocation code to retain the buffers
>>> for the resource groups so they don't need to be re-read from buffer
>>> cache with every request. This is a performance improvement that's
>>> especially noticeable when resource groups are very large. For
>>> example, with 2GB resource groups and 4K blocks, there can be 33
>>> blocks for every resource group. This patch allows those 33 buffers
>>> to be kept around and not read in and thrown away with every
>>> operation. The buffers are released when the resource group is
>>> either synced or invalidated.
>> The blocks should be cached between operations, so this should only be
>> resulting in a skip of the look up of the cached block, and no changes
>> to the actual I/O. Does that mean that grab_cache_page() is slow I
>> wonder? Or is this an issue of going around the retry loop due to lack
>> of memory at some stage?
>>
>> How does this interact with the rgrplvb support? I'd guess that with
>> that turned on, this is no longer an issue, because we'd only read in
>> the blocks for the rgrps that we are actually going to use?
>>
>>
>>
>> Steve.
> Hi,
>
> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
> see no significant difference in memory usage nor cpu usage. So I assume
> the page lookup is the "slow" part; not because it's such a slow thing
> but because it's done 33 times per read-reference-invalidate (33 pages
> to look up per rgrp).
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
Thats true, however, as I understand the problem here, the issue is not
reading in the blocks for the rgrp that is eventually selected to use,
but the reading in of those blocks for the rgrps that we reject, for
whatever reason (full, or congested, or whatever). So with rgrplvb
enabled, we don't then read those rgrps in off disk at all in most cases
- so I was wondering whether that solves the problem without needing
this change?
Ideally I'd like to make the rgrplvb setting the default, since it is
much more efficient. The question is how we can do that and still remain
backward compatible? Not an easy one to answer :(
Also, if the page lookup is the slow thing, then we should look at using
pagevec_lookup() to get the pages in chunks rather than doing it
individually (and indeed, multiple times per page, in case of block size
less than page size). We know that the blocks will always be contiguous
on disk, so we should be able to send down large I/Os, rather than
relying on the block stack to merge them as we do at the moment, which
should be a further improvement too,
Steve.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-10 10:30 ` Steven Whitehouse
@ 2015-06-12 19:50 ` Bob Peterson
2015-06-15 11:18 ` Steven Whitehouse
0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-12 19:50 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> Hi,
>
>
> On 09/06/15 15:45, Bob Peterson wrote:
> > ----- Original Message -----
> >> Hi,
> >>
> >>
> >> On 05/06/15 15:49, Bob Peterson wrote:
> >>> Hi,
> >>>
> >>> This patch allows the block allocation code to retain the buffers
> >>> for the resource groups so they don't need to be re-read from buffer
> >>> cache with every request. This is a performance improvement that's
> >>> especially noticeable when resource groups are very large. For
> >>> example, with 2GB resource groups and 4K blocks, there can be 33
> >>> blocks for every resource group. This patch allows those 33 buffers
> >>> to be kept around and not read in and thrown away with every
> >>> operation. The buffers are released when the resource group is
> >>> either synced or invalidated.
> >> The blocks should be cached between operations, so this should only be
> >> resulting in a skip of the look up of the cached block, and no changes
> >> to the actual I/O. Does that mean that grab_cache_page() is slow I
> >> wonder? Or is this an issue of going around the retry loop due to lack
> >> of memory at some stage?
> >>
> >> How does this interact with the rgrplvb support? I'd guess that with
> >> that turned on, this is no longer an issue, because we'd only read in
> >> the blocks for the rgrps that we are actually going to use?
> >>
> >>
> >>
> >> Steve.
> > Hi,
> >
> > If you compare the two vmstat outputs in the bugzilla #1154782, you'll
> > see no significant difference in memory usage nor cpu usage. So I assume
> > the page lookup is the "slow" part; not because it's such a slow thing
> > but because it's done 33 times per read-reference-invalidate (33 pages
> > to look up per rgrp).
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
>
> Thats true, however, as I understand the problem here, the issue is not
> reading in the blocks for the rgrp that is eventually selected to use,
> but the reading in of those blocks for the rgrps that we reject, for
> whatever reason (full, or congested, or whatever). So with rgrplvb
> enabled, we don't then read those rgrps in off disk at all in most cases
> - so I was wondering whether that solves the problem without needing
> this change?
>
> Ideally I'd like to make the rgrplvb setting the default, since it is
> much more efficient. The question is how we can do that and still remain
> backward compatible? Not an easy one to answer :(
>
> Also, if the page lookup is the slow thing, then we should look at using
> pagevec_lookup() to get the pages in chunks rather than doing it
> individually (and indeed, multiple times per page, in case of block size
> less than page size). We know that the blocks will always be contiguous
> on disk, so we should be able to send down large I/Os, rather than
> relying on the block stack to merge them as we do at the moment, which
> should be a further improvement too,
>
> Steve.
Hi,
The rgrplvb mount option only helps if the file system is using lock_dlm.
For lock_nolock, it's still just as slow because lock_nolock has no knowledge
of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
with lock_dlm.
I like the idea of making rgrplvb the default mount option, and I don't
see a problem doing that.
I think the rgrplvb option should be compatible with this patch, but
I'll set up a test environment in order to test that they work together
harmoniously.
I also like the idea of using a pagevec for reading in multiple pages for
the rgrps, but that's another improvement for another day. If there's
not a bugzilla record open for that, perhaps we should open one.
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-12 19:50 ` Bob Peterson
@ 2015-06-15 11:18 ` Steven Whitehouse
2015-06-15 13:56 ` Bob Peterson
0 siblings, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-15 11:18 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 12/06/15 20:50, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>>
>> On 09/06/15 15:45, Bob Peterson wrote:
>>> ----- Original Message -----
>>>> Hi,
>>>>
>>>>
>>>> On 05/06/15 15:49, Bob Peterson wrote:
>>>>> Hi,
>>>>>
>>>>> This patch allows the block allocation code to retain the buffers
>>>>> for the resource groups so they don't need to be re-read from buffer
>>>>> cache with every request. This is a performance improvement that's
>>>>> especially noticeable when resource groups are very large. For
>>>>> example, with 2GB resource groups and 4K blocks, there can be 33
>>>>> blocks for every resource group. This patch allows those 33 buffers
>>>>> to be kept around and not read in and thrown away with every
>>>>> operation. The buffers are released when the resource group is
>>>>> either synced or invalidated.
>>>> The blocks should be cached between operations, so this should only be
>>>> resulting in a skip of the look up of the cached block, and no changes
>>>> to the actual I/O. Does that mean that grab_cache_page() is slow I
>>>> wonder? Or is this an issue of going around the retry loop due to lack
>>>> of memory at some stage?
>>>>
>>>> How does this interact with the rgrplvb support? I'd guess that with
>>>> that turned on, this is no longer an issue, because we'd only read in
>>>> the blocks for the rgrps that we are actually going to use?
>>>>
>>>>
>>>>
>>>> Steve.
>>> Hi,
>>>
>>> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
>>> see no significant difference in memory usage nor cpu usage. So I assume
>>> the page lookup is the "slow" part; not because it's such a slow thing
>>> but because it's done 33 times per read-reference-invalidate (33 pages
>>> to look up per rgrp).
>>>
>>> Regards,
>>>
>>> Bob Peterson
>>> Red Hat File Systems
>> Thats true, however, as I understand the problem here, the issue is not
>> reading in the blocks for the rgrp that is eventually selected to use,
>> but the reading in of those blocks for the rgrps that we reject, for
>> whatever reason (full, or congested, or whatever). So with rgrplvb
>> enabled, we don't then read those rgrps in off disk at all in most cases
>> - so I was wondering whether that solves the problem without needing
>> this change?
>>
>> Ideally I'd like to make the rgrplvb setting the default, since it is
>> much more efficient. The question is how we can do that and still remain
>> backward compatible? Not an easy one to answer :(
>>
>> Also, if the page lookup is the slow thing, then we should look at using
>> pagevec_lookup() to get the pages in chunks rather than doing it
>> individually (and indeed, multiple times per page, in case of block size
>> less than page size). We know that the blocks will always be contiguous
>> on disk, so we should be able to send down large I/Os, rather than
>> relying on the block stack to merge them as we do at the moment, which
>> should be a further improvement too,
>>
>> Steve.
> Hi,
>
> The rgrplvb mount option only helps if the file system is using lock_dlm.
> For lock_nolock, it's still just as slow because lock_nolock has no knowledge
> of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
> with lock_dlm.
That sounds like a bug... it should work in the same way, even with
lock_nolock.
> I like the idea of making rgrplvb the default mount option, and I don't
> see a problem doing that.
The issue is that it will not be backwards compatible. We really need a
way to at least easily detect if someone has a mixed cluster (some nodes
with rgrplvb enabled, and some without) otherwise it might be confusing
when we get odd reports of allocations failing, even when there appears
to be free space.
We need to treat what we put into LVBs in the same way as we treat the
on-disk format in terms of
backwards compatibility.
>
> I think the rgrplvb option should be compatible with this patch, but
> I'll set up a test environment in order to test that they work together
> harmoniously.
>
> I also like the idea of using a pagevec for reading in multiple pages for
> the rgrps, but that's another improvement for another day. If there's
> not a bugzilla record open for that, perhaps we should open one.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
If we have rgrplvb, then we my not need this patch, since we will not be
looking up the rgrp's blocks as often. So we should see the benefit just
by turning that on I think... at least it would be good to see whether
there is any performance difference there. In cases where we have nodes
competing for the rgrps, then the blocks will not be cached anyway, so
we will gain no benefit from this patch, since we'll have to read the
blocks anyway, hence my thought that speeding up the lookup is the way
to go, since it will give the benefit for more different cases - both
when the rgrps blocks are cached and uncached,
Steve.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-15 11:18 ` Steven Whitehouse
@ 2015-06-15 13:56 ` Bob Peterson
2015-06-15 14:26 ` Steven Whitehouse
2015-06-16 13:54 ` Bob Peterson
0 siblings, 2 replies; 11+ messages in thread
From: Bob Peterson @ 2015-06-15 13:56 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> >>> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
> >>> see no significant difference in memory usage nor cpu usage. So I assume
> >>> the page lookup is the "slow" part; not because it's such a slow thing
> >>> but because it's done 33 times per read-reference-invalidate (33 pages
> >>> to look up per rgrp).
> >>>
> >>> Regards,
> >>>
> >>> Bob Peterson
> >>> Red Hat File Systems
> >> Thats true, however, as I understand the problem here, the issue is not
> >> reading in the blocks for the rgrp that is eventually selected to use,
> >> but the reading in of those blocks for the rgrps that we reject, for
> >> whatever reason (full, or congested, or whatever). So with rgrplvb
> >> enabled, we don't then read those rgrps in off disk at all in most cases
> >> - so I was wondering whether that solves the problem without needing
> >> this change?
Actually, I believe the problem is reading in the blocks for the rgrps we
use, not the ones we reject. In this case, I think the rejected rgrps are
pretty minimal.
> > The rgrplvb mount option only helps if the file system is using lock_dlm.
> > For lock_nolock, it's still just as slow because lock_nolock has no
> > knowledge
> > of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
> > with lock_dlm.
> That sounds like a bug... it should work in the same way, even with
> lock_nolock.
Perhaps it is a bug in the rgrplvb code. I'll investigate the possibility.
Until I look into the matter, all I can tell you is that the lvb option doesn't
come near the performance of this patch. Here are some example runs:
Stock kernel with -r128:
kB reclen write
2097152 32 213428
2097152 64 199363
2097152 128 202046
2097152 256 212355
2097152 512 228691
2097152 1024 216815
Stock kernel with -r2048:
kB reclen write
2097152 32 150471
2097152 64 166858
2097152 128 165517
2097152 256 168206
2097152 512 163427
2097152 1024 158296
Stock kernel with -r2048 and -o rgrplvb:
kB reclen write
2097152 32 167268
2097152 64 165654
2097152 128 166783
2097152 256 164070
2097152 512 166561
2097152 1024 166933
With my patch and -r2048:
kB reclen write
2097152 32 196209
2097152 64 224383
2097152 128 223108
2097152 256 228552
2097152 512 224295
2097152 1024 229110
With my patch and -r2048 and -o rgrplvb:
kB reclen write
2097152 32 214281
2097152 64 227061
2097152 128 226949
2097152 256 229651
2097152 512 229196
2097152 1024 226651
I'll see if I can track down why the rgrplvb option isn't performing as well.
I suspect the matter goes back to my first comment above. Namely, that the
slowdown goes back to the slowness of page cache lookup for the buffers of the
rgrps we are using (not rejected ones).
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-15 13:56 ` Bob Peterson
@ 2015-06-15 14:26 ` Steven Whitehouse
2015-06-15 14:43 ` Bob Peterson
2015-06-16 13:54 ` Bob Peterson
1 sibling, 1 reply; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-15 14:26 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 15/06/15 14:56, Bob Peterson wrote:
> ----- Original Message -----
>>>>> If you compare the two vmstat outputs in the bugzilla #1154782, you'll
>>>>> see no significant difference in memory usage nor cpu usage. So I assume
>>>>> the page lookup is the "slow" part; not because it's such a slow thing
>>>>> but because it's done 33 times per read-reference-invalidate (33 pages
>>>>> to look up per rgrp).
>>>>>
>>>>> Regards,
>>>>>
>>>>> Bob Peterson
>>>>> Red Hat File Systems
>>>> Thats true, however, as I understand the problem here, the issue is not
>>>> reading in the blocks for the rgrp that is eventually selected to use,
>>>> but the reading in of those blocks for the rgrps that we reject, for
>>>> whatever reason (full, or congested, or whatever). So with rgrplvb
>>>> enabled, we don't then read those rgrps in off disk at all in most cases
>>>> - so I was wondering whether that solves the problem without needing
>>>> this change?
> Actually, I believe the problem is reading in the blocks for the rgrps we
> use, not the ones we reject. In this case, I think the rejected rgrps are
> pretty minimal.
>
>>> The rgrplvb mount option only helps if the file system is using lock_dlm.
>>> For lock_nolock, it's still just as slow because lock_nolock has no
>>> knowledge
>>> of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
>>> with lock_dlm.
>> That sounds like a bug... it should work in the same way, even with
>> lock_nolock.
> Perhaps it is a bug in the rgrplvb code. I'll investigate the possibility.
> Until I look into the matter, all I can tell you is that the lvb option doesn't
> come near the performance of this patch. Here are some example runs:
>
> Stock kernel with -r128:
> kB reclen write
> 2097152 32 213428
> 2097152 64 199363
> 2097152 128 202046
> 2097152 256 212355
> 2097152 512 228691
> 2097152 1024 216815
>
> Stock kernel with -r2048:
> kB reclen write
> 2097152 32 150471
> 2097152 64 166858
> 2097152 128 165517
> 2097152 256 168206
> 2097152 512 163427
> 2097152 1024 158296
>
> Stock kernel with -r2048 and -o rgrplvb:
> kB reclen write
> 2097152 32 167268
> 2097152 64 165654
> 2097152 128 166783
> 2097152 256 164070
> 2097152 512 166561
> 2097152 1024 166933
>
> With my patch and -r2048:
> kB reclen write
> 2097152 32 196209
> 2097152 64 224383
> 2097152 128 223108
> 2097152 256 228552
> 2097152 512 224295
> 2097152 1024 229110
>
> With my patch and -r2048 and -o rgrplvb:
> kB reclen write
> 2097152 32 214281
> 2097152 64 227061
> 2097152 128 226949
> 2097152 256 229651
> 2097152 512 229196
> 2097152 1024 226651
>
> I'll see if I can track down why the rgrplvb option isn't performing as well.
> I suspect the matter goes back to my first comment above. Namely, that the
> slowdown goes back to the slowness of page cache lookup for the buffers of the
> rgrps we are using (not rejected ones).
I'm assuming that these figures are bandwidth rather than times, since
that appears to show that the patch makes quite a large difference.
However the reclen is rather small. In the 32 bytes case, thats 128
writes for each new block thats being allocated, unless of course that
is 32k?
Steve.
> Regards,
>
> Bob Peterson
> Red Hat File Systems
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-15 14:26 ` Steven Whitehouse
@ 2015-06-15 14:43 ` Bob Peterson
2015-06-16 10:19 ` Steven Whitehouse
0 siblings, 1 reply; 11+ messages in thread
From: Bob Peterson @ 2015-06-15 14:43 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> I'm assuming that these figures are bandwidth rather than times, since
> that appears to show that the patch makes quite a large difference.
> However the reclen is rather small. In the 32 bytes case, thats 128
> writes for each new block thats being allocated, unless of course that
> is 32k?
>
> Steve.
Hi,
To do this test, I'm executing this command:
numactl --cpunodebind=0 --membind=0 /home/bob/iozone/iozone3_429/src/current/iozone -az -f /mnt/gfs2/iozone-gfs2 -n 2048m -g 2048m -y 32k -q 1m -e -i 0 -+n &> /home/bob/iozone.out
According to iozone -h, specifying -y this way is 32K, not 32 bytes.
The -q is maximum write size, in KB, so -q 1m is 1MB writes.
The -g is maximum file size, in KB, so -g 2048 is a 2MB file.
So the test varies the writes from 32K to 1MB, adjusting the number
of writes to get the file to 2MB.
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-15 14:43 ` Bob Peterson
@ 2015-06-16 10:19 ` Steven Whitehouse
0 siblings, 0 replies; 11+ messages in thread
From: Steven Whitehouse @ 2015-06-16 10:19 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On 15/06/15 15:43, Bob Peterson wrote:
> ----- Original Message -----
>> I'm assuming that these figures are bandwidth rather than times, since
>> that appears to show that the patch makes quite a large difference.
>> However the reclen is rather small. In the 32 bytes case, thats 128
>> writes for each new block thats being allocated, unless of course that
>> is 32k?
>>
>> Steve.
> Hi,
>
> To do this test, I'm executing this command:
> numactl --cpunodebind=0 --membind=0 /home/bob/iozone/iozone3_429/src/current/iozone -az -f /mnt/gfs2/iozone-gfs2 -n 2048m -g 2048m -y 32k -q 1m -e -i 0 -+n &> /home/bob/iozone.out
>
> According to iozone -h, specifying -y this way is 32K, not 32 bytes.
> The -q is maximum write size, in KB, so -q 1m is 1MB writes.
> The -g is maximum file size, in KB, so -g 2048 is a 2MB file.
>
> So the test varies the writes from 32K to 1MB, adjusting the number
> of writes to get the file to 2MB.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
Ok, that makes sense to me. I guess that there is not a lot of searching
for the rgrps going on, which is why we are not seeing a big gain in the
rgrplvb case. If the fs was nearly full perhaps we'd see more of a
difference in that case.
Either way, provided the rgrplvb code can continue to work, that is
really the important thing, so I think we should be ok with this. I'd
still very much like to see what can be done to reduce the page look up
cost more directly though, since that seems to be the real issue here,
Steve.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation
2015-06-15 13:56 ` Bob Peterson
2015-06-15 14:26 ` Steven Whitehouse
@ 2015-06-16 13:54 ` Bob Peterson
1 sibling, 0 replies; 11+ messages in thread
From: Bob Peterson @ 2015-06-16 13:54 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> I'll see if I can track down why the rgrplvb option isn't performing as well.
> I suspect the matter goes back to my first comment above. Namely, that the
> slowdown goes back to the slowness of page cache lookup for the buffers of
> the
> rgrps we are using (not rejected ones).
Hi,
I did verify that the vast majority of time is spent doing page cache lookups,
in function gfs2_getbuf(). The time spent in gfs2_meta_read() and gfs2_meta_wait()
are minimal--almost nil--presumably because the page is already in cache.
The rgrplvb option isn't improving this because it still calls gfs2_rgrp_bh_get
which still does all the page cache lookups for the rgrp we need. The rgrplvb
option should, in theory, save us a lot of time when searching for a suitable
rgrp, especially when there are multiple nodes doing allocations.
I can't think of any reason why my patch would be incompatible with rgrplvb,
but I'd like to ask Ben Marzinski to scrutinize my patch carefully and see
if he can find any flaws in the design.
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-06-16 13:54 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1673564717.11791069.1433515261791.JavaMail.zimbra@redhat.com>
2015-06-05 14:49 ` [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation Bob Peterson
2015-06-08 12:18 ` Steven Whitehouse
2015-06-09 14:45 ` Bob Peterson
2015-06-10 10:30 ` Steven Whitehouse
2015-06-12 19:50 ` Bob Peterson
2015-06-15 11:18 ` Steven Whitehouse
2015-06-15 13:56 ` Bob Peterson
2015-06-15 14:26 ` Steven Whitehouse
2015-06-15 14:43 ` Bob Peterson
2015-06-16 10:19 ` Steven Whitehouse
2015-06-16 13:54 ` Bob Peterson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.