All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] New test results for "ls -Ul"
       [not found] <4DCBA5D4.5010902@whamcloud.com>
@ 2011-05-26 13:01 ` Eric Barton
  2011-05-26 14:36   ` Fan Yong
  2011-05-30  5:51   ` Jinshan Xiong
  0 siblings, 2 replies; 7+ messages in thread
From: Eric Barton @ 2011-05-26 13:01 UTC (permalink / raw)
  To: lustre-devel

Nasf,

 

Interesting results.  Thank you - especially for graphing the results so thoroughly.

I'm attaching them here and cc-ing lustre-devel since these are of general interest.

 

I don't think your conclusion number (1), to say CLIO locking is slowing us down

is as obvious from these results as you imply.  If you just compare the 1.8 and

patched 2.x per-file times and how they scale with #stripes you get this.

 



 

The gradients of these lines should correspond to the additional time per stripe required

to stat each file and I've graphed these times below (ignoring the 0-stripe data for this

calculation because I'm just interested in the incremental per-stripe overhead).

 



They show per-stripe overhead for 1.8 well above patched 2.x for the lower stripe

counts, but whereas 1.8 gets better with more stripes, patched 2.x gets worse.  I'm

guessing that at high stripe counts, 1.8 puts many concurrent glimpses on the wire

and does it quite efficiently.  I'd like to understand better how you control the #

of glimpse-aheads you keep on the wire - is it a single fixed number, or a fixed

number per OST or some other scheme?  In any case, it will be interesting to see

measurements at higher stripe counts.

Cheers, 
                   Eric 

From: Fan Yong [mailto:yong.fan at whamcloud.com] 
Sent: 12 May 2011 10:18 AM
To: Eric Barton
Cc: Bryon Neitzel; Ian Colle; Liang Zhen
Subject: New test results for "ls -Ul"

 

I have improved statahead load balance mechanism to distribute statahead load to more CPU units on client. And adjusted AGL
according to CLIO lock state machine. After those improvement, 'ls -Ul' can run more fast than old patches, especially on large SMP
node.

On the other hand, as the increasing the degree of parallelism, the lower network scheduler is becoming performance bottleneck. So I
combine my patches together with Liang's SMP patches in the test.

	
client (fat-intel-4, 24 cores)

server (client-xxx, 4 OSSes, 8 OSTs on each OSS)


b2x_patched

my patches + SMP patches

my patches


b18

original b1_8

share the same server with "b2x_patched"


b2x_original

original b2_x

original b2_x


Some notes:

1) Stripe count affects traversing performance much, and the impact is more than linear. Even if with all the patches applied on
b2_x, the degree of stripe count impact is still larger than b1_8. It is related with the complex CLIO lock state machine and
tedious iteration/repeat operations. It is not easy to make it run as efficiently as b1_8.

2) Patched b2_x is much faster than original b2_x, for traversing 400K * 32-striped directory, it is 100 times or more improved.

3) Patched b2_x is also faster than b1_8, within our test, patched b2_x is at least 4X faster than b1_8, which matches the
requirement in ORNL contract.

4) Original b2_x is faster than b1_8 only for small striped cases, not more than 4-striped. For large striped cases, slower than
b1_8, which is consistent with ORNL test result.

5) The largest stripe count is 32 in our test. We have not enough resource to test more large striped cases. And I also wonder
whether it is worth to test more large striped directory or not. Because how many customers want to use large and full striped
directory? means contains 1M * 160-striped items in signal directory. If it is rare case, then wasting lots of time on that is
worthless.

We need to confirm with ORNL what is the last acceptance test cases and environment, includes:
a) stripe count
b) item count
c) network latency, w/o lnet router, suggest without router.
d) OST count on each OSS


Cheers,
--
Nasf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/8ddd386d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 64417 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/8ddd386d/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 57471 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/8ddd386d/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: result_20110512.xls
Type: application/vnd.ms-excel
Size: 61952 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/8ddd386d/attachment.xls>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] New test results for "ls -Ul"
  2011-05-26 13:01 ` [Lustre-devel] New test results for "ls -Ul" Eric Barton
@ 2011-05-26 14:36   ` Fan Yong
  2011-05-26 17:40     ` Eric Barton
  2011-05-30  5:51   ` Jinshan Xiong
  1 sibling, 1 reply; 7+ messages in thread
From: Fan Yong @ 2011-05-26 14:36 UTC (permalink / raw)
  To: lustre-devel

Hi Eric,

Thanks very much for your comparison of the results. I want to give more 
explanation for the results:

1) I suspect the complex CLIO lock state machine and tedious 
iteration/repeat operations affect the performance of traversing 
large-striped directory, means the overhead introduced by those factors 
are higher than original b1_8 I/O stack. To measure per-stripe overhead, 
it is unfair that you compare the results between patched lustre-2.x and 
luster-1.8, because my AGL related patches are async pipeline 
operations, they hide much of such overhead. But b1_8 is sync glimpse 
and non-per-fetched. If compare between original lustre-2.x and 
lustre-1.8, you will find the overhead difference. In fact, such 
overhead difference can be seen in your second graph also. Just as you 
said: "1.8 gets better with more stripes, patched 2.x gets worse".

2) Currently, the limitation for AGL #/RPC is statahead window. 
Originally, such window is only used for controlling MDS-side statahead. 
So means, as long as item's MDS-side attributes is ready (per-fetched), 
then related OSS-side AGL RPC can be triggered. The default statahead 
window size is 32. In my test, I just use the default value. I also 
tested with larger window size on Toro, but it did not give much help. I 
am not sure whether it can be better if testing against more powerful 
nodes/network.

3) For large-striped directory, the test results maybe not represent the 
real cases, because in my test, there are 8 OSTs on each OSS, but OSS 
CPU is 4-cores, which is much slower than client node (24-cores CPU). I 
found OSS's load was quite high for 32-striped cases. In theory, there 
are at most 32 * 8 concurrent AGL RPCs for each OSS. If we can test on 
more powerful OSS nodes for large-stripe directory, the improvement may 
be better than current results.

4) If OSS is the performance bottle neck, it also can explain why "1.8 
gets better with more stripes, patched 2.x gets worse" on some degree. 
Because for b1_8, the glimpse RPCs between two items are sync, so there 
are at most 8 concurrent glimpse RPCs for each OSS, means less 
contention, so less overhead caused by those contention. I just guess 
from the experience of studying SMP scaling.


Cheers,
--
Nasf

On 5/26/11 9:01 PM, Eric Barton wrote:
>
> Nasf,
>
> Interesting results.  Thank you - especially for graphing the results 
> so thoroughly.
>
> I'm attaching them here and cc-ing lustre-devel since these are of 
> general interest.
>
> I don't think your conclusion number (1), to say CLIO locking is 
> slowing us down
>
> is as obvious from these results as you imply.  If you just compare 
> the 1.8 and
>
> patched 2.x per-file times and how they scale with #stripes you get 
> this...
>
> The gradients of these lines should correspond to the additional time 
> per stripe required
>
> to stat each file and I've graphed these times below (ignoring the 
> 0-stripe data for this
>
> calculation because I'm just interested in the incremental per-stripe 
> overhead).
>
> They show per-stripe overhead for 1.8 well above patched 2.x for the 
> lower stripe
>
> counts, but whereas 1.8 gets better with more stripes, patched 2.x 
> gets worse.  I'm
>
> guessing that at high stripe counts, 1.8 puts many concurrent glimpses 
> on the wire
>
> and does it quite efficiently.  I'd like to understand better how you 
> control the #
>
> of glimpse-aheads you keep on the wire -- is it a single fixed number, 
> or a fixed
>
> number per OST or some other scheme?  In any case, it will be 
> interesting to see
>
> measurements at higher stripe counts.
>
>     Cheers,
>                        Eric
>
> *From:*Fan Yong [mailto:yong.fan at whamcloud.com]
> *Sent:* 12 May 2011 10:18 AM
> *To:* Eric Barton
> *Cc:* Bryon Neitzel; Ian Colle; Liang Zhen
> *Subject:* New test results for "ls -Ul"
>
> I have improved statahead load balance mechanism to distribute 
> statahead load to more CPU units on client. And adjusted AGL according 
> to CLIO lock state machine. After those improvement, 'ls -Ul' can run 
> more fast than old patches, especially on large SMP node.
>
> On the other hand, as the increasing the degree of parallelism, the 
> lower network scheduler is becoming performance bottleneck. So I 
> combine my patches together with Liang's SMP patches in the test.
>
>
> 	
>
> client (fat-intel-4, 24 cores)
>
> 	
>
> server (client-xxx, 4 OSSes, 8 OSTs on each OSS)
>
> b2x_patched
>
> 	
>
> my patches + SMP patches
>
> 	
>
> my patches
>
> b18
>
> 	
>
> original b1_8
>
> 	
>
> share the same server with "b2x_patched"
>
> b2x_original
>
> 	
>
> original b2_x
>
> 	
>
> original b2_x
>
>
> Some notes:
>
> 1) Stripe count affects traversing performance much, and the impact is 
> more than linear. Even if with all the patches applied on b2_x, the 
> degree of stripe count impact is still larger than b1_8. It is related 
> with the complex CLIO lock state machine and tedious iteration/repeat 
> operations. It is not easy to make it run as efficiently as b1_8.
>
> 2) Patched b2_x is much faster than original b2_x, for traversing 400K 
> * 32-striped directory, it is 100 times or more improved.
>
> 3) Patched b2_x is also faster than b1_8, within our test, patched 
> b2_x is at least 4X faster than b1_8, which matches the requirement in 
> ORNL contract.
>
> 4) Original b2_x is faster than b1_8 only for small striped cases, not 
> more than 4-striped. For large striped cases, slower than b1_8, which 
> is consistent with ORNL test result.
>
> 5) The largest stripe count is 32 in our test. We have not enough 
> resource to test more large striped cases. And I also wonder whether 
> it is worth to test more large striped directory or not. Because how 
> many customers want to use large and full striped directory? means 
> contains 1M * 160-striped items in signal directory. If it is rare 
> case, then wasting lots of time on that is worthless.
>
> We need to confirm with ORNL what is the last acceptance test cases 
> and environment, includes:
> a) stripe count
> b) item count
> c) network latency, w/o lnet router, suggest without router.
> d) OST count on each OSS
>
>
> Cheers,
> --
> Nasf
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/ebf54878/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 64417 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/ebf54878/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 57471 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/ebf54878/attachment-0001.png>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] New test results for "ls -Ul"
  2011-05-26 14:36   ` Fan Yong
@ 2011-05-26 17:40     ` Eric Barton
  2011-05-26 19:36       ` Andreas Dilger
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Barton @ 2011-05-26 17:40 UTC (permalink / raw)
  To: lustre-devel

Nasf,

 

I agree that we have to be careful comparing 1.8 and patched 2.x since 1.8 is doing

no RPC pipelining to the MDS or OSSs - however I still think (unless you can show

me the hole in my reasoning) that comparing the slopes of the time v. # stripes graphs

is fair.  These slopes correspond to the additional time it takes to stat a file with more

stripes.  Although total per-file stat times in 1.8 are dominated by RPC round-trips

to the MDS and OSSes - the OSS RPCs are all sent concurrently, so the incremental

time per stripe should be the time it takes to traverse the stack for each stripe and

issue the RPC.  Similarly for 2.x, the incremental time per stripe should also be the

time it takes to traverse the stack for each strip and queue the async glimpse.

 

In any case, I think measurements of higher stripe counts on a larger server cluster

will be revealing.

 

Cheers,
                   Eric 

 

From: Fan Yong [mailto:yong.fan at whamcloud.com] 
Sent: 26 May 2011 3:36 PM
To: Eric Barton
Cc: 'Bryon Neitzel'; 'Ian Colle'; 'Liang Zhen'; lustre-devel at lists.lustre.org
Subject: Re: New test results for "ls -Ul"

 

Hi Eric,

Thanks very much for your comparison of the results. I want to give more explanation for the results:

1) I suspect the complex CLIO lock state machine and tedious iteration/repeat operations affect the performance of traversing
large-striped directory, means the overhead introduced by those factors are higher than original b1_8 I/O stack. To measure
per-stripe overhead, it is unfair that you compare the results between patched lustre-2.x and luster-1.8, because my AGL related
patches are async pipeline operations, they hide much of such overhead. But b1_8 is sync glimpse and non-per-fetched. If compare
between original lustre-2.x and lustre-1.8, you will find the overhead difference. In fact, such overhead difference can be seen in
your second graph also. Just as you said: "1.8 gets better with more stripes, patched 2.x gets worse".

2) Currently, the limitation for AGL #/RPC is statahead window. Originally, such window is only used for controlling MDS-side
statahead. So means, as long as item's MDS-side attributes is ready (per-fetched), then related OSS-side AGL RPC can be triggered.
The default statahead window size is 32. In my test, I just use the default value. I also tested with larger window size on Toro,
but it did not give much help. I am not sure whether it can be better if testing against more powerful nodes/network.

3) For large-striped directory, the test results maybe not represent the real cases, because in my test, there are 8 OSTs on each
OSS, but OSS CPU is 4-cores, which is much slower than client node (24-cores CPU). I found OSS's load was quite high for 32-striped
cases. In theory, there are at most 32 * 8 concurrent AGL RPCs for each OSS. If we can test on more powerful OSS nodes for
large-stripe directory, the improvement may be better than current results.

4) If OSS is the performance bottle neck, it also can explain why "1.8 gets better with more stripes, patched 2.x gets worse" on
some degree. Because for b1_8, the glimpse RPCs between two items are sync, so there are at most 8 concurrent glimpse RPCs for each
OSS, means less contention, so less overhead caused by those contention. I just guess from the experience of studying SMP scaling.


Cheers,
--
Nasf

On 5/26/11 9:01 PM, Eric Barton wrote: 

Nasf,

 

Interesting results.  Thank you - especially for graphing the results so thoroughly.

I'm attaching them here and cc-ing lustre-devel since these are of general interest.

 

I don't think your conclusion number (1), to say CLIO locking is slowing us down

is as obvious from these results as you imply.  If you just compare the 1.8 and

patched 2.x per-file times and how they scale with #stripes you get this.

 



 

The gradients of these lines should correspond to the additional time per stripe required

to stat each file and I've graphed these times below (ignoring the 0-stripe data for this

calculation because I'm just interested in the incremental per-stripe overhead).

 



They show per-stripe overhead for 1.8 well above patched 2.x for the lower stripe

counts, but whereas 1.8 gets better with more stripes, patched 2.x gets worse.  I'm

guessing that at high stripe counts, 1.8 puts many concurrent glimpses on the wire

and does it quite efficiently.  I'd like to understand better how you control the #

of glimpse-aheads you keep on the wire - is it a single fixed number, or a fixed

number per OST or some other scheme?  In any case, it will be interesting to see

measurements at higher stripe counts.

Cheers, 
                   Eric 

From: Fan Yong [mailto:yong.fan at whamcloud.com] 
Sent: 12 May 2011 10:18 AM
To: Eric Barton
Cc: Bryon Neitzel; Ian Colle; Liang Zhen
Subject: New test results for "ls -Ul"

 

I have improved statahead load balance mechanism to distribute statahead load to more CPU units on client. And adjusted AGL
according to CLIO lock state machine. After those improvement, 'ls -Ul' can run more fast than old patches, especially on large SMP
node.

On the other hand, as the increasing the degree of parallelism, the lower network scheduler is becoming performance bottleneck. So I
combine my patches together with Liang's SMP patches in the test.

	
client (fat-intel-4, 24 cores)

server (client-xxx, 4 OSSes, 8 OSTs on each OSS)


b2x_patched

my patches + SMP patches

my patches


b18

original b1_8

share the same server with "b2x_patched"


b2x_original

original b2_x

original b2_x


Some notes:

1) Stripe count affects traversing performance much, and the impact is more than linear. Even if with all the patches applied on
b2_x, the degree of stripe count impact is still larger than b1_8. It is related with the complex CLIO lock state machine and
tedious iteration/repeat operations. It is not easy to make it run as efficiently as b1_8.

2) Patched b2_x is much faster than original b2_x, for traversing 400K * 32-striped directory, it is 100 times or more improved.

3) Patched b2_x is also faster than b1_8, within our test, patched b2_x is at least 4X faster than b1_8, which matches the
requirement in ORNL contract.

4) Original b2_x is faster than b1_8 only for small striped cases, not more than 4-striped. For large striped cases, slower than
b1_8, which is consistent with ORNL test result.

5) The largest stripe count is 32 in our test. We have not enough resource to test more large striped cases. And I also wonder
whether it is worth to test more large striped directory or not. Because how many customers want to use large and full striped
directory? means contains 1M * 160-striped items in signal directory. If it is rare case, then wasting lots of time on that is
worthless.

We need to confirm with ORNL what is the last acceptance test cases and environment, includes:
a) stripe count
b) item count
c) network latency, w/o lnet router, suggest without router.
d) OST count on each OSS


Cheers,
--
Nasf

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/3ee2ea5e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 64417 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/3ee2ea5e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 57471 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110526/3ee2ea5e/attachment-0001.png>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] New test results for "ls -Ul"
  2011-05-26 17:40     ` Eric Barton
@ 2011-05-26 19:36       ` Andreas Dilger
  2011-05-27  7:58         ` Fan Yong
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2011-05-26 19:36 UTC (permalink / raw)
  To: lustre-devel

On May 26, 2011, at 11:40, Eric Barton wrote:
> I agree that we have to be careful comparing 1.8 and patched 2.x since 1.8
> is doing no RPC pipelining to the MDS or OSSs ? however I still think
> (unless you can show me the hole in my reasoning) that comparing the slopes
> of the time v. # stripes graphs is fair.  These slopes correspond to the
> additional time it takes to stat a file with more stripes.
> 
> Although total per-file stat times in 1.8 are dominated by RPC round-trips
> to the MDS and OSSes ? the OSS RPCs are all sent concurrently, so the
> incremental time per stripe should be the time it takes to traverse the
> stack for each stripe and issue the RPC.  Similarly for 2.x, the incremental
> time per stripe should also be the time it takes to traverse the stack for
> each strip and queue the async glimpse.

I'm not sure this is correct.  In the 1.8 case, the parallel OST glimpse RPCs are amortizing (per stripe) the higher (more visible) round-trip MDT RPC time.  That means the 1.8 MDT RPC time per OST stripe is shrinking, which makes it appear that 1.8 is more efficient with higher stripe counts, while in 2.x the increase in stripes can only increase the total time, because there is less opportunity the N-stripe glimpse RPC latency can be hidden by async RPCs.

I think the good news is that regardless of whether the 2.x client stack is less efficient, the overall improvements made to 2.x are stunning, and it is reasonable to have higher CPU overhead on the client for this substantial performance improvement visible to users/applications.

> In any case, I think measurements of higher stripe counts on a larger server
> cluster will be revealing.
>  
> Cheers,
>                    Eric
>  
> From: Fan Yong [mailto:yong.fan at whamcloud.com] 
> Sent: 26 May 2011 3:36 PM
> To: Eric Barton
> Cc: 'Bryon Neitzel'; 'Ian Colle'; 'Liang Zhen'; lustre-devel at lists.lustre.org
> Subject: Re: New test results for "ls -Ul"
>  
> Hi Eric,
> 
> Thanks very much for your comparison of the results. I want to give more explanation for the results:
> 
> 1) I suspect the complex CLIO lock state machine and tedious iteration/repeat operations affect the performance of traversing large-striped directory, means the overhead introduced by those factors are higher than original b1_8 I/O stack. To measure per-stripe overhead, it is unfair that you compare the results between patched lustre-2.x and luster-1.8, because my AGL related patches are async pipeline operations, they hide much of such overhead. But b1_8 is sync glimpse and non-per-fetched. If compare between original lustre-2.x and lustre-1.8, you will find the overhead difference. In fact, such overhead difference can be seen in your second graph also. Just as you said: "1.8 gets better with more stripes, patched 2.x gets worse".
> 
> 2) Currently, the limitation for AGL #/RPC is statahead window. Originally, such window is only used for controlling MDS-side statahead. So means, as long as item's MDS-side attributes is ready (per-fetched), then related OSS-side AGL RPC can be triggered. The default statahead window size is 32. In my test, I just use the default value. I also tested with larger window size on Toro, but it did not give much help. I am not sure whether it can be better if testing against more powerful nodes/network.
> 
> 3) For large-striped directory, the test results maybe not represent the real cases, because in my test, there are 8 OSTs on each OSS, but OSS CPU is 4-cores, which is much slower than client node (24-cores CPU). I found OSS's load was quite high for 32-striped cases. In theory, there are at most 32 * 8 concurrent AGL RPCs for each OSS. If we can test on more powerful OSS nodes for large-stripe directory, the improvement may be better than current results.
> 
> 4) If OSS is the performance bottle neck, it also can explain why "1.8 gets better with more stripes, patched 2.x gets worse" on some degree. Because for b1_8, the glimpse RPCs between two items are sync, so there are at most 8 concurrent glimpse RPCs for each OSS, means less contention, so less overhead caused by those contention. I just guess from the experience of studying SMP scaling.
> 
> 
> Cheers,
> --
> Nasf
> 
> On 5/26/11 9:01 PM, Eric Barton wrote:
> Nasf,
>  
> Interesting results.  Thank you - especially for graphing the results so thoroughly.
> I?m attaching them here and cc-ing lustre-devel since these are of general interest.
>  
> I don?t think your conclusion number (1), to say CLIO locking is slowing us down
> is as obvious from these results as you imply.  If you just compare the 1.8 and
> patched 2.x per-file times and how they scale with #stripes you get this?
>  
> <image001.png>
>  
> The gradients of these lines should correspond to the additional time per stripe required
> to stat each file and I?ve graphed these times below (ignoring the 0-stripe data for this
> calculation because I?m just interested in the incremental per-stripe overhead).
>  
> <image002.png>
> They show per-stripe overhead for 1.8 well above patched 2.x for the lower stripe
> counts, but whereas 1.8 gets better with more stripes, patched 2.x gets worse.  I?m
> guessing that at high stripe counts, 1.8 puts many concurrent glimpses on the wire
> and does it quite efficiently.  I?d like to understand better how you control the #
> of glimpse-aheads you keep on the wire ? is it a single fixed number, or a fixed
> number per OST or some other scheme?  In any case, it will be interesting to see
> measurements at higher stripe counts.
> Cheers, 
>                    Eric
> From: Fan Yong [mailto:yong.fan at whamcloud.com] 
> Sent: 12 May 2011 10:18 AM
> To: Eric Barton
> Cc: Bryon Neitzel; Ian Colle; Liang Zhen
> Subject: New test results for "ls -Ul"
>  
> I have improved statahead load balance mechanism to distribute statahead load to more CPU units on client. And adjusted AGL according to CLIO lock state machine. After those improvement, 'ls -Ul' can run more fast than old patches, especially on large SMP node.
> 
> On the other hand, as the increasing the degree of parallelism, the lower network scheduler is becoming performance bottleneck. So I combine my patches together with Liang's SMP patches in the test.
> 
> client (fat-intel-4, 24 cores)
> server (client-xxx, 4 OSSes, 8 OSTs on each OSS)
> b2x_patched
> my patches + SMP patches
> my patches
> b18
> original b1_8
> share the same server with "b2x_patched"
> b2x_original
> original b2_x
> original b2_x
> 
> Some notes:
> 
> 1) Stripe count affects traversing performance much, and the impact is more than linear. Even if with all the patches applied on b2_x, the degree of stripe count impact is still larger than b1_8. It is related with the complex CLIO lock state machine and tedious iteration/repeat operations. It is not easy to make it run as efficiently as b1_8.
> 
> 2) Patched b2_x is much faster than original b2_x, for traversing 400K * 32-striped directory, it is 100 times or more improved.
> 
> 3) Patched b2_x is also faster than b1_8, within our test, patched b2_x is at least 4X faster than b1_8, which matches the requirement in ORNL contract.
> 
> 4) Original b2_x is faster than b1_8 only for small striped cases, not more than 4-striped. For large striped cases, slower than b1_8, which is consistent with ORNL test result.
> 
> 5) The largest stripe count is 32 in our test. We have not enough resource to test more large striped cases. And I also wonder whether it is worth to test more large striped directory or not. Because how many customers want to use large and full striped directory? means contains 1M * 160-striped items in signal directory. If it is rare case, then wasting lots of time on that is worthless.
> 
> We need to confirm with ORNL what is the last acceptance test cases and environment, includes:
> a) stripe count
> b) item count
> c) network latency, w/o lnet router, suggest without router.
> d) OST count on each OSS
> 
> 
> Cheers,
> --
> Nasf
>  
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] New test results for "ls -Ul"
  2011-05-26 19:36       ` Andreas Dilger
@ 2011-05-27  7:58         ` Fan Yong
  0 siblings, 0 replies; 7+ messages in thread
From: Fan Yong @ 2011-05-27  7:58 UTC (permalink / raw)
  To: lustre-devel

On 5/27/11 3:36 AM, Andreas Dilger wrote:
> On May 26, 2011, at 11:40, Eric Barton wrote:
>> I agree that we have to be careful comparing 1.8 and patched 2.x since 1.8
>> is doing no RPC pipelining to the MDS or OSSs ? however I still think
>> (unless you can show me the hole in my reasoning) that comparing the slopes
>> of the time v. # stripes graphs is fair.  These slopes correspond to the
>> additional time it takes to stat a file with more stripes.
>>
>> Although total per-file stat times in 1.8 are dominated by RPC round-trips
>> to the MDS and OSSes ? the OSS RPCs are all sent concurrently, so the
>> incremental time per stripe should be the time it takes to traverse the
>> stack for each stripe and issue the RPC.  Similarly for 2.x, the incremental
>> time per stripe should also be the time it takes to traverse the stack for
>> each strip and queue the async glimpse.
> I'm not sure this is correct.  In the 1.8 case, the parallel OST glimpse RPCs are amortizing (per stripe) the higher (more visible) round-trip MDT RPC time.  That means the 1.8 MDT RPC time per OST stripe is shrinking, which makes it appear that 1.8 is more efficient with higher stripe counts, while in 2.x the increase in stripes can only increase the total time, because there is less opportunity the N-stripe glimpse RPC latency can be hidden by async RPCs.
>
> I think the good news is that regardless of whether the 2.x client stack is less efficient, the overall improvements made to 2.x are stunning, and it is reasonable to have higher CPU overhead on the client for this substantial performance improvement visible to users/applications.

Yes, it has greatly improved the traversing performance for not more 
than 32-striped cases, especially for large directory. As for more large 
striped directory, we have no direct test results yet. What I worry 
about is whether the per-stripe overhead will grow rapidly or not. I 
think that is also why Eric suggest to measure more large striped cases.


Cheers,
--
Nasf

>> In any case, I think measurements of higher stripe counts on a larger server
>> cluster will be revealing.
>>
>> Cheers,
>>                     Eric
>>
>> From: Fan Yong [mailto:yong.fan at whamcloud.com]
>> Sent: 26 May 2011 3:36 PM
>> To: Eric Barton
>> Cc: 'Bryon Neitzel'; 'Ian Colle'; 'Liang Zhen'; lustre-devel at lists.lustre.org
>> Subject: Re: New test results for "ls -Ul"
>>
>> Hi Eric,
>>
>> Thanks very much for your comparison of the results. I want to give more explanation for the results:
>>
>> 1) I suspect the complex CLIO lock state machine and tedious iteration/repeat operations affect the performance of traversing large-striped directory, means the overhead introduced by those factors are higher than original b1_8 I/O stack. To measure per-stripe overhead, it is unfair that you compare the results between patched lustre-2.x and luster-1.8, because my AGL related patches are async pipeline operations, they hide much of such overhead. But b1_8 is sync glimpse and non-per-fetched. If compare between original lustre-2.x and lustre-1.8, you will find the overhead difference. In fact, such overhead difference can be seen in your second graph also. Just as you said: "1.8 gets better with more stripes, patched 2.x gets worse".
>>
>> 2) Currently, the limitation for AGL #/RPC is statahead window. Originally, such window is only used for controlling MDS-side statahead. So means, as long as item's MDS-side attributes is ready (per-fetched), then related OSS-side AGL RPC can be triggered. The default statahead window size is 32. In my test, I just use the default value. I also tested with larger window size on Toro, but it did not give much help. I am not sure whether it can be better if testing against more powerful nodes/network.
>>
>> 3) For large-striped directory, the test results maybe not represent the real cases, because in my test, there are 8 OSTs on each OSS, but OSS CPU is 4-cores, which is much slower than client node (24-cores CPU). I found OSS's load was quite high for 32-striped cases. In theory, there are at most 32 * 8 concurrent AGL RPCs for each OSS. If we can test on more powerful OSS nodes for large-stripe directory, the improvement may be better than current results.
>>
>> 4) If OSS is the performance bottle neck, it also can explain why "1.8 gets better with more stripes, patched 2.x gets worse" on some degree. Because for b1_8, the glimpse RPCs between two items are sync, so there are at most 8 concurrent glimpse RPCs for each OSS, means less contention, so less overhead caused by those contention. I just guess from the experience of studying SMP scaling.
>>
>>
>> Cheers,
>> --
>> Nasf
>>
>> On 5/26/11 9:01 PM, Eric Barton wrote:
>> Nasf,
>>
>> Interesting results.  Thank you - especially for graphing the results so thoroughly.
>> I?m attaching them here and cc-ing lustre-devel since these are of general interest.
>>
>> I don?t think your conclusion number (1), to say CLIO locking is slowing us down
>> is as obvious from these results as you imply.  If you just compare the 1.8 and
>> patched 2.x per-file times and how they scale with #stripes you get this?
>>
>> <image001.png>
>>
>> The gradients of these lines should correspond to the additional time per stripe required
>> to stat each file and I?ve graphed these times below (ignoring the 0-stripe data for this
>> calculation because I?m just interested in the incremental per-stripe overhead).
>>
>> <image002.png>
>> They show per-stripe overhead for 1.8 well above patched 2.x for the lower stripe
>> counts, but whereas 1.8 gets better with more stripes, patched 2.x gets worse.  I?m
>> guessing that at high stripe counts, 1.8 puts many concurrent glimpses on the wire
>> and does it quite efficiently.  I?d like to understand better how you control the #
>> of glimpse-aheads you keep on the wire ? is it a single fixed number, or a fixed
>> number per OST or some other scheme?  In any case, it will be interesting to see
>> measurements at higher stripe counts.
>> Cheers,
>>                     Eric
>> From: Fan Yong [mailto:yong.fan at whamcloud.com]
>> Sent: 12 May 2011 10:18 AM
>> To: Eric Barton
>> Cc: Bryon Neitzel; Ian Colle; Liang Zhen
>> Subject: New test results for "ls -Ul"
>>
>> I have improved statahead load balance mechanism to distribute statahead load to more CPU units on client. And adjusted AGL according to CLIO lock state machine. After those improvement, 'ls -Ul' can run more fast than old patches, especially on large SMP node.
>>
>> On the other hand, as the increasing the degree of parallelism, the lower network scheduler is becoming performance bottleneck. So I combine my patches together with Liang's SMP patches in the test.
>>
>> client (fat-intel-4, 24 cores)
>> server (client-xxx, 4 OSSes, 8 OSTs on each OSS)
>> b2x_patched
>> my patches + SMP patches
>> my patches
>> b18
>> original b1_8
>> share the same server with "b2x_patched"
>> b2x_original
>> original b2_x
>> original b2_x
>>
>> Some notes:
>>
>> 1) Stripe count affects traversing performance much, and the impact is more than linear. Even if with all the patches applied on b2_x, the degree of stripe count impact is still larger than b1_8. It is related with the complex CLIO lock state machine and tedious iteration/repeat operations. It is not easy to make it run as efficiently as b1_8.
>>
>> 2) Patched b2_x is much faster than original b2_x, for traversing 400K * 32-striped directory, it is 100 times or more improved.
>>
>> 3) Patched b2_x is also faster than b1_8, within our test, patched b2_x is at least 4X faster than b1_8, which matches the requirement in ORNL contract.
>>
>> 4) Original b2_x is faster than b1_8 only for small striped cases, not more than 4-striped. For large striped cases, slower than b1_8, which is consistent with ORNL test result.
>>
>> 5) The largest stripe count is 32 in our test. We have not enough resource to test more large striped cases. And I also wonder whether it is worth to test more large striped directory or not. Because how many customers want to use large and full striped directory? means contains 1M * 160-striped items in signal directory. If it is rare case, then wasting lots of time on that is worthless.
>>
>> We need to confirm with ORNL what is the last acceptance test cases and environment, includes:
>> a) stripe count
>> b) item count
>> c) network latency, w/o lnet router, suggest without router.
>> d) OST count on each OSS
>>
>>
>> Cheers,
>> --
>> Nasf
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer
> Whamcloud, Inc.
>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] New test results for "ls -Ul"
  2011-05-26 13:01 ` [Lustre-devel] New test results for "ls -Ul" Eric Barton
  2011-05-26 14:36   ` Fan Yong
@ 2011-05-30  5:51   ` Jinshan Xiong
  2011-05-30  8:11     ` Fan Yong
  1 sibling, 1 reply; 7+ messages in thread
From: Jinshan Xiong @ 2011-05-30  5:51 UTC (permalink / raw)
  To: lustre-devel


On May 26, 2011, at 6:01 AM, Eric Barton wrote:

> Nasf,
>  
> Interesting results.  Thank you - especially for graphing the results so thoroughly.
> I?m attaching them here and cc-ing lustre-devel since these are of general interest.
>  
> I don?t think your conclusion number (1), to say CLIO locking is slowing us down
> is as obvious from these results as you imply.  If you just compare the 1.8 and
> patched 2.x per-file times and how they scale with #stripes you get this?
>  
> <image001.png>
>  
> The gradients of these lines should correspond to the additional time per stripe required
> to stat each file and I?ve graphed these times below (ignoring the 0-stripe data for this
> calculation because I?m just interested in the incremental per-stripe overhead).
>  
> <image004.png>
> They show per-stripe overhead for 1.8 well above patched 2.x for the lower stripe
> counts, but whereas 1.8 gets better with more stripes, patched 2.x gets worse.  I?m
> guessing that at high stripe counts, 1.8 puts many concurrent glimpses on the wire
> and does it quite efficiently.  I?d like to understand better how you control the #
> of glimpse-aheads you keep on the wire ? is it a single fixed number, or a fixed
> number per OST or some other scheme?  In any case, it will be interesting to see
> measurements at higher stripe counts.
> Cheers, 
>                    Eric
> From: Fan Yong [mailto:yong.fan at whamcloud.com] 
> Sent: 12 May 2011 10:18 AM
> To: Eric Barton
> Cc: Bryon Neitzel; Ian Colle; Liang Zhen
> Subject: New test results for "ls -Ul"
>  
> I have improved statahead load balance mechanism to distribute statahead load to more CPU units on client. And adjusted AGL according to CLIO lock state machine. After those improvement, 'ls -Ul' can run more fast than old patches, especially on large SMP node.
> 
> On the other hand, as the increasing the degree of parallelism, the lower network scheduler is becoming performance bottleneck. So I combine my patches together with Liang's SMP patches in the test.
> 
> client (fat-intel-4, 24 cores)
> server (client-xxx, 4 OSSes, 8 OSTs on each OSS)
> b2x_patched
> my patches + SMP patches
> my patches
> b18
> original b1_8
> share the same server with "b2x_patched"
> b2x_original
> original b2_x
> original b2_x
> 
> Some notes:
> 
> 1) Stripe count affects traversing performance much, and the impact is more than linear. Even if with all the patches applied on b2_x, the degree of stripe count impact is still larger than b1_8. It is related with the complex CLIO lock state machine and tedious iteration/repeat operations. It is not easy to make it run as efficiently as b1_8.


Hi there,

I did some tests to investigate the overhead of clio lock state machine and glimpse lock, and I found something new.

Basically I did the same thing as what Nasf had done, but I only cared about the overhead of glimpse locks. For this purpose, I ran 'ls -lU' twice for each test, and the 1st run is only used to create IBITS UPDATE lock cache for files; then, I dropped cl_locks and ldlm_locks from client side cache by setting zero to lru_size of ldlm namespaces, then do 'ls -lU' once again. In the second run of 'ls -lU', the statahead thread will always find cached IBITS lock(we can check mdc lock_count for sure), so the elapsed time of ls will be glimpse related.

This is what I got from the test:




Description and test environment:
- `ls -Ul time' means the time to finish the second run; 
- 100K means 100K files under the same directory; 400K means 400K files under the same directory;
- there are two OSSes in my test, and each OSS has 8 OSTs; OSTs are crossed over on two OSSes, i.e., OST0, 2, 4,.. are on OSS0; 1, 3, 5, .. are on OSS1;
- each node has 12G memory, 4 CPU cores;
- latest lustre-master build, b140

and, prorated per stripe overhead:




From the above test, it's very hard to make the conclusion that cl_lock causes the increase of ls time by the stripe count.

Here is the test script I used to do the test, and test output is attached as well. Please let me know if I missed something.





===================
Let's take a step back to reconsider what's real cause in Nasf's test. I tend to think the load on OSSes might cause that symptom. It's obvious that Async Glimpse Lock produces more stress on OSS, especially in his test env where multiple OSTs are actually on the same OSS. This will make the ls time increased by the stripe count as well - since OSS has to handle more RPCs when the stripe count increases in a specific time. This problem may be mitigated by distributing OSTs to more OSSes.

Thanks,
Jinshan

> 
> 2) Patched b2_x is much faster than original b2_x, for traversing 400K * 32-striped directory, it is 100 times or more improved.
> 
> 3) Patched b2_x is also faster than b1_8, within our test, patched b2_x is at least 4X faster than b1_8, which matches the requirement in ORNL contract.
> 
> 4) Original b2_x is faster than b1_8 only for small striped cases, not more than 4-striped. For large striped cases, slower than b1_8, which is consistent with ORNL test result.
> 
> 5) The largest stripe count is 32 in our test. We have not enough resource to test more large striped cases. And I also wonder whether it is worth to test more large striped directory or not. Because how many customers want to use large and full striped directory? means contains 1M * 160-striped items in signal directory. If it is rare case, then wasting lots of time on that is worthless.
> 
> We need to confirm with ORNL what is the last acceptance test cases and environment, includes:
> a) stripe count
> b) item count
> c) network latency, w/o lnet router, suggest without router.
> d) OST count on each OSS
> 
> 
> Cheers,
> --
> Nasf
> <result_20110512.xls>_______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110529/9ad0cc05/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1.png
Type: image/png
Size: 58740 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110529/9ad0cc05/attachment.png>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110529/9ad0cc05/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.png
Type: image/png
Size: 65244 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110529/9ad0cc05/attachment-0001.png>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110529/9ad0cc05/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.tgz
Type: application/octet-stream
Size: 1516 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110529/9ad0cc05/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110529/9ad0cc05/attachment-0003.htm>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] New test results for "ls -Ul"
  2011-05-30  5:51   ` Jinshan Xiong
@ 2011-05-30  8:11     ` Fan Yong
  0 siblings, 0 replies; 7+ messages in thread
From: Fan Yong @ 2011-05-30  8:11 UTC (permalink / raw)
  To: lustre-devel

Inline comments as following:

On 5/30/11 1:51 PM, Jinshan Xiong wrote:
>
> On May 26, 2011, at 6:01 AM, Eric Barton wrote:
>
>> Nasf,
>> Interesting results.  Thank you - especially for graphing the results 
>> so thoroughly.
>> I?m attaching them here and cc-ing lustre-devel since these are of 
>> general interest.
>> I don?t think your conclusion number (1), to say CLIO locking is 
>> slowing us down
>> is as obvious from these results as you imply.  If you just compare 
>> the 1.8 and
>> patched 2.x per-file times and how they scale with #stripes you get this?
>> <image001.png>
>> The gradients of these lines should correspond to the additional time 
>> per stripe required
>> to stat each file and I?ve graphed these times below (ignoring the 
>> 0-stripe data for this
>> calculation because I?m just interested in the incremental per-stripe 
>> overhead).
>> <image004.png>
>> They show per-stripe overhead for 1.8 well above patched 2.x for the 
>> lower stripe
>> counts, but whereas 1.8 gets better with more stripes, patched 2.x 
>> gets worse.  I?m
>> guessing that at high stripe counts, 1.8 puts many concurrent 
>> glimpses on the wire
>> and does it quite efficiently.  I?d like to understand better how you 
>> control the #
>> of glimpse-aheads you keep on the wire ? is it a single fixed number, 
>> or a fixed
>> number per OST or some other scheme?  In any case, it will be 
>> interesting to see
>> measurements at higher stripe counts.
>>
>>     Cheers,
>>                        Eric
>>
>> *From:*Fan Yong [mailto:yong.fan at whamcloud.com]
>> *Sent:*12 May 2011 10:18 AM
>> *To:*Eric Barton
>> *Cc:*Bryon Neitzel; Ian Colle; Liang Zhen
>> *Subject:*New test results for "ls -Ul"
>>
>> I have improved statahead load balance mechanism to distribute 
>> statahead load to more CPU units on client. And adjusted AGL 
>> according to CLIO lock state machine. After those improvement, 'ls 
>> -Ul' can run more fast than old patches, especially on large SMP node.
>>
>> On the other hand, as the increasing the degree of parallelism, the 
>> lower network scheduler is becoming performance bottleneck. So I 
>> combine my patches together with Liang's SMP patches in the test.
>>
>>
>> 	
>> client (fat-intel-4, 24 cores)
>> 	
>> server (client-xxx, 4 OSSes, 8 OSTs on each OSS)
>> b2x_patched
>> 	
>> my patches + SMP patches
>> 	
>> my patches
>> b18
>> 	
>> original b1_8
>> 	
>> share the same server with "b2x_patched"
>> b2x_original
>> 	
>> original b2_x
>> 	
>> original b2_x
>>
>>
>> Some notes:
>>
>> 1) Stripe count affects traversing performance much, and the impact 
>> is more than linear. Even if with all the patches applied on b2_x, 
>> the degree of stripe count impact is still larger than b1_8. It is 
>> related with the complex CLIO lock state machine and tedious 
>> iteration/repeat operations. It is not easy to make it run as 
>> efficiently as b1_8.
>
>
> Hi there,
>
> I did some tests to investigate the overhead of clio lock state 
> machine and glimpse lock, and I found something new.
>
> Basically I did the same thing as what Nasf had done, but I only cared 
> about the overhead of glimpse locks. For this purpose, I ran 'ls -lU' 
> twice for each test, and the 1st run is only used to create IBITS 
> UPDATE lock cache for files; then, I dropped cl_locks and ldlm_locks 
> from client side cache by setting zero to lru_size of ldlm namespaces, 
> then do 'ls -lU' once again. In the second run of 'ls -lU', the 
> statahead thread will always find cached IBITS lock(we can check mdc 
> lock_count for sure), so the elapsed time of ls will be glimpse related.
>
> This is what I got from the test:
>
>
>
>
>
> Description and test environment:
> - `ls -Ul time' means the time to finish the second run;
> - 100K means 100K files under the same directory; 400K means 400K 
> files under the same directory;
> - there are two OSSes in my test, and each OSS has 8 OSTs; OSTs are 
> crossed over on two OSSes, i.e., OST0, 2, 4,.. are on OSS0; 1, 3, 5, 
> .. are on OSS1;
> - each node has 12G memory, 4 CPU cores;
> - latest lustre-master build, b140
>
> and, prorated per stripe overhead:
>
>
>
>
>
> From the above test, it's very hard to make the conclusion that 
> cl_lock causes the increase of ls time by the stripe count.
>
> Here is the test script I used to do the test, and test output is 
> attached as well. Please let me know if I missed something.


In theory, processing glimpse RPC for each stripe of the same file 
should be in parallel. So means more stripe count, then less average 
overhead per-stripe, at least it is the expectation. Flat line cannot 
indicate the overhead is small enough. I suggest to compare with b1_8 
for the same tests.


>
>
>
>
>
>
> ===================
> Let's take a step back to reconsider what's real cause in Nasf's test. 
> I tend to think the load on OSSes might cause that symptom. It's 
> obvious that Async Glimpse Lock produces more stress on OSS, 
> especially in his test env where multiple OSTs are actually on the 
> same OSS. This will make the ls time increased by the stripe count as 
> well - since OSS has to handle more RPCs when the stripe count 
> increases in a specific time. This problem may be mitigated by 
> distributing OSTs to more OSSes.


Basically, I agree with you that the heavy load on OSS may be the 
performance bottleneck, just as I said in former email, we found the CPU 
loads on OSS were quite high when "ls -Ul" for large-striped cases. It 
is easy to be verified as long as we have enough powerful OSSes, 
unfortunately we have not now.

Cheers,
--
Nasf


>
> Thanks,
> Jinshan
>
>>
>> 2) Patched b2_x is much faster than original b2_x, for traversing 
>> 400K * 32-striped directory, it is 100 times or more improved.
>>
>> 3) Patched b2_x is also faster than b1_8, within our test, patched 
>> b2_x is at least 4X faster than b1_8, which matches the requirement 
>> in ORNL contract.
>>
>> 4) Original b2_x is faster than b1_8 only for small striped cases, 
>> not more than 4-striped. For large striped cases, slower than b1_8, 
>> which is consistent with ORNL test result.
>>
>> 5) The largest stripe count is 32 in our test. We have not enough 
>> resource to test more large striped cases. And I also wonder whether 
>> it is worth to test more large striped directory or not. Because how 
>> many customers want to use large and full striped directory? means 
>> contains 1M * 160-striped items in signal directory. If it is rare 
>> case, then wasting lots of time on that is worthless.
>>
>> We need to confirm with ORNL what is the last acceptance test cases 
>> and environment, includes:
>> a) stripe count
>> b) item count
>> c) network latency, w/o lnet router, suggest without router.
>> d) OST count on each OSS
>>
>>
>> Cheers,
>> --
>> Nasf
>> <result_20110512.xls>_______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org <mailto:Lustre-devel@lists.lustre.org>
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110530/29c7a5d7/attachment.htm>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-05-30  8:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4DCBA5D4.5010902@whamcloud.com>
2011-05-26 13:01 ` [Lustre-devel] New test results for "ls -Ul" Eric Barton
2011-05-26 14:36   ` Fan Yong
2011-05-26 17:40     ` Eric Barton
2011-05-26 19:36       ` Andreas Dilger
2011-05-27  7:58         ` Fan Yong
2011-05-30  5:51   ` Jinshan Xiong
2011-05-30  8:11     ` Fan Yong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.