All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] NRS HLD
@ 2011-06-29 19:55 Nathan Rutman
  2011-06-30 15:49 ` Nikitas Angelinas
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Rutman @ 2011-06-29 19:55 UTC (permalink / raw)
  To: lustre-devel

Sharing with the community.  All comments welcome.
This HLD (high-level design) for a Network Request Scheduler is more about infrastructure than algorithm.





We're actually in the DLD (detailed-level design) stage at the moment (sorry it didn't occur to me to 
post this earlier).  I'll post the DLD after some minor revision.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110629/8fce916d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NRS HLD.pdf
Type: application/pdf
Size: 325187 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110629/8fce916d/attachment.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110629/8fce916d/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-06-29 19:55 [Lustre-devel] NRS HLD Nathan Rutman
@ 2011-06-30 15:49 ` Nikitas Angelinas
  2011-07-02 10:48   ` Liang Zhen
  0 siblings, 1 reply; 10+ messages in thread
From: Nikitas Angelinas @ 2011-06-30 15:49 UTC (permalink / raw)
  To: lustre-devel

Hi,

There is a slightly more up-to-date version of the HLD, which I am
attaching.


Thanks,
Nikitas

On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
> Sharing with the community.  All comments welcome.
> This HLD (high-level design) for a Network Request Scheduler is more
> about infrastructure than algorithm.
> 
> 
> 
> 
> We're actually in the DLD (detailed-level design) stage at the moment
> (sorry it didn't occur to me to 
> post this earlier).  I'll post the DLD after some minor revision.
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel


______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: HLD_of_Lustre_NRS.pdf
Type: application/pdf
Size: 276401 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110630/d2dd9e88/attachment.pdf>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-06-30 15:49 ` Nikitas Angelinas
@ 2011-07-02 10:48   ` Liang Zhen
  2011-07-02 15:39     ` Nathan Rutman
  0 siblings, 1 reply; 10+ messages in thread
From: Liang Zhen @ 2011-07-02 10:48 UTC (permalink / raw)
  To: lustre-devel

Hi Nikitas,

It's a very interesting document, thanks for sharing these great ideas.
May I ask whether there are any tests/numbers based on OBRR description
in HLD? We always want to see numbers from these algorithms, unfortunately
there is no reliable testing result from the patch on BZ 13634.

We are actually considering about NRS as well, although we don't have a
formal design document like this HLD, but we'd like to share a few
preliminary ideas for discussion:

- Target Based Round Robin (TBRR) probably is something worth to have

  Briefly, it's just load balancing over OSTs to improve overall server
  performance.

- Fairness for clients and resource control for jobs

  I think they are similar to CBRR and UBRR described in your document
  though I didn't see too much detail about them in the HLD.
  Personally, I think they are important and we probably will do some tests
  for CBRR survey very soon.

  . Client-Based Round Robin (CBRR) can guarantee the server responses
    to all clients fairly, and get whole-cluster load balancing, improve
    concurrency of clients and jobs, and get better overall performance.

  . resource control for jobs, some users complained that a busy job will
    hog all resources on servers, and make the cluster not usable for other
    control command or sysadmin. So it might be helpful to support job
    resource control inside NRS.

- Layering NRS polices

  Of course, OBRR is very important policy for NRS, but it might be
  helpful to have multiple polices working at the same time, i.e:

  . bind OSTs on CPU partitions on NUMA system(please see more detail at
    http://jira.whamcloud.com/browse/LU-56)

  . Service threads on each CPU partition can do TBRR for bound OSTs. If
    there is no CPU partition (like current Lustre) or OSTs are not bound
    on CPU partitions, service threads just do round robin over all OSTs.

  . OBRR inside each OST

  . of course, these layers should be tunable.

- Overhead of layerd polices
  . there definitely will be some overhead for inserting/removing
    request from these queues (or whatever), so we want some very scalable
    algorithms to implement these polices.

Again, these are just some preliminary ideas, so we would appreciate any
comment and suggestion.

Thanks
Liang


On Jun 30, 2011, at 11:49 PM, Nikitas Angelinas wrote:

> Hi,
> 
> There is a slightly more up-to-date version of the HLD, which I am
> attaching.
> 
> 
> Thanks,
> Nikitas
> 
> On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
>> Sharing with the community.  All comments welcome.
>> This HLD (high-level design) for a Network Request Scheduler is more
>> about infrastructure than algorithm.
>> 
>> 
>> 
>> 
>> We're actually in the DLD (detailed-level design) stage at the moment
>> (sorry it didn't occur to me to 
>> post this earlier).  I'll post the DLD after some minor revision.
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> 
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
> 
> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> ______________________________________________________________________
> 
> 
> <HLD_of_Lustre_NRS.pdf>_______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-07-02 10:48   ` Liang Zhen
@ 2011-07-02 15:39     ` Nathan Rutman
  2011-07-05  9:20       ` Liang Zhen
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Rutman @ 2011-07-02 15:39 UTC (permalink / raw)
  To: lustre-devel

I think there is a whole slew of interesting possibilities to try out once an infrastructure for implementing different policies is in place. We're trying to get that infrastructure established first; we will be able to switch between policies dynamically to see their effects and optimize performance. People will be able to easily add new policies to test. 
You raise a good point about the SMP scaling patches; we will have to take a close look to make sure we don't undo any of the work you've done there. 

On Jul 2, 2011, at 3:49 AM, "Liang Zhen" <liang@whamcloud.com> wrote:

> Hi Nikitas,
> 
> It's a very interesting document, thanks for sharing these great ideas.
> May I ask whether there are any tests/numbers based on OBRR description
> in HLD? We always want to see numbers from these algorithms, unfortunately
> there is no reliable testing result from the patch on BZ 13634.
> 
> We are actually considering about NRS as well, although we don't have a
> formal design document like this HLD, but we'd like to share a few
> preliminary ideas for discussion:
> 
> - Target Based Round Robin (TBRR) probably is something worth to have
> 
>  Briefly, it's just load balancing over OSTs to improve overall server
>  performance.
> 
> - Fairness for clients and resource control for jobs
> 
>  I think they are similar to CBRR and UBRR described in your document
>  though I didn't see too much detail about them in the HLD.
>  Personally, I think they are important and we probably will do some tests
>  for CBRR survey very soon.
> 
>  . Client-Based Round Robin (CBRR) can guarantee the server responses
>    to all clients fairly, and get whole-cluster load balancing, improve
>    concurrency of clients and jobs, and get better overall performance.
> 
>  . resource control for jobs, some users complained that a busy job will
>    hog all resources on servers, and make the cluster not usable for other
>    control command or sysadmin. So it might be helpful to support job
>    resource control inside NRS.
> 
> - Layering NRS polices
> 
>  Of course, OBRR is very important policy for NRS, but it might be
>  helpful to have multiple polices working at the same time, i.e:
> 
>  . bind OSTs on CPU partitions on NUMA system(please see more detail at
>    http://jira.whamcloud.com/browse/LU-56)
> 
>  . Service threads on each CPU partition can do TBRR for bound OSTs. If
>    there is no CPU partition (like current Lustre) or OSTs are not bound
>    on CPU partitions, service threads just do round robin over all OSTs.
> 
>  . OBRR inside each OST
> 
>  . of course, these layers should be tunable.
> 
> - Overhead of layerd polices
>  . there definitely will be some overhead for inserting/removing
>    request from these queues (or whatever), so we want some very scalable
>    algorithms to implement these polices.
> 
> Again, these are just some preliminary ideas, so we would appreciate any
> comment and suggestion.
> 
> Thanks
> Liang
> 
> 
> On Jun 30, 2011, at 11:49 PM, Nikitas Angelinas wrote:
> 
>> Hi,
>> 
>> There is a slightly more up-to-date version of the HLD, which I am
>> attaching.
>> 
>> 
>> Thanks,
>> Nikitas
>> 
>> On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
>>> Sharing with the community.  All comments welcome.
>>> This HLD (high-level design) for a Network Request Scheduler is more
>>> about infrastructure than algorithm.
>>> 
>>> 
>>> 
>>> 
>>> We're actually in the DLD (detailed-level design) stage at the moment
>>> (sorry it didn't occur to me to 
>>> post this earlier).  I'll post the DLD after some minor revision.
>>> _______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>> 
>> 
>> ______________________________________________________________________
>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>> 
>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>> 
>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>> 
>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
>> ______________________________________________________________________
>> 
>> 
>> <HLD_of_Lustre_NRS.pdf>_______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-07-02 15:39     ` Nathan Rutman
@ 2011-07-05  9:20       ` Liang Zhen
  2011-07-07 14:34         ` Nikitas Angelinas
  0 siblings, 1 reply; 10+ messages in thread
From: Liang Zhen @ 2011-07-05  9:20 UTC (permalink / raw)
  To: lustre-devel

Hi Nathan,

We also have some ideas about infrastructure of NRS and would like to share
them at here for discussion:

- a common library to support inserting/removing prioritized requests, which 
  should be scalable even for millions of requests

- each policy only needs to provide:
  . private data ("heap", more details below)
  . a sorting callback to prioritize requests

By this way, most common code can be put into libcfs and ptlrpc modules,
instead of having a whole bunch of policy functions to sort requests in
each policy (like the patch on BZ 13634).

Here is the concept of "heap" mentioned above:

(it's quoted from the following link)
A heap is a specialized tree-based data structure that satisfies the heap
property: if B is a child node of A, then key(A) ? key(B). This implies that an
element with the greatest key is always in the root node, and so such a heap is
sometimes called a max-heap. Alternatively, if the comparison is reversed, the
smallest element is always in the root node, which results in a min-heap.

Please see more detail at:
http://en.wikipedia.org/wiki/Heap_%28data_structure%29

Heap can give (log N) for both inserting element and removing element with
greatest/lowest key, so it's very scalable, here are some sample heap APIs
from a simple prototype:

    /* create a heap, @compare is callback for sorting elements */
    cfs_binheap_t *cfs_binheap_create(cfs_binheap_cmp_t compare,
                                      unsigned int flags);

    /* destroy a heap */
    void cfs_binheap_destroy(cfs_binheap_t *h);

    /* insert @element into heap */
    int cfs_binheap_insert(cfs_binheap_t *h, void *elelment);

    /* pop root for the @heap, which has greatest/lowest key in @heap */
    void *cfs_binheap_pop_root(cfs_binheap_t *heap);

and an example of using "heap" to implement Client-Based Round Robin:

  - allocating a "heap" for the service which needs CBRR

  - the service also have a "global round-number" which is always incremental
    . GRN: Global Round Number
    . CRN: Client Round Number
    . RRN: Request Round Number

  Process of inserting a request
  - when there is new request:
    . if the incoming request is the first request for a client, assign
      the GRN to both the RRN of the request and CNR of the client

    . if the incoming request isn't the first request for a client,
      increase the CNR on the client then assign it to RRN of the request.

  - insert the request into the heap, the heap can sort requests based
    on callback like this:
    . primary key is RRN of request

    . secondary key is client ID (i.e: nid of client), secondary keys are
      compared only if primary keys are equal

  Process of dequeue a request:

  - pop the root element from the heap, which should always has the lowest RRN

  - if the RRN of the request is larger than GRN of the service,
    which means the service has already handled all requests with <= GRN and
    should set the GRN of service to the RRN of request

Again, this just an example to demonstrate how to use "heap" to implement CBRR
and show some preliminary ideas, it could be helpful to raise some discussion
at here and to see whether it's possible to improve and apply this to NRS and
make it to be a common library for future implementation of OBRR or any
policy needs to prioritize requests, or for QoS.

Thanks
Liang

On Jul 2, 2011,@11:39 PM, Nathan Rutman wrote:

> I think there is a whole slew of interesting possibilities to try out once an infrastructure for implementing different policies is in place. We're trying to get that infrastructure established first; we will be able to switch between policies dynamically to see their effects and optimize performance. People will be able to easily add new policies to test. 
> You raise a good point about the SMP scaling patches; we will have to take a close look to make sure we don't undo any of the work you've done there. 
> 
> On Jul 2, 2011, at 3:49 AM, "Liang Zhen" <liang@whamcloud.com> wrote:
> 
>> Hi Nikitas,
>> 
>> It's a very interesting document, thanks for sharing these great ideas.
>> May I ask whether there are any tests/numbers based on OBRR description
>> in HLD? We always want to see numbers from these algorithms, unfortunately
>> there is no reliable testing result from the patch on BZ 13634.
>> 
>> We are actually considering about NRS as well, although we don't have a
>> formal design document like this HLD, but we'd like to share a few
>> preliminary ideas for discussion:
>> 
>> - Target Based Round Robin (TBRR) probably is something worth to have
>> 
>> Briefly, it's just load balancing over OSTs to improve overall server
>> performance.
>> 
>> - Fairness for clients and resource control for jobs
>> 
>> I think they are similar to CBRR and UBRR described in your document
>> though I didn't see too much detail about them in the HLD.
>> Personally, I think they are important and we probably will do some tests
>> for CBRR survey very soon.
>> 
>> . Client-Based Round Robin (CBRR) can guarantee the server responses
>>   to all clients fairly, and get whole-cluster load balancing, improve
>>   concurrency of clients and jobs, and get better overall performance.
>> 
>> . resource control for jobs, some users complained that a busy job will
>>   hog all resources on servers, and make the cluster not usable for other
>>   control command or sysadmin. So it might be helpful to support job
>>   resource control inside NRS.
>> 
>> - Layering NRS polices
>> 
>> Of course, OBRR is very important policy for NRS, but it might be
>> helpful to have multiple polices working at the same time, i.e:
>> 
>> . bind OSTs on CPU partitions on NUMA system(please see more detail at
>>   http://jira.whamcloud.com/browse/LU-56)
>> 
>> . Service threads on each CPU partition can do TBRR for bound OSTs. If
>>   there is no CPU partition (like current Lustre) or OSTs are not bound
>>   on CPU partitions, service threads just do round robin over all OSTs.
>> 
>> . OBRR inside each OST
>> 
>> . of course, these layers should be tunable.
>> 
>> - Overhead of layerd polices
>> . there definitely will be some overhead for inserting/removing
>>   request from these queues (or whatever), so we want some very scalable
>>   algorithms to implement these polices.
>> 
>> Again, these are just some preliminary ideas, so we would appreciate any
>> comment and suggestion.
>> 
>> Thanks
>> Liang
>> 
>> 
>> On Jun 30, 2011, at 11:49 PM, Nikitas Angelinas wrote:
>> 
>>> Hi,
>>> 
>>> There is a slightly more up-to-date version of the HLD, which I am
>>> attaching.
>>> 
>>> 
>>> Thanks,
>>> Nikitas
>>> 
>>> On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
>>>> Sharing with the community.  All comments welcome.
>>>> This HLD (high-level design) for a Network Request Scheduler is more
>>>> about infrastructure than algorithm.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> We're actually in the DLD (detailed-level design) stage at the moment
>>>> (sorry it didn't occur to me to 
>>>> post this earlier).  I'll post the DLD after some minor revision.
>>>> _______________________________________________
>>>> Lustre-devel mailing list
>>>> Lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>> 
>>> 
>>> ______________________________________________________________________
>>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>>> 
>>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>>> 
>>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>>> 
>>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
>>> ______________________________________________________________________
>>> 
>>> 
>>> <HLD_of_Lustre_NRS.pdf>_______________________________________________
>>> Lustre-devel mailing list
>>> Lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>> 
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
> 
> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> ______________________________________________________________________
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-07-05  9:20       ` Liang Zhen
@ 2011-07-07 14:34         ` Nikitas Angelinas
  2011-07-09  3:34           ` Liang Zhen
  0 siblings, 1 reply; 10+ messages in thread
From: Nikitas Angelinas @ 2011-07-07 14:34 UTC (permalink / raw)
  To: lustre-devel

Hi Liang,


Thank you for sharing these great ideas; I think the task lends itself
well to adding different schemes that performance can benefit from. To
answer an earlier question, we unfortunately don't have any numbers or
have performed any tests related to the NRS task. We'll obviously
publish any such data as soon as we get our first policy developed, and
under test.

Both Nathan and I think you raise a very good point by highlighting the
need for a scalable data structure that can sort elements in libcfs, as
different policies could probably make use of such a facility, and may
actually require it if they need to address scalability-related issues
(there's an older, related topic
http://lists.lustre.org/pipermail/lustre-devel/2008-January/002230.html
which I'm sure you're already aware of).

I think there are some holes in my understanding of the example you gave
on using the binary heap in conjunction with GRN/CRN/RRN to implement a
CBRR policy. Is the GRN incremented with every incoming request? And how
can two requests end up with the same RRN? I'm sure the logic makes
sense, this is probably me just being daft :-).

Can I ask if there is a binary heap implementation living in a branch
somewhere, or you plan on landing one on master at some point? There is
some code attached on the thread I have pasted above, but I have not had
a look through it; going through the thread posts, it seems to be in
working and well-performing state.

For the first policy we aim to develop ("obj_extents" in the HLD)
although OBRR-like, we are not looking at using any data structures that
are out of the norm, at present. The idea is that we 
pick up requests from the service in-queue and add requests in
number-limited (or perhaps size-limited) per-object groupings, sorted by
offset, which are either populated in-place in the normal service
request queue, or use the normal service request queue as an
intermediate queue (just to avoid much processing in what is now the
pre-processing stage for RPCs), before they are added to the sorted
queue, from where they are eventually dispatched. We are looking at
using a libcfs hash for resolving incoming requests to the appropriate
object grouping, much like the OBRR implementation in the prototype NRS
patch does; I think this method will naturally avoid any request
starvation issues, but I am not sure it will perform as well as an OBRR
implementation using a heap. Hope the description above makes some
sense; if you see any problems with it, then please do let us know.

The NRS architectural specification at
http://wiki.lustre.org/index.php/Architecture_-_Network_Request_Scheduler has references to additional aspects to the task. Are you looking at providing some form of POP capability, or reproducing offset stream ordering, at present? These both sound like features that perhaps the framework itself could benefit from, at least in part irrespective of policy, I think. We seem to have at least one use case where we could do with clients passing additional information with each RPC, in QoS, so it seems like resource control, and catering for offset stream consistency may add additional ones.


Regards,
Nikitas






On Tue, 2011-07-05 at 17:20 +0800, Liang Zhen wrote: 
> Hi Nathan,
> 
> We also have some ideas about infrastructure of NRS and would like to share
> them at here for discussion:
> 
> - a common library to support inserting/removing prioritized requests, which 
>   should be scalable even for millions of requests
> 
> - each policy only needs to provide:
>   . private data ("heap", more details below)
>   . a sorting callback to prioritize requests
> 
> By this way, most common code can be put into libcfs and ptlrpc modules,
> instead of having a whole bunch of policy functions to sort requests in
> each policy (like the patch on BZ 13634).
> 
> Here is the concept of "heap" mentioned above:
> 
> (it's quoted from the following link)
> A heap is a specialized tree-based data structure that satisfies the heap
> property: if B is a child node of A, then key(A) ? key(B). This implies that an
> element with the greatest key is always in the root node, and so such a heap is
> sometimes called a max-heap. Alternatively, if the comparison is reversed, the
> smallest element is always in the root node, which results in a min-heap.
> 
> Please see more detail at:
> http://en.wikipedia.org/wiki/Heap_%28data_structure%29
> 
> Heap can give (log N) for both inserting element and removing element with
> greatest/lowest key, so it's very scalable, here are some sample heap APIs
> from a simple prototype:
> 
>     /* create a heap, @compare is callback for sorting elements */
>     cfs_binheap_t *cfs_binheap_create(cfs_binheap_cmp_t compare,
>                                       unsigned int flags);
> 
>     /* destroy a heap */
>     void cfs_binheap_destroy(cfs_binheap_t *h);
> 
>     /* insert @element into heap */
>     int cfs_binheap_insert(cfs_binheap_t *h, void *elelment);
> 
>     /* pop root for the @heap, which has greatest/lowest key in @heap */
>     void *cfs_binheap_pop_root(cfs_binheap_t *heap);
> 
> and an example of using "heap" to implement Client-Based Round Robin:
> 
>   - allocating a "heap" for the service which needs CBRR
> 
>   - the service also have a "global round-number" which is always incremental
>     . GRN: Global Round Number
>     . CRN: Client Round Number
>     . RRN: Request Round Number
> 
>   Process of inserting a request
>   - when there is new request:
>     . if the incoming request is the first request for a client, assign
>       the GRN to both the RRN of the request and CNR of the client
> 
>     . if the incoming request isn't the first request for a client,
>       increase the CNR on the client then assign it to RRN of the request.
> 
>   - insert the request into the heap, the heap can sort requests based
>     on callback like this:
>     . primary key is RRN of request
> 
>     . secondary key is client ID (i.e: nid of client), secondary keys are
>       compared only if primary keys are equal
> 
>   Process of dequeue a request:
> 
>   - pop the root element from the heap, which should always has the lowest RRN
> 
>   - if the RRN of the request is larger than GRN of the service,
>     which means the service has already handled all requests with <= GRN and
>     should set the GRN of service to the RRN of request
> 
> Again, this just an example to demonstrate how to use "heap" to implement CBRR
> and show some preliminary ideas, it could be helpful to raise some discussion
> at here and to see whether it's possible to improve and apply this to NRS and
> make it to be a common library for future implementation of OBRR or any
> policy needs to prioritize requests, or for QoS.
> 
> Thanks
> Liang
> 
> On Jul 2, 2011, at 11:39 PM, Nathan Rutman wrote:
> 
> > I think there is a whole slew of interesting possibilities to try out once an infrastructure for implementing different policies is in place. We're trying to get that infrastructure established first; we will be able to switch between policies dynamically to see their effects and optimize performance. People will be able to easily add new policies to test. 
> > You raise a good point about the SMP scaling patches; we will have to take a close look to make sure we don't undo any of the work you've done there. 
> > 
> > On Jul 2, 2011, at 3:49 AM, "Liang Zhen" <liang@whamcloud.com> wrote:
> > 
> >> Hi Nikitas,
> >> 
> >> It's a very interesting document, thanks for sharing these great ideas.
> >> May I ask whether there are any tests/numbers based on OBRR description
> >> in HLD? We always want to see numbers from these algorithms, unfortunately
> >> there is no reliable testing result from the patch on BZ 13634.
> >> 
> >> We are actually considering about NRS as well, although we don't have a
> >> formal design document like this HLD, but we'd like to share a few
> >> preliminary ideas for discussion:
> >> 
> >> - Target Based Round Robin (TBRR) probably is something worth to have
> >> 
> >> Briefly, it's just load balancing over OSTs to improve overall server
> >> performance.
> >> 
> >> - Fairness for clients and resource control for jobs
> >> 
> >> I think they are similar to CBRR and UBRR described in your document
> >> though I didn't see too much detail about them in the HLD.
> >> Personally, I think they are important and we probably will do some tests
> >> for CBRR survey very soon.
> >> 
> >> . Client-Based Round Robin (CBRR) can guarantee the server responses
> >>   to all clients fairly, and get whole-cluster load balancing, improve
> >>   concurrency of clients and jobs, and get better overall performance.
> >> 
> >> . resource control for jobs, some users complained that a busy job will
> >>   hog all resources on servers, and make the cluster not usable for other
> >>   control command or sysadmin. So it might be helpful to support job
> >>   resource control inside NRS.
> >> 
> >> - Layering NRS polices
> >> 
> >> Of course, OBRR is very important policy for NRS, but it might be
> >> helpful to have multiple polices working at the same time, i.e:
> >> 
> >> . bind OSTs on CPU partitions on NUMA system(please see more detail at
> >>   http://jira.whamcloud.com/browse/LU-56)
> >> 
> >> . Service threads on each CPU partition can do TBRR for bound OSTs. If
> >>   there is no CPU partition (like current Lustre) or OSTs are not bound
> >>   on CPU partitions, service threads just do round robin over all OSTs.
> >> 
> >> . OBRR inside each OST
> >> 
> >> . of course, these layers should be tunable.
> >> 
> >> - Overhead of layerd polices
> >> . there definitely will be some overhead for inserting/removing
> >>   request from these queues (or whatever), so we want some very scalable
> >>   algorithms to implement these polices.
> >> 
> >> Again, these are just some preliminary ideas, so we would appreciate any
> >> comment and suggestion.
> >> 
> >> Thanks
> >> Liang
> >> 
> >> 
> >> On Jun 30, 2011, at 11:49 PM, Nikitas Angelinas wrote:
> >> 
> >>> Hi,
> >>> 
> >>> There is a slightly more up-to-date version of the HLD, which I am
> >>> attaching.
> >>> 
> >>> 
> >>> Thanks,
> >>> Nikitas
> >>> 
> >>> On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
> >>>> Sharing with the community.  All comments welcome.
> >>>> This HLD (high-level design) for a Network Request Scheduler is more
> >>>> about infrastructure than algorithm.
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> We're actually in the DLD (detailed-level design) stage at the moment
> >>>> (sorry it didn't occur to me to 
> >>>> post this earlier).  I'll post the DLD after some minor revision.
> >>>> _______________________________________________
> >>>> Lustre-devel mailing list
> >>>> Lustre-devel at lists.lustre.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >>> 
> >>> 
> >>> ______________________________________________________________________
> >>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
> >>> 
> >>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
> >>> 
> >>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> >>> 
> >>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> >>> ______________________________________________________________________
> >>> 
> >>> 
> >>> <HLD_of_Lustre_NRS.pdf>_______________________________________________
> >>> Lustre-devel mailing list
> >>> Lustre-devel at lists.lustre.org
> >>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >> 
> >> _______________________________________________
> >> Lustre-devel mailing list
> >> Lustre-devel at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-devel
> > ______________________________________________________________________
> > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
> > 
> > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
> > 
> > Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> > 
> > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> > ______________________________________________________________________
> > 
> > 
> 


______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-07-07 14:34         ` Nikitas Angelinas
@ 2011-07-09  3:34           ` Liang Zhen
  2011-07-09  7:53             ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: Liang Zhen @ 2011-07-09  3:34 UTC (permalink / raw)
  To: lustre-devel

Hi Nikitas,

As you can see, I've posted my prototype on our Jira, hope it can help on
explaining how binheap can be used for NRS because it will be too complex
for me to explain all things by English, :-)

http://jira.whamcloud.com/browse/LU-398
http://jira.whamcloud.com/secure/attachment/10300/nrs_liang_v2.patch

I got this prototype a couple of weeks ago and we are looking for a chance
to get some performance data with it.

It's not fully tested and not reviewed, so I posted it just for discussion
and hope it can help us to understand things better and avoid rework.
If I was wrong in the prototype or my points here, please just correct me
either on Jira or at here.

the prototype contains:

  - implementation of binheap

  - a simple framework for NRS

  - client-round-robin implemented by binheap

Because it's just a prototype and I don't have any document (sorry...)
so I'm going to give some introduction at here.

  A few concepts:

  - NRS policy
    the algorithm of adding/sorting/removing requests

  - NRS policy head
    . it's a list of NRS polices

    . a policy head must has a default-policy (i.e: FIFO), default-policy
      should never fail and can't be disable

    . a policy-head can have one active-policy, or no active-policy at all.
      If there is an active-policy, request should firstly be handled by it.
      If the active-policy failed to handle request, the request will
      be delivered to default-policy.

      If no active-policy, request will always be handled by default-policy.

      I think it's similar but a little different with your design(Please
      correct me if I was wrong), the "secondary policy" in your design
      is somehow like default-policy here, I'm wondering why we want to
      specify secondary-policy for each policy, instead of just let all
      polices share a default policy.

    . user can activate/deactivate poilicy at runtime

    . ptlrpc service should always poll over all polices to find pending
      requests (even from inactive polices), so requests wouldn't be forgotten
      forever when user deactivate policy at runtime.

  - NRS object
    target object, i.e: it represents a client (or export) for
    client-round-robin, or it represents an OST object for object-round-robin

  - NRS target
    NRS-target is private data of NRS-policy, A NRS-target can contains
    many NRS-objects, just like storage-target to storage-objects.

  - NRS request
    NRS-request is a stub embedded in ptlrpc_request, each NRS-object can
    have 1-N pending NRS-requests.

  Logic specifications

  - each service has two NRS-policy-heads, one for regular requests,
    another for HP (high priority) requests

  - user can register/unregister a NRS-policy for a service, either for
    for regular NRS-policy-heads, or for HP NRS-policy-heads

  - Two polices so far:

    . FIFO list, which is default

    . client-round-robin based on binheap
      it's actually round-robin over exports for now, user can
      activate/deactivate it by lprocfs.
      We can change it to round-robin over client NIDs in the future,
      or just add a new policy.

  - ptlrpc service threads call active/default NRS-policy to sort incoming
    requests policy::nrs_ops::op_req_add()

    . requests can ben sorted by round & sequence number recorded on
      NRS-target and NRS-object, buffer offset from request can also be
      counted in as sorting key for using it as disk elevator in the future

  - ptlrpc service threads will poll request (round-robin) from all
    NRS-polices on NRS-policy-head by calling policy::nrs_ops::op_req_first(),
    which returns the request with the highest priority in the binheap or
    just the first request if it's FIFO mode, the request can be removed from
    NRS-target by calling policy::nrs_ops::op_req_del().

  - object-round-robin (not in the prototype)

    I agree cfs_hash can be used to find out object, also, I noticed the
    patch on BZ 13634 is using list + rbtree to implement object-round-robin
    and I'm not very clear about your sorting solution but I think binheap
    could be an option. By using binheap,two data structures (list + rbtree,
    or list+list) can be replaced by one (binheap), and all sorting operations
    are invisible to user (user just needs to provide compare callback)

    The major difference between client-round-robin and object-round-robin is:

    . NRS-object of client-round-robin is a client, round-number on NRS-object
      is unique for each request.

    . NRS-object of object-round-robin is an OST object, round-number on
      NRS-object doesn't have to be unique, which means multiple NRS-requests
      for one NRS-object can share the same round-number, and these requests
      are sorted by buffer offset in binheap.
      As you said, we can "consume" the round-number by counting grouped
      RPCs or data size.

Anyway, these are just my preliminary ideas, any comment and suggestions
would be great.

Some information about binheap:

  There are some changes for the binheap posted by Eric:

  - In the original version, user just needs address of element while
    inserting it, users have no idea about where the element is and
    can't remove element unless it's root of binheap.
    In current version, user needs to provide a pointer of
    cfs_binheap_node_t, which should be embedded in element structure.

    example:
        struct foo_type {
            cfs_binheap_node_t  node;
        } foo;
        cfs_binheap_insert(h, &foo.node);

    with this change, user can remove element at any position

    example:
        cfs_binheap_remove(h, &foo.node);

  - there are two options if user wants to have atomically insert:

    . preallocate memory for binheap if caller can estimate heap size,
      cfs_binheap_create(..., count);

    . use flag CBH_FLAG_ATOMIC_GROW
      cfs_binheap_create(..., CBH_FLAG_ATOMIC_GROW, count);

  Here is the list of all APIs

    cfs_binheap_t *cfs_binheap_create(cfs_binheap_ops_t *ops,
                                      unsigned int flags, unsigned count);

    void cfs_binheap_destroy(cfs_binheap_t *h);

    cfs_binheap_node_t *cfs_binheap_find(cfs_binheap_t *h, unsigned int idx);

    int cfs_binheap_insert(cfs_binheap_t *h, cfs_binheap_node_t *e);

    void cfs_binheap_remove(cfs_binheap_t *h, cfs_binheap_node_t *e);

  And some inline wrappers:

    int cfs_binheap_size(cfs_binheap_t *h)

    int cfs_binheap_is_empty(cfs_binheap_t *h)

    cfs_binheap_node_t *cfs_binheap_root(cfs_binheap_t *h)

    cfs_binheap_node_t *cfs_binheap_remove_root(cfs_binheap_t *h)

Anyway, hope these discussion can help us to understand the whole thing better
so we can get agreement about how to move on this project.

Thanks
Liang



On Jul 7, 2011, at 10:34 PM, Nikitas Angelinas wrote:

> Hi Liang,
> 
> 
> Thank you for sharing these great ideas; I think the task lends itself
> well to adding different schemes that performance can benefit from. To
> answer an earlier question, we unfortunately don't have any numbers or
> have performed any tests related to the NRS task. We'll obviously
> publish any such data as soon as we get our first policy developed, and
> under test.
> 
> Both Nathan and I think you raise a very good point by highlighting the
> need for a scalable data structure that can sort elements in libcfs, as
> different policies could probably make use of such a facility, and may
> actually require it if they need to address scalability-related issues
> (there's an older, related topic
> http://lists.lustre.org/pipermail/lustre-devel/2008-January/002230.html
> which I'm sure you're already aware of).
> 
> I think there are some holes in my understanding of the example you gave
> on using the binary heap in conjunction with GRN/CRN/RRN to implement a
> CBRR policy. Is the GRN incremented with every incoming request? And how
> can two requests end up with the same RRN? I'm sure the logic makes
> sense, this is probably me just being daft :-).
> 
> Can I ask if there is a binary heap implementation living in a branch
> somewhere, or you plan on landing one on master at some point? There is
> some code attached on the thread I have pasted above, but I have not had
> a look through it; going through the thread posts, it seems to be in
> working and well-performing state.
> 
> For the first policy we aim to develop ("obj_extents" in the HLD)
> although OBRR-like, we are not looking at using any data structures that
> are out of the norm, at present. The idea is that we 
> pick up requests from the service in-queue and add requests in
> number-limited (or perhaps size-limited) per-object groupings, sorted by
> offset, which are either populated in-place in the normal service
> request queue, or use the normal service request queue as an
> intermediate queue (just to avoid much processing in what is now the
> pre-processing stage for RPCs), before they are added to the sorted
> queue, from where they are eventually dispatched. We are looking at
> using a libcfs hash for resolving incoming requests to the appropriate
> object grouping, much like the OBRR implementation in the prototype NRS
> patch does; I think this method will naturally avoid any request
> starvation issues, but I am not sure it will perform as well as an OBRR
> implementation using a heap. Hope the description above makes some
> sense; if you see any problems with it, then please do let us know.
> 
> The NRS architectural specification at
> http://wiki.lustre.org/index.php/Architecture_-_Network_Request_Scheduler has references to additional aspects to the task. Are you looking at providing some form of POP capability, or reproducing offset stream ordering, at present? These both sound like features that perhaps the framework itself could benefit from, at least in part irrespective of policy, I think. We seem to have at least one use case where we could do with clients passing additional information with each RPC, in QoS, so it seems like resource control, and catering for offset stream consistency may add additional ones.
> 
> 
> Regards,
> Nikitas
> 
> 
> 
> 
> 
> 
> On Tue, 2011-07-05 at 17:20 +0800, Liang Zhen wrote: 
>> Hi Nathan,
>> 
>> We also have some ideas about infrastructure of NRS and would like to share
>> them at here for discussion:
>> 
>> - a common library to support inserting/removing prioritized requests, which 
>>  should be scalable even for millions of requests
>> 
>> - each policy only needs to provide:
>>  . private data ("heap", more details below)
>>  . a sorting callback to prioritize requests
>> 
>> By this way, most common code can be put into libcfs and ptlrpc modules,
>> instead of having a whole bunch of policy functions to sort requests in
>> each policy (like the patch on BZ 13634).
>> 
>> Here is the concept of "heap" mentioned above:
>> 
>> (it's quoted from the following link)
>> A heap is a specialized tree-based data structure that satisfies the heap
>> property: if B is a child node of A, then key(A) ? key(B). This implies that an
>> element with the greatest key is always in the root node, and so such a heap is
>> sometimes called a max-heap. Alternatively, if the comparison is reversed, the
>> smallest element is always in the root node, which results in a min-heap.
>> 
>> Please see more detail at:
>> http://en.wikipedia.org/wiki/Heap_%28data_structure%29
>> 
>> Heap can give (log N) for both inserting element and removing element with
>> greatest/lowest key, so it's very scalable, here are some sample heap APIs
>> from a simple prototype:
>> 
>>    /* create a heap, @compare is callback for sorting elements */
>>    cfs_binheap_t *cfs_binheap_create(cfs_binheap_cmp_t compare,
>>                                      unsigned int flags);
>> 
>>    /* destroy a heap */
>>    void cfs_binheap_destroy(cfs_binheap_t *h);
>> 
>>    /* insert @element into heap */
>>    int cfs_binheap_insert(cfs_binheap_t *h, void *elelment);
>> 
>>    /* pop root for the @heap, which has greatest/lowest key in @heap */
>>    void *cfs_binheap_pop_root(cfs_binheap_t *heap);
>> 
>> and an example of using "heap" to implement Client-Based Round Robin:
>> 
>>  - allocating a "heap" for the service which needs CBRR
>> 
>>  - the service also have a "global round-number" which is always incremental
>>    . GRN: Global Round Number
>>    . CRN: Client Round Number
>>    . RRN: Request Round Number
>> 
>>  Process of inserting a request
>>  - when there is new request:
>>    . if the incoming request is the first request for a client, assign
>>      the GRN to both the RRN of the request and CNR of the client
>> 
>>    . if the incoming request isn't the first request for a client,
>>      increase the CNR on the client then assign it to RRN of the request.
>> 
>>  - insert the request into the heap, the heap can sort requests based
>>    on callback like this:
>>    . primary key is RRN of request
>> 
>>    . secondary key is client ID (i.e: nid of client), secondary keys are
>>      compared only if primary keys are equal
>> 
>>  Process of dequeue a request:
>> 
>>  - pop the root element from the heap, which should always has the lowest RRN
>> 
>>  - if the RRN of the request is larger than GRN of the service,
>>    which means the service has already handled all requests with <= GRN and
>>    should set the GRN of service to the RRN of request
>> 
>> Again, this just an example to demonstrate how to use "heap" to implement CBRR
>> and show some preliminary ideas, it could be helpful to raise some discussion
>> at here and to see whether it's possible to improve and apply this to NRS and
>> make it to be a common library for future implementation of OBRR or any
>> policy needs to prioritize requests, or for QoS.
>> 
>> Thanks
>> Liang
>> 
>> On Jul 2, 2011, at 11:39 PM, Nathan Rutman wrote:
>> 
>>> I think there is a whole slew of interesting possibilities to try out once an infrastructure for implementing different policies is in place. We're trying to get that infrastructure established first; we will be able to switch between policies dynamically to see their effects and optimize performance. People will be able to easily add new policies to test. 
>>> You raise a good point about the SMP scaling patches; we will have to take a close look to make sure we don't undo any of the work you've done there. 
>>> 
>>> On Jul 2, 2011, at 3:49 AM, "Liang Zhen" <liang@whamcloud.com> wrote:
>>> 
>>>> Hi Nikitas,
>>>> 
>>>> It's a very interesting document, thanks for sharing these great ideas.
>>>> May I ask whether there are any tests/numbers based on OBRR description
>>>> in HLD? We always want to see numbers from these algorithms, unfortunately
>>>> there is no reliable testing result from the patch on BZ 13634.
>>>> 
>>>> We are actually considering about NRS as well, although we don't have a
>>>> formal design document like this HLD, but we'd like to share a few
>>>> preliminary ideas for discussion:
>>>> 
>>>> - Target Based Round Robin (TBRR) probably is something worth to have
>>>> 
>>>> Briefly, it's just load balancing over OSTs to improve overall server
>>>> performance.
>>>> 
>>>> - Fairness for clients and resource control for jobs
>>>> 
>>>> I think they are similar to CBRR and UBRR described in your document
>>>> though I didn't see too much detail about them in the HLD.
>>>> Personally, I think they are important and we probably will do some tests
>>>> for CBRR survey very soon.
>>>> 
>>>> . Client-Based Round Robin (CBRR) can guarantee the server responses
>>>>  to all clients fairly, and get whole-cluster load balancing, improve
>>>>  concurrency of clients and jobs, and get better overall performance.
>>>> 
>>>> . resource control for jobs, some users complained that a busy job will
>>>>  hog all resources on servers, and make the cluster not usable for other
>>>>  control command or sysadmin. So it might be helpful to support job
>>>>  resource control inside NRS.
>>>> 
>>>> - Layering NRS polices
>>>> 
>>>> Of course, OBRR is very important policy for NRS, but it might be
>>>> helpful to have multiple polices working at the same time, i.e:
>>>> 
>>>> . bind OSTs on CPU partitions on NUMA system(please see more detail at
>>>>  http://jira.whamcloud.com/browse/LU-56)
>>>> 
>>>> . Service threads on each CPU partition can do TBRR for bound OSTs. If
>>>>  there is no CPU partition (like current Lustre) or OSTs are not bound
>>>>  on CPU partitions, service threads just do round robin over all OSTs.
>>>> 
>>>> . OBRR inside each OST
>>>> 
>>>> . of course, these layers should be tunable.
>>>> 
>>>> - Overhead of layerd polices
>>>> . there definitely will be some overhead for inserting/removing
>>>>  request from these queues (or whatever), so we want some very scalable
>>>>  algorithms to implement these polices.
>>>> 
>>>> Again, these are just some preliminary ideas, so we would appreciate any
>>>> comment and suggestion.
>>>> 
>>>> Thanks
>>>> Liang
>>>> 
>>>> 
>>>> On Jun 30, 2011, at 11:49 PM, Nikitas Angelinas wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> There is a slightly more up-to-date version of the HLD, which I am
>>>>> attaching.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Nikitas
>>>>> 
>>>>> On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
>>>>>> Sharing with the community.  All comments welcome.
>>>>>> This HLD (high-level design) for a Network Request Scheduler is more
>>>>>> about infrastructure than algorithm.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> We're actually in the DLD (detailed-level design) stage at the moment
>>>>>> (sorry it didn't occur to me to 
>>>>>> post this earlier).  I'll post the DLD after some minor revision.
>>>>>> _______________________________________________
>>>>>> Lustre-devel mailing list
>>>>>> Lustre-devel at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>>> 
>>>>> 
>>>>> ______________________________________________________________________
>>>>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>>>>> 
>>>>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>>>>> 
>>>>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>>>>> 
>>>>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
>>>>> ______________________________________________________________________
>>>>> 
>>>>> 
>>>>> <HLD_of_Lustre_NRS.pdf>_______________________________________________
>>>>> Lustre-devel mailing list
>>>>> Lustre-devel at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>> 
>>>> _______________________________________________
>>>> Lustre-devel mailing list
>>>> Lustre-devel at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>> ______________________________________________________________________
>>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>>> 
>>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>>> 
>>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>>> 
>>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
>>> ______________________________________________________________________
>>> 
>>> 
>> 
> 
> 
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
> 
> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> ______________________________________________________________________
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-07-09  3:34           ` Liang Zhen
@ 2011-07-09  7:53             ` Andreas Dilger
  2011-07-11  8:20               ` Liang Zhen
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2011-07-09  7:53 UTC (permalink / raw)
  To: lustre-devel

I think another policy type that is important to users is "minimum bandwidth" or "maximum bandwidth". Please ensure that any framework that is implemented can also be used in this manner. Ideally this would be part of the initial implementation. 

Cheers, Andreas

On 2011-07-08, at 9:34 PM, Liang Zhen <liang@whamcloud.com> wrote:

> Hi Nikitas,
> 
> As you can see, I've posted my prototype on our Jira, hope it can help on
> explaining how binheap can be used for NRS because it will be too complex
> for me to explain all things by English, :-)
> 
> http://jira.whamcloud.com/browse/LU-398
> http://jira.whamcloud.com/secure/attachment/10300/nrs_liang_v2.patch
> 
> I got this prototype a couple of weeks ago and we are looking for a chance
> to get some performance data with it.
> 
> It's not fully tested and not reviewed, so I posted it just for discussion
> and hope it can help us to understand things better and avoid rework.
> If I was wrong in the prototype or my points here, please just correct me
> either on Jira or at here.
> 
> the prototype contains:
> 
>  - implementation of binheap
> 
>  - a simple framework for NRS
> 
>  - client-round-robin implemented by binheap
> 
> Because it's just a prototype and I don't have any document (sorry...)
> so I'm going to give some introduction at here.
> 
>  A few concepts:
> 
>  - NRS policy
>    the algorithm of adding/sorting/removing requests
> 
>  - NRS policy head
>    . it's a list of NRS polices
> 
>    . a policy head must has a default-policy (i.e: FIFO), default-policy
>      should never fail and can't be disable
> 
>    . a policy-head can have one active-policy, or no active-policy at all.
>      If there is an active-policy, request should firstly be handled by it.
>      If the active-policy failed to handle request, the request will
>      be delivered to default-policy.
> 
>      If no active-policy, request will always be handled by default-policy.
> 
>      I think it's similar but a little different with your design(Please
>      correct me if I was wrong), the "secondary policy" in your design
>      is somehow like default-policy here, I'm wondering why we want to
>      specify secondary-policy for each policy, instead of just let all
>      polices share a default policy.
> 
>    . user can activate/deactivate poilicy at runtime
> 
>    . ptlrpc service should always poll over all polices to find pending
>      requests (even from inactive polices), so requests wouldn't be forgotten
>      forever when user deactivate policy at runtime.
> 
>  - NRS object
>    target object, i.e: it represents a client (or export) for
>    client-round-robin, or it represents an OST object for object-round-robin
> 
>  - NRS target
>    NRS-target is private data of NRS-policy, A NRS-target can contains
>    many NRS-objects, just like storage-target to storage-objects.
> 
>  - NRS request
>    NRS-request is a stub embedded in ptlrpc_request, each NRS-object can
>    have 1-N pending NRS-requests.
> 
>  Logic specifications
> 
>  - each service has two NRS-policy-heads, one for regular requests,
>    another for HP (high priority) requests
> 
>  - user can register/unregister a NRS-policy for a service, either for
>    for regular NRS-policy-heads, or for HP NRS-policy-heads
> 
>  - Two polices so far:
> 
>    . FIFO list, which is default
> 
>    . client-round-robin based on binheap
>      it's actually round-robin over exports for now, user can
>      activate/deactivate it by lprocfs.
>      We can change it to round-robin over client NIDs in the future,
>      or just add a new policy.
> 
>  - ptlrpc service threads call active/default NRS-policy to sort incoming
>    requests policy::nrs_ops::op_req_add()
> 
>    . requests can ben sorted by round & sequence number recorded on
>      NRS-target and NRS-object, buffer offset from request can also be
>      counted in as sorting key for using it as disk elevator in the future
> 
>  - ptlrpc service threads will poll request (round-robin) from all
>    NRS-polices on NRS-policy-head by calling policy::nrs_ops::op_req_first(),
>    which returns the request with the highest priority in the binheap or
>    just the first request if it's FIFO mode, the request can be removed from
>    NRS-target by calling policy::nrs_ops::op_req_del().
> 
>  - object-round-robin (not in the prototype)
> 
>    I agree cfs_hash can be used to find out object, also, I noticed the
>    patch on BZ 13634 is using list + rbtree to implement object-round-robin
>    and I'm not very clear about your sorting solution but I think binheap
>    could be an option. By using binheap,two data structures (list + rbtree,
>    or list+list) can be replaced by one (binheap), and all sorting operations
>    are invisible to user (user just needs to provide compare callback)
> 
>    The major difference between client-round-robin and object-round-robin is:
> 
>    . NRS-object of client-round-robin is a client, round-number on NRS-object
>      is unique for each request.
> 
>    . NRS-object of object-round-robin is an OST object, round-number on
>      NRS-object doesn't have to be unique, which means multiple NRS-requests
>      for one NRS-object can share the same round-number, and these requests
>      are sorted by buffer offset in binheap.
>      As you said, we can "consume" the round-number by counting grouped
>      RPCs or data size.
> 
> Anyway, these are just my preliminary ideas, any comment and suggestions
> would be great.
> 
> Some information about binheap:
> 
>  There are some changes for the binheap posted by Eric:
> 
>  - In the original version, user just needs address of element while
>    inserting it, users have no idea about where the element is and
>    can't remove element unless it's root of binheap.
>    In current version, user needs to provide a pointer of
>    cfs_binheap_node_t, which should be embedded in element structure.
> 
>    example:
>        struct foo_type {
>            cfs_binheap_node_t  node;
>        } foo;
>        cfs_binheap_insert(h, &foo.node);
> 
>    with this change, user can remove element at any position
> 
>    example:
>        cfs_binheap_remove(h, &foo.node);
> 
>  - there are two options if user wants to have atomically insert:
> 
>    . preallocate memory for binheap if caller can estimate heap size,
>      cfs_binheap_create(..., count);
> 
>    . use flag CBH_FLAG_ATOMIC_GROW
>      cfs_binheap_create(..., CBH_FLAG_ATOMIC_GROW, count);
> 
>  Here is the list of all APIs
> 
>    cfs_binheap_t *cfs_binheap_create(cfs_binheap_ops_t *ops,
>                                      unsigned int flags, unsigned count);
> 
>    void cfs_binheap_destroy(cfs_binheap_t *h);
> 
>    cfs_binheap_node_t *cfs_binheap_find(cfs_binheap_t *h, unsigned int idx);
> 
>    int cfs_binheap_insert(cfs_binheap_t *h, cfs_binheap_node_t *e);
> 
>    void cfs_binheap_remove(cfs_binheap_t *h, cfs_binheap_node_t *e);
> 
>  And some inline wrappers:
> 
>    int cfs_binheap_size(cfs_binheap_t *h)
> 
>    int cfs_binheap_is_empty(cfs_binheap_t *h)
> 
>    cfs_binheap_node_t *cfs_binheap_root(cfs_binheap_t *h)
> 
>    cfs_binheap_node_t *cfs_binheap_remove_root(cfs_binheap_t *h)
> 
> Anyway, hope these discussion can help us to understand the whole thing better
> so we can get agreement about how to move on this project.
> 
> Thanks
> Liang
> 
> 
> 
> On Jul 7, 2011, at 10:34 PM, Nikitas Angelinas wrote:
> 
>> Hi Liang,
>> 
>> 
>> Thank you for sharing these great ideas; I think the task lends itself
>> well to adding different schemes that performance can benefit from. To
>> answer an earlier question, we unfortunately don't have any numbers or
>> have performed any tests related to the NRS task. We'll obviously
>> publish any such data as soon as we get our first policy developed, and
>> under test.
>> 
>> Both Nathan and I think you raise a very good point by highlighting the
>> need for a scalable data structure that can sort elements in libcfs, as
>> different policies could probably make use of such a facility, and may
>> actually require it if they need to address scalability-related issues
>> (there's an older, related topic
>> http://lists.lustre.org/pipermail/lustre-devel/2008-January/002230.html
>> which I'm sure you're already aware of).
>> 
>> I think there are some holes in my understanding of the example you gave
>> on using the binary heap in conjunction with GRN/CRN/RRN to implement a
>> CBRR policy. Is the GRN incremented with every incoming request? And how
>> can two requests end up with the same RRN? I'm sure the logic makes
>> sense, this is probably me just being daft :-).
>> 
>> Can I ask if there is a binary heap implementation living in a branch
>> somewhere, or you plan on landing one on master at some point? There is
>> some code attached on the thread I have pasted above, but I have not had
>> a look through it; going through the thread posts, it seems to be in
>> working and well-performing state.
>> 
>> For the first policy we aim to develop ("obj_extents" in the HLD)
>> although OBRR-like, we are not looking at using any data structures that
>> are out of the norm, at present. The idea is that we 
>> pick up requests from the service in-queue and add requests in
>> number-limited (or perhaps size-limited) per-object groupings, sorted by
>> offset, which are either populated in-place in the normal service
>> request queue, or use the normal service request queue as an
>> intermediate queue (just to avoid much processing in what is now the
>> pre-processing stage for RPCs), before they are added to the sorted
>> queue, from where they are eventually dispatched. We are looking at
>> using a libcfs hash for resolving incoming requests to the appropriate
>> object grouping, much like the OBRR implementation in the prototype NRS
>> patch does; I think this method will naturally avoid any request
>> starvation issues, but I am not sure it will perform as well as an OBRR
>> implementation using a heap. Hope the description above makes some
>> sense; if you see any problems with it, then please do let us know.
>> 
>> The NRS architectural specification at
>> http://wiki.lustre.org/index.php/Architecture_-_Network_Request_Scheduler has references to additional aspects to the task. Are you looking at providing some form of POP capability, or reproducing offset stream ordering, at present? These both sound like features that perhaps the framework itself could benefit from, at least in part irrespective of policy, I think. We seem to have at least one use case where we could do with clients passing additional information with each RPC, in QoS, so it seems like resource control, and catering for offset stream consistency may add additional ones.
>> 
>> 
>> Regards,
>> Nikitas
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, 2011-07-05 at 17:20 +0800, Liang Zhen wrote: 
>>> Hi Nathan,
>>> 
>>> We also have some ideas about infrastructure of NRS and would like to share
>>> them at here for discussion:
>>> 
>>> - a common library to support inserting/removing prioritized requests, which 
>>> should be scalable even for millions of requests
>>> 
>>> - each policy only needs to provide:
>>> . private data ("heap", more details below)
>>> . a sorting callback to prioritize requests
>>> 
>>> By this way, most common code can be put into libcfs and ptlrpc modules,
>>> instead of having a whole bunch of policy functions to sort requests in
>>> each policy (like the patch on BZ 13634).
>>> 
>>> Here is the concept of "heap" mentioned above:
>>> 
>>> (it's quoted from the following link)
>>> A heap is a specialized tree-based data structure that satisfies the heap
>>> property: if B is a child node of A, then key(A) ? key(B). This implies that an
>>> element with the greatest key is always in the root node, and so such a heap is
>>> sometimes called a max-heap. Alternatively, if the comparison is reversed, the
>>> smallest element is always in the root node, which results in a min-heap.
>>> 
>>> Please see more detail at:
>>> http://en.wikipedia.org/wiki/Heap_%28data_structure%29
>>> 
>>> Heap can give (log N) for both inserting element and removing element with
>>> greatest/lowest key, so it's very scalable, here are some sample heap APIs
>>> from a simple prototype:
>>> 
>>>   /* create a heap, @compare is callback for sorting elements */
>>>   cfs_binheap_t *cfs_binheap_create(cfs_binheap_cmp_t compare,
>>>                                     unsigned int flags);
>>> 
>>>   /* destroy a heap */
>>>   void cfs_binheap_destroy(cfs_binheap_t *h);
>>> 
>>>   /* insert @element into heap */
>>>   int cfs_binheap_insert(cfs_binheap_t *h, void *elelment);
>>> 
>>>   /* pop root for the @heap, which has greatest/lowest key in @heap */
>>>   void *cfs_binheap_pop_root(cfs_binheap_t *heap);
>>> 
>>> and an example of using "heap" to implement Client-Based Round Robin:
>>> 
>>> - allocating a "heap" for the service which needs CBRR
>>> 
>>> - the service also have a "global round-number" which is always incremental
>>>   . GRN: Global Round Number
>>>   . CRN: Client Round Number
>>>   . RRN: Request Round Number
>>> 
>>> Process of inserting a request
>>> - when there is new request:
>>>   . if the incoming request is the first request for a client, assign
>>>     the GRN to both the RRN of the request and CNR of the client
>>> 
>>>   . if the incoming request isn't the first request for a client,
>>>     increase the CNR on the client then assign it to RRN of the request.
>>> 
>>> - insert the request into the heap, the heap can sort requests based
>>>   on callback like this:
>>>   . primary key is RRN of request
>>> 
>>>   . secondary key is client ID (i.e: nid of client), secondary keys are
>>>     compared only if primary keys are equal
>>> 
>>> Process of dequeue a request:
>>> 
>>> - pop the root element from the heap, which should always has the lowest RRN
>>> 
>>> - if the RRN of the request is larger than GRN of the service,
>>>   which means the service has already handled all requests with <= GRN and
>>>   should set the GRN of service to the RRN of request
>>> 
>>> Again, this just an example to demonstrate how to use "heap" to implement CBRR
>>> and show some preliminary ideas, it could be helpful to raise some discussion
>>> at here and to see whether it's possible to improve and apply this to NRS and
>>> make it to be a common library for future implementation of OBRR or any
>>> policy needs to prioritize requests, or for QoS.
>>> 
>>> Thanks
>>> Liang
>>> 
>>> On Jul 2, 2011, at 11:39 PM, Nathan Rutman wrote:
>>> 
>>>> I think there is a whole slew of interesting possibilities to try out once an infrastructure for implementing different policies is in place. We're trying to get that infrastructure established first; we will be able to switch between policies dynamically to see their effects and optimize performance. People will be able to easily add new policies to test. 
>>>> You raise a good point about the SMP scaling patches; we will have to take a close look to make sure we don't undo any of the work you've done there. 
>>>> 
>>>> On Jul 2, 2011, at 3:49 AM, "Liang Zhen" <liang@whamcloud.com> wrote:
>>>> 
>>>>> Hi Nikitas,
>>>>> 
>>>>> It's a very interesting document, thanks for sharing these great ideas.
>>>>> May I ask whether there are any tests/numbers based on OBRR description
>>>>> in HLD? We always want to see numbers from these algorithms, unfortunately
>>>>> there is no reliable testing result from the patch on BZ 13634.
>>>>> 
>>>>> We are actually considering about NRS as well, although we don't have a
>>>>> formal design document like this HLD, but we'd like to share a few
>>>>> preliminary ideas for discussion:
>>>>> 
>>>>> - Target Based Round Robin (TBRR) probably is something worth to have
>>>>> 
>>>>> Briefly, it's just load balancing over OSTs to improve overall server
>>>>> performance.
>>>>> 
>>>>> - Fairness for clients and resource control for jobs
>>>>> 
>>>>> I think they are similar to CBRR and UBRR described in your document
>>>>> though I didn't see too much detail about them in the HLD.
>>>>> Personally, I think they are important and we probably will do some tests
>>>>> for CBRR survey very soon.
>>>>> 
>>>>> . Client-Based Round Robin (CBRR) can guarantee the server responses
>>>>> to all clients fairly, and get whole-cluster load balancing, improve
>>>>> concurrency of clients and jobs, and get better overall performance.
>>>>> 
>>>>> . resource control for jobs, some users complained that a busy job will
>>>>> hog all resources on servers, and make the cluster not usable for other
>>>>> control command or sysadmin. So it might be helpful to support job
>>>>> resource control inside NRS.
>>>>> 
>>>>> - Layering NRS polices
>>>>> 
>>>>> Of course, OBRR is very important policy for NRS, but it might be
>>>>> helpful to have multiple polices working at the same time, i.e:
>>>>> 
>>>>> . bind OSTs on CPU partitions on NUMA system(please see more detail at
>>>>> http://jira.whamcloud.com/browse/LU-56)
>>>>> 
>>>>> . Service threads on each CPU partition can do TBRR for bound OSTs. If
>>>>> there is no CPU partition (like current Lustre) or OSTs are not bound
>>>>> on CPU partitions, service threads just do round robin over all OSTs.
>>>>> 
>>>>> . OBRR inside each OST
>>>>> 
>>>>> . of course, these layers should be tunable.
>>>>> 
>>>>> - Overhead of layerd polices
>>>>> . there definitely will be some overhead for inserting/removing
>>>>> request from these queues (or whatever), so we want some very scalable
>>>>> algorithms to implement these polices.
>>>>> 
>>>>> Again, these are just some preliminary ideas, so we would appreciate any
>>>>> comment and suggestion.
>>>>> 
>>>>> Thanks
>>>>> Liang
>>>>> 
>>>>> 
>>>>> On Jun 30, 2011, at 11:49 PM, Nikitas Angelinas wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> There is a slightly more up-to-date version of the HLD, which I am
>>>>>> attaching.
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Nikitas
>>>>>> 
>>>>>> On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
>>>>>>> Sharing with the community.  All comments welcome.
>>>>>>> This HLD (high-level design) for a Network Request Scheduler is more
>>>>>>> about infrastructure than algorithm.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> We're actually in the DLD (detailed-level design) stage at the moment
>>>>>>> (sorry it didn't occur to me to 
>>>>>>> post this earlier).  I'll post the DLD after some minor revision.
>>>>>>> _______________________________________________
>>>>>>> Lustre-devel mailing list
>>>>>>> Lustre-devel at lists.lustre.org
>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>>>> 
>>>>>> 
>>>>>> ______________________________________________________________________
>>>>>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>>>>>> 
>>>>>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>>>>>> 
>>>>>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>>>>>> 
>>>>>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
>>>>>> ______________________________________________________________________
>>>>>> 
>>>>>> 
>>>>>> <HLD_of_Lustre_NRS.pdf>_______________________________________________
>>>>>> Lustre-devel mailing list
>>>>>> Lustre-devel at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>>> 
>>>>> _______________________________________________
>>>>> Lustre-devel mailing list
>>>>> Lustre-devel at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>>> ______________________________________________________________________
>>>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>>>> 
>>>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>>>> 
>>>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>>>> 
>>>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
>>>> ______________________________________________________________________
>>>> 
>>>> 
>>> 
>> 
>> 
>> ______________________________________________________________________
>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
>> 
>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
>> 
>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>> 
>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
>> ______________________________________________________________________
>> 
>> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-07-09  7:53             ` Andreas Dilger
@ 2011-07-11  8:20               ` Liang Zhen
  2011-07-11 19:46                 ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: Liang Zhen @ 2011-07-11  8:20 UTC (permalink / raw)
  To: lustre-devel

Andreas, could you give a little more information about this, or point me to the link about this?

Thanks
Liang

On Jul 9, 2011, at 3:53 PM, Andreas Dilger wrote:

> I think another policy type that is important to users is "minimum bandwidth" or "maximum bandwidth". Please ensure that any framework that is implemented can also be used in this manner. Ideally this would be part of the initial implementation. 
> 
> Cheers, Andreas
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Lustre-devel] NRS HLD
  2011-07-11  8:20               ` Liang Zhen
@ 2011-07-11 19:46                 ` Andreas Dilger
  0 siblings, 0 replies; 10+ messages in thread
From: Andreas Dilger @ 2011-07-11 19:46 UTC (permalink / raw)
  To: lustre-devel

I don't know if this feature is listed specifically in Qian's NRS design, but the idea is simple. The desire is to guarantee some node or cluster (~= LNET network) a specific amount of bandwidth, so that it is not starved for bandwidth when competing against a large cluster with many thousands of clients. Simple Client-Based RR will uniformly allocate bandwidth to each client, but this will weight the aggregate bandwidth by the ratio of clients in each network.

Having Network-Based RR would be better (allowing uniform bandwidth sharing among the networks) , but the most flexible and usable policy from a user/admin point of view is to ensure that some client/network can always get N MB/s, preferably in relatively uniform increments over the second, if it has any outstanding requests.

A simple extension to this policy is to allow limiting bandwidth for a specific client or network to some maximum over each second. 

Cheers, Andreas

On 2011-07-11, at 2:20 AM, Liang Zhen <liang@whamcloud.com> wrote:

> Andreas, could you give a little more information about this, or point me to the link about this?
> 
> Thanks
> Liang
> 
> On Jul 9, 2011, at 3:53 PM, Andreas Dilger wrote:
> 
>> I think another policy type that is important to users is "minimum bandwidth" or "maximum bandwidth". Please ensure that any framework that is implemented can also be used in this manner. Ideally this would be part of the initial implementation. 
>> 
>> Cheers, Andreas
>> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-07-11 19:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-29 19:55 [Lustre-devel] NRS HLD Nathan Rutman
2011-06-30 15:49 ` Nikitas Angelinas
2011-07-02 10:48   ` Liang Zhen
2011-07-02 15:39     ` Nathan Rutman
2011-07-05  9:20       ` Liang Zhen
2011-07-07 14:34         ` Nikitas Angelinas
2011-07-09  3:34           ` Liang Zhen
2011-07-09  7:53             ` Andreas Dilger
2011-07-11  8:20               ` Liang Zhen
2011-07-11 19:46                 ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.