All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] Quality of Service Planning in Lustre
@ 2017-01-16 15:28 Jürgen Kaiser
  2017-01-19  0:32 ` Dilger, Andreas
  0 siblings, 1 reply; 3+ messages in thread
From: Jürgen Kaiser @ 2017-01-16 15:28 UTC (permalink / raw)
  To: lustre-devel

Hello everyone,

my name is J?rgen Kaiser and I'm a research assistant at the Johannes Gutenberg University Mainz. As part of an IPCC project, we developed a tool for handling Quality of Service requirements in Lustre. We would like to explain what we did and hear your thoughts about it. We hope that our work will help the Lustre community in the future.

==== What we did ===

HPC centers typically use scheduling systems such as LSF or Slurm to manage the computations of the users and the resources themselves. These schedulers force the users to define compute jobs and to submit these jobs to the schedulers. In return, the schedulers guarantee the users that their job will have the required compute resources. So far, storage bandwidth is (mostly) excluded from the list of managed resources. Unlike resources like CPU or memory, there is no easy method for schedulers to handle storage bandwidth because they lack knowledge about storage system internals. For example, a scheduler must know the placement of a file?s data on Lustre's OSTs plus the workload as well as the maximum performance of the involved parts to reason about the available read throughput for the file.

It would be more practical if the storage system would provide a generic API for schedulers to set Quality of Service (QoS) configurations while abstracting the internal dependencies. With such an interface, a scheduler easily could request available storage         resources. In our project, we're developing a Quality of Service Planner (QoSP) that provides this interface. Its task is to receive storage bandwidth requests (e.g. 100MB/s read throughput for files X,Y,Z for one hour), check for resource availability and, if       available, guarantee the reserved resources by configuring Lustre. The main tool here is a (modified) TBF strategy in Lustre's NRS.

The QoSP still is under development. You can see the code on github: https://github.com/jkaiser/qos-planner . We see further use cases beyond the HPC scheduling scenario. For example, applications could negotiate I/O Phases with the storage backend (e.g. HPC checkpoints). We would like to hear your thoughts about this project in general and about several problems we face in detail. You can find a pdf a with detailed description and discussion about some core problems we face here: https://seafile.rlp.net/f/55e4c7b619/?raw=1. Two example issues are:

=== The NRS and the TBF ===

Users require _guaranteed_ storage bandwidth to reason about the run time of their applications so that they can reserve enough time on the computation cluster. In other words: users require minimum bandwidths instead of maximum ones. The current TBF strategy, however, only supports upper thresholds. There are two options here:
1) Implement minimums indirectly. This involves a monitoring of the actual resource consumption on the OSTs and a repeatedly readjusting of TBF rules.
2) Modify the TBF so that it supports lower thresholds. Here, the NRS would try to satisfy the minimums first. This additionally has the advantage that there is no underutilized bandwidth: each job can use free resources if necessary because there is no upper limit.

We would like to implement Option 2. We are in contact with DDN and discuss this work including the TBF strategy.

=== Handling Write Throughput ===

When an application requests write throughput, this usually means that it will create new files. However, at request time, these files do not exist yet, therefore the QoSP cannot know which OSTs will have to process the write throughput. Hence, the QoSP somehow must predict/determine the placement of new files on the OSTs within the Lustre system. This requires several modifications in Lustre including a new interface to query such information. We would like to discuss this issue with the Lustre community. The mentioned PDF file contains further details. We would be happy to hear your thoughts on this.

Best Regards,
J?rgen Kaiser
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1536 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170116/f0342de4/attachment.pgp>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [lustre-devel] Quality of Service Planning in Lustre
  2017-01-16 15:28 [lustre-devel] Quality of Service Planning in Lustre Jürgen Kaiser
@ 2017-01-19  0:32 ` Dilger, Andreas
       [not found]   ` <1B2C98C4BC048747A2EA06A5FF0D0C76B585D13F@LAX-EX-MB3.datadirect.datadirectnet.com>
  0 siblings, 1 reply; 3+ messages in thread
From: Dilger, Andreas @ 2017-01-19  0:32 UTC (permalink / raw)
  To: lustre-devel

On Jan 16, 2017, at 08:28, J?rgen Kaiser <kaiserj@uni-mainz.de> wrote:
> 
> Hello everyone,
> 
> my name is J?rgen Kaiser and I'm a research assistant at the Johannes Gutenberg University Mainz. As part of an IPCC project, we developed a tool for handling Quality of Service requirements in Lustre. We would like to explain what we did and hear your thoughts about it. We hope that our work will help the Lustre community in the future.
> 
> ==== What we did ===
> 
> HPC centers typically use scheduling systems such as LSF or Slurm to manage the computations of the users and the resources themselves. These schedulers force the users to define compute jobs and to submit these jobs to the schedulers. In return, the schedulers guarantee the users that their job will have the required compute resources. So far, storage bandwidth is (mostly) excluded from the list of managed resources. Unlike resources like CPU or memory, there is no easy method for schedulers to handle storage bandwidth because they lack knowledge about storage system internals. For example, a scheduler must know the placement of a file?s data on Lustre's OSTs plus the workload as well as the maximum performance of the involved parts to reason about the available read throughput for the file.
> 
> It would be more practical if the storage system would provide a generic API for schedulers to set Quality of Service (QoS) configurations while abstracting the internal dependencies. With such an interface, a scheduler easily could request available storage resources. In our project, we're developing a Quality of Service Planner (QoSP) that provides this interface. Its task is to receive storage bandwidth requests (e.g. 100MB/s read throughput for files X,Y,Z for one hour), check for resource availability and, if available, guarantee the reserved resources by configuring Lustre. The main tool here is a (modified) TBF strategy in Lustre's NRS.

I think there is an open question about what sort of granularity of I/O bandwidth an application needs in order to do its job.  I think the main goal of the user is to avoid having their job contend with another large job that is consuming all of the bandwidth for a long period of time, but they don't necessarily need real-time I/O guarantees.  At the same time, from process scheduling on CPUs we know the most globally efficient scheduling algorithm is "shortest job first", so that small jobs can complete their I/O quickly and return to computation, while the large job is going to take a long time in either case.

It may be that instead of working on hard bandwidth guarantees (i.e job A needs 100MB/s, while job B needs 500MB/s), which is hard for users to determine (they want to get the maximum bandwidth all of the time?) it might be better to have jobs provide information about how large their I/O is going to be, and how often.  That is something that users can determine quite easily from one run to the next, and would allow a global scheduler to know that job A wants to read 10 GB of files every 10 minutes, while job B wants to write 10TB of data every 60 minutes, and when job A starts its read it should preempt job B (if currently doing I/O) so it can complete all of its reads quickly and go back to work, and job B will still be writing that 10TB long after job A has finished.

> The QoSP still is under development. You can see the code on github: https://github.com/jkaiser/qos-planner . We see further use cases beyond the HPC scheduling scenario. For example, applications could negotiate I/O Phases with the storage backend (e.g. HPC checkpoints). We would like to hear your thoughts about this project in general and about several problems we face in detail. You can find a pdf a with detailed description and discussion about some core problems we face here: https://seafile.rlp.net/f/55e4c7b619/?raw=1. Two example issues are:
> 
> === The NRS and the TBF ===
> 
> Users require _guaranteed_ storage bandwidth to reason about the run time of their applications so that they can reserve enough time on the computation cluster. In other words: users require minimum bandwidths instead of maximum ones. The current TBF strategy, however, only supports upper thresholds. There are two options here:
> 1) Implement minimums indirectly. This involves a monitoring of the actual resource consumption on the OSTs and a repeatedly readjusting of TBF rules.
> 2) Modify the TBF so that it supports lower thresholds. Here, the NRS would try to satisfy the minimums first. This additionally has the advantage that there is no underutilized bandwidth: each job can use free resources if necessary because there is no upper limit.
> 
> We would like to implement Option 2. We are in contact with DDN and discuss this work including the TBF strategy.

Good to hear that you are in contact with DDN on this, since they are the TBF developers and are already working to improve that code, and can help as needed for global NRS scheduling.

> === Handling Write Throughput ===
> 
> When an application requests write throughput, this usually means that it will create new files. However, at request time, these files do not exist yet, therefore the QoSP cannot know which OSTs will have to process the write throughput. Hence, the QoSP somehow must predict/determine the placement of new files on the OSTs within the Lustre system. This requires several modifications in Lustre including a new interface to query such information. We would like to discuss this issue with the Lustre community. The mentioned PDF file contains further details. We would be happy to hear your thoughts on this.

If the application provides the output directory for the new files, the number of files, and the size, the QoSP can have a very good idea of which OSTs will be used.  In most cases, the number of files exceeds the OST count, or a single file will be striped across all OSTs so the I/O will be evenly balanced across OSTs and it doesn't matter what the exact file placement will be.  If something like an OST pool is set on the directory, this can also be determined by "lfs getstripe" or llapi_layout_get_by_path() or similar.

I think fine-grained scheduling of the I/O performance of every file in the filesystem is not necessary to achieve improvements in aggregate I/O performance across jobs.  High-level prioritization of jobs is probably enough to gain the majority of performance improvements possible.  For example, if the QoSP knows two jobs have similar I/O times during their busy period, and are contending for writes on OSTs then the first job to submit RPCs can have a short-term but significant boost in I/O priority so it can complete all of its I/O before the second job does.  Even if the jobs do I/O at the same frequency (e.g. every 60 minutes) this would naturally offset the two jobs in time to avoid contention in the future.

If the QoSP doesn't get any information about a job, it could potentially generate this dynamically from the job's previous I/O submissions (e.g. steady-state reader/writer of N MB/s, bursty every X seconds for Y GB, etc) to use it for later submissions and/or dump this in the job epilog so the user knows what information to submit for later test runs.

I suspect that even if the user is specifying the job I/O pattern explicitly, this should only be taken as "initial values" and the QoSP should determine what actual I/O pattern the job has (and report large discrepancies in the epilog).  The I/O pattern may change significantly based on input parameters, changes to the system, etc. that make the user-provided data inaccurate.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [lustre-devel] Quality of Service Planning in Lustre
       [not found]   ` <1B2C98C4BC048747A2EA06A5FF0D0C76B585D13F@LAX-EX-MB3.datadirect.datadirectnet.com>
@ 2017-01-23  9:28     ` Jürgen Kaiser
  0 siblings, 0 replies; 3+ messages in thread
From: Jürgen Kaiser @ 2017-01-23  9:28 UTC (permalink / raw)
  To: lustre-devel

Hi,

> Am 19.01.2017 um 04:44 schrieb Li Xi <lixi@ddn.com>:
> 
> Hi,
> 
> Glad to hear from you, Andreas.
> 
>> I think there is an open question about what sort of granularity of I/O bandwidth an application needs in order to do its job.  I think the main goal of the user is to avoid having their job contend with another large job that is consuming all of the bandwidth for a long period of time, but they don't necessarily need real-time I/O guarantees.  At the same time, from process scheduling on CPUs we know the most globally efficient scheduling algorithm is "shortest job first", so that small jobs can complete their I/O quickly and return to computation, while the large job is going to take a long time in either case.
> 
> I agree that real-time bandwidth guarantees might not be necessary in some (maybe most) circumstances. But these circumstances have some preconditions or assumptions. And I think the possible assumptions are: 1) Jobs have the same priority and can be scheduled as needed ; 2)  The jobs are queued and waiting the scheduling of QoSP & Slurm; 3) The ultimate goal is to get highest throughput of jobs, i.e. to maximize the sum of the weights of the finished jobs.
> 
> However, these assumptions might not always be ture, for example.
> 
> 1) Jobs have the same priority can be scheduled as needed  -> A job has extermely high priority because the user need to finish it before a deadline otherwise he/she will be fired.
> 
> 2) The jobs are queued and waiting the scheduling of QoSP & Slurm -> A user happens to lose his/her sleep in the midnight and starts a huge job.
> 
> 3) The ultimate goal is to get highest throughput of jobs -> A user complains a lot because he submits a huge job but need to wait after a massive of small jobs.
> 
> Under these circumstance, a real-time performance guarantees is more friendly to the users, since the behaviors and results are more predictable. And another advantage is that the administrator could charge the users according to the QoS guarantees. That means, if a user is not satisfied with the current guarantees, he/she can pay more for better guarantees.
> 

Agreed. However, we see bandwidth guarantees as a long-term (research) goal. As an intermediate step, the QoSP should help to prevent situations as Andreas described it: no two high bandwidth applications at the same time.

>> It may be that instead of working on hard bandwidth guarantees (i.e job A needs 100MB/s, while job B needs 500MB/s), which is hard for users to determine (they want to get the maximum bandwidth all of the time?) it might be better to have jobs provide information about how large their I/O is going to be, and how often.  That is something that users can determine quite easily from one run to the next, and would allow a global scheduler to know that job A wants to read 10 GB of files every 10 minutes, while job B wants to write 10TB of data every 60 minutes, and when job A starts its read it should preempt job B (if currently doing I/O) so it can complete all of its reads quickly and go back to work, and job B will still be writing that 10TB long after job A has finished.
> 
> Ideally, users know clearly about how are the I/O patterns of the jobs, and also how much performance the jobs will need. Unfortunately, it seems difficult to understand/abstract/describe/predict the I/O patterns of the jobs, especially the jobs are parrallel applications. I think that the I/O pattern information provided by users might be helpful for rough estimation, but might not be accurate enough to make real-time QoS decisions. So, QoSP might need to monitor the I/O patterns in real time and make QoS decisions dynamically.
> 
> Also, this gives us another reason why real-time bandwidth guarantees is helpful. Because if users are charged accorrding to the performance guarantees, users will have much more motivation to understand and optimize the I/O patterns. And hopefully, the bad I/O patterns (for example, unccessary 2GB/s write which last for one hour) will be eliminated eventually, which is good for everyone.
> 

Besides charging for reserved resources, another incentive is the waiting time until the job scheduler allows the job to start. If a user wants 100% he/she must wait until every other job is finished. We already see this effect in case of CPUs. Since we allow multi-days jobs users typically must wait several days (sometimes even weeks) if they want that much resources.

I agree that an I/O pattern would be better for us. However, I think that it is easier for users to work with bandwidth values instead of access patterns. Most of our users come from non computer science fields such as biology, physics, meteorology, etc. and typically lack knowledge about the access patterns. In the best case, they know the size of the input files and (from experience) the expected size of the output files. They also might know that the input is read only once and, therefore, (roughly) know the total volume of read/written data. With this information, they can estimate the reservation length of the required resources.

However, in case of checkpointing the users should be able to tell Slurm/QoSP enough details.

>> If the application provides the output directory for the new files, the number of files, and the size, the QoSP can have a very good idea of which OSTs will be used.  In most cases, the number of files exceeds the OST count, or a single file will be striped across all OSTs so the I/O will be evenly balanced across OSTs and it doesn't matter what the exact file placement will be.  If something like an OST pool is set on the directory, this can also be determined by "lfs getstripe" or llapi_layout_get_by_path() or similar.
> 
> Yeah, it seems QoSP needs to a lot of things, including controlling the stripe of the files.
> 
>> I think fine-grained scheduling of the I/O performance of every file in the filesystem is not necessary to achieve improvements in aggregate I/O performance across jobs.  High-level prioritization of jobs is probably enough to gain the majority of performance improvements possible.  For example, if the QoSP knows two jobs have similar I/O times during their busy period, and are contending for writes on OSTs then the first job to submit RPCs can have a short-term but significant boost in I/O priority so it can complete all of its I/O before the second job does.  Even if the jobs do I/O at the same frequency (e.g. every 60 minutes) this would naturally offset the two jobs in time to avoid contention in the future.
> 
> Agreed. Job might be a better granularity/graininess of I/O performance scheduling than file, since users care more about the time costs of jobs than writing/reading a file.

Agreed. We see the fine-grained view as part of a long term research topic.

> 
>> If the QoSP doesn't get any information about a job, it could potentially generate this dynamically from the job's previous I/O submissions (e.g. steady-state reader/writer of N MB/s, bursty every X seconds for Y GB, etc) to use it for later submissions and/or dump this in the job epilog so the user knows what information to submit for later test runs.
> 
> Agreed. This needs a monitor & analyzer to extract the I/O pattern. Various types of monitoring systems have been developed for Lustre by different vendors, however we might need to put further work to them for I/O pattern abstract. A powerful quantitative method to discribe the I/O pattern needs to be designed. This method should be both understandable by human and computer.  For example, something like following discription could be generated automatically after running a job:
> 
> rank: 0, start second: 0, end second: 60, I/O type: exclusive read, total size 10GB, rank size: 10GB, file number: 1, exclusive, grain size: 1MB, comment: "Reading the input";
> rank: 1~100, start second: 0, end second: 60, I/O type: None, comment: "Waiting rank0";
> rank: 0~100, start second: 61, end second: 600, I/O type: None, comment: "Computing";
> rank: 0~100, start second: 601, end second: 660, I/O type: exclusive write, total size 1000GB, rank size: 10GB, file number: 100, grain size: 1MB, comment: "Writing the output";
> 
>> I suspect that even if the user is specifying the job I/O pattern explicitly, this should only be taken as "initial values" and the QoSP should determine what actual I/O pattern the job has (and report large discrepancies in the epilog).  The I/O pattern may change significantly based on input parameters, changes to the system, etc. that make the user-provided data inaccurate.
> 
> Yep, totoally agreed.

Agreed

> 
> Cheers, Andreas
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1536 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170123/00facc62/attachment.pgp>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-01-23  9:28 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-16 15:28 [lustre-devel] Quality of Service Planning in Lustre Jürgen Kaiser
2017-01-19  0:32 ` Dilger, Andreas
     [not found]   ` <1B2C98C4BC048747A2EA06A5FF0D0C76B585D13F@LAX-EX-MB3.datadirect.datadirectnet.com>
2017-01-23  9:28     ` Jürgen Kaiser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.