All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/FS TOPIC] I/O performance isolation for shared storage
@ 2011-02-04  1:50 Chad Talbott
  2011-02-04  2:31 ` Vivek Goyal
  0 siblings, 1 reply; 8+ messages in thread
From: Chad Talbott @ 2011-02-04  1:50 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel

I/O performance is the bottleneck in many systems, from phones to
servers. Knowing which request to schedule at any moment is crucial to
systems that support interactive latencies and high throughput.  When
you're watching a video on your desktop, you don't want it to skip
when you build a kernel.

To address this in our environment Google has now deployed the
blk-cgroup code worldwide, and I'd like to share some of our
experiences. We've made modifications for our purposes, and are in the
process of proposing those upstream:

  - Page tracking for buffered writes
  - Fairness-preserving preemption across cgroups

There is further work to do along the lines of fine-grained accounting
and isolation. For example, many file servers in a Google cluster will
do IO on behalf of hundreds, even thousands of clients. Each client
has different service requirements, and it's inefficient to map them
to (cgroup, task) pairs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/FS TOPIC] I/O performance isolation for shared storage
  2011-02-04  1:50 [LSF/FS TOPIC] I/O performance isolation for shared storage Chad Talbott
@ 2011-02-04  2:31 ` Vivek Goyal
  2011-02-04 23:07   ` Chad Talbott
  0 siblings, 1 reply; 8+ messages in thread
From: Vivek Goyal @ 2011-02-04  2:31 UTC (permalink / raw)
  To: Chad Talbott; +Cc: lsf-pc, linux-fsdevel

On Thu, Feb 03, 2011 at 05:50:00PM -0800, Chad Talbott wrote:
> I/O performance is the bottleneck in many systems, from phones to
> servers. Knowing which request to schedule at any moment is crucial to
> systems that support interactive latencies and high throughput.  When
> you're watching a video on your desktop, you don't want it to skip
> when you build a kernel.
> 
> To address this in our environment Google has now deployed the
> blk-cgroup code worldwide, and I'd like to share some of our
> experiences. We've made modifications for our purposes, and are in the
> process of proposing those upstream:
> 
>   - Page tracking for buffered writes
>   - Fairness-preserving preemption across cgroups

Chad,

This is definitely of interest to me (though I will not be around but
will like to read LWN summary of discussion later. :-)). Would like to
know more how google has deployed this and using this infrastructre. Also
would like that all the missing pieces be pushed upstream (especially
the buffered WRITE support and page tracking stuff).

One thing I am curious to know that how do you deal with getting service
differentiation while maintaining high throughput. Idling on group for
fairness is more or less reasonable on single SATA disk but can very
well kill performance (especially with random IO) on storage array or
on fast SSDs.

I have been thinking of disabling idling altogether and trying to change the
position of group in the service tree based on weight when new IO comes
(CFQ already does something similar for cfqq, slice_offset() logic). I 
have been thinking of doing similar while calculating vdisktime of group
when it gets enqueued. This might give us some service differentiation
while getting better throughput.

You also mentioned about controlling latencies very tightly and that 
probably means driving shallower queue depths (may be 1) so that preemption
is somewhat effective and latencies are better. But again driving lesser queue
depth can lead to reduced performance. So I am curious how do you deal with
that.

Also curious to know if per memory cgroup dirty ration stuff got in and how
did we deal with the issue of selecting which inode to dispatch the writes
from based on the cgroup it belongs to. 

> 
> There is further work to do along the lines of fine-grained accounting
> and isolation. For example, many file servers in a Google cluster will
> do IO on behalf of hundreds, even thousands of clients. Each client
> has different service requirements, and it's inefficient to map them
> to (cgroup, task) pairs.

So is it ioprio based isolation or soemthing else?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/FS TOPIC] I/O performance isolation for shared storage
  2011-02-04  2:31 ` Vivek Goyal
@ 2011-02-04 23:07   ` Chad Talbott
  2011-02-07 18:06     ` Vivek Goyal
  2011-02-15 12:54     ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Chad Talbott @ 2011-02-04 23:07 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: lsf-pc, linux-fsdevel

On Thu, Feb 3, 2011 at 6:31 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Feb 03, 2011 at 05:50:00PM -0800, Chad Talbott wrote:
> This is definitely of interest to me (though I will not be around but
> will like to read LWN summary of discussion later. :-)). Would like to
> know more how google has deployed this and using this infrastructre. Also
> would like that all the missing pieces be pushed upstream (especially
> the buffered WRITE support and page tracking stuff).

Pushing this all upstream is my current focus, so you'll be getting a
lot of patches in your inbox in the coming weeks.

> One thing I am curious to know that how do you deal with getting service
> differentiation while maintaining high throughput. Idling on group for
> fairness is more or less reasonable on single SATA disk but can very
> well kill performance (especially with random IO) on storage array or
> on fast SSDs.

We've sidestepped that problem by deploying the blk-cgroup scheduler
against single spinning media drives.  Fast SSDs present another set
of problems.  It's not clear to me that CFQ is the right starting
place for a scheduler for SSDs.  Much of the structure of the code
reflects its design for spinning media drives.

> I have been thinking of disabling idling altogether and trying to change the
> position of group in the service tree based on weight when new IO comes
> (CFQ already does something similar for cfqq, slice_offset() logic). I
> have been thinking of doing similar while calculating vdisktime of group
> when it gets enqueued. This might give us some service differentiation
> while getting better throughput.

I'd like to hear more about this.  It's not clear to me that idling
would be necessary for throughput on a device with a deep queue.  In
my mind idling is used only to get better throughput by avoiding seeks
introduced when switching between synchronous tasks.

> You also mentioned about controlling latencies very tightly and that
> probably means driving shallower queue depths (may be 1) so that preemption
> is somewhat effective and latencies are better. But again driving lesser queue
> depth can lead to reduced performance. So I am curious how do you deal with
> that.

We've currently just made the trade-off that you're pointing out.
We've chosen to limit queue depth and then leaned heavily on idling
for sequential, synchronous, well-behaved applications to maintain
throughput.  I think supporting high throughput and low-latency with
many random workloads is still an open area.

> Also curious to know if per memory cgroup dirty ration stuff got in and how
> did we deal with the issue of selecting which inode to dispatch the writes
> from based on the cgroup it belongs to.

We have some experience with per-cgroup writeback under our fake-NUMA
memory container system. Writeback under memcg will likely face
similar issues.  See Greg Thelen's topic description at
http://article.gmane.org/gmane.linux.kernel.mm/58164 for a request for
discussion.

Per-cgroup dirty ratios is just the beginning, as you mention.  Unless
the IO scheduler can see the deep queues of all the blocked tasks, it
can't make the right decisions.  Also, today writeback is ignorant of
the tasks' debt to the IO scheduler, so it issues the "wrong" inodes.

>> There is further work to do along the lines of fine-grained accounting
>> and isolation. For example, many file servers in a Google cluster will
>> do IO on behalf of hundreds, even thousands of clients. Each client
>> has different service requirements, and it's inefficient to map them
>> to (cgroup, task) pairs.
>
> So is it ioprio based isolation or soemthing else?

For me that's an open question.  ioprio might be a starting place.
There is interest in accounting for IO time, and ioprio doesn't
provide a notion of "tagging" IO by submitter.

Thanks for your interest.

Chad

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/FS TOPIC] I/O performance isolation for shared storage
  2011-02-04 23:07   ` Chad Talbott
@ 2011-02-07 18:06     ` Vivek Goyal
  2011-02-07 19:40       ` Chad Talbott
  2011-02-15 12:54     ` Jan Kara
  1 sibling, 1 reply; 8+ messages in thread
From: Vivek Goyal @ 2011-02-07 18:06 UTC (permalink / raw)
  To: Chad Talbott; +Cc: lsf-pc, linux-fsdevel

On Fri, Feb 04, 2011 at 03:07:15PM -0800, Chad Talbott wrote:
> On Thu, Feb 3, 2011 at 6:31 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Thu, Feb 03, 2011 at 05:50:00PM -0800, Chad Talbott wrote:
> > This is definitely of interest to me (though I will not be around but
> > will like to read LWN summary of discussion later. :-)). Would like to
> > know more how google has deployed this and using this infrastructre. Also
> > would like that all the missing pieces be pushed upstream (especially
> > the buffered WRITE support and page tracking stuff).
> 
> Pushing this all upstream is my current focus, so you'll be getting a
> lot of patches in your inbox in the coming weeks.
> 
> > One thing I am curious to know that how do you deal with getting service
> > differentiation while maintaining high throughput. Idling on group for
> > fairness is more or less reasonable on single SATA disk but can very
> > well kill performance (especially with random IO) on storage array or
> > on fast SSDs.
> 
> We've sidestepped that problem by deploying the blk-cgroup scheduler
> against single spinning media drives.  Fast SSDs present another set
> of problems.  It's not clear to me that CFQ is the right starting
> place for a scheduler for SSDs.  Much of the structure of the code
> reflects its design for spinning media drives.
> 
> > I have been thinking of disabling idling altogether and trying to change the
> > position of group in the service tree based on weight when new IO comes
> > (CFQ already does something similar for cfqq, slice_offset() logic). I
> > have been thinking of doing similar while calculating vdisktime of group
> > when it gets enqueued. This might give us some service differentiation
> > while getting better throughput.
> 
> I'd like to hear more about this.

If a group dispatches some IO and then it is empty, then it will be
deleted from service tree and when new IO comes in, it will be put
at the end of service tree. That way all the groups become more of
round robin and there is no service differentiation.

I was thiking that when a group gets backlogged instead of putting
him at the end of service tree come up with a new mechanism of 
where they are put at certain offset from the st->min_vdisktime. This
offset is more for high prio group and less for low prio group. That
way even if a group gets deleted and comes back again with more IO
there is a chance it gets schedled ahead of already queued low prio
group and we could see some service differentiation even with idling
disabled.

But this is theory at this point and efficacy of this procedure will
be go down as we increase queue depth and service differentation will
also become underministic. But this might be our best bet on faster
devices with higher queue depth.

>It's not clear to me that idling
> would be necessary for throughput on a device with a deep queue.  In
> my mind idling is used only to get better throughput by avoiding seeks
> introduced when switching between synchronous tasks.
> 
> > You also mentioned about controlling latencies very tightly and that
> > probably means driving shallower queue depths (may be 1) so that preemption
> > is somewhat effective and latencies are better. But again driving lesser queue
> > depth can lead to reduced performance. So I am curious how do you deal with
> > that.
> 
> We've currently just made the trade-off that you're pointing out.
> We've chosen to limit queue depth and then leaned heavily on idling
> for sequential, synchronous, well-behaved applications to maintain
> throughput.  I think supporting high throughput and low-latency with
> many random workloads is still an open area.
> 
> > Also curious to know if per memory cgroup dirty ration stuff got in and how
> > did we deal with the issue of selecting which inode to dispatch the writes
> > from based on the cgroup it belongs to.
> 
> We have some experience with per-cgroup writeback under our fake-NUMA
> memory container system. Writeback under memcg will likely face
> similar issues.  See Greg Thelen's topic description at
> http://article.gmane.org/gmane.linux.kernel.mm/58164 for a request for
> discussion.
> 
> Per-cgroup dirty ratios is just the beginning, as you mention.  Unless
> the IO scheduler can see the deep queues of all the blocked tasks, it
> can't make the right decisions.  Also, today writeback is ignorant of
> the tasks' debt to the IO scheduler, so it issues the "wrong" inodes.
> 
> >> There is further work to do along the lines of fine-grained accounting
> >> and isolation. For example, many file servers in a Google cluster will
> >> do IO on behalf of hundreds, even thousands of clients. Each client
> >> has different service requirements, and it's inefficient to map them
> >> to (cgroup, task) pairs.
> >
> > So is it ioprio based isolation or soemthing else?
> 
> For me that's an open question.  ioprio might be a starting place.

[..]
> There is interest in accounting for IO time, and ioprio doesn't
> provide a notion of "tagging" IO by submitter.

I am curious to know how IO time can be accounted with deep queue depths
as once say 32 or more requests are in driver/device, we just don't
know which request consumed how much of actual time.

Vivek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/FS TOPIC] I/O performance isolation for shared storage
  2011-02-07 18:06     ` Vivek Goyal
@ 2011-02-07 19:40       ` Chad Talbott
  2011-02-07 20:38         ` Vivek Goyal
  0 siblings, 1 reply; 8+ messages in thread
From: Chad Talbott @ 2011-02-07 19:40 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: lsf-pc, linux-fsdevel

On Mon, Feb 7, 2011 at 10:06 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Feb 04, 2011 at 03:07:15PM -0800, Chad Talbott wrote:
>> I'd like to hear more about this.
>
> If a group dispatches some IO and then it is empty, then it will be
> deleted from service tree and when new IO comes in, it will be put
> at the end of service tree. That way all the groups become more of
> round robin and there is no service differentiation.
>
> I was thiking that when a group gets backlogged instead of putting
> him at the end of service tree come up with a new mechanism of
> where they are put at certain offset from the st->min_vdisktime. This
> offset is more for high prio group and less for low prio group. That
> way even if a group gets deleted and comes back again with more IO
> there is a chance it gets schedled ahead of already queued low prio
> group and we could see some service differentiation even with idling
> disabled.

This is interesting.  I think Nauman may have come up with a different
method to address similar concerns.  In his method, we remember a
group's vdisktime even when they are removed from the service tree.
This would lead fairness over too long of a time window implemented by
itself.  Only when the disk becomes idle, we "forget" everyone's
vdisktime.  We should be sending that patch out Real Soon Now, along
with the rest.

> [..]
>> There is interest in accounting for IO time, and ioprio doesn't
>> provide a notion of "tagging" IO by submitter.
>
> I am curious to know how IO time can be accounted with deep queue depths
> as once say 32 or more requests are in driver/device, we just don't
> know which request consumed how much of actual time.

Yes, that's a large problem.  We would be interested in the partial
solution that works against shallow queues.

Thinking out loud: it occurs to me that without device support, some
sort of device performance model might be hard to avoid when trying to
understand the work incurred by a given request.

Chad

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/FS TOPIC] I/O performance isolation for shared storage
  2011-02-07 19:40       ` Chad Talbott
@ 2011-02-07 20:38         ` Vivek Goyal
  0 siblings, 0 replies; 8+ messages in thread
From: Vivek Goyal @ 2011-02-07 20:38 UTC (permalink / raw)
  To: Chad Talbott; +Cc: lsf-pc, linux-fsdevel

On Mon, Feb 07, 2011 at 11:40:26AM -0800, Chad Talbott wrote:
> On Mon, Feb 7, 2011 at 10:06 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Feb 04, 2011 at 03:07:15PM -0800, Chad Talbott wrote:
> >> I'd like to hear more about this.
> >
> > If a group dispatches some IO and then it is empty, then it will be
> > deleted from service tree and when new IO comes in, it will be put
> > at the end of service tree. That way all the groups become more of
> > round robin and there is no service differentiation.
> >
> > I was thiking that when a group gets backlogged instead of putting
> > him at the end of service tree come up with a new mechanism of
> > where they are put at certain offset from the st->min_vdisktime. This
> > offset is more for high prio group and less for low prio group. That
> > way even if a group gets deleted and comes back again with more IO
> > there is a chance it gets schedled ahead of already queued low prio
> > group and we could see some service differentiation even with idling
> > disabled.
> 
> This is interesting.  I think Nauman may have come up with a different
> method to address similar concerns.  In his method, we remember a
> group's vdisktime even when they are removed from the service tree.
> This would lead fairness over too long of a time window implemented by
> itself.  Only when the disk becomes idle, we "forget" everyone's
> vdisktime.  We should be sending that patch out Real Soon Now, along
> with the rest.

I have thought about it in the past. I think there are still few
concerns there.

- How to determine which group's vdisktime is still valid and how to
  invalidate all the past vdisktimes.

- When idling is disabled, most likely groups will dispatch bunch of
  requests and go away. So slice used might be just 1 jiffy or even
  less. In that case all the groups then have same vdisktime at
  expiry and not service differentiation.

- Even if we reuse the previous disktime, most likely it is past
  st->min_vdisktime, which is an monotonically increasing number. How
  is that handled.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/FS TOPIC] I/O performance isolation for shared storage
  2011-02-04 23:07   ` Chad Talbott
  2011-02-07 18:06     ` Vivek Goyal
@ 2011-02-15 12:54     ` Jan Kara
  2011-02-15 23:15       ` Chad Talbott
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Kara @ 2011-02-15 12:54 UTC (permalink / raw)
  To: Chad Talbott; +Cc: Vivek Goyal, lsf-pc, linux-fsdevel

On Fri 04-02-11 15:07:15, Chad Talbott wrote:
> > Also curious to know if per memory cgroup dirty ration stuff got in and how
> > did we deal with the issue of selecting which inode to dispatch the writes
> > from based on the cgroup it belongs to.
> 
> We have some experience with per-cgroup writeback under our fake-NUMA
> memory container system. Writeback under memcg will likely face
> similar issues.  See Greg Thelen's topic description at
> http://article.gmane.org/gmane.linux.kernel.mm/58164 for a request for
> discussion.
> 
> Per-cgroup dirty ratios is just the beginning, as you mention.  Unless
> the IO scheduler can see the deep queues of all the blocked tasks, it
> can't make the right decisions.  Also, today writeback is ignorant of
> the tasks' debt to the IO scheduler, so it issues the "wrong" inodes.
  I'm curious: Could you elaborate a bit more about this? I'm not sure what
a debt to the IO scheduler is and why choice of inodes would matter...
Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/FS TOPIC] I/O performance isolation for shared storage
  2011-02-15 12:54     ` Jan Kara
@ 2011-02-15 23:15       ` Chad Talbott
  0 siblings, 0 replies; 8+ messages in thread
From: Chad Talbott @ 2011-02-15 23:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Vivek Goyal, lsf-pc, linux-fsdevel

On Tue, Feb 15, 2011 at 4:54 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 04-02-11 15:07:15, Chad Talbott wrote:
>> Per-cgroup dirty ratios is just the beginning, as you mention.  Unless
>> the IO scheduler can see the deep queues of all the blocked tasks, it
>> can't make the right decisions.  Also, today writeback is ignorant of
>> the tasks' debt to the IO scheduler, so it issues the "wrong" inodes.
>  I'm curious: Could you elaborate a bit more about this? I'm not sure what
> a debt to the IO scheduler is and why choice of inodes would matter...

Sorry, this comment needs more context.  Google's servers typically
operate with both memory capacity and disk time isolation via cgroups.
 We maintain a set of patches that provide page tracking and buffered
write isolation.  We are working on sending those patches out
alongside the memcg efforts.

A scenario: When a given cgroup reaches its foreground writeback
high-water mark, it invokes the writeback code to send dirty inodes
belonging to that cgroup to the IO scheduler.  CFQ can then see those
requests and schedules them against other requests in the system.

If the thread doing the dirtying issues the IO directly, then CFQ can
see all the individual threads waiting on IO.  CFQ schedules between
them and provides the requested disk time isolation.

If the background writeout thread does the IO, the picture is
different.  Since there is only a single flusher thread per disk, the
order in which the inodes are issued to the IO scheduler matters.  The
writeback code issues fairly large chunks from each inode; from the IO
scheduler's point of view it will only see IO from a single group
while it works on that chunk.  So CFQ cannot provide fairness between
threads between buffered writers.

Chad
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-02-15 23:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-04  1:50 [LSF/FS TOPIC] I/O performance isolation for shared storage Chad Talbott
2011-02-04  2:31 ` Vivek Goyal
2011-02-04 23:07   ` Chad Talbott
2011-02-07 18:06     ` Vivek Goyal
2011-02-07 19:40       ` Chad Talbott
2011-02-07 20:38         ` Vivek Goyal
2011-02-15 12:54     ` Jan Kara
2011-02-15 23:15       ` Chad Talbott

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.