From mboxrd@z Thu Jan  1 00:00:00 1970
From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [Qemu-devel] [RFC]QEMU disk I/O limits
Date: Tue, 31 May 2011 13:59:55 -0400
Message-ID: <20110531175955.GI16382@redhat.com>
References: <20110530050923.GF18832@f12.cn.ibm.com>
 <20110531134537.GE16382@redhat.com>
 <4DE4F230.2040203@us.ibm.com>
 <20110531140402.GF16382@redhat.com>
 <4DE4FA5B.1090804@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: kwolf@redhat.com, stefanha@linux.vnet.ibm.com, kvm@vger.kernel.org,
	guijianfeng@cn.fujitsu.com, qemu-devel@nongnu.org,
	wuzhy@cn.ibm.com, herbert@gondor.hengli.com.au,
	Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>, luowenj@cn.ibm.com,
	zhanx@cn.ibm.com, zhaoyang@cn.ibm.com, llim@redhat.com,
	Ryan A Harper <raharper@us.ibm.com>,
	Mike Snitzer <snitzer@redhat.com>,
	Joe Thornber <ejt@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:54540 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1757501Ab1EaSAN (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 31 May 2011 14:00:13 -0400
Content-Disposition: inline
In-Reply-To: <4DE4FA5B.1090804@codemonkey.ws>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Tue, May 31, 2011 at 09:25:31AM -0500, Anthony Liguori wrote:
> On 05/31/2011 09:04 AM, Vivek Goyal wrote:
> >On Tue, May 31, 2011 at 08:50:40AM -0500, Anthony Liguori wrote:
> >>On 05/31/2011 08:45 AM, Vivek Goyal wrote:
> >>>On Mon, May 30, 2011 at 01:09:23PM +0800, Zhi Yong Wu wrote:
> >>>>Hello, all,
> >>>>
> >>>>     I have prepared to work on a feature called "Disk I/O limits" for qemu-kvm projeect.
> >>>>     This feature will enable the user to cap disk I/O amount performed by a VM.It is important for some storage resources to be shared among multi-VMs. As you've known, if some of VMs are doing excessive disk I/O, they will hurt the performance of other VMs.
> >>>>
> >>>
> >>>Hi Zhiyong,
> >>>
> >>>Why not use kernel blkio controller for this and why reinvent the wheel
> >>>and implement the feature again in qemu?
> >>
> >>blkio controller only works for block devices.  It doesn't work when
> >>using files.
> >
> >So can't we comeup with something to easily determine which device backs
> >up this file? Though that will still not work for NFS backed storage
> >though.
> 
> Right.
> 
> Additionally, in QEMU, we can rate limit based on concepts that make
> sense to a guest.  We can limit the actual I/O ops visible to the
> guest which means that we'll get consistent performance regardless
> of whether the backing file is qcow2, raw, LVM, or raw over NFS.
> 

Are you referring to merging taking place which can change the definition
of IOPS as seen by guest?

We do throttling at bio level and no merging is taking place. So IOPS
seen by guest and as seen by throttling logic should be same. Readahead
would be one exception though where any readahead data will be charged
to guest.

Device throttling and interaction with file system is still an issue
with IO controller (things like journalling lead to serialization) where
a faster group can get blocked behind slower group. That's why at the
moment, it is recommened that directly export devices/partitions to
virtual machines if throttling is to be used and don't share a
file system across VMs.

> The kernel just doesn't have enough information to do a good job here.

[CCing couple of device mapper folks for thoughts on below]

When I think more about it, I think this problem is very similar to
other features like snapshotting. Whether we should implement snapshotting
in qemu or use some kernel based solution like dm-snaphot or dm-multisnap
etc.

I don't have a good answer for that. Has this detabe been settled
already? I see that development is happening in kernel or providing
dm snapshot capabilities and Mike Snitzer also mentioned about 
possibility of using dm-loop for covering the case of files over
NFS etc.

Some thoughts in general though.

- Any kernel based solution is generic and can be used in other contexts
  also like containers or bare metal.

- In some cases, kernel can implement throttling more efficiently. For
  example if a block devie has multiple partitions and these partitions
  are exported to VMs, then kernel can maintain a single queue and
  single set of timer to manage all the VMs doing IO to that device.
  In user space solution we shall have manage as many queues and
  timers as there are VMs.

  So kernel implementation can enable more efficient implementation in
  certain cases. 

- Things like dm-loop essentially means introduce another block layer
  on top of file system layer. I personally think that it does not
  sound very clean and might slow down things. Though I don't have
  any data. Has there been any discussion/conclusion on this?

- qemu based scheme will work well with all kind of targets. For using
  kenrel based scheme one shall have to switch to using kernel provided
  snapshotting schemes (dm-snapshot or dm multisnap etc). Otherwise a
  READ might come from a base image which is on other device and we did
  not throttle the VM. 
 
Thanks
Vivek

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:45136)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <vgoyal@redhat.com>) id 1QRTEl-0001iI-4f
	for qemu-devel@nongnu.org; Tue, 31 May 2011 14:00:21 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <vgoyal@redhat.com>) id 1QRTEj-00027I-6h
	for qemu-devel@nongnu.org; Tue, 31 May 2011 14:00:14 -0400
Received: from mx1.redhat.com ([209.132.183.28]:55091)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <vgoyal@redhat.com>) id 1QRTEi-000273-GA
	for qemu-devel@nongnu.org; Tue, 31 May 2011 14:00:12 -0400
Date: Tue, 31 May 2011 13:59:55 -0400
From: Vivek Goyal <vgoyal@redhat.com>
Message-ID: <20110531175955.GI16382@redhat.com>
References: <20110530050923.GF18832@f12.cn.ibm.com>
	<20110531134537.GE16382@redhat.com> <4DE4F230.2040203@us.ibm.com>
	<20110531140402.GF16382@redhat.com>
	<4DE4FA5B.1090804@codemonkey.ws>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4DE4FA5B.1090804@codemonkey.ws>
Subject: Re: [Qemu-devel] [RFC]QEMU disk I/O limits
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: kwolf@redhat.com, stefanha@linux.vnet.ibm.com, kvm@vger.kernel.org, guijianfeng@cn.fujitsu.com, Mike Snitzer <snitzer@redhat.com>, qemu-devel@nongnu.org, wuzhy@cn.ibm.com, herbert@gondor.hengli.com.au, Joe Thornber <ejt@redhat.com>, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>, luowenj@cn.ibm.com, zhanx@cn.ibm.com, zhaoyang@cn.ibm.com, llim@redhat.com, Ryan A Harper <raharper@us.ibm.com>

On Tue, May 31, 2011 at 09:25:31AM -0500, Anthony Liguori wrote:
> On 05/31/2011 09:04 AM, Vivek Goyal wrote:
> >On Tue, May 31, 2011 at 08:50:40AM -0500, Anthony Liguori wrote:
> >>On 05/31/2011 08:45 AM, Vivek Goyal wrote:
> >>>On Mon, May 30, 2011 at 01:09:23PM +0800, Zhi Yong Wu wrote:
> >>>>Hello, all,
> >>>>
> >>>>     I have prepared to work on a feature called "Disk I/O limits" for qemu-kvm projeect.
> >>>>     This feature will enable the user to cap disk I/O amount performed by a VM.It is important for some storage resources to be shared among multi-VMs. As you've known, if some of VMs are doing excessive disk I/O, they will hurt the performance of other VMs.
> >>>>
> >>>
> >>>Hi Zhiyong,
> >>>
> >>>Why not use kernel blkio controller for this and why reinvent the wheel
> >>>and implement the feature again in qemu?
> >>
> >>blkio controller only works for block devices.  It doesn't work when
> >>using files.
> >
> >So can't we comeup with something to easily determine which device backs
> >up this file? Though that will still not work for NFS backed storage
> >though.
> 
> Right.
> 
> Additionally, in QEMU, we can rate limit based on concepts that make
> sense to a guest.  We can limit the actual I/O ops visible to the
> guest which means that we'll get consistent performance regardless
> of whether the backing file is qcow2, raw, LVM, or raw over NFS.
> 

Are you referring to merging taking place which can change the definition
of IOPS as seen by guest?

We do throttling at bio level and no merging is taking place. So IOPS
seen by guest and as seen by throttling logic should be same. Readahead
would be one exception though where any readahead data will be charged
to guest.

Device throttling and interaction with file system is still an issue
with IO controller (things like journalling lead to serialization) where
a faster group can get blocked behind slower group. That's why at the
moment, it is recommened that directly export devices/partitions to
virtual machines if throttling is to be used and don't share a
file system across VMs.

> The kernel just doesn't have enough information to do a good job here.

[CCing couple of device mapper folks for thoughts on below]

When I think more about it, I think this problem is very similar to
other features like snapshotting. Whether we should implement snapshotting
in qemu or use some kernel based solution like dm-snaphot or dm-multisnap
etc.

I don't have a good answer for that. Has this detabe been settled
already? I see that development is happening in kernel or providing
dm snapshot capabilities and Mike Snitzer also mentioned about 
possibility of using dm-loop for covering the case of files over
NFS etc.

Some thoughts in general though.

- Any kernel based solution is generic and can be used in other contexts
  also like containers or bare metal.

- In some cases, kernel can implement throttling more efficiently. For
  example if a block devie has multiple partitions and these partitions
  are exported to VMs, then kernel can maintain a single queue and
  single set of timer to manage all the VMs doing IO to that device.
  In user space solution we shall have manage as many queues and
  timers as there are VMs.

  So kernel implementation can enable more efficient implementation in
  certain cases. 

- Things like dm-loop essentially means introduce another block layer
  on top of file system layer. I personally think that it does not
  sound very clean and might slow down things. Though I don't have
  any data. Has there been any discussion/conclusion on this?

- qemu based scheme will work well with all kind of targets. For using
  kenrel based scheme one shall have to switch to using kernel provided
  snapshotting schemes (dm-snapshot or dm multisnap etc). Otherwise a
  READ might come from a base image which is on other device and we did
  not throttle the VM. 
 
Thanks
Vivek