From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vivek Goyal Subject: Re: [Qemu-devel] [RFC]QEMU disk I/O limits Date: Tue, 31 May 2011 13:59:55 -0400 Message-ID: <20110531175955.GI16382@redhat.com> References: <20110530050923.GF18832@f12.cn.ibm.com> <20110531134537.GE16382@redhat.com> <4DE4F230.2040203@us.ibm.com> <20110531140402.GF16382@redhat.com> <4DE4FA5B.1090804@codemonkey.ws> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: kwolf@redhat.com, stefanha@linux.vnet.ibm.com, kvm@vger.kernel.org, guijianfeng@cn.fujitsu.com, qemu-devel@nongnu.org, wuzhy@cn.ibm.com, herbert@gondor.hengli.com.au, Zhi Yong Wu , luowenj@cn.ibm.com, zhanx@cn.ibm.com, zhaoyang@cn.ibm.com, llim@redhat.com, Ryan A Harper , Mike Snitzer , Joe Thornber To: Anthony Liguori Return-path: Received: from mx1.redhat.com ([209.132.183.28]:54540 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757501Ab1EaSAN (ORCPT ); Tue, 31 May 2011 14:00:13 -0400 Content-Disposition: inline In-Reply-To: <4DE4FA5B.1090804@codemonkey.ws> Sender: kvm-owner@vger.kernel.org List-ID: On Tue, May 31, 2011 at 09:25:31AM -0500, Anthony Liguori wrote: > On 05/31/2011 09:04 AM, Vivek Goyal wrote: > >On Tue, May 31, 2011 at 08:50:40AM -0500, Anthony Liguori wrote: > >>On 05/31/2011 08:45 AM, Vivek Goyal wrote: > >>>On Mon, May 30, 2011 at 01:09:23PM +0800, Zhi Yong Wu wrote: > >>>>Hello, all, > >>>> > >>>> I have prepared to work on a feature called "Disk I/O limits" for qemu-kvm projeect. > >>>> This feature will enable the user to cap disk I/O amount performed by a VM.It is important for some storage resources to be shared among multi-VMs. As you've known, if some of VMs are doing excessive disk I/O, they will hurt the performance of other VMs. > >>>> > >>> > >>>Hi Zhiyong, > >>> > >>>Why not use kernel blkio controller for this and why reinvent the wheel > >>>and implement the feature again in qemu? > >> > >>blkio controller only works for block devices. It doesn't work when > >>using files. > > > >So can't we comeup with something to easily determine which device backs > >up this file? Though that will still not work for NFS backed storage > >though. > > Right. > > Additionally, in QEMU, we can rate limit based on concepts that make > sense to a guest. We can limit the actual I/O ops visible to the > guest which means that we'll get consistent performance regardless > of whether the backing file is qcow2, raw, LVM, or raw over NFS. > Are you referring to merging taking place which can change the definition of IOPS as seen by guest? We do throttling at bio level and no merging is taking place. So IOPS seen by guest and as seen by throttling logic should be same. Readahead would be one exception though where any readahead data will be charged to guest. Device throttling and interaction with file system is still an issue with IO controller (things like journalling lead to serialization) where a faster group can get blocked behind slower group. That's why at the moment, it is recommened that directly export devices/partitions to virtual machines if throttling is to be used and don't share a file system across VMs. > The kernel just doesn't have enough information to do a good job here. [CCing couple of device mapper folks for thoughts on below] When I think more about it, I think this problem is very similar to other features like snapshotting. Whether we should implement snapshotting in qemu or use some kernel based solution like dm-snaphot or dm-multisnap etc. I don't have a good answer for that. Has this detabe been settled already? I see that development is happening in kernel or providing dm snapshot capabilities and Mike Snitzer also mentioned about possibility of using dm-loop for covering the case of files over NFS etc. Some thoughts in general though. - Any kernel based solution is generic and can be used in other contexts also like containers or bare metal. - In some cases, kernel can implement throttling more efficiently. For example if a block devie has multiple partitions and these partitions are exported to VMs, then kernel can maintain a single queue and single set of timer to manage all the VMs doing IO to that device. In user space solution we shall have manage as many queues and timers as there are VMs. So kernel implementation can enable more efficient implementation in certain cases. - Things like dm-loop essentially means introduce another block layer on top of file system layer. I personally think that it does not sound very clean and might slow down things. Though I don't have any data. Has there been any discussion/conclusion on this? - qemu based scheme will work well with all kind of targets. For using kenrel based scheme one shall have to switch to using kernel provided snapshotting schemes (dm-snapshot or dm multisnap etc). Otherwise a READ might come from a base image which is on other device and we did not throttle the VM. Thanks Vivek From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:45136) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QRTEl-0001iI-4f for qemu-devel@nongnu.org; Tue, 31 May 2011 14:00:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QRTEj-00027I-6h for qemu-devel@nongnu.org; Tue, 31 May 2011 14:00:14 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55091) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QRTEi-000273-GA for qemu-devel@nongnu.org; Tue, 31 May 2011 14:00:12 -0400 Date: Tue, 31 May 2011 13:59:55 -0400 From: Vivek Goyal Message-ID: <20110531175955.GI16382@redhat.com> References: <20110530050923.GF18832@f12.cn.ibm.com> <20110531134537.GE16382@redhat.com> <4DE4F230.2040203@us.ibm.com> <20110531140402.GF16382@redhat.com> <4DE4FA5B.1090804@codemonkey.ws> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4DE4FA5B.1090804@codemonkey.ws> Subject: Re: [Qemu-devel] [RFC]QEMU disk I/O limits List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: kwolf@redhat.com, stefanha@linux.vnet.ibm.com, kvm@vger.kernel.org, guijianfeng@cn.fujitsu.com, Mike Snitzer , qemu-devel@nongnu.org, wuzhy@cn.ibm.com, herbert@gondor.hengli.com.au, Joe Thornber , Zhi Yong Wu , luowenj@cn.ibm.com, zhanx@cn.ibm.com, zhaoyang@cn.ibm.com, llim@redhat.com, Ryan A Harper On Tue, May 31, 2011 at 09:25:31AM -0500, Anthony Liguori wrote: > On 05/31/2011 09:04 AM, Vivek Goyal wrote: > >On Tue, May 31, 2011 at 08:50:40AM -0500, Anthony Liguori wrote: > >>On 05/31/2011 08:45 AM, Vivek Goyal wrote: > >>>On Mon, May 30, 2011 at 01:09:23PM +0800, Zhi Yong Wu wrote: > >>>>Hello, all, > >>>> > >>>> I have prepared to work on a feature called "Disk I/O limits" for qemu-kvm projeect. > >>>> This feature will enable the user to cap disk I/O amount performed by a VM.It is important for some storage resources to be shared among multi-VMs. As you've known, if some of VMs are doing excessive disk I/O, they will hurt the performance of other VMs. > >>>> > >>> > >>>Hi Zhiyong, > >>> > >>>Why not use kernel blkio controller for this and why reinvent the wheel > >>>and implement the feature again in qemu? > >> > >>blkio controller only works for block devices. It doesn't work when > >>using files. > > > >So can't we comeup with something to easily determine which device backs > >up this file? Though that will still not work for NFS backed storage > >though. > > Right. > > Additionally, in QEMU, we can rate limit based on concepts that make > sense to a guest. We can limit the actual I/O ops visible to the > guest which means that we'll get consistent performance regardless > of whether the backing file is qcow2, raw, LVM, or raw over NFS. > Are you referring to merging taking place which can change the definition of IOPS as seen by guest? We do throttling at bio level and no merging is taking place. So IOPS seen by guest and as seen by throttling logic should be same. Readahead would be one exception though where any readahead data will be charged to guest. Device throttling and interaction with file system is still an issue with IO controller (things like journalling lead to serialization) where a faster group can get blocked behind slower group. That's why at the moment, it is recommened that directly export devices/partitions to virtual machines if throttling is to be used and don't share a file system across VMs. > The kernel just doesn't have enough information to do a good job here. [CCing couple of device mapper folks for thoughts on below] When I think more about it, I think this problem is very similar to other features like snapshotting. Whether we should implement snapshotting in qemu or use some kernel based solution like dm-snaphot or dm-multisnap etc. I don't have a good answer for that. Has this detabe been settled already? I see that development is happening in kernel or providing dm snapshot capabilities and Mike Snitzer also mentioned about possibility of using dm-loop for covering the case of files over NFS etc. Some thoughts in general though. - Any kernel based solution is generic and can be used in other contexts also like containers or bare metal. - In some cases, kernel can implement throttling more efficiently. For example if a block devie has multiple partitions and these partitions are exported to VMs, then kernel can maintain a single queue and single set of timer to manage all the VMs doing IO to that device. In user space solution we shall have manage as many queues and timers as there are VMs. So kernel implementation can enable more efficient implementation in certain cases. - Things like dm-loop essentially means introduce another block layer on top of file system layer. I personally think that it does not sound very clean and might slow down things. Though I don't have any data. Has there been any discussion/conclusion on this? - qemu based scheme will work well with all kind of targets. For using kenrel based scheme one shall have to switch to using kernel provided snapshotting schemes (dm-snapshot or dm multisnap etc). Otherwise a READ might come from a base image which is on other device and we did not throttle the VM. Thanks Vivek