From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Arcangeli Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool Date: Thu, 11 Dec 2008 16:53:35 +0100 Message-ID: <20081211155335.GE14908@random.random> References: <1228512061-25398-1-git-send-email-aliguori@us.ibm.com> <493E941D.4000608@redhat.com> <493E965E.5050701@us.ibm.com> <20081210164401.GF18814@random.random> <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081211131222.GA14908@random.random> <494130B5.2080800@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Anthony Liguori , kvm-devel , qemu-devel@nongnu.org To: Gerd Hoffmann Return-path: Received: from mx2.redhat.com ([66.187.237.31]:37922 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755669AbYLKPxm (ORCPT ); Thu, 11 Dec 2008 10:53:42 -0500 Content-Disposition: inline In-Reply-To: <494130B5.2080800@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: On Thu, Dec 11, 2008 at 04:24:37PM +0100, Gerd Hoffmann wrote: > Well, linux kernel aio has its share of problems too: > > * Anthony mentioned it may block on certain circumstances (forgot > which ones), and you can't figure beforehand to turn off aio then. We've worse problems as long as bdrv_read/write are used by qcow2. And we can fix host kernel in the long run if this becomes an issue. > * It can't handle block allocation. Kernel handles that by doing > such writes synchronously via VFS layer (instead of the separate > aio code paths). Leads to horrible performance and bug reports > such as "installs on sparse files are very slow". I think here you mean O_DIRECT regardless of aio/sync API, I doubt aio has any relevance to block allocation in any way, so whatever problem we have with kernel API and O_DIRECT should also be there with sync-api + userland threads and O_DIRECT. > * support for vectored aio isn't that old. IIRC it was added > somewhen around 2.6.20 (newer that current suse/redhat enterprise > versions). Which IMHO means you can't expect it being present > unconditionally. I think this is a false alarm: the whole point of kernel AIO is that even if O_DIRECT is enabled, all bios are pushed to the disk before the disk queue is unplugged which is all we care about to get decent disk bandwidth with zerocopy dma. Or at least that's the way it's supposed to work if aio is implemented correctly at the bio level. So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply to an array of iocb through io_submit (i.e. to conver the iov into a vector of iocb, instead of a single iocb pointing to the iov). Internally to io_submit a single dma command should be generated and the same sg list should be built the same as if we used READV/WRITEV. In theory READV/WRITEV should be just a cpu saving feature, it shouldn't influence disk bandwidth. If it does, it means the bio layer is broken and needs fixing. If IOCB_CMD_READV/WRITEV is available, good, if not we go with READ/WRITE and more iocb dynamically allocated. It just needs a conversion routine from iovec, file, offset to iocb pointer when IOCB_CMD_READV/WRITEV is not available. The iocb array can be preallocated along with the iovec when we detect IOCB_CMD_READV/WRITEV is not available, I've a cache layer that does this and I'll just provide an output selectable in iovec or iocb terms, with iocb selectable depending if host os is linux and IOCB_CMD_READV/WRITEV is not available. > Threads will be there anyway for kvm smp. Yes, I didn't mean those threads ;), I love threads, but I love threads that are CPU bound and allow to exploit the whole power of the system! But for storage, threads are purely overscheduling overhead as far as I can tell, given we've an async api to use and we already have to deal with the pain of async programming. So it worth we get the full benefit of it (i.e. no thread/overscheduling overhead). If aio inside the kernel is too complex than use kernel threads, it's still better than user threads. I mean if we keep only using threads we should get rid of bdrv_aio* completely and move qcow2 code in a separate thread instead of keep running it from the io thread. If we stick to threads then it worth to get the full benefit of threads (i.e. not having to deal with the pains of async programming and moving the qcow2 computation in a separate CPU). Something I tried doing but I ended up having to add locks all over qcow2 in order to submit multiple qcow2 requests in parallel (otherwise the lock would be global and I couldn't differentiate between a bdrv_read for qcow2 metadata that must be executed with the qcow2 mutex held, and a bdrv_aio_readv that can run lockless from the point of view of the current qcow2 instance - the qcow2 parent may take its own locks then etc..). Basically it breaks all backends something I'm not confortable with right now just to get zerocopy dma working at platter speed. Hence I stick with async programming for now... > Well, wait for glibc isn't going to fly. glibc waits for posix, and > posix waits for a reference implementation (which will not be glibc). Agree. > > and kernel with preadv/pwritev > > With that in place you don't need kernel aio any more, then you can > really do it in userspace with threads. But that probably would be > linux-only ^W^W^W Waiting for preadv/pwritev is just the 'quicker' version of waiting glibc aio_readv. And because it remains a linux-only, I prefer kernel AIO that fixes cfq and should be the most optimal anyway (with or without READV/WRITEV support). So in the end: we either open the file 64 times (which I think is perfectly coherent with nfs unless the nfs client is broken, but then Anthony may know nfs better, I'm not heavy nfs user here), or we go with kernel AIO... you know my preference. Said that opening the file 64 times is probably simpler, if it has been confirmed that it doesn't break nfs. Breaking nfs is not possible here, nfs is the ideal shared storage for migration (we surely want to exploit the fact we need so weak semantics we need to do a safe migration, that it worth to keep nfs supported as 100% KVM reliable virtualization shared storage). > > ahem: http://www.daemon-systems.org/man/preadv.2.html > > Too bad nobody implemented it yet... From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LAnrE-00008C-4B for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:44 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LAnrD-00006j-8O for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:43 -0500 Received: from [199.232.76.173] (port=42491 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LAnrC-00006b-TM for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:42 -0500 Received: from mx2.redhat.com ([66.187.237.31]:38881) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LAnrC-0005NU-DA for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:42 -0500 Date: Thu, 11 Dec 2008 16:53:35 +0100 From: Andrea Arcangeli Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool Message-ID: <20081211155335.GE14908@random.random> References: <1228512061-25398-1-git-send-email-aliguori@us.ibm.com> <493E941D.4000608@redhat.com> <493E965E.5050701@us.ibm.com> <20081210164401.GF18814@random.random> <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081211131222.GA14908@random.random> <494130B5.2080800@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <494130B5.2080800@redhat.com> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gerd Hoffmann Cc: kvm-devel , qemu-devel@nongnu.org On Thu, Dec 11, 2008 at 04:24:37PM +0100, Gerd Hoffmann wrote: > Well, linux kernel aio has its share of problems too: > > * Anthony mentioned it may block on certain circumstances (forgot > which ones), and you can't figure beforehand to turn off aio then. We've worse problems as long as bdrv_read/write are used by qcow2. And we can fix host kernel in the long run if this becomes an issue. > * It can't handle block allocation. Kernel handles that by doing > such writes synchronously via VFS layer (instead of the separate > aio code paths). Leads to horrible performance and bug reports > such as "installs on sparse files are very slow". I think here you mean O_DIRECT regardless of aio/sync API, I doubt aio has any relevance to block allocation in any way, so whatever problem we have with kernel API and O_DIRECT should also be there with sync-api + userland threads and O_DIRECT. > * support for vectored aio isn't that old. IIRC it was added > somewhen around 2.6.20 (newer that current suse/redhat enterprise > versions). Which IMHO means you can't expect it being present > unconditionally. I think this is a false alarm: the whole point of kernel AIO is that even if O_DIRECT is enabled, all bios are pushed to the disk before the disk queue is unplugged which is all we care about to get decent disk bandwidth with zerocopy dma. Or at least that's the way it's supposed to work if aio is implemented correctly at the bio level. So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply to an array of iocb through io_submit (i.e. to conver the iov into a vector of iocb, instead of a single iocb pointing to the iov). Internally to io_submit a single dma command should be generated and the same sg list should be built the same as if we used READV/WRITEV. In theory READV/WRITEV should be just a cpu saving feature, it shouldn't influence disk bandwidth. If it does, it means the bio layer is broken and needs fixing. If IOCB_CMD_READV/WRITEV is available, good, if not we go with READ/WRITE and more iocb dynamically allocated. It just needs a conversion routine from iovec, file, offset to iocb pointer when IOCB_CMD_READV/WRITEV is not available. The iocb array can be preallocated along with the iovec when we detect IOCB_CMD_READV/WRITEV is not available, I've a cache layer that does this and I'll just provide an output selectable in iovec or iocb terms, with iocb selectable depending if host os is linux and IOCB_CMD_READV/WRITEV is not available. > Threads will be there anyway for kvm smp. Yes, I didn't mean those threads ;), I love threads, but I love threads that are CPU bound and allow to exploit the whole power of the system! But for storage, threads are purely overscheduling overhead as far as I can tell, given we've an async api to use and we already have to deal with the pain of async programming. So it worth we get the full benefit of it (i.e. no thread/overscheduling overhead). If aio inside the kernel is too complex than use kernel threads, it's still better than user threads. I mean if we keep only using threads we should get rid of bdrv_aio* completely and move qcow2 code in a separate thread instead of keep running it from the io thread. If we stick to threads then it worth to get the full benefit of threads (i.e. not having to deal with the pains of async programming and moving the qcow2 computation in a separate CPU). Something I tried doing but I ended up having to add locks all over qcow2 in order to submit multiple qcow2 requests in parallel (otherwise the lock would be global and I couldn't differentiate between a bdrv_read for qcow2 metadata that must be executed with the qcow2 mutex held, and a bdrv_aio_readv that can run lockless from the point of view of the current qcow2 instance - the qcow2 parent may take its own locks then etc..). Basically it breaks all backends something I'm not confortable with right now just to get zerocopy dma working at platter speed. Hence I stick with async programming for now... > Well, wait for glibc isn't going to fly. glibc waits for posix, and > posix waits for a reference implementation (which will not be glibc). Agree. > > and kernel with preadv/pwritev > > With that in place you don't need kernel aio any more, then you can > really do it in userspace with threads. But that probably would be > linux-only ^W^W^W Waiting for preadv/pwritev is just the 'quicker' version of waiting glibc aio_readv. And because it remains a linux-only, I prefer kernel AIO that fixes cfq and should be the most optimal anyway (with or without READV/WRITEV support). So in the end: we either open the file 64 times (which I think is perfectly coherent with nfs unless the nfs client is broken, but then Anthony may know nfs better, I'm not heavy nfs user here), or we go with kernel AIO... you know my preference. Said that opening the file 64 times is probably simpler, if it has been confirmed that it doesn't break nfs. Breaking nfs is not possible here, nfs is the ideal shared storage for migration (we surely want to exploit the fact we need so weak semantics we need to do a safe migration, that it worth to keep nfs supported as 100% KVM reliable virtualization shared storage). > > ahem: http://www.daemon-systems.org/man/preadv.2.html > > Too bad nobody implemented it yet...