From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=48780 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OvA7Y-0005p0-06 for qemu-devel@nongnu.org; Mon, 13 Sep 2010 10:35:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OvA7F-0005IL-Lj for qemu-devel@nongnu.org; Mon, 13 Sep 2010 10:34:51 -0400 Received: from mail-iw0-f173.google.com ([209.85.214.173]:40121) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OvA7F-0005Hy-I8 for qemu-devel@nongnu.org; Mon, 13 Sep 2010 10:34:41 -0400 Received: by iwn38 with SMTP id 38so5393942iwn.4 for ; Mon, 13 Sep 2010 07:34:40 -0700 (PDT) Message-ID: <4C8E367B.8070609@codemonkey.ws> Date: Mon, 13 Sep 2010 09:34:35 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts References: <1284213896-12705-1-git-send-email-aliguori@us.ibm.com> <1284213896-12705-4-git-send-email-aliguori@us.ibm.com> <4C8DE19B.9090309@redhat.com> <4C8E2747.9090806@linux.vnet.ibm.com> <4C8E2981.7000304@redhat.com> <4C8E2A52.1000708@linux.vnet.ibm.com> <4C8E319A.4090103@redhat.com> In-Reply-To: <4C8E319A.4090103@redhat.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Stefan Hajnoczi , Juan Quintela On 09/13/2010 09:13 AM, Kevin Wolf wrote: >> I think the only real advantage is that we fix NFS migration, right? >> > That's the one that we know about, yes. > > The rest is not a specific scenario, but a strong feeling that having an > image opened twice at the same time feels dangerous. We've never really had clear semantics about live migration and block driver's life cycles. At a high level, for live migration to work, we need the following sequence: 1) src> flush all pending writes to disk 2) 3) dst> invalidate any cached data 4) dst> start guest We've gotten away ignoring (3) because raw disks never cache anything. But that assumes that we're on top of cache coherent storage. If we don't have fully cache coherent storage, we need to do more. We need to extend (3) to also flush the cache of the underlying storage. There are two ways we can solve this, we can either ensure that (3) is a nop by not having any operations that would cause caching until after (3), or we can find a way to inject a flush into the underlying cache. Since he later approach requires storage specific knowledge, I went with the former approach. Of course, if you did a close-open sequence at (3), it may achieve the same goal but only really for NFS. If you have something weaker than close-to-open coherency, you still need to do something unique in step (3). I don't know that I see a perfect model. Pushing reads past point (3) is easy and fixes raw on top of NFS. I think we want to do that because it's low hanging fruit. An block driver hook for (3) also seems appealing because we can make use of it easily in QED. That said, I'm open to suggestions of a better model. Delaying open (especially if you open, then close, then open again) seems a bit hacky. With respect to the devices, I think the question of when block devices can begin accessing drives is orthogonal to this discussion. Even without delaying open, we could simply not give them their BlockDriverStates until realize() or something like that. Regards, Anthony Liguori > As soon as an > open/close sequence writes to the image for some format, we probably > have a bug. For example, what about this mounted flag that you were > discussing for QED? > > >> But if we do invalidate_cache() as you suggested with a close/open of >> the qcow2 layer, and also acquire and release a lock in the file layer >> by propagating the invalidate_cache(), that should work robustly with NFS. >> >> I think that's a simpler change. Do you see additional advantages to >> delaying the open? >> > Just that it makes it very obvious if a device model is doing bad things > and accessing the image before it should. The difference is a failed > request vs. silently corrupted data. > > Kevin > >