From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA706C3F2CD for ; Mon, 23 Mar 2020 18:52:19 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 899D620637 for ; Mon, 23 Mar 2020 18:52:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="hd5ygnx7" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 899D620637 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:38530 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jGSBO-0002Ou-Of for qemu-devel@archiver.kernel.org; Mon, 23 Mar 2020 14:52:18 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:38442) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jGSAf-0001xv-5p for qemu-devel@nongnu.org; Mon, 23 Mar 2020 14:51:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jGSAd-0007ae-7b for qemu-devel@nongnu.org; Mon, 23 Mar 2020 14:51:33 -0400 Received: from us-smtp-delivery-74.mimecast.com ([63.128.21.74]:21534) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1jGSAd-0007aL-3T for qemu-devel@nongnu.org; Mon, 23 Mar 2020 14:51:31 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1584989490; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=i5uNgQfCyHlEshudNjrdais8zXu8wZ0IgkAoADjIMfY=; b=hd5ygnx7l5TkPPDW3zAXKxRMVBQxZYiXpaFULhqR3ijZRdmQi9qcIP2Q21hbdAmNjr6aAa PzNUkZSxuslvXXa3FxlPrG/UTytrJOlo6L9i7Ok7EyIXq9wsHyUnz0X9ohqpTgL8bzLovn o+OiCNDTefnYDJf1a6SIwDFDScA1fs0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-186-zkdOpzCwOWWhyPzdQZ8nFw-1; Mon, 23 Mar 2020 14:51:26 -0400 X-MC-Unique: zkdOpzCwOWWhyPzdQZ8nFw-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2DDB5A1365; Mon, 23 Mar 2020 18:51:24 +0000 (UTC) Received: from work-vm (ovpn-112-29.ams2.redhat.com [10.36.112.29]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 965288AC30; Mon, 23 Mar 2020 18:51:16 +0000 (UTC) Date: Mon, 23 Mar 2020 18:51:14 +0000 From: "Dr. David Alan Gilbert" To: Alex Williamson Subject: Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking. Message-ID: <20200323185114.GF3017@work-vm> References: <1584649004-8285-1-git-send-email-kwankhede@nvidia.com> <1584649004-8285-5-git-send-email-kwankhede@nvidia.com> <20200319165704.1f4eb36a@w520.home> <20200320120137.6acd89ee@x1.home> <20200320125910.028d7af5@w520.home> <7062f72a-bf06-a8cd-89f0-9e729699a454@nvidia.com> <20200323124448.2d3bc315@w520.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200323124448.2d3bc315@w520.home> User-Agent: Mutt/1.13.3 (2020-01-12) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 63.128.21.74 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Zhengxiao.zx@alibaba-inc.com, kevin.tian@intel.com, yi.l.liu@intel.com, cjia@nvidia.com, kvm@vger.kernel.org, eskultet@redhat.com, ziye.yang@intel.com, cohuck@redhat.com, shuangtai.tst@alibaba-inc.com, qemu-devel@nongnu.org, zhi.a.wang@intel.com, mlevitsk@redhat.com, pasic@linux.ibm.com, aik@ozlabs.ru, Kirti Wankhede , eauger@redhat.com, felipe@nutanix.com, jonathan.davies@nutanix.com, yan.y.zhao@intel.com, changpeng.liu@intel.com, Ken.Xue@amd.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" * Alex Williamson (alex.williamson@redhat.com) wrote: > On Mon, 23 Mar 2020 23:24:37 +0530 > Kirti Wankhede wrote: > > > On 3/21/2020 12:29 AM, Alex Williamson wrote: > > > On Sat, 21 Mar 2020 00:12:04 +0530 > > > Kirti Wankhede wrote: > > > > > >> On 3/20/2020 11:31 PM, Alex Williamson wrote: > > >>> On Fri, 20 Mar 2020 23:19:14 +0530 > > >>> Kirti Wankhede wrote: > > >>> > > >>>> On 3/20/2020 4:27 AM, Alex Williamson wrote: > > >>>>> On Fri, 20 Mar 2020 01:46:41 +0530 > > >>>>> Kirti Wankhede wrote: > > >>>>> > > >> > > >> > > >> > > >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova, > > >>>>>> + size_t size, uint64_t pgsize, > > >>>>>> + u64 __user *bitmap) > > >>>>>> +{ > > >>>>>> + struct vfio_dma *dma; > > >>>>>> + unsigned long pgshift = __ffs(pgsize); > > >>>>>> + unsigned int npages, bitmap_size; > > >>>>>> + > > >>>>>> + dma = vfio_find_dma(iommu, iova, 1); > > >>>>>> + > > >>>>>> + if (!dma) > > >>>>>> + return -EINVAL; > > >>>>>> + > > >>>>>> + if (dma->iova != iova || dma->size != size) > > >>>>>> + return -EINVAL; > > >>>>>> + > > >>>>>> + npages = dma->size >> pgshift; > > >>>>>> + bitmap_size = DIRTY_BITMAP_BYTES(npages); > > >>>>>> + > > >>>>>> + /* mark all pages dirty if all pages are pinned and mapped. */ > > >>>>>> + if (dma->iommu_mapped) > > >>>>>> + bitmap_set(dma->bitmap, 0, npages); > > >>>>>> + > > >>>>>> + if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size)) > > >>>>>> + return -EFAULT; > > >>>>> > > >>>>> We still need to reset the bitmap here, clearing and re-adding the > > >>>>> pages that are still pinned. > > >>>>> > > >>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@x1.home/ > > >>>>> > > >>>> > > >>>> I thought you agreed on my reply to it > > >>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@nvidia.com/ > > >>>> > > >>>> > Why re-populate when there will be no change since > > >>>> > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any > > >>>> > pin request while vfio_iova_dirty_bitmap() is still working, it will > > >>>> > wait till iommu->lock is released. Bitmap will be populated when page is > > >>>> > pinned. > > >>> > > >>> As coded, dirty bits are only ever set in the bitmap, never cleared. > > >>> If a page is unpinned between iterations of the user recording the > > >>> dirty bitmap, it should be marked dirty in the iteration immediately > > >>> after the unpinning and not marked dirty in the following iteration. > > >>> That doesn't happen here. We're reporting cumulative dirty pages since > > >>> logging was enabled, we need to be reporting dirty pages since the user > > >>> last retrieved the dirty bitmap. The bitmap should be cleared and > > >>> currently pinned pages re-added after copying to the user. Thanks, > > >>> > > >> > > >> Does that mean, we have to track every iteration? do we really need that > > >> tracking? > > >> > > >> Generally the flow is: > > >> - vendor driver pin x pages > > >> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages > > >> tracking, then user asks dirty bitmap, x pages reported dirty by > > >> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag > > >> - In pre-copy phase, vendor driver pins y more pages, now bitmap > > >> consists of x+y bits set > > >> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not > > >> updated, so again bitmap consists of x+y bits set. > > >> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped > > >> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped, > > >> pages should not get dirty by guest driver or the physical device. > > >> Hence, x+y dirty pages would be reported. > > >> > > >> I don't think we need to track every iteration of bitmap reporting. > > > > > > Yes, once a bitmap is read, it's reset. In your example, after > > > unpinning z pages the user should still see a bitmap with x+y pages, > > > but once they've read that bitmap, the next bitmap should be x+y-z. > > > Userspace can make decisions about when to switch from pre-copy to > > > stop-and-copy based on convergence, ie. the slope of the line recording > > > dirty pages per iteration. The implementation here never allows an > > > inflection point, dirty pages reported through vfio would always either > > > be flat or climbing. There might also be a case that an iommu backed > > > device could start pinning pages during the course of a migration, how > > > would the bitmap ever revert from fully populated to only tracking the > > > pinned pages? Thanks, > > > > > > > At KVM forum we discussed this - if guest driver pins say 1024 pages > > before migration starts, during pre-copy phase device can dirty 0 pages > > in best case and 1024 pages in worst case. In that case, user will > > transfer content of 1024 pages during pre-copy phase and in > > stop-and-copy phase also, that will be pages will be copied twice. So we > > decided to only get dirty pages bitmap at stop-and-copy phase. If user > > is going to get dirty pages in stop-and-copy phase only, then that will > > be single iteration. > > There aren't any devices yet that can track sys memory dirty pages. So > > we can go ahead with this patch and support for dirty pages tracking > > during pre-copy phase can be added later when there will be consumers of > > that functionality. > > So if I understand this right, you're expecting the dirty bitmap to > accumulate dirty bits, in perpetuity, so that the user can only > retrieve them once at the end of migration? But if that's the case, > the user could simply choose to not retrieve the bitmap until the end > of migration, the result would be the same. What we have here is that > dirty bits are never cleared, regardless of whether the user has seen > them, which is wrong. Sorry, we had a lot of discussions at KVM forum, > I don't recall this specific one 5 months later and maybe we weren't > considering all aspects. I see the behavior we have here as incorrect, > but it also seems relatively trivial to make correct. I hope the QEMU > code isn't making us go through all this trouble to report a dirty > bitmap that gets thrown away because it expects the final one to be > cumulative since the beginning of dirty logging. Thanks, I remember the discussion that we couldn't track the system memory dirtying with current hardware; so the question then is just to track what has been pinned and then ideally put that memory off until the end. (Which is interesting because I don't think we currently have a way to delay RAM pages till the end in qemu). [I still worry whether migration will be usable with any significant amount of system ram that's pinned in this way; the downside will very easily get above the threshold that people like] Dave > Alex -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK