From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751231AbcFXEGo (ORCPT ); Fri, 24 Jun 2016 00:06:44 -0400 Received: from mga03.intel.com ([134.134.136.65]:39836 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751019AbcFXEGi convert rfc822-to-8bit (ORCPT ); Fri, 24 Jun 2016 00:06:38 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,518,1459839600"; d="scan'208";a="724268430" From: "Tian, Kevin" To: Alex Williamson , Yongji Xie CC: "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "bhelgaas@google.com" , "aik@ozlabs.ru" , "benh@kernel.crashing.org" , "paulus@samba.org" , "mpe@ellerman.id.au" , "warrier@linux.vnet.ibm.com" , "zhong@linux.vnet.ibm.com" , "nikunj@linux.vnet.ibm.com" , "gwshan@linux.vnet.ibm.com" Subject: RE: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Thread-Topic: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive Thread-Index: AQHRzWoTpa5cIYkeRkGABrHGGe6H7Z/3ZhYAgAAMTgCAAIgjgA== Date: Fri, 24 Jun 2016 04:06:22 +0000 Message-ID: References: <201605301311.u4UD99he028186@mx0a-001b2d01.pphosted.com> <20160623101245.78e3606b@t450s.home> <20160623213700.2ccfe903@t450s.home> In-Reply-To: <20160623213700.2ccfe903@t450s.home> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ctpclassification: CTP_IC x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYzE1ZWU0MjUtZTY4Mi00Yjc1LWE0MGItNjkxNGY3MmJmN2UyIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IkRwdkZ5K2lNTFFiS0NKaHZ6eGVmbzRFT0tPdndOYUVuM1pZVVZjN0ppWkE9In0= x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > From: Alex Williamson [mailto:alex.williamson@redhat.com] > Sent: Friday, June 24, 2016 11:37 AM > > On Fri, 24 Jun 2016 10:52:58 +0800 > Yongji Xie wrote: > > On 2016/6/24 0:12, Alex Williamson wrote: > > > On Mon, 30 May 2016 21:06:37 +0800 > > > Yongji Xie wrote: > > >> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > > >> +{ > > >> + struct resource *res; > > >> + int bar; > > >> + struct vfio_pci_dummy_resource *dummy_res; > > >> + > > >> + INIT_LIST_HEAD(&vdev->dummy_resources_list); > > >> + > > >> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { > > >> + res = vdev->pdev->resource + bar; > > >> + > > >> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) > > >> + goto no_mmap; > > >> + > > >> + if (!(res->flags & IORESOURCE_MEM)) > > >> + goto no_mmap; > > >> + > > >> + /* > > >> + * The PCI core shouldn't set up a resource with a > > >> + * type but zero size. But there may be bugs that > > >> + * cause us to do that. > > >> + */ > > >> + if (!resource_size(res)) > > >> + goto no_mmap; > > >> + > > >> + if (resource_size(res) >= PAGE_SIZE) { > > >> + vdev->bar_mmap_supported[bar] = true; > > >> + continue; > > >> + } > > >> + > > >> + if (!(res->start & ~PAGE_MASK)) { > > >> + /* > > >> + * Add a dummy resource to reserve the remainder > > >> + * of the exclusive page in case that hot-add > > >> + * device's bar is assigned into it. > > >> + */ > > >> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); > > >> + if (dummy_res == NULL) > > >> + goto no_mmap; > > >> + > > >> + dummy_res->resource.start = res->end + 1; > > >> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; > > >> + dummy_res->resource.flags = res->flags; > > >> + if (request_resource(res->parent, > > >> + &dummy_res->resource)) { > > >> + kfree(dummy_res); > > >> + goto no_mmap; > > >> + } > > > Isn't it true that request_resource() only tells us that at a given > > > point in time, no other drivers have reserved that resource? It seems > > > like it does not guarantee that the resource isn't routed to another > > > device or that another driver won't at some point attempt to request > > > that same resource. So for example if a user constructs their initrd > > > to bind vfio-pci to devices before other modules load, this > > > request_resource() may succeed, at the expense of drivers loaded later > > > now failing. The behavior will depend on driver load order and we're > > > not actually insuring that the overflow resource is unused, just that > > > we got it first. Can we do better? Am I missing something that > > > prevents this? Thanks, > > > > > > Alex > > > > Couldn't PCI resources allocator prevent this, which will find a > > empty slot in the resource tree firstly, then try to request that > > resource in allocate_resource() when a PCI device is probed. > > And I'd like to know why a PCI device driver would attempt to > > call request_resource()? Should this be done in PCI enumeration? > > Hi Yongji, > > Looks like most pci drivers call pci_request_regions(). From there the > call path is: > > pci_request_selected_regions > __pci_request_selected_regions > __pci_request_region > __request_mem_region > __request_region > __request_resource > > We see this driver ordering issue sometimes with users attempting to > blacklist native pci drivers, trying to leave a device free for use by > vfio-pci. If the device is a graphics card, the generic vesa or uefi > driver can request device resources causing a failure when vfio-pci > tries to request those same resources. I expect that unless it's a > boot device, like vga in my example, the resources are not enabled > until the driver opens the device, therefore the request_resource() call > doesn't occur until that point. > > For another trivial example, look at /proc/iomem as you load and unload > a driver, on my laptop with e1000e unloaded I see: > > e1200000-e121ffff : 0000:00:19.0 > e123e000-e123efff : 0000:00:19.0 > > When e1000e is loaded, each of these becomes claimed by the e1000e > driver: > > e1200000-e121ffff : 0000:00:19.0 > e1200000-e121ffff : e1000e > e123e000-e123efff : 0000:00:19.0 > e123e000-e123efff : e1000e > > Clearly pci core knows the resource is associated with the device, but > I don't think we're tapping into that with request_resource(), we're > just potentially stealing resources that another driver might have > claimed otherwise as I described above. That's my suspicion at > least, feel free to show otherwise if it's incorrect. Thanks, > > Alex It's a problem unless there is a way to trigger resource re-assignment (e.g. pci_assign_resource) on devices which if claim on same resource by VFIO. But doing this for every request_resource failure looks an overkill... Thanks Kevin