From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752483AbcF1KKR (ORCPT ); Tue, 28 Jun 2016 06:10:17 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:33843 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752166AbcF1KKO (ORCPT ); Tue, 28 Jun 2016 06:10:14 -0400 X-IBM-Helo: d28dlp03.in.ibm.com X-IBM-MailFrom: xyjxie@linux.vnet.ibm.com X-IBM-RcptTo: kvm@vger.kernel.org;linux-kernel@vger.kernel.org Subject: Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive To: Alex Williamson References: <201605301311.u4UD99he028186@mx0a-001b2d01.pphosted.com> <20160623101245.78e3606b@t450s.home> <20160623213700.2ccfe903@t450s.home> <3e0965bc-8c89-4534-6381-5af9469bbd7e@linux.vnet.ibm.com> <20160624104315.79898181@t450s.home> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, bhelgaas@google.com, aik@ozlabs.ru, benh@kernel.crashing.org, paulus@samba.org, mpe@ellerman.id.au, warrier@linux.vnet.ibm.com, zhong@linux.vnet.ibm.com, nikunj@linux.vnet.ibm.com, gwshan@linux.vnet.ibm.com, kevin.tian@intel.com From: Yongji Xie Date: Tue, 28 Jun 2016 18:09:46 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <20160624104315.79898181@t450s.home> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16062810-0004-0000-0000-000002B29275 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16062810-0005-0000-0000-00000DB3489F Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-06-28_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1606280094 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Alex On 2016/6/25 0:43, Alex Williamson wrote: > On Fri, 24 Jun 2016 23:37:02 +0800 > Yongji Xie wrote: > >> Hi, Alex >> >> On 2016/6/24 11:37, Alex Williamson wrote: >> >>> On Fri, 24 Jun 2016 10:52:58 +0800 >>> Yongji Xie wrote: >>>> On 2016/6/24 0:12, Alex Williamson wrote: >>>>> On Mon, 30 May 2016 21:06:37 +0800 >>>>> Yongji Xie wrote: >>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) >>>>>> +{ >>>>>> + struct resource *res; >>>>>> + int bar; >>>>>> + struct vfio_pci_dummy_resource *dummy_res; >>>>>> + >>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); >>>>>> + >>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { >>>>>> + res = vdev->pdev->resource + bar; >>>>>> + >>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) >>>>>> + goto no_mmap; >>>>>> + >>>>>> + if (!(res->flags & IORESOURCE_MEM)) >>>>>> + goto no_mmap; >>>>>> + >>>>>> + /* >>>>>> + * The PCI core shouldn't set up a resource with a >>>>>> + * type but zero size. But there may be bugs that >>>>>> + * cause us to do that. >>>>>> + */ >>>>>> + if (!resource_size(res)) >>>>>> + goto no_mmap; >>>>>> + >>>>>> + if (resource_size(res) >= PAGE_SIZE) { >>>>>> + vdev->bar_mmap_supported[bar] = true; >>>>>> + continue; >>>>>> + } >>>>>> + >>>>>> + if (!(res->start & ~PAGE_MASK)) { >>>>>> + /* >>>>>> + * Add a dummy resource to reserve the remainder >>>>>> + * of the exclusive page in case that hot-add >>>>>> + * device's bar is assigned into it. >>>>>> + */ >>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); >>>>>> + if (dummy_res == NULL) >>>>>> + goto no_mmap; >>>>>> + >>>>>> + dummy_res->resource.start = res->end + 1; >>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; >>>>>> + dummy_res->resource.flags = res->flags; >>>>>> + if (request_resource(res->parent, >>>>>> + &dummy_res->resource)) { >>>>>> + kfree(dummy_res); >>>>>> + goto no_mmap; >>>>>> + } >>>>> Isn't it true that request_resource() only tells us that at a given >>>>> point in time, no other drivers have reserved that resource? It seems >>>>> like it does not guarantee that the resource isn't routed to another >>>>> device or that another driver won't at some point attempt to request >>>>> that same resource. So for example if a user constructs their initrd >>>>> to bind vfio-pci to devices before other modules load, this >>>>> request_resource() may succeed, at the expense of drivers loaded later >>>>> now failing. The behavior will depend on driver load order and we're >>>>> not actually insuring that the overflow resource is unused, just that >>>>> we got it first. Can we do better? Am I missing something that >>>>> prevents this? Thanks, >>>>> >>>>> Alex >>>> Couldn't PCI resources allocator prevent this, which will find a >>>> empty slot in the resource tree firstly, then try to request that >>>> resource in allocate_resource() when a PCI device is probed. >>>> And I'd like to know why a PCI device driver would attempt to >>>> call request_resource()? Should this be done in PCI enumeration? >>> Hi Yongji, >>> >>> Looks like most pci drivers call pci_request_regions(). From there the >>> call path is: >>> >>> pci_request_selected_regions >>> __pci_request_selected_regions >>> __pci_request_region >>> __request_mem_region >>> __request_region >>> __request_resource >>> >>> We see this driver ordering issue sometimes with users attempting to >>> blacklist native pci drivers, trying to leave a device free for use by >>> vfio-pci. If the device is a graphics card, the generic vesa or uefi >>> driver can request device resources causing a failure when vfio-pci >>> tries to request those same resources. I expect that unless it's a >>> boot device, like vga in my example, the resources are not enabled >>> until the driver opens the device, therefore the request_resource() call >>> doesn't occur until that point. >>> >>> For another trivial example, look at /proc/iomem as you load and unload >>> a driver, on my laptop with e1000e unloaded I see: >>> >>> e1200000-e121ffff : 0000:00:19.0 >>> e123e000-e123efff : 0000:00:19.0 >>> >>> When e1000e is loaded, each of these becomes claimed by the e1000e >>> driver: >>> >>> e1200000-e121ffff : 0000:00:19.0 >>> e1200000-e121ffff : e1000e >>> e123e000-e123efff : 0000:00:19.0 >>> e123e000-e123efff : e1000e >>> >>> Clearly pci core knows the resource is associated with the device, but >>> I don't think we're tapping into that with request_resource(), we're >>> just potentially stealing resources that another driver might have >>> claimed otherwise as I described above. That's my suspicion at >>> least, feel free to show otherwise if it's incorrect. Thanks, >>> >>> Alex >>> >> Thanks for your explanation. But I still have one question. >> Shouldn't PCI core have claimed all PCI device's resources >> after probing those devices. If so, request_resource() will fail >> when vfio-pci try to steal resources that another driver might >> request later. Anything I missed here? Some device resources >> would not be claimed in PCI core? > Hi Yongji, > > I don't know what to say, this is not how the interface currently > works. request_resource() is a driver level interface that tries to > prevent drivers from claiming conflicting resources. In this patch > you're trying to use it to probe whether a resource maps to another > device. This is not what it does. request_resource() will happily let > you claim any resource you want, so long as nobody else claimed it > first. So the only case where the assumptions in this patch are valid > is if we can guarantee that any potentially conflicting device has a > driver loaded that has claimed those resources. As it is here, > vfio-pci will happily attempt to request resources that might overlap > with another device and might break drivers that haven't yet had a > chance to probe their devices. I don't think that's acceptable. > Thanks, > > Alex > I'm trying to get your point. Let me give an example here. I'm not sure whether my understanding is right. Please point it out if I'm wrong. We assume that there are two PCI devices like this: 240000000000-24feffffffff : /pciex@3fffe40400000 240000000000-2400ffffffff : PCI Bus 0002:01 240000000000-240000007fff : 0002:01:00.0 240000000000-240000007fff : vfio-pci 240000008000-24000000ffff : 0002:01:01.0 240000008000-24000000ffff : lpfc Do you mean vfio-pci driver will succeed in requesting dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) if it is loaded before lpfc driver? Like this: 240000000000-24feffffffff : /pciex@3fffe40400000 240000000000-2400ffffffff : PCI Bus 0002:01 240000000000-240000007fff : 0002:01:00.0 240000000000-240000007fff : vfio-pci 240000008000-24000000ffff : 0002:01:01.0 240000008000-24000000ffff : --> vfio-pci call request_resource() Then lpfc driver will fail when it attempts to call pci_request_regions() later. But is it possible that the dummy_res become the child of the res: 0002:01:01.0? Wouldn't request_resource() fail when it found parent res: PCI Bus 0002:01 already have conflict child res: 0002:01:01.0. And I think the case that request_resource() will succeed should like this: 240000000000-24feffffffff : /pciex@3fffe40400000 240000000000-2400ffffffff : PCI Bus 0002:01 240000000000-240000007fff : 0002:01:00.0 240000010000-240000017fff : 0002:01:01.0 There is a mem hole: [240000008000-24000000ffff] after PCI probing. After loading drivers, the resources tree will be: 240000000000-24feffffffff : /pciex@3fffe40400000 240000000000-2400ffffffff : PCI Bus 0002:01 240000000000-240000007fff : 0002:01:00.0 240000000000-240000007fff : vfio-pci 240000008000-24000000ffff : ---> vfio-pci call request_resource() 240000010000-240000017fff : 0002:01:01.0 240000010000-240000017fff : lpfc Thanks, Yongji