From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED588C433DB for ; Thu, 14 Jan 2021 21:45:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD2A023A5C for ; Thu, 14 Jan 2021 21:45:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729500AbhANVox (ORCPT ); Thu, 14 Jan 2021 16:44:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726239AbhANVou (ORCPT ); Thu, 14 Jan 2021 16:44:50 -0500 Received: from mail-io1-xd2b.google.com (mail-io1-xd2b.google.com [IPv6:2607:f8b0:4864:20::d2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DE0B6C061757; Thu, 14 Jan 2021 13:44:09 -0800 (PST) Received: by mail-io1-xd2b.google.com with SMTP id u17so14308935iow.1; Thu, 14 Jan 2021 13:44:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=pFRKfXI+geT2Xti5kKSP2qV5qTl25DZTIwSYqP1kwwQ=; b=p+ri3TuIU+me8X0FKf4kJ7zNrAqx5WNbNEs1rUW2BQKAOqWkJqc74aFMEObI5AfIsd qMVdebSUSuGPBgXtL3H218On0W2J36U6pjuO1piPiJDtuZUaKMOSq9PClm2BgNwbaPAo VDdiO+1JcnZEiSbF7fxbiHXFVJoiTZzVLS3Iyk6mud+v87dnJ6tKim14D2VMMyfwKPd4 Bq+h7JHn7r49CjYPgI8RFbaw96k8dJtwXCtv2pCXC9dEsDYJaNoNo8oQ2C+e+6fjAKV6 GUfu8ItwEMUJc/Q3K7D6llwDeiUw20YbrigQxEUjFN30ZY7pDTM3glsuieYIHMo8ov9G TCpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=pFRKfXI+geT2Xti5kKSP2qV5qTl25DZTIwSYqP1kwwQ=; b=fDNnA9/j7S8wOF+R0X/l3TTpHhFoCufOMENeLnGAkb83aXCHKgmT2ypTyHCYrNp4Nr IBfk5ViwrH55uHM83mpMAqhS556s59nTYsJhVE4CFVx9VknFsFsWRuok4Chj/Qgj6I3h xQLuWFSGbV3Foumsli9qfRzVlmMErAyLc3gswAXMn5W2gYW49hN3KHNDVDm8OZ0BQjOp 1/u6HBj5aPwK0ZjV6aRULcZVtmt3acwxl9FRDv16sO2XVMwxBm+Zg5lT1gk+Dw4oqzQX 41N4qxfbz4N8fNGXP0BAAbZLFfOlMT+/sLDxr+kgSSBiEY9fU8MuOqPBUeclj+FOeNNY 5dNw== X-Gm-Message-State: AOAM532vAbpi/AOM2amdNK0EaxElE0XO8vclPo4y1dAI5lgKSWu+yaWA aZOvqJ7/msTXStqYfXkCQmK7JHEjKgpS8NdWO5M= X-Google-Smtp-Source: ABdhPJzYgu5OLqSb7W7y06Hia0gwsS8lLSCxvAD8zMz7PosHcN0Xs2ZgwtJ1OW5dIp0Qh7jtMvTUHB/JVymfW83OW/A= X-Received: by 2002:a92:d592:: with SMTP id a18mr8389451iln.64.1610660649069; Thu, 14 Jan 2021 13:44:09 -0800 (PST) MIME-Version: 1.0 References: <20210112065601.GD4678@unreal> <20210113061909.GG4678@unreal> <20210114065024.GK4678@unreal> <20210114164857.GN4147@nvidia.com> <20210114182945.GO4147@nvidia.com> <20210114200825.GR4147@nvidia.com> In-Reply-To: <20210114200825.GR4147@nvidia.com> From: Alexander Duyck Date: Thu, 14 Jan 2021 13:43:57 -0800 Message-ID: Subject: Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors To: Jason Gunthorpe Cc: Leon Romanovsky , Bjorn Helgaas , Saeed Mahameed , Jakub Kicinski , linux-pci , linux-rdma@vger.kernel.org, Netdev , Don Dutile , Alex Williamson Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org On Thu, Jan 14, 2021 at 12:08 PM Jason Gunthorpe wrote: > > On Thu, Jan 14, 2021 at 11:24:12AM -0800, Alexander Duyck wrote: > > > > As you say BAR and MSI vector table are not so different, it seems > > > perfectly in line with current PCI sig thinking to allow resizing the > > > MSI as well > > > > The resizing of a BAR has an extended capability that goes with it and > > isn't something that the device can just do on a whim. This patch set > > is not based on implementing some ECN for resizable MSI-X tables. I > > view it as arbitrarily rewriting the table size for a device after it > > is created. > > The only difference is resizing the BAR is backed by an ECN, and this > is an extension. The device does not "do it on a whim" the OS tells it > when to change the size, exactly like for BAR resizing. > > > In addition Leon still hasn't answered my question on preventing the > > VF driver from altering entries beyond the ones listed in the table. > > Of course this is blocked, the FW completely revokes the HW resource > backing the vectors. One of the troubles with this is that I am having to take your word for it. The worst case scenario in my mind is that this is just was Leon described it earlier and the firmware interface is doing nothing more than altering the table size in the MSI-X config space so that the value can be picked up by VFIO and advertised to the guest. In such a situation you end up opening up a backdoor as now there are vectors that could be configured by userspace since the protections provided by VFIO are disabled as you could be reporting a size that is smaller than the actual number or vectors. These are the kind of things I am worried about with this interface. It just seems like this is altering the VF PCIe device config to address an issue that would be better handled by the vfio interface ath VM creation time. At least if this was left to vfio it could prevent the VM from being able to access the unused entries and just limit the guest to the ones that we want to have the VM access. Instead we are just being expected to trust the firmware for security from the VF should it decide to try and be malicious. > > From what I can tell, the mlx5 Linux driver never reads the MSI-X > > flags register so it isn't reading the MSI-X size either. > > I don't know why you say that. All Linux drivers call into something > like pci_alloc_irq_vectors() requesting a maximum # of vectors and > that call returns the actual allocated. Drivers can request more > vectors than the system provides, which is what mlx5 does. > > Under the call chain of pci_alloc_irq_vectors() it calls > pci_msix_vec_count() which does > > pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control); > return msix_table_size(control); > > And eventually uses that to return the result to the driver. > > So, yes, it reads the config space and ensures it doesn't allocate > more vectors than that. > > Every driver using MSI does this in Linux. > > Adjusting config space *directly* limits the number of vectors the > driver allocates. > > You should be able to find the call chain in mlx5 based on the above > guidance. Yes, but at the same time you also end up passing a max_vecs into the function as you have multiple limitations that come into account from both the driver, the system, and the table. The MSI-X table size is just one piece. Specifically the bit in the code for the mlx5 driver is: nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() + MLX5_IRQ_VEC_COMP_BASE; nvec = min_t(int, nvec, num_eqs); So saying that the MSI-X table size is what defines the number of interrupts you can use is only partially true. What it defines is the aperture available in MMIO to define the possible addresses and values to be written to trigger the interrupts. The device itself plays a large role in defining the number of interrupts ultimately requested. > > At a minimum I really think we need to go through and have a clear > > definition on when updating the MSI-X table size is okay and when it > > is not. I am not sure just saying to not update it when a driver isn't > > attached is enough to guarantee all that. > > If you know of a real issue then please state it, other don't fear > monger "maybe" issues that don't exist. Well I don't have visibility into your firmware so I am not sure what is going on in response to this command so forgive me when I do a bit of fear mongering when somebody tells me that all this patch set does is modify the VF configuration space. As I have said my main concern is somebody really screwing something like this up and creating a security vulnerability where they do something exactly like updating just the MSI-X table size without protecting the MMIO region that contains the remaining MSI-X table entries. This is why I would be much more comfortable with something like a vfio ioctl that says that while the device supports the number reported in the MSI-X table size the vfio will only present some subset of those entries to the guest. With that I could at least review the code and verify that it is providing the expected protections. > > What we are talking about is the MSI-X table size. Not the number of > > MSI-X vectors being requested by the device driver. Those are normally > > two seperate things. > > Yes, table size is what is critical. The number of entries in that BAR > memory is what needs to be controlled. That is where we disagree. My past experience is that in the device you have to be able to associate an interrupt cause to an interrupt vector. Normally as a part of that the device itself will place some limit on how many causes and vectors you can associate before you even get to the MSI-X table. The MSI-X table size is usually a formality that defines the upper limit on the number of entries the device might request. The reason why most drivers don't bother asking for it or reading it is because it is defined early as a part of the definition of the device itself. Going back to my earlier example I am not going to size a MSI-X table at 2048 for a device that only has a few interrupt sources. Odds are I would size it such that I will have enough entries in the table to cover all my interrupt sources. Usually the limiting factor for an MSI-X request is the system itself as too many devices requesting a large number of interrupts may end up eating up all the vectors available for a given CPU. > > > The standards based way to communicate that limit is the MSI table cap > > > size. > > > > This only defines the maximum number of entries, not how many have to be used. > > A driver can't use entries beyond the cap. We are not trying to > reclaim vectors that are available but not used by the OS. The MSI-X table consists of a MMIO region in one of the BARs on the device. It is easily possible to code something up in a driver that would go in and be able to access the region. Most sensible devices place it in a separate BAR but even then you have to worry about them possibly sharing the memory region internally among several devices. The big thing I see as an issue with all this is that arbitrarily reducing the size of the MSI-X table doesn't have any actual effect on the MMIO resources. They are still there. So a bad firmware doing something like reducing the table size without also restricting access to the resources in the BAR potentially opens up a security hole as the MSI-X vector is really nothing more than a pre-programmed PCIe write waiting for something to trigger it. Odds are an IOMMU would block it, but still not necessarily a good thing. As such my preference would be to leave the MSI-X table size static at the size of possible vectors that could be modified, and instead have the firmware guarantee that writing to those registers has no effect. Then when you do something like a direct assignment vfio_pci will also guard that region of MMIO instead of only guarding a portion of it. > > I'm not offering up a non-standard way to do this. Just think about > > it. If I have a device that defines an MSI-X table size of 2048 but > > makes use of only one vector how would that be any different than what > > I am suggesting where you size your VF to whatever the maximum is you > > need but only make use of some fixed number from the hardware. > > That is completely different, in the hypervisor there is no idea how > many vectors a guest OS will create. The FW is told to only permit 1 > vector. How is the guest to know this if we don't update the config > space *as the standard requires* ? Doesn't the guest driver talk to the firmware? Last I knew it had to request additional resources such as queues and those come from the firmware don't they? > > I will repeat what I said before. Why can't this be handled via the > > vfio interface? > > 1) The FW has to be told of the limit and everything has to be in sync > If the FW is out of sync with the config space then everything > breaks if the user makes even a small mistake - for instance > forgetting to use the ioctl to override vfio. This is needlessly > frail and complicated. That is also the way I feel about the sysfs solution. I just see VFIO making the call to the device to notify it that it only needs X number of vectors instead of having the VF sysfs doing it. > 2) VFIO needs to know how to tell the FW the limit so it can override > the config space with emulation. This is all device specific and I > don't see that adding an extension to vfio is any better than > adding one here. So it depends on the setup. In my suggestion I was suggesting VFIO defines the maximum, not the minimum. So the guest would automatically know exactly how many it would have as the table size would be specified inside the guest. > 3) VFIO doesn't cover any other driver that binds to the VF, so > this "solution" leaves the normal mlx5_core functionally broken on > VFs which is a major regression. > > Overall the entire idea to have the config space not reflect the > actual current device configuration seems completely wrong to me - why > do this? For what gain? It breaks everything. Your configuration is admittedly foreign to me as well. So if I am understanding correctly your limiting factor isn't the number of interrupt causes in the platform, but the firmware deciding to provide too few interrupt vectors in the MSI-X table. I'm just kind of surprised the firmware itself isn't providing some sort of messaging on the number of interrupt vectors that a given device has since I assume that it is already providing you with information on the number of queues and such since that isn't provided by any other mechanism.