Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups

From: Matthew Rosato <mjrosato@linux.ibm.com>
To: Pierre Morel <pmorel@linux.ibm.com>, qemu-s390x@nongnu.org
Cc: farman@linux.ibm.com, kvm@vger.kernel.org,
	schnelle@linux.ibm.com, cohuck@redhat.com,
	richard.henderson@linaro.org, thuth@redhat.com,
	qemu-devel@nongnu.org, pasic@linux.ibm.com,
	alex.williamson@redhat.com, mst@redhat.com, pbonzini@redhat.com,
	david@redhat.com, borntraeger@linux.ibm.com
Subject: Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups
Date: Thu, 16 Dec 2021 10:16:10 -0500	[thread overview]
Message-ID: <b445e4e7-21b1-b9bc-3d9f-9f5f94c1d7fa@linux.ibm.com> (raw)
In-Reply-To: <599c66a7-6e91-1fd0-ac96-bec7ffe51dfe@linux.ibm.com>

On 12/16/21 3:15 AM, Pierre Morel wrote:
> 
> 
> On 12/7/21 22:04, Matthew Rosato wrote:
>> Let's use the reserved pool of simulated PCI groups to allow intercept
>> devices to have separate groups from interpreted devices as some group
>> values may be different. If we run out of simulated PCI groups, 
>> subsequent
>> intercept devices just get the default group.
>> Furthermore, if we encounter any PCI groups from hostdevs that are marked
>> as simulated, let's just assign them to the default group to avoid
>> conflicts between host simulated groups and our own simulated groups.
> 
> I have a problem here.
> We will have the same hardware viewed by 2 different VFIO implementation 
> (interpretation vs interception) reporting different groups ID.

Yes -- To be clear, this patch proposes that the interpreted device will 
continue to report the passthrough group ID and the intercept device 
will use a simulated group ID.

> 
> The alternative is to have them reporting same group ID with different 
> values.
>

I don't think we can do this.  For starters, we would have to throw out 
the group tracking we do in QEMU; but for all we know the guest could be 
doing similar tracking -- the implication of the group ID is that 
everyone shares the same values so I don't think we can get away with 
reporting different values for 2 members of the same group.

I think the other alternative is rather to always do something like...

1) host reports its value via vfio capabilities as 'this is what an 
interpreted device can use'
2) QEMU must accept those values as-is OR reduce them to some subset of 
what both interpretation and intercept can support, and report only 
those values for all devices in the group.  (More on this further down)

> I fear both are wrong.
> 
> On the other hand, should we have a difference in the QEMU command line 
> between intercepted and interpreted devices for default values.

I'm not sure I follow what you suggest here.  Even if we somehow 
provided a command-line means for specifying some of these values, they 
would still be presented to the guest via clp and if the guest has 2 
devices in the same group the clp results had better be the same.

> If not why not give up the host values so that in an hypothetical future 
> migration we are clean with the GID ?
> 

Well, the interpreted device will use the passthrough group ID so in a 
hypothetical future migration scenario we should be good there.

And simulated devices will still use the default group, so we should 
also be OK there.

This really changes the behavior for 2 other classes of device:

1) Intercept passthrough devices -- Yes, I agree that doing this is a 
bit weird.  But my thinking was that these devices should be the 
exception case rather than the norm moving forward, and it would clearly 
dilineate the different in Q PCI FNGRP values.

2) nested simulated devices -- These aren't using real GIDs anyway and I 
would expect them to also be using the default group already -- forcing 
these to the default group was basically to make sure they didn't 
conflict with the simulated groups being created for intercept devices 
above.

> I am not sure of this, just want to open a little discussion on this.

FWIW, I'm not 100% on this either, so a better idea is welcome.  One 
thing I don't like, for example, is that we only have 16 simulated 
groups to work with, and for example we might find it useful later to 
split simulated devices into different groups based on type.

> 
> For example what could go wrong to keep the host values returned by the 
> CAP?

As-is, we risk advertising the wrong maxstbl and dtsm value for some 
devices in the group, depending on which device is plugged first. 
Imagine you have 2 devices on group 5; one will be interpreted and the 
other intercepted.

If the interpreted device plugs first, we will use the passthrough 
maxstbl and dtsm for all devices in the group; so the intercept device 
gets these values too.

If the intercept device plugs first, we will use the QEMU value for DTSM 
and the smaller maxstbl requried for intercept passthrough.  So the 
interpreted device gets these values too.

Worth noting, we could have more of these differences later -- But if we 
want to avoid splitting the group, then we I think we have to circle 
back to my 'alternative idea' above and provide equivalent support or 
toleration for intercept devices so that we can report a single group 
value that both types can support.

So insofar as dealing with the differences today...  maxstbl is pretty 
easy, we can just tolerate supporting the larger maxstbl in QEMU by 
adding logic to break up the I/O in pcistb_service_call.  We might have 
to provide 2 different maxstbl values over vfio capabilities however 
(what interpretation can support vs what kernel api supports for 
intercept as this could change between host kernel versions)

DTSM is a little trickier.  We are actually OK today because both 
intercept and interpreted devices will report the same value anyway, but 
that could change in the future.  Maybe here QEMU must report

dtsm = (QEMU_SUPPORT_MASK & HOST_SUPPORT_MASK);

So basically: ensure that only what both QEMU intercept and passthrough 
supports is advertised via the clp.  If we want to support a new type 
later, then we must either support it in both kvm and QEMU to enable it 
for the guest (or disallow intercept devices on that group, or provide 
some means of forcing an intercept device to the default group, etc)

If we do the above, then I think we can drop the idea of using simulated 
groups for intercpet passthrough devices.  What do you think?

> 
> 
>>
>> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> ---
>>   hw/s390x/s390-pci-bus.c         | 19 ++++++++++++++--
>>   hw/s390x/s390-pci-vfio.c        | 40 ++++++++++++++++++++++++++++++---
>>   include/hw/s390x/s390-pci-bus.h |  6 ++++-
>>   3 files changed, 59 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
>> index ab442f17fb..8b0f3ef120 100644
>> --- a/hw/s390x/s390-pci-bus.c
>> +++ b/hw/s390x/s390-pci-bus.c
>> @@ -747,13 +747,14 @@ static void s390_pci_iommu_free(S390pciState *s, 
>> PCIBus *bus, int32_t devfn)
>>       object_unref(OBJECT(iommu));
>>   }
>> -S390PCIGroup *s390_group_create(int id)
>> +S390PCIGroup *s390_group_create(int id, int host_id)
>>   {
>>       S390PCIGroup *group;
>>       S390pciState *s = s390_get_phb();
>>       group = g_new0(S390PCIGroup, 1);
>>       group->id = id;
>> +    group->host_id = host_id;
>>       QTAILQ_INSERT_TAIL(&s->zpci_groups, group, link);
>>       return group;
>>   }
>> @@ -771,12 +772,25 @@ S390PCIGroup *s390_group_find(int id)
>>       return NULL;
>>   }
>> +S390PCIGroup *s390_group_find_host_sim(int host_id)
>> +{
>> +    S390PCIGroup *group;
>> +    S390pciState *s = s390_get_phb();
>> +
>> +    QTAILQ_FOREACH(group, &s->zpci_groups, link) {
>> +        if (group->id >= ZPCI_SIM_GRP_START && group->host_id == 
>> host_id) {
>> +            return group;
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>>   static void s390_pci_init_default_group(void)
>>   {
>>       S390PCIGroup *group;
>>       ClpRspQueryPciGrp *resgrp;
>> -    group = s390_group_create(ZPCI_DEFAULT_FN_GRP);
>> +    group = s390_group_create(ZPCI_DEFAULT_FN_GRP, ZPCI_DEFAULT_FN_GRP);
>>       resgrp = &group->zpci_group;
>>       resgrp->fr = 1;
>>       resgrp->dasm = 0;
>> @@ -824,6 +838,7 @@ static void s390_pcihost_realize(DeviceState *dev, 
>> Error **errp)
>>                                              NULL, g_free);
>>       s->zpci_table = g_hash_table_new_full(g_int_hash, g_int_equal, 
>> NULL, NULL);
>>       s->bus_no = 0;
>> +    s->next_sim_grp = ZPCI_SIM_GRP_START;
>>       QTAILQ_INIT(&s->pending_sei);
>>       QTAILQ_INIT(&s->zpci_devs);
>>       QTAILQ_INIT(&s->zpci_dma_limit);
>> diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
>> index c9269683f5..bdc5892287 100644
>> --- a/hw/s390x/s390-pci-vfio.c
>> +++ b/hw/s390x/s390-pci-vfio.c
>> @@ -305,13 +305,17 @@ static void s390_pci_read_group(S390PCIBusDevice 
>> *pbdev,
>>   {
>>       struct vfio_info_cap_header *hdr;
>>       struct vfio_device_info_cap_zpci_group *cap;
>> +    S390pciState *s = s390_get_phb();
>>       ClpRspQueryPciGrp *resgrp;
>>       VFIOPCIDevice *vpci =  container_of(pbdev->pdev, VFIOPCIDevice, 
>> pdev);
>>       hdr = vfio_get_device_info_cap(info, 
>> VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
>> -    /* If capability not provided, just use the default group */
>> -    if (hdr == NULL) {
>> +    /*
>> +     * If capability not provided or the underlying hostdev is 
>> simulated, just
>> +     * use the default group.
>> +     */
>> +    if (hdr == NULL || pbdev->zpci_fn.pfgid >= ZPCI_SIM_GRP_START) {
>>           trace_s390_pci_clp_cap(vpci->vbasedev.name,
>>                                  VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
>>           pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
>> @@ -320,11 +324,41 @@ static void s390_pci_read_group(S390PCIBusDevice 
>> *pbdev,
>>       }
>>       cap = (void *) hdr;
>> +    /*
>> +     * For an intercept device, let's use an existing simulated group 
>> if one
>> +     * one was already created for other intercept devices in this 
>> group.
>> +     * If not, create a new simulated group if any are still available.
>> +     * If all else fails, just fall back on the default group.
>> +     */
>> +    if (!pbdev->interp) {
>> +        pbdev->pci_group = 
>> s390_group_find_host_sim(pbdev->zpci_fn.pfgid);
>> +        if (pbdev->pci_group) {
>> +            /* Use existing simulated group */
>> +            pbdev->zpci_fn.pfgid = pbdev->pci_group->id;
>> +            return;
>> +        } else {
>> +            if (s->next_sim_grp == ZPCI_DEFAULT_FN_GRP) {
>> +                /* All out of simulated groups, use default */
>> +                trace_s390_pci_clp_cap(vpci->vbasedev.name,
>> +                                       VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
>> +                pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
>> +                pbdev->pci_group = s390_group_find(ZPCI_DEFAULT_FN_GRP);
>> +                return;
>> +            } else {
>> +                /* We can assign a new simulated group */
>> +                pbdev->zpci_fn.pfgid = s->next_sim_grp;
>> +                s->next_sim_grp++;
>> +                /* Fall through to create the new sim group using CLP 
>> info */
>> +            }
>> +        }
>> +    }
>> +
>>       /* See if the PCI group is already defined, create if not */
>>       pbdev->pci_group = s390_group_find(pbdev->zpci_fn.pfgid);
>>       if (!pbdev->pci_group) {
>> -        pbdev->pci_group = s390_group_create(pbdev->zpci_fn.pfgid);
>> +        pbdev->pci_group = s390_group_create(pbdev->zpci_fn.pfgid,
>> +                                             pbdev->zpci_fn.pfgid);
>>           resgrp = &pbdev->pci_group->zpci_group;
>>           if (cap->flags & VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH) {
>> diff --git a/include/hw/s390x/s390-pci-bus.h 
>> b/include/hw/s390x/s390-pci-bus.h
>> index 9941ca0084..8664023d5d 100644
>> --- a/include/hw/s390x/s390-pci-bus.h
>> +++ b/include/hw/s390x/s390-pci-bus.h
>> @@ -315,13 +315,16 @@ typedef struct ZpciFmb {
>>   QEMU_BUILD_BUG_MSG(offsetof(ZpciFmb, fmt0) != 48, "padding in 
>> ZpciFmb");
>>   #define ZPCI_DEFAULT_FN_GRP 0xFF
>> +#define ZPCI_SIM_GRP_START 0xF0
>>   typedef struct S390PCIGroup {
>>       ClpRspQueryPciGrp zpci_group;
>>       int id;
>> +    int host_id;
>>       QTAILQ_ENTRY(S390PCIGroup) link;
>>   } S390PCIGroup;
>> -S390PCIGroup *s390_group_create(int id);
>> +S390PCIGroup *s390_group_create(int id, int host_id);
>>   S390PCIGroup *s390_group_find(int id);
>> +S390PCIGroup *s390_group_find_host_sim(int host_id);
>>   struct S390PCIBusDevice {
>>       DeviceState qdev;
>> @@ -370,6 +373,7 @@ struct S390pciState {
>>       QTAILQ_HEAD(, S390PCIBusDevice) zpci_devs;
>>       QTAILQ_HEAD(, S390PCIDMACount) zpci_dma_limit;
>>       QTAILQ_HEAD(, S390PCIGroup) zpci_groups;
>> +    uint8_t next_sim_grp;
>>   };
>>   S390pciState *s390_get_phb(void);
>>
>