On 24/07/17 13:28, David Gibson wrote:
> On Fri, Jul 21, 2017 at 06:21:31PM +1000, Benjamin Herrenschmidt wrote:
>> On Fri, 2017-07-21 at 17:50 +1000, David Gibson wrote:
>>> On Wed, Jul 19, 2017 at 02:02:18PM +1000, Benjamin Herrenschmidt wrote:
>>>> On Wed, 2017-07-19 at 13:08 +1000, David Gibson wrote:
>>>>>
>>>>> I'm somewhat uncomfortable with an irq allocater here in the intc
>>>>> code.  As a rule, irq allocation is the responsibility of the machine,
>>>>> not any sub-component.  Furthermore, it should allocate in a way which
>>>>> is repeatable, since they need to stay stable across reboots and
>>>>> migrations.
>>>>>
>>>>> And, yes, we have an allocator of sorts in XICS - it has caused a
>>>>> number of problems in the past.
>>>>
>>>> So....
>>>>
>>>> For a bare metal model (which we don't have yet) of XIVE, the IRQ
>>>> numbering is entirely an artifact of how the HW is configured. There
>>>> should thus be no interrupt numbers visible to qemu.
>>>
>>> Uh.. I don't entirely follow.  Do you mean that during boot the guest
>>> programs the irq numbers into the various components?
>>
>> I said a "bare metal model" but yes. Pretty much. 
> 
> Right, by "guest" I meant the kernel running under qemu, even if its
> running on a bare-metal equivalent platform.
> 
>>> In that case this allocator stuff definitely doesn't belong on the
>>> xive code.
>>>
>>>> For a PAPR model things are a bit different, but if we want to
>>>> maximize code re-use between the two, we probably need to make sure
>>>> the interrupts "allocated" by the machine for XIVE can be represented
>>>> by the HW model.
>>>>
>>>> That means:
>>>>
>>>>  - Each chip has a range (high bits are the block ID, which maps to a
>>>> chip, low bits, around 512K to 1M interrupts is the per-chip space).
>>>>
>>>>  - Interrupts 0...N of that range (N depends on how much backing
>>>> memory and MMIO space is provisioned for each chip) are "generic IPIs"
>>>> which are somewhat generic interrupt source that can be triggered with
>>>> an MMIO store and routed to any target. Those are used in PAPR for
>>>> things like IPIs and some type of accelerator interrupts.
>>>>
>>>>  - Portions of that range (which may or may not overlap the 0...N
>>>> above, if they do they "shadow" the generic interrupts) can be
>>>> configured to be the HW sources from the various PCIe bridges and
>>>> the PSI controller.
>>>
>>> Err.. I'm confused how this not sure this relates to spapr.  There are
>>> no chips or PSI there, and the PCI bridges aren't really the same
>>> thing.
>>
>> The above is the HW model, sorry for the confusion. With a few comments
>> about how they are used in PAPR.
>>
>> So yes, in PAPR there's an "allocator" because the hypervisor will
>> create a guest "virtual" (or logical to use PAPR terminology) interrupt
>> number space, in order to represents the various interrupts into the
>> guest.
> 
> Ok, but are each of those logical irqs bound to a specific device/PHB
> line/whatever, or can they be configured by the guest?
> 
>> Those numbers however are just tokens, they don't have to represent any
>> real HW concept. So they can be "allocated" in a rather fixed way, for
>> example, you could have something like a fixed map where you put all
>> the PCI interrupts at a certain number (a factor of the PHB# with room
>> or a fix number per PHB, maybe 16K or so, the HW does 4K max). Another
>> based would have a chunk of "general purpose" IPIs (for use for actual
>> IPIs and for other things to come). And a range for the virtual device
>> interrupts for example. Or you can just use an allocator.
> 
> Hm.  So what I'm meaning by an "allocator" is something at least
> partially dynamic.  Something you say "give me an irq" and it gives
> you the next available or similar.  As opposed to any mapping from
> devices to (logical) irqs, which the machine will need to supply one
> way or another.

I am probably reading it wrong but the XIVE's allocator allocates IRQ
ranges for interrupt source controls (which are CPU cores, PHBs, PSI - in
bare metal - so they allocate just once per machine creation). Individual
interrupts are still allocated via spapr_ics_alloc_block().



> 
>> But it's fundamentally an allocator that sits in the hypervisor, so in
>> our case, I would say in the spapr "component" of XIVE, rather than the
>> XIVE HW model itself.
> 
> Maybe..
> 
>> Now what Cedric did, because XIVE is very complex and we need something
>> for PAPR quickly, is not a complete HW model, but a somewhat simplified
>> one that only handles what PAPR exposes. So in that case where the
>> allocator sits is a bit of a TBD...
> 
> Hm, ok.  My concern here is that "dynamic" allocation of irqs at the
> machine type level needs extreme caution, or the irqs may not be
> stable which will generally break migration.
> 


-- 
Alexey