All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@epam.com>
To: Jan Beulich <jbeulich@suse.com>,
	Oleksandr Andrushchenko <andr2000@gmail.com>
Cc: "julien@xen.org" <julien@xen.org>,
	"sstabellini@kernel.org" <sstabellini@kernel.org>,
	Oleksandr Tyshchenko <Oleksandr_Tyshchenko@epam.com>,
	Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>,
	Artem Mygaiev <Artem_Mygaiev@epam.com>,
	"roger.pau@citrix.com" <roger.pau@citrix.com>,
	Bertrand Marquis <bertrand.marquis@arm.com>,
	Rahul Singh <rahul.singh@arm.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
Date: Thu, 9 Sep 2021 09:12:21 +0000	[thread overview]
Message-ID: <dfb66ff2-9c9e-645f-4789-2dc6c21ff751@epam.com> (raw)
In-Reply-To: <422a6543-ec2e-0793-3db5-09456e04f65b@suse.com>


On 09.09.21 11:24, Jan Beulich wrote:
> On 09.09.2021 07:22, Oleksandr Andrushchenko wrote:
>> On 08.09.21 18:00, Jan Beulich wrote:
>>> On 08.09.2021 16:31, Oleksandr Andrushchenko wrote:
>>>> On 06.09.21 17:47, Jan Beulich wrote:
>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> Instead of handling a single range set, that contains all the memory
>>>>>> regions of all the BARs and ROM, have them per BAR.
>>>>> Without looking at how you carry out this change - this look wrong (as
>>>>> in: wasteful) to me. Despite ...
>>>>>
>>>>>> This is in preparation of making non-identity mappings in p2m for the
>>>>>> MMIOs/ROM.
>>>>> ... the need for this, every individual BAR is still contiguous in both
>>>>> host and guest address spaces, so can be represented as a single
>>>>> (start,end) tuple (or a pair thereof, to account for both host and guest
>>>>> values). No need to use a rangeset for this.
>>>> First of all this change is in preparation for non-identity mappings,
>>> I'm afraid I continue to not see how this matters in the discussion at
>>> hand. I'm fully aware that this is the goal.
>>>
>>>> e.g. currently we collect all the memory ranges which require mappings
>>>> into a single range set, then we cut off MSI-X regions and then use range set
>>>> functionality to call a callback for every memory range left after MSI-X.
>>>> This works perfectly fine for 1:1 mappings, e.g. what we have as the range
>>>> set's starting address is what we want to be mapped/unmapped.
>>>> Why range sets? Because they allow partial mappings, e.g. you can map part of
>>>> the range and return back and continue from where you stopped. And if I
>>>> understand that correctly that was the initial intention of introducing range sets here.
>>>>
>>>> For non-identity mappings this becomes not that easy. Each individual BAR may be
>>>> mapped differently according to what guest OS has programmed as bar->guest_addr
>>>> (guest view of the BAR start).
>>> I don't see how the rangeset helps here. You have a guest and a host pair
>>> of values for every BAR. Pages with e.g. the MSI-X table may not be mapped
>>> to their host counterpart address, yes, but you need to special cases
>>> these anyway: Accesses to them need to be handled. Hence I'm having a hard
>>> time seeing how a per-BAR rangeset (which will cover at most three distinct
>>> ranges afaict, which is way too little for this kind of data organization
>>> imo) can gain you all this much.
>>>
>>> Overall the 6 BARs of a device will cover up to 8 non-adjacent ranges. IOW
>>> the majority (4 or more) of the rangesets will indeed merely represent a
>>> plain (start,end) pair (or be entirely empty).
>> First of all, let me explain why I decided to move to per-BAR
>> range sets.
>> Before this change all the MMIO regions and MSI-X holes were
>> accounted by a single range set, e.g. we go over all BARs and
>> add MMIOs and then subtract MSI-X from there. When it comes to
>> mapping/unmapping we have an assumtion that the starting address of
>> each element in the range set is equal to map/unmap address, e.g.
>> we have identity mapping. Please note, that the range set accepts
>> a single private data parameter which is enough to hold all
>> required data about the pdev in common, but there is no way to provide
>> any per-BAR data.
>>
>> Now, that we want non-identity mappings, we can no longer assume
>> that starting address == mapping address and we need to provide
>> additional information on how to map and which is now per-BAR.
>> This is why I decided to use per-BAR range sets.
>>
>> One of the solutions may be that we form an additional list of
>> structures in a form (I ommit some of the fields):
>> struct non_identity {
>>       unsigned long start_mfn;
>>       unsigned long start_gfn;
>>       unsigned long size;
>> };
>> So this way when the range set gets processed we go over the list
>> and find out the corresponding list's element which describes the
>> range set entry being processed (s, e, data):
>>
>> static int map_range(unsigned long s, unsigned long e, void *data,
>>                        unsigned long *c)
>> {
>> [snip]
>>       go over the list elements
>>           if ( list->start_mfn == s )
>>               found, can use list->start_gfn for mapping
>> [snip]
>> }
>> This has some complications as map_range may be called multiple times
>> for the same range: if {unmap|map}_mmio_regions was not able to complete
>> the operation it returns the number of pages it was able to process:
>>           rc = map->map ? map_mmio_regions(map->d, start_gfn,
>>                                            size, _mfn(s))
>>                         : unmap_mmio_regions(map->d, start_gfn,
>>                                              size, _mfn(s));
>> In this case we need to update the list item:
>>       list->start_mfn += rc;
>>       list->start_gfn += rc;
>>       list->size -= rc;
>> and if all the pages of the range were processed delete the list entry.
>>
>> With respect of creating the list everything also not so complicated:
>> while processing each BAR create a list entry and fill it with mfn, gfn
>> and size. Then, if MSI-X region is present within this BAR, break the
>> list item into multiple ones with respect to the holes, for example:
>>
>> MMIO 0 list item
>> MSI-X hole 0
>> MMIO 1 list item
>> MSI-X hole 1
>>
>> Here instead of a single BAR description we now have 2 list elements
>> describing the BAR without MSI-X regions.
>>
>> All the above still relies on a single range set per pdev as it is in the
>> original code. We can go this route if we agree this is more acceptable
>> than the range sets per BAR
> I guess I am now even more confused: I can't spot any "rangeset per pdev"
> either. The rangeset I see being used doesn't get associated with anything
> that's device-related; it gets accumulated as a transient data structure,
> but _all_ devices owned by a domain influence its final content.

You are absolutely right here, sorry for the confusion: in the current

code the range set belongs to struct vpci_vcpu, e.g.

/* Per-vcpu structure to store state while {un}mapping of PCI BARs. */

>
> If you associate rangesets with either a device or a BAR, I'm failing to
> see how you'd deal with multiple BARs living in the same page (see also
> below).

This was exactly the issue I ran into while emulating RTL8139 on QEMU:

The MMIOs are 128 bytes long and Linux put them on the same page.

So, it is a known limitation that we can't deal with [1]

>
> Considering that a rangeset really is a compressed representation of a
> bitmap, I wonder whether this data structure is suitable at all for what
> you want to express. You have two pieces of information to carry / manage,
> after all: Which ranges need mapping, and what their GFN <-> MFN
> relationship is. Maybe the latter needs expressing differently in the
> first place?

I proposed a list which can be extended to hold all the required information

there, e.g. MFN, GFN, size etc.

>   And then in a way that's ensuring by its organization that
> no conflicting GFN <-> MFN mappings will be possible?

If you mean the use-case above with different device MMIOs living

in the same page then my understanding is that such a use-case is

not supported [1]

>   Isn't this
> precisely what is already getting recorded in the P2M?
>
> I'm also curious what your plan is to deal with BARs overlapping in MFN
> space: In such a case, the guest cannot independently change the GFNs of
> any of the involved BARs. (Same the other way around: overlaps in GFN
> space are only permitted when the same overlap exists in MFN space.) Are
> you excluding (forbidding) this case? If so, did I miss you saying so
> somewhere?
Again [1]
>   Yet if no overlaps are allowed in the first place, what
> modify_bars() does would be far more complicated than necessary in the
> DomU case, so it may be worthwhile considering to deviate more from how
> Dom0 gets taken care of. In the end a guest writing a BAR is merely a
> request to change its P2M. That's very different from Dom0 writing a BAR,
> which means the physical BAR also changes, and hence the P2M changes in
> quite different a way.

So, what is the difference then besides hwdom really writes to a BAR?

To me most of the logic remains the same: we need to map/unmap.

The only difference I see here is that for Dom0 we have 1:1 at the moment

and for guest we need GFN <-> MFN.


Anyways, I am open to any decision on what would be the right approach here:

1. Use range sets per BAR as in the patch

2. Remove range sets completely and have a per-vCPU list with mapping

data as I described above

3. Anything else?

>
> Jan

Thank you,

Oleksandr

[1] https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough#I_get_.22non-page-aligned_MMIO_BAR.22_error_when_trying_to_start_the_guest

  reply	other threads:[~2021-09-09  9:12 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 1/9] vpci: Make vpci registers removal a dedicated function Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
2021-09-06 13:23   ` Jan Beulich
2021-09-07  8:33     ` Oleksandr Andrushchenko
2021-09-07  8:44       ` Jan Beulich
2021-09-03 10:08 ` [PATCH 3/9] vpci/header: Move register assignments from init_bars Oleksandr Andrushchenko
2021-09-06 13:53   ` Jan Beulich
2021-09-07 10:04     ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 4/9] vpci/header: Add and remove register handlers dynamically Oleksandr Andrushchenko
2021-09-06 14:11   ` Jan Beulich
2021-09-07 10:11     ` Oleksandr Andrushchenko
2021-09-07 10:43       ` Jan Beulich
2021-09-07 11:10         ` Oleksandr Andrushchenko
2021-09-07 11:49           ` Jan Beulich
2021-09-07 12:16             ` Oleksandr Andrushchenko
2021-09-07 12:20               ` Jan Beulich
2021-09-07 12:23                 ` Oleksandr Andrushchenko
2021-09-10 21:14   ` Stefano Stabellini
2021-09-03 10:08 ` [PATCH 5/9] vpci/header: Implement guest BAR register handlers Oleksandr Andrushchenko
2021-09-06 14:31   ` Jan Beulich
2021-09-07 13:33     ` Oleksandr Andrushchenko
2021-09-07 16:30       ` Jan Beulich
2021-09-07 17:39         ` Oleksandr Andrushchenko
2021-09-08  9:27           ` Jan Beulich
2021-09-08  9:43             ` Oleksandr Andrushchenko
2021-09-08 10:03               ` Jan Beulich
2021-09-08 13:33                 ` Oleksandr Andrushchenko
2021-09-08 14:46                   ` Jan Beulich
2021-09-08 15:14                     ` Oleksandr Andrushchenko
2021-09-08 15:29                       ` Jan Beulich
2021-09-08 15:35                         ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 6/9] vpci/header: Handle p2m range sets per BAR Oleksandr Andrushchenko
2021-09-06 14:47   ` Jan Beulich
2021-09-08 14:31     ` Oleksandr Andrushchenko
2021-09-08 15:00       ` Jan Beulich
2021-09-09  5:22         ` Oleksandr Andrushchenko
2021-09-09  8:24           ` Jan Beulich
2021-09-09  9:12             ` Oleksandr Andrushchenko [this message]
2021-09-09  9:39               ` Jan Beulich
2021-09-09 10:03                 ` Oleksandr Andrushchenko
2021-09-09 10:46                   ` Jan Beulich
2021-09-09 11:30                     ` Oleksandr Andrushchenko
2021-09-09 11:51                       ` Jan Beulich
2021-09-03 10:08 ` [PATCH 7/9] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
2021-09-06 14:51   ` Jan Beulich
2021-09-09  6:13     ` Oleksandr Andrushchenko
2021-09-09  8:26       ` Jan Beulich
2021-09-09  9:16         ` Oleksandr Andrushchenko
2021-09-09  9:40           ` Jan Beulich
2021-09-09  9:53             ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 8/9] vpci/header: Reset the command register when adding devices Oleksandr Andrushchenko
2021-09-06 14:55   ` Jan Beulich
2021-09-07  7:43     ` Oleksandr Andrushchenko
2021-09-07  8:00       ` Jan Beulich
2021-09-07  8:18         ` Oleksandr Andrushchenko
2021-09-07  8:49           ` Jan Beulich
2021-09-07  9:07             ` Oleksandr Andrushchenko
2021-09-07  9:19               ` Jan Beulich
2021-09-07  9:52                 ` Oleksandr Andrushchenko
2021-09-07 10:06                   ` Jan Beulich
2021-09-09  8:39                     ` Oleksandr Andrushchenko
2021-09-09  8:43                       ` Jan Beulich
2021-09-09  8:50                         ` Oleksandr Andrushchenko
2021-09-09  9:21                           ` Jan Beulich
2021-09-09 11:48                             ` Oleksandr Andrushchenko
2021-09-09 11:53                               ` Jan Beulich
2021-09-09 12:42                                 ` Oleksandr Andrushchenko
2021-09-09 12:47                                   ` Jan Beulich
2021-09-09 12:48                                     ` Oleksandr Andrushchenko
2021-09-09 13:17                                     ` Oleksandr Andrushchenko
2021-09-09 11:48                             ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU Oleksandr Andrushchenko
2021-09-06 14:57   ` Jan Beulich
2021-09-09  4:23     ` Oleksandr Andrushchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dfb66ff2-9c9e-645f-4789-2dc6c21ff751@epam.com \
    --to=oleksandr_andrushchenko@epam.com \
    --cc=Artem_Mygaiev@epam.com \
    --cc=Oleksandr_Tyshchenko@epam.com \
    --cc=Volodymyr_Babchuk@epam.com \
    --cc=andr2000@gmail.com \
    --cc=bertrand.marquis@arm.com \
    --cc=jbeulich@suse.com \
    --cc=julien@xen.org \
    --cc=rahul.singh@arm.com \
    --cc=roger.pau@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.