All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: gowrishankar muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
Cc: Jonas Pfefferle1 <JPF@zurich.ibm.com>,
	Gowrishankar Muthukrishnan <gowrishankar.m@in.ibm.com>,
	Adrian Schuepbach <DRI@zurich.ibm.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus addresses for DMA map
Date: Fri, 21 Apr 2017 18:43:53 +1000	[thread overview]
Message-ID: <2354a035-1c97-b9b4-9a15-d62a26d6d160@ozlabs.ru> (raw)
In-Reply-To: <4977b4e8-e63a-0621-2375-89066d8de10a@ozlabs.ru>

On 21/04/17 13:42, Alexey Kardashevskiy wrote:
> On 21/04/17 05:16, gowrishankar muthukrishnan wrote:
>> On Thursday 20 April 2017 07:52 PM, Alexey Kardashevskiy wrote:
>>> On 20/04/17 23:25, Alexey Kardashevskiy wrote:
>>>> On 20/04/17 19:04, Jonas Pfefferle1 wrote:
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote on 20/04/2017 09:24:02:
>>>>>
>>>>>> From: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> To: dev@dpdk.org
>>>>>> Cc: Alexey Kardashevskiy <aik@ozlabs.ru>, JPF@zurich.ibm.com,
>>>>>> Gowrishankar Muthukrishnan <gowrishankar.m@in.ibm.com>
>>>>>> Date: 20/04/2017 09:24
>>>>>> Subject: [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus
>>>>>> addresses for DMA map
>>>>>>
>>>>>> VFIO_IOMMU_SPAPR_TCE_CREATE ioctl() returns the actual bus address for
>>>>>> just created DMA window. It happens to start from zero because the
>>>>>> default
>>>>>> window is removed (leaving no windows) and new window starts from zero.
>>>>>> However this is not guaranteed and the new window may start from another
>>>>>> address, this adds an error check.
>>>>>>
>>>>>> Another issue is that IOVA passed to VFIO_IOMMU_MAP_DMA should be a PCI
>>>>>> bus address while in this case a physical address of a user page is used.
>>>>>> This changes IOVA to start from zero in a hope that the rest of DPDK
>>>>>> expects this.
>>>>> This is not the case. DPDK expects a 1:1 mapping PA==IOVA. It will use the
>>>>> phys_addr of the memory segment it got from /proc/self/pagemap cf.
>>>>> librte_eal/linuxapp/eal/eal_memory.c. We could try setting it here to the
>>>>> actual iova which basically makes the whole virtual to phyiscal mapping
>>>>> with pagemap unnecessary which I believe should be the case for VFIO
>>>>> anyway. Pagemap should only be needed when using pci_uio.
>>>>
>>>> Ah, ok, makes sense now. But it sure needs a big fat comment there as it is
>>>> not obvious why host RAM address is used there as DMA window start is not
>>>> guaranteed.
>>> Well, either way there is some bug - ms[i].phys_addr and ms[i].addr_64 both
>>> have exact same value, in my setup it is 3fffb33c0000 which is a userspace
>>> address - at least ms[i].phys_addr must be physical address.
>>
>> This patch breaks i40e_dev_init() in my server.
>>
>> EAL: PCI device 0004:01:00.0 on NUMA socket 1
>> EAL:   probe driver: 8086:1583 net_i40e
>> EAL:   using IOMMU type 7 (sPAPR)
>> eth_i40e_dev_init(): Failed to init adminq: -32
>> EAL: Releasing pci mapped resource for 0004:01:00.0
>> EAL: Calling pci_unmap_resource for 0004:01:00.0 at 0x3fff82aa0000
>> EAL: Requested device 0004:01:00.0 cannot be used
>> EAL: PCI device 0004:01:00.1 on NUMA socket 1
>> EAL:   probe driver: 8086:1583 net_i40e
>> EAL:   using IOMMU type 7 (sPAPR)
>> eth_i40e_dev_init(): Failed to init adminq: -32
>> EAL: Releasing pci mapped resource for 0004:01:00.1
>> EAL: Calling pci_unmap_resource for 0004:01:00.1 at 0x3fff82aa0000
>> EAL: Requested device 0004:01:00.1 cannot be used
>> EAL: No probed ethernet devices
>>
>> I have two memseg each of 1G size. Their mapped PA and VA are also different.
>>
>> (gdb) p /x ms[0]
>> $3 = {phys_addr = 0x1e0b000000, {addr = 0x3effaf000000, addr_64 =
>> 0x3effaf000000},
>>   len = 0x40000000, hugepage_sz = 0x1000000, socket_id = 0x1, nchannel =
>> 0x0, nrank = 0x0}
>> (gdb) p /x ms[1]
>> $4 = {phys_addr = 0xf6d000000, {addr = 0x3efbaf000000, addr_64 =
>> 0x3efbaf000000},
>>   len = 0x40000000, hugepage_sz = 0x1000000, socket_id = 0x0, nchannel =
>> 0x0, nrank = 0x0}
>>
>> Could you please recheck this. May be, if new DMA window does not start
>> from bus address 0,
>> only then you reset dma_map.iova for this offset ?
> 
> As we figured out, it is --no-huge effect.
> 
> Another thing - as I read the code - the window size comes from
> rte_eal_get_physmem_size(). On my 512GB machine, DPDK allocates only 16GB
> window so it is far away from 1:1 mapping which is believed to be DPDK
> expectation. Looking now for a better version of rte_eal_get_physmem_size()...


I have not found any helper to get a total RAM size or
round-up-to-power-of-two - I could look through memory segments, find the
one with highest ending physical address, round it up to power of two
(requirement on POWER8 platform for a DMA window size) and use it as a DMA
window size - is there kernel's order_base_2() analog?


> 
> 
> And another problem - after few unsuccessful starts of app/testpmd, all
> huge pages are gone:
> 
> aik@stratton2:~$ cat /proc/meminfo
> MemTotal:       535527296 kB
> MemFree:        516662272 kB
> MemAvailable:   515501696 kB
> ...
> HugePages_Total:    1024
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:      16384 kB
> 
> 
> How is that possible? What is pinning these pages so testpmd process exit
> does not clear that up?

Still not clear, any ideas why might be causing this?



btw what is the correct way of running DPDK with hugepages?

I basically create a folder in ~aik/hugepages and do
sudo mount -t hugetlbfs hugetlbfs ~aik/hugepages
sudo sysctl vm.nr_hugepages=4096

This creates bunch of pages:
aik@stratton2:~$ cat /proc/meminfo | grep HugePage
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:    4096
HugePages_Free:     4096
HugePages_Rsvd:        0
HugePages_Surp:        0


And then I am watching testpmd to detect hugepages (it does see 4096 16MB
pages) to allocate pages:
rte_eal_hugepage_init() calls map_all_hugepages(... orig=1) - here all 4096
pages are allocated, then it calls map_all_hugepages(... orig=0) - and here
I get lots of "EAL: Cannot get a virtual area: Cannot allocate memory" due
to obvious reason - all pages are allocated. Since you folks have this
tested somehow - what am I doing wrong? :) This is all very confusing -
what is that orig=0/1 business is all about?




> 
> 
> 
> 
>>
>>
>> Thanks,
>> Gowrishankar
>>
>>>
>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>   lib/librte_eal/linuxapp/eal/eal_vfio.c | 12 ++++++++++--
>>>>>>   1 file changed, 10 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/
>>>>>> librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> index 46f951f4d..8b8e75c4f 100644
>>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> @@ -658,7 +658,7 @@ vfio_spapr_dma_map(int vfio_container_fd)
>>>>>>   {
>>>>>>      const struct rte_memseg *ms = rte_eal_get_physmem_layout();
>>>>>>      int i, ret;
>>>>>> -
>>>>>> +   phys_addr_t io_offset;
>>>>>>      struct vfio_iommu_spapr_register_memory reg = {
>>>>>>         .argsz = sizeof(reg),
>>>>>>         .flags = 0
>>>>>> @@ -702,6 +702,13 @@ vfio_spapr_dma_map(int vfio_container_fd)
>>>>>>         return -1;
>>>>>>      }
>>>>>>   +   io_offset = create.start_addr;
>>>>>> +   if (io_offset) {
>>>>>> +      RTE_LOG(ERR, EAL, "  DMA offsets other than zero is not
>>>>>> supported, "
>>>>>> +            "new window is created at %lx\n", io_offset);
>>>>>> +      return -1;
>>>>>> +   }
>>>>>> +
>>>>>>      /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
>>>>>>      for (i = 0; i < RTE_MAX_MEMSEG; i++) {
>>>>>>         struct vfio_iommu_type1_dma_map dma_map;
>>>>>> @@ -723,7 +730,7 @@ vfio_spapr_dma_map(int vfio_container_fd)
>>>>>>         dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>>         dma_map.vaddr = ms[i].addr_64;
>>>>>>         dma_map.size = ms[i].len;
>>>>>> -      dma_map.iova = ms[i].phys_addr;
>>>>>> +      dma_map.iova = io_offset;
>>>>>>         dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
>>>>>>                VFIO_DMA_MAP_FLAG_WRITE;
>>>>>>   @@ -735,6 +742,7 @@ vfio_spapr_dma_map(int vfio_container_fd)
>>>>>>            return -1;
>>>>>>         }
>>>>>>   +      io_offset += dma_map.size;
>>>>>>      }
>>>>>>        return 0;
>>>>>> -- 
>>>>>> 2.11.0
>>>>>>
>>>>
>>>
>>
>>
> 
> 


-- 
Alexey

  reply	other threads:[~2017-04-21  8:44 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-20  7:23 [PATCH dpdk 0/5] ppc64/spapr: Attempt to use on POWER8 Alexey Kardashevskiy
2017-04-20  7:23 ` [PATCH dpdk 1/5] vfio/ppc64/spapr: Use correct structures for add/remove windows Alexey Kardashevskiy
2017-04-20  7:23 ` [PATCH dpdk 2/5] pci: Initialize common rte driver pointer Alexey Kardashevskiy
2017-04-24  9:28   ` Burakov, Anatoly
2017-04-20  7:24 ` [PATCH dpdk 3/5] RFC: bnx2x: Update firmware versions Alexey Kardashevskiy
2017-04-20  7:24 ` [PATCH dpdk 4/5] vfio: Do try setting IOMMU type if already set Alexey Kardashevskiy
2017-04-20 19:31   ` gowrishankar muthukrishnan
2017-04-21  8:54   ` Andrew Rybchenko
2017-04-26  7:50     ` Alexey Kardashevskiy
2017-04-26  8:27       ` Burakov, Anatoly
2017-04-26  8:45         ` Alejandro Lucero
2017-04-26  8:58           ` Burakov, Anatoly
2017-04-26 10:24             ` Alejandro Lucero
2017-04-20  7:24 ` [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus addresses for DMA map Alexey Kardashevskiy
2017-04-20  9:04   ` Jonas Pfefferle1
2017-04-20 13:25     ` Alexey Kardashevskiy
2017-04-20 14:22       ` Alexey Kardashevskiy
2017-04-20 15:15         ` Jonas Pfefferle1
2017-04-20 22:01           ` Alexey Kardashevskiy
2017-04-20 19:16         ` gowrishankar muthukrishnan
2017-04-21  3:42           ` Alexey Kardashevskiy
2017-04-21  8:43             ` Alexey Kardashevskiy [this message]
     [not found]               ` <OF6F33ED54.7950E1EF-ONC1258109.003295E3-C1258109.00333E2E@notes.na.collabserv.com>
2017-04-22  0:12                 ` Alexey Kardashevskiy
2017-04-24  9:40                   ` Burakov, Anatoly
2017-04-21  8:51             ` gowrishankar muthukrishnan
     [not found]             ` <OF45247CC5.192F9D29-ONC1258109.002D6497-C1258109.002F2868@notes.na.collabserv.com>
2017-04-21  8:59               ` Alexey Kardashevskiy
2017-04-22 21:11 ` [PATCH dpdk 0/5] ppc64/spapr: Attempt to use on POWER8 Olga Shern
2017-04-23 13:35   ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2354a035-1c97-b9b4-9a15-d62a26d6d160@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=DRI@zurich.ibm.com \
    --cc=JPF@zurich.ibm.com \
    --cc=dev@dpdk.org \
    --cc=gowrishankar.m@in.ibm.com \
    --cc=gowrishankar.m@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.