All of lore.kernel.org
 help / color / mirror / Atom feed
* Running DPDK as an unprivileged user
@ 2016-12-29 20:41 Walker, Benjamin
  2016-12-30  1:14 ` Stephen Hemminger
  2017-01-04 11:39 ` Tan, Jianfeng
  0 siblings, 2 replies; 18+ messages in thread
From: Walker, Benjamin @ 2016-12-29 20:41 UTC (permalink / raw)
  To: dev

Hi all,

I've been digging in to what it would take to run DPDK as an
unprivileged user and I have some findings that I thought
were worthy of discussion. The assumptions here are that I'm
using a very recent Linux kernel (4.8.15 to be specific) and
I'm using vfio with my IOMMU enabled. I'm only interested in
making it possible to run as an unprivileged user in this
type of environment.

There are a few key things that DPDK needs to do in order to
run as an unprivileged user:

1) Allocate hugepages
2) Map device resources
3) Map hugepage virtual addresses to DMA addresses.

For #1 and #2, DPDK works just fine today. You simply chown
the relevant resources in sysfs to the desired user and
everything is happy. 

The problem is #3. This currently relies on looking up the
mappings in /proc/self/pagemap, but the ability to get
physical addresses in /proc/self/pagemap as an unprivileged
user was removed from the kernel in the 4.x timeframe due to
the Rowhammer vulnerability. At this time, it is not
possible to run DPDK as an unprivileged user on a 4.x Linux
kernel.

There is a way to make this work though, which I'll outline
now. Unfortunately, I think it is going to require some very
significant changes to the initialization flow in the EAL.
One bit of  of background before I go into how to fix this -
there are three types of memory addresses - virtual
addresses, physical addresses, and DMA addresses. Sometimes
DMA addresses are called bus addresses or I/O addresses, but
I'll call them DMA addresses because I think that's the
clearest name. In a system without an IOMMU, DMA addresses
and physical addresses are equivalent, but in a system with
an IOMMU any arbitrary DMA address can be chosen by the user
to map to a given physical address. For security reasons
(rowhammer), it is no longer considered safe to expose
physical addresses to userspace, but it is perfectly fine to
expose DMA addresses when an IOMMU is present.

DPDK today begins by allocating all of the required
hugepages, then finds all of the physical addresses for
those hugepages using /proc/self/pagemap, sorts the
hugepages by physical address, then remaps the pages to
contiguous virtual addresses. Later on and if vfio is
enabled, it asks vfio to pin the hugepages and to set their
DMA addresses in the IOMMU to be the physical addresses
discovered earlier. Of course, running as an unprivileged
user means all of the physical addresses in
/proc/self/pagemap are just 0, so this doesn't end up
working. Further, there is no real reason to choose the
physical address as the DMA address in the IOMMU - it would
be better to just count up starting at 0. Also, because the
pages are pinned after the virtual to physical mapping is
looked up, there is a window where a page could be moved.
Hugepage mappings can be moved on more recent kernels (at
least 4.x), and the reliability of hugepages having static
mappings decreases with every kernel release. Note that this
probably means that using uio on recent kernels is subtly
broken and cannot be supported going forward because there
is no uio mechanism to pin the memory.

The first open question I have is whether DPDK should allow
uio at all on recent (4.x) kernels. My current understanding
is that there is no way to pin memory and hugepages can now
be moved around, so uio would be unsafe. What does the
community think here?

My second question is whether the user should be allowed to
mix uio and vfio usage simultaneously. For vfio, the
physical addresses are really DMA addresses and are best
when arbitrarily chosen to appear sequential relative to
their virtual addresses. For uio, they are physical
addresses and are not chosen at all. It seems that these two
things are in conflict and that it will be difficult, ugly,
and maybe impossible to resolve the simultaneous use of
both.

Once we agree on the above two things, we can try to talk
through some solutions in the code.

Thanks,
Ben


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2016-12-29 20:41 Running DPDK as an unprivileged user Walker, Benjamin
@ 2016-12-30  1:14 ` Stephen Hemminger
  2017-01-02 14:32   ` Thomas Monjalon
  2017-01-04 11:39 ` Tan, Jianfeng
  1 sibling, 1 reply; 18+ messages in thread
From: Stephen Hemminger @ 2016-12-30  1:14 UTC (permalink / raw)
  To: Walker, Benjamin; +Cc: dev

On Thu, 29 Dec 2016 20:41:21 +0000
"Walker, Benjamin" <benjamin.walker@intel.com> wrote:

> The first open question I have is whether DPDK should allow
> uio at all on recent (4.x) kernels. My current understanding
> is that there is no way to pin memory and hugepages can now
> be moved around, so uio would be unsafe. What does the
> community think here?

DMA access without IOMMU (ie UIO) is not safe from a security
point of view. A malicious app could program device (like Ethernet
NIC) to change its current privledge level in kernel memory.
Therefore ignore UIO as an option if you want to allow unprivileged
access.

But there are many many systems without working IOMMU. Not just broken
motherboards, but virtualization environments (Xen, Hyper-V, and KVM until
very recently) where IOMMU is no going to work.  In these environments,
DPDK is still useful where the security risks are known.

If kernel broke pinning of hugepages, then it is an upstream kernel bug.

> 
> My second question is whether the user should be allowed to
> mix uio and vfio usage simultaneously. For vfio, the
> physical addresses are really DMA addresses and are best
> when arbitrarily chosen to appear sequential relative to
> their virtual addresses. For uio, they are physical
> addresses and are not chosen at all. It seems that these two
> things are in conflict and that it will be difficult, ugly,
> and maybe impossible to resolve the simultaneous use of
> both.

Unless application is running as privileged user (ie root), UIO
is not going to work. Therefore don't worry about mixed environment.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2016-12-30  1:14 ` Stephen Hemminger
@ 2017-01-02 14:32   ` Thomas Monjalon
  2017-01-02 19:47     ` Stephen Hemminger
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Monjalon @ 2017-01-02 14:32 UTC (permalink / raw)
  To: Walker, Benjamin; +Cc: dev, Stephen Hemminger

2016-12-29 17:14, Stephen Hemminger:
> On Thu, 29 Dec 2016 20:41:21 +0000
> "Walker, Benjamin" <benjamin.walker@intel.com> wrote:
> > My second question is whether the user should be allowed to
> > mix uio and vfio usage simultaneously. For vfio, the
> > physical addresses are really DMA addresses and are best
> > when arbitrarily chosen to appear sequential relative to
> > their virtual addresses. For uio, they are physical
> > addresses and are not chosen at all. It seems that these two
> > things are in conflict and that it will be difficult, ugly,
> > and maybe impossible to resolve the simultaneous use of
> > both.
> 
> Unless application is running as privileged user (ie root), UIO
> is not going to work. Therefore don't worry about mixed environment.

Yes, mixing UIO and VFIO is possible only as root.
However, what is the benefit of mixing them?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-02 14:32   ` Thomas Monjalon
@ 2017-01-02 19:47     ` Stephen Hemminger
  2017-01-03 22:50       ` Walker, Benjamin
  0 siblings, 1 reply; 18+ messages in thread
From: Stephen Hemminger @ 2017-01-02 19:47 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Walker, Benjamin, dev

On Mon, 02 Jan 2017 15:32:08 +0100
Thomas Monjalon <thomas.monjalon@6wind.com> wrote:

> 2016-12-29 17:14, Stephen Hemminger:
> > On Thu, 29 Dec 2016 20:41:21 +0000
> > "Walker, Benjamin" <benjamin.walker@intel.com> wrote:  
> > > My second question is whether the user should be allowed to
> > > mix uio and vfio usage simultaneously. For vfio, the
> > > physical addresses are really DMA addresses and are best
> > > when arbitrarily chosen to appear sequential relative to
> > > their virtual addresses. For uio, they are physical
> > > addresses and are not chosen at all. It seems that these two
> > > things are in conflict and that it will be difficult, ugly,
> > > and maybe impossible to resolve the simultaneous use of
> > > both.  
> > 
> > Unless application is running as privileged user (ie root), UIO
> > is not going to work. Therefore don't worry about mixed environment.  
> 
> Yes, mixing UIO and VFIO is possible only as root.
> However, what is the benefit of mixing them?

One possible case where this could be used, Hyper-V/Azure and SR-IOV.
The VF interface will show up on an isolated PCI bus and the virtual NIC
is on VMBUS. It is possible to use VFIO on the PCI to get MSI-X per queue
interrupts, but there is no support for VFIO on VMBUS.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-02 19:47     ` Stephen Hemminger
@ 2017-01-03 22:50       ` Walker, Benjamin
  2017-01-04 10:11         ` Thomas Monjalon
  0 siblings, 1 reply; 18+ messages in thread
From: Walker, Benjamin @ 2017-01-03 22:50 UTC (permalink / raw)
  To: stephen, thomas.monjalon; +Cc: dev

On Thu, 2016-12-29 at 17:14 -0800, Stephen Hemminger wrote:
> If kernel broke pinning of hugepages, then it is an upstream kernel bug.

The kernel, under a myriad of circumstances, will change the mapping of virtual
to physical addresses for hugepages. This behavior began somewhere around kernel
3.16 and with each release more cases where the mapping can change are
introduced. DPDK should not be relying on that mapping staying static, and
instead should be using vfio to explicitly pin the pages. I've consulted the
relevant kernel developers who write the code in this area and they are
universally in agreement that this is not a kernel bug and the mappings will get
less static over time.

On Mon, 2017-01-02 at 11:47 -0800, Stephen Hemminger wrote:
> On Mon, 02 Jan 2017 15:32:08 +0100
> Thomas Monjalon <thomas.monjalon@6wind.com> wrote:
> 
> > 2016-12-29 17:14, Stephen Hemminger:
> > > On Thu, 29 Dec 2016 20:41:21 +0000
> > > "Walker, Benjamin" <benjamin.walker@intel.com> wrote:  
> > > > My second question is whether the user should be allowed to
> > > > mix uio and vfio usage simultaneously. For vfio, the
> > > > physical addresses are really DMA addresses and are best
> > > > when arbitrarily chosen to appear sequential relative to
> > > > their virtual addresses. For uio, they are physical
> > > > addresses and are not chosen at all. It seems that these two
> > > > things are in conflict and that it will be difficult, ugly,
> > > > and maybe impossible to resolve the simultaneous use of
> > > > both.  
> > > 
> > > Unless application is running as privileged user (ie root), UIO
> > > is not going to work. Therefore don't worry about mixed environment.  
> > 
> > Yes, mixing UIO and VFIO is possible only as root.
> > However, what is the benefit of mixing them?
> 
> One possible case where this could be used, Hyper-V/Azure and SR-IOV.
> The VF interface will show up on an isolated PCI bus and the virtual NIC
> is on VMBUS. It is possible to use VFIO on the PCI to get MSI-X per queue
> interrupts, but there is no support for VFIO on VMBUS.

I sent out a patch a little while ago that makes DPDK work when running as an
unprivileged user with an IOMMU. I allow mixing of uio/vfio when root (I choose
the DMA address to be the physical address), but only vfio when unprivileged (I
choose the DMA addresses to start at 0).

Unfortunately, there are a few more wrinkles for systems that do not have an
IOMMU. These systems still need to explicitly pin memory, but they need to use
physical addresses instead of DMA addresses. There are two concerns with this:

1) Physical addresses cannot be exposed to unprivileged users due to security
concerns (the fallout of rowhammer). Therefore, systems without an IOMMU can
only support privileged users. I think this is probably fine.
2) The IOCTL from vfio to pin the memory is tied to specifying the DMA address
and programming the IOMMU. This is unfortunate - systems without an IOMMU still
want to do the pinning, but they need to be given the physical address instead
of specifying a DMA address.
3) Not all device types, particularly in virtualization environments, support
vfio today. These devices have no way to explicitly pin memory.

I think this is going to take a kernel patch or two to resolve, unless someone
has a good idea.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-03 22:50       ` Walker, Benjamin
@ 2017-01-04 10:11         ` Thomas Monjalon
  2017-01-04 21:35           ` Walker, Benjamin
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Monjalon @ 2017-01-04 10:11 UTC (permalink / raw)
  To: Walker, Benjamin; +Cc: stephen, dev

2017-01-03 22:50, Walker, Benjamin:
> 1) Physical addresses cannot be exposed to unprivileged users due to security
> concerns (the fallout of rowhammer). Therefore, systems without an IOMMU can
> only support privileged users. I think this is probably fine.
> 2) The IOCTL from vfio to pin the memory is tied to specifying the DMA address
> and programming the IOMMU. This is unfortunate - systems without an IOMMU still
> want to do the pinning, but they need to be given the physical address instead
> of specifying a DMA address.
> 3) Not all device types, particularly in virtualization environments, support
> vfio today. These devices have no way to explicitly pin memory.

In VM we can use VFIO-noiommu. Is it helping for mapping?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2016-12-29 20:41 Running DPDK as an unprivileged user Walker, Benjamin
  2016-12-30  1:14 ` Stephen Hemminger
@ 2017-01-04 11:39 ` Tan, Jianfeng
  2017-01-04 21:34   ` Walker, Benjamin
  1 sibling, 1 reply; 18+ messages in thread
From: Tan, Jianfeng @ 2017-01-04 11:39 UTC (permalink / raw)
  To: Walker, Benjamin, dev

Hi Benjamin,


On 12/30/2016 4:41 AM, Walker, Benjamin wrote:
> Hi all,
>
> I've been digging in to what it would take to run DPDK as an
> unprivileged user and I have some findings that I thought
> were worthy of discussion. The assumptions here are that I'm
> using a very recent Linux kernel (4.8.15 to be specific) and
> I'm using vfio with my IOMMU enabled. I'm only interested in
> making it possible to run as an unprivileged user in this
> type of environment.
>
> There are a few key things that DPDK needs to do in order to
> run as an unprivileged user:
>
> 1) Allocate hugepages
> 2) Map device resources
> 3) Map hugepage virtual addresses to DMA addresses.
>
> For #1 and #2, DPDK works just fine today. You simply chown
> the relevant resources in sysfs to the desired user and
> everything is happy.
>
> The problem is #3. This currently relies on looking up the
> mappings in /proc/self/pagemap, but the ability to get
> physical addresses in /proc/self/pagemap as an unprivileged
> user was removed from the kernel in the 4.x timeframe due to
> the Rowhammer vulnerability. At this time, it is not
> possible to run DPDK as an unprivileged user on a 4.x Linux
> kernel.
>
> There is a way to make this work though, which I'll outline
> now. Unfortunately, I think it is going to require some very
> significant changes to the initialization flow in the EAL.
> One bit of  of background before I go into how to fix this -
> there are three types of memory addresses - virtual
> addresses, physical addresses, and DMA addresses. Sometimes
> DMA addresses are called bus addresses or I/O addresses, but
> I'll call them DMA addresses because I think that's the
> clearest name. In a system without an IOMMU, DMA addresses
> and physical addresses are equivalent, but in a system with
> an IOMMU any arbitrary DMA address can be chosen by the user
> to map to a given physical address. For security reasons
> (rowhammer), it is no longer considered safe to expose
> physical addresses to userspace, but it is perfectly fine to
> expose DMA addresses when an IOMMU is present.
>
> DPDK today begins by allocating all of the required
> hugepages, then finds all of the physical addresses for
> those hugepages using /proc/self/pagemap, sorts the
> hugepages by physical address, then remaps the pages to
> contiguous virtual addresses. Later on and if vfio is
> enabled, it asks vfio to pin the hugepages and to set their
> DMA addresses in the IOMMU to be the physical addresses
> discovered earlier. Of course, running as an unprivileged
> user means all of the physical addresses in
> /proc/self/pagemap are just 0, so this doesn't end up
> working. Further, there is no real reason to choose the
> physical address as the DMA address in the IOMMU - it would
> be better to just count up starting at 0.

Why not just using virtual address as the DMA address in this case to 
avoid maintaining another kind of addresses?

>   Also, because the
> pages are pinned after the virtual to physical mapping is
> looked up, there is a window where a page could be moved.
> Hugepage mappings can be moved on more recent kernels (at
> least 4.x), and the reliability of hugepages having static
> mappings decreases with every kernel release.

Do you mean kernel might take back a physical page after mapping it to a 
virtual page (maybe copy the data to another physical page)? Could you 
please show some links or kernel commits?

> Note that this
> probably means that using uio on recent kernels is subtly
> broken and cannot be supported going forward because there
> is no uio mechanism to pin the memory.
>
> The first open question I have is whether DPDK should allow
> uio at all on recent (4.x) kernels. My current understanding
> is that there is no way to pin memory and hugepages can now
> be moved around, so uio would be unsafe. What does the
> community think here?
>
> My second question is whether the user should be allowed to
> mix uio and vfio usage simultaneously. For vfio, the
> physical addresses are really DMA addresses and are best
> when arbitrarily chosen to appear sequential relative to
> their virtual addresses.

Why "sequential relative to their virtual addresses"? IOMMU table is for 
DMA addr -> physical addr mapping. So we need to DMA addresses 
"sequential relative to their physical addresses"? Based on your above 
analysis on how hugepages are initialized, virtual addresses is a good 
candidate for DMA address?

Thanks,
Jianfeng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-04 11:39 ` Tan, Jianfeng
@ 2017-01-04 21:34   ` Walker, Benjamin
  2017-01-05 10:09     ` Sergio Gonzalez Monroy
  2017-01-05 15:52     ` Tan, Jianfeng
  0 siblings, 2 replies; 18+ messages in thread
From: Walker, Benjamin @ 2017-01-04 21:34 UTC (permalink / raw)
  To: Tan, Jianfeng, dev

On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote:
> Hi Benjamin,
> 
> 
> On 12/30/2016 4:41 AM, Walker, Benjamin wrote:
> > DPDK today begins by allocating all of the required
> > hugepages, then finds all of the physical addresses for
> > those hugepages using /proc/self/pagemap, sorts the
> > hugepages by physical address, then remaps the pages to
> > contiguous virtual addresses. Later on and if vfio is
> > enabled, it asks vfio to pin the hugepages and to set their
> > DMA addresses in the IOMMU to be the physical addresses
> > discovered earlier. Of course, running as an unprivileged
> > user means all of the physical addresses in
> > /proc/self/pagemap are just 0, so this doesn't end up
> > working. Further, there is no real reason to choose the
> > physical address as the DMA address in the IOMMU - it would
> > be better to just count up starting at 0.
> 
> Why not just using virtual address as the DMA address in this case to 
> avoid maintaining another kind of addresses?

That's a valid choice, although I'm just storing the DMA address in the
physical address field that already exists. You either have a physical
address or a DMA address and never both.

> 
> >   Also, because the
> > pages are pinned after the virtual to physical mapping is
> > looked up, there is a window where a page could be moved.
> > Hugepage mappings can be moved on more recent kernels (at
> > least 4.x), and the reliability of hugepages having static
> > mappings decreases with every kernel release.
> 
> Do you mean kernel might take back a physical page after mapping it to a 
> virtual page (maybe copy the data to another physical page)? Could you 
> please show some links or kernel commits?

Yes - the kernel can move a physical page to another physical page
and change the virtual mapping at any time. For a concise example
see 'man migrate_pages(2)', or for a more serious example the code
that performs memory page compaction in the kernel which was
recently extended to support hugepages.

Before we go down the path of me proving that the mapping isn't static,
let me turn that line of thinking around. Do you have any documentation
demonstrating that the mapping is static? It's not static for 4k pages, so
why are we assuming that it is static for 2MB pages? I understand that
it happened to be static for some versions of the kernel, but my understanding
is that this was purely by coincidence and never by intention.

> 
> > Note that this
> > probably means that using uio on recent kernels is subtly
> > broken and cannot be supported going forward because there
> > is no uio mechanism to pin the memory.
> > 
> > The first open question I have is whether DPDK should allow
> > uio at all on recent (4.x) kernels. My current understanding
> > is that there is no way to pin memory and hugepages can now
> > be moved around, so uio would be unsafe. What does the
> > community think here?
> > 
> > My second question is whether the user should be allowed to
> > mix uio and vfio usage simultaneously. For vfio, the
> > physical addresses are really DMA addresses and are best
> > when arbitrarily chosen to appear sequential relative to
> > their virtual addresses.
> 
> Why "sequential relative to their virtual addresses"? IOMMU table is for 
> DMA addr -> physical addr mapping. So we need to DMA addresses 
> "sequential relative to their physical addresses"? Based on your above 
> analysis on how hugepages are initialized, virtual addresses is a good 
> candidate for DMA address?

The code already goes through a separate organizational step on all of
the pages that remaps the virtual addresses such that they're sequential
relative to the physical backing pages, so this mostly ends up as the same
thing.
Choosing to use the virtual address is a totally valid choice, but I worry it
may lead to confusion during debugging or in a multi-process scenario.
I'm open to making this choice instead of starting from zero, though.

> 
> Thanks,
> Jianfeng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-04 10:11         ` Thomas Monjalon
@ 2017-01-04 21:35           ` Walker, Benjamin
  0 siblings, 0 replies; 18+ messages in thread
From: Walker, Benjamin @ 2017-01-04 21:35 UTC (permalink / raw)
  To: thomas.monjalon; +Cc: stephen, dev

On Wed, 2017-01-04 at 11:11 +0100, Thomas Monjalon wrote:
> 2017-01-03 22:50, Walker, Benjamin:
> > 1) Physical addresses cannot be exposed to unprivileged users due to
> > security
> > concerns (the fallout of rowhammer). Therefore, systems without an IOMMU can
> > only support privileged users. I think this is probably fine.
> > 2) The IOCTL from vfio to pin the memory is tied to specifying the DMA
> > address
> > and programming the IOMMU. This is unfortunate - systems without an IOMMU
> > still
> > want to do the pinning, but they need to be given the physical address
> > instead
> > of specifying a DMA address.
> > 3) Not all device types, particularly in virtualization environments,
> > support
> > vfio today. These devices have no way to explicitly pin memory.
> 
> In VM we can use VFIO-noiommu. Is it helping for mapping?

There does not appear to be a vfio IOCTL that pins memory without also
programming the IOMMU, so vfio-noiommu is broken in the same way that uio is for
drivers that require physical memory.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-04 21:34   ` Walker, Benjamin
@ 2017-01-05 10:09     ` Sergio Gonzalez Monroy
  2017-01-05 10:16       ` Sergio Gonzalez Monroy
  2017-01-05 15:52     ` Tan, Jianfeng
  1 sibling, 1 reply; 18+ messages in thread
From: Sergio Gonzalez Monroy @ 2017-01-05 10:09 UTC (permalink / raw)
  To: Walker, Benjamin, Tan, Jianfeng, dev

On 04/01/2017 21:34, Walker, Benjamin wrote:
> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote:
>> Hi Benjamin,
>>
>>
>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote:
>>> DPDK today begins by allocating all of the required
>>> hugepages, then finds all of the physical addresses for
>>> those hugepages using /proc/self/pagemap, sorts the
>>> hugepages by physical address, then remaps the pages to
>>> contiguous virtual addresses. Later on and if vfio is
>>> enabled, it asks vfio to pin the hugepages and to set their
>>> DMA addresses in the IOMMU to be the physical addresses
>>> discovered earlier. Of course, running as an unprivileged
>>> user means all of the physical addresses in
>>> /proc/self/pagemap are just 0, so this doesn't end up
>>> working. Further, there is no real reason to choose the
>>> physical address as the DMA address in the IOMMU - it would
>>> be better to just count up starting at 0.
>> Why not just using virtual address as the DMA address in this case to
>> avoid maintaining another kind of addresses?
> That's a valid choice, although I'm just storing the DMA address in the
> physical address field that already exists. You either have a physical
> address or a DMA address and never both.
>
>>>    Also, because the
>>> pages are pinned after the virtual to physical mapping is
>>> looked up, there is a window where a page could be moved.
>>> Hugepage mappings can be moved on more recent kernels (at
>>> least 4.x), and the reliability of hugepages having static
>>> mappings decreases with every kernel release.
>> Do you mean kernel might take back a physical page after mapping it to a
>> virtual page (maybe copy the data to another physical page)? Could you
>> please show some links or kernel commits?
> Yes - the kernel can move a physical page to another physical page
> and change the virtual mapping at any time. For a concise example
> see 'man migrate_pages(2)', or for a more serious example the code
> that performs memory page compaction in the kernel which was
> recently extended to support hugepages.
>
> Before we go down the path of me proving that the mapping isn't static,
> let me turn that line of thinking around. Do you have any documentation
> demonstrating that the mapping is static? It's not static for 4k pages, so
> why are we assuming that it is static for 2MB pages? I understand that
> it happened to be static for some versions of the kernel, but my understanding
> is that this was purely by coincidence and never by intention.

It looks to me as if you are talking about Transparent hugepages, and 
not hugetlbfs managed hugepages (DPDK usecase).
AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or 
moved, they are not part of the kernel memory management.

So again, do you have some references to code/articles where this 
"dynamic" behavior of hugepages managed by hugetlbfs is mentioned?

Sergio

>>> Note that this
>>> probably means that using uio on recent kernels is subtly
>>> broken and cannot be supported going forward because there
>>> is no uio mechanism to pin the memory.
>>>
>>> The first open question I have is whether DPDK should allow
>>> uio at all on recent (4.x) kernels. My current understanding
>>> is that there is no way to pin memory and hugepages can now
>>> be moved around, so uio would be unsafe. What does the
>>> community think here?
>>>
>>> My second question is whether the user should be allowed to
>>> mix uio and vfio usage simultaneously. For vfio, the
>>> physical addresses are really DMA addresses and are best
>>> when arbitrarily chosen to appear sequential relative to
>>> their virtual addresses.
>> Why "sequential relative to their virtual addresses"? IOMMU table is for
>> DMA addr -> physical addr mapping. So we need to DMA addresses
>> "sequential relative to their physical addresses"? Based on your above
>> analysis on how hugepages are initialized, virtual addresses is a good
>> candidate for DMA address?
> The code already goes through a separate organizational step on all of
> the pages that remaps the virtual addresses such that they're sequential
> relative to the physical backing pages, so this mostly ends up as the same
> thing.
> Choosing to use the virtual address is a totally valid choice, but I worry it
> may lead to confusion during debugging or in a multi-process scenario.
> I'm open to making this choice instead of starting from zero, though.
>
>> Thanks,
>> Jianfeng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-05 10:09     ` Sergio Gonzalez Monroy
@ 2017-01-05 10:16       ` Sergio Gonzalez Monroy
  2017-01-05 14:58         ` Tan, Jianfeng
  0 siblings, 1 reply; 18+ messages in thread
From: Sergio Gonzalez Monroy @ 2017-01-05 10:16 UTC (permalink / raw)
  To: Walker, Benjamin, Tan, Jianfeng, dev

On 05/01/2017 10:09, Sergio Gonzalez Monroy wrote:
> On 04/01/2017 21:34, Walker, Benjamin wrote:
>> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote:
>>> Hi Benjamin,
>>>
>>>
>>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote:
>>>> DPDK today begins by allocating all of the required
>>>> hugepages, then finds all of the physical addresses for
>>>> those hugepages using /proc/self/pagemap, sorts the
>>>> hugepages by physical address, then remaps the pages to
>>>> contiguous virtual addresses. Later on and if vfio is
>>>> enabled, it asks vfio to pin the hugepages and to set their
>>>> DMA addresses in the IOMMU to be the physical addresses
>>>> discovered earlier. Of course, running as an unprivileged
>>>> user means all of the physical addresses in
>>>> /proc/self/pagemap are just 0, so this doesn't end up
>>>> working. Further, there is no real reason to choose the
>>>> physical address as the DMA address in the IOMMU - it would
>>>> be better to just count up starting at 0.
>>> Why not just using virtual address as the DMA address in this case to
>>> avoid maintaining another kind of addresses?
>> That's a valid choice, although I'm just storing the DMA address in the
>> physical address field that already exists. You either have a physical
>> address or a DMA address and never both.
>>
>>>>    Also, because the
>>>> pages are pinned after the virtual to physical mapping is
>>>> looked up, there is a window where a page could be moved.
>>>> Hugepage mappings can be moved on more recent kernels (at
>>>> least 4.x), and the reliability of hugepages having static
>>>> mappings decreases with every kernel release.
>>> Do you mean kernel might take back a physical page after mapping it 
>>> to a
>>> virtual page (maybe copy the data to another physical page)? Could you
>>> please show some links or kernel commits?
>> Yes - the kernel can move a physical page to another physical page
>> and change the virtual mapping at any time. For a concise example
>> see 'man migrate_pages(2)', or for a more serious example the code
>> that performs memory page compaction in the kernel which was
>> recently extended to support hugepages.
>>
>> Before we go down the path of me proving that the mapping isn't static,
>> let me turn that line of thinking around. Do you have any documentation
>> demonstrating that the mapping is static? It's not static for 4k 
>> pages, so
>> why are we assuming that it is static for 2MB pages? I understand that
>> it happened to be static for some versions of the kernel, but my 
>> understanding
>> is that this was purely by coincidence and never by intention.
>
> It looks to me as if you are talking about Transparent hugepages, and 
> not hugetlbfs managed hugepages (DPDK usecase).
> AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or 
> moved, they are not part of the kernel memory management.
>

Please forgive my loose/poor use of words here when saying that "they 
are not part of the kernel memory management", I mean to say that
they are not part of the kernel memory management process you were 
mentioning, ie. compacting, moving, etc.

Sergio

> So again, do you have some references to code/articles where this 
> "dynamic" behavior of hugepages managed by hugetlbfs is mentioned?
>
> Sergio
>
>>>> Note that this
>>>> probably means that using uio on recent kernels is subtly
>>>> broken and cannot be supported going forward because there
>>>> is no uio mechanism to pin the memory.
>>>>
>>>> The first open question I have is whether DPDK should allow
>>>> uio at all on recent (4.x) kernels. My current understanding
>>>> is that there is no way to pin memory and hugepages can now
>>>> be moved around, so uio would be unsafe. What does the
>>>> community think here?
>>>>
>>>> My second question is whether the user should be allowed to
>>>> mix uio and vfio usage simultaneously. For vfio, the
>>>> physical addresses are really DMA addresses and are best
>>>> when arbitrarily chosen to appear sequential relative to
>>>> their virtual addresses.
>>> Why "sequential relative to their virtual addresses"? IOMMU table is 
>>> for
>>> DMA addr -> physical addr mapping. So we need to DMA addresses
>>> "sequential relative to their physical addresses"? Based on your above
>>> analysis on how hugepages are initialized, virtual addresses is a good
>>> candidate for DMA address?
>> The code already goes through a separate organizational step on all of
>> the pages that remaps the virtual addresses such that they're sequential
>> relative to the physical backing pages, so this mostly ends up as the 
>> same
>> thing.
>> Choosing to use the virtual address is a totally valid choice, but I 
>> worry it
>> may lead to confusion during debugging or in a multi-process scenario.
>> I'm open to making this choice instead of starting from zero, though.
>>
>>> Thanks,
>>> Jianfeng
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-05 10:16       ` Sergio Gonzalez Monroy
@ 2017-01-05 14:58         ` Tan, Jianfeng
  0 siblings, 0 replies; 18+ messages in thread
From: Tan, Jianfeng @ 2017-01-05 14:58 UTC (permalink / raw)
  To: Sergio Gonzalez Monroy, Walker, Benjamin, dev

Hi,


On 1/5/2017 6:16 PM, Sergio Gonzalez Monroy wrote:
> On 05/01/2017 10:09, Sergio Gonzalez Monroy wrote:
>> On 04/01/2017 21:34, Walker, Benjamin wrote:
>>> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote:
>>>> Hi Benjamin,
>>>>
>>>>
>>>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote:
>>>>> DPDK today begins by allocating all of the required
>>>>> hugepages, then finds all of the physical addresses for
>>>>> those hugepages using /proc/self/pagemap, sorts the
>>>>> hugepages by physical address, then remaps the pages to
>>>>> contiguous virtual addresses. Later on and if vfio is
>>>>> enabled, it asks vfio to pin the hugepages and to set their
>>>>> DMA addresses in the IOMMU to be the physical addresses
>>>>> discovered earlier. Of course, running as an unprivileged
>>>>> user means all of the physical addresses in
>>>>> /proc/self/pagemap are just 0, so this doesn't end up
>>>>> working. Further, there is no real reason to choose the
>>>>> physical address as the DMA address in the IOMMU - it would
>>>>> be better to just count up starting at 0.
>>>> Why not just using virtual address as the DMA address in this case to
>>>> avoid maintaining another kind of addresses?
>>> That's a valid choice, although I'm just storing the DMA address in the
>>> physical address field that already exists. You either have a physical
>>> address or a DMA address and never both.
>>>
>>>>>    Also, because the
>>>>> pages are pinned after the virtual to physical mapping is
>>>>> looked up, there is a window where a page could be moved.
>>>>> Hugepage mappings can be moved on more recent kernels (at
>>>>> least 4.x), and the reliability of hugepages having static
>>>>> mappings decreases with every kernel release.
>>>> Do you mean kernel might take back a physical page after mapping it 
>>>> to a
>>>> virtual page (maybe copy the data to another physical page)? Could you
>>>> please show some links or kernel commits?
>>> Yes - the kernel can move a physical page to another physical page
>>> and change the virtual mapping at any time. For a concise example
>>> see 'man migrate_pages(2)', or for a more serious example the code
>>> that performs memory page compaction in the kernel which was
>>> recently extended to support hugepages.
>>>
>>> Before we go down the path of me proving that the mapping isn't static,
>>> let me turn that line of thinking around. Do you have any documentation
>>> demonstrating that the mapping is static? It's not static for 4k 
>>> pages, so
>>> why are we assuming that it is static for 2MB pages? I understand that
>>> it happened to be static for some versions of the kernel, but my 
>>> understanding
>>> is that this was purely by coincidence and never by intention.
>>
>> It looks to me as if you are talking about Transparent hugepages, and 
>> not hugetlbfs managed hugepages (DPDK usecase).
>> AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or 
>> moved, they are not part of the kernel memory management.
>>
>
> Please forgive my loose/poor use of words here when saying that "they 
> are not part of the kernel memory management", I mean to say that
> they are not part of the kernel memory management process you were 
> mentioning, ie. compacting, moving, etc.
>
> Sergio
>
>> So again, do you have some references to code/articles where this 
>> "dynamic" behavior of hugepages managed by hugetlbfs is mentioned?
>>
>> Sergio

According to the information Benjamin provided, I did some home work and 
find this macro in kernel config, CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION, 
and further the function, hugepage_migration_supported().

Seems that there are at least three ways to make this behavior happen 
(I'm basing on Linux 4.8.1):

a) Through a syscall migrate_pages();
b) through a syscall move_pages();
c) Since some version of kernel, there's a kthread named kcompactd for 
each numa socket, to perform memory compaction.

Thanks,
Jianfeng

>>
>>>>> Note that this
>>>>> probably means that using uio on recent kernels is subtly
>>>>> broken and cannot be supported going forward because there
>>>>> is no uio mechanism to pin the memory.
>>>>>
>>>>> The first open question I have is whether DPDK should allow
>>>>> uio at all on recent (4.x) kernels. My current understanding
>>>>> is that there is no way to pin memory and hugepages can now
>>>>> be moved around, so uio would be unsafe. What does the
>>>>> community think here?
>>>>>
>>>>> My second question is whether the user should be allowed to
>>>>> mix uio and vfio usage simultaneously. For vfio, the
>>>>> physical addresses are really DMA addresses and are best
>>>>> when arbitrarily chosen to appear sequential relative to
>>>>> their virtual addresses.
>>>> Why "sequential relative to their virtual addresses"? IOMMU table 
>>>> is for
>>>> DMA addr -> physical addr mapping. So we need to DMA addresses
>>>> "sequential relative to their physical addresses"? Based on your above
>>>> analysis on how hugepages are initialized, virtual addresses is a good
>>>> candidate for DMA address?
>>> The code already goes through a separate organizational step on all of
>>> the pages that remaps the virtual addresses such that they're 
>>> sequential
>>> relative to the physical backing pages, so this mostly ends up as 
>>> the same
>>> thing.
>>> Choosing to use the virtual address is a totally valid choice, but I 
>>> worry it
>>> may lead to confusion during debugging or in a multi-process scenario.
>>> I'm open to making this choice instead of starting from zero, though.
>>>
>>>> Thanks,
>>>> Jianfeng
>>
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-04 21:34   ` Walker, Benjamin
  2017-01-05 10:09     ` Sergio Gonzalez Monroy
@ 2017-01-05 15:52     ` Tan, Jianfeng
  2017-11-05  0:17       ` Thomas Monjalon
  1 sibling, 1 reply; 18+ messages in thread
From: Tan, Jianfeng @ 2017-01-05 15:52 UTC (permalink / raw)
  To: Walker, Benjamin, dev

Hi Benjamin,


On 1/5/2017 5:34 AM, Walker, Benjamin wrote:
> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote:
>> Hi Benjamin,
>>
>>
>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote:
>>> DPDK today begins by allocating all of the required
>>> hugepages, then finds all of the physical addresses for
>>> those hugepages using /proc/self/pagemap, sorts the
>>> hugepages by physical address, then remaps the pages to
>>> contiguous virtual addresses. Later on and if vfio is
>>> enabled, it asks vfio to pin the hugepages and to set their
>>> DMA addresses in the IOMMU to be the physical addresses
>>> discovered earlier. Of course, running as an unprivileged
>>> user means all of the physical addresses in
>>> /proc/self/pagemap are just 0, so this doesn't end up
>>> working. Further, there is no real reason to choose the
>>> physical address as the DMA address in the IOMMU - it would
>>> be better to just count up starting at 0.
>> Why not just using virtual address as the DMA address in this case to
>> avoid maintaining another kind of addresses?
> That's a valid choice, although I'm just storing the DMA address in the
> physical address field that already exists. You either have a physical
> address or a DMA address and never both.

Yes, I understand that's why you cast the second question below.

>
>>>    Also, because the
>>> pages are pinned after the virtual to physical mapping is
>>> looked up, there is a window where a page could be moved.
>>> Hugepage mappings can be moved on more recent kernels (at
>>> least 4.x), and the reliability of hugepages having static
>>> mappings decreases with every kernel release.
>> Do you mean kernel might take back a physical page after mapping it to a
>> virtual page (maybe copy the data to another physical page)? Could you
>> please show some links or kernel commits?
> Yes - the kernel can move a physical page to another physical page
> and change the virtual mapping at any time. For a concise example
> see 'man migrate_pages(2)', or for a more serious example the code
> that performs memory page compaction in the kernel which was
> recently extended to support hugepages.
>
> Before we go down the path of me proving that the mapping isn't static,
> let me turn that line of thinking around. Do you have any documentation
> demonstrating that the mapping is static? It's not static for 4k pages, so
> why are we assuming that it is static for 2MB pages? I understand that
> it happened to be static for some versions of the kernel, but my understanding
> is that this was purely by coincidence and never by intention.

Thank you for the information. Based on what you provide above, I 
realize this behavior could happen since long time ago.

>
>>> Note that this
>>> probably means that using uio on recent kernels is subtly
>>> broken and cannot be supported going forward because there
>>> is no uio mechanism to pin the memory.
>>>
>>> The first open question I have is whether DPDK should allow
>>> uio at all on recent (4.x) kernels. My current understanding
>>> is that there is no way to pin memory and hugepages can now
>>> be moved around, so uio would be unsafe. What does the
>>> community think here?

Back to this question, removing uio support in DPDK seems a little 
overkill to me. Can we just document it down? Like, firstly warn users 
do not invoke migrate_pages() or move_pages() to a DPDK process; as for 
the kcompactd daemon and some more cases (like compaction could be 
triggered by alloc_pages()), could we just recommend to disable 
CONFIG_COMPACTION?

Another side, how does vfio pin those memory? Through memlock (from code 
in vfio_pin_pages())? So why not just mlock those hugepages?

>>>
>>> My second question is whether the user should be allowed to
>>> mix uio and vfio usage simultaneously. For vfio, the
>>> physical addresses are really DMA addresses and are best
>>> when arbitrarily chosen to appear sequential relative to
>>> their virtual addresses.
>> Why "sequential relative to their virtual addresses"? IOMMU table is for
>> DMA addr -> physical addr mapping. So we need to DMA addresses
>> "sequential relative to their physical addresses"? Based on your above
>> analysis on how hugepages are initialized, virtual addresses is a good
>> candidate for DMA address?
> The code already goes through a separate organizational step on all of
> the pages that remaps the virtual addresses such that they're sequential
> relative to the physical backing pages, so this mostly ends up as the same
> thing.

Agreed.

> Choosing to use the virtual address is a totally valid choice, but I worry it
> may lead to confusion during debugging or in a multi-process scenario.

Make sense.

Thanks,
Jianfeng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-01-05 15:52     ` Tan, Jianfeng
@ 2017-11-05  0:17       ` Thomas Monjalon
  2017-11-27 17:58         ` Walker, Benjamin
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Monjalon @ 2017-11-05  0:17 UTC (permalink / raw)
  To: Tan, Jianfeng, Walker, Benjamin, sergio.gonzalez.monroy, anatoly.burakov
  Cc: dev

Hi, restarting an old topic,

05/01/2017 16:52, Tan, Jianfeng:
> On 1/5/2017 5:34 AM, Walker, Benjamin wrote:
> >>> Note that this
> >>> probably means that using uio on recent kernels is subtly
> >>> broken and cannot be supported going forward because there
> >>> is no uio mechanism to pin the memory.
> >>>
> >>> The first open question I have is whether DPDK should allow
> >>> uio at all on recent (4.x) kernels. My current understanding
> >>> is that there is no way to pin memory and hugepages can now
> >>> be moved around, so uio would be unsafe. What does the
> >>> community think here?
> 
> Back to this question, removing uio support in DPDK seems a little 
> overkill to me. Can we just document it down? Like, firstly warn users 
> do not invoke migrate_pages() or move_pages() to a DPDK process; as for 
> the kcompactd daemon and some more cases (like compaction could be 
> triggered by alloc_pages()), could we just recommend to disable 
> CONFIG_COMPACTION?

We really need to better document the limitations of UIO.
May we have some suggestions here?

> Another side, how does vfio pin those memory? Through memlock (from code 
> in vfio_pin_pages())? So why not just mlock those hugepages?

Good question. Why not mlock the hugepages?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-11-05  0:17       ` Thomas Monjalon
@ 2017-11-27 17:58         ` Walker, Benjamin
  2017-11-28 14:16           ` Alejandro Lucero
  0 siblings, 1 reply; 18+ messages in thread
From: Walker, Benjamin @ 2017-11-27 17:58 UTC (permalink / raw)
  To: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng; +Cc: dev

On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote:
> Hi, restarting an old topic,
> 
> 05/01/2017 16:52, Tan, Jianfeng:
> > On 1/5/2017 5:34 AM, Walker, Benjamin wrote:
> > > > > Note that this
> > > > > probably means that using uio on recent kernels is subtly
> > > > > broken and cannot be supported going forward because there
> > > > > is no uio mechanism to pin the memory.
> > > > > 
> > > > > The first open question I have is whether DPDK should allow
> > > > > uio at all on recent (4.x) kernels. My current understanding
> > > > > is that there is no way to pin memory and hugepages can now
> > > > > be moved around, so uio would be unsafe. What does the
> > > > > community think here?
> > 
> > Back to this question, removing uio support in DPDK seems a little 
> > overkill to me. Can we just document it down? Like, firstly warn users 
> > do not invoke migrate_pages() or move_pages() to a DPDK process; as for 
> > the kcompactd daemon and some more cases (like compaction could be 
> > triggered by alloc_pages()), could we just recommend to disable 
> > CONFIG_COMPACTION?
> 
> We really need to better document the limitations of UIO.
> May we have some suggestions here?
> 
> > Another side, how does vfio pin those memory? Through memlock (from code 
> > in vfio_pin_pages())? So why not just mlock those hugepages?
> 
> Good question. Why not mlock the hugepages?

mlock just guarantees that a virtual page is always backed by *some* physical
page of memory. It does not guarantee that over the lifetime of the process a
virtual page is mapped to the *same* physical page. The kernel is free to
transparently move memory around, compress it, dedupe it, etc.

vfio is not pinning the memory, but instead is using the IOMMU (a piece of
hardware) to participate in the memory management on the platform. If a device
begins a DMA transfer to an I/O virtual address, the IOMMU will coordinate with
the main MMU to make sure that the data ends up in the correct location, even as
the virtual to physical mappings are being modified.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-11-27 17:58         ` Walker, Benjamin
@ 2017-11-28 14:16           ` Alejandro Lucero
  2017-11-28 17:50             ` Walker, Benjamin
  0 siblings, 1 reply; 18+ messages in thread
From: Alejandro Lucero @ 2017-11-28 14:16 UTC (permalink / raw)
  To: Walker, Benjamin
  Cc: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng, dev

On Mon, Nov 27, 2017 at 5:58 PM, Walker, Benjamin <benjamin.walker@intel.com
> wrote:

> On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote:
> > Hi, restarting an old topic,
> >
> > 05/01/2017 16:52, Tan, Jianfeng:
> > > On 1/5/2017 5:34 AM, Walker, Benjamin wrote:
> > > > > > Note that this
> > > > > > probably means that using uio on recent kernels is subtly
> > > > > > broken and cannot be supported going forward because there
> > > > > > is no uio mechanism to pin the memory.
> > > > > >
> > > > > > The first open question I have is whether DPDK should allow
> > > > > > uio at all on recent (4.x) kernels. My current understanding
> > > > > > is that there is no way to pin memory and hugepages can now
> > > > > > be moved around, so uio would be unsafe. What does the
> > > > > > community think here?
> > >
> > > Back to this question, removing uio support in DPDK seems a little
> > > overkill to me. Can we just document it down? Like, firstly warn users
> > > do not invoke migrate_pages() or move_pages() to a DPDK process; as for
> > > the kcompactd daemon and some more cases (like compaction could be
> > > triggered by alloc_pages()), could we just recommend to disable
> > > CONFIG_COMPACTION?
> >
> > We really need to better document the limitations of UIO.
> > May we have some suggestions here?
> >
> > > Another side, how does vfio pin those memory? Through memlock (from
> code
> > > in vfio_pin_pages())? So why not just mlock those hugepages?
> >
> > Good question. Why not mlock the hugepages?
>
> mlock just guarantees that a virtual page is always backed by *some*
> physical
> page of memory. It does not guarantee that over the lifetime of the
> process a
> virtual page is mapped to the *same* physical page. The kernel is free to
> transparently move memory around, compress it, dedupe it, etc.
>
> vfio is not pinning the memory, but instead is using the IOMMU (a piece of
> hardware) to participate in the memory management on the platform. If a
> device
> begins a DMA transfer to an I/O virtual address, the IOMMU will coordinate
> with
> the main MMU to make sure that the data ends up in the correct location,
> even as
> the virtual to physical mappings are being modified.


This last comment confused me because you said VFIO did the page pinning in
your first email.
I have been looking at the kernel code and the VFIO driver does pin the
pages, at least the iommu type 1.

I can see a problem adding support to UIO for doing the same, because that
implies there is a device
doing DMAs and programmed from user space, which is something the UIO
maintainer is against. But because
vfio-noiommu mode was implemented just for this, I guess that could be
added to the VFIO driver. This does not
solve the problem of software not using vfio though.

Apart from improving the UIO documentation when used with DPDK, maybe some
sort of check could be done
and DPDK requiring a explicit parameter for making the user aware of the
potential risk when UIO is used and the
kernel page migration is enabled. Not sure if this last thing could be
easily known from user space.

On another side, we suffered a similar problem when VMs were using SRIOV
and memory balloning. The IOMMU was
removing the mapping for the memory removed, but the kernel inside the VM
did not get any event and the device
ended up doing some wrong DMA operation.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-11-28 14:16           ` Alejandro Lucero
@ 2017-11-28 17:50             ` Walker, Benjamin
  2017-11-28 19:13               ` Alejandro Lucero
  0 siblings, 1 reply; 18+ messages in thread
From: Walker, Benjamin @ 2017-11-28 17:50 UTC (permalink / raw)
  To: alejandro.lucero
  Cc: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng, dev

On Tue, 2017-11-28 at 14:16 +0000, Alejandro Lucero wrote:
> 
> 
> On Mon, Nov 27, 2017 at 5:58 PM, Walker, Benjamin <benjamin.walker@intel.com>
> wrote:
> > On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote:
> > > Hi, restarting an old topic,
> > >
> > > 05/01/2017 16:52, Tan, Jianfeng:
> > > > On 1/5/2017 5:34 AM, Walker, Benjamin wrote:
> > > > > > > Note that this
> > > > > > > probably means that using uio on recent kernels is subtly
> > > > > > > broken and cannot be supported going forward because there
> > > > > > > is no uio mechanism to pin the memory.
> > > > > > >
> > > > > > > The first open question I have is whether DPDK should allow
> > > > > > > uio at all on recent (4.x) kernels. My current understanding
> > > > > > > is that there is no way to pin memory and hugepages can now
> > > > > > > be moved around, so uio would be unsafe. What does the
> > > > > > > community think here?
> > > >
> > > > Back to this question, removing uio support in DPDK seems a little
> > > > overkill to me. Can we just document it down? Like, firstly warn users
> > > > do not invoke migrate_pages() or move_pages() to a DPDK process; as for
> > > > the kcompactd daemon and some more cases (like compaction could be
> > > > triggered by alloc_pages()), could we just recommend to disable
> > > > CONFIG_COMPACTION?
> > >
> > > We really need to better document the limitations of UIO.
> > > May we have some suggestions here?
> > >
> > > > Another side, how does vfio pin those memory? Through memlock (from code
> > > > in vfio_pin_pages())? So why not just mlock those hugepages?
> > >
> > > Good question. Why not mlock the hugepages?
> > 
> > mlock just guarantees that a virtual page is always backed by *some*
> > physical
> > page of memory. It does not guarantee that over the lifetime of the process
> > a
> > virtual page is mapped to the *same* physical page. The kernel is free to
> > transparently move memory around, compress it, dedupe it, etc.
> > 
> > vfio is not pinning the memory, but instead is using the IOMMU (a piece of
> > hardware) to participate in the memory management on the platform. If a
> > device
> > begins a DMA transfer to an I/O virtual address, the IOMMU will coordinate
> > with
> > the main MMU to make sure that the data ends up in the correct location,
> > even as
> > the virtual to physical mappings are being modified.
> 
> This last comment confused me because you said VFIO did the page pinning in
> your first email.
> I have been looking at the kernel code and the VFIO driver does pin the pages,
> at least the iommu type 1.

The vfio driver does flag the page in a way that prevents some types of
movement, so in that sense it is pinning it. I haven't done an audit to
guarantee that it prevents all types of movement - that would be very difficult.
My point was more that vfio is not strictly relying on pinning to function, but
instead relying on the IOMMU. In my previous email I said "pinning" when I
really meant "programs the IOMMU". Of course, with vfio-noiommu you'd be back to
relying on pinning again, in which case you'd really have to do that full audit
of the kernel memory manager to confirm that the flags vfio is setting prevent
all movement for any reason.

> 
> I can see a problem adding support to UIO for doing the same, because that
> implies there is a device
> doing DMAs and programmed from user space, which is something the UIO
> maintainer is against. But because
> vfio-noiommu mode was implemented just for this, I guess that could be added
> to the VFIO driver. This does not
> solve the problem of software not using vfio though.

vfio-noiommu is intended for devices programmed in user space, but primarily for
devices that don't require physical addresses to perform data transfers (like
RDMA NICs). Those devices don't actually require pinned memory and already
participate in the regular memory management on the platform, so putting them
behind an IOMMU is of no additional value.

> 
> Apart from improving the UIO documentation when used with DPDK, maybe some
> sort of check could be done
> and DPDK requiring a explicit parameter for making the user aware of the
> potential risk when UIO is used and the
> kernel page migration is enabled. Not sure if this last thing could be easily
> known from user space.

The challenge is that there are so many reasons for a page to move, and more are
added all the time. It would be really hard to correctly prevent the user from
using uio in every case. Further, if the user is using uio inside of a virtual
machine that happens to be deployed using the IOMMU on the host system, most of
the reasons for a page to move (besides explicit requests to move pages) are
alleviated and it is more or less safe. But the user would have no idea from
within the guest that they're actually protected. I think this case - using uio
from within a guest VM that is protected by the IOMMU - is common.

> 
> On another side, we suffered a similar problem when VMs were using SRIOV and
> memory balloning. The IOMMU was
> removing the mapping for the memory removed, but the kernel inside the VM did
> not get any event and the device
> ended up doing some wrong DMA operation.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Running DPDK as an unprivileged user
  2017-11-28 17:50             ` Walker, Benjamin
@ 2017-11-28 19:13               ` Alejandro Lucero
  0 siblings, 0 replies; 18+ messages in thread
From: Alejandro Lucero @ 2017-11-28 19:13 UTC (permalink / raw)
  To: Walker, Benjamin
  Cc: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng, dev

On Tue, Nov 28, 2017 at 5:50 PM, Walker, Benjamin <benjamin.walker@intel.com
> wrote:

> On Tue, 2017-11-28 at 14:16 +0000, Alejandro Lucero wrote:
> >
> >
> > On Mon, Nov 27, 2017 at 5:58 PM, Walker, Benjamin <
> benjamin.walker@intel.com>
> > wrote:
> > > On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote:
> > > > Hi, restarting an old topic,
> > > >
> > > > 05/01/2017 16:52, Tan, Jianfeng:
> > > > > On 1/5/2017 5:34 AM, Walker, Benjamin wrote:
> > > > > > > > Note that this
> > > > > > > > probably means that using uio on recent kernels is subtly
> > > > > > > > broken and cannot be supported going forward because there
> > > > > > > > is no uio mechanism to pin the memory.
> > > > > > > >
> > > > > > > > The first open question I have is whether DPDK should allow
> > > > > > > > uio at all on recent (4.x) kernels. My current understanding
> > > > > > > > is that there is no way to pin memory and hugepages can now
> > > > > > > > be moved around, so uio would be unsafe. What does the
> > > > > > > > community think here?
> > > > >
> > > > > Back to this question, removing uio support in DPDK seems a little
> > > > > overkill to me. Can we just document it down? Like, firstly warn
> users
> > > > > do not invoke migrate_pages() or move_pages() to a DPDK process;
> as for
> > > > > the kcompactd daemon and some more cases (like compaction could be
> > > > > triggered by alloc_pages()), could we just recommend to disable
> > > > > CONFIG_COMPACTION?
> > > >
> > > > We really need to better document the limitations of UIO.
> > > > May we have some suggestions here?
> > > >
> > > > > Another side, how does vfio pin those memory? Through memlock
> (from code
> > > > > in vfio_pin_pages())? So why not just mlock those hugepages?
> > > >
> > > > Good question. Why not mlock the hugepages?
> > >
> > > mlock just guarantees that a virtual page is always backed by *some*
> > > physical
> > > page of memory. It does not guarantee that over the lifetime of the
> process
> > > a
> > > virtual page is mapped to the *same* physical page. The kernel is free
> to
> > > transparently move memory around, compress it, dedupe it, etc.
> > >
> > > vfio is not pinning the memory, but instead is using the IOMMU (a
> piece of
> > > hardware) to participate in the memory management on the platform. If a
> > > device
> > > begins a DMA transfer to an I/O virtual address, the IOMMU will
> coordinate
> > > with
> > > the main MMU to make sure that the data ends up in the correct
> location,
> > > even as
> > > the virtual to physical mappings are being modified.
> >
> > This last comment confused me because you said VFIO did the page pinning
> in
> > your first email.
> > I have been looking at the kernel code and the VFIO driver does pin the
> pages,
> > at least the iommu type 1.
>
> The vfio driver does flag the page in a way that prevents some types of
> movement, so in that sense it is pinning it. I haven't done an audit to
> guarantee that it prevents all types of movement - that would be very
> difficult.
> My point was more that vfio is not strictly relying on pinning to
> function, but
> instead relying on the IOMMU. In my previous email I said "pinning" when I
> really meant "programs the IOMMU". Of course, with vfio-noiommu you'd be
> back to
> relying on pinning again, in which case you'd really have to do that full
> audit
> of the kernel memory manager to confirm that the flags vfio is setting
> prevent
> all movement for any reason.
>
>
If you are saying the kernel code related to page migration will know how
to reprogram
the IOMMU, I think that is unlikely. What the VFIO code is doing is to set
a flag for those
involved pages saying they are "writable", and therefore it is not safe to
do the page
migration. If that mm code needs to reprogram the IOMMU, it needs to know
not just the
process which page table will be modified, but also the device that process
has assigned,
because the IOMMU mapping is related to devices and not processes. So I'm
not 100%
sure, but I don't think the kernel is doing so.



> >
> > I can see a problem adding support to UIO for doing the same, because
> that
> > implies there is a device
> > doing DMAs and programmed from user space, which is something the UIO
> > maintainer is against. But because
> > vfio-noiommu mode was implemented just for this, I guess that could be
> added
> > to the VFIO driver. This does not
> > solve the problem of software not using vfio though.
>
> vfio-noiommu is intended for devices programmed in user space, but
> primarily for
> devices that don't require physical addresses to perform data transfers
> (like
> RDMA NICs). Those devices don't actually require pinned memory and already
> participate in the regular memory management on the platform, so putting
> them
> behind an IOMMU is of no additional value.
>
>
AFAIK, noiommu mode was added to VFIO just for DPDK (mainly) and for
solving the problem
with the unupstreamable igb_uio module, and the problem with adding more
features to the uio.ko



> >
> > Apart from improving the UIO documentation when used with DPDK, maybe
> some
> > sort of check could be done
> > and DPDK requiring a explicit parameter for making the user aware of the
> > potential risk when UIO is used and the
> > kernel page migration is enabled. Not sure if this last thing could be
> easily
> > known from user space.
>
> The challenge is that there are so many reasons for a page to move, and
> more are
> added all the time. It would be really hard to correctly prevent the user
> from
> using uio in every case. Further, if the user is using uio inside of a
> virtual
> machine that happens to be deployed using the IOMMU on the host system,
> most of
> the reasons for a page to move (besides explicit requests to move pages)
> are
> alleviated and it is more or less safe. But the user would have no idea
> from
> within the guest that they're actually protected. I think this case -
> using uio
> from within a guest VM that is protected by the IOMMU - is common.
>
>
That is true, but a driver can know if the system is a virtualized one, so
then that
explicit flag could not be needed.



> >
> > On another side, we suffered a similar problem when VMs were using SRIOV
> and
> > memory balloning. The IOMMU was
> > removing the mapping for the memory removed, but the kernel inside the
> VM did
> > not get any event and the device
> > ended up doing some wrong DMA operation.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-11-28 19:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-29 20:41 Running DPDK as an unprivileged user Walker, Benjamin
2016-12-30  1:14 ` Stephen Hemminger
2017-01-02 14:32   ` Thomas Monjalon
2017-01-02 19:47     ` Stephen Hemminger
2017-01-03 22:50       ` Walker, Benjamin
2017-01-04 10:11         ` Thomas Monjalon
2017-01-04 21:35           ` Walker, Benjamin
2017-01-04 11:39 ` Tan, Jianfeng
2017-01-04 21:34   ` Walker, Benjamin
2017-01-05 10:09     ` Sergio Gonzalez Monroy
2017-01-05 10:16       ` Sergio Gonzalez Monroy
2017-01-05 14:58         ` Tan, Jianfeng
2017-01-05 15:52     ` Tan, Jianfeng
2017-11-05  0:17       ` Thomas Monjalon
2017-11-27 17:58         ` Walker, Benjamin
2017-11-28 14:16           ` Alejandro Lucero
2017-11-28 17:50             ` Walker, Benjamin
2017-11-28 19:13               ` Alejandro Lucero

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.