All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zhou Wang <wangzhou1@hisilicon.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: <linux-kernel@vger.kernel.org>,
	<iommu@lists.linux-foundation.org>, <linux-mm@kvack.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-api@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	<gregkh@linuxfoundation.org>, <song.bao.hua@hisilicon.com>,
	<jgg@ziepe.ca>, <kevin.tian@intel.com>,
	<jean-philippe@linaro.org>, <eric.auger@redhat.com>,
	<liguozhu@hisilicon.com>, <zhangfei.gao@linaro.org>,
	Sihang Chen <chensihang1@hisilicon.com>
Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
Date: Tue, 9 Feb 2021 17:02:30 +0800	[thread overview]
Message-ID: <a8f90c52-d51f-173f-7fd0-fb0792ac58e4@hisilicon.com> (raw)
In-Reply-To: <20210207213409.GL308988@casper.infradead.org>

On 2021/2/8 5:34, Matthew Wilcox wrote:
> On Sun, Feb 07, 2021 at 04:18:03PM +0800, Zhou Wang wrote:
>> SVA(share virtual address) offers a way for device to share process virtual
>> address space safely, which makes more convenient for user space device
>> driver coding. However, IO page faults may happen when doing DMA
>> operations. As the latency of IO page fault is relatively big, DMA
>> performance will be affected severely when there are IO page faults.
>> >From a long term view, DMA performance will be not stable.
>>
>> In high-performance I/O cases, accelerators might want to perform
>> I/O on a memory without IO page faults which can result in dramatically
>> increased latency. Current memory related APIs could not achieve this
>> requirement, e.g. mlock can only avoid memory to swap to backup device,
>> page migration can still trigger IO page fault.
> 
> Well ... we have two requirements.  The application wants to not take
> page faults.  The system wants to move the application to a different
> NUMA node in order to optimise overall performance.  Why should the
> application's desires take precedence over the kernel's desires?  And why
> should it be done this way rather than by the sysadmin using numactl to
> lock the application to a particular node?

Just as Barry mentioned, there are other cases which could trigger IOPF.
Only numactl is not enough.

> 
>> +struct mem_pin_container {
>> +	struct xarray array;
>> +	struct mutex lock;
>> +};
> 
> I don't understand what the lock actually protects.

This lock protects pin/unpin and record/remove.
 - pin pages and record them
 - unpin pages and remove them
should be exlusive.

> 
>> +struct pin_pages {
>> +	unsigned long first;
>> +	unsigned long nr_pages;
>> +	struct page **pages;
>> +};
> 
> I don't think you need 'first', and I think you can embed the pages
> array into this struct, removing one allocation.

'first' will be recorded and be used to unpin later. We use it
as an index to get pinned pages and do unpin operation.

> 
>> +	xa_for_each(&priv->array, idx, p) {
>> +		unpin_user_pages(p->pages, p->nr_pages);
>> +		xa_erase(&priv->array, p->first);
>> +		vfree(p->pages);
>> +		kfree(p);
>> +	}
>> +
>> +	mutex_destroy(&priv->lock);
>> +	xa_destroy(&priv->array);
> 
> If you just called xa_erase() on every element of the array, you don't need
> to call xa_destroy().

OK.

> 
>> +	if (!can_do_mlock())
>> +		return -EPERM;
>
> You check for can_do_mlock(), but you don't account the pages to this
> rlimit.

Here I just copied it from ib_umen_get and do_mlock. If needed, we can
add account for pages here.

> 
>> +	first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> 
> You don't need to mask off the bits, the shift will remove them.

OK.

> 
>> +	last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> 
> DIV_ROUND_UP()?

addr->size is input pin page size which is not same as PAGE_SIZE.
So seems we can not use this macro.

> 
>> +	pages = vmalloc(nr_pages * sizeof(struct page *));
> 
> kvmalloc().  vmalloc() always allocates at least a page, so we want to
> use kmalloc if the size is small.  Also, use array_size() -- I know this
> can't overflow, but let's be clear

Yes, will use kvmalloc and array_size here.

> 
>> +	ret = pin_user_pages_fast(addr->addr & PAGE_MASK, nr_pages,
>> +				  flags | FOLL_LONGTERM, pages);
>> +	if (ret != nr_pages) {
>> +		pr_err("mempinfd: Failed to pin page\n");
> 
> No.  You mustn't allow the user to be able to generate messages to syslog,
> just by passing garbage to a syscall.

OK, will remove this.

> 
>> +	ret = xa_insert(&priv->array, p->first, p, GFP_KERNEL);
>> +	if (ret)
>> +		goto unpin_pages;
> 
> Hmm.  So we can't pin two ranges which start at the same address, but we
> can pin two overlapping ranges.  Is that OK?

The design here is only supporting pin a range with different start address.
If two overlapping ranges with different start address, we will not prevent
this. If this will not go wrong, let's make it simple firstly.

Best,
Zhou

> 
> 
> .
> 


WARNING: multiple messages have this Message-ID (diff)
From: Zhou Wang <wangzhou1@hisilicon.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: song.bao.hua@hisilicon.com, jean-philippe@linaro.org,
	kevin.tian@intel.com, Sihang Chen <chensihang1@hisilicon.com>,
	jgg@ziepe.ca, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	iommu@lists.linux-foundation.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	eric.auger@redhat.com, gregkh@linuxfoundation.org,
	zhangfei.gao@linaro.org,
	Andrew Morton <akpm@linux-foundation.org>,
	liguozhu@hisilicon.com, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
Date: Tue, 9 Feb 2021 17:02:30 +0800	[thread overview]
Message-ID: <a8f90c52-d51f-173f-7fd0-fb0792ac58e4@hisilicon.com> (raw)
In-Reply-To: <20210207213409.GL308988@casper.infradead.org>

On 2021/2/8 5:34, Matthew Wilcox wrote:
> On Sun, Feb 07, 2021 at 04:18:03PM +0800, Zhou Wang wrote:
>> SVA(share virtual address) offers a way for device to share process virtual
>> address space safely, which makes more convenient for user space device
>> driver coding. However, IO page faults may happen when doing DMA
>> operations. As the latency of IO page fault is relatively big, DMA
>> performance will be affected severely when there are IO page faults.
>> >From a long term view, DMA performance will be not stable.
>>
>> In high-performance I/O cases, accelerators might want to perform
>> I/O on a memory without IO page faults which can result in dramatically
>> increased latency. Current memory related APIs could not achieve this
>> requirement, e.g. mlock can only avoid memory to swap to backup device,
>> page migration can still trigger IO page fault.
> 
> Well ... we have two requirements.  The application wants to not take
> page faults.  The system wants to move the application to a different
> NUMA node in order to optimise overall performance.  Why should the
> application's desires take precedence over the kernel's desires?  And why
> should it be done this way rather than by the sysadmin using numactl to
> lock the application to a particular node?

Just as Barry mentioned, there are other cases which could trigger IOPF.
Only numactl is not enough.

> 
>> +struct mem_pin_container {
>> +	struct xarray array;
>> +	struct mutex lock;
>> +};
> 
> I don't understand what the lock actually protects.

This lock protects pin/unpin and record/remove.
 - pin pages and record them
 - unpin pages and remove them
should be exlusive.

> 
>> +struct pin_pages {
>> +	unsigned long first;
>> +	unsigned long nr_pages;
>> +	struct page **pages;
>> +};
> 
> I don't think you need 'first', and I think you can embed the pages
> array into this struct, removing one allocation.

'first' will be recorded and be used to unpin later. We use it
as an index to get pinned pages and do unpin operation.

> 
>> +	xa_for_each(&priv->array, idx, p) {
>> +		unpin_user_pages(p->pages, p->nr_pages);
>> +		xa_erase(&priv->array, p->first);
>> +		vfree(p->pages);
>> +		kfree(p);
>> +	}
>> +
>> +	mutex_destroy(&priv->lock);
>> +	xa_destroy(&priv->array);
> 
> If you just called xa_erase() on every element of the array, you don't need
> to call xa_destroy().

OK.

> 
>> +	if (!can_do_mlock())
>> +		return -EPERM;
>
> You check for can_do_mlock(), but you don't account the pages to this
> rlimit.

Here I just copied it from ib_umen_get and do_mlock. If needed, we can
add account for pages here.

> 
>> +	first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> 
> You don't need to mask off the bits, the shift will remove them.

OK.

> 
>> +	last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> 
> DIV_ROUND_UP()?

addr->size is input pin page size which is not same as PAGE_SIZE.
So seems we can not use this macro.

> 
>> +	pages = vmalloc(nr_pages * sizeof(struct page *));
> 
> kvmalloc().  vmalloc() always allocates at least a page, so we want to
> use kmalloc if the size is small.  Also, use array_size() -- I know this
> can't overflow, but let's be clear

Yes, will use kvmalloc and array_size here.

> 
>> +	ret = pin_user_pages_fast(addr->addr & PAGE_MASK, nr_pages,
>> +				  flags | FOLL_LONGTERM, pages);
>> +	if (ret != nr_pages) {
>> +		pr_err("mempinfd: Failed to pin page\n");
> 
> No.  You mustn't allow the user to be able to generate messages to syslog,
> just by passing garbage to a syscall.

OK, will remove this.

> 
>> +	ret = xa_insert(&priv->array, p->first, p, GFP_KERNEL);
>> +	if (ret)
>> +		goto unpin_pages;
> 
> Hmm.  So we can't pin two ranges which start at the same address, but we
> can pin two overlapping ranges.  Is that OK?

The design here is only supporting pin a range with different start address.
If two overlapping ranges with different start address, we will not prevent
this. If this will not go wrong, let's make it simple firstly.

Best,
Zhou

> 
> 
> .
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2021-02-09  9:09 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-07  8:18 [RFC PATCH v3 0/2] mempinfd: Add new syscall to provide memory pin Zhou Wang
2021-02-07  8:18 ` Zhou Wang
2021-02-07  8:18 ` Zhou Wang
2021-02-07  8:18 ` [RFC PATCH v3 1/2] " Zhou Wang
2021-02-07  8:18   ` Zhou Wang
2021-02-07  8:18   ` Zhou Wang
2021-02-07 10:51   ` kernel test robot
2021-02-07 10:59   ` kernel test robot
2021-02-07 21:34   ` Matthew Wilcox
2021-02-07 21:34     ` Matthew Wilcox
2021-02-07 21:34     ` Matthew Wilcox
2021-02-07 22:24     ` Song Bao Hua (Barry Song)
2021-02-07 22:24       ` Song Bao Hua (Barry Song)
2021-02-07 22:24       ` Song Bao Hua (Barry Song)
2021-02-07 22:24       ` Song Bao Hua (Barry Song)
2021-02-08  1:30       ` Matthew Wilcox
2021-02-08  1:30         ` Matthew Wilcox
2021-02-08  1:30         ` Matthew Wilcox
2021-02-08  1:30         ` Matthew Wilcox
2021-02-08  2:27         ` Song Bao Hua (Barry Song)
2021-02-08  2:27           ` Song Bao Hua (Barry Song)
2021-02-08  2:27           ` Song Bao Hua (Barry Song)
2021-02-08  2:27           ` Song Bao Hua (Barry Song)
2021-02-08  3:46           ` Hillf Danton
2021-02-08  8:21           ` David Hildenbrand
2021-02-08  8:21             ` David Hildenbrand
2021-02-08  8:21             ` David Hildenbrand
2021-02-08  8:21             ` David Hildenbrand
2021-02-08 10:13             ` Song Bao Hua (Barry Song)
2021-02-08 10:13               ` Song Bao Hua (Barry Song)
2021-02-08 10:13               ` Song Bao Hua (Barry Song)
2021-02-08 10:13               ` Song Bao Hua (Barry Song)
2021-02-08 10:37               ` David Hildenbrand
2021-02-08 10:37                 ` David Hildenbrand
2021-02-08 10:37                 ` David Hildenbrand
2021-02-08 10:37                 ` David Hildenbrand
2021-02-08 20:52                 ` Song Bao Hua (Barry Song)
2021-02-08 20:52                   ` Song Bao Hua (Barry Song)
2021-02-08 20:52                   ` Song Bao Hua (Barry Song)
2021-02-08 20:52                   ` Song Bao Hua (Barry Song)
2021-02-08  2:18       ` David Rientjes
2021-02-08  2:18         ` David Rientjes
2021-02-08  2:18         ` David Rientjes via iommu
2021-02-08  2:18         ` David Rientjes
2021-02-08  5:34         ` Song Bao Hua (Barry Song)
2021-02-08  5:34           ` Song Bao Hua (Barry Song)
2021-02-08  5:34           ` Song Bao Hua (Barry Song)
2021-02-08  5:34           ` Song Bao Hua (Barry Song)
2021-02-09  9:02     ` Zhou Wang [this message]
2021-02-09  9:02       ` Zhou Wang
2021-02-07 21:51   ` Arnd Bergmann
2021-02-07 21:51     ` Arnd Bergmann
2021-02-07 21:51     ` Arnd Bergmann
2021-02-07 21:51     ` Arnd Bergmann
2021-02-09  9:27     ` Zhou Wang
2021-02-09  9:27       ` Zhou Wang
2021-02-09  9:27       ` Zhou Wang
2021-02-07 22:02   ` Andy Lutomirski
2021-02-07 22:02     ` Andy Lutomirski
2021-02-07 22:02     ` Andy Lutomirski
2021-02-09  9:17     ` Zhou Wang
2021-02-09  9:17       ` Zhou Wang
2021-02-09  9:17       ` Zhou Wang
2021-02-09  9:37       ` Greg KH
2021-02-09  9:37         ` Greg KH
2021-02-09  9:37         ` Greg KH
2021-02-09 11:58         ` Zhou Wang
2021-02-09 11:58           ` Zhou Wang
2021-02-09 11:58           ` Zhou Wang
2021-02-09 12:01           ` Greg KH
2021-02-09 12:01             ` Greg KH
2021-02-09 12:01             ` Greg KH
2021-02-09 12:20             ` Zhou Wang
2021-02-09 12:20               ` Zhou Wang
2021-02-09 12:20               ` Zhou Wang
2021-02-10 18:50               ` Matthew Wilcox
2021-02-10 18:50                 ` Matthew Wilcox
2021-02-10 18:50                 ` Matthew Wilcox
2021-02-08  8:14   ` David Hildenbrand
2021-02-08  8:14     ` David Hildenbrand
2021-02-08  8:14     ` David Hildenbrand
2021-02-08 18:33     ` Jason Gunthorpe
2021-02-08 18:33       ` Jason Gunthorpe
2021-02-08 18:33       ` Jason Gunthorpe
2021-02-08 20:35       ` Song Bao Hua (Barry Song)
2021-02-08 20:35         ` Song Bao Hua (Barry Song)
2021-02-08 20:35         ` Song Bao Hua (Barry Song)
2021-02-08 20:35         ` Song Bao Hua (Barry Song)
2021-02-08 21:30         ` Jason Gunthorpe
2021-02-08 21:30           ` Jason Gunthorpe
2021-02-08 21:30           ` Jason Gunthorpe
2021-02-08 21:30           ` Jason Gunthorpe
2021-02-09  3:01           ` Song Bao Hua (Barry Song)
2021-02-09  3:01             ` Song Bao Hua (Barry Song)
2021-02-09  3:01             ` Song Bao Hua (Barry Song)
2021-02-09  3:01             ` Song Bao Hua (Barry Song)
2021-02-09 13:53             ` Jason Gunthorpe
2021-02-09 13:53               ` Jason Gunthorpe
2021-02-09 13:53               ` Jason Gunthorpe
2021-02-09 13:53               ` Jason Gunthorpe
2021-02-09 22:22               ` Song Bao Hua (Barry Song)
2021-02-09 22:22                 ` Song Bao Hua (Barry Song)
2021-02-09 22:22                 ` Song Bao Hua (Barry Song)
2021-02-09 22:22                 ` Song Bao Hua (Barry Song)
2021-02-10 18:04                 ` Jason Gunthorpe
2021-02-10 18:04                   ` Jason Gunthorpe
2021-02-10 18:04                   ` Jason Gunthorpe
2021-02-10 18:04                   ` Jason Gunthorpe
2021-02-10 21:39                   ` Song Bao Hua (Barry Song)
2021-02-10 21:39                     ` Song Bao Hua (Barry Song)
2021-02-10 21:39                     ` Song Bao Hua (Barry Song)
2021-02-10 21:39                     ` Song Bao Hua (Barry Song)
2021-02-11 10:28                     ` David Hildenbrand
2021-02-11 10:28                       ` David Hildenbrand
2021-02-11 10:28                       ` David Hildenbrand
2021-02-11 10:28                       ` David Hildenbrand
2021-02-07  8:18 ` [RFC PATCH v3 2/2] selftests/vm: add mempinfd test Zhou Wang
2021-02-07  8:18   ` Zhou Wang
2021-02-07  8:18   ` Zhou Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a8f90c52-d51f-173f-7fd0-fb0792ac58e4@hisilicon.com \
    --to=wangzhou1@hisilicon.com \
    --cc=akpm@linux-foundation.org \
    --cc=chensihang1@hisilicon.com \
    --cc=eric.auger@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jean-philippe@linaro.org \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=liguozhu@hisilicon.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=zhangfei.gao@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.