All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>
Cc: "Wangzhou (B)" <wangzhou1@hisilicon.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>,
	"kevin.tian@intel.com" <kevin.tian@intel.com>,
	"jean-philippe@linaro.org" <jean-philippe@linaro.org>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>,
	"Liguozhu (Kenneth)" <liguozhu@hisilicon.com>,
	"zhangfei.gao@linaro.org" <zhangfei.gao@linaro.org>,
	"chensihang (A)" <chensihang1@hisilicon.com>
Subject: RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
Date: Mon, 8 Feb 2021 20:52:27 +0000	[thread overview]
Message-ID: <62e7a7cbe6ce4f2e8b220032e25a0aab@hisilicon.com> (raw)
In-Reply-To: <bbe18536-7048-d790-11bf-0b0742a59926@redhat.com>



> -----Original Message-----
> From: David Hildenbrand [mailto:david@redhat.com]
> Sent: Monday, February 8, 2021 11:37 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Matthew Wilcox
> <willy@infradead.org>
> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> Morton <akpm@linux-foundation.org>; Alexander Viro <viro@zeniv.linux.org.uk>;
> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> <chensihang1@hisilicon.com>
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -----Original Message-----
> >> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf
> Of
> >> David Hildenbrand
> >> Sent: Monday, February 8, 2021 9:22 PM
> >> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Matthew Wilcox
> >> <willy@infradead.org>
> >> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>; linux-kernel@vger.kernel.org;
> >> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> >> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> >> Morton <akpm@linux-foundation.org>; Alexander Viro
> <viro@zeniv.linux.org.uk>;
> >> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> >> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> >> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> >> <chensihang1@hisilicon.com>
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> Behalf
> >> Of
> >>>> Matthew Wilcox
> >>>> Sent: Monday, February 8, 2021 2:31 PM
> >>>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> >>>> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>;
> linux-kernel@vger.kernel.org;
> >>>> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> >>>> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> >>>> Morton <akpm@linux-foundation.org>; Alexander Viro
> >> <viro@zeniv.linux.org.uk>;
> >>>> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> >>>> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> >>>> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> >>>> <chensihang1@hisilicon.com>
> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >>>> pin
> >>>>
> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +0000, Song Bao Hua (Barry Song) wrote:
> >>>>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>>>> I/O on a memory without IO page faults which can result in dramatically
> >>>>>>> increased latency. Current memory related APIs could not achieve this
> >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup device,
> >>>>>>> page migration can still trigger IO page fault.
> >>>>>>
> >>>>>> Well ... we have two requirements.  The application wants to not take
> >>>>>> page faults.  The system wants to move the application to a different
> >>>>>> NUMA node in order to optimise overall performance.  Why should the
> >>>>>> application's desires take precedence over the kernel's desires?  And
> why
> >>>>>> should it be done this way rather than by the sysadmin using numactl
> to
> >>>>>> lock the application to a particular node?
> >>>>>
> >>>>> NUMA balancer is just one of many reasons for page migration. Even one
> >>>>> simple alloc_pages() can cause memory migration in just single NUMA
> >>>>> node or UMA system.
> >>>>>
> >>>>> The other reasons for page migration include but are not limited to:
> >>>>> * memory move due to CMA
> >>>>> * memory move due to huge pages creation
> >>>>>
> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>>>> in the whole system.
> >>>>
> >>>> You're dodging the question.  Should the CMA allocation fail because
> >>>> another application is using SVA?
> >>>>
> >>>> I would say no.
> >>>
> >>> I would say no as well.
> >>>
> >>> While IOMMU is enabled, CMA almost has one user only: IOMMU driver
> >>> as other drivers will depend on iommu to use non-contiguous memory
> >>> though they are still calling dma_alloc_coherent().
> >>>
> >>> In iommu driver, dma_alloc_coherent is called during initialization
> >>> and there is no new allocation afterwards. So it wouldn't cause
> >>> runtime impact on SVA performance. Even there is new allocations,
> >>> CMA will fall back to general alloc_pages() and iommu drivers are
> >>> almost allocating small memory for command queues.
> >>>
> >>> So I would say general compound pages, huge pages, especially
> >>> transparent huge pages, would be bigger concerns than CMA for
> >>> internal page migration within one NUMA.
> >>>
> >>> Not like CMA, general alloc_pages() can get memory by moving
> >>> pages other than those pinned.
> >>>
> >>> And there is no guarantee we can always bind the memory of
> >>> SVA applications to single one NUMA, so NUMA balancing is
> >>> still a concern.
> >>>
> >>> But I agree we need a way to make CMA success while the userspace
> >>> pages are pinned. Since pin has been viral in many drivers, I
> >>> assume there is a way to handle this. Otherwise, APIs like
> >>> V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
> >>> is no guarantee that usersspace will allocate unmovable memory
> >>> and there is no guarantee the fallback path- alloc_pages() can
> >>> succeed while allocating big memory.
> >>>
> >>
> >> Long term pinnings cannot go onto CMA-reserved memory, and there is
> >> similar work to also fix ZONE_MOVABLE in that regard.
> >>
> >>
> https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c
> >> om
> >>
> >> One of the reasons I detest using long term pinning of pages where it
> >> could be avoided. Take VFIO and RDMA as an example: these things
> >> currently can't work without them.
> >>
> >> What I read here: "DMA performance will be affected severely". That does
> >> not sound like a compelling argument to me for long term pinnings.
> >> Please find another way to achieve the same goal without long term
> >> pinnings controlled by user space - e.g., controlling when migration
> >> actually happens.
> >>
> >> For example, CMA/alloc_contig_range()/memory unplug are corner cases
> >> that happen rarely, you shouldn't have to worry about them messing with
> >> your DMA performance.
> >
> > I agree CMA/alloc_contig_range()/memory unplug would be corner cases,
> > the major cases would be THP, NUMA balancing while we could totally
> > disable them but it seems insensible to do that only because there is
> > a process using SVA in the system.
> 
> Can't you use huge pages in your application that uses SVA and prevent
> THP/NUMA balancing from kicking in?

Yes. That's exactly we have done in userspace for the applications which
can directly call UADK (the user-space accelerator framework based on
uacce) to use accelerators for zip, encryption:

+-------------------------------------------+
 |                                           |
 |applications using accelerators            |
 +-------------------------------------------+


     alloc from pool             free to pool
           +                      ++
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
+----------+-----------------------+---------+
|                                            |
|                                            |
|      HugeTLB memory pool                   |
|                                            |
|                                            |
+--------------------------------------------+

Those applications can get memory from the hugetlb pool and avoid
IO page faults.

The problem is that not every application can do this. Many applications
such as Nginx, Ceph,  are just calling zlib/openssl to use accelerators,
they are not calling the UADK pool based on HugeTLB and they are not
customized.

"vm.compact_unevictable_allowed=0 + mlock + numa_balancing disabling"
which David Rientjes mentioned seem to be a good direction to
investigate on. but it would be better if those settings only affect
the specific process using SVA.

> 
> --
> Thanks,
> 
> David / dhildenb

Thanks
Barry

WARNING: multiple messages have this Message-ID (diff)
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>
Cc: "jean-philippe@linaro.org" <jean-philippe@linaro.org>,
	"kevin.tian@intel.com" <kevin.tian@intel.com>,
	"chensihang \(A\)" <chensihang1@hisilicon.com>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"zhangfei.gao@linaro.org" <zhangfei.gao@linaro.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Liguozhu \(Kenneth\)" <liguozhu@hisilicon.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
Date: Mon, 8 Feb 2021 20:52:27 +0000	[thread overview]
Message-ID: <62e7a7cbe6ce4f2e8b220032e25a0aab@hisilicon.com> (raw)
In-Reply-To: <bbe18536-7048-d790-11bf-0b0742a59926@redhat.com>



> -----Original Message-----
> From: David Hildenbrand [mailto:david@redhat.com]
> Sent: Monday, February 8, 2021 11:37 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Matthew Wilcox
> <willy@infradead.org>
> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> Morton <akpm@linux-foundation.org>; Alexander Viro <viro@zeniv.linux.org.uk>;
> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> <chensihang1@hisilicon.com>
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -----Original Message-----
> >> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf
> Of
> >> David Hildenbrand
> >> Sent: Monday, February 8, 2021 9:22 PM
> >> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Matthew Wilcox
> >> <willy@infradead.org>
> >> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>; linux-kernel@vger.kernel.org;
> >> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> >> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> >> Morton <akpm@linux-foundation.org>; Alexander Viro
> <viro@zeniv.linux.org.uk>;
> >> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> >> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> >> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> >> <chensihang1@hisilicon.com>
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> Behalf
> >> Of
> >>>> Matthew Wilcox
> >>>> Sent: Monday, February 8, 2021 2:31 PM
> >>>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> >>>> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>;
> linux-kernel@vger.kernel.org;
> >>>> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> >>>> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> >>>> Morton <akpm@linux-foundation.org>; Alexander Viro
> >> <viro@zeniv.linux.org.uk>;
> >>>> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> >>>> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> >>>> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> >>>> <chensihang1@hisilicon.com>
> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >>>> pin
> >>>>
> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +0000, Song Bao Hua (Barry Song) wrote:
> >>>>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>>>> I/O on a memory without IO page faults which can result in dramatically
> >>>>>>> increased latency. Current memory related APIs could not achieve this
> >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup device,
> >>>>>>> page migration can still trigger IO page fault.
> >>>>>>
> >>>>>> Well ... we have two requirements.  The application wants to not take
> >>>>>> page faults.  The system wants to move the application to a different
> >>>>>> NUMA node in order to optimise overall performance.  Why should the
> >>>>>> application's desires take precedence over the kernel's desires?  And
> why
> >>>>>> should it be done this way rather than by the sysadmin using numactl
> to
> >>>>>> lock the application to a particular node?
> >>>>>
> >>>>> NUMA balancer is just one of many reasons for page migration. Even one
> >>>>> simple alloc_pages() can cause memory migration in just single NUMA
> >>>>> node or UMA system.
> >>>>>
> >>>>> The other reasons for page migration include but are not limited to:
> >>>>> * memory move due to CMA
> >>>>> * memory move due to huge pages creation
> >>>>>
> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>>>> in the whole system.
> >>>>
> >>>> You're dodging the question.  Should the CMA allocation fail because
> >>>> another application is using SVA?
> >>>>
> >>>> I would say no.
> >>>
> >>> I would say no as well.
> >>>
> >>> While IOMMU is enabled, CMA almost has one user only: IOMMU driver
> >>> as other drivers will depend on iommu to use non-contiguous memory
> >>> though they are still calling dma_alloc_coherent().
> >>>
> >>> In iommu driver, dma_alloc_coherent is called during initialization
> >>> and there is no new allocation afterwards. So it wouldn't cause
> >>> runtime impact on SVA performance. Even there is new allocations,
> >>> CMA will fall back to general alloc_pages() and iommu drivers are
> >>> almost allocating small memory for command queues.
> >>>
> >>> So I would say general compound pages, huge pages, especially
> >>> transparent huge pages, would be bigger concerns than CMA for
> >>> internal page migration within one NUMA.
> >>>
> >>> Not like CMA, general alloc_pages() can get memory by moving
> >>> pages other than those pinned.
> >>>
> >>> And there is no guarantee we can always bind the memory of
> >>> SVA applications to single one NUMA, so NUMA balancing is
> >>> still a concern.
> >>>
> >>> But I agree we need a way to make CMA success while the userspace
> >>> pages are pinned. Since pin has been viral in many drivers, I
> >>> assume there is a way to handle this. Otherwise, APIs like
> >>> V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
> >>> is no guarantee that usersspace will allocate unmovable memory
> >>> and there is no guarantee the fallback path- alloc_pages() can
> >>> succeed while allocating big memory.
> >>>
> >>
> >> Long term pinnings cannot go onto CMA-reserved memory, and there is
> >> similar work to also fix ZONE_MOVABLE in that regard.
> >>
> >>
> https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c
> >> om
> >>
> >> One of the reasons I detest using long term pinning of pages where it
> >> could be avoided. Take VFIO and RDMA as an example: these things
> >> currently can't work without them.
> >>
> >> What I read here: "DMA performance will be affected severely". That does
> >> not sound like a compelling argument to me for long term pinnings.
> >> Please find another way to achieve the same goal without long term
> >> pinnings controlled by user space - e.g., controlling when migration
> >> actually happens.
> >>
> >> For example, CMA/alloc_contig_range()/memory unplug are corner cases
> >> that happen rarely, you shouldn't have to worry about them messing with
> >> your DMA performance.
> >
> > I agree CMA/alloc_contig_range()/memory unplug would be corner cases,
> > the major cases would be THP, NUMA balancing while we could totally
> > disable them but it seems insensible to do that only because there is
> > a process using SVA in the system.
> 
> Can't you use huge pages in your application that uses SVA and prevent
> THP/NUMA balancing from kicking in?

Yes. That's exactly we have done in userspace for the applications which
can directly call UADK (the user-space accelerator framework based on
uacce) to use accelerators for zip, encryption:

+-------------------------------------------+
 |                                           |
 |applications using accelerators            |
 +-------------------------------------------+


     alloc from pool             free to pool
           +                      ++
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
+----------+-----------------------+---------+
|                                            |
|                                            |
|      HugeTLB memory pool                   |
|                                            |
|                                            |
+--------------------------------------------+

Those applications can get memory from the hugetlb pool and avoid
IO page faults.

The problem is that not every application can do this. Many applications
such as Nginx, Ceph,  are just calling zlib/openssl to use accelerators,
they are not calling the UADK pool based on HugeTLB and they are not
customized.

"vm.compact_unevictable_allowed=0 + mlock + numa_balancing disabling"
which David Rientjes mentioned seem to be a good direction to
investigate on. but it would be better if those settings only affect
the specific process using SVA.

> 
> --
> Thanks,
> 
> David / dhildenb

Thanks
Barry
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

WARNING: multiple messages have this Message-ID (diff)
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>
Cc: "jean-philippe@linaro.org" <jean-philippe@linaro.org>,
	"kevin.tian@intel.com" <kevin.tian@intel.com>,
	"chensihang \(A\)" <chensihang1@hisilicon.com>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Wangzhou \(B\)" <wangzhou1@hisilicon.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"zhangfei.gao@linaro.org" <zhangfei.gao@linaro.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Liguozhu \(Kenneth\)" <liguozhu@hisilicon.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin
Date: Mon, 8 Feb 2021 20:52:27 +0000	[thread overview]
Message-ID: <62e7a7cbe6ce4f2e8b220032e25a0aab@hisilicon.com> (raw)
In-Reply-To: <bbe18536-7048-d790-11bf-0b0742a59926@redhat.com>



> -----Original Message-----
> From: David Hildenbrand [mailto:david@redhat.com]
> Sent: Monday, February 8, 2021 11:37 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Matthew Wilcox
> <willy@infradead.org>
> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> Morton <akpm@linux-foundation.org>; Alexander Viro <viro@zeniv.linux.org.uk>;
> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> <chensihang1@hisilicon.com>
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -----Original Message-----
> >> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf
> Of
> >> David Hildenbrand
> >> Sent: Monday, February 8, 2021 9:22 PM
> >> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Matthew Wilcox
> >> <willy@infradead.org>
> >> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>; linux-kernel@vger.kernel.org;
> >> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> >> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> >> Morton <akpm@linux-foundation.org>; Alexander Viro
> <viro@zeniv.linux.org.uk>;
> >> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> >> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> >> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> >> <chensihang1@hisilicon.com>
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> Behalf
> >> Of
> >>>> Matthew Wilcox
> >>>> Sent: Monday, February 8, 2021 2:31 PM
> >>>> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> >>>> Cc: Wangzhou (B) <wangzhou1@hisilicon.com>;
> linux-kernel@vger.kernel.org;
> >>>> iommu@lists.linux-foundation.org; linux-mm@kvack.org;
> >>>> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew
> >>>> Morton <akpm@linux-foundation.org>; Alexander Viro
> >> <viro@zeniv.linux.org.uk>;
> >>>> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com;
> >>>> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth)
> >>>> <liguozhu@hisilicon.com>; zhangfei.gao@linaro.org; chensihang (A)
> >>>> <chensihang1@hisilicon.com>
> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >>>> pin
> >>>>
> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +0000, Song Bao Hua (Barry Song) wrote:
> >>>>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>>>> I/O on a memory without IO page faults which can result in dramatically
> >>>>>>> increased latency. Current memory related APIs could not achieve this
> >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup device,
> >>>>>>> page migration can still trigger IO page fault.
> >>>>>>
> >>>>>> Well ... we have two requirements.  The application wants to not take
> >>>>>> page faults.  The system wants to move the application to a different
> >>>>>> NUMA node in order to optimise overall performance.  Why should the
> >>>>>> application's desires take precedence over the kernel's desires?  And
> why
> >>>>>> should it be done this way rather than by the sysadmin using numactl
> to
> >>>>>> lock the application to a particular node?
> >>>>>
> >>>>> NUMA balancer is just one of many reasons for page migration. Even one
> >>>>> simple alloc_pages() can cause memory migration in just single NUMA
> >>>>> node or UMA system.
> >>>>>
> >>>>> The other reasons for page migration include but are not limited to:
> >>>>> * memory move due to CMA
> >>>>> * memory move due to huge pages creation
> >>>>>
> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>>>> in the whole system.
> >>>>
> >>>> You're dodging the question.  Should the CMA allocation fail because
> >>>> another application is using SVA?
> >>>>
> >>>> I would say no.
> >>>
> >>> I would say no as well.
> >>>
> >>> While IOMMU is enabled, CMA almost has one user only: IOMMU driver
> >>> as other drivers will depend on iommu to use non-contiguous memory
> >>> though they are still calling dma_alloc_coherent().
> >>>
> >>> In iommu driver, dma_alloc_coherent is called during initialization
> >>> and there is no new allocation afterwards. So it wouldn't cause
> >>> runtime impact on SVA performance. Even there is new allocations,
> >>> CMA will fall back to general alloc_pages() and iommu drivers are
> >>> almost allocating small memory for command queues.
> >>>
> >>> So I would say general compound pages, huge pages, especially
> >>> transparent huge pages, would be bigger concerns than CMA for
> >>> internal page migration within one NUMA.
> >>>
> >>> Not like CMA, general alloc_pages() can get memory by moving
> >>> pages other than those pinned.
> >>>
> >>> And there is no guarantee we can always bind the memory of
> >>> SVA applications to single one NUMA, so NUMA balancing is
> >>> still a concern.
> >>>
> >>> But I agree we need a way to make CMA success while the userspace
> >>> pages are pinned. Since pin has been viral in many drivers, I
> >>> assume there is a way to handle this. Otherwise, APIs like
> >>> V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
> >>> is no guarantee that usersspace will allocate unmovable memory
> >>> and there is no guarantee the fallback path- alloc_pages() can
> >>> succeed while allocating big memory.
> >>>
> >>
> >> Long term pinnings cannot go onto CMA-reserved memory, and there is
> >> similar work to also fix ZONE_MOVABLE in that regard.
> >>
> >>
> https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c
> >> om
> >>
> >> One of the reasons I detest using long term pinning of pages where it
> >> could be avoided. Take VFIO and RDMA as an example: these things
> >> currently can't work without them.
> >>
> >> What I read here: "DMA performance will be affected severely". That does
> >> not sound like a compelling argument to me for long term pinnings.
> >> Please find another way to achieve the same goal without long term
> >> pinnings controlled by user space - e.g., controlling when migration
> >> actually happens.
> >>
> >> For example, CMA/alloc_contig_range()/memory unplug are corner cases
> >> that happen rarely, you shouldn't have to worry about them messing with
> >> your DMA performance.
> >
> > I agree CMA/alloc_contig_range()/memory unplug would be corner cases,
> > the major cases would be THP, NUMA balancing while we could totally
> > disable them but it seems insensible to do that only because there is
> > a process using SVA in the system.
> 
> Can't you use huge pages in your application that uses SVA and prevent
> THP/NUMA balancing from kicking in?

Yes. That's exactly we have done in userspace for the applications which
can directly call UADK (the user-space accelerator framework based on
uacce) to use accelerators for zip, encryption:

+-------------------------------------------+
 |                                           |
 |applications using accelerators            |
 +-------------------------------------------+


     alloc from pool             free to pool
           +                      ++
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
           |                       |
+----------+-----------------------+---------+
|                                            |
|                                            |
|      HugeTLB memory pool                   |
|                                            |
|                                            |
+--------------------------------------------+

Those applications can get memory from the hugetlb pool and avoid
IO page faults.

The problem is that not every application can do this. Many applications
such as Nginx, Ceph,  are just calling zlib/openssl to use accelerators,
they are not calling the UADK pool based on HugeTLB and they are not
customized.

"vm.compact_unevictable_allowed=0 + mlock + numa_balancing disabling"
which David Rientjes mentioned seem to be a good direction to
investigate on. but it would be better if those settings only affect
the specific process using SVA.

> 
> --
> Thanks,
> 
> David / dhildenb

Thanks
Barry
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-02-08 21:38 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-07  8:18 [RFC PATCH v3 0/2] mempinfd: Add new syscall to provide memory pin Zhou Wang
2021-02-07  8:18 ` Zhou Wang
2021-02-07  8:18 ` Zhou Wang
2021-02-07  8:18 ` [RFC PATCH v3 1/2] " Zhou Wang
2021-02-07  8:18   ` Zhou Wang
2021-02-07  8:18   ` Zhou Wang
2021-02-07 10:51   ` kernel test robot
2021-02-07 10:59   ` kernel test robot
2021-02-07 21:34   ` Matthew Wilcox
2021-02-07 21:34     ` Matthew Wilcox
2021-02-07 21:34     ` Matthew Wilcox
2021-02-07 22:24     ` Song Bao Hua (Barry Song)
2021-02-07 22:24       ` Song Bao Hua (Barry Song)
2021-02-07 22:24       ` Song Bao Hua (Barry Song)
2021-02-07 22:24       ` Song Bao Hua (Barry Song)
2021-02-08  1:30       ` Matthew Wilcox
2021-02-08  1:30         ` Matthew Wilcox
2021-02-08  1:30         ` Matthew Wilcox
2021-02-08  1:30         ` Matthew Wilcox
2021-02-08  2:27         ` Song Bao Hua (Barry Song)
2021-02-08  2:27           ` Song Bao Hua (Barry Song)
2021-02-08  2:27           ` Song Bao Hua (Barry Song)
2021-02-08  2:27           ` Song Bao Hua (Barry Song)
2021-02-08  3:46           ` Hillf Danton
2021-02-08  8:21           ` David Hildenbrand
2021-02-08  8:21             ` David Hildenbrand
2021-02-08  8:21             ` David Hildenbrand
2021-02-08  8:21             ` David Hildenbrand
2021-02-08 10:13             ` Song Bao Hua (Barry Song)
2021-02-08 10:13               ` Song Bao Hua (Barry Song)
2021-02-08 10:13               ` Song Bao Hua (Barry Song)
2021-02-08 10:13               ` Song Bao Hua (Barry Song)
2021-02-08 10:37               ` David Hildenbrand
2021-02-08 10:37                 ` David Hildenbrand
2021-02-08 10:37                 ` David Hildenbrand
2021-02-08 10:37                 ` David Hildenbrand
2021-02-08 20:52                 ` Song Bao Hua (Barry Song) [this message]
2021-02-08 20:52                   ` Song Bao Hua (Barry Song)
2021-02-08 20:52                   ` Song Bao Hua (Barry Song)
2021-02-08 20:52                   ` Song Bao Hua (Barry Song)
2021-02-08  2:18       ` David Rientjes
2021-02-08  2:18         ` David Rientjes
2021-02-08  2:18         ` David Rientjes via iommu
2021-02-08  2:18         ` David Rientjes
2021-02-08  5:34         ` Song Bao Hua (Barry Song)
2021-02-08  5:34           ` Song Bao Hua (Barry Song)
2021-02-08  5:34           ` Song Bao Hua (Barry Song)
2021-02-08  5:34           ` Song Bao Hua (Barry Song)
2021-02-09  9:02     ` Zhou Wang
2021-02-09  9:02       ` Zhou Wang
2021-02-07 21:51   ` Arnd Bergmann
2021-02-07 21:51     ` Arnd Bergmann
2021-02-07 21:51     ` Arnd Bergmann
2021-02-07 21:51     ` Arnd Bergmann
2021-02-09  9:27     ` Zhou Wang
2021-02-09  9:27       ` Zhou Wang
2021-02-09  9:27       ` Zhou Wang
2021-02-07 22:02   ` Andy Lutomirski
2021-02-07 22:02     ` Andy Lutomirski
2021-02-07 22:02     ` Andy Lutomirski
2021-02-09  9:17     ` Zhou Wang
2021-02-09  9:17       ` Zhou Wang
2021-02-09  9:17       ` Zhou Wang
2021-02-09  9:37       ` Greg KH
2021-02-09  9:37         ` Greg KH
2021-02-09  9:37         ` Greg KH
2021-02-09 11:58         ` Zhou Wang
2021-02-09 11:58           ` Zhou Wang
2021-02-09 11:58           ` Zhou Wang
2021-02-09 12:01           ` Greg KH
2021-02-09 12:01             ` Greg KH
2021-02-09 12:01             ` Greg KH
2021-02-09 12:20             ` Zhou Wang
2021-02-09 12:20               ` Zhou Wang
2021-02-09 12:20               ` Zhou Wang
2021-02-10 18:50               ` Matthew Wilcox
2021-02-10 18:50                 ` Matthew Wilcox
2021-02-10 18:50                 ` Matthew Wilcox
2021-02-08  8:14   ` David Hildenbrand
2021-02-08  8:14     ` David Hildenbrand
2021-02-08  8:14     ` David Hildenbrand
2021-02-08 18:33     ` Jason Gunthorpe
2021-02-08 18:33       ` Jason Gunthorpe
2021-02-08 18:33       ` Jason Gunthorpe
2021-02-08 20:35       ` Song Bao Hua (Barry Song)
2021-02-08 20:35         ` Song Bao Hua (Barry Song)
2021-02-08 20:35         ` Song Bao Hua (Barry Song)
2021-02-08 20:35         ` Song Bao Hua (Barry Song)
2021-02-08 21:30         ` Jason Gunthorpe
2021-02-08 21:30           ` Jason Gunthorpe
2021-02-08 21:30           ` Jason Gunthorpe
2021-02-08 21:30           ` Jason Gunthorpe
2021-02-09  3:01           ` Song Bao Hua (Barry Song)
2021-02-09  3:01             ` Song Bao Hua (Barry Song)
2021-02-09  3:01             ` Song Bao Hua (Barry Song)
2021-02-09  3:01             ` Song Bao Hua (Barry Song)
2021-02-09 13:53             ` Jason Gunthorpe
2021-02-09 13:53               ` Jason Gunthorpe
2021-02-09 13:53               ` Jason Gunthorpe
2021-02-09 13:53               ` Jason Gunthorpe
2021-02-09 22:22               ` Song Bao Hua (Barry Song)
2021-02-09 22:22                 ` Song Bao Hua (Barry Song)
2021-02-09 22:22                 ` Song Bao Hua (Barry Song)
2021-02-09 22:22                 ` Song Bao Hua (Barry Song)
2021-02-10 18:04                 ` Jason Gunthorpe
2021-02-10 18:04                   ` Jason Gunthorpe
2021-02-10 18:04                   ` Jason Gunthorpe
2021-02-10 18:04                   ` Jason Gunthorpe
2021-02-10 21:39                   ` Song Bao Hua (Barry Song)
2021-02-10 21:39                     ` Song Bao Hua (Barry Song)
2021-02-10 21:39                     ` Song Bao Hua (Barry Song)
2021-02-10 21:39                     ` Song Bao Hua (Barry Song)
2021-02-11 10:28                     ` David Hildenbrand
2021-02-11 10:28                       ` David Hildenbrand
2021-02-11 10:28                       ` David Hildenbrand
2021-02-11 10:28                       ` David Hildenbrand
2021-02-07  8:18 ` [RFC PATCH v3 2/2] selftests/vm: add mempinfd test Zhou Wang
2021-02-07  8:18   ` Zhou Wang
2021-02-07  8:18   ` Zhou Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=62e7a7cbe6ce4f2e8b220032e25a0aab@hisilicon.com \
    --to=song.bao.hua@hisilicon.com \
    --cc=akpm@linux-foundation.org \
    --cc=chensihang1@hisilicon.com \
    --cc=david@redhat.com \
    --cc=eric.auger@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jean-philippe@linaro.org \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=liguozhu@hisilicon.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wangzhou1@hisilicon.com \
    --cc=willy@infradead.org \
    --cc=zhangfei.gao@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.