* alloc_pages_bulk() @ 2021-02-08 15:42 Chuck Lever 2021-02-08 17:50 ` Fwd: alloc_pages_bulk() Chuck Lever 0 siblings, 1 reply; 23+ messages in thread From: Chuck Lever @ 2021-02-08 15:42 UTC (permalink / raw) To: mgorman, brouer; +Cc: linux-mm, Linux NFS Mailing List Hi- [ please Cc: me, I'm not subscribed to linux-mm ] We've been discussing how NFSD can more efficiently refill its receive buffers (currently alloc_page() in a loop; see net/sunrpc/svc_xprt.c::svc_alloc_arg()). Neil Brown pointed me to this old thread: https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ We see that many of the prerequisites are in v5.11-rc, but alloc_page_bulk() is not. I tried forward-porting 4/4 in that series, but enough internal APIs have changed since 2017 that the patch does not come close to applying and compiling. I'm wondering: a) is there a newer version of that work? b) if not, does there exist a preferred API in 5.11 for bulk page allocation? Many thanks for any guidance! -- Chuck Lever ^ permalink raw reply [flat|nested] 23+ messages in thread
* Fwd: alloc_pages_bulk() 2021-02-08 15:42 alloc_pages_bulk() Chuck Lever @ 2021-02-08 17:50 ` Chuck Lever 2021-02-09 10:31 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-09 22:01 ` Fwd: alloc_pages_bulk() Matthew Wilcox 0 siblings, 2 replies; 23+ messages in thread From: Chuck Lever @ 2021-02-08 17:50 UTC (permalink / raw) To: mgorman, brouer; +Cc: Linux NFS Mailing List, linux-mm Sorry for resending. I misremembered the linux-mm address. > Begin forwarded message: > > From: Chuck Lever <chuck.lever@oracle.com> > Subject: alloc_pages_bulk() > Date: February 8, 2021 at 10:42:07 AM EST > To: "mgorman@suse.de" <mgorman@suse.de>, "brouer@redhat.com" <brouer@redhat.com> > Cc: "linux-mm@vger.kernel.org" <linux-mm@vger.kernel.org>, Linux NFS Mailing List <linux-nfs@vger.kernel.org> > > Hi- > > [ please Cc: me, I'm not subscribed to linux-mm ] > > We've been discussing how NFSD can more efficiently refill its > receive buffers (currently alloc_page() in a loop; see > net/sunrpc/svc_xprt.c::svc_alloc_arg()). > > Neil Brown pointed me to this old thread: > > https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ > > We see that many of the prerequisites are in v5.11-rc, but > alloc_page_bulk() is not. I tried forward-porting 4/4 in that > series, but enough internal APIs have changed since 2017 that > the patch does not come close to applying and compiling. > > I'm wondering: > > a) is there a newer version of that work? > > b) if not, does there exist a preferred API in 5.11 for bulk > page allocation? > > Many thanks for any guidance! > > -- > Chuck Lever > > > -- Chuck Lever ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-08 17:50 ` Fwd: alloc_pages_bulk() Chuck Lever @ 2021-02-09 10:31 ` Jesper Dangaard Brouer 2021-02-09 13:37 ` alloc_pages_bulk() Chuck Lever ` (2 more replies) 2021-02-09 22:01 ` Fwd: alloc_pages_bulk() Matthew Wilcox 1 sibling, 3 replies; 23+ messages in thread From: Jesper Dangaard Brouer @ 2021-02-09 10:31 UTC (permalink / raw) To: Chuck Lever; +Cc: mgorman, Linux NFS Mailing List, linux-mm, brouer, Mel Gorman On Mon, 8 Feb 2021 17:50:51 +0000 Chuck Lever <chuck.lever@oracle.com> wrote: > Sorry for resending. I misremembered the linux-mm address. > > > Begin forwarded message: > > > > [ please Cc: me, I'm not subscribed to linux-mm ] > > > > We've been discussing how NFSD can more efficiently refill its > > receive buffers (currently alloc_page() in a loop; see > > net/sunrpc/svc_xprt.c::svc_alloc_arg()). > > It looks like you could also take advantage of bulk free in: svc_free_res_pages() I would like to use the page bulk alloc API here: https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209 > > Neil Brown pointed me to this old thread: > > > > https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ > > > > We see that many of the prerequisites are in v5.11-rc, but > > alloc_page_bulk() is not. I tried forward-porting 4/4 in that > > series, but enough internal APIs have changed since 2017 that > > the patch does not come close to applying and compiling. I forgot that this was never merged. It is sad as Mel showed huge improvement with his work. > > I'm wondering: > > > > a) is there a newer version of that work? > > Mel, why was this work never merged upstream? > > b) if not, does there exist a preferred API in 5.11 for bulk > > page allocation? > > > > Many thanks for any guidance! I have a kernel module that micro-bench the API alloc_pages_bulk() here: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c#L97 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-09 10:31 ` alloc_pages_bulk() Jesper Dangaard Brouer @ 2021-02-09 13:37 ` Chuck Lever 2021-02-09 17:27 ` alloc_pages_bulk() Vlastimil Babka 2021-02-10 8:41 ` alloc_pages_bulk() Mel Gorman 2 siblings, 0 replies; 23+ messages in thread From: Chuck Lever @ 2021-02-09 13:37 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: mgorman, Linux NFS Mailing List, linux-mm, Mel Gorman Hi Jesper- > On Feb 9, 2021, at 5:31 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Mon, 8 Feb 2021 17:50:51 +0000 > Chuck Lever <chuck.lever@oracle.com> wrote: > >> Sorry for resending. I misremembered the linux-mm address. >> >>> Begin forwarded message: >>> >>> [ please Cc: me, I'm not subscribed to linux-mm ] >>> >>> We've been discussing how NFSD can more efficiently refill its >>> receive buffers (currently alloc_page() in a loop; see >>> net/sunrpc/svc_xprt.c::svc_alloc_arg()). >>> > > It looks like you could also take advantage of bulk free in: > svc_free_res_pages() We started there. Those pages often have a non-zero reference count, so that call site didn't seem to be a candidate for a bulk free. > I would like to use the page bulk alloc API here: > https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209 > > >>> Neil Brown pointed me to this old thread: >>> >>> https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ >>> >>> We see that many of the prerequisites are in v5.11-rc, but >>> alloc_page_bulk() is not. I tried forward-porting 4/4 in that >>> series, but enough internal APIs have changed since 2017 that >>> the patch does not come close to applying and compiling. > > I forgot that this was never merged. It is sad as Mel showed huge > improvement with his work. > >>> I'm wondering: >>> >>> a) is there a newer version of that work? >>> > > Mel, why was this work never merged upstream? > > >>> b) if not, does there exist a preferred API in 5.11 for bulk >>> page allocation? >>> >>> Many thanks for any guidance! > > I have a kernel module that micro-bench the API alloc_pages_bulk() here: > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c#L97 > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > -- Chuck Lever ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-09 10:31 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-09 13:37 ` alloc_pages_bulk() Chuck Lever @ 2021-02-09 17:27 ` Vlastimil Babka 2021-02-10 9:51 ` alloc_pages_bulk() Christoph Hellwig 2021-02-10 8:41 ` alloc_pages_bulk() Mel Gorman 2 siblings, 1 reply; 23+ messages in thread From: Vlastimil Babka @ 2021-02-09 17:27 UTC (permalink / raw) To: Jesper Dangaard Brouer, Chuck Lever Cc: mgorman, Linux NFS Mailing List, linux-mm, Mel Gorman On 2/9/21 11:31 AM, Jesper Dangaard Brouer wrote: > On Mon, 8 Feb 2021 17:50:51 +0000 > Chuck Lever <chuck.lever@oracle.com> wrote: > >> Sorry for resending. I misremembered the linux-mm address. >> >> > Begin forwarded message: >> > >> > [ please Cc: me, I'm not subscribed to linux-mm ] >> > >> > We've been discussing how NFSD can more efficiently refill its >> > receive buffers (currently alloc_page() in a loop; see >> > net/sunrpc/svc_xprt.c::svc_alloc_arg()). >> > > > It looks like you could also take advantage of bulk free in: > svc_free_res_pages() > > I would like to use the page bulk alloc API here: > https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209 > > >> > Neil Brown pointed me to this old thread: >> > >> > https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ >> > >> > We see that many of the prerequisites are in v5.11-rc, but >> > alloc_page_bulk() is not. I tried forward-porting 4/4 in that >> > series, but enough internal APIs have changed since 2017 that >> > the patch does not come close to applying and compiling. > > I forgot that this was never merged. It is sad as Mel showed huge > improvement with his work. > >> > I'm wondering: >> > >> > a) is there a newer version of that work? >> > > > Mel, why was this work never merged upstream? Hmm the cover letter of that work says: > The fourth patch introduces a bulk page allocator with no > in-kernel users as an example for Jesper and others who want to > build a page allocator for DMA-coherent pages. It hopefully is > relatively easy to modify this API and the one core function toget > the semantics they require. So it seems there were no immediate users to finalize the API? >> > b) if not, does there exist a preferred API in 5.11 for bulk >> > page allocation? >> > >> > Many thanks for any guidance! > > I have a kernel module that micro-bench the API alloc_pages_bulk() here: > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c#L97 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-09 17:27 ` alloc_pages_bulk() Vlastimil Babka @ 2021-02-10 9:51 ` Christoph Hellwig 0 siblings, 0 replies; 23+ messages in thread From: Christoph Hellwig @ 2021-02-10 9:51 UTC (permalink / raw) To: Vlastimil Babka Cc: Jesper Dangaard Brouer, Chuck Lever, mgorman, Linux NFS Mailing List, linux-mm, Mel Gorman On Tue, Feb 09, 2021 at 06:27:11PM +0100, Vlastimil Babka wrote: > > > The fourth patch introduces a bulk page allocator with no > > in-kernel users as an example for Jesper and others who want to > > build a page allocator for DMA-coherent pages. It hopefully is > > relatively easy to modify this API and the one core function toget > the > semantics they require. > > So it seems there were no immediate users to finalize the API? __iommu_dma_alloc_pages would be a hot candidate. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-09 10:31 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-09 13:37 ` alloc_pages_bulk() Chuck Lever 2021-02-09 17:27 ` alloc_pages_bulk() Vlastimil Babka @ 2021-02-10 8:41 ` Mel Gorman 2021-02-10 11:41 ` alloc_pages_bulk() Jesper Dangaard Brouer 2 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-02-10 8:41 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Chuck Lever, mgorman, Linux NFS Mailing List, linux-mm On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote: > > > Neil Brown pointed me to this old thread: > > > > > > https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ > > > > > > We see that many of the prerequisites are in v5.11-rc, but > > > alloc_page_bulk() is not. I tried forward-porting 4/4 in that > > > series, but enough internal APIs have changed since 2017 that > > > the patch does not come close to applying and compiling. > > I forgot that this was never merged. It is sad as Mel showed huge > improvement with his work. > > > > I'm wondering: > > > > > > a) is there a newer version of that work? > > > > > Mel, why was this work never merged upstream? > Lack of realistic consumers to drive it forward, finalise the API and confirm it was working as expected. It eventually died as a result. If it was reintroduced, it would need to be forward ported and then implement at least one user on top. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-10 8:41 ` alloc_pages_bulk() Mel Gorman @ 2021-02-10 11:41 ` Jesper Dangaard Brouer 2021-02-10 13:07 ` alloc_pages_bulk() Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Jesper Dangaard Brouer @ 2021-02-10 11:41 UTC (permalink / raw) To: Mel Gorman Cc: Chuck Lever, mgorman, Linux NFS Mailing List, linux-mm, brouer, Jakub Kicinski On Wed, 10 Feb 2021 08:41:55 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote: > > > > Neil Brown pointed me to this old thread: > > > > > > > > https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ > > > > > > > > We see that many of the prerequisites are in v5.11-rc, but > > > > alloc_page_bulk() is not. I tried forward-porting 4/4 in that > > > > series, but enough internal APIs have changed since 2017 that > > > > the patch does not come close to applying and compiling. > > > > I forgot that this was never merged. It is sad as Mel showed huge > > improvement with his work. > > > > > > I'm wondering: > > > > > > > > a) is there a newer version of that work? > > > > > > > > Mel, why was this work never merged upstream? > > > > Lack of realistic consumers to drive it forward, finalise the API and > confirm it was working as expected. It eventually died as a result. If it > was reintroduced, it would need to be forward ported and then implement > at least one user on top. I guess I misunderstood you back in 2017. I though that I had presented a clear use-case/consumer in page_pool[1]. But you wanted the code as part of the patchset I guess. I though, I could add it later via the net-next tree. It seems that Chuck now have a NFS use-case, and Hellwig also have a use-case for DMA-iommu in __iommu_dma_alloc_pages. The performance improvement (in above link) were really impressive! Quote: "It's roughly a 50-70% reduction of allocation costs and roughly a halving of the overall cost of allocating/freeing batches of pages." Who have time to revive this patchset? I can only signup for coding the page_pool usage. Chuck do you have time if Mel doesn't? [1] https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-10 11:41 ` alloc_pages_bulk() Jesper Dangaard Brouer @ 2021-02-10 13:07 ` Mel Gorman 2021-02-10 22:58 ` alloc_pages_bulk() Chuck Lever 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-02-10 13:07 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Mel Gorman, Chuck Lever, Linux NFS Mailing List, linux-mm, Jakub Kicinski On Wed, Feb 10, 2021 at 12:41:03PM +0100, Jesper Dangaard Brouer wrote: > On Wed, 10 Feb 2021 08:41:55 +0000 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote: > > > > > Neil Brown pointed me to this old thread: > > > > > > > > > > https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ > > > > > > > > > > We see that many of the prerequisites are in v5.11-rc, but > > > > > alloc_page_bulk() is not. I tried forward-porting 4/4 in that > > > > > series, but enough internal APIs have changed since 2017 that > > > > > the patch does not come close to applying and compiling. > > > > > > I forgot that this was never merged. It is sad as Mel showed huge > > > improvement with his work. > > > > > > > > I'm wondering: > > > > > > > > > > a) is there a newer version of that work? > > > > > > > > > > > Mel, why was this work never merged upstream? > > > > > > > Lack of realistic consumers to drive it forward, finalise the API and > > confirm it was working as expected. It eventually died as a result. If it > > was reintroduced, it would need to be forward ported and then implement > > at least one user on top. > > I guess I misunderstood you back in 2017. I though that I had presented > a clear use-case/consumer in page_pool[1]. You did but it was never integrated and/or tested AFAIK. I see page_pool accepts orders so even by the original prototype, it would only have seen a benefit for order-0 pages. It would also have needed some supporting data that it actually helped with drivers using the page_pool interface which I was not in the position to properly test at the time. > But you wanted the code as > part of the patchset I guess. I though, I could add it later via the > net-next tree. > Yes, a consumer of the code should go in at the same time with supporting data showing it actually helps because otherwise it's dead code. > It seems that Chuck now have a NFS use-case, and Hellwig also have a > use-case for DMA-iommu in __iommu_dma_alloc_pages. > > The performance improvement (in above link) were really impressive! > > Quote: > "It's roughly a 50-70% reduction of allocation costs and roughly a halving of the > overall cost of allocating/freeing batches of pages." > > Who have time to revive this patchset? > Not in the short term due to bug load and other obligations. The original series had "mm, page_allocator: Only use per-cpu allocator for irq-safe requests" but that was ultimately rejected because softirqs were affected so it would have to be done without that patch. The last patch can be rebased easily enough but it only batch allocates order-0 pages. It's also only build tested and could be completely miserable in practice and as I didn't even try boot test let, let alone actually test it, it could be a giant pile of crap. To make high orders work, it would need significant reworking but if the API showed even partial benefit, it might motiviate someone to reimplement the bulk interfaces to perform better. Rebased diff, build tested only, might not even work diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 6e479e9c48ce..d1b586e5b4b8 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -511,6 +511,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid) return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL); } +unsigned long +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, nodemask_t *nodemask, + unsigned long nr_pages, struct list_head *alloc_list); + +static inline unsigned long +__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, unsigned long nr_pages, + struct list_head *list) +{ + return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL, + nr_pages, list); +} + +static inline unsigned long +alloc_pages_bulk(gfp_t gfp_mask, unsigned int order, + unsigned long nr_pages, struct list_head *list) +{ + int nid = numa_mem_id(); + return __alloc_pages_bulk(gfp_mask, order, + node_zonelist(nid, gfp_mask), nr_pages, list); +} + /* * Allocate pages, preferring the node given as nid. The node must be valid and * online. For more general interface, see alloc_pages_node(). @@ -580,6 +603,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask); extern void __free_pages(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); +extern void free_pages_bulk(struct list_head *list); struct page_frag_cache; extern void __page_frag_cache_drain(struct page *page, unsigned int count); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 519a60d5b6f7..f8353ea7b977 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3254,7 +3254,7 @@ void free_unref_page(struct page *page) } /* - * Free a list of 0-order pages + * Free a list of 0-order pages whose reference count is already zero. */ void free_unref_page_list(struct list_head *list) { @@ -4435,6 +4435,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask, } } +/* Drop reference counts and free pages from a list */ +void free_pages_bulk(struct list_head *list) +{ + struct page *page, *next; + + list_for_each_entry_safe(page, next, list, lru) { + trace_mm_page_free_batched(page); + if (put_page_testzero(page)) { + list_del(&page->lru); + __free_pages_ok(page, 0, FPI_NONE); + } + } +} +EXPORT_SYMBOL_GPL(free_pages_bulk); + static inline unsigned int gfp_to_alloc_flags(gfp_t gfp_mask) { @@ -5818,6 +5833,99 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) } +/* + * This is a batched version of the page allocator that attempts to + * allocate nr_pages quickly from the preferred zone and add them to list. + * Note that there is no guarantee that nr_pages will be allocated although + * every effort will be made to allocate at least one. Unlike the core + * allocator, no special effort is made to recover from transient + * failures caused by changes in cpusets. It should only be used from !IRQ + * context. An attempt to allocate a batch of patches from an interrupt + * will allocate a single page. + */ +unsigned long +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, nodemask_t *nodemask, + unsigned long nr_pages, struct list_head *alloc_list) +{ + struct page *page; + unsigned long alloced = 0; + unsigned int alloc_flags = ALLOC_WMARK_LOW; + unsigned long flags; + struct zone *zone; + struct per_cpu_pages *pcp; + struct list_head *pcp_list; + int migratetype; + gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */ + struct alloc_context ac = { }; + + /* If there are already pages on the list, don't bother */ + if (!list_empty(alloc_list)) + return 0; + + /* Order-0 cannot go through per-cpu lists */ + if (order) + goto failed; + + gfp_mask &= gfp_allowed_mask; + + if (!prepare_alloc_pages(gfp_mask, order, numa_mem_id(), nodemask, &ac, &alloc_mask, &alloc_flags)) + return 0; + + if (!ac.preferred_zoneref) + return 0; + + /* + * Only attempt a batch allocation if watermarks on the preferred zone + * are safe. + */ + zone = ac.preferred_zoneref->zone; + if (!zone_watermark_fast(zone, order, high_wmark_pages(zone) + nr_pages, + zonelist_zone_idx(ac.preferred_zoneref), alloc_flags, gfp_mask)) + goto failed; + + /* Attempt the batch allocation */ + migratetype = ac.migratetype; + + local_irq_save(flags); + pcp = &this_cpu_ptr(zone->pageset)->pcp; + pcp_list = &pcp->lists[migratetype]; + + while (nr_pages) { + page = __rmqueue_pcplist(zone, gfp_mask, migratetype, + pcp, pcp_list); + if (!page) + break; + + prep_new_page(page, order, gfp_mask, 0); + nr_pages--; + alloced++; + list_add(&page->lru, alloc_list); + } + + if (!alloced) { + preempt_enable_no_resched(); + goto failed; + } + + __count_zid_vm_events(PGALLOC, zone_idx(zone), alloced); + zone_statistics(zone, zone); + + local_irq_restore(flags); + + return alloced; + +failed: + page = __alloc_pages_nodemask(gfp_mask, order, numa_node_id(), nodemask); + if (page) { + alloced++; + list_add(&page->lru, alloc_list); + } + + return alloced; +} +EXPORT_SYMBOL(__alloc_pages_bulk_nodemask); + /* * Build zonelists ordered by node and zones within node. * This results in maximum locality--normal zone overflows into local -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-10 13:07 ` alloc_pages_bulk() Mel Gorman @ 2021-02-10 22:58 ` Chuck Lever 2021-02-11 9:12 ` alloc_pages_bulk() Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Chuck Lever @ 2021-02-10 22:58 UTC (permalink / raw) To: Mel Gorman Cc: Jesper Dangaard Brouer, Mel Gorman, Linux NFS Mailing List, linux-mm, Jakub Kicinski > On Feb 10, 2021, at 8:07 AM, Mel Gorman <mgorman@suse.de> wrote: > > On Wed, Feb 10, 2021 at 12:41:03PM +0100, Jesper Dangaard Brouer wrote: >> On Wed, 10 Feb 2021 08:41:55 +0000 >> Mel Gorman <mgorman@techsingularity.net> wrote: >> >>> On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote: >>>>>> Neil Brown pointed me to this old thread: >>>>>> >>>>>> https://lore.kernel.org/lkml/20170109163518.6001-1-mgorman@techsingularity.net/ >>>>>> >>>>>> We see that many of the prerequisites are in v5.11-rc, but >>>>>> alloc_page_bulk() is not. I tried forward-porting 4/4 in that >>>>>> series, but enough internal APIs have changed since 2017 that >>>>>> the patch does not come close to applying and compiling. >>>> >>>> I forgot that this was never merged. It is sad as Mel showed huge >>>> improvement with his work. >>>> >>>>>> I'm wondering: >>>>>> >>>>>> a) is there a newer version of that work? >>>>>> >>>> >>>> Mel, why was this work never merged upstream? >>>> >>> >>> Lack of realistic consumers to drive it forward, finalise the API and >>> confirm it was working as expected. It eventually died as a result. If it >>> was reintroduced, it would need to be forward ported and then implement >>> at least one user on top. >> >> I guess I misunderstood you back in 2017. I though that I had presented >> a clear use-case/consumer in page_pool[1]. > > You did but it was never integrated and/or tested AFAIK. I see page_pool > accepts orders so even by the original prototype, it would only have seen > a benefit for order-0 pages. It would also have needed some supporting > data that it actually helped with drivers using the page_pool interface > which I was not in the position to properly test at the time. > >> But you wanted the code as >> part of the patchset I guess. I though, I could add it later via the >> net-next tree. >> > > Yes, a consumer of the code should go in at the same time with supporting > data showing it actually helps because otherwise it's dead code. > >> It seems that Chuck now have a NFS use-case, and Hellwig also have a >> use-case for DMA-iommu in __iommu_dma_alloc_pages. >> >> The performance improvement (in above link) were really impressive! >> >> Quote: >> "It's roughly a 50-70% reduction of allocation costs and roughly a halving of the >> overall cost of allocating/freeing batches of pages." >> >> Who have time to revive this patchset? >> > > Not in the short term due to bug load and other obligations. > > The original series had "mm, page_allocator: Only use per-cpu allocator > for irq-safe requests" but that was ultimately rejected because softirqs > were affected so it would have to be done without that patch. > > The last patch can be rebased easily enough but it only batch allocates > order-0 pages. It's also only build tested and could be completely > miserable in practice and as I didn't even try boot test let, let alone > actually test it, it could be a giant pile of crap. To make high orders > work, it would need significant reworking but if the API showed even > partial benefit, it might motiviate someone to reimplement the bulk > interfaces to perform better. > > Rebased diff, build tested only, might not even work Thanks, Mel, for kicking off a forward port. It compiles. I've added a patch to replace the page allocation loop in svc_alloc_arg() with a call to alloc_pages_bulk(). The server system deadlocks pretty quickly with any NFS traffic. Based on some initial debugging, it appears that a pcplist is getting corrupted and this causes the list_del() in __rmqueue_pcplist() to fail during a a call to alloc_pages_bulk(). > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 6e479e9c48ce..d1b586e5b4b8 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -511,6 +511,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid) > return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL); > } > > +unsigned long > +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, > + struct zonelist *zonelist, nodemask_t *nodemask, > + unsigned long nr_pages, struct list_head *alloc_list); > + > +static inline unsigned long > +__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order, > + struct zonelist *zonelist, unsigned long nr_pages, > + struct list_head *list) > +{ > + return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL, > + nr_pages, list); > +} > + > +static inline unsigned long > +alloc_pages_bulk(gfp_t gfp_mask, unsigned int order, > + unsigned long nr_pages, struct list_head *list) > +{ > + int nid = numa_mem_id(); > + return __alloc_pages_bulk(gfp_mask, order, > + node_zonelist(nid, gfp_mask), nr_pages, list); > +} > + > /* > * Allocate pages, preferring the node given as nid. The node must be valid and > * online. For more general interface, see alloc_pages_node(). > @@ -580,6 +603,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask); > > extern void __free_pages(struct page *page, unsigned int order); > extern void free_pages(unsigned long addr, unsigned int order); > +extern void free_pages_bulk(struct list_head *list); > > struct page_frag_cache; > extern void __page_frag_cache_drain(struct page *page, unsigned int count); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 519a60d5b6f7..f8353ea7b977 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3254,7 +3254,7 @@ void free_unref_page(struct page *page) > } > > /* > - * Free a list of 0-order pages > + * Free a list of 0-order pages whose reference count is already zero. > */ > void free_unref_page_list(struct list_head *list) > { > @@ -4435,6 +4435,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask, > } > } > > +/* Drop reference counts and free pages from a list */ > +void free_pages_bulk(struct list_head *list) > +{ > + struct page *page, *next; > + > + list_for_each_entry_safe(page, next, list, lru) { > + trace_mm_page_free_batched(page); > + if (put_page_testzero(page)) { > + list_del(&page->lru); > + __free_pages_ok(page, 0, FPI_NONE); > + } > + } > +} > +EXPORT_SYMBOL_GPL(free_pages_bulk); > + > static inline unsigned int > gfp_to_alloc_flags(gfp_t gfp_mask) > { > @@ -5818,6 +5833,99 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) > } > > > +/* > + * This is a batched version of the page allocator that attempts to > + * allocate nr_pages quickly from the preferred zone and add them to list. > + * Note that there is no guarantee that nr_pages will be allocated although > + * every effort will be made to allocate at least one. Unlike the core > + * allocator, no special effort is made to recover from transient > + * failures caused by changes in cpusets. It should only be used from !IRQ > + * context. An attempt to allocate a batch of patches from an interrupt > + * will allocate a single page. > + */ > +unsigned long > +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, > + struct zonelist *zonelist, nodemask_t *nodemask, > + unsigned long nr_pages, struct list_head *alloc_list) > +{ > + struct page *page; > + unsigned long alloced = 0; > + unsigned int alloc_flags = ALLOC_WMARK_LOW; > + unsigned long flags; > + struct zone *zone; > + struct per_cpu_pages *pcp; > + struct list_head *pcp_list; > + int migratetype; > + gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */ > + struct alloc_context ac = { }; > + > + /* If there are already pages on the list, don't bother */ > + if (!list_empty(alloc_list)) > + return 0; > + > + /* Order-0 cannot go through per-cpu lists */ > + if (order) > + goto failed; > + > + gfp_mask &= gfp_allowed_mask; > + > + if (!prepare_alloc_pages(gfp_mask, order, numa_mem_id(), nodemask, &ac, &alloc_mask, &alloc_flags)) > + return 0; > + > + if (!ac.preferred_zoneref) > + return 0; > + > + /* > + * Only attempt a batch allocation if watermarks on the preferred zone > + * are safe. > + */ > + zone = ac.preferred_zoneref->zone; > + if (!zone_watermark_fast(zone, order, high_wmark_pages(zone) + nr_pages, > + zonelist_zone_idx(ac.preferred_zoneref), alloc_flags, gfp_mask)) > + goto failed; > + > + /* Attempt the batch allocation */ > + migratetype = ac.migratetype; > + > + local_irq_save(flags); > + pcp = &this_cpu_ptr(zone->pageset)->pcp; > + pcp_list = &pcp->lists[migratetype]; > + > + while (nr_pages) { > + page = __rmqueue_pcplist(zone, gfp_mask, migratetype, > + pcp, pcp_list); > + if (!page) > + break; > + > + prep_new_page(page, order, gfp_mask, 0); > + nr_pages--; > + alloced++; > + list_add(&page->lru, alloc_list); > + } > + > + if (!alloced) { > + preempt_enable_no_resched(); > + goto failed; > + } > + > + __count_zid_vm_events(PGALLOC, zone_idx(zone), alloced); > + zone_statistics(zone, zone); > + > + local_irq_restore(flags); > + > + return alloced; > + > +failed: > + page = __alloc_pages_nodemask(gfp_mask, order, numa_node_id(), nodemask); > + if (page) { > + alloced++; > + list_add(&page->lru, alloc_list); > + } > + > + return alloced; > +} > +EXPORT_SYMBOL(__alloc_pages_bulk_nodemask); > + > /* > * Build zonelists ordered by node and zones within node. > * This results in maximum locality--normal zone overflows into local > > -- > Mel Gorman > SUSE Labs -- Chuck Lever ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-10 22:58 ` alloc_pages_bulk() Chuck Lever @ 2021-02-11 9:12 ` Mel Gorman 2021-02-11 12:26 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-11 16:20 ` alloc_pages_bulk() Chuck Lever 0 siblings, 2 replies; 23+ messages in thread From: Mel Gorman @ 2021-02-11 9:12 UTC (permalink / raw) To: Chuck Lever Cc: Mel Gorman, Jesper Dangaard Brouer, Linux NFS Mailing List, linux-mm, Jakub Kicinski On Wed, Feb 10, 2021 at 10:58:37PM +0000, Chuck Lever wrote: > > Not in the short term due to bug load and other obligations. > > > > The original series had "mm, page_allocator: Only use per-cpu allocator > > for irq-safe requests" but that was ultimately rejected because softirqs > > were affected so it would have to be done without that patch. > > > > The last patch can be rebased easily enough but it only batch allocates > > order-0 pages. It's also only build tested and could be completely > > miserable in practice and as I didn't even try boot test let, let alone > > actually test it, it could be a giant pile of crap. To make high orders > > work, it would need significant reworking but if the API showed even > > partial benefit, it might motiviate someone to reimplement the bulk > > interfaces to perform better. > > > > Rebased diff, build tested only, might not even work > > Thanks, Mel, for kicking off a forward port. > > It compiles. I've added a patch to replace the page allocation loop > in svc_alloc_arg() with a call to alloc_pages_bulk(). > > The server system deadlocks pretty quickly with any NFS traffic. Based > on some initial debugging, it appears that a pcplist is getting corrupted > and this causes the list_del() in __rmqueue_pcplist() to fail during a > a call to alloc_pages_bulk(). > Parameters to __rmqueue_pcplist are garbage as the parameter order changed. I'm surprised it didn't blow up in a spectacular fashion. Again, this hasn't been near any testing and passing a list with high orders to free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see if there is justification for reworking the allocator to fundamentally deal in batches and then feed batches to pcp lists and the bulk allocator while leaving the normal GFP API as single page "batches". While that would be ideal, it's relatively high risk for regressions. There is still some scope for adding a basic bulk allocator before considering a major refactoring effort. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f8353ea7b977..8f3fe7de2cf7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, pcp_list = &pcp->lists[migratetype]; while (nr_pages) { - page = __rmqueue_pcplist(zone, gfp_mask, migratetype, + page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, pcp_list); if (!page) break; -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-11 9:12 ` alloc_pages_bulk() Mel Gorman @ 2021-02-11 12:26 ` Jesper Dangaard Brouer 2021-02-15 12:00 ` alloc_pages_bulk() Mel Gorman 2021-02-11 16:20 ` alloc_pages_bulk() Chuck Lever 1 sibling, 1 reply; 23+ messages in thread From: Jesper Dangaard Brouer @ 2021-02-11 12:26 UTC (permalink / raw) To: Mel Gorman Cc: Chuck Lever, Mel Gorman, Linux NFS Mailing List, linux-mm, Jakub Kicinski, brouer On Thu, 11 Feb 2021 09:12:35 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Wed, Feb 10, 2021 at 10:58:37PM +0000, Chuck Lever wrote: > > > Not in the short term due to bug load and other obligations. > > > > > > The original series had "mm, page_allocator: Only use per-cpu allocator > > > for irq-safe requests" but that was ultimately rejected because softirqs > > > were affected so it would have to be done without that patch. > > > > > > The last patch can be rebased easily enough but it only batch allocates > > > order-0 pages. It's also only build tested and could be completely > > > miserable in practice and as I didn't even try boot test let, let alone > > > actually test it, it could be a giant pile of crap. To make high orders > > > work, it would need significant reworking but if the API showed even > > > partial benefit, it might motiviate someone to reimplement the bulk > > > interfaces to perform better. > > > > > > Rebased diff, build tested only, might not even work > > > > Thanks, Mel, for kicking off a forward port. > > > > It compiles. I've added a patch to replace the page allocation loop > > in svc_alloc_arg() with a call to alloc_pages_bulk(). > > > > The server system deadlocks pretty quickly with any NFS traffic. Based > > on some initial debugging, it appears that a pcplist is getting corrupted > > and this causes the list_del() in __rmqueue_pcplist() to fail during a > > a call to alloc_pages_bulk(). > > > > Parameters to __rmqueue_pcplist are garbage as the parameter order changed. > I'm surprised it didn't blow up in a spectacular fashion. Again, this > hasn't been near any testing and passing a list with high orders to > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see > if there is justification for reworking the allocator to fundamentally > deal in batches and then feed batches to pcp lists and the bulk allocator > while leaving the normal GFP API as single page "batches". While that > would be ideal, it's relatively high risk for regressions. There is still > some scope for adding a basic bulk allocator before considering a major > refactoring effort. The alloc_flags reminds me that I have some asks around the semantics of the API. I'm concerned about the latency impact on preemption. I want us to avoid creating something that runs for too long with IRQs/preempt disabled. (With SLUB kmem_cache_free_bulk() we manage to run most of the time with preempt and IRQs enabled. So, I'm not worried about large slab bulk free. For SLUB kmem_cache_alloc_bulk() we run with local_irq_disable(), so I always recommend users not to do excessive bulk-alloc.) For this page bulk alloc API, I'm fine with limiting it to only support order-0 pages. (This will also fit nicely with the PCP system it think). I also suggest the API can return less pages than requested. Because I want to to "exit"/return if it need to go into an expensive code path (like buddy allocator or compaction). I'm assuming we have a flags to give us this behavior (via gfp_flags or alloc_flags)? My use-case is in page_pool where I can easily handle not getting exact number of pages, and I want to handle low-latency network traffic. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f8353ea7b977..8f3fe7de2cf7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, > pcp_list = &pcp->lists[migratetype]; > > while (nr_pages) { > - page = __rmqueue_pcplist(zone, gfp_mask, migratetype, > + page = __rmqueue_pcplist(zone, migratetype, alloc_flags, > pcp, pcp_list); > if (!page) > break; -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-11 12:26 ` alloc_pages_bulk() Jesper Dangaard Brouer @ 2021-02-15 12:00 ` Mel Gorman 2021-02-15 16:10 ` alloc_pages_bulk() Jesper Dangaard Brouer 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-02-15 12:00 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Chuck Lever, Mel Gorman, Linux NFS Mailing List, linux-mm, Jakub Kicinski On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote: > > Parameters to __rmqueue_pcplist are garbage as the parameter order changed. > > I'm surprised it didn't blow up in a spectacular fashion. Again, this > > hasn't been near any testing and passing a list with high orders to > > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see > > if there is justification for reworking the allocator to fundamentally > > deal in batches and then feed batches to pcp lists and the bulk allocator > > while leaving the normal GFP API as single page "batches". While that > > would be ideal, it's relatively high risk for regressions. There is still > > some scope for adding a basic bulk allocator before considering a major > > refactoring effort. > > The alloc_flags reminds me that I have some asks around the semantics > of the API. I'm concerned about the latency impact on preemption. I > want us to avoid creating something that runs for too long with > IRQs/preempt disabled. > > (With SLUB kmem_cache_free_bulk() we manage to run most of the time with > preempt and IRQs enabled. So, I'm not worried about large slab bulk > free. For SLUB kmem_cache_alloc_bulk() we run with local_irq_disable(), > so I always recommend users not to do excessive bulk-alloc.) > The length of time IRQs/preempt disabled are partially related to the bulk API but the deeper concern is how long the IRQs are disabled within the page allocator in general. Sometimes they are disabled for list manipulations but the duration is longer than it has to be for per-cpu stat updates and that may be unnecessary for all arches and configurations. Some may need IRQs disabled but others may only need preempt disabled for the counters. That's a more serious overall than just the bulk allocator. > For this page bulk alloc API, I'm fine with limiting it to only support > order-0 pages. (This will also fit nicely with the PCP system it think). > Ok. > I also suggest the API can return less pages than requested. Because I > want to to "exit"/return if it need to go into an expensive code path > (like buddy allocator or compaction). I'm assuming we have a flags to > give us this behavior (via gfp_flags or alloc_flags)? > The API returns the number of pages returned on a list so policies around how aggressive it should be allocating the requested number of pages could be adjusted without changing the API. Passing in policy requests via gfp_flags may be problematic as most (all?) bits are already used. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-15 12:00 ` alloc_pages_bulk() Mel Gorman @ 2021-02-15 16:10 ` Jesper Dangaard Brouer 2021-02-22 9:42 ` alloc_pages_bulk() Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Jesper Dangaard Brouer @ 2021-02-15 16:10 UTC (permalink / raw) To: Mel Gorman Cc: Chuck Lever, Mel Gorman, Linux NFS Mailing List, linux-mm, Jakub Kicinski, brouer On Mon, 15 Feb 2021 12:00:56 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote: [...] > > > I also suggest the API can return less pages than requested. Because I > > want to to "exit"/return if it need to go into an expensive code path > > (like buddy allocator or compaction). I'm assuming we have a flags to > > give us this behavior (via gfp_flags or alloc_flags)? > > > > The API returns the number of pages returned on a list so policies > around how aggressive it should be allocating the requested number of > pages could be adjusted without changing the API. Passing in policy > requests via gfp_flags may be problematic as most (all?) bits are > already used. Well, I was just thinking that I would use GFP_ATOMIC instead of GFP_KERNEL to "communicate" that I don't want this call to take too long (like sleeping). I'm not requesting any fancy policy :-) For page_pool use case we use (GFP_ATOMIC | __GFP_NOWARN) flags. static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool) { gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); return page_pool_alloc_pages(pool, gfp); } -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-15 16:10 ` alloc_pages_bulk() Jesper Dangaard Brouer @ 2021-02-22 9:42 ` Mel Gorman 2021-02-22 11:42 ` alloc_pages_bulk() Jesper Dangaard Brouer 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2021-02-22 9:42 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Chuck Lever, Mel Gorman, Linux NFS Mailing List, linux-mm, Jakub Kicinski On Mon, Feb 15, 2021 at 05:10:38PM +0100, Jesper Dangaard Brouer wrote: > > On Mon, 15 Feb 2021 12:00:56 +0000 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote: > [...] > > > > > I also suggest the API can return less pages than requested. Because I > > > want to to "exit"/return if it need to go into an expensive code path > > > (like buddy allocator or compaction). I'm assuming we have a flags to > > > give us this behavior (via gfp_flags or alloc_flags)? > > > > > > > The API returns the number of pages returned on a list so policies > > around how aggressive it should be allocating the requested number of > > pages could be adjusted without changing the API. Passing in policy > > requests via gfp_flags may be problematic as most (all?) bits are > > already used. > > Well, I was just thinking that I would use GFP_ATOMIC instead of > GFP_KERNEL to "communicate" that I don't want this call to take too > long (like sleeping). I'm not requesting any fancy policy :-) > The NFS use case requires opposite semantics -- it really needs those allocations to succeed https://lore.kernel.org/r/161340498400.7780.962495219428962117.stgit@klimt.1015granger.net. I've asked what code it's based on as it's not 5.11 and I'll iron that out first. Then it might be clearer what the "can fail" semantics should look like. I think it would be best to have pairs of patches where the first patch adjusts the semantics of the bulk allocator and the second adds a user. That will limit the amount of code code carried in the implementation. When the initial users are in place then the implementation can be optimised as the optimisations will require significant refactoring and I not want to refactor multiple times. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-22 9:42 ` alloc_pages_bulk() Mel Gorman @ 2021-02-22 11:42 ` Jesper Dangaard Brouer 2021-02-22 14:08 ` alloc_pages_bulk() Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Jesper Dangaard Brouer @ 2021-02-22 11:42 UTC (permalink / raw) To: Mel Gorman Cc: Chuck Lever, Mel Gorman, Linux NFS Mailing List, linux-mm, Jakub Kicinski, brouer, netdev On Mon, 22 Feb 2021 09:42:56 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Mon, Feb 15, 2021 at 05:10:38PM +0100, Jesper Dangaard Brouer wrote: > > > > On Mon, 15 Feb 2021 12:00:56 +0000 > > Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote: > > [...] > > > > > > > I also suggest the API can return less pages than requested. Because I > > > > want to to "exit"/return if it need to go into an expensive code path > > > > (like buddy allocator or compaction). I'm assuming we have a flags to > > > > give us this behavior (via gfp_flags or alloc_flags)? > > > > > > > > > > The API returns the number of pages returned on a list so policies > > > around how aggressive it should be allocating the requested number of > > > pages could be adjusted without changing the API. Passing in policy > > > requests via gfp_flags may be problematic as most (all?) bits are > > > already used. > > > > Well, I was just thinking that I would use GFP_ATOMIC instead of > > GFP_KERNEL to "communicate" that I don't want this call to take too > > long (like sleeping). I'm not requesting any fancy policy :-) > > > > The NFS use case requires opposite semantics > -- it really needs those allocations to succeed > https://lore.kernel.org/r/161340498400.7780.962495219428962117.stgit@klimt.1015granger.net. Sorry, but that is not how I understand the code. The code is doing exactly what I'm requesting. If the alloc_pages_bulk() doesn't return expected number of pages, then check if others need to run. The old code did schedule_timeout(msecs_to_jiffies(500)), while Chuck's patch change this to ask for cond_resched(). Thus, it tries to avoid blocking the CPU for too long (when allocating many pages). And the nfsd code seems to handle that the code can be interrupted (via return -EINTR) via signal_pending(current). Thus, the nfsd code seems to be able to handle if the page allocations failed. > I've asked what code it's based on as it's not 5.11 and I'll iron that > out first. > > Then it might be clearer what the "can fail" semantics should look like. > I think it would be best to have pairs of patches where the first patch > adjusts the semantics of the bulk allocator and the second adds a user. > That will limit the amount of code code carried in the implementation. > When the initial users are in place then the implementation can be > optimised as the optimisations will require significant refactoring and > I not want to refactor multiple times. I guess, I should try to code-up the usage in page_pool. What is the latest patch for adding alloc_pages_bulk() ? The nfsd code (svc_alloc_arg) is called in a context where it can sleep, and thus use GFP_KERNEL. In most cases the page_pool will be called with GFP_ATOMIC. I don't think I/page_pool will retry the call like Chuck did, as I cannot (re)schedule others to run. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-22 11:42 ` alloc_pages_bulk() Jesper Dangaard Brouer @ 2021-02-22 14:08 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2021-02-22 14:08 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Chuck Lever, Mel Gorman, Linux NFS Mailing List, linux-mm, Jakub Kicinski, netdev On Mon, Feb 22, 2021 at 12:42:46PM +0100, Jesper Dangaard Brouer wrote: > On Mon, 22 Feb 2021 09:42:56 +0000 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > On Mon, Feb 15, 2021 at 05:10:38PM +0100, Jesper Dangaard Brouer wrote: > > > > > > On Mon, 15 Feb 2021 12:00:56 +0000 > > > Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote: > > > [...] > > > > > > > > > I also suggest the API can return less pages than requested. Because I > > > > > want to to "exit"/return if it need to go into an expensive code path > > > > > (like buddy allocator or compaction). I'm assuming we have a flags to > > > > > give us this behavior (via gfp_flags or alloc_flags)? > > > > > > > > > > > > > The API returns the number of pages returned on a list so policies > > > > around how aggressive it should be allocating the requested number of > > > > pages could be adjusted without changing the API. Passing in policy > > > > requests via gfp_flags may be problematic as most (all?) bits are > > > > already used. > > > > > > Well, I was just thinking that I would use GFP_ATOMIC instead of > > > GFP_KERNEL to "communicate" that I don't want this call to take too > > > long (like sleeping). I'm not requesting any fancy policy :-) > > > > > > > The NFS use case requires opposite semantics > > -- it really needs those allocations to succeed > > https://lore.kernel.org/r/161340498400.7780.962495219428962117.stgit@klimt.1015granger.net. > > Sorry, but that is not how I understand the code. > > The code is doing exactly what I'm requesting. If the alloc_pages_bulk() > doesn't return expected number of pages, then check if others need to > run. The old code did schedule_timeout(msecs_to_jiffies(500)), while > Chuck's patch change this to ask for cond_resched(). Thus, it tries to > avoid blocking the CPU for too long (when allocating many pages). > > And the nfsd code seems to handle that the code can be interrupted (via > return -EINTR) via signal_pending(current). Thus, the nfsd code seems > to be able to handle if the page allocations failed. > I'm waiting to find out exactly what NFSD is currently doing as the code in 5.11 is not the same as what Chuck was coding against so I'm not 100% certain how it currently works. > > > I've asked what code it's based on as it's not 5.11 and I'll iron that > > out first. > > > > Then it might be clearer what the "can fail" semantics should look like. > > I think it would be best to have pairs of patches where the first patch > > adjusts the semantics of the bulk allocator and the second adds a user. > > That will limit the amount of code code carried in the implementation. > > When the initial users are in place then the implementation can be > > optimised as the optimisations will require significant refactoring and > > I not want to refactor multiple times. > > I guess, I should try to code-up the usage in page_pool. > > What is the latest patch for adding alloc_pages_bulk() ? > There isn't a usable latest version until I reconcile the nfsd caller. The only major change in the API right now is dropping order. It handles order-0 only. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-11 9:12 ` alloc_pages_bulk() Mel Gorman 2021-02-11 12:26 ` alloc_pages_bulk() Jesper Dangaard Brouer @ 2021-02-11 16:20 ` Chuck Lever 2021-02-15 12:06 ` alloc_pages_bulk() Mel Gorman 1 sibling, 1 reply; 23+ messages in thread From: Chuck Lever @ 2021-02-11 16:20 UTC (permalink / raw) To: Mel Gorman Cc: Mel Gorman, Jesper Dangaard Brouer, Linux NFS Mailing List, linux-mm, Jakub Kicinski > On Feb 11, 2021, at 4:12 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Wed, Feb 10, 2021 at 10:58:37PM +0000, Chuck Lever wrote: >>> Not in the short term due to bug load and other obligations. >>> >>> The original series had "mm, page_allocator: Only use per-cpu allocator >>> for irq-safe requests" but that was ultimately rejected because softirqs >>> were affected so it would have to be done without that patch. >>> >>> The last patch can be rebased easily enough but it only batch allocates >>> order-0 pages. It's also only build tested and could be completely >>> miserable in practice and as I didn't even try boot test let, let alone >>> actually test it, it could be a giant pile of crap. To make high orders >>> work, it would need significant reworking but if the API showed even >>> partial benefit, it might motiviate someone to reimplement the bulk >>> interfaces to perform better. >>> >>> Rebased diff, build tested only, might not even work >> >> Thanks, Mel, for kicking off a forward port. >> >> It compiles. I've added a patch to replace the page allocation loop >> in svc_alloc_arg() with a call to alloc_pages_bulk(). >> >> The server system deadlocks pretty quickly with any NFS traffic. Based >> on some initial debugging, it appears that a pcplist is getting corrupted >> and this causes the list_del() in __rmqueue_pcplist() to fail during a >> a call to alloc_pages_bulk(). >> > > Parameters to __rmqueue_pcplist are garbage as the parameter order changed. > I'm surprised it didn't blow up in a spectacular fashion. Again, this > hasn't been near any testing and passing a list with high orders to > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see > if there is justification for reworking the allocator to fundamentally > deal in batches and then feed batches to pcp lists and the bulk allocator > while leaving the normal GFP API as single page "batches". While that > would be ideal, it's relatively high risk for regressions. There is still > some scope for adding a basic bulk allocator before considering a major > refactoring effort. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f8353ea7b977..8f3fe7de2cf7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, > pcp_list = &pcp->lists[migratetype]; > > while (nr_pages) { > - page = __rmqueue_pcplist(zone, gfp_mask, migratetype, > + page = __rmqueue_pcplist(zone, migratetype, alloc_flags, > pcp, pcp_list); > if (!page) > break; The NFS server is considerably more stable now. Thank you! I confirmed that my patch is requesting and getting multiple pages. The new NFSD code and the API seem to be working as expected. The results are stunning. Each svc_alloc_arg() call here allocates 65 pages to satisfy a 256KB NFS READ request. Before: nfsd-972 [000] 584.513817: funcgraph_entry: + 35.385 us | svc_alloc_arg(); nfsd-979 [002] 584.513870: funcgraph_entry: + 29.051 us | svc_alloc_arg(); nfsd-980 [001] 584.513951: funcgraph_entry: + 29.178 us | svc_alloc_arg(); nfsd-983 [000] 584.514014: funcgraph_entry: + 29.211 us | svc_alloc_arg(); nfsd-976 [002] 584.514059: funcgraph_entry: + 29.315 us | svc_alloc_arg(); nfsd-974 [001] 584.514127: funcgraph_entry: + 29.237 us | svc_alloc_arg(); After: nfsd-977 [002] 87.049425: funcgraph_entry: 4.293 us | svc_alloc_arg(); nfsd-981 [000] 87.049478: funcgraph_entry: 4.059 us | svc_alloc_arg(); nfsd-988 [001] 87.049549: funcgraph_entry: 4.474 us | svc_alloc_arg(); nfsd-983 [003] 87.049612: funcgraph_entry: 3.819 us | svc_alloc_arg(); nfsd-976 [000] 87.049619: funcgraph_entry: 3.869 us | svc_alloc_arg(); nfsd-980 [002] 87.049738: funcgraph_entry: 4.124 us | svc_alloc_arg(); nfsd-975 [000] 87.049769: funcgraph_entry: 3.734 us | svc_alloc_arg(); There appears to be little cost change for single-page allocations using the bulk allocator (nr_pages=1): Before: nfsd-985 [003] 572.324517: funcgraph_entry: 0.332 us | svc_alloc_arg(); nfsd-986 [001] 572.324531: funcgraph_entry: 0.311 us | svc_alloc_arg(); nfsd-985 [003] 572.324701: funcgraph_entry: 0.311 us | svc_alloc_arg(); nfsd-986 [001] 572.324727: funcgraph_entry: 0.424 us | svc_alloc_arg(); nfsd-985 [003] 572.324760: funcgraph_entry: 0.332 us | svc_alloc_arg(); nfsd-986 [001] 572.324786: funcgraph_entry: 0.390 us | svc_alloc_arg(); After: nfsd-989 [002] 75.043226: funcgraph_entry: 0.322 us | svc_alloc_arg(); nfsd-988 [001] 75.043436: funcgraph_entry: 0.368 us | svc_alloc_arg(); nfsd-989 [002] 75.043464: funcgraph_entry: 0.424 us | svc_alloc_arg(); nfsd-988 [001] 75.043490: funcgraph_entry: 0.317 us | svc_alloc_arg(); nfsd-989 [002] 75.043517: funcgraph_entry: 0.425 us | svc_alloc_arg(); nfsd-988 [001] 75.050025: funcgraph_entry: 0.407 us | svc_alloc_arg(); -- Chuck Lever ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-11 16:20 ` alloc_pages_bulk() Chuck Lever @ 2021-02-15 12:06 ` Mel Gorman 2021-02-15 16:00 ` alloc_pages_bulk() Chuck Lever 2021-02-22 20:44 ` alloc_pages_bulk() Jesper Dangaard Brouer 0 siblings, 2 replies; 23+ messages in thread From: Mel Gorman @ 2021-02-15 12:06 UTC (permalink / raw) To: Chuck Lever Cc: Mel Gorman, Jesper Dangaard Brouer, Linux NFS Mailing List, linux-mm, Jakub Kicinski On Thu, Feb 11, 2021 at 04:20:31PM +0000, Chuck Lever wrote: > > On Feb 11, 2021, at 4:12 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > <SNIP> > > > > Parameters to __rmqueue_pcplist are garbage as the parameter order changed. > > I'm surprised it didn't blow up in a spectacular fashion. Again, this > > hasn't been near any testing and passing a list with high orders to > > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see > > if there is justification for reworking the allocator to fundamentally > > deal in batches and then feed batches to pcp lists and the bulk allocator > > while leaving the normal GFP API as single page "batches". While that > > would be ideal, it's relatively high risk for regressions. There is still > > some scope for adding a basic bulk allocator before considering a major > > refactoring effort. > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index f8353ea7b977..8f3fe7de2cf7 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, > > pcp_list = &pcp->lists[migratetype]; > > > > while (nr_pages) { > > - page = __rmqueue_pcplist(zone, gfp_mask, migratetype, > > + page = __rmqueue_pcplist(zone, migratetype, alloc_flags, > > pcp, pcp_list); > > if (!page) > > break; > > The NFS server is considerably more stable now. Thank you! > Thanks for testing! > I confirmed that my patch is requesting and getting multiple pages. > The new NFSD code and the API seem to be working as expected. > > The results are stunning. Each svc_alloc_arg() call here allocates > 65 pages to satisfy a 256KB NFS READ request. > > Before: > > nfsd-972 [000] 584.513817: funcgraph_entry: + 35.385 us | svc_alloc_arg(); > nfsd-979 [002] 584.513870: funcgraph_entry: + 29.051 us | svc_alloc_arg(); > nfsd-980 [001] 584.513951: funcgraph_entry: + 29.178 us | svc_alloc_arg(); > nfsd-983 [000] 584.514014: funcgraph_entry: + 29.211 us | svc_alloc_arg(); > nfsd-976 [002] 584.514059: funcgraph_entry: + 29.315 us | svc_alloc_arg(); > nfsd-974 [001] 584.514127: funcgraph_entry: + 29.237 us | svc_alloc_arg(); > > After: > > nfsd-977 [002] 87.049425: funcgraph_entry: 4.293 us | svc_alloc_arg(); > nfsd-981 [000] 87.049478: funcgraph_entry: 4.059 us | svc_alloc_arg(); > nfsd-988 [001] 87.049549: funcgraph_entry: 4.474 us | svc_alloc_arg(); > nfsd-983 [003] 87.049612: funcgraph_entry: 3.819 us | svc_alloc_arg(); > nfsd-976 [000] 87.049619: funcgraph_entry: 3.869 us | svc_alloc_arg(); > nfsd-980 [002] 87.049738: funcgraph_entry: 4.124 us | svc_alloc_arg(); > nfsd-975 [000] 87.049769: funcgraph_entry: 3.734 us | svc_alloc_arg(); > Uhhhh, that is much better than I expected given how lame the implementation is. Sure -- it works, but it has more overhead than it should with the downside that reducing it requires fairly deep surgery. It may be enough to tidy this up to handle order-0 pages only to start with and see how far it gets. That's a fairly trivial modification. > There appears to be little cost change for single-page allocations > using the bulk allocator (nr_pages=1): > > Before: > > nfsd-985 [003] 572.324517: funcgraph_entry: 0.332 us | svc_alloc_arg(); > nfsd-986 [001] 572.324531: funcgraph_entry: 0.311 us | svc_alloc_arg(); > nfsd-985 [003] 572.324701: funcgraph_entry: 0.311 us | svc_alloc_arg(); > nfsd-986 [001] 572.324727: funcgraph_entry: 0.424 us | svc_alloc_arg(); > nfsd-985 [003] 572.324760: funcgraph_entry: 0.332 us | svc_alloc_arg(); > nfsd-986 [001] 572.324786: funcgraph_entry: 0.390 us | svc_alloc_arg(); > > After: > > nfsd-989 [002] 75.043226: funcgraph_entry: 0.322 us | svc_alloc_arg(); > nfsd-988 [001] 75.043436: funcgraph_entry: 0.368 us | svc_alloc_arg(); > nfsd-989 [002] 75.043464: funcgraph_entry: 0.424 us | svc_alloc_arg(); > nfsd-988 [001] 75.043490: funcgraph_entry: 0.317 us | svc_alloc_arg(); > nfsd-989 [002] 75.043517: funcgraph_entry: 0.425 us | svc_alloc_arg(); > nfsd-988 [001] 75.050025: funcgraph_entry: 0.407 us | svc_alloc_arg(); > That is not too surprising given that there would be some additional overhead to manage a list of 1 page. I would hope that users of the bulk allocator are not routinely calling it with nr_pages == 1. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-15 12:06 ` alloc_pages_bulk() Mel Gorman @ 2021-02-15 16:00 ` Chuck Lever 2021-02-22 20:44 ` alloc_pages_bulk() Jesper Dangaard Brouer 1 sibling, 0 replies; 23+ messages in thread From: Chuck Lever @ 2021-02-15 16:00 UTC (permalink / raw) To: Mel Gorman Cc: Mel Gorman, Jesper Dangaard Brouer, Linux NFS Mailing List, linux-mm, Jakub Kicinski > On Feb 15, 2021, at 7:06 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Thu, Feb 11, 2021 at 04:20:31PM +0000, Chuck Lever wrote: >>> On Feb 11, 2021, at 4:12 AM, Mel Gorman <mgorman@techsingularity.net> wrote: >>> >>> <SNIP> >>> >>> Parameters to __rmqueue_pcplist are garbage as the parameter order changed. >>> I'm surprised it didn't blow up in a spectacular fashion. Again, this >>> hasn't been near any testing and passing a list with high orders to >>> free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see >>> if there is justification for reworking the allocator to fundamentally >>> deal in batches and then feed batches to pcp lists and the bulk allocator >>> while leaving the normal GFP API as single page "batches". While that >>> would be ideal, it's relatively high risk for regressions. There is still >>> some scope for adding a basic bulk allocator before considering a major >>> refactoring effort. >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index f8353ea7b977..8f3fe7de2cf7 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, >>> pcp_list = &pcp->lists[migratetype]; >>> >>> while (nr_pages) { >>> - page = __rmqueue_pcplist(zone, gfp_mask, migratetype, >>> + page = __rmqueue_pcplist(zone, migratetype, alloc_flags, >>> pcp, pcp_list); >>> if (!page) >>> break; >> >> The NFS server is considerably more stable now. Thank you! >> > > Thanks for testing! > >> I confirmed that my patch is requesting and getting multiple pages. >> The new NFSD code and the API seem to be working as expected. >> >> The results are stunning. Each svc_alloc_arg() call here allocates >> 65 pages to satisfy a 256KB NFS READ request. >> >> Before: >> >> nfsd-972 [000] 584.513817: funcgraph_entry: + 35.385 us | svc_alloc_arg(); >> nfsd-979 [002] 584.513870: funcgraph_entry: + 29.051 us | svc_alloc_arg(); >> nfsd-980 [001] 584.513951: funcgraph_entry: + 29.178 us | svc_alloc_arg(); >> nfsd-983 [000] 584.514014: funcgraph_entry: + 29.211 us | svc_alloc_arg(); >> nfsd-976 [002] 584.514059: funcgraph_entry: + 29.315 us | svc_alloc_arg(); >> nfsd-974 [001] 584.514127: funcgraph_entry: + 29.237 us | svc_alloc_arg(); >> >> After: >> >> nfsd-977 [002] 87.049425: funcgraph_entry: 4.293 us | svc_alloc_arg(); >> nfsd-981 [000] 87.049478: funcgraph_entry: 4.059 us | svc_alloc_arg(); >> nfsd-988 [001] 87.049549: funcgraph_entry: 4.474 us | svc_alloc_arg(); >> nfsd-983 [003] 87.049612: funcgraph_entry: 3.819 us | svc_alloc_arg(); >> nfsd-976 [000] 87.049619: funcgraph_entry: 3.869 us | svc_alloc_arg(); >> nfsd-980 [002] 87.049738: funcgraph_entry: 4.124 us | svc_alloc_arg(); >> nfsd-975 [000] 87.049769: funcgraph_entry: 3.734 us | svc_alloc_arg(); >> > > Uhhhh, that is much better than I expected given how lame the > implementation is. My experience with function tracing is the entry and exit timestamping adds significant overhead. I'd bet the actual timing improvement is still good but much less. > Sure -- it works, but it has more overhead than it > should with the downside that reducing it requires fairly deep surgery. It > may be enough to tidy this up to handle order-0 pages only to start with > and see how far it gets. That's a fairly trivial modification. I'd like to see an "order-0 only" implementation go in soon. The improvement is palpable and there are several worthy consumers on deck. >> There appears to be little cost change for single-page allocations >> using the bulk allocator (nr_pages=1): >> >> Before: >> >> nfsd-985 [003] 572.324517: funcgraph_entry: 0.332 us | svc_alloc_arg(); >> nfsd-986 [001] 572.324531: funcgraph_entry: 0.311 us | svc_alloc_arg(); >> nfsd-985 [003] 572.324701: funcgraph_entry: 0.311 us | svc_alloc_arg(); >> nfsd-986 [001] 572.324727: funcgraph_entry: 0.424 us | svc_alloc_arg(); >> nfsd-985 [003] 572.324760: funcgraph_entry: 0.332 us | svc_alloc_arg(); >> nfsd-986 [001] 572.324786: funcgraph_entry: 0.390 us | svc_alloc_arg(); >> >> After: >> >> nfsd-989 [002] 75.043226: funcgraph_entry: 0.322 us | svc_alloc_arg(); >> nfsd-988 [001] 75.043436: funcgraph_entry: 0.368 us | svc_alloc_arg(); >> nfsd-989 [002] 75.043464: funcgraph_entry: 0.424 us | svc_alloc_arg(); >> nfsd-988 [001] 75.043490: funcgraph_entry: 0.317 us | svc_alloc_arg(); >> nfsd-989 [002] 75.043517: funcgraph_entry: 0.425 us | svc_alloc_arg(); >> nfsd-988 [001] 75.050025: funcgraph_entry: 0.407 us | svc_alloc_arg(); >> > > That is not too surprising given that there would be some additional > overhead to manage a list of 1 page. I would hope that users of the bulk > allocator are not routinely calling it with nr_pages == 1. The NFSD implementation I did uses alloc_pages_bulk() to fill however many pages are needed. Often that's just one page. Sometimes it's zero pages. alloc_pages_bulk() does not behave very well, so NFSD avoids calling it in that case. I can post the patch for review. I cleaned it up recently but haven't had a chance to test the clean-ups, so it might not work in its current state. -- Chuck Lever ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-15 12:06 ` alloc_pages_bulk() Mel Gorman 2021-02-15 16:00 ` alloc_pages_bulk() Chuck Lever @ 2021-02-22 20:44 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 23+ messages in thread From: Jesper Dangaard Brouer @ 2021-02-22 20:44 UTC (permalink / raw) To: Mel Gorman Cc: Chuck Lever, Mel Gorman, Linux NFS Mailing List, linux-mm, brouer, netdev On Mon, 15 Feb 2021 12:06:09 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Thu, Feb 11, 2021 at 04:20:31PM +0000, Chuck Lever wrote: > > > On Feb 11, 2021, at 4:12 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > <SNIP> > > > > > > Parameters to __rmqueue_pcplist are garbage as the parameter order changed. > > > I'm surprised it didn't blow up in a spectacular fashion. Again, this > > > hasn't been near any testing and passing a list with high orders to > > > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see > > > if there is justification for reworking the allocator to fundamentally > > > deal in batches and then feed batches to pcp lists and the bulk allocator > > > while leaving the normal GFP API as single page "batches". While that > > > would be ideal, it's relatively high risk for regressions. There is still > > > some scope for adding a basic bulk allocator before considering a major > > > refactoring effort. > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index f8353ea7b977..8f3fe7de2cf7 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order, > > > pcp_list = &pcp->lists[migratetype]; > > > > > > while (nr_pages) { > > > - page = __rmqueue_pcplist(zone, gfp_mask, migratetype, > > > + page = __rmqueue_pcplist(zone, migratetype, alloc_flags, > > > pcp, pcp_list); > > > if (!page) > > > break; > > > > The NFS server is considerably more stable now. Thank you! > > > > Thanks for testing! I've done some testing here: https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org Performance summary: - Before: 3,677,958 pps - After : 4,066,028 pps I'll describe/show the page_pool changes tomorrow. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: alloc_pages_bulk() 2021-02-08 17:50 ` Fwd: alloc_pages_bulk() Chuck Lever 2021-02-09 10:31 ` alloc_pages_bulk() Jesper Dangaard Brouer @ 2021-02-09 22:01 ` Matthew Wilcox 2021-02-09 22:55 ` alloc_pages_bulk() Chuck Lever 1 sibling, 1 reply; 23+ messages in thread From: Matthew Wilcox @ 2021-02-09 22:01 UTC (permalink / raw) To: Chuck Lever; +Cc: mgorman, brouer, Linux NFS Mailing List, linux-mm On Mon, Feb 08, 2021 at 05:50:51PM +0000, Chuck Lever wrote: > > We've been discussing how NFSD can more efficiently refill its > > receive buffers (currently alloc_page() in a loop; see > > net/sunrpc/svc_xprt.c::svc_alloc_arg()). I'm not familiar with the sunrpc architecture, but this feels like you're trying to optimise something that shouldn't exist. Ideally a write would ask the page cache for the pages that correspond to the portion of the file which is being written to. I appreciate that doesn't work well for, eg, NFS-over-TCP, but for NFS over any kind of RDMA, that should be possible, right? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: alloc_pages_bulk() 2021-02-09 22:01 ` Fwd: alloc_pages_bulk() Matthew Wilcox @ 2021-02-09 22:55 ` Chuck Lever 0 siblings, 0 replies; 23+ messages in thread From: Chuck Lever @ 2021-02-09 22:55 UTC (permalink / raw) To: Matthew Wilcox; +Cc: mgorman, brouer, Linux NFS Mailing List, linux-mm > On Feb 9, 2021, at 5:01 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Feb 08, 2021 at 05:50:51PM +0000, Chuck Lever wrote: >>> We've been discussing how NFSD can more efficiently refill its >>> receive buffers (currently alloc_page() in a loop; see >>> net/sunrpc/svc_xprt.c::svc_alloc_arg()). > > I'm not familiar with the sunrpc architecture, but this feels like you're > trying to optimise something that shouldn't exist. Ideally a write > would ask the page cache for the pages that correspond to the portion > of the file which is being written to. I appreciate that doesn't work > well for, eg, NFS-over-TCP, but for NFS over any kind of RDMA, that > should be possible, right? (Note there is room for improvement for both transport types). Since you asked ;-) there are four broad categories of NFSD I/O. 1. Receive an ingress RPC message (typically a Call) 2. Read from a file 3. Write to a file 4. Send an egress RPC message (typically a Reply) A server RPC transaction is some combination of these, usually 1, 2, and 4; or 1, 3, and 4. To do 1, the server allocates a set of order-0 pages to form a receive buffer and a set of order-0 pages to form a send buffer. We want to handle this with bulk allocation. The Call is then received into the receive buffer pages. The receive buffer pages typically stay with the nfsd thread for its lifetime, but the send buffer pages do not. We want to use a buffer page size that matches the page cache size (see below) and also a size small enough that makes allocation very unlikely to fail. The largest transactions (NFS READ and WRITE) use up to 1MB worth of pages. Category two can be done by copying the file's pages into the send buffer pages, or it can be done via a splice. When a splice is done, the send buffer pages allocated above are released first before being replaced in the buffer with the file's pages. 3 is currently done only by copying receive buffer pages to file pages. Pages are neither allocated or released by this category of I/O. There are various reasons for this, but it's an area that could stand some attention. Sending (category 4) passes the send buffer pages to kernel_sendpage(), which bumps the page count on them. When sendpage() returns, the server does a put_page() on all of those pages, then goes back to category 1 to replace the consumed send buffer pages. When the network layer is finished with the pages, it releases them. There are two reasons I can see for this: 1. A network send isn't complete until the server gets an ACK from the client. This can take a while. I'm not aware of a TCP-provided mechanism to indicate when the ACK has arrived, so the server can't re-use them. (RDMA has an affirmative send completion event that we can use to reduce send buffer churn). 2. If a splice was done, the send buffer pages that are also file pages can't be re-used for the next RPC send buffer because overwriting their content would corrupt the file. Thus they must also be released and replaced. -- Chuck Lever ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2021-02-22 20:46 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-02-08 15:42 alloc_pages_bulk() Chuck Lever 2021-02-08 17:50 ` Fwd: alloc_pages_bulk() Chuck Lever 2021-02-09 10:31 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-09 13:37 ` alloc_pages_bulk() Chuck Lever 2021-02-09 17:27 ` alloc_pages_bulk() Vlastimil Babka 2021-02-10 9:51 ` alloc_pages_bulk() Christoph Hellwig 2021-02-10 8:41 ` alloc_pages_bulk() Mel Gorman 2021-02-10 11:41 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-10 13:07 ` alloc_pages_bulk() Mel Gorman 2021-02-10 22:58 ` alloc_pages_bulk() Chuck Lever 2021-02-11 9:12 ` alloc_pages_bulk() Mel Gorman 2021-02-11 12:26 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-15 12:00 ` alloc_pages_bulk() Mel Gorman 2021-02-15 16:10 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-22 9:42 ` alloc_pages_bulk() Mel Gorman 2021-02-22 11:42 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-22 14:08 ` alloc_pages_bulk() Mel Gorman 2021-02-11 16:20 ` alloc_pages_bulk() Chuck Lever 2021-02-15 12:06 ` alloc_pages_bulk() Mel Gorman 2021-02-15 16:00 ` alloc_pages_bulk() Chuck Lever 2021-02-22 20:44 ` alloc_pages_bulk() Jesper Dangaard Brouer 2021-02-09 22:01 ` Fwd: alloc_pages_bulk() Matthew Wilcox 2021-02-09 22:55 ` alloc_pages_bulk() Chuck Lever
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.