All of lore.kernel.org
 help / color / mirror / Atom feed
* [merged] mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch removed from -mm tree
@ 2017-02-27 20:25 akpm
  2017-03-01 13:48 ` Page allocator order-0 optimizations merged Jesper Dangaard Brouer
  0 siblings, 1 reply; 63+ messages in thread
From: akpm @ 2017-02-27 20:25 UTC (permalink / raw)
  To: mgorman, brouer, hillf.zj, tglx, vbabka, mm-commits


The patch titled
     Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
has been removed from the -mm tree.  Its filename was
     mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests

Many workloads that allocate pages are not handling an interrupt at a
time.  As allocation requests may be from IRQ context, it's necessary to
disable/enable IRQs for every page allocation.  This cost is the bulk of
the free path but also a significant percentage of the allocation path.

This patch alters the locking and checks such that only irq-safe
allocation requests use the per-cpu allocator.  All others acquire the
irq-safe zone->lock and allocate from the buddy allocator.  It relies on
disabling preemption to safely access the per-cpu structures.  It could be
slightly modified to avoid soft IRQs using it but it's not clear it's
worthwhile.

This modification may slow allocations from IRQ context slightly but the
main gain from the per-cpu allocator is that it scales better for
allocations from multiple contexts.  There is an implicit assumption that
intensive allocations from IRQ contexts on multiple CPUs from a single
NUMA node are rare and that the fast majority of scaling issues are
encountered in !IRQ contexts such as page faulting.  It's worth noting
that this patch is not required for a bulk page allocator but it
significantly reduces the overhead.

The following is results from a page allocator micro-benchmark.  Only
order-0 is interesting as higher orders do not use the per-cpu allocator

                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla               irqsafe-v1r5
Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)

This is the alloc, free and total overhead of allocating order-0 pages in
batches of 1 page up to 16384 pages.  Avoiding disabling/enabling overhead
massively reduces overhead.  Alloc overhead is roughly reduced by 14-20%
in most cases.  The free path is reduced by 26-46% and the total reduction
is significant.

Many users require zeroing of pages from the page allocator which is the
vast cost of allocation.  Hence, the impact on a basic page faulting
benchmark is not that significant

                              4.10.0-rc2            4.10.0-rc2
                                 vanilla          irqsafe-v1r5
Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)

This is from aim9 and the most notable outcome is that fault variability
is reduced by the patch.  The headline improvement is small as the overall
fault cost, zeroing, page table insertion etc dominate relative to
disabling/enabling IRQs in the per-cpu allocator.

Similarly, little benefit was seen on networking benchmarks both localhost
and between physical server/clients where other costs dominate.  It's
possible that this will only be noticable on very high speed networks.

Jesper Dangaard Brouer independently tested
this with a separate microbenchmark from
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

Micro-benchmarked with [1] page_bench02:
 modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
  rmmod page_bench02 ; dmesg --notime | tail -n 4

Compared to baseline: 213 cycles(tsc) 53.417 ns
 - against this     : 184 cycles(tsc) 46.056 ns
 - Saving           : -29 cycles
 - Very close to expected 27 cycles saving [see below [2]]

Micro benchmarking via time_bench_sample[3], we get the cost of these
operations:

 time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
 time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
 time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
 time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
 time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
 time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
 time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
 time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
 time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
 [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
 time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
 [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
 time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
 time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
 time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)

Thus, expected improvement is: 38-11 = 27 cycles.

[mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
  Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   43 +++++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 20 deletions(-)

diff -puN mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests mm/page_alloc.c
--- a/mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests
+++ a/mm/page_alloc.c
@@ -1085,10 +1085,10 @@ static void free_pcppages_bulk(struct zo
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned;
+	unsigned long nr_scanned, flags;
 	bool isolated_pageblocks;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
@@ -1137,7 +1137,7 @@ static void free_pcppages_bulk(struct zo
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--count && --batch_free && !list_empty(list));
 	}
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -1145,8 +1145,9 @@ static void free_one_page(struct zone *z
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned;
-	spin_lock(&zone->lock);
+	unsigned long nr_scanned, flags;
+	spin_lock_irqsave(&zone->lock, flags);
+	__count_vm_events(PGFREE, 1 << order);
 	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
 		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
@@ -1156,7 +1157,7 @@ static void free_one_page(struct zone *z
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
 	__free_one_page(page, pfn, zone, order, migratetype);
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -1234,7 +1235,6 @@ void __meminit reserve_bootmem_region(ph
 
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
 
@@ -1242,10 +1242,7 @@ static void __free_pages_ok(struct page
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
-	local_irq_restore(flags);
 }
 
 static void __init __free_pages_boot_core(struct page *page, unsigned int order)
@@ -2217,8 +2214,9 @@ static int rmqueue_bulk(struct zone *zon
 			int migratetype, bool cold)
 {
 	int i, alloced = 0;
+	unsigned long flags;
 
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
@@ -2254,7 +2252,7 @@ static int rmqueue_bulk(struct zone *zon
 	 * pages added to the pcp list.
 	 */
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return alloced;
 }
 
@@ -2475,17 +2473,20 @@ void free_hot_cold_page(struct page *pag
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
-	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	if (in_interrupt()) {
+		__free_pages_ok(page, 0);
+		return;
+	}
+
 	if (!free_pcp_prepare(page))
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	local_irq_save(flags);
-	__count_vm_event(PGFREE);
+	preempt_disable();
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2502,6 +2503,7 @@ void free_hot_cold_page(struct page *pag
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	__count_vm_event(PGFREE);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (!cold)
 		list_add(&page->lru, &pcp->lists[migratetype]);
@@ -2515,7 +2517,7 @@ void free_hot_cold_page(struct page *pag
 	}
 
 out:
-	local_irq_restore(flags);
+	preempt_enable();
 }
 
 /*
@@ -2640,6 +2642,8 @@ static struct page *__rmqueue_pcplist(st
 {
 	struct page *page;
 
+	VM_BUG_ON(in_interrupt());
+
 	do {
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
@@ -2670,9 +2674,8 @@ static struct page *rmqueue_pcplist(stru
 	struct list_head *list;
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
-	unsigned long flags;
 
-	local_irq_save(flags);
+	preempt_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
@@ -2680,7 +2683,7 @@ static struct page *rmqueue_pcplist(stru
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone);
 	}
-	local_irq_restore(flags);
+	preempt_enable();
 	return page;
 }
 
@@ -2696,7 +2699,7 @@ struct page *rmqueue(struct zone *prefer
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0)) {
+	if (likely(order == 0) && !in_interrupt()) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 		goto out;
_

Patches currently in -mm which might be from mgorman@techsingularity.net are



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Page allocator order-0 optimizations merged
  2017-02-27 20:25 [merged] mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch removed from -mm tree akpm
@ 2017-03-01 13:48 ` Jesper Dangaard Brouer
  2017-03-01 17:36     ` Tariq Toukan
  2017-04-10 14:31     ` zhong jiang
  0 siblings, 2 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-01 13:48 UTC (permalink / raw)
  To: netdev; +Cc: akpm, Mel Gorman, linux-mm, Saeed Mahameed, Tariq Toukan, brouer


Hi NetDev community,

I just wanted to make net driver people aware that this MM commit[1] got
merged and is available in net-next.

 commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
 [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696

It provides approx 14% speedup of order-0 page allocations.  I do know
most driver do their own page-recycling.  Thus, this gain will only be
seen when this page recycling is insufficient, which Tariq was affected
by AFAIK.

We are also playing with a bulk page allocator facility[2], that I've
benchmarked[3][4].  While I'm seeing between 34%-46% improvements by
bulking, I believe we actually need to do better, before it reach our
performance target for high-speed networking.

--Jesper

[2] http://lkml.kernel.org/r/20170109163518.6001-5-mgorman%40techsingularity.net
[3] http://lkml.kernel.org/r/20170116152518.5519dc1e%40redhat.com
[4] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c


On Mon, 27 Feb 2017 12:25:03 -0800 akpm@linux-foundation.org wrote:

> The patch titled
>      Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
> has been removed from the -mm tree.  Its filename was
>      mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
> 
> This patch was dropped because it was merged into mainline or a subsystem tree
> 
> ------------------------------------------------------
> From: Mel Gorman <mgorman@techsingularity.net>
> Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
> 
> Many workloads that allocate pages are not handling an interrupt at a
> time.  As allocation requests may be from IRQ context, it's necessary to
> disable/enable IRQs for every page allocation.  This cost is the bulk of
> the free path but also a significant percentage of the allocation path.
> 
> This patch alters the locking and checks such that only irq-safe
> allocation requests use the per-cpu allocator.  All others acquire the
> irq-safe zone->lock and allocate from the buddy allocator.  It relies on
> disabling preemption to safely access the per-cpu structures.  It could be
> slightly modified to avoid soft IRQs using it but it's not clear it's
> worthwhile.
> 
> This modification may slow allocations from IRQ context slightly but the
> main gain from the per-cpu allocator is that it scales better for
> allocations from multiple contexts.  There is an implicit assumption that
> intensive allocations from IRQ contexts on multiple CPUs from a single
> NUMA node are rare and that the fast majority of scaling issues are
> encountered in !IRQ contexts such as page faulting.  It's worth noting
> that this patch is not required for a bulk page allocator but it
> significantly reduces the overhead.
> 
> The following is results from a page allocator micro-benchmark.  Only
> order-0 is interesting as higher orders do not use the per-cpu allocator
> 
>                                           4.10.0-rc2                 4.10.0-rc2
>                                              vanilla               irqsafe-v1r5
> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
> 
> This is the alloc, free and total overhead of allocating order-0 pages in
> batches of 1 page up to 16384 pages.  Avoiding disabling/enabling overhead
> massively reduces overhead.  Alloc overhead is roughly reduced by 14-20%
> in most cases.  The free path is reduced by 26-46% and the total reduction
> is significant.
> 
> Many users require zeroing of pages from the page allocator which is the
> vast cost of allocation.  Hence, the impact on a basic page faulting
> benchmark is not that significant
> 
>                               4.10.0-rc2            4.10.0-rc2
>                                  vanilla          irqsafe-v1r5
> Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
> Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
> Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
> Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
> CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
> CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
> Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
> Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
> 
> This is from aim9 and the most notable outcome is that fault variability
> is reduced by the patch.  The headline improvement is small as the overall
> fault cost, zeroing, page table insertion etc dominate relative to
> disabling/enabling IRQs in the per-cpu allocator.
> 
> Similarly, little benefit was seen on networking benchmarks both localhost
> and between physical server/clients where other costs dominate.  It's
> possible that this will only be noticable on very high speed networks.
> 
> Jesper Dangaard Brouer independently tested
> this with a separate microbenchmark from
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
> 
> Micro-benchmarked with [1] page_bench02:
>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>   rmmod page_bench02 ; dmesg --notime | tail -n 4
> 
> Compared to baseline: 213 cycles(tsc) 53.417 ns
>  - against this     : 184 cycles(tsc) 46.056 ns
>  - Saving           : -29 cycles
>  - Very close to expected 27 cycles saving [see below [2]]
> 
> Micro benchmarking via time_bench_sample[3], we get the cost of these
> operations:
> 
>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
> 
> Thus, expected improvement is: 38-11 = 27 cycles.
> 
> [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
>   Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
> Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/page_alloc.c |   43 +++++++++++++++++++++++--------------------
>  1 file changed, 23 insertions(+), 20 deletions(-)
> 
> diff -puN mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests mm/page_alloc.c
> --- a/mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests
> +++ a/mm/page_alloc.c
> @@ -1085,10 +1085,10 @@ static void free_pcppages_bulk(struct zo
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	unsigned long nr_scanned;
> +	unsigned long nr_scanned, flags;
>  	bool isolated_pageblocks;
>  
> -	spin_lock(&zone->lock);
> +	spin_lock_irqsave(&zone->lock, flags);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
>  	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>  	if (nr_scanned)
> @@ -1137,7 +1137,7 @@ static void free_pcppages_bulk(struct zo
>  			trace_mm_page_pcpu_drain(page, 0, mt);
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
> -	spin_unlock(&zone->lock);
> +	spin_unlock_irqrestore(&zone->lock, flags);
>  }
>  
>  static void free_one_page(struct zone *zone,
> @@ -1145,8 +1145,9 @@ static void free_one_page(struct zone *z
>  				unsigned int order,
>  				int migratetype)
>  {
> -	unsigned long nr_scanned;
> -	spin_lock(&zone->lock);
> +	unsigned long nr_scanned, flags;
> +	spin_lock_irqsave(&zone->lock, flags);
> +	__count_vm_events(PGFREE, 1 << order);
>  	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>  	if (nr_scanned)
>  		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> @@ -1156,7 +1157,7 @@ static void free_one_page(struct zone *z
>  		migratetype = get_pfnblock_migratetype(page, pfn);
>  	}
>  	__free_one_page(page, pfn, zone, order, migratetype);
> -	spin_unlock(&zone->lock);
> +	spin_unlock_irqrestore(&zone->lock, flags);
>  }
>  
>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> @@ -1234,7 +1235,6 @@ void __meminit reserve_bootmem_region(ph
>  
>  static void __free_pages_ok(struct page *page, unsigned int order)
>  {
> -	unsigned long flags;
>  	int migratetype;
>  	unsigned long pfn = page_to_pfn(page);
>  
> @@ -1242,10 +1242,7 @@ static void __free_pages_ok(struct page
>  		return;
>  
>  	migratetype = get_pfnblock_migratetype(page, pfn);
> -	local_irq_save(flags);
> -	__count_vm_events(PGFREE, 1 << order);
>  	free_one_page(page_zone(page), page, pfn, order, migratetype);
> -	local_irq_restore(flags);
>  }
>  
>  static void __init __free_pages_boot_core(struct page *page, unsigned int order)
> @@ -2217,8 +2214,9 @@ static int rmqueue_bulk(struct zone *zon
>  			int migratetype, bool cold)
>  {
>  	int i, alloced = 0;
> +	unsigned long flags;
>  
> -	spin_lock(&zone->lock);
> +	spin_lock_irqsave(&zone->lock, flags);
>  	for (i = 0; i < count; ++i) {
>  		struct page *page = __rmqueue(zone, order, migratetype);
>  		if (unlikely(page == NULL))
> @@ -2254,7 +2252,7 @@ static int rmqueue_bulk(struct zone *zon
>  	 * pages added to the pcp list.
>  	 */
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> -	spin_unlock(&zone->lock);
> +	spin_unlock_irqrestore(&zone->lock, flags);
>  	return alloced;
>  }
>  
> @@ -2475,17 +2473,20 @@ void free_hot_cold_page(struct page *pag
>  {
>  	struct zone *zone = page_zone(page);
>  	struct per_cpu_pages *pcp;
> -	unsigned long flags;
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>  
> +	if (in_interrupt()) {
> +		__free_pages_ok(page, 0);
> +		return;
> +	}
> +
>  	if (!free_pcp_prepare(page))
>  		return;
>  
>  	migratetype = get_pfnblock_migratetype(page, pfn);
>  	set_pcppage_migratetype(page, migratetype);
> -	local_irq_save(flags);
> -	__count_vm_event(PGFREE);
> +	preempt_disable();
>  
>  	/*
>  	 * We only track unmovable, reclaimable and movable on pcp lists.
> @@ -2502,6 +2503,7 @@ void free_hot_cold_page(struct page *pag
>  		migratetype = MIGRATE_MOVABLE;
>  	}
>  
> +	__count_vm_event(PGFREE);
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	if (!cold)
>  		list_add(&page->lru, &pcp->lists[migratetype]);
> @@ -2515,7 +2517,7 @@ void free_hot_cold_page(struct page *pag
>  	}
>  
>  out:
> -	local_irq_restore(flags);
> +	preempt_enable();
>  }
>  
>  /*
> @@ -2640,6 +2642,8 @@ static struct page *__rmqueue_pcplist(st
>  {
>  	struct page *page;
>  
> +	VM_BUG_ON(in_interrupt());
> +
>  	do {
>  		if (list_empty(list)) {
>  			pcp->count += rmqueue_bulk(zone, 0,
> @@ -2670,9 +2674,8 @@ static struct page *rmqueue_pcplist(stru
>  	struct list_head *list;
>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>  	struct page *page;
> -	unsigned long flags;
>  
> -	local_irq_save(flags);
> +	preempt_disable();
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	list = &pcp->lists[migratetype];
>  	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
> @@ -2680,7 +2683,7 @@ static struct page *rmqueue_pcplist(stru
>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>  		zone_statistics(preferred_zone, zone);
>  	}
> -	local_irq_restore(flags);
> +	preempt_enable();
>  	return page;
>  }
>  
> @@ -2696,7 +2699,7 @@ struct page *rmqueue(struct zone *prefer
>  	unsigned long flags;
>  	struct page *page;
>  
> -	if (likely(order == 0)) {
> +	if (likely(order == 0) && !in_interrupt()) {
>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>  				gfp_flags, migratetype);
>  		goto out;
> _
> 
> Patches currently in -mm which might be from mgorman@techsingularity.net are
> 
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-01 13:48 ` Page allocator order-0 optimizations merged Jesper Dangaard Brouer
@ 2017-03-01 17:36     ` Tariq Toukan
  2017-04-10 14:31     ` zhong jiang
  1 sibling, 0 replies; 63+ messages in thread
From: Tariq Toukan @ 2017-03-01 17:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev; +Cc: akpm, Mel Gorman, linux-mm, Saeed Mahameed


On 01/03/2017 3:48 PM, Jesper Dangaard Brouer wrote:
> Hi NetDev community,
>
> I just wanted to make net driver people aware that this MM commit[1] got
> merged and is available in net-next.
>
>   commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
>   [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696
>
> It provides approx 14% speedup of order-0 page allocations.  I do know
> most driver do their own page-recycling.  Thus, this gain will only be
> seen when this page recycling is insufficient, which Tariq was affected
> by AFAIK.
Thanks Jesper, this is great news!
I will start perf testing this tomorrow.
>
> We are also playing with a bulk page allocator facility[2], that I've
> benchmarked[3][4].  While I'm seeing between 34%-46% improvements by
> bulking, I believe we actually need to do better, before it reach our
> performance target for high-speed networking.
Very promising!
This fits perfectly in our Striding RQ feature (Multi-Packet WQE)
where we allocate fragmented buffers (of order-0 pages) of 256KB total.
Big like :)

Thanks,
Tariq
> --Jesper
>
> [2] http://lkml.kernel.org/r/20170109163518.6001-5-mgorman%40techsingularity.net
> [3] http://lkml.kernel.org/r/20170116152518.5519dc1e%40redhat.com
> [4] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>
>
> On Mon, 27 Feb 2017 12:25:03 -0800 akpm@linux-foundation.org wrote:
>
>> The patch titled
>>       Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>> has been removed from the -mm tree.  Its filename was
>>       mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
>>
>> This patch was dropped because it was merged into mainline or a subsystem tree
>>
>> ------------------------------------------------------
>> From: Mel Gorman <mgorman@techsingularity.net>
>> Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>>
>> Many workloads that allocate pages are not handling an interrupt at a
>> time.  As allocation requests may be from IRQ context, it's necessary to
>> disable/enable IRQs for every page allocation.  This cost is the bulk of
>> the free path but also a significant percentage of the allocation path.
>>
>> This patch alters the locking and checks such that only irq-safe
>> allocation requests use the per-cpu allocator.  All others acquire the
>> irq-safe zone->lock and allocate from the buddy allocator.  It relies on
>> disabling preemption to safely access the per-cpu structures.  It could be
>> slightly modified to avoid soft IRQs using it but it's not clear it's
>> worthwhile.
>>
>> This modification may slow allocations from IRQ context slightly but the
>> main gain from the per-cpu allocator is that it scales better for
>> allocations from multiple contexts.  There is an implicit assumption that
>> intensive allocations from IRQ contexts on multiple CPUs from a single
>> NUMA node are rare and that the fast majority of scaling issues are
>> encountered in !IRQ contexts such as page faulting.  It's worth noting
>> that this patch is not required for a bulk page allocator but it
>> significantly reduces the overhead.
>>
>> The following is results from a page allocator micro-benchmark.  Only
>> order-0 is interesting as higher orders do not use the per-cpu allocator
>>
>>                                            4.10.0-rc2                 4.10.0-rc2
>>                                               vanilla               irqsafe-v1r5
>> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
>> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
>> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
>> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
>> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
>> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
>> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
>> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
>> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
>> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
>> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
>> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
>> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
>> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
>> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
>> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
>> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
>> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
>> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
>> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
>> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
>> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
>> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
>> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
>> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
>> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
>> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
>> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
>> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
>> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
>> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
>> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
>> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
>> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
>> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
>> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
>> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
>> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
>> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
>> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
>> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
>> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
>> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
>> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
>> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
>>
>> This is the alloc, free and total overhead of allocating order-0 pages in
>> batches of 1 page up to 16384 pages.  Avoiding disabling/enabling overhead
>> massively reduces overhead.  Alloc overhead is roughly reduced by 14-20%
>> in most cases.  The free path is reduced by 26-46% and the total reduction
>> is significant.
>>
>> Many users require zeroing of pages from the page allocator which is the
>> vast cost of allocation.  Hence, the impact on a basic page faulting
>> benchmark is not that significant
>>
>>                                4.10.0-rc2            4.10.0-rc2
>>                                   vanilla          irqsafe-v1r5
>> Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
>> Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
>> Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
>> Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
>> CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
>> CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
>> Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
>> Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
>>
>> This is from aim9 and the most notable outcome is that fault variability
>> is reduced by the patch.  The headline improvement is small as the overall
>> fault cost, zeroing, page table insertion etc dominate relative to
>> disabling/enabling IRQs in the per-cpu allocator.
>>
>> Similarly, little benefit was seen on networking benchmarks both localhost
>> and between physical server/clients where other costs dominate.  It's
>> possible that this will only be noticable on very high speed networks.
>>
>> Jesper Dangaard Brouer independently tested
>> this with a separate microbenchmark from
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>>
>> Micro-benchmarked with [1] page_bench02:
>>   modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>>    rmmod page_bench02 ; dmesg --notime | tail -n 4
>>
>> Compared to baseline: 213 cycles(tsc) 53.417 ns
>>   - against this     : 184 cycles(tsc) 46.056 ns
>>   - Saving           : -29 cycles
>>   - Very close to expected 27 cycles saving [see below [2]]
>>
>> Micro benchmarking via time_bench_sample[3], we get the cost of these
>> operations:
>>
>>   time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>>   time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>>   time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>>   time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>>   time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>>   time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>>   time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>>   time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>>   time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>>   [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>>   time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>>   [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>>   time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>>   time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>>   time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
>>
>> Thus, expected improvement is: 38-11 = 27 cycles.
>>
>> [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
>>    Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
>> Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>>   mm/page_alloc.c |   43 +++++++++++++++++++++++--------------------
>>   1 file changed, 23 insertions(+), 20 deletions(-)
>>
>> diff -puN mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests mm/page_alloc.c
>> --- a/mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests
>> +++ a/mm/page_alloc.c
>> @@ -1085,10 +1085,10 @@ static void free_pcppages_bulk(struct zo
>>   {
>>   	int migratetype = 0;
>>   	int batch_free = 0;
>> -	unsigned long nr_scanned;
>> +	unsigned long nr_scanned, flags;
>>   	bool isolated_pageblocks;
>>   
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>   	isolated_pageblocks = has_isolate_pageblock(zone);
>>   	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>   	if (nr_scanned)
>> @@ -1137,7 +1137,7 @@ static void free_pcppages_bulk(struct zo
>>   			trace_mm_page_pcpu_drain(page, 0, mt);
>>   		} while (--count && --batch_free && !list_empty(list));
>>   	}
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>   }
>>   
>>   static void free_one_page(struct zone *zone,
>> @@ -1145,8 +1145,9 @@ static void free_one_page(struct zone *z
>>   				unsigned int order,
>>   				int migratetype)
>>   {
>> -	unsigned long nr_scanned;
>> -	spin_lock(&zone->lock);
>> +	unsigned long nr_scanned, flags;
>> +	spin_lock_irqsave(&zone->lock, flags);
>> +	__count_vm_events(PGFREE, 1 << order);
>>   	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>   	if (nr_scanned)
>>   		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>> @@ -1156,7 +1157,7 @@ static void free_one_page(struct zone *z
>>   		migratetype = get_pfnblock_migratetype(page, pfn);
>>   	}
>>   	__free_one_page(page, pfn, zone, order, migratetype);
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>   }
>>   
>>   static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>> @@ -1234,7 +1235,6 @@ void __meminit reserve_bootmem_region(ph
>>   
>>   static void __free_pages_ok(struct page *page, unsigned int order)
>>   {
>> -	unsigned long flags;
>>   	int migratetype;
>>   	unsigned long pfn = page_to_pfn(page);
>>   
>> @@ -1242,10 +1242,7 @@ static void __free_pages_ok(struct page
>>   		return;
>>   
>>   	migratetype = get_pfnblock_migratetype(page, pfn);
>> -	local_irq_save(flags);
>> -	__count_vm_events(PGFREE, 1 << order);
>>   	free_one_page(page_zone(page), page, pfn, order, migratetype);
>> -	local_irq_restore(flags);
>>   }
>>   
>>   static void __init __free_pages_boot_core(struct page *page, unsigned int order)
>> @@ -2217,8 +2214,9 @@ static int rmqueue_bulk(struct zone *zon
>>   			int migratetype, bool cold)
>>   {
>>   	int i, alloced = 0;
>> +	unsigned long flags;
>>   
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>   	for (i = 0; i < count; ++i) {
>>   		struct page *page = __rmqueue(zone, order, migratetype);
>>   		if (unlikely(page == NULL))
>> @@ -2254,7 +2252,7 @@ static int rmqueue_bulk(struct zone *zon
>>   	 * pages added to the pcp list.
>>   	 */
>>   	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>   	return alloced;
>>   }
>>   
>> @@ -2475,17 +2473,20 @@ void free_hot_cold_page(struct page *pag
>>   {
>>   	struct zone *zone = page_zone(page);
>>   	struct per_cpu_pages *pcp;
>> -	unsigned long flags;
>>   	unsigned long pfn = page_to_pfn(page);
>>   	int migratetype;
>>   
>> +	if (in_interrupt()) {
>> +		__free_pages_ok(page, 0);
>> +		return;
>> +	}
>> +
>>   	if (!free_pcp_prepare(page))
>>   		return;
>>   
>>   	migratetype = get_pfnblock_migratetype(page, pfn);
>>   	set_pcppage_migratetype(page, migratetype);
>> -	local_irq_save(flags);
>> -	__count_vm_event(PGFREE);
>> +	preempt_disable();
>>   
>>   	/*
>>   	 * We only track unmovable, reclaimable and movable on pcp lists.
>> @@ -2502,6 +2503,7 @@ void free_hot_cold_page(struct page *pag
>>   		migratetype = MIGRATE_MOVABLE;
>>   	}
>>   
>> +	__count_vm_event(PGFREE);
>>   	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>   	if (!cold)
>>   		list_add(&page->lru, &pcp->lists[migratetype]);
>> @@ -2515,7 +2517,7 @@ void free_hot_cold_page(struct page *pag
>>   	}
>>   
>>   out:
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>   }
>>   
>>   /*
>> @@ -2640,6 +2642,8 @@ static struct page *__rmqueue_pcplist(st
>>   {
>>   	struct page *page;
>>   
>> +	VM_BUG_ON(in_interrupt());
>> +
>>   	do {
>>   		if (list_empty(list)) {
>>   			pcp->count += rmqueue_bulk(zone, 0,
>> @@ -2670,9 +2674,8 @@ static struct page *rmqueue_pcplist(stru
>>   	struct list_head *list;
>>   	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>>   	struct page *page;
>> -	unsigned long flags;
>>   
>> -	local_irq_save(flags);
>> +	preempt_disable();
>>   	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>   	list = &pcp->lists[migratetype];
>>   	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
>> @@ -2680,7 +2683,7 @@ static struct page *rmqueue_pcplist(stru
>>   		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>>   		zone_statistics(preferred_zone, zone);
>>   	}
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>   	return page;
>>   }
>>   
>> @@ -2696,7 +2699,7 @@ struct page *rmqueue(struct zone *prefer
>>   	unsigned long flags;
>>   	struct page *page;
>>   
>> -	if (likely(order == 0)) {
>> +	if (likely(order == 0) && !in_interrupt()) {
>>   		page = rmqueue_pcplist(preferred_zone, zone, order,
>>   				gfp_flags, migratetype);
>>   		goto out;
>> _
>>
>> Patches currently in -mm which might be from mgorman@techsingularity.net are
>>
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
@ 2017-03-01 17:36     ` Tariq Toukan
  0 siblings, 0 replies; 63+ messages in thread
From: Tariq Toukan @ 2017-03-01 17:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev; +Cc: akpm, Mel Gorman, linux-mm, Saeed Mahameed


On 01/03/2017 3:48 PM, Jesper Dangaard Brouer wrote:
> Hi NetDev community,
>
> I just wanted to make net driver people aware that this MM commit[1] got
> merged and is available in net-next.
>
>   commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
>   [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696
>
> It provides approx 14% speedup of order-0 page allocations.  I do know
> most driver do their own page-recycling.  Thus, this gain will only be
> seen when this page recycling is insufficient, which Tariq was affected
> by AFAIK.
Thanks Jesper, this is great news!
I will start perf testing this tomorrow.
>
> We are also playing with a bulk page allocator facility[2], that I've
> benchmarked[3][4].  While I'm seeing between 34%-46% improvements by
> bulking, I believe we actually need to do better, before it reach our
> performance target for high-speed networking.
Very promising!
This fits perfectly in our Striding RQ feature (Multi-Packet WQE)
where we allocate fragmented buffers (of order-0 pages) of 256KB total.
Big like :)

Thanks,
Tariq
> --Jesper
>
> [2] http://lkml.kernel.org/r/20170109163518.6001-5-mgorman%40techsingularity.net
> [3] http://lkml.kernel.org/r/20170116152518.5519dc1e%40redhat.com
> [4] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>
>
> On Mon, 27 Feb 2017 12:25:03 -0800 akpm@linux-foundation.org wrote:
>
>> The patch titled
>>       Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>> has been removed from the -mm tree.  Its filename was
>>       mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
>>
>> This patch was dropped because it was merged into mainline or a subsystem tree
>>
>> ------------------------------------------------------
>> From: Mel Gorman <mgorman@techsingularity.net>
>> Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>>
>> Many workloads that allocate pages are not handling an interrupt at a
>> time.  As allocation requests may be from IRQ context, it's necessary to
>> disable/enable IRQs for every page allocation.  This cost is the bulk of
>> the free path but also a significant percentage of the allocation path.
>>
>> This patch alters the locking and checks such that only irq-safe
>> allocation requests use the per-cpu allocator.  All others acquire the
>> irq-safe zone->lock and allocate from the buddy allocator.  It relies on
>> disabling preemption to safely access the per-cpu structures.  It could be
>> slightly modified to avoid soft IRQs using it but it's not clear it's
>> worthwhile.
>>
>> This modification may slow allocations from IRQ context slightly but the
>> main gain from the per-cpu allocator is that it scales better for
>> allocations from multiple contexts.  There is an implicit assumption that
>> intensive allocations from IRQ contexts on multiple CPUs from a single
>> NUMA node are rare and that the fast majority of scaling issues are
>> encountered in !IRQ contexts such as page faulting.  It's worth noting
>> that this patch is not required for a bulk page allocator but it
>> significantly reduces the overhead.
>>
>> The following is results from a page allocator micro-benchmark.  Only
>> order-0 is interesting as higher orders do not use the per-cpu allocator
>>
>>                                            4.10.0-rc2                 4.10.0-rc2
>>                                               vanilla               irqsafe-v1r5
>> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
>> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
>> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
>> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
>> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
>> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
>> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
>> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
>> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
>> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
>> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
>> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
>> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
>> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
>> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
>> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
>> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
>> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
>> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
>> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
>> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
>> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
>> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
>> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
>> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
>> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
>> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
>> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
>> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
>> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
>> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
>> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
>> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
>> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
>> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
>> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
>> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
>> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
>> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
>> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
>> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
>> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
>> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
>> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
>> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
>>
>> This is the alloc, free and total overhead of allocating order-0 pages in
>> batches of 1 page up to 16384 pages.  Avoiding disabling/enabling overhead
>> massively reduces overhead.  Alloc overhead is roughly reduced by 14-20%
>> in most cases.  The free path is reduced by 26-46% and the total reduction
>> is significant.
>>
>> Many users require zeroing of pages from the page allocator which is the
>> vast cost of allocation.  Hence, the impact on a basic page faulting
>> benchmark is not that significant
>>
>>                                4.10.0-rc2            4.10.0-rc2
>>                                   vanilla          irqsafe-v1r5
>> Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
>> Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
>> Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
>> Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
>> CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
>> CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
>> Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
>> Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
>>
>> This is from aim9 and the most notable outcome is that fault variability
>> is reduced by the patch.  The headline improvement is small as the overall
>> fault cost, zeroing, page table insertion etc dominate relative to
>> disabling/enabling IRQs in the per-cpu allocator.
>>
>> Similarly, little benefit was seen on networking benchmarks both localhost
>> and between physical server/clients where other costs dominate.  It's
>> possible that this will only be noticable on very high speed networks.
>>
>> Jesper Dangaard Brouer independently tested
>> this with a separate microbenchmark from
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>>
>> Micro-benchmarked with [1] page_bench02:
>>   modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>>    rmmod page_bench02 ; dmesg --notime | tail -n 4
>>
>> Compared to baseline: 213 cycles(tsc) 53.417 ns
>>   - against this     : 184 cycles(tsc) 46.056 ns
>>   - Saving           : -29 cycles
>>   - Very close to expected 27 cycles saving [see below [2]]
>>
>> Micro benchmarking via time_bench_sample[3], we get the cost of these
>> operations:
>>
>>   time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>>   time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>>   time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>>   time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>>   time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>>   time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>>   time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>>   time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>>   time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>>   [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>>   time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>>   [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>>   time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>>   time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>>   time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
>>
>> Thus, expected improvement is: 38-11 = 27 cycles.
>>
>> [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
>>    Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
>> Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>>   mm/page_alloc.c |   43 +++++++++++++++++++++++--------------------
>>   1 file changed, 23 insertions(+), 20 deletions(-)
>>
>> diff -puN mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests mm/page_alloc.c
>> --- a/mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests
>> +++ a/mm/page_alloc.c
>> @@ -1085,10 +1085,10 @@ static void free_pcppages_bulk(struct zo
>>   {
>>   	int migratetype = 0;
>>   	int batch_free = 0;
>> -	unsigned long nr_scanned;
>> +	unsigned long nr_scanned, flags;
>>   	bool isolated_pageblocks;
>>   
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>   	isolated_pageblocks = has_isolate_pageblock(zone);
>>   	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>   	if (nr_scanned)
>> @@ -1137,7 +1137,7 @@ static void free_pcppages_bulk(struct zo
>>   			trace_mm_page_pcpu_drain(page, 0, mt);
>>   		} while (--count && --batch_free && !list_empty(list));
>>   	}
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>   }
>>   
>>   static void free_one_page(struct zone *zone,
>> @@ -1145,8 +1145,9 @@ static void free_one_page(struct zone *z
>>   				unsigned int order,
>>   				int migratetype)
>>   {
>> -	unsigned long nr_scanned;
>> -	spin_lock(&zone->lock);
>> +	unsigned long nr_scanned, flags;
>> +	spin_lock_irqsave(&zone->lock, flags);
>> +	__count_vm_events(PGFREE, 1 << order);
>>   	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>   	if (nr_scanned)
>>   		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>> @@ -1156,7 +1157,7 @@ static void free_one_page(struct zone *z
>>   		migratetype = get_pfnblock_migratetype(page, pfn);
>>   	}
>>   	__free_one_page(page, pfn, zone, order, migratetype);
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>   }
>>   
>>   static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>> @@ -1234,7 +1235,6 @@ void __meminit reserve_bootmem_region(ph
>>   
>>   static void __free_pages_ok(struct page *page, unsigned int order)
>>   {
>> -	unsigned long flags;
>>   	int migratetype;
>>   	unsigned long pfn = page_to_pfn(page);
>>   
>> @@ -1242,10 +1242,7 @@ static void __free_pages_ok(struct page
>>   		return;
>>   
>>   	migratetype = get_pfnblock_migratetype(page, pfn);
>> -	local_irq_save(flags);
>> -	__count_vm_events(PGFREE, 1 << order);
>>   	free_one_page(page_zone(page), page, pfn, order, migratetype);
>> -	local_irq_restore(flags);
>>   }
>>   
>>   static void __init __free_pages_boot_core(struct page *page, unsigned int order)
>> @@ -2217,8 +2214,9 @@ static int rmqueue_bulk(struct zone *zon
>>   			int migratetype, bool cold)
>>   {
>>   	int i, alloced = 0;
>> +	unsigned long flags;
>>   
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>   	for (i = 0; i < count; ++i) {
>>   		struct page *page = __rmqueue(zone, order, migratetype);
>>   		if (unlikely(page == NULL))
>> @@ -2254,7 +2252,7 @@ static int rmqueue_bulk(struct zone *zon
>>   	 * pages added to the pcp list.
>>   	 */
>>   	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>   	return alloced;
>>   }
>>   
>> @@ -2475,17 +2473,20 @@ void free_hot_cold_page(struct page *pag
>>   {
>>   	struct zone *zone = page_zone(page);
>>   	struct per_cpu_pages *pcp;
>> -	unsigned long flags;
>>   	unsigned long pfn = page_to_pfn(page);
>>   	int migratetype;
>>   
>> +	if (in_interrupt()) {
>> +		__free_pages_ok(page, 0);
>> +		return;
>> +	}
>> +
>>   	if (!free_pcp_prepare(page))
>>   		return;
>>   
>>   	migratetype = get_pfnblock_migratetype(page, pfn);
>>   	set_pcppage_migratetype(page, migratetype);
>> -	local_irq_save(flags);
>> -	__count_vm_event(PGFREE);
>> +	preempt_disable();
>>   
>>   	/*
>>   	 * We only track unmovable, reclaimable and movable on pcp lists.
>> @@ -2502,6 +2503,7 @@ void free_hot_cold_page(struct page *pag
>>   		migratetype = MIGRATE_MOVABLE;
>>   	}
>>   
>> +	__count_vm_event(PGFREE);
>>   	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>   	if (!cold)
>>   		list_add(&page->lru, &pcp->lists[migratetype]);
>> @@ -2515,7 +2517,7 @@ void free_hot_cold_page(struct page *pag
>>   	}
>>   
>>   out:
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>   }
>>   
>>   /*
>> @@ -2640,6 +2642,8 @@ static struct page *__rmqueue_pcplist(st
>>   {
>>   	struct page *page;
>>   
>> +	VM_BUG_ON(in_interrupt());
>> +
>>   	do {
>>   		if (list_empty(list)) {
>>   			pcp->count += rmqueue_bulk(zone, 0,
>> @@ -2670,9 +2674,8 @@ static struct page *rmqueue_pcplist(stru
>>   	struct list_head *list;
>>   	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>>   	struct page *page;
>> -	unsigned long flags;
>>   
>> -	local_irq_save(flags);
>> +	preempt_disable();
>>   	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>   	list = &pcp->lists[migratetype];
>>   	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
>> @@ -2680,7 +2683,7 @@ static struct page *rmqueue_pcplist(stru
>>   		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>>   		zone_statistics(preferred_zone, zone);
>>   	}
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>   	return page;
>>   }
>>   
>> @@ -2696,7 +2699,7 @@ struct page *rmqueue(struct zone *prefer
>>   	unsigned long flags;
>>   	struct page *page;
>>   
>> -	if (likely(order == 0)) {
>> +	if (likely(order == 0) && !in_interrupt()) {
>>   		page = rmqueue_pcplist(preferred_zone, zone, order,
>>   				gfp_flags, migratetype);
>>   		goto out;
>> _
>>
>> Patches currently in -mm which might be from mgorman@techsingularity.net are
>>
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-01 17:36     ` Tariq Toukan
@ 2017-03-22 17:39       ` Tariq Toukan
  -1 siblings, 0 replies; 63+ messages in thread
From: Tariq Toukan @ 2017-03-22 17:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Mel Gorman; +Cc: akpm, linux-mm, Saeed Mahameed



On 01/03/2017 7:36 PM, Tariq Toukan wrote:
>
> On 01/03/2017 3:48 PM, Jesper Dangaard Brouer wrote:
>> Hi NetDev community,
>>
>> I just wanted to make net driver people aware that this MM commit[1] got
>> merged and is available in net-next.
>>
>>   commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator 
>> for irq-safe requests")
>>   [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696
>>
>> It provides approx 14% speedup of order-0 page allocations.  I do know
>> most driver do their own page-recycling.  Thus, this gain will only be
>> seen when this page recycling is insufficient, which Tariq was affected
>> by AFAIK.
> Thanks Jesper, this is great news!
> I will start perf testing this tomorrow.
>>
>> We are also playing with a bulk page allocator facility[2], that I've
>> benchmarked[3][4].  While I'm seeing between 34%-46% improvements by
>> bulking, I believe we actually need to do better, before it reach our
>> performance target for high-speed networking.
> Very promising!
> This fits perfectly in our Striding RQ feature (Multi-Packet WQE)
> where we allocate fragmented buffers (of order-0 pages) of 256KB total.
> Big like :)
>
> Thanks,
> Tariq
>> --Jesper
>>
>> [2] 
>> http://lkml.kernel.org/r/20170109163518.6001-5-mgorman%40techsingularity.net
>> [3] http://lkml.kernel.org/r/20170116152518.5519dc1e%40redhat.com
>> [4] 
>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>>
>>
>> On Mon, 27 Feb 2017 12:25:03 -0800 akpm@linux-foundation.org wrote:
>>
>>> The patch titled
>>>       Subject: mm, page_alloc: only use per-cpu allocator for 
>>> irq-safe requests
>>> has been removed from the -mm tree.  Its filename was
>>> mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
>>>
>>> This patch was dropped because it was merged into mainline or a 
>>> subsystem tree
>>>
>>> ------------------------------------------------------
>>> From: Mel Gorman <mgorman@techsingularity.net>
>>> Subject: mm, page_alloc: only use per-cpu allocator for irq-safe 
>>> requests
>>>
>>> Many workloads that allocate pages are not handling an interrupt at a
>>> time.  As allocation requests may be from IRQ context, it's 
>>> necessary to
>>> disable/enable IRQs for every page allocation.  This cost is the 
>>> bulk of
>>> the free path but also a significant percentage of the allocation path.
>>>
>>> This patch alters the locking and checks such that only irq-safe
>>> allocation requests use the per-cpu allocator.  All others acquire the
>>> irq-safe zone->lock and allocate from the buddy allocator. It relies on
>>> disabling preemption to safely access the per-cpu structures. It 
>>> could be
>>> slightly modified to avoid soft IRQs using it but it's not clear it's
>>> worthwhile.
>>>
>>> This modification may slow allocations from IRQ context slightly but 
>>> the
>>> main gain from the per-cpu allocator is that it scales better for
>>> allocations from multiple contexts.  There is an implicit assumption 
>>> that
>>> intensive allocations from IRQ contexts on multiple CPUs from a single
>>> NUMA node are rare 
Hi Mel, Jesper, and all.

This assumption contradicts regular multi-stream traffic that is 
naturally handled
over close numa cores.  I compared iperf TCP multistream (8 streams)
over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
kernel v4.11-rc1 (with this series).
I disabled the page-cache (recycle) mechanism to stress the page allocator,
and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in 
v4.11-rc1 (34% drop).
I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.

Best,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
@ 2017-03-22 17:39       ` Tariq Toukan
  0 siblings, 0 replies; 63+ messages in thread
From: Tariq Toukan @ 2017-03-22 17:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Mel Gorman; +Cc: akpm, linux-mm, Saeed Mahameed



On 01/03/2017 7:36 PM, Tariq Toukan wrote:
>
> On 01/03/2017 3:48 PM, Jesper Dangaard Brouer wrote:
>> Hi NetDev community,
>>
>> I just wanted to make net driver people aware that this MM commit[1] got
>> merged and is available in net-next.
>>
>>   commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator 
>> for irq-safe requests")
>>   [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696
>>
>> It provides approx 14% speedup of order-0 page allocations.  I do know
>> most driver do their own page-recycling.  Thus, this gain will only be
>> seen when this page recycling is insufficient, which Tariq was affected
>> by AFAIK.
> Thanks Jesper, this is great news!
> I will start perf testing this tomorrow.
>>
>> We are also playing with a bulk page allocator facility[2], that I've
>> benchmarked[3][4].  While I'm seeing between 34%-46% improvements by
>> bulking, I believe we actually need to do better, before it reach our
>> performance target for high-speed networking.
> Very promising!
> This fits perfectly in our Striding RQ feature (Multi-Packet WQE)
> where we allocate fragmented buffers (of order-0 pages) of 256KB total.
> Big like :)
>
> Thanks,
> Tariq
>> --Jesper
>>
>> [2] 
>> http://lkml.kernel.org/r/20170109163518.6001-5-mgorman%40techsingularity.net
>> [3] http://lkml.kernel.org/r/20170116152518.5519dc1e%40redhat.com
>> [4] 
>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>>
>>
>> On Mon, 27 Feb 2017 12:25:03 -0800 akpm@linux-foundation.org wrote:
>>
>>> The patch titled
>>>       Subject: mm, page_alloc: only use per-cpu allocator for 
>>> irq-safe requests
>>> has been removed from the -mm tree.  Its filename was
>>> mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
>>>
>>> This patch was dropped because it was merged into mainline or a 
>>> subsystem tree
>>>
>>> ------------------------------------------------------
>>> From: Mel Gorman <mgorman@techsingularity.net>
>>> Subject: mm, page_alloc: only use per-cpu allocator for irq-safe 
>>> requests
>>>
>>> Many workloads that allocate pages are not handling an interrupt at a
>>> time.  As allocation requests may be from IRQ context, it's 
>>> necessary to
>>> disable/enable IRQs for every page allocation.  This cost is the 
>>> bulk of
>>> the free path but also a significant percentage of the allocation path.
>>>
>>> This patch alters the locking and checks such that only irq-safe
>>> allocation requests use the per-cpu allocator.  All others acquire the
>>> irq-safe zone->lock and allocate from the buddy allocator. It relies on
>>> disabling preemption to safely access the per-cpu structures. It 
>>> could be
>>> slightly modified to avoid soft IRQs using it but it's not clear it's
>>> worthwhile.
>>>
>>> This modification may slow allocations from IRQ context slightly but 
>>> the
>>> main gain from the per-cpu allocator is that it scales better for
>>> allocations from multiple contexts.  There is an implicit assumption 
>>> that
>>> intensive allocations from IRQ contexts on multiple CPUs from a single
>>> NUMA node are rare 
Hi Mel, Jesper, and all.

This assumption contradicts regular multi-stream traffic that is 
naturally handled
over close numa cores.  I compared iperf TCP multistream (8 streams)
over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
kernel v4.11-rc1 (with this series).
I disabled the page-cache (recycle) mechanism to stress the page allocator,
and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in 
v4.11-rc1 (34% drop).
I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.

Best,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-22 17:39       ` Tariq Toukan
  (?)
@ 2017-03-22 23:40       ` Mel Gorman
  2017-03-23 13:43         ` Jesper Dangaard Brouer
  -1 siblings, 1 reply; 63+ messages in thread
From: Mel Gorman @ 2017-03-22 23:40 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, Saeed Mahameed

On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
> > > > This modification may slow allocations from IRQ context slightly
> > > > but the
> > > > main gain from the per-cpu allocator is that it scales better for
> > > > allocations from multiple contexts.  There is an implicit
> > > > assumption that
> > > > intensive allocations from IRQ contexts on multiple CPUs from a single
> > > > NUMA node are rare
> Hi Mel, Jesper, and all.
> 
> This assumption contradicts regular multi-stream traffic that is naturally
> handled
> over close numa cores.  I compared iperf TCP multistream (8 streams)
> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
> kernel v4.11-rc1 (with this series).
> I disabled the page-cache (recycle) mechanism to stress the page allocator,
> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
> v4.11-rc1 (34% drop).
> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.

Can you get the stack trace for the spin lock slowpath to confirm it's
from IRQ context?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-22 23:40       ` Mel Gorman
@ 2017-03-23 13:43         ` Jesper Dangaard Brouer
  2017-03-23 14:51           ` Mel Gorman
  0 siblings, 1 reply; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-23 13:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed, brouer

On Wed, 22 Mar 2017 23:40:04 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
> > > > > This modification may slow allocations from IRQ context slightly
> > > > > but the
> > > > > main gain from the per-cpu allocator is that it scales better for
> > > > > allocations from multiple contexts.  There is an implicit
> > > > > assumption that
> > > > > intensive allocations from IRQ contexts on multiple CPUs from a single
> > > > > NUMA node are rare  
> > Hi Mel, Jesper, and all.
> > 
> > This assumption contradicts regular multi-stream traffic that is naturally
> > handled
> > over close numa cores.  I compared iperf TCP multistream (8 streams)
> > over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
> > kernel v4.11-rc1 (with this series).
> > I disabled the page-cache (recycle) mechanism to stress the page allocator,
> > and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
> > v4.11-rc1 (34% drop).
> > I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.  
> 
> Can you get the stack trace for the spin lock slowpath to confirm it's
> from IRQ context?

AFAIK allocations happen in softirq.  Argh and during review I missed
that in_interrupt() also covers softirq.  To Mel, can we use a in_irq()
check instead?

(p.s. just landed and got home)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-23 13:43         ` Jesper Dangaard Brouer
@ 2017-03-23 14:51           ` Mel Gorman
  2017-03-26  8:21             ` Tariq Toukan
  0 siblings, 1 reply; 63+ messages in thread
From: Mel Gorman @ 2017-03-23 14:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed

On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 22 Mar 2017 23:40:04 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
> > > > > > This modification may slow allocations from IRQ context slightly
> > > > > > but the
> > > > > > main gain from the per-cpu allocator is that it scales better for
> > > > > > allocations from multiple contexts.  There is an implicit
> > > > > > assumption that
> > > > > > intensive allocations from IRQ contexts on multiple CPUs from a single
> > > > > > NUMA node are rare  
> > > Hi Mel, Jesper, and all.
> > > 
> > > This assumption contradicts regular multi-stream traffic that is naturally
> > > handled
> > > over close numa cores.  I compared iperf TCP multistream (8 streams)
> > > over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
> > > kernel v4.11-rc1 (with this series).
> > > I disabled the page-cache (recycle) mechanism to stress the page allocator,
> > > and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
> > > v4.11-rc1 (34% drop).
> > > I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.  
> > 
> > Can you get the stack trace for the spin lock slowpath to confirm it's
> > from IRQ context?
> 
> AFAIK allocations happen in softirq.  Argh and during review I missed
> that in_interrupt() also covers softirq.  To Mel, can we use a in_irq()
> check instead?
> 
> (p.s. just landed and got home)

Not built or even boot tested. I'm unable to run tests at the moment

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6cbde310abed..f82225725bc1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
-	if (in_interrupt()) {
+	if (in_irq()) {
 		__free_pages_ok(page, 0);
 		return;
 	}
@@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 {
 	struct page *page;
 
-	VM_BUG_ON(in_interrupt());
+	VM_BUG_ON(in_irq());
 
 	do {
 		if (list_empty(list)) {
@@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0) && !in_interrupt()) {
+	if (likely(order == 0) && !in_irq()) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 		goto out;

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-23 14:51           ` Mel Gorman
@ 2017-03-26  8:21             ` Tariq Toukan
  2017-03-26 10:17               ` Tariq Toukan
  0 siblings, 1 reply; 63+ messages in thread
From: Tariq Toukan @ 2017-03-26  8:21 UTC (permalink / raw)
  To: Mel Gorman, Jesper Dangaard Brouer
  Cc: Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed



On 23/03/2017 4:51 PM, Mel Gorman wrote:
> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote:
>> On Wed, 22 Mar 2017 23:40:04 +0000
>> Mel Gorman <mgorman@techsingularity.net> wrote:
>>
>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
>>>>>>> This modification may slow allocations from IRQ context slightly
>>>>>>> but the
>>>>>>> main gain from the per-cpu allocator is that it scales better for
>>>>>>> allocations from multiple contexts.  There is an implicit
>>>>>>> assumption that
>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a single
>>>>>>> NUMA node are rare
>>>> Hi Mel, Jesper, and all.
>>>>
>>>> This assumption contradicts regular multi-stream traffic that is naturally
>>>> handled
>>>> over close numa cores.  I compared iperf TCP multistream (8 streams)
>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
>>>> kernel v4.11-rc1 (with this series).
>>>> I disabled the page-cache (recycle) mechanism to stress the page allocator,
>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
>>>> v4.11-rc1 (34% drop).
>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.
>>>
>>> Can you get the stack trace for the spin lock slowpath to confirm it's
>>> from IRQ context?
>>
>> AFAIK allocations happen in softirq.  Argh and during review I missed
>> that in_interrupt() also covers softirq.  To Mel, can we use a in_irq()
>> check instead?
>>
>> (p.s. just landed and got home)

Glad to hear. Thanks for your suggestion.

>
> Not built or even boot tested. I'm unable to run tests at the moment

Thanks Mel, I will test it soon.

>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6cbde310abed..f82225725bc1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool cold)
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>
> -	if (in_interrupt()) {
> +	if (in_irq()) {
>  		__free_pages_ok(page, 0);
>  		return;
>  	}
> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
>  {
>  	struct page *page;
>
> -	VM_BUG_ON(in_interrupt());
> +	VM_BUG_ON(in_irq());
>
>  	do {
>  		if (list_empty(list)) {
> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone,
>  	unsigned long flags;
>  	struct page *page;
>
> -	if (likely(order == 0) && !in_interrupt()) {
> +	if (likely(order == 0) && !in_irq()) {
>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>  				gfp_flags, migratetype);
>  		goto out;
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-26  8:21             ` Tariq Toukan
@ 2017-03-26 10:17               ` Tariq Toukan
  2017-03-27  7:32                 ` Pankaj Gupta
  0 siblings, 1 reply; 63+ messages in thread
From: Tariq Toukan @ 2017-03-26 10:17 UTC (permalink / raw)
  To: Mel Gorman, Jesper Dangaard Brouer
  Cc: Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed



On 26/03/2017 11:21 AM, Tariq Toukan wrote:
>
>
> On 23/03/2017 4:51 PM, Mel Gorman wrote:
>> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote:
>>> On Wed, 22 Mar 2017 23:40:04 +0000
>>> Mel Gorman <mgorman@techsingularity.net> wrote:
>>>
>>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
>>>>>>>> This modification may slow allocations from IRQ context slightly
>>>>>>>> but the
>>>>>>>> main gain from the per-cpu allocator is that it scales better for
>>>>>>>> allocations from multiple contexts.  There is an implicit
>>>>>>>> assumption that
>>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a
>>>>>>>> single
>>>>>>>> NUMA node are rare
>>>>> Hi Mel, Jesper, and all.
>>>>>
>>>>> This assumption contradicts regular multi-stream traffic that is
>>>>> naturally
>>>>> handled
>>>>> over close numa cores.  I compared iperf TCP multistream (8 streams)
>>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
>>>>> kernel v4.11-rc1 (with this series).
>>>>> I disabled the page-cache (recycle) mechanism to stress the page
>>>>> allocator,
>>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
>>>>> v4.11-rc1 (34% drop).
>>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.
>>>>
>>>> Can you get the stack trace for the spin lock slowpath to confirm it's
>>>> from IRQ context?
>>>
>>> AFAIK allocations happen in softirq.  Argh and during review I missed
>>> that in_interrupt() also covers softirq.  To Mel, can we use a in_irq()
>>> check instead?
>>>
>>> (p.s. just landed and got home)
>
> Glad to hear. Thanks for your suggestion.
>
>>
>> Not built or even boot tested. I'm unable to run tests at the moment
>
> Thanks Mel, I will test it soon.
>
Crashed in iperf single stream test:

[ 3974.123386] ------------[ cut here ]------------
[ 3974.128778] WARNING: CPU: 2 PID: 8754 at lib/list_debug.c:53 
__list_del_entry_valid+0xa3/0xd0
[ 3974.138751] list_del corruption. prev->next should be 
ffffea0040369c60, but was dead000000000100
[ 3974.149016] Modules linked in: netconsole nfsv3 nfs fscache dm_mirror 
dm_region_hash dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal 
coretemp i2c_diolan_u2c kvm irqbypass ipmi_si ipmi_devintf crc32_pclmul 
iTCO_wdt ghash_clmulni_intel ipmi_msghandler dcdbas iTCO_vendor_support 
sg pcspkr lpc_ich shpchp wmi mfd_core acpi_power_meter nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc binfmt_misc ip_tables mlx4_en sr_mod cdrom 
sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops mlx5_core ttm tg3 ahci libahci mlx4_core drm libata ptp 
megaraid_sas crc32c_intel i2c_core pps_core [last unloaded: netconsole]
[ 3974.212743] CPU: 2 PID: 8754 Comm: iperf Not tainted 4.11.0-rc2+ #30
[ 3974.220073] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
1.5.4 10/002/2015
[ 3974.228974] Call Trace:
[ 3974.231925]  <IRQ>
[ 3974.234405]  dump_stack+0x63/0x8c
[ 3974.238355]  __warn+0xd1/0xf0
[ 3974.241891]  warn_slowpath_fmt+0x4f/0x60
[ 3974.246494]  __list_del_entry_valid+0xa3/0xd0
[ 3974.251583]  get_page_from_freelist+0x84c/0xb40
[ 3974.256868]  ? napi_gro_receive+0x38/0x140
[ 3974.261666]  __alloc_pages_nodemask+0xca/0x200
[ 3974.266866]  mlx5e_alloc_rx_wqe+0x49/0x130 [mlx5_core]
[ 3974.272862]  mlx5e_post_rx_wqes+0x84/0xc0 [mlx5_core]
[ 3974.278725]  mlx5e_napi_poll+0xc7/0x450 [mlx5_core]
[ 3974.284409]  net_rx_action+0x23d/0x3a0
[ 3974.288819]  __do_softirq+0xd1/0x2a2
[ 3974.293054]  irq_exit+0xb5/0xc0
[ 3974.296783]  do_IRQ+0x51/0xd0
[ 3974.300353]  common_interrupt+0x89/0x89
[ 3974.304859] RIP: 0010:free_hot_cold_page+0x228/0x280
[ 3974.310629] RSP: 0018:ffffc9000ea07c90 EFLAGS: 00000202 ORIG_RAX: 
ffffffffffffffa8
[ 3974.319565] RAX: 0000000000000001 RBX: ffff88103f85f158 RCX: 
ffffea0040369c60
[ 3974.327764] RDX: ffffea0040369c60 RSI: ffff88103f85f168 RDI: 
ffffea0040369ca0
[ 3974.335961] RBP: ffffc9000ea07cc0 R08: ffff88103f85f168 R09: 
00000000000005a8
[ 3974.344178] R10: 00000000000005a8 R11: 0000000000010468 R12: 
ffffea0040369c80
[ 3974.352387] R13: ffff88103f85f168 R14: ffff88107ffdeb80 R15: 
ffffea0040369ca0
[ 3974.360577]  </IRQ>
[ 3974.363145]  __put_page+0x34/0x40
[ 3974.367068]  skb_release_data+0xca/0xe0
[ 3974.371575]  skb_release_all+0x24/0x30
[ 3974.375984]  __kfree_skb+0x12/0x20
[ 3974.380003]  tcp_recvmsg+0x6ac/0xaf0
[ 3974.384251]  inet_recvmsg+0x3c/0xa0
[ 3974.388394]  sock_recvmsg+0x3d/0x50
[ 3974.392511]  SYSC_recvfrom+0xd3/0x140
[ 3974.396826]  ? handle_mm_fault+0xce/0x240
[ 3974.401535]  ? SyS_futex+0x71/0x150
[ 3974.405653]  SyS_recvfrom+0xe/0x10
[ 3974.409673]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 3974.415056] RIP: 0033:0x7f04ca9315bb
[ 3974.419309] RSP: 002b:00007f04c955de70 EFLAGS: 00000246 ORIG_RAX: 
000000000000002d
[ 3974.428243] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 
00007f04ca9315bb
[ 3974.436450] RDX: 0000000000020000 RSI: 00007f04bc0008f0 RDI: 
0000000000000004
[ 3974.444653] RBP: 0000000000000000 R08: 0000000000000000 R09: 
0000000000000000
[ 3974.452851] R10: 0000000000000000 R11: 0000000000000246 R12: 
00007f04bc0008f0
[ 3974.461051] R13: 0000000000034ac8 R14: 00007f04bc020910 R15: 
000000000001c480
[ 3974.469297] ---[ end trace 6fd472c9e1973d53 ]---


>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6cbde310abed..f82225725bc1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool
>> cold)
>>      unsigned long pfn = page_to_pfn(page);
>>      int migratetype;
>>
>> -    if (in_interrupt()) {
>> +    if (in_irq()) {
>>          __free_pages_ok(page, 0);
>>          return;
>>      }
>> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct
>> zone *zone, int migratetype,
>>  {
>>      struct page *page;
>>
>> -    VM_BUG_ON(in_interrupt());
>> +    VM_BUG_ON(in_irq());
>>
>>      do {
>>          if (list_empty(list)) {
>> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone,
>>      unsigned long flags;
>>      struct page *page;
>>
>> -    if (likely(order == 0) && !in_interrupt()) {
>> +    if (likely(order == 0) && !in_irq()) {
>>          page = rmqueue_pcplist(preferred_zone, zone, order,
>>                  gfp_flags, migratetype);
>>          goto out;
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-26 10:17               ` Tariq Toukan
@ 2017-03-27  7:32                 ` Pankaj Gupta
  2017-03-27  8:55                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 63+ messages in thread
From: Pankaj Gupta @ 2017-03-27  7:32 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Mel Gorman, Jesper Dangaard Brouer, Tariq Toukan, netdev, akpm,
	linux-mm, Saeed Mahameed


Hello,

It looks like a race with softirq and normal process context.

Just thinking if we really want allocations from 'softirqs' to be done using 
per cpu list? Or we can have some check in  'free_hot_cold_page' for softirqs 
to check if we are on a path of returning from hard interrupt don't allocate 
from per cpu list.

Thanks,
Pankaj

> 
> 
> 
> On 26/03/2017 11:21 AM, Tariq Toukan wrote:
> >
> >
> > On 23/03/2017 4:51 PM, Mel Gorman wrote:
> >> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote:
> >>> On Wed, 22 Mar 2017 23:40:04 +0000
> >>> Mel Gorman <mgorman@techsingularity.net> wrote:
> >>>
> >>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
> >>>>>>>> This modification may slow allocations from IRQ context slightly
> >>>>>>>> but the
> >>>>>>>> main gain from the per-cpu allocator is that it scales better for
> >>>>>>>> allocations from multiple contexts.  There is an implicit
> >>>>>>>> assumption that
> >>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a
> >>>>>>>> single
> >>>>>>>> NUMA node are rare
> >>>>> Hi Mel, Jesper, and all.
> >>>>>
> >>>>> This assumption contradicts regular multi-stream traffic that is
> >>>>> naturally
> >>>>> handled
> >>>>> over close numa cores.  I compared iperf TCP multistream (8 streams)
> >>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
> >>>>> kernel v4.11-rc1 (with this series).
> >>>>> I disabled the page-cache (recycle) mechanism to stress the page
> >>>>> allocator,
> >>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
> >>>>> v4.11-rc1 (34% drop).
> >>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.
> >>>>
> >>>> Can you get the stack trace for the spin lock slowpath to confirm it's
> >>>> from IRQ context?
> >>>
> >>> AFAIK allocations happen in softirq.  Argh and during review I missed
> >>> that in_interrupt() also covers softirq.  To Mel, can we use a in_irq()
> >>> check instead?
> >>>
> >>> (p.s. just landed and got home)
> >
> > Glad to hear. Thanks for your suggestion.
> >
> >>
> >> Not built or even boot tested. I'm unable to run tests at the moment
> >
> > Thanks Mel, I will test it soon.
> >
> Crashed in iperf single stream test:
> 
> [ 3974.123386] ------------[ cut here ]------------
> [ 3974.128778] WARNING: CPU: 2 PID: 8754 at lib/list_debug.c:53
> __list_del_entry_valid+0xa3/0xd0
> [ 3974.138751] list_del corruption. prev->next should be
> ffffea0040369c60, but was dead000000000100
> [ 3974.149016] Modules linked in: netconsole nfsv3 nfs fscache dm_mirror
> dm_region_hash dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal
> coretemp i2c_diolan_u2c kvm irqbypass ipmi_si ipmi_devintf crc32_pclmul
> iTCO_wdt ghash_clmulni_intel ipmi_msghandler dcdbas iTCO_vendor_support
> sg pcspkr lpc_ich shpchp wmi mfd_core acpi_power_meter nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc binfmt_misc ip_tables mlx4_en sr_mod cdrom
> sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops mlx5_core ttm tg3 ahci libahci mlx4_core drm libata ptp
> megaraid_sas crc32c_intel i2c_core pps_core [last unloaded: netconsole]
> [ 3974.212743] CPU: 2 PID: 8754 Comm: iperf Not tainted 4.11.0-rc2+ #30
> [ 3974.220073] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
> 1.5.4 10/002/2015
> [ 3974.228974] Call Trace:
> [ 3974.231925]  <IRQ>
> [ 3974.234405]  dump_stack+0x63/0x8c
> [ 3974.238355]  __warn+0xd1/0xf0
> [ 3974.241891]  warn_slowpath_fmt+0x4f/0x60
> [ 3974.246494]  __list_del_entry_valid+0xa3/0xd0
> [ 3974.251583]  get_page_from_freelist+0x84c/0xb40
> [ 3974.256868]  ? napi_gro_receive+0x38/0x140
> [ 3974.261666]  __alloc_pages_nodemask+0xca/0x200
> [ 3974.266866]  mlx5e_alloc_rx_wqe+0x49/0x130 [mlx5_core]
> [ 3974.272862]  mlx5e_post_rx_wqes+0x84/0xc0 [mlx5_core]
> [ 3974.278725]  mlx5e_napi_poll+0xc7/0x450 [mlx5_core]
> [ 3974.284409]  net_rx_action+0x23d/0x3a0
> [ 3974.288819]  __do_softirq+0xd1/0x2a2
> [ 3974.293054]  irq_exit+0xb5/0xc0
> [ 3974.296783]  do_IRQ+0x51/0xd0
> [ 3974.300353]  common_interrupt+0x89/0x89
> [ 3974.304859] RIP: 0010:free_hot_cold_page+0x228/0x280
> [ 3974.310629] RSP: 0018:ffffc9000ea07c90 EFLAGS: 00000202 ORIG_RAX:
> ffffffffffffffa8
> [ 3974.319565] RAX: 0000000000000001 RBX: ffff88103f85f158 RCX:
> ffffea0040369c60
> [ 3974.327764] RDX: ffffea0040369c60 RSI: ffff88103f85f168 RDI:
> ffffea0040369ca0
> [ 3974.335961] RBP: ffffc9000ea07cc0 R08: ffff88103f85f168 R09:
> 00000000000005a8
> [ 3974.344178] R10: 00000000000005a8 R11: 0000000000010468 R12:
> ffffea0040369c80
> [ 3974.352387] R13: ffff88103f85f168 R14: ffff88107ffdeb80 R15:
> ffffea0040369ca0
> [ 3974.360577]  </IRQ>
> [ 3974.363145]  __put_page+0x34/0x40
> [ 3974.367068]  skb_release_data+0xca/0xe0
> [ 3974.371575]  skb_release_all+0x24/0x30
> [ 3974.375984]  __kfree_skb+0x12/0x20
> [ 3974.380003]  tcp_recvmsg+0x6ac/0xaf0
> [ 3974.384251]  inet_recvmsg+0x3c/0xa0
> [ 3974.388394]  sock_recvmsg+0x3d/0x50
> [ 3974.392511]  SYSC_recvfrom+0xd3/0x140
> [ 3974.396826]  ? handle_mm_fault+0xce/0x240
> [ 3974.401535]  ? SyS_futex+0x71/0x150
> [ 3974.405653]  SyS_recvfrom+0xe/0x10
> [ 3974.409673]  entry_SYSCALL_64_fastpath+0x1a/0xa9
> [ 3974.415056] RIP: 0033:0x7f04ca9315bb
> [ 3974.419309] RSP: 002b:00007f04c955de70 EFLAGS: 00000246 ORIG_RAX:
> 000000000000002d
> [ 3974.428243] RAX: ffffffffffffffda RBX: 0000000000020000 RCX:
> 00007f04ca9315bb
> [ 3974.436450] RDX: 0000000000020000 RSI: 00007f04bc0008f0 RDI:
> 0000000000000004
> [ 3974.444653] RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000000
> [ 3974.452851] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00007f04bc0008f0
> [ 3974.461051] R13: 0000000000034ac8 R14: 00007f04bc020910 R15:
> 000000000001c480
> [ 3974.469297] ---[ end trace 6fd472c9e1973d53 ]---
> 
> 
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 6cbde310abed..f82225725bc1 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool
> >> cold)
> >>      unsigned long pfn = page_to_pfn(page);
> >>      int migratetype;
> >>
> >> -    if (in_interrupt()) {
> >> +    if (in_irq()) {
> >>          __free_pages_ok(page, 0);
> >>          return;
> >>      }
> >> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct
> >> zone *zone, int migratetype,
> >>  {
> >>      struct page *page;
> >>
> >> -    VM_BUG_ON(in_interrupt());
> >> +    VM_BUG_ON(in_irq());
> >>
> >>      do {
> >>          if (list_empty(list)) {
> >> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> >>      unsigned long flags;
> >>      struct page *page;
> >>
> >> -    if (likely(order == 0) && !in_interrupt()) {
> >> +    if (likely(order == 0) && !in_irq()) {
> >>          page = rmqueue_pcplist(preferred_zone, zone, order,
> >>                  gfp_flags, migratetype);
> >>          goto out;
> >>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27  7:32                 ` Pankaj Gupta
@ 2017-03-27  8:55                   ` Jesper Dangaard Brouer
  2017-03-27 12:28                     ` Mel Gorman
  2017-03-27 12:39                     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-27  8:55 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Tariq Toukan, Mel Gorman, Tariq Toukan, netdev, akpm, linux-mm,
	Saeed Mahameed, brouer

On Mon, 27 Mar 2017 03:32:47 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:

> Hello,
> 
> It looks like a race with softirq and normal process context.
> 
> Just thinking if we really want allocations from 'softirqs' to be
> done using per cpu list? 

Yes, softirq need fast page allocs. The softirq use-case is refilling
the DMA RX rings, which is time critical, especially for NIC drivers.
For this reason most drivers implement different page recycling tricks.

> Or we can have some check in  'free_hot_cold_page' for softirqs 
> to check if we are on a path of returning from hard interrupt don't
> allocate from per cpu list.

A possible solution, would be use the local_bh_{disable,enable} instead
of the {preempt_disable,enable} calls.  But it is slower, using numbers
from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.

The problematic part of using local_bh_enable is that this adds a
softirq/bottom-halves rescheduling point (as it checks for pending
BHs).  Thus, this might affects real workloads.


I'm unsure what the best option is.  I'm leaning towards partly
reverting[1] and go back to doing the slower local_irq_save +
local_irq_restore as before.

Afterwards we can add a bulk page alloc+free call, that can amortize
this 38 cycles cost (of local_irq_{save,restore}).  Or add a function
call that MUST only be called from contexts with IRQs enabled, which
allow using the unconditionally local_irq_{disable,enable} as it only
costs 7 cycles.


[1] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
 https://git.kernel.org/torvalds/c/374ad05ab64d


> > On 26/03/2017 11:21 AM, Tariq Toukan wrote:  
> > >
> > >
> > > On 23/03/2017 4:51 PM, Mel Gorman wrote:  
> > >> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote:  
> > >>> On Wed, 22 Mar 2017 23:40:04 +0000
> > >>> Mel Gorman <mgorman@techsingularity.net> wrote:
> > >>>  
> > >>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:  
> > >>>>>>>> This modification may slow allocations from IRQ context slightly
> > >>>>>>>> but the
> > >>>>>>>> main gain from the per-cpu allocator is that it scales better for
> > >>>>>>>> allocations from multiple contexts.  There is an implicit
> > >>>>>>>> assumption that
> > >>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a
> > >>>>>>>> single
> > >>>>>>>> NUMA node are rare  
> > >>>>> Hi Mel, Jesper, and all.
> > >>>>>
> > >>>>> This assumption contradicts regular multi-stream traffic that is
> > >>>>> naturally
> > >>>>> handled
> > >>>>> over close numa cores.  I compared iperf TCP multistream (8 streams)
> > >>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
> > >>>>> kernel v4.11-rc1 (with this series).
> > >>>>> I disabled the page-cache (recycle) mechanism to stress the page
> > >>>>> allocator,
> > >>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
> > >>>>> v4.11-rc1 (34% drop).
> > >>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.  
> > >>>>
> > >>>> Can you get the stack trace for the spin lock slowpath to confirm it's
> > >>>> from IRQ context?  
> > >>>
> > >>> AFAIK allocations happen in softirq.  Argh and during review I missed
> > >>> that in_interrupt() also covers softirq.  To Mel, can we use a in_irq()
> > >>> check instead?
> > >>>
> > >>> (p.s. just landed and got home)  
> > >
> > > Glad to hear. Thanks for your suggestion.
> > >  
> > >>
> > >> Not built or even boot tested. I'm unable to run tests at the moment  
> > >
> > > Thanks Mel, I will test it soon.
> > >  
> > Crashed in iperf single stream test:
> > 
> > [ 3974.123386] ------------[ cut here ]------------
> > [ 3974.128778] WARNING: CPU: 2 PID: 8754 at lib/list_debug.c:53
> > __list_del_entry_valid+0xa3/0xd0
> > [ 3974.138751] list_del corruption. prev->next should be
> > ffffea0040369c60, but was dead000000000100
> > [ 3974.149016] Modules linked in: netconsole nfsv3 nfs fscache dm_mirror
> > dm_region_hash dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal
> > coretemp i2c_diolan_u2c kvm irqbypass ipmi_si ipmi_devintf crc32_pclmul
> > iTCO_wdt ghash_clmulni_intel ipmi_msghandler dcdbas iTCO_vendor_support
> > sg pcspkr lpc_ich shpchp wmi mfd_core acpi_power_meter nfsd auth_rpcgss
> > nfs_acl lockd grace sunrpc binfmt_misc ip_tables mlx4_en sr_mod cdrom
> > sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
> > fb_sys_fops mlx5_core ttm tg3 ahci libahci mlx4_core drm libata ptp
> > megaraid_sas crc32c_intel i2c_core pps_core [last unloaded: netconsole]
> > [ 3974.212743] CPU: 2 PID: 8754 Comm: iperf Not tainted 4.11.0-rc2+ #30
> > [ 3974.220073] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
> > 1.5.4 10/002/2015
> > [ 3974.228974] Call Trace:
> > [ 3974.231925]  <IRQ>
> > [ 3974.234405]  dump_stack+0x63/0x8c
> > [ 3974.238355]  __warn+0xd1/0xf0
> > [ 3974.241891]  warn_slowpath_fmt+0x4f/0x60
> > [ 3974.246494]  __list_del_entry_valid+0xa3/0xd0
> > [ 3974.251583]  get_page_from_freelist+0x84c/0xb40
> > [ 3974.256868]  ? napi_gro_receive+0x38/0x140
> > [ 3974.261666]  __alloc_pages_nodemask+0xca/0x200
> > [ 3974.266866]  mlx5e_alloc_rx_wqe+0x49/0x130 [mlx5_core]
> > [ 3974.272862]  mlx5e_post_rx_wqes+0x84/0xc0 [mlx5_core]
> > [ 3974.278725]  mlx5e_napi_poll+0xc7/0x450 [mlx5_core]
> > [ 3974.284409]  net_rx_action+0x23d/0x3a0
> > [ 3974.288819]  __do_softirq+0xd1/0x2a2
> > [ 3974.293054]  irq_exit+0xb5/0xc0
> > [ 3974.296783]  do_IRQ+0x51/0xd0
> > [ 3974.300353]  common_interrupt+0x89/0x89
> > [ 3974.304859] RIP: 0010:free_hot_cold_page+0x228/0x280
> > [ 3974.310629] RSP: 0018:ffffc9000ea07c90 EFLAGS: 00000202 ORIG_RAX:
> > ffffffffffffffa8
> > [ 3974.319565] RAX: 0000000000000001 RBX: ffff88103f85f158 RCX:
> > ffffea0040369c60
> > [ 3974.327764] RDX: ffffea0040369c60 RSI: ffff88103f85f168 RDI:
> > ffffea0040369ca0
> > [ 3974.335961] RBP: ffffc9000ea07cc0 R08: ffff88103f85f168 R09:
> > 00000000000005a8
> > [ 3974.344178] R10: 00000000000005a8 R11: 0000000000010468 R12:
> > ffffea0040369c80
> > [ 3974.352387] R13: ffff88103f85f168 R14: ffff88107ffdeb80 R15:
> > ffffea0040369ca0
> > [ 3974.360577]  </IRQ>
> > [ 3974.363145]  __put_page+0x34/0x40
> > [ 3974.367068]  skb_release_data+0xca/0xe0
> > [ 3974.371575]  skb_release_all+0x24/0x30
> > [ 3974.375984]  __kfree_skb+0x12/0x20
> > [ 3974.380003]  tcp_recvmsg+0x6ac/0xaf0
> > [ 3974.384251]  inet_recvmsg+0x3c/0xa0
> > [ 3974.388394]  sock_recvmsg+0x3d/0x50
> > [ 3974.392511]  SYSC_recvfrom+0xd3/0x140
> > [ 3974.396826]  ? handle_mm_fault+0xce/0x240
> > [ 3974.401535]  ? SyS_futex+0x71/0x150
> > [ 3974.405653]  SyS_recvfrom+0xe/0x10
> > [ 3974.409673]  entry_SYSCALL_64_fastpath+0x1a/0xa9
> > [ 3974.415056] RIP: 0033:0x7f04ca9315bb
> > [ 3974.419309] RSP: 002b:00007f04c955de70 EFLAGS: 00000246 ORIG_RAX:
> > 000000000000002d
> > [ 3974.428243] RAX: ffffffffffffffda RBX: 0000000000020000 RCX:
> > 00007f04ca9315bb
> > [ 3974.436450] RDX: 0000000000020000 RSI: 00007f04bc0008f0 RDI:
> > 0000000000000004
> > [ 3974.444653] RBP: 0000000000000000 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 3974.452851] R10: 0000000000000000 R11: 0000000000000246 R12:
> > 00007f04bc0008f0
> > [ 3974.461051] R13: 0000000000034ac8 R14: 00007f04bc020910 R15:
> > 000000000001c480
> > [ 3974.469297] ---[ end trace 6fd472c9e1973d53 ]---
> > 
> >   
> > >>
> > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > >> index 6cbde310abed..f82225725bc1 100644
> > >> --- a/mm/page_alloc.c
> > >> +++ b/mm/page_alloc.c
> > >> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool
> > >> cold)
> > >>      unsigned long pfn = page_to_pfn(page);
> > >>      int migratetype;
> > >>
> > >> -    if (in_interrupt()) {
> > >> +    if (in_irq()) {
> > >>          __free_pages_ok(page, 0);
> > >>          return;
> > >>      }
> > >> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct
> > >> zone *zone, int migratetype,
> > >>  {
> > >>      struct page *page;
> > >>
> > >> -    VM_BUG_ON(in_interrupt());
> > >> +    VM_BUG_ON(in_irq());
> > >>
> > >>      do {
> > >>          if (list_empty(list)) {
> > >> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> > >>      unsigned long flags;
> > >>      struct page *page;
> > >>
> > >> -    if (likely(order == 0) && !in_interrupt()) {
> > >> +    if (likely(order == 0) && !in_irq()) {
> > >>          page = rmqueue_pcplist(preferred_zone, zone, order,
> > >>                  gfp_flags, migratetype);
> > >>          goto out;
> > >>  
> >   



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27  8:55                   ` Jesper Dangaard Brouer
@ 2017-03-27 12:28                     ` Mel Gorman
  2017-03-27 12:39                     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2017-03-27 12:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, Tariq Toukan, netdev, akpm, linux-mm,
	Saeed Mahameed

On Mon, Mar 27, 2017 at 10:55:14AM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 27 Mar 2017 03:32:47 -0400 (EDT)
> Pankaj Gupta <pagupta@redhat.com> wrote:
> 
> > Hello,
> > 
> > It looks like a race with softirq and normal process context.
> > 
> > Just thinking if we really want allocations from 'softirqs' to be
> > done using per cpu list? 
> 
> Yes, softirq need fast page allocs. The softirq use-case is refilling
> the DMA RX rings, which is time critical, especially for NIC drivers.
> For this reason most drivers implement different page recycling tricks.
> 
> > Or we can have some check in  'free_hot_cold_page' for softirqs 
> > to check if we are on a path of returning from hard interrupt don't
> > allocate from per cpu list.
> 
> A possible solution, would be use the local_bh_{disable,enable} instead
> of the {preempt_disable,enable} calls.  But it is slower, using numbers
> from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
> 
> The problematic part of using local_bh_enable is that this adds a
> softirq/bottom-halves rescheduling point (as it checks for pending
> BHs).  Thus, this might affects real workloads.
> 
> 
> I'm unsure what the best option is.  I'm leaning towards partly
> reverting[1] and go back to doing the slower local_irq_save +
> local_irq_restore as before.
> 
> Afterwards we can add a bulk page alloc+free call, that can amortize
> this 38 cycles cost (of local_irq_{save,restore}).  Or add a function
> call that MUST only be called from contexts with IRQs enabled, which
> allow using the unconditionally local_irq_{disable,enable} as it only
> costs 7 cycles.
> 

It's possible to have a separate list for hard/soft IRQ that are protected
although great care is needed to drain properly. I have a partial prototype
lying around marked as "interesting if we ever need it" but it needs more
work. It's sufficiently complex that I couldn't rush it as a fix with the
time I currently have available. For 4.11, it's safer to revert and try
again later bearing in mind that softirqs are in the critical allocation
path for some drivers.

I'll prepare a patch.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27  8:55                   ` Jesper Dangaard Brouer
  2017-03-27 12:28                     ` Mel Gorman
@ 2017-03-27 12:39                     ` Jesper Dangaard Brouer
  2017-03-27 13:32                       ` Mel Gorman
  2017-03-27 14:15                         ` Matthew Wilcox
  1 sibling, 2 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-27 12:39 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Tariq Toukan, Mel Gorman, Tariq Toukan, netdev, akpm, linux-mm,
	Saeed Mahameed, brouer

On Mon, 27 Mar 2017 10:55:14 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> A possible solution, would be use the local_bh_{disable,enable} instead
> of the {preempt_disable,enable} calls.  But it is slower, using numbers
> from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
> 
> The problematic part of using local_bh_enable is that this adds a
> softirq/bottom-halves rescheduling point (as it checks for pending
> BHs).  Thus, this might affects real workloads.

I implemented this solution in patch below... and tested it on mlx5 at
50G with manually disabled driver-page-recycling.  It works for me.

To Mel, that do you prefer... a partial-revert or something like this?


[PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

From: Jesper Dangaard Brouer <brouer@redhat.com>

IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists
caching of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only
use per-cpu allocator for irq-safe requests").

This unfortunately also included excluded SoftIRQ.  This hurt the
performance for the use-case of refilling DMA RX rings in softirq
context.

This patch re-allow softirq context, which should be safe by disabling
BH/softirq, while accessing the list.  And makes sure to avoid
PCP-lists access from both hard-IRQ and NMI context.

One concern with this change is adding a BH (enable) scheduling point
at both PCP alloc and free.

Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
---
 include/trace/events/kmem.h |    2 ++
 mm/page_alloc.c             |   41 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 6b2e154fd23a..ad412ad1b092 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -244,6 +244,8 @@ DECLARE_EVENT_CLASS(mm_page,
 		__entry->order,
 		__entry->migratetype,
 		__entry->order == 0)
+// WARNING: percpu_refill check not 100% correct after commit
+// 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
 );
 
 DEFINE_EVENT(mm_page, mm_page_alloc_zone_locked,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6cbde310abed..db9ffc8ac538 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2470,6 +2470,25 @@ void mark_free_pages(struct zone *zone)
 }
 #endif /* CONFIG_PM */
 
+static __always_inline int in_irq_or_nmi(void)
+{
+	return in_irq() || in_nmi();
+// XXX: hoping compiler will optimize this (todo verify) into:
+// #define in_irq_or_nmi()	(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
+
+	/* compiler was smart enough to only read __preempt_count once
+	 * but added two branches
+asm code:
+ │       mov    __preempt_count,%eax
+ │       test   $0xf0000,%eax    // HARDIRQ_MASK: 0x000f0000
+ │    ┌──jne    2a
+ │    │  test   $0x100000,%eax   // NMI_MASK:     0x00100000
+ │    │↓ je     3f
+ │ 2a:└─→mov    %rbx,%rdi
+
+	 */
+}
+
 /*
  * Free a 0-order page
  * cold == true ? free a cold page : free a hot page
@@ -2481,7 +2500,11 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
-	if (in_interrupt()) {
+	/*
+	 * Exclude (hard) IRQ and NMI context from using the pcplists.
+	 * But allow softirq context, via disabling BH.
+	 */
+	if (in_irq_or_nmi()) {
 		__free_pages_ok(page, 0);
 		return;
 	}
@@ -2491,7 +2514,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	preempt_disable();
+	local_bh_disable();
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2522,7 +2545,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 out:
-	preempt_enable();
+	local_bh_enable();
 }
 
 /*
@@ -2647,7 +2670,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 {
 	struct page *page;
 
-	VM_BUG_ON(in_interrupt());
+	VM_BUG_ON(in_irq());
 
 	do {
 		if (list_empty(list)) {
@@ -2680,7 +2703,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
 
-	preempt_disable();
+	local_bh_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
@@ -2688,7 +2711,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone);
 	}
-	preempt_enable();
+	local_bh_enable();
 	return page;
 }
 
@@ -2704,7 +2727,11 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0) && !in_interrupt()) {
+	/*
+	 * Exclude (hard) IRQ and NMI context from using the pcplists.
+	 * But allow softirq context, via disabling BH.
+	 */
+	if (likely(order == 0) && !in_irq_or_nmi() ) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 		goto out;


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27 12:39                     ` Jesper Dangaard Brouer
@ 2017-03-27 13:32                       ` Mel Gorman
  2017-03-28  7:32                         ` Tariq Toukan
  2017-03-28  8:28                         ` Pankaj Gupta
  2017-03-27 14:15                         ` Matthew Wilcox
  1 sibling, 2 replies; 63+ messages in thread
From: Mel Gorman @ 2017-03-27 13:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, Tariq Toukan, netdev, akpm, linux-mm,
	Saeed Mahameed

On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 27 Mar 2017 10:55:14 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > A possible solution, would be use the local_bh_{disable,enable} instead
> > of the {preempt_disable,enable} calls.  But it is slower, using numbers
> > from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
> > 
> > The problematic part of using local_bh_enable is that this adds a
> > softirq/bottom-halves rescheduling point (as it checks for pending
> > BHs).  Thus, this might affects real workloads.
> 
> I implemented this solution in patch below... and tested it on mlx5 at
> 50G with manually disabled driver-page-recycling.  It works for me.
> 
> To Mel, that do you prefer... a partial-revert or something like this?
> 

If Tariq confirms it works for him as well, this looks far safer patch
than having a dedicate IRQ-safe queue. Your concern about the BH
scheduling point is valid but if it's proven to be a problem, there is
still the option of a partial revert.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27 12:39                     ` Jesper Dangaard Brouer
@ 2017-03-27 14:15                         ` Matthew Wilcox
  2017-03-27 14:15                         ` Matthew Wilcox
  1 sibling, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-27 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, Mel Gorman, Tariq Toukan, netdev,
	akpm, linux-mm, Saeed Mahameed

On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
>  
> +static __always_inline int in_irq_or_nmi(void)
> +{
> +	return in_irq() || in_nmi();
> +// XXX: hoping compiler will optimize this (todo verify) into:
> +// #define in_irq_or_nmi()	(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> +
> +	/* compiler was smart enough to only read __preempt_count once
> +	 * but added two branches
> +asm code:
> + │       mov    __preempt_count,%eax
> + │       test   $0xf0000,%eax    // HARDIRQ_MASK: 0x000f0000
> + │    ┌──jne    2a
> + │    │  test   $0x100000,%eax   // NMI_MASK:     0x00100000
> + │    │↓ je     3f
> + │ 2a:└─→mov    %rbx,%rdi
> +
> +	 */
> +}

To be fair, you told the compiler to do that with your use of fancy-pants ||
instead of optimisable |.  Try this instead:

static __always_inline int in_irq_or_nmi(void)
{
	return in_irq() | in_nmi();
}

0000000000001770 <test_fn>:
    1770:       65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # 1777 <test_fn+0x7>
                        1773: R_X86_64_PC32     __preempt_count-0x4
#define in_nmi()                (preempt_count() & NMI_MASK)
#define in_task()               (!(preempt_count() & \
                                   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
static __always_inline int in_irq_or_nmi(void)
{
        return in_irq() | in_nmi();
    1777:       25 00 00 1f 00          and    $0x1f0000,%eax
}
    177c:       c3                      retq   
    177d:       0f 1f 00                nopl   (%rax)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
@ 2017-03-27 14:15                         ` Matthew Wilcox
  0 siblings, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-27 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, Mel Gorman, Tariq Toukan, netdev,
	akpm, linux-mm, Saeed Mahameed

On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
>  
> +static __always_inline int in_irq_or_nmi(void)
> +{
> +	return in_irq() || in_nmi();
> +// XXX: hoping compiler will optimize this (todo verify) into:
> +// #define in_irq_or_nmi()	(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> +
> +	/* compiler was smart enough to only read __preempt_count once
> +	 * but added two branches
> +asm code:
> + a??       mov    __preempt_count,%eax
> + a??       test   $0xf0000,%eax    // HARDIRQ_MASK: 0x000f0000
> + a??    a??a??a??jne    2a
> + a??    a??  test   $0x100000,%eax   // NMI_MASK:     0x00100000
> + a??    a??a?? je     3f
> + a?? 2a:a??a??a??mov    %rbx,%rdi
> +
> +	 */
> +}

To be fair, you told the compiler to do that with your use of fancy-pants ||
instead of optimisable |.  Try this instead:

static __always_inline int in_irq_or_nmi(void)
{
	return in_irq() | in_nmi();
}

0000000000001770 <test_fn>:
    1770:       65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # 1777 <test_fn+0x7>
                        1773: R_X86_64_PC32     __preempt_count-0x4
#define in_nmi()                (preempt_count() & NMI_MASK)
#define in_task()               (!(preempt_count() & \
                                   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
static __always_inline int in_irq_or_nmi(void)
{
        return in_irq() | in_nmi();
    1777:       25 00 00 1f 00          and    $0x1f0000,%eax
}
    177c:       c3                      retq   
    177d:       0f 1f 00                nopl   (%rax)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27 14:15                         ` Matthew Wilcox
  (?)
@ 2017-03-27 15:15                         ` Jesper Dangaard Brouer
  2017-03-27 16:58                             ` in_irq_or_nmi() Matthew Wilcox
  -1 siblings, 1 reply; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-27 15:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Gupta, Tariq Toukan, Mel Gorman, Tariq Toukan, netdev,
	akpm, linux-mm, Saeed Mahameed, brouer

On Mon, 27 Mar 2017 07:15:18 -0700
Matthew Wilcox <willy@infradead.org> wrote:

> On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
> >  
> > +static __always_inline int in_irq_or_nmi(void)
> > +{
> > +	return in_irq() || in_nmi();
> > +// XXX: hoping compiler will optimize this (todo verify) into:
> > +// #define in_irq_or_nmi()	(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> > +
> > +	/* compiler was smart enough to only read __preempt_count once
> > +	 * but added two branches
> > +asm code:
> > + │       mov    __preempt_count,%eax
> > + │       test   $0xf0000,%eax    // HARDIRQ_MASK: 0x000f0000
> > + │    ┌──jne    2a
> > + │    │  test   $0x100000,%eax   // NMI_MASK:     0x00100000
> > + │    │↓ je     3f
> > + │ 2a:└─→mov    %rbx,%rdi
> > +
> > +	 */
> > +}  
> 
> To be fair, you told the compiler to do that with your use of fancy-pants ||
> instead of optimisable |.  Try this instead:

Thanks you! -- good point! :-)

> static __always_inline int in_irq_or_nmi(void)
> {
> 	return in_irq() | in_nmi();
> }
> 
> 0000000000001770 <test_fn>:
>     1770:       65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # 1777 <test_fn+0x7>
>                         1773: R_X86_64_PC32     __preempt_count-0x4
> #define in_nmi()                (preempt_count() & NMI_MASK)
> #define in_task()               (!(preempt_count() & \
>                                    (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
> static __always_inline int in_irq_or_nmi(void)
> {
>         return in_irq() | in_nmi();
>     1777:       25 00 00 1f 00          and    $0x1f0000,%eax
> }
>     177c:       c3                      retq   
>     177d:       0f 1f 00                nopl   (%rax)

And I also verified it worked:

  0.63 │       mov    __preempt_count,%eax
       │     free_hot_cold_page():
  1.25 │       test   $0x1f0000,%eax
       │     ↓ jne    1e4

And this simplification also made the compiler change this into a
unlikely branch, which is a micro-optimization (that I will leave up to
the compiler).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* in_irq_or_nmi()
  2017-03-27 15:15                         ` Jesper Dangaard Brouer
  2017-03-27 16:58                             ` in_irq_or_nmi() Matthew Wilcox
@ 2017-03-27 16:58                             ` Matthew Wilcox
  0 siblings, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-27 16:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, Mel Gorman, Tariq Toukan, netdev,
	akpm, linux-mm, Saeed Mahameed, linux-kernel

On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:
> And I also verified it worked:
> 
>   0.63 │       mov    __preempt_count,%eax
>        │     free_hot_cold_page():
>   1.25 │       test   $0x1f0000,%eax
>        │     ↓ jne    1e4
> 
> And this simplification also made the compiler change this into a
> unlikely branch, which is a micro-optimization (that I will leave up to
> the compiler).

Excellent!  That said, I think we should define in_irq_or_nmi() in
preempt.h, rather than hiding it in the memory allocator.  And since we're
doing that, we might as well make it look like the other definitions:

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7eeceac52dea..af98c29abd9d 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -81,6 +81,7 @@
 #define in_interrupt()		(irq_count())
 #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
 #define in_nmi()		(preempt_count() & NMI_MASK)
+#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
 #define in_task()		(!(preempt_count() & \
 				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
 

I think there are some genuine questions to be asked about the other
users of in_irq() whether they really want to use in_irq_or_nmi().
There's fewer than a hundred of them, so somebody sufficiently motivated
could take a look in a few days.

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* in_irq_or_nmi()
@ 2017-03-27 16:58                             ` Matthew Wilcox
  0 siblings, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-27 16:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, Mel Gorman, Tariq Toukan, netdev,
	akpm, linux-mm, Saeed Mahameed, linux-kernel

On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:
> And I also verified it worked:
> 
>   0.63 │       mov    __preempt_count,%eax
>        │     free_hot_cold_page():
>   1.25 │       test   $0x1f0000,%eax
>        │     ↓ jne    1e4
> 
> And this simplification also made the compiler change this into a
> unlikely branch, which is a micro-optimization (that I will leave up to
> the compiler).

Excellent!  That said, I think we should define in_irq_or_nmi() in
preempt.h, rather than hiding it in the memory allocator.  And since we're
doing that, we might as well make it look like the other definitions:

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7eeceac52dea..af98c29abd9d 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -81,6 +81,7 @@
 #define in_interrupt()		(irq_count())
 #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
 #define in_nmi()		(preempt_count() & NMI_MASK)
+#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
 #define in_task()		(!(preempt_count() & \
 				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
 

I think there are some genuine questions to be asked about the other
users of in_irq() whether they really want to use in_irq_or_nmi().
There's fewer than a hundred of them, so somebody sufficiently motivated
could take a look in a few days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* in_irq_or_nmi()
@ 2017-03-27 16:58                             ` Matthew Wilcox
  0 siblings, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-27 16:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, Mel Gorman, Tariq Toukan, netdev,
	akpm, linux-mm, Saeed Mahameed, linux-kernel

On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:
> And I also verified it worked:
> 
>   0.63 a??       mov    __preempt_count,%eax
>        a??     free_hot_cold_page():
>   1.25 a??       test   $0x1f0000,%eax
>        a??     a?? jne    1e4
> 
> And this simplification also made the compiler change this into a
> unlikely branch, which is a micro-optimization (that I will leave up to
> the compiler).

Excellent!  That said, I think we should define in_irq_or_nmi() in
preempt.h, rather than hiding it in the memory allocator.  And since we're
doing that, we might as well make it look like the other definitions:

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7eeceac52dea..af98c29abd9d 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -81,6 +81,7 @@
 #define in_interrupt()		(irq_count())
 #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
 #define in_nmi()		(preempt_count() & NMI_MASK)
+#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
 #define in_task()		(!(preempt_count() & \
 				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
 

I think there are some genuine questions to be asked about the other
users of in_irq() whether they really want to use in_irq_or_nmi().
There's fewer than a hundred of them, so somebody sufficiently motivated
could take a look in a few days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27 13:32                       ` Mel Gorman
@ 2017-03-28  7:32                         ` Tariq Toukan
  2017-03-28  8:29                           ` Jesper Dangaard Brouer
  2017-03-28 16:05                           ` Tariq Toukan
  2017-03-28  8:28                         ` Pankaj Gupta
  1 sibling, 2 replies; 63+ messages in thread
From: Tariq Toukan @ 2017-03-28  7:32 UTC (permalink / raw)
  To: Mel Gorman, Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed



On 27/03/2017 4:32 PM, Mel Gorman wrote:
> On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
>> On Mon, 27 Mar 2017 10:55:14 +0200
>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>
>>> A possible solution, would be use the local_bh_{disable,enable} instead
>>> of the {preempt_disable,enable} calls.  But it is slower, using numbers
>>> from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
>>>
>>> The problematic part of using local_bh_enable is that this adds a
>>> softirq/bottom-halves rescheduling point (as it checks for pending
>>> BHs).  Thus, this might affects real workloads.
>>
>> I implemented this solution in patch below... and tested it on mlx5 at
>> 50G with manually disabled driver-page-recycling.  It works for me.
>>
>> To Mel, that do you prefer... a partial-revert or something like this?
>>
>
> If Tariq confirms it works for him as well, this looks far safer patch

Great.
I will test Jesper's patch today in the afternoon.

> than having a dedicate IRQ-safe queue. Your concern about the BH
> scheduling point is valid but if it's proven to be a problem, there is
> still the option of a partial revert.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-27 13:32                       ` Mel Gorman
  2017-03-28  7:32                         ` Tariq Toukan
@ 2017-03-28  8:28                         ` Pankaj Gupta
  1 sibling, 0 replies; 63+ messages in thread
From: Pankaj Gupta @ 2017-03-28  8:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, Tariq Toukan, Tariq Toukan, netdev, akpm,
	linux-mm, Saeed Mahameed


> 
> On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 27 Mar 2017 10:55:14 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > 
> > > A possible solution, would be use the local_bh_{disable,enable} instead
> > > of the {preempt_disable,enable} calls.  But it is slower, using numbers
> > > from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
> > > 
> > > The problematic part of using local_bh_enable is that this adds a
> > > softirq/bottom-halves rescheduling point (as it checks for pending
> > > BHs).  Thus, this might affects real workloads.
> > 
> > I implemented this solution in patch below... and tested it on mlx5 at
> > 50G with manually disabled driver-page-recycling.  It works for me.
> > 
> > To Mel, that do you prefer... a partial-revert or something like this?
> > 
> 
> If Tariq confirms it works for him as well, this looks far safer patch
> than having a dedicate IRQ-safe queue. Your concern about the BH
> scheduling point is valid but if it's proven to be a problem, there is
> still the option of a partial revert.

I also feel the same.

Thanks,
Pankaj

> 
> --
> Mel Gorman
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-28  7:32                         ` Tariq Toukan
@ 2017-03-28  8:29                           ` Jesper Dangaard Brouer
  2017-03-28 16:05                           ` Tariq Toukan
  1 sibling, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-28  8:29 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Mel Gorman, Pankaj Gupta, Tariq Toukan, netdev, akpm, linux-mm,
	Saeed Mahameed, brouer

On Tue, 28 Mar 2017 10:32:19 +0300
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> On 27/03/2017 4:32 PM, Mel Gorman wrote:
> > On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:  
> >> On Mon, 27 Mar 2017 10:55:14 +0200
> >> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >>  
> >>> A possible solution, would be use the local_bh_{disable,enable} instead
> >>> of the {preempt_disable,enable} calls.  But it is slower, using numbers
> >>> from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
> >>>
> >>> The problematic part of using local_bh_enable is that this adds a
> >>> softirq/bottom-halves rescheduling point (as it checks for pending
> >>> BHs).  Thus, this might affects real workloads.  
> >>
> >> I implemented this solution in patch below... and tested it on mlx5 at
> >> 50G with manually disabled driver-page-recycling.  It works for me.
> >>
> >> To Mel, that do you prefer... a partial-revert or something like this?
> >>  
> >
> > If Tariq confirms it works for him as well, this looks far safer patch  
> 
> Great.
> I will test Jesper's patch today in the afternoon.

Good to hear :-)

> > than having a dedicate IRQ-safe queue. Your concern about the BH
> > scheduling point is valid but if it's proven to be a problem, there is
> > still the option of a partial revert.

I wanted to evaluate my own BH scheduling point concern, but I could
not, because I ran into a softirq acct regression (which I bisected
see[1]).  AFAIK this should not affect Tariq's multi-TCP-stream test
(netperf TCP stream testing works fine on my testlab).

[1] http://lkml.kernel.org/r/20170328101403.34a82fbf@redhat.com
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-28  7:32                         ` Tariq Toukan
  2017-03-28  8:29                           ` Jesper Dangaard Brouer
@ 2017-03-28 16:05                           ` Tariq Toukan
  2017-03-28 18:24                             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 63+ messages in thread
From: Tariq Toukan @ 2017-03-28 16:05 UTC (permalink / raw)
  To: Mel Gorman, Jesper Dangaard Brouer
  Cc: Pankaj Gupta, Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed



On 28/03/2017 10:32 AM, Tariq Toukan wrote:
>
>
> On 27/03/2017 4:32 PM, Mel Gorman wrote:
>> On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
>>> On Mon, 27 Mar 2017 10:55:14 +0200
>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>
>>>> A possible solution, would be use the local_bh_{disable,enable} instead
>>>> of the {preempt_disable,enable} calls.  But it is slower, using numbers
>>>> from [1] (19 vs 11 cycles), thus the expected cycles saving is
>>>> 38-19=19.
>>>>
>>>> The problematic part of using local_bh_enable is that this adds a
>>>> softirq/bottom-halves rescheduling point (as it checks for pending
>>>> BHs).  Thus, this might affects real workloads.
>>>
>>> I implemented this solution in patch below... and tested it on mlx5 at
>>> 50G with manually disabled driver-page-recycling.  It works for me.
>>>
>>> To Mel, that do you prefer... a partial-revert or something like this?
>>>
>>
>> If Tariq confirms it works for him as well, this looks far safer patch
>
> Great.
> I will test Jesper's patch today in the afternoon.
>

It looks very good!
I get line-rate (94Gbits/sec) with 8 streams, in comparison to less than 
55Gbits/sec before.

Many thanks guys.

>> than having a dedicate IRQ-safe queue. Your concern about the BH
>> scheduling point is valid but if it's proven to be a problem, there is
>> still the option of a partial revert.
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-28 16:05                           ` Tariq Toukan
@ 2017-03-28 18:24                             ` Jesper Dangaard Brouer
  2017-03-29  7:13                               ` Tariq Toukan
  0 siblings, 1 reply; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-28 18:24 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Mel Gorman, Pankaj Gupta, Tariq Toukan, netdev, akpm, linux-mm,
	Saeed Mahameed, brouer

On Tue, 28 Mar 2017 19:05:12 +0300
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> On 28/03/2017 10:32 AM, Tariq Toukan wrote:
> >
> >
> > On 27/03/2017 4:32 PM, Mel Gorman wrote:  
> >> On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:  
> >>> On Mon, 27 Mar 2017 10:55:14 +0200
> >>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >>>  
> >>>> A possible solution, would be use the local_bh_{disable,enable} instead
> >>>> of the {preempt_disable,enable} calls.  But it is slower, using numbers
> >>>> from [1] (19 vs 11 cycles), thus the expected cycles saving is
> >>>> 38-19=19.
> >>>>
> >>>> The problematic part of using local_bh_enable is that this adds a
> >>>> softirq/bottom-halves rescheduling point (as it checks for pending
> >>>> BHs).  Thus, this might affects real workloads.  
> >>>
> >>> I implemented this solution in patch below... and tested it on mlx5 at
> >>> 50G with manually disabled driver-page-recycling.  It works for me.
> >>>
> >>> To Mel, that do you prefer... a partial-revert or something like this?
> >>>  
> >>
> >> If Tariq confirms it works for him as well, this looks far safer patch  
> >
> > Great.
> > I will test Jesper's patch today in the afternoon.
> >  
> 
> It looks very good!
> I get line-rate (94Gbits/sec) with 8 streams, in comparison to less than 
> 55Gbits/sec before.

Just confirming, this is when you have disabled mlx5 driver
page-recycling, right?


> >> than having a dedicate IRQ-safe queue. Your concern about the BH
> >> scheduling point is valid but if it's proven to be a problem, there is
> >> still the option of a partial revert.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-28 18:24                             ` Jesper Dangaard Brouer
@ 2017-03-29  7:13                               ` Tariq Toukan
  0 siblings, 0 replies; 63+ messages in thread
From: Tariq Toukan @ 2017-03-29  7:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, Pankaj Gupta, Tariq Toukan, netdev, akpm, linux-mm,
	Saeed Mahameed



On 28/03/2017 9:24 PM, Jesper Dangaard Brouer wrote:
> On Tue, 28 Mar 2017 19:05:12 +0300
> Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>> On 28/03/2017 10:32 AM, Tariq Toukan wrote:
>>>
>>>
>>> On 27/03/2017 4:32 PM, Mel Gorman wrote:
>>>> On Mon, Mar 27, 2017 at 02:39:47PM +0200, Jesper Dangaard Brouer wrote:
>>>>> On Mon, 27 Mar 2017 10:55:14 +0200
>>>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>>>
>>>>>> A possible solution, would be use the local_bh_{disable,enable} instead
>>>>>> of the {preempt_disable,enable} calls.  But it is slower, using numbers
>>>>>> from [1] (19 vs 11 cycles), thus the expected cycles saving is
>>>>>> 38-19=19.
>>>>>>
>>>>>> The problematic part of using local_bh_enable is that this adds a
>>>>>> softirq/bottom-halves rescheduling point (as it checks for pending
>>>>>> BHs).  Thus, this might affects real workloads.
>>>>>
>>>>> I implemented this solution in patch below... and tested it on mlx5 at
>>>>> 50G with manually disabled driver-page-recycling.  It works for me.
>>>>>
>>>>> To Mel, that do you prefer... a partial-revert or something like this?
>>>>>
>>>>
>>>> If Tariq confirms it works for him as well, this looks far safer patch
>>>
>>> Great.
>>> I will test Jesper's patch today in the afternoon.
>>>
>>
>> It looks very good!
>> I get line-rate (94Gbits/sec) with 8 streams, in comparison to less than
>> 55Gbits/sec before.
>
> Just confirming, this is when you have disabled mlx5 driver
> page-recycling, right?
>
>
Right.
This is a great result!

>>>> than having a dedicate IRQ-safe queue. Your concern about the BH
>>>> scheduling point is valid but if it's proven to be a problem, there is
>>>> still the option of a partial revert.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
  2017-03-27 16:58                             ` in_irq_or_nmi() Matthew Wilcox
  (?)
@ 2017-03-29  8:12                               ` Peter Zijlstra
  -1 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-29  8:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jesper Dangaard Brouer, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:
> > And I also verified it worked:
> > 
> >   0.63 │       mov    __preempt_count,%eax
> >        │     free_hot_cold_page():
> >   1.25 │       test   $0x1f0000,%eax
> >        │     ↓ jne    1e4
> > 
> > And this simplification also made the compiler change this into a
> > unlikely branch, which is a micro-optimization (that I will leave up to
> > the compiler).
> 
> Excellent!  That said, I think we should define in_irq_or_nmi() in
> preempt.h, rather than hiding it in the memory allocator.  And since we're
> doing that, we might as well make it look like the other definitions:
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index 7eeceac52dea..af98c29abd9d 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -81,6 +81,7 @@
>  #define in_interrupt()		(irq_count())
>  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
>  #define in_nmi()		(preempt_count() & NMI_MASK)
> +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
>  #define in_task()		(!(preempt_count() & \
>  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
>  

No, that's horrible. Also, wth is this about? A memory allocator that
needs in_nmi()? That sounds beyond broken.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
@ 2017-03-29  8:12                               ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-29  8:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jesper Dangaard Brouer, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:
> > And I also verified it worked:
> > 
> >   0.63 │       mov    __preempt_count,%eax
> >        │     free_hot_cold_page():
> >   1.25 │       test   $0x1f0000,%eax
> >        │     ↓ jne    1e4
> > 
> > And this simplification also made the compiler change this into a
> > unlikely branch, which is a micro-optimization (that I will leave up to
> > the compiler).
> 
> Excellent!  That said, I think we should define in_irq_or_nmi() in
> preempt.h, rather than hiding it in the memory allocator.  And since we're
> doing that, we might as well make it look like the other definitions:
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index 7eeceac52dea..af98c29abd9d 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -81,6 +81,7 @@
>  #define in_interrupt()		(irq_count())
>  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
>  #define in_nmi()		(preempt_count() & NMI_MASK)
> +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
>  #define in_task()		(!(preempt_count() & \
>  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
>  

No, that's horrible. Also, wth is this about? A memory allocator that
needs in_nmi()? That sounds beyond broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
@ 2017-03-29  8:12                               ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-29  8:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jesper Dangaard Brouer, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:
> > And I also verified it worked:
> > 
> >   0.63 a??       mov    __preempt_count,%eax
> >        a??     free_hot_cold_page():
> >   1.25 a??       test   $0x1f0000,%eax
> >        a??     a?? jne    1e4
> > 
> > And this simplification also made the compiler change this into a
> > unlikely branch, which is a micro-optimization (that I will leave up to
> > the compiler).
> 
> Excellent!  That said, I think we should define in_irq_or_nmi() in
> preempt.h, rather than hiding it in the memory allocator.  And since we're
> doing that, we might as well make it look like the other definitions:
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index 7eeceac52dea..af98c29abd9d 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -81,6 +81,7 @@
>  #define in_interrupt()		(irq_count())
>  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
>  #define in_nmi()		(preempt_count() & NMI_MASK)
> +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
>  #define in_task()		(!(preempt_count() & \
>  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
>  

No, that's horrible. Also, wth is this about? A memory allocator that
needs in_nmi()? That sounds beyond broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
  2017-03-29  8:12                               ` in_irq_or_nmi() Peter Zijlstra
@ 2017-03-29  8:59                                 ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-29  8:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Wed, 29 Mar 2017 10:12:19 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> > On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:  
> > > And I also verified it worked:
> > > 
> > >   0.63 │       mov    __preempt_count,%eax
> > >        │     free_hot_cold_page():
> > >   1.25 │       test   $0x1f0000,%eax
> > >        │     ↓ jne    1e4
> > > 
> > > And this simplification also made the compiler change this into a
> > > unlikely branch, which is a micro-optimization (that I will leave up to
> > > the compiler).  
> > 
> > Excellent!  That said, I think we should define in_irq_or_nmi() in
> > preempt.h, rather than hiding it in the memory allocator.  And since we're
> > doing that, we might as well make it look like the other definitions:
> > 
> > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > index 7eeceac52dea..af98c29abd9d 100644
> > --- a/include/linux/preempt.h
> > +++ b/include/linux/preempt.h
> > @@ -81,6 +81,7 @@
> >  #define in_interrupt()		(irq_count())
> >  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
> >  #define in_nmi()		(preempt_count() & NMI_MASK)
> > +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> >  #define in_task()		(!(preempt_count() & \
> >  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
> >    
> 
> No, that's horrible. Also, wth is this about? A memory allocator that
> needs in_nmi()? That sounds beyond broken.

It is the other way around. We want to exclude NMI and HARDIRQ from
using the per-cpu-pages (pcp) lists "order-0 cache" (they will
fall-through using the normal buddy allocator path).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
@ 2017-03-29  8:59                                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-29  8:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Wed, 29 Mar 2017 10:12:19 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> > On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:  
> > > And I also verified it worked:
> > > 
> > >   0.63 │       mov    __preempt_count,%eax
> > >        │     free_hot_cold_page():
> > >   1.25 │       test   $0x1f0000,%eax
> > >        │     ↓ jne    1e4
> > > 
> > > And this simplification also made the compiler change this into a
> > > unlikely branch, which is a micro-optimization (that I will leave up to
> > > the compiler).  
> > 
> > Excellent!  That said, I think we should define in_irq_or_nmi() in
> > preempt.h, rather than hiding it in the memory allocator.  And since we're
> > doing that, we might as well make it look like the other definitions:
> > 
> > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > index 7eeceac52dea..af98c29abd9d 100644
> > --- a/include/linux/preempt.h
> > +++ b/include/linux/preempt.h
> > @@ -81,6 +81,7 @@
> >  #define in_interrupt()		(irq_count())
> >  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
> >  #define in_nmi()		(preempt_count() & NMI_MASK)
> > +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> >  #define in_task()		(!(preempt_count() & \
> >  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
> >    
> 
> No, that's horrible. Also, wth is this about? A memory allocator that
> needs in_nmi()? That sounds beyond broken.

It is the other way around. We want to exclude NMI and HARDIRQ from
using the per-cpu-pages (pcp) lists "order-0 cache" (they will
fall-through using the normal buddy allocator path).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
  2017-03-29  8:59                                 ` in_irq_or_nmi() Jesper Dangaard Brouer
  (?)
@ 2017-03-29  9:19                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-29  9:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 29 Mar 2017 10:12:19 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> > > On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:  
> > > > And I also verified it worked:
> > > > 
> > > >   0.63 │       mov    __preempt_count,%eax
> > > >        │     free_hot_cold_page():
> > > >   1.25 │       test   $0x1f0000,%eax
> > > >        │     ↓ jne    1e4
> > > > 
> > > > And this simplification also made the compiler change this into a
> > > > unlikely branch, which is a micro-optimization (that I will leave up to
> > > > the compiler).  
> > > 
> > > Excellent!  That said, I think we should define in_irq_or_nmi() in
> > > preempt.h, rather than hiding it in the memory allocator.  And since we're
> > > doing that, we might as well make it look like the other definitions:
> > > 
> > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > index 7eeceac52dea..af98c29abd9d 100644
> > > --- a/include/linux/preempt.h
> > > +++ b/include/linux/preempt.h
> > > @@ -81,6 +81,7 @@
> > >  #define in_interrupt()		(irq_count())
> > >  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
> > >  #define in_nmi()		(preempt_count() & NMI_MASK)
> > > +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> > >  #define in_task()		(!(preempt_count() & \
> > >  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
> > >    
> > 
> > No, that's horrible. Also, wth is this about? A memory allocator that
> > needs in_nmi()? That sounds beyond broken.
> 
> It is the other way around. We want to exclude NMI and HARDIRQ from
> using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> fall-through using the normal buddy allocator path).

Any in_nmi() code arriving at the allocator is broken. No need to fix
the allocator.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
@ 2017-03-29  9:19                                   ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-29  9:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 29 Mar 2017 10:12:19 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> > > On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:  
> > > > And I also verified it worked:
> > > > 
> > > >   0.63 │       mov    __preempt_count,%eax
> > > >        │     free_hot_cold_page():
> > > >   1.25 │       test   $0x1f0000,%eax
> > > >        │     ↓ jne    1e4
> > > > 
> > > > And this simplification also made the compiler change this into a
> > > > unlikely branch, which is a micro-optimization (that I will leave up to
> > > > the compiler).  
> > > 
> > > Excellent!  That said, I think we should define in_irq_or_nmi() in
> > > preempt.h, rather than hiding it in the memory allocator.  And since we're
> > > doing that, we might as well make it look like the other definitions:
> > > 
> > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > index 7eeceac52dea..af98c29abd9d 100644
> > > --- a/include/linux/preempt.h
> > > +++ b/include/linux/preempt.h
> > > @@ -81,6 +81,7 @@
> > >  #define in_interrupt()		(irq_count())
> > >  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
> > >  #define in_nmi()		(preempt_count() & NMI_MASK)
> > > +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> > >  #define in_task()		(!(preempt_count() & \
> > >  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
> > >    
> > 
> > No, that's horrible. Also, wth is this about? A memory allocator that
> > needs in_nmi()? That sounds beyond broken.
> 
> It is the other way around. We want to exclude NMI and HARDIRQ from
> using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> fall-through using the normal buddy allocator path).

Any in_nmi() code arriving at the allocator is broken. No need to fix
the allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
@ 2017-03-29  9:19                                   ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-29  9:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 29 Mar 2017 10:12:19 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Mar 27, 2017 at 09:58:17AM -0700, Matthew Wilcox wrote:
> > > On Mon, Mar 27, 2017 at 05:15:00PM +0200, Jesper Dangaard Brouer wrote:  
> > > > And I also verified it worked:
> > > > 
> > > >   0.63 a??       mov    __preempt_count,%eax
> > > >        a??     free_hot_cold_page():
> > > >   1.25 a??       test   $0x1f0000,%eax
> > > >        a??     a?? jne    1e4
> > > > 
> > > > And this simplification also made the compiler change this into a
> > > > unlikely branch, which is a micro-optimization (that I will leave up to
> > > > the compiler).  
> > > 
> > > Excellent!  That said, I think we should define in_irq_or_nmi() in
> > > preempt.h, rather than hiding it in the memory allocator.  And since we're
> > > doing that, we might as well make it look like the other definitions:
> > > 
> > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > > index 7eeceac52dea..af98c29abd9d 100644
> > > --- a/include/linux/preempt.h
> > > +++ b/include/linux/preempt.h
> > > @@ -81,6 +81,7 @@
> > >  #define in_interrupt()		(irq_count())
> > >  #define in_serving_softirq()	(softirq_count() & SOFTIRQ_OFFSET)
> > >  #define in_nmi()		(preempt_count() & NMI_MASK)
> > > +#define in_irq_or_nmi()		(preempt_count() & (HARDIRQ_MASK | NMI_MASK))
> > >  #define in_task()		(!(preempt_count() & \
> > >  				   (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
> > >    
> > 
> > No, that's horrible. Also, wth is this about? A memory allocator that
> > needs in_nmi()? That sounds beyond broken.
> 
> It is the other way around. We want to exclude NMI and HARDIRQ from
> using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> fall-through using the normal buddy allocator path).

Any in_nmi() code arriving at the allocator is broken. No need to fix
the allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
  2017-03-29  9:19                                   ` in_irq_or_nmi() Peter Zijlstra
@ 2017-03-29 18:12                                     ` Matthew Wilcox
  -1 siblings, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-29 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jesper Dangaard Brouer, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:
> On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 29 Mar 2017 10:12:19 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > needs in_nmi()? That sounds beyond broken.
> > 
> > It is the other way around. We want to exclude NMI and HARDIRQ from
> > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > fall-through using the normal buddy allocator path).
> 
> Any in_nmi() code arriving at the allocator is broken. No need to fix
> the allocator.

That's demonstrably true.  You can't grab a spinlock in NMI code and
the first thing that happens if this in_irq_or_nmi() check fails is ...
        spin_lock_irqsave(&zone->lock, flags);
so this patch should just use in_irq().

(the concept of NMI code needing to allocate memory was blowing my mind
a little bit)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
@ 2017-03-29 18:12                                     ` Matthew Wilcox
  0 siblings, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2017-03-29 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jesper Dangaard Brouer, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:
> On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 29 Mar 2017 10:12:19 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > needs in_nmi()? That sounds beyond broken.
> > 
> > It is the other way around. We want to exclude NMI and HARDIRQ from
> > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > fall-through using the normal buddy allocator path).
> 
> Any in_nmi() code arriving at the allocator is broken. No need to fix
> the allocator.

That's demonstrably true.  You can't grab a spinlock in NMI code and
the first thing that happens if this in_irq_or_nmi() check fails is ...
        spin_lock_irqsave(&zone->lock, flags);
so this patch should just use in_irq().

(the concept of NMI code needing to allocate memory was blowing my mind
a little bit)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
  2017-03-29 18:12                                     ` in_irq_or_nmi() Matthew Wilcox
@ 2017-03-29 19:11                                       ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-29 19:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Peter Zijlstra, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer


On Wed, 29 Mar 2017 11:12:26 -0700 Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:
> > On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:  
> > > On Wed, 29 Mar 2017 10:12:19 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:  
> > > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > > needs in_nmi()? That sounds beyond broken.  
> > > 
> > > It is the other way around. We want to exclude NMI and HARDIRQ from
> > > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > > fall-through using the normal buddy allocator path).  
> > 
> > Any in_nmi() code arriving at the allocator is broken. No need to fix
> > the allocator.  
> 
> That's demonstrably true.  You can't grab a spinlock in NMI code and
> the first thing that happens if this in_irq_or_nmi() check fails is ...
>         spin_lock_irqsave(&zone->lock, flags);
> so this patch should just use in_irq().
> 
> (the concept of NMI code needing to allocate memory was blowing my mind
> a little bit)

Regardless or using in_irq() (or in combi with in_nmi()) I get the
following warning below:

[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
-8
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
[    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
[    0.000000] Call Trace:
[    0.000000]  dump_stack+0x4f/0x73
[    0.000000]  __warn+0xcb/0xf0
[    0.000000]  warn_slowpath_null+0x1d/0x20
[    0.000000]  __local_bh_enable_ip+0x70/0x90
[    0.000000]  free_hot_cold_page+0x1a4/0x2f0
[    0.000000]  __free_pages+0x1f/0x30
[    0.000000]  __free_pages_bootmem+0xab/0xb8
[    0.000000]  __free_memory_core+0x79/0x91
[    0.000000]  free_all_bootmem+0xaa/0x122
[    0.000000]  mem_init+0x71/0xa4
[    0.000000]  start_kernel+0x1e5/0x3f1
[    0.000000]  x86_64_start_reservations+0x2a/0x2c
[    0.000000]  x86_64_start_kernel+0x178/0x18b
[    0.000000]  start_cpu+0x14/0x14
[    0.000000]  ? start_cpu+0x14/0x14
[    0.000000] ---[ end trace a57944bec8fc985c ]---
[    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)

And kernel/softirq.c:161 contains:

 WARN_ON_ONCE(in_irq() || irqs_disabled());

Thus, I don't think the change in my RFC-patch[1] is safe.
Of changing[2] to support softirq allocations by replacing
preempt_disable() with local_bh_disable().

[1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com

[2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
 https://git.kernel.org/torvalds/c/374ad05ab64d

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi()
@ 2017-03-29 19:11                                       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-29 19:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Peter Zijlstra, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer


On Wed, 29 Mar 2017 11:12:26 -0700 Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:
> > On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:  
> > > On Wed, 29 Mar 2017 10:12:19 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:  
> > > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > > needs in_nmi()? That sounds beyond broken.  
> > > 
> > > It is the other way around. We want to exclude NMI and HARDIRQ from
> > > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > > fall-through using the normal buddy allocator path).  
> > 
> > Any in_nmi() code arriving at the allocator is broken. No need to fix
> > the allocator.  
> 
> That's demonstrably true.  You can't grab a spinlock in NMI code and
> the first thing that happens if this in_irq_or_nmi() check fails is ...
>         spin_lock_irqsave(&zone->lock, flags);
> so this patch should just use in_irq().
> 
> (the concept of NMI code needing to allocate memory was blowing my mind
> a little bit)

Regardless or using in_irq() (or in combi with in_nmi()) I get the
following warning below:

[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
-8
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
[    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
[    0.000000] Call Trace:
[    0.000000]  dump_stack+0x4f/0x73
[    0.000000]  __warn+0xcb/0xf0
[    0.000000]  warn_slowpath_null+0x1d/0x20
[    0.000000]  __local_bh_enable_ip+0x70/0x90
[    0.000000]  free_hot_cold_page+0x1a4/0x2f0
[    0.000000]  __free_pages+0x1f/0x30
[    0.000000]  __free_pages_bootmem+0xab/0xb8
[    0.000000]  __free_memory_core+0x79/0x91
[    0.000000]  free_all_bootmem+0xaa/0x122
[    0.000000]  mem_init+0x71/0xa4
[    0.000000]  start_kernel+0x1e5/0x3f1
[    0.000000]  x86_64_start_reservations+0x2a/0x2c
[    0.000000]  x86_64_start_kernel+0x178/0x18b
[    0.000000]  start_cpu+0x14/0x14
[    0.000000]  ? start_cpu+0x14/0x14
[    0.000000] ---[ end trace a57944bec8fc985c ]---
[    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)

And kernel/softirq.c:161 contains:

 WARN_ON_ONCE(in_irq() || irqs_disabled());

Thus, I don't think the change in my RFC-patch[1] is safe.
Of changing[2] to support softirq allocations by replacing
preempt_disable() with local_bh_disable().

[1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com

[2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
 https://git.kernel.org/torvalds/c/374ad05ab64d

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-29 19:11                                       ` in_irq_or_nmi() Jesper Dangaard Brouer
@ 2017-03-29 19:44                                         ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-29 19:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Peter Zijlstra, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Wed, 29 Mar 2017 21:11:44 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Wed, 29 Mar 2017 11:12:26 -0700 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:  
> > > On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:    
> > > > On Wed, 29 Mar 2017 10:12:19 +0200
> > > > Peter Zijlstra <peterz@infradead.org> wrote:    
> > > > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > > > needs in_nmi()? That sounds beyond broken.    
> > > > 
> > > > It is the other way around. We want to exclude NMI and HARDIRQ from
> > > > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > > > fall-through using the normal buddy allocator path).    
> > > 
> > > Any in_nmi() code arriving at the allocator is broken. No need to fix
> > > the allocator.    
> > 
> > That's demonstrably true.  You can't grab a spinlock in NMI code and
> > the first thing that happens if this in_irq_or_nmi() check fails is ...
> >         spin_lock_irqsave(&zone->lock, flags);
> > so this patch should just use in_irq().
> > 
> > (the concept of NMI code needing to allocate memory was blowing my mind
> > a little bit)  
> 
> Regardless or using in_irq() (or in combi with in_nmi()) I get the
> following warning below:
> 
> [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> -8
> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.000000] ------------[ cut here ]------------
> [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> [    0.000000] Modules linked in:
> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> [    0.000000] Call Trace:
> [    0.000000]  dump_stack+0x4f/0x73
> [    0.000000]  __warn+0xcb/0xf0
> [    0.000000]  warn_slowpath_null+0x1d/0x20
> [    0.000000]  __local_bh_enable_ip+0x70/0x90
> [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> [    0.000000]  __free_pages+0x1f/0x30
> [    0.000000]  __free_pages_bootmem+0xab/0xb8
> [    0.000000]  __free_memory_core+0x79/0x91
> [    0.000000]  free_all_bootmem+0xaa/0x122
> [    0.000000]  mem_init+0x71/0xa4
> [    0.000000]  start_kernel+0x1e5/0x3f1
> [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> [    0.000000]  x86_64_start_kernel+0x178/0x18b
> [    0.000000]  start_cpu+0x14/0x14
> [    0.000000]  ? start_cpu+0x14/0x14
> [    0.000000] ---[ end trace a57944bec8fc985c ]---
> [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> 
> And kernel/softirq.c:161 contains:
> 
>  WARN_ON_ONCE(in_irq() || irqs_disabled());
> 
> Thus, I don't think the change in my RFC-patch[1] is safe.
> Of changing[2] to support softirq allocations by replacing
> preempt_disable() with local_bh_disable().
> 
> [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> 
> [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
>  https://git.kernel.org/torvalds/c/374ad05ab64d

A patch that avoids the above warning is inlined below, but I'm not
sure if this is best direction.  Or we should rather consider reverting
part of commit 374ad05ab64d to avoid the softirq performance regression?
 

[PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

From: Jesper Dangaard Brouer <brouer@redhat.com>

IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists
caching of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only
use per-cpu allocator for irq-safe requests").

This unfortunately also included excluded SoftIRQ.  This hurt the
performance for the use-case of refilling DMA RX rings in softirq
context.

This patch re-allow softirq context, which should be safe by disabling
BH/softirq, while accessing the list.  PCP-lists access from both
hard-IRQ and NMI context must not be allowed.  Peter Zijlstra says
in_nmi() code never access the page allocator, thus it should be
sufficient to only test for !in_irq().

One concern with this change is adding a BH (enable) scheduling point
at both PCP alloc and free.

Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/page_alloc.c |   26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6cbde310abed..d7e986967910 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2351,9 +2351,9 @@ static void drain_local_pages_wq(struct work_struct *work)
 	 * cpu which is allright but we also have to make sure to not move to
 	 * a different one.
 	 */
-	preempt_disable();
+	local_bh_disable();
 	drain_local_pages(NULL);
-	preempt_enable();
+	local_bh_enable();
 }
 
 /*
@@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
-	if (in_interrupt()) {
+	/*
+	 * Exclude (hard) IRQ and NMI context from using the pcplists.
+	 * But allow softirq context, via disabling BH.
+	 */
+	if (in_irq() || irqs_disabled()) {
 		__free_pages_ok(page, 0);
 		return;
 	}
@@ -2491,7 +2495,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	preempt_disable();
+	local_bh_disable();
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2522,7 +2526,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 out:
-	preempt_enable();
+	local_bh_enable();
 }
 
 /*
@@ -2647,7 +2651,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 {
 	struct page *page;
 
-	VM_BUG_ON(in_interrupt());
+	VM_BUG_ON(in_irq() || irqs_disabled());
 
 	do {
 		if (list_empty(list)) {
@@ -2680,7 +2684,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
 
-	preempt_disable();
+	local_bh_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
@@ -2688,7 +2692,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone);
 	}
-	preempt_enable();
+	local_bh_enable();
 	return page;
 }
 
@@ -2704,7 +2708,11 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0) && !in_interrupt()) {
+	/*
+	 * Exclude (hard) IRQ and NMI context from using the pcplists.
+	 * But allow softirq context, via disabling BH.
+	 */
+	if (likely(order == 0) && !(in_irq() || irqs_disabled()) ) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 		goto out;


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-03-29 19:44                                         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-29 19:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Peter Zijlstra, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Wed, 29 Mar 2017 21:11:44 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Wed, 29 Mar 2017 11:12:26 -0700 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Wed, Mar 29, 2017 at 11:19:49AM +0200, Peter Zijlstra wrote:  
> > > On Wed, Mar 29, 2017 at 10:59:28AM +0200, Jesper Dangaard Brouer wrote:    
> > > > On Wed, 29 Mar 2017 10:12:19 +0200
> > > > Peter Zijlstra <peterz@infradead.org> wrote:    
> > > > > No, that's horrible. Also, wth is this about? A memory allocator that
> > > > > needs in_nmi()? That sounds beyond broken.    
> > > > 
> > > > It is the other way around. We want to exclude NMI and HARDIRQ from
> > > > using the per-cpu-pages (pcp) lists "order-0 cache" (they will
> > > > fall-through using the normal buddy allocator path).    
> > > 
> > > Any in_nmi() code arriving at the allocator is broken. No need to fix
> > > the allocator.    
> > 
> > That's demonstrably true.  You can't grab a spinlock in NMI code and
> > the first thing that happens if this in_irq_or_nmi() check fails is ...
> >         spin_lock_irqsave(&zone->lock, flags);
> > so this patch should just use in_irq().
> > 
> > (the concept of NMI code needing to allocate memory was blowing my mind
> > a little bit)  
> 
> Regardless or using in_irq() (or in combi with in_nmi()) I get the
> following warning below:
> 
> [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> -8
> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.000000] ------------[ cut here ]------------
> [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> [    0.000000] Modules linked in:
> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> [    0.000000] Call Trace:
> [    0.000000]  dump_stack+0x4f/0x73
> [    0.000000]  __warn+0xcb/0xf0
> [    0.000000]  warn_slowpath_null+0x1d/0x20
> [    0.000000]  __local_bh_enable_ip+0x70/0x90
> [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> [    0.000000]  __free_pages+0x1f/0x30
> [    0.000000]  __free_pages_bootmem+0xab/0xb8
> [    0.000000]  __free_memory_core+0x79/0x91
> [    0.000000]  free_all_bootmem+0xaa/0x122
> [    0.000000]  mem_init+0x71/0xa4
> [    0.000000]  start_kernel+0x1e5/0x3f1
> [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> [    0.000000]  x86_64_start_kernel+0x178/0x18b
> [    0.000000]  start_cpu+0x14/0x14
> [    0.000000]  ? start_cpu+0x14/0x14
> [    0.000000] ---[ end trace a57944bec8fc985c ]---
> [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> 
> And kernel/softirq.c:161 contains:
> 
>  WARN_ON_ONCE(in_irq() || irqs_disabled());
> 
> Thus, I don't think the change in my RFC-patch[1] is safe.
> Of changing[2] to support softirq allocations by replacing
> preempt_disable() with local_bh_disable().
> 
> [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> 
> [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
>  https://git.kernel.org/torvalds/c/374ad05ab64d

A patch that avoids the above warning is inlined below, but I'm not
sure if this is best direction.  Or we should rather consider reverting
part of commit 374ad05ab64d to avoid the softirq performance regression?
 

[PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

From: Jesper Dangaard Brouer <brouer@redhat.com>

IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists
caching of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only
use per-cpu allocator for irq-safe requests").

This unfortunately also included excluded SoftIRQ.  This hurt the
performance for the use-case of refilling DMA RX rings in softirq
context.

This patch re-allow softirq context, which should be safe by disabling
BH/softirq, while accessing the list.  PCP-lists access from both
hard-IRQ and NMI context must not be allowed.  Peter Zijlstra says
in_nmi() code never access the page allocator, thus it should be
sufficient to only test for !in_irq().

One concern with this change is adding a BH (enable) scheduling point
at both PCP alloc and free.

Fixes: 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/page_alloc.c |   26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6cbde310abed..d7e986967910 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2351,9 +2351,9 @@ static void drain_local_pages_wq(struct work_struct *work)
 	 * cpu which is allright but we also have to make sure to not move to
 	 * a different one.
 	 */
-	preempt_disable();
+	local_bh_disable();
 	drain_local_pages(NULL);
-	preempt_enable();
+	local_bh_enable();
 }
 
 /*
@@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
-	if (in_interrupt()) {
+	/*
+	 * Exclude (hard) IRQ and NMI context from using the pcplists.
+	 * But allow softirq context, via disabling BH.
+	 */
+	if (in_irq() || irqs_disabled()) {
 		__free_pages_ok(page, 0);
 		return;
 	}
@@ -2491,7 +2495,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_pcppage_migratetype(page, migratetype);
-	preempt_disable();
+	local_bh_disable();
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -2522,7 +2526,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	}
 
 out:
-	preempt_enable();
+	local_bh_enable();
 }
 
 /*
@@ -2647,7 +2651,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 {
 	struct page *page;
 
-	VM_BUG_ON(in_interrupt());
+	VM_BUG_ON(in_irq() || irqs_disabled());
 
 	do {
 		if (list_empty(list)) {
@@ -2680,7 +2684,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 	struct page *page;
 
-	preempt_disable();
+	local_bh_disable();
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
@@ -2688,7 +2692,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone);
 	}
-	preempt_enable();
+	local_bh_enable();
 	return page;
 }
 
@@ -2704,7 +2708,11 @@ struct page *rmqueue(struct zone *preferred_zone,
 	unsigned long flags;
 	struct page *page;
 
-	if (likely(order == 0) && !in_interrupt()) {
+	/*
+	 * Exclude (hard) IRQ and NMI context from using the pcplists.
+	 * But allow softirq context, via disabling BH.
+	 */
+	if (likely(order == 0) && !(in_irq() || irqs_disabled()) ) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				gfp_flags, migratetype);
 		goto out;


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-29 19:44                                         ` Jesper Dangaard Brouer
@ 2017-03-30  6:49                                           ` Peter Zijlstra
  -1 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-30  6:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>  
> -	if (in_interrupt()) {
> +	/*
> +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> +	 * But allow softirq context, via disabling BH.
> +	 */
> +	if (in_irq() || irqs_disabled()) {

Why do you need irqs_disabled() ? Also, your comment is stale, it still
refers to NMI context.

>  		__free_pages_ok(page, 0);
>  		return;
>  	}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-03-30  6:49                                           ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-30  6:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>  
> -	if (in_interrupt()) {
> +	/*
> +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> +	 * But allow softirq context, via disabling BH.
> +	 */
> +	if (in_irq() || irqs_disabled()) {

Why do you need irqs_disabled() ? Also, your comment is stale, it still
refers to NMI context.

>  		__free_pages_ok(page, 0);
>  		return;
>  	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-30  6:49                                           ` Peter Zijlstra
@ 2017-03-30  7:12                                             ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-30  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Thu, 30 Mar 2017 08:49:58 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> >  	unsigned long pfn = page_to_pfn(page);
> >  	int migratetype;
> >  
> > -	if (in_interrupt()) {
> > +	/*
> > +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> > +	 * But allow softirq context, via disabling BH.
> > +	 */
> > +	if (in_irq() || irqs_disabled()) {  
> 
> Why do you need irqs_disabled() ? 

Because further down I call local_bh_enable(), which calls
__local_bh_enable_ip() which triggers a warning during early boot on:

  WARN_ON_ONCE(in_irq() || irqs_disabled());

It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.


> Also, your comment is stale, it still refers to NMI context.

True, as you told me NMI is implicit, as it cannot occur.

> >  		__free_pages_ok(page, 0);
> >  		return;
> >  	}  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-03-30  7:12                                             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-30  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Thu, 30 Mar 2017 08:49:58 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> >  	unsigned long pfn = page_to_pfn(page);
> >  	int migratetype;
> >  
> > -	if (in_interrupt()) {
> > +	/*
> > +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> > +	 * But allow softirq context, via disabling BH.
> > +	 */
> > +	if (in_irq() || irqs_disabled()) {  
> 
> Why do you need irqs_disabled() ? 

Because further down I call local_bh_enable(), which calls
__local_bh_enable_ip() which triggers a warning during early boot on:

  WARN_ON_ONCE(in_irq() || irqs_disabled());

It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.


> Also, your comment is stale, it still refers to NMI context.

True, as you told me NMI is implicit, as it cannot occur.

> >  		__free_pages_ok(page, 0);
> >  		return;
> >  	}  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-30  7:12                                             ` Jesper Dangaard Brouer
@ 2017-03-30  7:35                                               ` Peter Zijlstra
  -1 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-30  7:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, Thomas Gleixner

On Thu, Mar 30, 2017 at 09:12:23AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 30 Mar 2017 08:49:58 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> > >  	unsigned long pfn = page_to_pfn(page);
> > >  	int migratetype;
> > >  
> > > -	if (in_interrupt()) {
> > > +	/*
> > > +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> > > +	 * But allow softirq context, via disabling BH.
> > > +	 */
> > > +	if (in_irq() || irqs_disabled()) {  
> > 
> > Why do you need irqs_disabled() ? 
> 
> Because further down I call local_bh_enable(), which calls
> __local_bh_enable_ip() which triggers a warning during early boot on:
> 
>   WARN_ON_ONCE(in_irq() || irqs_disabled());
> 
> It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.

Ah, no. Its because when you do things like:

	local_irq_disable();
	local_bh_enable();
	local_irq_enable();

you can loose a pending softirq.

Bugger.. that irqs_disabled() is something we could do without.

I'm thinking that when tglx finishes his soft irq disable patches for
x86 (same thing ppc also does) we can go revert all these patches.

Thomas, see:

  https://lkml.kernel.org/r/20170301144845.783f8cad@redhat.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-03-30  7:35                                               ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2017-03-30  7:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, Thomas Gleixner

On Thu, Mar 30, 2017 at 09:12:23AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 30 Mar 2017 08:49:58 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> > >  	unsigned long pfn = page_to_pfn(page);
> > >  	int migratetype;
> > >  
> > > -	if (in_interrupt()) {
> > > +	/*
> > > +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> > > +	 * But allow softirq context, via disabling BH.
> > > +	 */
> > > +	if (in_irq() || irqs_disabled()) {  
> > 
> > Why do you need irqs_disabled() ? 
> 
> Because further down I call local_bh_enable(), which calls
> __local_bh_enable_ip() which triggers a warning during early boot on:
> 
>   WARN_ON_ONCE(in_irq() || irqs_disabled());
> 
> It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.

Ah, no. Its because when you do things like:

	local_irq_disable();
	local_bh_enable();
	local_irq_enable();

you can loose a pending softirq.

Bugger.. that irqs_disabled() is something we could do without.

I'm thinking that when tglx finishes his soft irq disable patches for
x86 (same thing ppc also does) we can go revert all these patches.

Thomas, see:

  https://lkml.kernel.org/r/20170301144845.783f8cad@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-30  7:35                                               ` Peter Zijlstra
@ 2017-03-30  9:46                                                 ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-30  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, Thomas Gleixner, brouer

On Thu, 30 Mar 2017 09:35:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Mar 30, 2017 at 09:12:23AM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 30 Mar 2017 08:49:58 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:  
> > > > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> > > >  	unsigned long pfn = page_to_pfn(page);
> > > >  	int migratetype;
> > > >  
> > > > -	if (in_interrupt()) {
> > > > +	/*
> > > > +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> > > > +	 * But allow softirq context, via disabling BH.
> > > > +	 */
> > > > +	if (in_irq() || irqs_disabled()) {    
> > > 
> > > Why do you need irqs_disabled() ?   
> > 
> > Because further down I call local_bh_enable(), which calls
> > __local_bh_enable_ip() which triggers a warning during early boot on:
> > 
> >   WARN_ON_ONCE(in_irq() || irqs_disabled());
> > 
> > It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.  
> 
> Ah, no. Its because when you do things like:
> 
> 	local_irq_disable();
> 	local_bh_enable();
> 	local_irq_enable();
> 
> you can loose a pending softirq.
> 
> Bugger.. that irqs_disabled() is something we could do without.

Yes, I really don't like adding this irqs_disabled() check here.

> I'm thinking that when tglx finishes his soft irq disable patches for
> x86 (same thing ppc also does) we can go revert all these patches.
> 
> Thomas, see:
> 
>   https://lkml.kernel.org/r/20170301144845.783f8cad@redhat.com

The summary is Mel and I found a way to optimized the page allocator,
by avoiding a local_irq_{save,restore} operation, see commit
374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe
requests")  [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696

But Tariq discovered that this caused a regression for 100Gbit/s NICs,
as the patch excluded softirq from using the per-cpu-page (PCP) lists.
As DMA RX page-refill happens in softirq context.

Now we are trying to re-enable allowing softirq to use the PCP.
My proposal is: https://lkml.kernel.org/r/20170329214441.08332799@redhat.com
The alternative is to revert this optimization.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-03-30  9:46                                                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-30  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Pankaj Gupta, Tariq Toukan, Mel Gorman,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, Thomas Gleixner, brouer

On Thu, 30 Mar 2017 09:35:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Mar 30, 2017 at 09:12:23AM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 30 Mar 2017 08:49:58 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:  
> > > > @@ -2481,7 +2481,11 @@ void free_hot_cold_page(struct page *page, bool cold)
> > > >  	unsigned long pfn = page_to_pfn(page);
> > > >  	int migratetype;
> > > >  
> > > > -	if (in_interrupt()) {
> > > > +	/*
> > > > +	 * Exclude (hard) IRQ and NMI context from using the pcplists.
> > > > +	 * But allow softirq context, via disabling BH.
> > > > +	 */
> > > > +	if (in_irq() || irqs_disabled()) {    
> > > 
> > > Why do you need irqs_disabled() ?   
> > 
> > Because further down I call local_bh_enable(), which calls
> > __local_bh_enable_ip() which triggers a warning during early boot on:
> > 
> >   WARN_ON_ONCE(in_irq() || irqs_disabled());
> > 
> > It looks like it is for supporting CONFIG_TRACE_IRQFLAGS.  
> 
> Ah, no. Its because when you do things like:
> 
> 	local_irq_disable();
> 	local_bh_enable();
> 	local_irq_enable();
> 
> you can loose a pending softirq.
> 
> Bugger.. that irqs_disabled() is something we could do without.

Yes, I really don't like adding this irqs_disabled() check here.

> I'm thinking that when tglx finishes his soft irq disable patches for
> x86 (same thing ppc also does) we can go revert all these patches.
> 
> Thomas, see:
> 
>   https://lkml.kernel.org/r/20170301144845.783f8cad@redhat.com

The summary is Mel and I found a way to optimized the page allocator,
by avoiding a local_irq_{save,restore} operation, see commit
374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe
requests")  [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696

But Tariq discovered that this caused a regression for 100Gbit/s NICs,
as the patch excluded softirq from using the per-cpu-page (PCP) lists.
As DMA RX page-refill happens in softirq context.

Now we are trying to re-enable allowing softirq to use the PCP.
My proposal is: https://lkml.kernel.org/r/20170329214441.08332799@redhat.com
The alternative is to revert this optimization.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-29 19:44                                         ` Jesper Dangaard Brouer
@ 2017-03-30 13:04                                           ` Mel Gorman
  -1 siblings, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2017-03-30 13:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > following warning below:
> > 
> > [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> > -8
> > [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > [    0.000000] ------------[ cut here ]------------
> > [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> > [    0.000000] Modules linked in:
> > [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> > [    0.000000] Call Trace:
> > [    0.000000]  dump_stack+0x4f/0x73
> > [    0.000000]  __warn+0xcb/0xf0
> > [    0.000000]  warn_slowpath_null+0x1d/0x20
> > [    0.000000]  __local_bh_enable_ip+0x70/0x90
> > [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> > [    0.000000]  __free_pages+0x1f/0x30
> > [    0.000000]  __free_pages_bootmem+0xab/0xb8
> > [    0.000000]  __free_memory_core+0x79/0x91
> > [    0.000000]  free_all_bootmem+0xaa/0x122
> > [    0.000000]  mem_init+0x71/0xa4
> > [    0.000000]  start_kernel+0x1e5/0x3f1
> > [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> > [    0.000000]  x86_64_start_kernel+0x178/0x18b
> > [    0.000000]  start_cpu+0x14/0x14
> > [    0.000000]  ? start_cpu+0x14/0x14
> > [    0.000000] ---[ end trace a57944bec8fc985c ]---
> > [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> > 
> > And kernel/softirq.c:161 contains:
> > 
> >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > 
> > Thus, I don't think the change in my RFC-patch[1] is safe.
> > Of changing[2] to support softirq allocations by replacing
> > preempt_disable() with local_bh_disable().
> > 
> > [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> > 
> > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
> >  https://git.kernel.org/torvalds/c/374ad05ab64d
> 
> A patch that avoids the above warning is inlined below, but I'm not
> sure if this is best direction.  Or we should rather consider reverting
> part of commit 374ad05ab64d to avoid the softirq performance regression?
>  

At the moment, I'm not seeing a better alternative. If this works, I
think it would still be far superior in terms of performance than a
revert. As before, if there are bad consequences to adding a BH
rescheduling point then we'll have to revert. However, I don't like a
revert being the first option as it'll keep encouraging drivers to build
sub-allocators to avoid the page allocator.

> [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator
> 
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> 

Other than the slightly misleading comments about NMI which could
explain "this potentially misses an NMI but an NMI allocating pages is
brain damaged", I don't see a problem. The irqs_disabled() check is a
subtle but it's not earth shattering and it still helps the 100GiB cases
with the limited cycle budget to process packets.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-03-30 13:04                                           ` Mel Gorman
  0 siblings, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2017-03-30 13:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > following warning below:
> > 
> > [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> > -8
> > [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > [    0.000000] ------------[ cut here ]------------
> > [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> > [    0.000000] Modules linked in:
> > [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> > [    0.000000] Call Trace:
> > [    0.000000]  dump_stack+0x4f/0x73
> > [    0.000000]  __warn+0xcb/0xf0
> > [    0.000000]  warn_slowpath_null+0x1d/0x20
> > [    0.000000]  __local_bh_enable_ip+0x70/0x90
> > [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> > [    0.000000]  __free_pages+0x1f/0x30
> > [    0.000000]  __free_pages_bootmem+0xab/0xb8
> > [    0.000000]  __free_memory_core+0x79/0x91
> > [    0.000000]  free_all_bootmem+0xaa/0x122
> > [    0.000000]  mem_init+0x71/0xa4
> > [    0.000000]  start_kernel+0x1e5/0x3f1
> > [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> > [    0.000000]  x86_64_start_kernel+0x178/0x18b
> > [    0.000000]  start_cpu+0x14/0x14
> > [    0.000000]  ? start_cpu+0x14/0x14
> > [    0.000000] ---[ end trace a57944bec8fc985c ]---
> > [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> > 
> > And kernel/softirq.c:161 contains:
> > 
> >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > 
> > Thus, I don't think the change in my RFC-patch[1] is safe.
> > Of changing[2] to support softirq allocations by replacing
> > preempt_disable() with local_bh_disable().
> > 
> > [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> > 
> > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
> >  https://git.kernel.org/torvalds/c/374ad05ab64d
> 
> A patch that avoids the above warning is inlined below, but I'm not
> sure if this is best direction.  Or we should rather consider reverting
> part of commit 374ad05ab64d to avoid the softirq performance regression?
>  

At the moment, I'm not seeing a better alternative. If this works, I
think it would still be far superior in terms of performance than a
revert. As before, if there are bad consequences to adding a BH
rescheduling point then we'll have to revert. However, I don't like a
revert being the first option as it'll keep encouraging drivers to build
sub-allocators to avoid the page allocator.

> [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator
> 
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> 

Other than the slightly misleading comments about NMI which could
explain "this potentially misses an NMI but an NMI allocating pages is
brain damaged", I don't see a problem. The irqs_disabled() check is a
subtle but it's not earth shattering and it still helps the 100GiB cases
with the limited cycle budget to process packets.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-30 13:04                                           ` Mel Gorman
@ 2017-03-30 15:07                                             ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-30 15:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Thu, 30 Mar 2017 14:04:36 +0100
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > > following warning below:
> > > 
> > > [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> > > -8
> > > [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > > [    0.000000] ------------[ cut here ]------------
> > > [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> > > [    0.000000] Modules linked in:
> > > [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > > [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> > > [    0.000000] Call Trace:
> > > [    0.000000]  dump_stack+0x4f/0x73
> > > [    0.000000]  __warn+0xcb/0xf0
> > > [    0.000000]  warn_slowpath_null+0x1d/0x20
> > > [    0.000000]  __local_bh_enable_ip+0x70/0x90
> > > [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> > > [    0.000000]  __free_pages+0x1f/0x30
> > > [    0.000000]  __free_pages_bootmem+0xab/0xb8
> > > [    0.000000]  __free_memory_core+0x79/0x91
> > > [    0.000000]  free_all_bootmem+0xaa/0x122
> > > [    0.000000]  mem_init+0x71/0xa4
> > > [    0.000000]  start_kernel+0x1e5/0x3f1
> > > [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> > > [    0.000000]  x86_64_start_kernel+0x178/0x18b
> > > [    0.000000]  start_cpu+0x14/0x14
> > > [    0.000000]  ? start_cpu+0x14/0x14
> > > [    0.000000] ---[ end trace a57944bec8fc985c ]---
> > > [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> > > 
> > > And kernel/softirq.c:161 contains:
> > > 
> > >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > > 
> > > Thus, I don't think the change in my RFC-patch[1] is safe.
> > > Of changing[2] to support softirq allocations by replacing
> > > preempt_disable() with local_bh_disable().
> > > 
> > > [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> > > 
> > > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
> > >  https://git.kernel.org/torvalds/c/374ad05ab64d  
> > 
> > A patch that avoids the above warning is inlined below, but I'm not
> > sure if this is best direction.  Or we should rather consider reverting
> > part of commit 374ad05ab64d to avoid the softirq performance regression?
> >    
> 
> At the moment, I'm not seeing a better alternative. If this works, I
> think it would still be far superior in terms of performance than a
> revert. 

Started performance benchmarking:
 163 cycles = current state
 183 cycles = with BH disable + in_irq
 218 cycles = with BH disable + in_irq + irqs_disabled

Thus, the performance numbers unfortunately looks bad, once we add the
test for irqs_disabled().  The slowdown by replacing preempt_disable
with BH-disable is still a win (we saved 29 cycles before, and loose
20, I was expecting regression to be only 10 cycles).

Bad things happen when adding the test for irqs_disabled().  This
likely happens because it uses the "pushfq + pop" to read CPU flags.  I
wonder if X86-experts know if e.g. using "lahf" would be faster (and if
it also loads the interrupt flag X86_EFLAGS_IF)?

We basically lost more (163-218=-55) than we gained (29) :-(


> As before, if there are bad consequences to adding a BH
> rescheduling point then we'll have to revert. However, I don't like a
> revert being the first option as it'll keep encouraging drivers to build
> sub-allocators to avoid the page allocator.

I'm also motivated by speeding up the page allocator to avoid this
happening in all the drivers.

> > [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator
> > 
> > From: Jesper Dangaard Brouer <brouer@redhat.com>
> >   
> 
> Other than the slightly misleading comments about NMI which could
> explain "this potentially misses an NMI but an NMI allocating pages is
> brain damaged", I don't see a problem. The irqs_disabled() check is a
> subtle but it's not earth shattering and it still helps the 100GiB cases
> with the limited cycle budget to process packets.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-03-30 15:07                                             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 63+ messages in thread
From: Jesper Dangaard Brouer @ 2017-03-30 15:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel, brouer

On Thu, 30 Mar 2017 14:04:36 +0100
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > > following warning below:
> > > 
> > > [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> > > -8
> > > [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > > [    0.000000] ------------[ cut here ]------------
> > > [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> > > [    0.000000] Modules linked in:
> > > [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > > [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> > > [    0.000000] Call Trace:
> > > [    0.000000]  dump_stack+0x4f/0x73
> > > [    0.000000]  __warn+0xcb/0xf0
> > > [    0.000000]  warn_slowpath_null+0x1d/0x20
> > > [    0.000000]  __local_bh_enable_ip+0x70/0x90
> > > [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> > > [    0.000000]  __free_pages+0x1f/0x30
> > > [    0.000000]  __free_pages_bootmem+0xab/0xb8
> > > [    0.000000]  __free_memory_core+0x79/0x91
> > > [    0.000000]  free_all_bootmem+0xaa/0x122
> > > [    0.000000]  mem_init+0x71/0xa4
> > > [    0.000000]  start_kernel+0x1e5/0x3f1
> > > [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> > > [    0.000000]  x86_64_start_kernel+0x178/0x18b
> > > [    0.000000]  start_cpu+0x14/0x14
> > > [    0.000000]  ? start_cpu+0x14/0x14
> > > [    0.000000] ---[ end trace a57944bec8fc985c ]---
> > > [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> > > 
> > > And kernel/softirq.c:161 contains:
> > > 
> > >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > > 
> > > Thus, I don't think the change in my RFC-patch[1] is safe.
> > > Of changing[2] to support softirq allocations by replacing
> > > preempt_disable() with local_bh_disable().
> > > 
> > > [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> > > 
> > > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
> > >  https://git.kernel.org/torvalds/c/374ad05ab64d  
> > 
> > A patch that avoids the above warning is inlined below, but I'm not
> > sure if this is best direction.  Or we should rather consider reverting
> > part of commit 374ad05ab64d to avoid the softirq performance regression?
> >    
> 
> At the moment, I'm not seeing a better alternative. If this works, I
> think it would still be far superior in terms of performance than a
> revert. 

Started performance benchmarking:
 163 cycles = current state
 183 cycles = with BH disable + in_irq
 218 cycles = with BH disable + in_irq + irqs_disabled

Thus, the performance numbers unfortunately looks bad, once we add the
test for irqs_disabled().  The slowdown by replacing preempt_disable
with BH-disable is still a win (we saved 29 cycles before, and loose
20, I was expecting regression to be only 10 cycles).

Bad things happen when adding the test for irqs_disabled().  This
likely happens because it uses the "pushfq + pop" to read CPU flags.  I
wonder if X86-experts know if e.g. using "lahf" would be faster (and if
it also loads the interrupt flag X86_EFLAGS_IF)?

We basically lost more (163-218=-55) than we gained (29) :-(


> As before, if there are bad consequences to adding a BH
> rescheduling point then we'll have to revert. However, I don't like a
> revert being the first option as it'll keep encouraging drivers to build
> sub-allocators to avoid the page allocator.

I'm also motivated by speeding up the page allocator to avoid this
happening in all the drivers.

> > [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator
> > 
> > From: Jesper Dangaard Brouer <brouer@redhat.com>
> >   
> 
> Other than the slightly misleading comments about NMI which could
> explain "this potentially misses an NMI but an NMI allocating pages is
> brain damaged", I don't see a problem. The irqs_disabled() check is a
> subtle but it's not earth shattering and it still helps the 100GiB cases
> with the limited cycle budget to process packets.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-03-30 15:07                                             ` Jesper Dangaard Brouer
@ 2017-04-03 12:05                                               ` Mel Gorman
  -1 siblings, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2017-04-03 12:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Thu, Mar 30, 2017 at 05:07:08PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 30 Mar 2017 14:04:36 +0100
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > > > following warning below:
> > > > 
> > > > [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> > > > -8
> > > > [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > > > [    0.000000] ------------[ cut here ]------------
> > > > [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> > > > [    0.000000] Modules linked in:
> > > > [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > > > [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> > > > [    0.000000] Call Trace:
> > > > [    0.000000]  dump_stack+0x4f/0x73
> > > > [    0.000000]  __warn+0xcb/0xf0
> > > > [    0.000000]  warn_slowpath_null+0x1d/0x20
> > > > [    0.000000]  __local_bh_enable_ip+0x70/0x90
> > > > [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> > > > [    0.000000]  __free_pages+0x1f/0x30
> > > > [    0.000000]  __free_pages_bootmem+0xab/0xb8
> > > > [    0.000000]  __free_memory_core+0x79/0x91
> > > > [    0.000000]  free_all_bootmem+0xaa/0x122
> > > > [    0.000000]  mem_init+0x71/0xa4
> > > > [    0.000000]  start_kernel+0x1e5/0x3f1
> > > > [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> > > > [    0.000000]  x86_64_start_kernel+0x178/0x18b
> > > > [    0.000000]  start_cpu+0x14/0x14
> > > > [    0.000000]  ? start_cpu+0x14/0x14
> > > > [    0.000000] ---[ end trace a57944bec8fc985c ]---
> > > > [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> > > > 
> > > > And kernel/softirq.c:161 contains:
> > > > 
> > > >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > > > 
> > > > Thus, I don't think the change in my RFC-patch[1] is safe.
> > > > Of changing[2] to support softirq allocations by replacing
> > > > preempt_disable() with local_bh_disable().
> > > > 
> > > > [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> > > > 
> > > > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
> > > >  https://git.kernel.org/torvalds/c/374ad05ab64d  
> > > 
> > > A patch that avoids the above warning is inlined below, but I'm not
> > > sure if this is best direction.  Or we should rather consider reverting
> > > part of commit 374ad05ab64d to avoid the softirq performance regression?
> > >    
> > 
> > At the moment, I'm not seeing a better alternative. If this works, I
> > think it would still be far superior in terms of performance than a
> > revert. 
> 
> Started performance benchmarking:
>  163 cycles = current state
>  183 cycles = with BH disable + in_irq
>  218 cycles = with BH disable + in_irq + irqs_disabled
> 
> Thus, the performance numbers unfortunately looks bad, once we add the
> test for irqs_disabled().  The slowdown by replacing preempt_disable
> with BH-disable is still a win (we saved 29 cycles before, and loose
> 20, I was expecting regression to be only 10 cycles).
> 

This surprises me because I'm not seeing the same severity of problems
with irqs_disabled. Your path is slower than what's currently upstream
but it's still far better than a revert. The softirq column in the
middle is your patch versus a full revert which is the last columnm

                                          4.11.0-rc5                 4.11.0-rc5                 4.11.0-rc5
                                             vanilla               softirq-v2r1                revert-v2r1
Amean    alloc-odr0-1               217.00 (  0.00%)           223.00 ( -2.76%)           280.54 (-29.28%)
Amean    alloc-odr0-2               162.23 (  0.00%)           174.46 ( -7.54%)           210.54 (-29.78%)
Amean    alloc-odr0-4               144.15 (  0.00%)           150.38 ( -4.32%)           182.38 (-26.52%)
Amean    alloc-odr0-8               126.00 (  0.00%)           132.15 ( -4.88%)           282.08 (-123.87%)
Amean    alloc-odr0-16              117.00 (  0.00%)           122.00 ( -4.27%)           253.00 (-116.24%)
Amean    alloc-odr0-32              113.00 (  0.00%)           118.00 ( -4.42%)           145.00 (-28.32%)
Amean    alloc-odr0-64              110.77 (  0.00%)           114.31 ( -3.19%)           143.00 (-29.10%)
Amean    alloc-odr0-128             109.00 (  0.00%)           107.69 (  1.20%)           179.54 (-64.71%)
Amean    alloc-odr0-256             121.00 (  0.00%)           125.00 ( -3.31%)           232.23 (-91.93%)
Amean    alloc-odr0-512             123.46 (  0.00%)           129.46 ( -4.86%)           148.08 (-19.94%)
Amean    alloc-odr0-1024            123.23 (  0.00%)           128.92 ( -4.62%)           142.46 (-15.61%)
Amean    alloc-odr0-2048            125.92 (  0.00%)           129.62 ( -2.93%)           147.46 (-17.10%)
Amean    alloc-odr0-4096            133.85 (  0.00%)           139.77 ( -4.43%)           155.69 (-16.32%)
Amean    alloc-odr0-8192            138.08 (  0.00%)           142.92 ( -3.51%)           159.00 (-15.15%)
Amean    alloc-odr0-16384           133.08 (  0.00%)           140.08 ( -5.26%)           157.38 (-18.27%)
Amean    alloc-odr1-1               390.27 (  0.00%)           401.53 ( -2.89%)           389.73 (  0.14%)
Amean    alloc-odr1-2               306.33 (  0.00%)           311.07 ( -1.55%)           304.07 (  0.74%)
Amean    alloc-odr1-4               250.87 (  0.00%)           258.00 ( -2.84%)           256.53 ( -2.26%)
Amean    alloc-odr1-8               221.00 (  0.00%)           231.07 ( -4.56%)           221.20 ( -0.09%)
Amean    alloc-odr1-16              212.07 (  0.00%)           223.07 ( -5.19%)           208.00 (  1.92%)
Amean    alloc-odr1-32              210.07 (  0.00%)           215.20 ( -2.44%)           208.20 (  0.89%)
Amean    alloc-odr1-64              197.00 (  0.00%)           203.00 ( -3.05%)           203.00 ( -3.05%)
Amean    alloc-odr1-128             204.07 (  0.00%)           189.27 (  7.25%)           200.00 (  1.99%)
Amean    alloc-odr1-256             193.33 (  0.00%)           190.53 (  1.45%)           193.80 ( -0.24%)
Amean    alloc-odr1-512             180.60 (  0.00%)           190.33 ( -5.39%)           183.13 ( -1.40%)
Amean    alloc-odr1-1024            176.93 (  0.00%)           182.40 ( -3.09%)           176.33 (  0.34%)
Amean    alloc-odr1-2048            184.60 (  0.00%)           191.33 ( -3.65%)           180.60 (  2.17%)
Amean    alloc-odr1-4096            184.80 (  0.00%)           182.60 (  1.19%)           182.27 (  1.37%)
Amean    alloc-odr1-8192            183.60 (  0.00%)           180.93 (  1.45%)           181.07 (  1.38%)

I revisisted having an irq-safe list but it's excessively complex and
there are significant problems where it's not clear it can be handled
safely so it's not a short-term option.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-04-03 12:05                                               ` Mel Gorman
  0 siblings, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2017-04-03 12:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Thu, Mar 30, 2017 at 05:07:08PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 30 Mar 2017 14:04:36 +0100
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Wed, Mar 29, 2017 at 09:44:41PM +0200, Jesper Dangaard Brouer wrote:
> > > > Regardless or using in_irq() (or in combi with in_nmi()) I get the
> > > > following warning below:
> > > > 
> > > > [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.11.0-rc3-net-next-page-alloc-softirq+ root=UUID=2e8451ff-6797-49b5-8d3a-eed5a42d7dc9 ro rhgb quiet LANG=en_DK.UTF
> > > > -8
> > > > [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> > > > [    0.000000] ------------[ cut here ]------------
> > > > [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x70/0x90
> > > > [    0.000000] Modules linked in:
> > > > [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc3-net-next-page-alloc-softirq+ #235
> > > > [    0.000000] Hardware name: MSI MS-7984/Z170A GAMING PRO (MS-7984), BIOS 1.60 12/16/2015
> > > > [    0.000000] Call Trace:
> > > > [    0.000000]  dump_stack+0x4f/0x73
> > > > [    0.000000]  __warn+0xcb/0xf0
> > > > [    0.000000]  warn_slowpath_null+0x1d/0x20
> > > > [    0.000000]  __local_bh_enable_ip+0x70/0x90
> > > > [    0.000000]  free_hot_cold_page+0x1a4/0x2f0
> > > > [    0.000000]  __free_pages+0x1f/0x30
> > > > [    0.000000]  __free_pages_bootmem+0xab/0xb8
> > > > [    0.000000]  __free_memory_core+0x79/0x91
> > > > [    0.000000]  free_all_bootmem+0xaa/0x122
> > > > [    0.000000]  mem_init+0x71/0xa4
> > > > [    0.000000]  start_kernel+0x1e5/0x3f1
> > > > [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> > > > [    0.000000]  x86_64_start_kernel+0x178/0x18b
> > > > [    0.000000]  start_cpu+0x14/0x14
> > > > [    0.000000]  ? start_cpu+0x14/0x14
> > > > [    0.000000] ---[ end trace a57944bec8fc985c ]---
> > > > [    0.000000] Memory: 32739472K/33439416K available (7624K kernel code, 1528K rwdata, 3168K rodata, 1860K init, 2260K bss, 699944K reserved, 0K cma-reserved)
> > > > 
> > > > And kernel/softirq.c:161 contains:
> > > > 
> > > >  WARN_ON_ONCE(in_irq() || irqs_disabled());
> > > > 
> > > > Thus, I don't think the change in my RFC-patch[1] is safe.
> > > > Of changing[2] to support softirq allocations by replacing
> > > > preempt_disable() with local_bh_disable().
> > > > 
> > > > [1] http://lkml.kernel.org/r/20170327143947.4c237e54@redhat.com
> > > > 
> > > > [2] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
> > > >  https://git.kernel.org/torvalds/c/374ad05ab64d  
> > > 
> > > A patch that avoids the above warning is inlined below, but I'm not
> > > sure if this is best direction.  Or we should rather consider reverting
> > > part of commit 374ad05ab64d to avoid the softirq performance regression?
> > >    
> > 
> > At the moment, I'm not seeing a better alternative. If this works, I
> > think it would still be far superior in terms of performance than a
> > revert. 
> 
> Started performance benchmarking:
>  163 cycles = current state
>  183 cycles = with BH disable + in_irq
>  218 cycles = with BH disable + in_irq + irqs_disabled
> 
> Thus, the performance numbers unfortunately looks bad, once we add the
> test for irqs_disabled().  The slowdown by replacing preempt_disable
> with BH-disable is still a win (we saved 29 cycles before, and loose
> 20, I was expecting regression to be only 10 cycles).
> 

This surprises me because I'm not seeing the same severity of problems
with irqs_disabled. Your path is slower than what's currently upstream
but it's still far better than a revert. The softirq column in the
middle is your patch versus a full revert which is the last columnm

                                          4.11.0-rc5                 4.11.0-rc5                 4.11.0-rc5
                                             vanilla               softirq-v2r1                revert-v2r1
Amean    alloc-odr0-1               217.00 (  0.00%)           223.00 ( -2.76%)           280.54 (-29.28%)
Amean    alloc-odr0-2               162.23 (  0.00%)           174.46 ( -7.54%)           210.54 (-29.78%)
Amean    alloc-odr0-4               144.15 (  0.00%)           150.38 ( -4.32%)           182.38 (-26.52%)
Amean    alloc-odr0-8               126.00 (  0.00%)           132.15 ( -4.88%)           282.08 (-123.87%)
Amean    alloc-odr0-16              117.00 (  0.00%)           122.00 ( -4.27%)           253.00 (-116.24%)
Amean    alloc-odr0-32              113.00 (  0.00%)           118.00 ( -4.42%)           145.00 (-28.32%)
Amean    alloc-odr0-64              110.77 (  0.00%)           114.31 ( -3.19%)           143.00 (-29.10%)
Amean    alloc-odr0-128             109.00 (  0.00%)           107.69 (  1.20%)           179.54 (-64.71%)
Amean    alloc-odr0-256             121.00 (  0.00%)           125.00 ( -3.31%)           232.23 (-91.93%)
Amean    alloc-odr0-512             123.46 (  0.00%)           129.46 ( -4.86%)           148.08 (-19.94%)
Amean    alloc-odr0-1024            123.23 (  0.00%)           128.92 ( -4.62%)           142.46 (-15.61%)
Amean    alloc-odr0-2048            125.92 (  0.00%)           129.62 ( -2.93%)           147.46 (-17.10%)
Amean    alloc-odr0-4096            133.85 (  0.00%)           139.77 ( -4.43%)           155.69 (-16.32%)
Amean    alloc-odr0-8192            138.08 (  0.00%)           142.92 ( -3.51%)           159.00 (-15.15%)
Amean    alloc-odr0-16384           133.08 (  0.00%)           140.08 ( -5.26%)           157.38 (-18.27%)
Amean    alloc-odr1-1               390.27 (  0.00%)           401.53 ( -2.89%)           389.73 (  0.14%)
Amean    alloc-odr1-2               306.33 (  0.00%)           311.07 ( -1.55%)           304.07 (  0.74%)
Amean    alloc-odr1-4               250.87 (  0.00%)           258.00 ( -2.84%)           256.53 ( -2.26%)
Amean    alloc-odr1-8               221.00 (  0.00%)           231.07 ( -4.56%)           221.20 ( -0.09%)
Amean    alloc-odr1-16              212.07 (  0.00%)           223.07 ( -5.19%)           208.00 (  1.92%)
Amean    alloc-odr1-32              210.07 (  0.00%)           215.20 ( -2.44%)           208.20 (  0.89%)
Amean    alloc-odr1-64              197.00 (  0.00%)           203.00 ( -3.05%)           203.00 ( -3.05%)
Amean    alloc-odr1-128             204.07 (  0.00%)           189.27 (  7.25%)           200.00 (  1.99%)
Amean    alloc-odr1-256             193.33 (  0.00%)           190.53 (  1.45%)           193.80 ( -0.24%)
Amean    alloc-odr1-512             180.60 (  0.00%)           190.33 ( -5.39%)           183.13 ( -1.40%)
Amean    alloc-odr1-1024            176.93 (  0.00%)           182.40 ( -3.09%)           176.33 (  0.34%)
Amean    alloc-odr1-2048            184.60 (  0.00%)           191.33 ( -3.65%)           180.60 (  2.17%)
Amean    alloc-odr1-4096            184.80 (  0.00%)           182.60 (  1.19%)           182.27 (  1.37%)
Amean    alloc-odr1-8192            183.60 (  0.00%)           180.93 (  1.45%)           181.07 (  1.38%)

I revisisted having an irq-safe list but it's excessively complex and
there are significant problems where it's not clear it can be handled
safely so it's not a short-term option.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
  2017-04-03 12:05                                               ` Mel Gorman
@ 2017-04-05  8:53                                                 ` Mel Gorman
  -1 siblings, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2017-04-05  8:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Mon, Apr 03, 2017 at 01:05:06PM +0100, Mel Gorman wrote:
> > Started performance benchmarking:
> >  163 cycles = current state
> >  183 cycles = with BH disable + in_irq
> >  218 cycles = with BH disable + in_irq + irqs_disabled
> > 
> > Thus, the performance numbers unfortunately looks bad, once we add the
> > test for irqs_disabled().  The slowdown by replacing preempt_disable
> > with BH-disable is still a win (we saved 29 cycles before, and loose
> > 20, I was expecting regression to be only 10 cycles).
> > 
> 
> This surprises me because I'm not seeing the same severity of problems
> with irqs_disabled. Your path is slower than what's currently upstream
> but it's still far better than a revert. The softirq column in the
> middle is your patch versus a full revert which is the last columnm
> 

Any objection to resending the local_bh_enable/disable patch with the
in_interrupt() check based on this data or should I post the revert and
go back to the drawing board?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: in_irq_or_nmi() and RFC patch
@ 2017-04-05  8:53                                                 ` Mel Gorman
  0 siblings, 0 replies; 63+ messages in thread
From: Mel Gorman @ 2017-04-05  8:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Matthew Wilcox, Peter Zijlstra, Pankaj Gupta, Tariq Toukan,
	Tariq Toukan, netdev, akpm, linux-mm, Saeed Mahameed,
	linux-kernel

On Mon, Apr 03, 2017 at 01:05:06PM +0100, Mel Gorman wrote:
> > Started performance benchmarking:
> >  163 cycles = current state
> >  183 cycles = with BH disable + in_irq
> >  218 cycles = with BH disable + in_irq + irqs_disabled
> > 
> > Thus, the performance numbers unfortunately looks bad, once we add the
> > test for irqs_disabled().  The slowdown by replacing preempt_disable
> > with BH-disable is still a win (we saved 29 cycles before, and loose
> > 20, I was expecting regression to be only 10 cycles).
> > 
> 
> This surprises me because I'm not seeing the same severity of problems
> with irqs_disabled. Your path is slower than what's currently upstream
> but it's still far better than a revert. The softirq column in the
> middle is your patch versus a full revert which is the last columnm
> 

Any objection to resending the local_bh_enable/disable patch with the
in_interrupt() check based on this data or should I post the revert and
go back to the drawing board?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-03-01 13:48 ` Page allocator order-0 optimizations merged Jesper Dangaard Brouer
@ 2017-04-10 14:31     ` zhong jiang
  2017-04-10 14:31     ` zhong jiang
  1 sibling, 0 replies; 63+ messages in thread
From: zhong jiang @ 2017-04-10 14:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, Saeed Mahameed,
	Tariq Toukan

On 2017/3/1 21:48, Jesper Dangaard Brouer wrote:
> Hi NetDev community,
>
> I just wanted to make net driver people aware that this MM commit[1] got
> merged and is available in net-next.
>
>  commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
>  [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696
>
> It provides approx 14% speedup of order-0 page allocations.  I do know
> most driver do their own page-recycling.  Thus, this gain will only be
> seen when this page recycling is insufficient, which Tariq was affected
> by AFAIK.
>
> We are also playing with a bulk page allocator facility[2], that I've
> benchmarked[3][4].  While I'm seeing between 34%-46% improvements by
> bulking, I believe we actually need to do better, before it reach our
> performance target for high-speed networking.
>
> --Jesper
>
> [2] http://lkml.kernel.org/r/20170109163518.6001-5-mgorman%40techsingularity.net
> [3] http://lkml.kernel.org/r/20170116152518.5519dc1e%40redhat.com
> [4] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>
>
> On Mon, 27 Feb 2017 12:25:03 -0800 akpm@linux-foundation.org wrote:
>
>> The patch titled
>>      Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>> has been removed from the -mm tree.  Its filename was
>>      mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
>>
>> This patch was dropped because it was merged into mainline or a subsystem tree
>>
>> ------------------------------------------------------
>> From: Mel Gorman <mgorman@techsingularity.net>
>> Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>>
>> Many workloads that allocate pages are not handling an interrupt at a
>> time.  As allocation requests may be from IRQ context, it's necessary to
>> disable/enable IRQs for every page allocation.  This cost is the bulk of
>> the free path but also a significant percentage of the allocation path.
>>
>> This patch alters the locking and checks such that only irq-safe
>> allocation requests use the per-cpu allocator.  All others acquire the
>> irq-safe zone->lock and allocate from the buddy allocator.  It relies on
>> disabling preemption to safely access the per-cpu structures.  It could be
>> slightly modified to avoid soft IRQs using it but it's not clear it's
>> worthwhile.
>>
>> This modification may slow allocations from IRQ context slightly but the
>> main gain from the per-cpu allocator is that it scales better for
>> allocations from multiple contexts.  There is an implicit assumption that
>> intensive allocations from IRQ contexts on multiple CPUs from a single
>> NUMA node are rare and that the fast majority of scaling issues are
>> encountered in !IRQ contexts such as page faulting.  It's worth noting
>> that this patch is not required for a bulk page allocator but it
>> significantly reduces the overhead.
>>
>> The following is results from a page allocator micro-benchmark.  Only
>> order-0 is interesting as higher orders do not use the per-cpu allocator
>>
>>                                           4.10.0-rc2                 4.10.0-rc2
>>                                              vanilla               irqsafe-v1r5
>> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
>> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
>> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
>> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
>> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
>> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
>> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
>> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
>> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
>> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
>> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
>> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
>> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
>> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
>> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
>> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
>> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
>> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
>> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
>> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
>> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
>> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
>> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
>> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
>> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
>> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
>> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
>> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
>> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
>> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
>> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
>> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
>> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
>> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
>> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
>> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
>> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
>> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
>> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
>> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
>> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
>> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
>> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
>> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
>> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
>>
>> This is the alloc, free and total overhead of allocating order-0 pages in
>> batches of 1 page up to 16384 pages.  Avoiding disabling/enabling overhead
>> massively reduces overhead.  Alloc overhead is roughly reduced by 14-20%
>> in most cases.  The free path is reduced by 26-46% and the total reduction
>> is significant.
>>
>> Many users require zeroing of pages from the page allocator which is the
>> vast cost of allocation.  Hence, the impact on a basic page faulting
>> benchmark is not that significant
>>
>>                               4.10.0-rc2            4.10.0-rc2
>>                                  vanilla          irqsafe-v1r5
>> Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
>> Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
>> Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
>> Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
>> CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
>> CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
>> Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
>> Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
>>
>> This is from aim9 and the most notable outcome is that fault variability
>> is reduced by the patch.  The headline improvement is small as the overall
>> fault cost, zeroing, page table insertion etc dominate relative to
>> disabling/enabling IRQs in the per-cpu allocator.
>>
>> Similarly, little benefit was seen on networking benchmarks both localhost
>> and between physical server/clients where other costs dominate.  It's
>> possible that this will only be noticable on very high speed networks.
>>
>> Jesper Dangaard Brouer independently tested
>> this with a separate microbenchmark from
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>>
>> Micro-benchmarked with [1] page_bench02:
>>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>>   rmmod page_bench02 ; dmesg --notime | tail -n 4
>>
>> Compared to baseline: 213 cycles(tsc) 53.417 ns
>>  - against this     : 184 cycles(tsc) 46.056 ns
>>  - Saving           : -29 cycles
>>  - Very close to expected 27 cycles saving [see below [2]]
>>
>> Micro benchmarking via time_bench_sample[3], we get the cost of these
>> operations:
>>
>>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
>>
>> Thus, expected improvement is: 38-11 = 27 cycles.
>>
>> [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
>>   Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
>> Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>>  mm/page_alloc.c |   43 +++++++++++++++++++++++--------------------
>>  1 file changed, 23 insertions(+), 20 deletions(-)
>>
>> diff -puN mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests mm/page_alloc.c
>> --- a/mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests
>> +++ a/mm/page_alloc.c
>> @@ -1085,10 +1085,10 @@ static void free_pcppages_bulk(struct zo
>>  {
>>  	int migratetype = 0;
>>  	int batch_free = 0;
>> -	unsigned long nr_scanned;
>> +	unsigned long nr_scanned, flags;
>>  	bool isolated_pageblocks;
>>  
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>  	isolated_pageblocks = has_isolate_pageblock(zone);
>>  	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>  	if (nr_scanned)
>> @@ -1137,7 +1137,7 @@ static void free_pcppages_bulk(struct zo
>>  			trace_mm_page_pcpu_drain(page, 0, mt);
>>  		} while (--count && --batch_free && !list_empty(list));
>>  	}
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>  }
>>  
>>  static void free_one_page(struct zone *zone,
>> @@ -1145,8 +1145,9 @@ static void free_one_page(struct zone *z
>>  				unsigned int order,
>>  				int migratetype)
>>  {
>> -	unsigned long nr_scanned;
>> -	spin_lock(&zone->lock);
>> +	unsigned long nr_scanned, flags;
>> +	spin_lock_irqsave(&zone->lock, flags);
>> +	__count_vm_events(PGFREE, 1 << order);
>>  	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>  	if (nr_scanned)
>>  		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>> @@ -1156,7 +1157,7 @@ static void free_one_page(struct zone *z
>>  		migratetype = get_pfnblock_migratetype(page, pfn);
>>  	}
>>  	__free_one_page(page, pfn, zone, order, migratetype);
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>  }
>>  
>>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>> @@ -1234,7 +1235,6 @@ void __meminit reserve_bootmem_region(ph
>>  
>>  static void __free_pages_ok(struct page *page, unsigned int order)
>>  {
>> -	unsigned long flags;
>>  	int migratetype;
>>  	unsigned long pfn = page_to_pfn(page);
>>  
>> @@ -1242,10 +1242,7 @@ static void __free_pages_ok(struct page
>>  		return;
>>  
>>  	migratetype = get_pfnblock_migratetype(page, pfn);
>> -	local_irq_save(flags);
>> -	__count_vm_events(PGFREE, 1 << order);
>>  	free_one_page(page_zone(page), page, pfn, order, migratetype);
>> -	local_irq_restore(flags);
>>  }
>>  
>>  static void __init __free_pages_boot_core(struct page *page, unsigned int order)
>> @@ -2217,8 +2214,9 @@ static int rmqueue_bulk(struct zone *zon
>>  			int migratetype, bool cold)
>>  {
>>  	int i, alloced = 0;
>> +	unsigned long flags;
>>  
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>  	for (i = 0; i < count; ++i) {
>>  		struct page *page = __rmqueue(zone, order, migratetype);
>>  		if (unlikely(page == NULL))
>> @@ -2254,7 +2252,7 @@ static int rmqueue_bulk(struct zone *zon
>>  	 * pages added to the pcp list.
>>  	 */
>>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>  	return alloced;
>>  }
>>  
>> @@ -2475,17 +2473,20 @@ void free_hot_cold_page(struct page *pag
>>  {
>>  	struct zone *zone = page_zone(page);
>>  	struct per_cpu_pages *pcp;
>> -	unsigned long flags;
>>  	unsigned long pfn = page_to_pfn(page);
>>  	int migratetype;
>>  
>> +	if (in_interrupt()) {
>> +		__free_pages_ok(page, 0);
>> +		return;
>> +	}
>> +
>>  	if (!free_pcp_prepare(page))
>>  		return;
>>  
>>  	migratetype = get_pfnblock_migratetype(page, pfn);
>>  	set_pcppage_migratetype(page, migratetype);
>> -	local_irq_save(flags);
>> -	__count_vm_event(PGFREE);
>> +	preempt_disable();
>>  
>>  	/*
>>  	 * We only track unmovable, reclaimable and movable on pcp lists.
>> @@ -2502,6 +2503,7 @@ void free_hot_cold_page(struct page *pag
>>  		migratetype = MIGRATE_MOVABLE;
>>  	}
>>  
>> +	__count_vm_event(PGFREE);
>>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>  	if (!cold)
>>  		list_add(&page->lru, &pcp->lists[migratetype]);
>> @@ -2515,7 +2517,7 @@ void free_hot_cold_page(struct page *pag
>>  	}
>>  
>>  out:
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>  }
>>  
>>  /*
>> @@ -2640,6 +2642,8 @@ static struct page *__rmqueue_pcplist(st
>>  {
>>  	struct page *page;
>>  
>> +	VM_BUG_ON(in_interrupt());
>> +
>>  	do {
>>  		if (list_empty(list)) {
>>  			pcp->count += rmqueue_bulk(zone, 0,
>> @@ -2670,9 +2674,8 @@ static struct page *rmqueue_pcplist(stru
>>  	struct list_head *list;
>>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>>  	struct page *page;
>> -	unsigned long flags;
>>  
>> -	local_irq_save(flags);
>> +	preempt_disable();
>>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>  	list = &pcp->lists[migratetype];
>>  	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
>> @@ -2680,7 +2683,7 @@ static struct page *rmqueue_pcplist(stru
>>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>>  		zone_statistics(preferred_zone, zone);
>>  	}
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>  	return page;
>>  }
>>  
>> @@ -2696,7 +2699,7 @@ struct page *rmqueue(struct zone *prefer
>>  	unsigned long flags;
>>  	struct page *page;
>>  
>> -	if (likely(order == 0)) {
>> +	if (likely(order == 0) && !in_interrupt()) {
>>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>>  				gfp_flags, migratetype);
>>  		goto out;
>> _
>>
>> Patches currently in -mm which might be from mgorman@techsingularity.net are
>>
>>
>>
Hi, Mel

     The patch I had test on arm64. I find the great degradation. I test it by micro-bench.
    The patrly data is as following.  and it is stable.  That stands for the allocate and free time. 
    
 
    before apply the patch:
     order 0 batch 1       alloc 477 free 251    (unit: ns)
    order 0  batch 1       alloc 475   free  250

   after apply the patch:
   order 0 batch 1         alloc 601  free 369   (unit: ns)
   order 0 batch 1         alloc 600   free 370


Thanks
zhongjiang

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
@ 2017-04-10 14:31     ` zhong jiang
  0 siblings, 0 replies; 63+ messages in thread
From: zhong jiang @ 2017-04-10 14:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, Saeed Mahameed,
	Tariq Toukan

On 2017/3/1 21:48, Jesper Dangaard Brouer wrote:
> Hi NetDev community,
>
> I just wanted to make net driver people aware that this MM commit[1] got
> merged and is available in net-next.
>
>  commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
>  [1] https://git.kernel.org/davem/net-next/c/374ad05ab64d696
>
> It provides approx 14% speedup of order-0 page allocations.  I do know
> most driver do their own page-recycling.  Thus, this gain will only be
> seen when this page recycling is insufficient, which Tariq was affected
> by AFAIK.
>
> We are also playing with a bulk page allocator facility[2], that I've
> benchmarked[3][4].  While I'm seeing between 34%-46% improvements by
> bulking, I believe we actually need to do better, before it reach our
> performance target for high-speed networking.
>
> --Jesper
>
> [2] http://lkml.kernel.org/r/20170109163518.6001-5-mgorman%40techsingularity.net
> [3] http://lkml.kernel.org/r/20170116152518.5519dc1e%40redhat.com
> [4] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>
>
> On Mon, 27 Feb 2017 12:25:03 -0800 akpm@linux-foundation.org wrote:
>
>> The patch titled
>>      Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>> has been removed from the -mm tree.  Its filename was
>>      mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
>>
>> This patch was dropped because it was merged into mainline or a subsystem tree
>>
>> ------------------------------------------------------
>> From: Mel Gorman <mgorman@techsingularity.net>
>> Subject: mm, page_alloc: only use per-cpu allocator for irq-safe requests
>>
>> Many workloads that allocate pages are not handling an interrupt at a
>> time.  As allocation requests may be from IRQ context, it's necessary to
>> disable/enable IRQs for every page allocation.  This cost is the bulk of
>> the free path but also a significant percentage of the allocation path.
>>
>> This patch alters the locking and checks such that only irq-safe
>> allocation requests use the per-cpu allocator.  All others acquire the
>> irq-safe zone->lock and allocate from the buddy allocator.  It relies on
>> disabling preemption to safely access the per-cpu structures.  It could be
>> slightly modified to avoid soft IRQs using it but it's not clear it's
>> worthwhile.
>>
>> This modification may slow allocations from IRQ context slightly but the
>> main gain from the per-cpu allocator is that it scales better for
>> allocations from multiple contexts.  There is an implicit assumption that
>> intensive allocations from IRQ contexts on multiple CPUs from a single
>> NUMA node are rare and that the fast majority of scaling issues are
>> encountered in !IRQ contexts such as page faulting.  It's worth noting
>> that this patch is not required for a bulk page allocator but it
>> significantly reduces the overhead.
>>
>> The following is results from a page allocator micro-benchmark.  Only
>> order-0 is interesting as higher orders do not use the per-cpu allocator
>>
>>                                           4.10.0-rc2                 4.10.0-rc2
>>                                              vanilla               irqsafe-v1r5
>> Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
>> Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
>> Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
>> Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
>> Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
>> Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
>> Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
>> Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
>> Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
>> Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
>> Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
>> Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
>> Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
>> Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
>> Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
>> Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
>> Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
>> Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
>> Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
>> Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
>> Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
>> Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
>> Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
>> Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
>> Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
>> Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
>> Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
>> Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
>> Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
>> Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
>> Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
>> Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
>> Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
>> Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
>> Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
>> Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
>> Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
>> Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
>> Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
>> Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
>> Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
>> Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
>> Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
>> Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
>> Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
>>
>> This is the alloc, free and total overhead of allocating order-0 pages in
>> batches of 1 page up to 16384 pages.  Avoiding disabling/enabling overhead
>> massively reduces overhead.  Alloc overhead is roughly reduced by 14-20%
>> in most cases.  The free path is reduced by 26-46% and the total reduction
>> is significant.
>>
>> Many users require zeroing of pages from the page allocator which is the
>> vast cost of allocation.  Hence, the impact on a basic page faulting
>> benchmark is not that significant
>>
>>                               4.10.0-rc2            4.10.0-rc2
>>                                  vanilla          irqsafe-v1r5
>> Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
>> Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
>> Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
>> Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
>> CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
>> CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
>> Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
>> Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
>>
>> This is from aim9 and the most notable outcome is that fault variability
>> is reduced by the patch.  The headline improvement is small as the overall
>> fault cost, zeroing, page table insertion etc dominate relative to
>> disabling/enabling IRQs in the per-cpu allocator.
>>
>> Similarly, little benefit was seen on networking benchmarks both localhost
>> and between physical server/clients where other costs dominate.  It's
>> possible that this will only be noticable on very high speed networks.
>>
>> Jesper Dangaard Brouer independently tested
>> this with a separate microbenchmark from
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
>>
>> Micro-benchmarked with [1] page_bench02:
>>  modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
>>   rmmod page_bench02 ; dmesg --notime | tail -n 4
>>
>> Compared to baseline: 213 cycles(tsc) 53.417 ns
>>  - against this     : 184 cycles(tsc) 46.056 ns
>>  - Saving           : -29 cycles
>>  - Very close to expected 27 cycles saving [see below [2]]
>>
>> Micro benchmarking via time_bench_sample[3], we get the cost of these
>> operations:
>>
>>  time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
>>  time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
>>  time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
>>  time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
>>  time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
>>  time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
>>  time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
>>  time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
>>  time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
>>  [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
>>  time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
>>  [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
>>  time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
>>  time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
>>  time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
>>
>> Thus, expected improvement is: 38-11 = 27 cycles.
>>
>> [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
>>   Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
>> Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>>  mm/page_alloc.c |   43 +++++++++++++++++++++++--------------------
>>  1 file changed, 23 insertions(+), 20 deletions(-)
>>
>> diff -puN mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests mm/page_alloc.c
>> --- a/mm/page_alloc.c~mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests
>> +++ a/mm/page_alloc.c
>> @@ -1085,10 +1085,10 @@ static void free_pcppages_bulk(struct zo
>>  {
>>  	int migratetype = 0;
>>  	int batch_free = 0;
>> -	unsigned long nr_scanned;
>> +	unsigned long nr_scanned, flags;
>>  	bool isolated_pageblocks;
>>  
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>  	isolated_pageblocks = has_isolate_pageblock(zone);
>>  	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>  	if (nr_scanned)
>> @@ -1137,7 +1137,7 @@ static void free_pcppages_bulk(struct zo
>>  			trace_mm_page_pcpu_drain(page, 0, mt);
>>  		} while (--count && --batch_free && !list_empty(list));
>>  	}
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>  }
>>  
>>  static void free_one_page(struct zone *zone,
>> @@ -1145,8 +1145,9 @@ static void free_one_page(struct zone *z
>>  				unsigned int order,
>>  				int migratetype)
>>  {
>> -	unsigned long nr_scanned;
>> -	spin_lock(&zone->lock);
>> +	unsigned long nr_scanned, flags;
>> +	spin_lock_irqsave(&zone->lock, flags);
>> +	__count_vm_events(PGFREE, 1 << order);
>>  	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>>  	if (nr_scanned)
>>  		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>> @@ -1156,7 +1157,7 @@ static void free_one_page(struct zone *z
>>  		migratetype = get_pfnblock_migratetype(page, pfn);
>>  	}
>>  	__free_one_page(page, pfn, zone, order, migratetype);
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>  }
>>  
>>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>> @@ -1234,7 +1235,6 @@ void __meminit reserve_bootmem_region(ph
>>  
>>  static void __free_pages_ok(struct page *page, unsigned int order)
>>  {
>> -	unsigned long flags;
>>  	int migratetype;
>>  	unsigned long pfn = page_to_pfn(page);
>>  
>> @@ -1242,10 +1242,7 @@ static void __free_pages_ok(struct page
>>  		return;
>>  
>>  	migratetype = get_pfnblock_migratetype(page, pfn);
>> -	local_irq_save(flags);
>> -	__count_vm_events(PGFREE, 1 << order);
>>  	free_one_page(page_zone(page), page, pfn, order, migratetype);
>> -	local_irq_restore(flags);
>>  }
>>  
>>  static void __init __free_pages_boot_core(struct page *page, unsigned int order)
>> @@ -2217,8 +2214,9 @@ static int rmqueue_bulk(struct zone *zon
>>  			int migratetype, bool cold)
>>  {
>>  	int i, alloced = 0;
>> +	unsigned long flags;
>>  
>> -	spin_lock(&zone->lock);
>> +	spin_lock_irqsave(&zone->lock, flags);
>>  	for (i = 0; i < count; ++i) {
>>  		struct page *page = __rmqueue(zone, order, migratetype);
>>  		if (unlikely(page == NULL))
>> @@ -2254,7 +2252,7 @@ static int rmqueue_bulk(struct zone *zon
>>  	 * pages added to the pcp list.
>>  	 */
>>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>> -	spin_unlock(&zone->lock);
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>  	return alloced;
>>  }
>>  
>> @@ -2475,17 +2473,20 @@ void free_hot_cold_page(struct page *pag
>>  {
>>  	struct zone *zone = page_zone(page);
>>  	struct per_cpu_pages *pcp;
>> -	unsigned long flags;
>>  	unsigned long pfn = page_to_pfn(page);
>>  	int migratetype;
>>  
>> +	if (in_interrupt()) {
>> +		__free_pages_ok(page, 0);
>> +		return;
>> +	}
>> +
>>  	if (!free_pcp_prepare(page))
>>  		return;
>>  
>>  	migratetype = get_pfnblock_migratetype(page, pfn);
>>  	set_pcppage_migratetype(page, migratetype);
>> -	local_irq_save(flags);
>> -	__count_vm_event(PGFREE);
>> +	preempt_disable();
>>  
>>  	/*
>>  	 * We only track unmovable, reclaimable and movable on pcp lists.
>> @@ -2502,6 +2503,7 @@ void free_hot_cold_page(struct page *pag
>>  		migratetype = MIGRATE_MOVABLE;
>>  	}
>>  
>> +	__count_vm_event(PGFREE);
>>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>  	if (!cold)
>>  		list_add(&page->lru, &pcp->lists[migratetype]);
>> @@ -2515,7 +2517,7 @@ void free_hot_cold_page(struct page *pag
>>  	}
>>  
>>  out:
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>  }
>>  
>>  /*
>> @@ -2640,6 +2642,8 @@ static struct page *__rmqueue_pcplist(st
>>  {
>>  	struct page *page;
>>  
>> +	VM_BUG_ON(in_interrupt());
>> +
>>  	do {
>>  		if (list_empty(list)) {
>>  			pcp->count += rmqueue_bulk(zone, 0,
>> @@ -2670,9 +2674,8 @@ static struct page *rmqueue_pcplist(stru
>>  	struct list_head *list;
>>  	bool cold = ((gfp_flags & __GFP_COLD) != 0);
>>  	struct page *page;
>> -	unsigned long flags;
>>  
>> -	local_irq_save(flags);
>> +	preempt_disable();
>>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>>  	list = &pcp->lists[migratetype];
>>  	page = __rmqueue_pcplist(zone,  migratetype, cold, pcp, list);
>> @@ -2680,7 +2683,7 @@ static struct page *rmqueue_pcplist(stru
>>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>>  		zone_statistics(preferred_zone, zone);
>>  	}
>> -	local_irq_restore(flags);
>> +	preempt_enable();
>>  	return page;
>>  }
>>  
>> @@ -2696,7 +2699,7 @@ struct page *rmqueue(struct zone *prefer
>>  	unsigned long flags;
>>  	struct page *page;
>>  
>> -	if (likely(order == 0)) {
>> +	if (likely(order == 0) && !in_interrupt()) {
>>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>>  				gfp_flags, migratetype);
>>  		goto out;
>> _
>>
>> Patches currently in -mm which might be from mgorman@techsingularity.net are
>>
>>
>>
Hi, Mel

     The patch I had test on arm64. I find the great degradation. I test it by micro-bench.
    The patrly data is as following.  and it is stable.  That stands for the allocate and free time. 
    
 
    before apply the patch:
     order 0 batch 1       alloc 477 free 251    (unit: ns)
    order 0  batch 1       alloc 475   free  250

   after apply the patch:
   order 0 batch 1         alloc 601  free 369   (unit: ns)
   order 0 batch 1         alloc 600   free 370


Thanks
zhongjiang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-04-10 14:31     ` zhong jiang
  (?)
@ 2017-04-10 15:10     ` Mel Gorman
  2017-04-11  1:54         ` zhong jiang
  -1 siblings, 1 reply; 63+ messages in thread
From: Mel Gorman @ 2017-04-10 15:10 UTC (permalink / raw)
  To: zhong jiang
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, Saeed Mahameed,
	Tariq Toukan

On Mon, Apr 10, 2017 at 10:31:48PM +0800, zhong jiang wrote:
> Hi, Mel
> 
>      The patch I had test on arm64. I find the great degradation. I test it by micro-bench.
>     The patrly data is as following.  and it is stable.  That stands for the allocate and free time. 
>     

What type of allocations is the benchmark doing? In particular, what context
is the microbenchmark allocating from? Lastly, how did you isolate the
patch, did you test two specific commits in mainline or are you comparing
4.10 with 4.11-rcX?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
  2017-04-10 15:10     ` Mel Gorman
@ 2017-04-11  1:54         ` zhong jiang
  0 siblings, 0 replies; 63+ messages in thread
From: zhong jiang @ 2017-04-11  1:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, Saeed Mahameed,
	Tariq Toukan

On 2017/4/10 23:10, Mel Gorman wrote:
> On Mon, Apr 10, 2017 at 10:31:48PM +0800, zhong jiang wrote:
>> Hi, Mel
>>
>>      The patch I had test on arm64. I find the great degradation. I test it by micro-bench.
>>     The patrly data is as following.  and it is stable.  That stands for the allocate and free time. 
>>     
> What type of allocations is the benchmark doing? In particular, what context
> is the microbenchmark allocating from? Lastly, how did you isolate the
> patch, did you test two specific commits in mainline or are you comparing
> 4.10 with 4.11-rcX?
>
 Hi, Mel

   benchmark adopt  0 order allocation.  just insmod module  allocate  memory by alloc_pages.
   it is not interrupt context.  I test the patch in linux 4.1 stable. In x86 , it have 10% improve.
   but in arm64,  it have great degradation.  

  Thanks
  zhongjiang

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Page allocator order-0 optimizations merged
@ 2017-04-11  1:54         ` zhong jiang
  0 siblings, 0 replies; 63+ messages in thread
From: zhong jiang @ 2017-04-11  1:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, Saeed Mahameed,
	Tariq Toukan

On 2017/4/10 23:10, Mel Gorman wrote:
> On Mon, Apr 10, 2017 at 10:31:48PM +0800, zhong jiang wrote:
>> Hi, Mel
>>
>>      The patch I had test on arm64. I find the great degradation. I test it by micro-bench.
>>     The patrly data is as following.  and it is stable.  That stands for the allocate and free time. 
>>     
> What type of allocations is the benchmark doing? In particular, what context
> is the microbenchmark allocating from? Lastly, how did you isolate the
> patch, did you test two specific commits in mainline or are you comparing
> 4.10 with 4.11-rcX?
>
 Hi, Mel

   benchmark adopt  0 order allocation.  just insmod module  allocate  memory by alloc_pages.
   it is not interrupt context.  I test the patch in linux 4.1 stable. In x86 , it have 10% improve.
   but in arm64,  it have great degradation.  

  Thanks
  zhongjiang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2017-04-11  2:00 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-27 20:25 [merged] mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch removed from -mm tree akpm
2017-03-01 13:48 ` Page allocator order-0 optimizations merged Jesper Dangaard Brouer
2017-03-01 17:36   ` Tariq Toukan
2017-03-01 17:36     ` Tariq Toukan
2017-03-22 17:39     ` Tariq Toukan
2017-03-22 17:39       ` Tariq Toukan
2017-03-22 23:40       ` Mel Gorman
2017-03-23 13:43         ` Jesper Dangaard Brouer
2017-03-23 14:51           ` Mel Gorman
2017-03-26  8:21             ` Tariq Toukan
2017-03-26 10:17               ` Tariq Toukan
2017-03-27  7:32                 ` Pankaj Gupta
2017-03-27  8:55                   ` Jesper Dangaard Brouer
2017-03-27 12:28                     ` Mel Gorman
2017-03-27 12:39                     ` Jesper Dangaard Brouer
2017-03-27 13:32                       ` Mel Gorman
2017-03-28  7:32                         ` Tariq Toukan
2017-03-28  8:29                           ` Jesper Dangaard Brouer
2017-03-28 16:05                           ` Tariq Toukan
2017-03-28 18:24                             ` Jesper Dangaard Brouer
2017-03-29  7:13                               ` Tariq Toukan
2017-03-28  8:28                         ` Pankaj Gupta
2017-03-27 14:15                       ` Matthew Wilcox
2017-03-27 14:15                         ` Matthew Wilcox
2017-03-27 15:15                         ` Jesper Dangaard Brouer
2017-03-27 16:58                           ` in_irq_or_nmi() Matthew Wilcox
2017-03-27 16:58                             ` in_irq_or_nmi() Matthew Wilcox
2017-03-27 16:58                             ` in_irq_or_nmi() Matthew Wilcox
2017-03-29  8:12                             ` in_irq_or_nmi() Peter Zijlstra
2017-03-29  8:12                               ` in_irq_or_nmi() Peter Zijlstra
2017-03-29  8:12                               ` in_irq_or_nmi() Peter Zijlstra
2017-03-29  8:59                               ` in_irq_or_nmi() Jesper Dangaard Brouer
2017-03-29  8:59                                 ` in_irq_or_nmi() Jesper Dangaard Brouer
2017-03-29  9:19                                 ` in_irq_or_nmi() Peter Zijlstra
2017-03-29  9:19                                   ` in_irq_or_nmi() Peter Zijlstra
2017-03-29  9:19                                   ` in_irq_or_nmi() Peter Zijlstra
2017-03-29 18:12                                   ` in_irq_or_nmi() Matthew Wilcox
2017-03-29 18:12                                     ` in_irq_or_nmi() Matthew Wilcox
2017-03-29 19:11                                     ` in_irq_or_nmi() Jesper Dangaard Brouer
2017-03-29 19:11                                       ` in_irq_or_nmi() Jesper Dangaard Brouer
2017-03-29 19:44                                       ` in_irq_or_nmi() and RFC patch Jesper Dangaard Brouer
2017-03-29 19:44                                         ` Jesper Dangaard Brouer
2017-03-30  6:49                                         ` Peter Zijlstra
2017-03-30  6:49                                           ` Peter Zijlstra
2017-03-30  7:12                                           ` Jesper Dangaard Brouer
2017-03-30  7:12                                             ` Jesper Dangaard Brouer
2017-03-30  7:35                                             ` Peter Zijlstra
2017-03-30  7:35                                               ` Peter Zijlstra
2017-03-30  9:46                                               ` Jesper Dangaard Brouer
2017-03-30  9:46                                                 ` Jesper Dangaard Brouer
2017-03-30 13:04                                         ` Mel Gorman
2017-03-30 13:04                                           ` Mel Gorman
2017-03-30 15:07                                           ` Jesper Dangaard Brouer
2017-03-30 15:07                                             ` Jesper Dangaard Brouer
2017-04-03 12:05                                             ` Mel Gorman
2017-04-03 12:05                                               ` Mel Gorman
2017-04-05  8:53                                               ` Mel Gorman
2017-04-05  8:53                                                 ` Mel Gorman
2017-04-10 14:31   ` Page allocator order-0 optimizations merged zhong jiang
2017-04-10 14:31     ` zhong jiang
2017-04-10 15:10     ` Mel Gorman
2017-04-11  1:54       ` zhong jiang
2017-04-11  1:54         ` zhong jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.