All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Fast noirq bulk page allocator
@ 2017-01-04 11:10 ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2017-01-04 11:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Mel Gorman

This series is motivated by a conversation led by Jesper Dangaard Brouer at
the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part of
his motivation was due to the overhead of allocating multiple order-0 that
led some drivers to use high-order allocations and splitting them which
can be very slow if high-order pages are unavailable. This long-overdue
series aims to show that raw bulk page allocation can be achieved relatively
easily without introducing a completely new allocator. A new generic page
pool allocator would then ideally focus on just the DMA-coherent part.

The first two patches in the series restructure the allocator such that
it's relatively easy to build a bulk page allocator. The third patch
alters the per-cpu alloctor to make it exclusive to !irq requests. This
cuts allocation/free overhead by roughly 30% but it may not be noticable
to anyone other than users of high-speed networks (I'm not one). The
fourth patch introduces a bulk page allocator with no in-kernel users as
an example for Jesper and others who want to build a page allocator for
DMA-coherent pages.  It hopefully is relatively easy to modify this API
and the one core function to get the semantics they require.  Note that
Patch 3 is not required for patch 4 but it may be desirable if the bulk
allocations happen from !IRQ context.

 include/linux/gfp.h |  23 ++++
 mm/page_alloc.c     | 329 ++++++++++++++++++++++++++++++++++++----------------
 2 files changed, 254 insertions(+), 98 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread
* [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7
@ 2017-01-09 16:35 Mel Gorman
  2017-01-09 16:35   ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Linux Kernel, Linux-MM, Hillf Danton, Mel Gorman

The biggest changes are in the final patch. In v1, it was a rough untested
prototype. This version corrected a number of issues, tested it and includes
a comparison between bulk allocating pages and allocating them one at a time.
While there are still no in-kernel users, it is hoped that the bulk API
would convince network drivers to avoid using high-order allocations. One
slight caveat is that there still may be an advantage to doing the coherent
setup on a high-order page instead of a list of order-0 pages. If that is the
case, it would need to be covered by Jesper's generic page pool allocator.

Changelog since v1
o Remove a scheduler point from the allocation path
o Finalise the bulk allocator and test it

This series is motivated by a conversation led by Jesper Dangaard Brouer at
the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part of
his motivation was due to the overhead of allocating multiple order-0 that
led some drivers to use high-order allocations and splitting them which
can be very slow if high-order pages are unavailable. This long-overdue
series aims to show that raw bulk page allocation can be achieved relatively
easily without introducing a completely new allocator. A new generic page
pool allocator would then ideally focus on just the DMA-coherent part.

The first two patches in the series restructure the allocator such that
it's relatively easy to build a bulk page allocator. The third patch
alters the per-cpu alloctor to make it exclusive to !irq requests. This
cuts allocation/free overhead by roughly 30% but it may not be noticable
to anyone other than users of high-speed networks (I'm not one). The
fourth patch introduces a bulk page allocator with no in-kernel users as
an example for Jesper and others who want to build a page allocator for
DMA-coherent pages.  It hopefully is relatively easy to modify this API
and the one core function to get the semantics they require.  Note that
Patch 3 is not required for patch 4 but it may be desirable if the bulk
allocations happen from !IRQ context.

A comparison of costs of allocating one page at a time on the vanilla
kernel vs the bulk allocator that forces the per-cpu allocator to be
used from a !irq context is as follows

pagealloc
                                          4.10.0-rc2                 4.10.0-rc2
                                             vanilla                  bulk-v2r7
Amean    alloc-odr0-1               302.85 (  0.00%)           106.62 ( 64.80%)
Amean    alloc-odr0-2               227.85 (  0.00%)            76.38 ( 66.48%)
Amean    alloc-odr0-4               191.23 (  0.00%)            57.23 ( 70.07%)
Amean    alloc-odr0-8               167.54 (  0.00%)            48.77 ( 70.89%)
Amean    alloc-odr0-16              158.54 (  0.00%)            45.38 ( 71.37%)
Amean    alloc-odr0-32              150.46 (  0.00%)            42.77 ( 71.57%)
Amean    alloc-odr0-64              148.23 (  0.00%)            41.00 ( 72.34%)
Amean    alloc-odr0-128             145.00 (  0.00%)            40.08 ( 72.36%)
Amean    alloc-odr0-256             157.00 (  0.00%)            56.00 ( 64.33%)
Amean    alloc-odr0-512             170.00 (  0.00%)            69.00 ( 59.41%)
Amean    alloc-odr0-1024            181.00 (  0.00%)            76.23 ( 57.88%)
Amean    alloc-odr0-2048            186.00 (  0.00%)            81.15 ( 56.37%)
Amean    alloc-odr0-4096            192.92 (  0.00%)            85.92 ( 55.46%)
Amean    alloc-odr0-8192            194.00 (  0.00%)            88.00 ( 54.64%)
Amean    alloc-odr0-16384           202.15 (  0.00%)            89.00 ( 55.97%)
Amean    free-odr0-1                154.92 (  0.00%)            55.69 ( 64.05%)
Amean    free-odr0-2                115.31 (  0.00%)            49.38 ( 57.17%)
Amean    free-odr0-4                 93.31 (  0.00%)            45.38 ( 51.36%)
Amean    free-odr0-8                 82.62 (  0.00%)            44.23 ( 46.46%)
Amean    free-odr0-16                79.00 (  0.00%)            45.00 ( 43.04%)
Amean    free-odr0-32                75.15 (  0.00%)            43.92 ( 41.56%)
Amean    free-odr0-64                74.00 (  0.00%)            43.00 ( 41.89%)
Amean    free-odr0-128               73.00 (  0.00%)            43.00 ( 41.10%)
Amean    free-odr0-256               91.00 (  0.00%)            60.46 ( 33.56%)
Amean    free-odr0-512              108.00 (  0.00%)            76.00 ( 29.63%)
Amean    free-odr0-1024             119.00 (  0.00%)            85.38 ( 28.25%)
Amean    free-odr0-2048             125.08 (  0.00%)            91.23 ( 27.06%)
Amean    free-odr0-4096             130.00 (  0.00%)            95.62 ( 26.45%)
Amean    free-odr0-8192             130.00 (  0.00%)            97.00 ( 25.38%)
Amean    free-odr0-16384            134.46 (  0.00%)            97.46 ( 27.52%)
Amean    total-odr0-1               457.77 (  0.00%)           162.31 ( 64.54%)
Amean    total-odr0-2               343.15 (  0.00%)           125.77 ( 63.35%)
Amean    total-odr0-4               284.54 (  0.00%)           102.62 ( 63.94%)
Amean    total-odr0-8               250.15 (  0.00%)            93.00 ( 62.82%)
Amean    total-odr0-16              237.54 (  0.00%)            90.38 ( 61.95%)
Amean    total-odr0-32              225.62 (  0.00%)            86.69 ( 61.58%)
Amean    total-odr0-64              222.23 (  0.00%)            84.00 ( 62.20%)
Amean    total-odr0-128             218.00 (  0.00%)            83.08 ( 61.89%)
Amean    total-odr0-256             248.00 (  0.00%)           116.46 ( 53.04%)
Amean    total-odr0-512             278.00 (  0.00%)           145.00 ( 47.84%)
Amean    total-odr0-1024            300.00 (  0.00%)           161.62 ( 46.13%)
Amean    total-odr0-2048            311.08 (  0.00%)           172.38 ( 44.58%)
Amean    total-odr0-4096            322.92 (  0.00%)           181.54 ( 43.78%)
Amean    total-odr0-8192            324.00 (  0.00%)           185.00 ( 42.90%)
Amean    total-odr0-16384           336.62 (  0.00%)           186.46 ( 44.61%)

It's roughly a 50-70% reduction of allocation costs and roughly a halving of the
overall cost of allocating/freeing batches of pages.

 include/linux/gfp.h |  24 ++++
 mm/page_alloc.c     | 353 +++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 278 insertions(+), 99 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread
* [PATCH 0/4] Use per-cpu allocator for !irq requests and prepare for a bulk allocator v4
@ 2017-01-17  9:29 Mel Gorman
  2017-01-17  9:29   ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2017-01-17  9:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Vlastimil Babka, Hillf Danton,
	Jesper Dangaard Brouer, Mel Gorman

For Vlastimil, this version passed a few tests with full debugging on
without triggering the additional !in_interrupt() checks. The biggest change
is patch 3 which avoids draining the per-cpu lists from IPI context.

Changelog since v3
o Debugging check in allocation path
o Make it harder to use the free path incorrectly
o Use preempt-safe stats counter
o Do not use IPIs to drain the per-cpu allocator

Changelog since v2
o Add ack's and benchmark data
o Rebase to 4.10-rc3

Changelog since v1
o Remove a scheduler point from the allocation path
o Finalise the bulk allocator and test it

This series is motivated by a conversation led by Jesper Dangaard Brouer at
the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part
of his motivation was due to the overhead of allocating multiple order-0
that led some drivers to use high-order allocations and splitting them. This
is very slow in some cases.

The first two patches in this series restructure the page allocator such
that it is relatively easy to introduce an order-0 bulk page allocator.
A patch exists to do that and has been handed over to Jesper until an
in-kernel users is created. The third patch prevents the per-cpu allocator
being drained from IPI context as that can potentially corrupt the list
after patch four is merged. The final patch alters the per-cpu alloctor
to make it exclusive to !irq requests. This cuts allocation/free overhead
by roughly 30%.

Performance tests from both Jesper and I are included in the patch.

 mm/page_alloc.c | 284 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 181 insertions(+), 103 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread
* [PATCH 0/4] Use per-cpu allocator for !irq requests and prepare for a bulk allocator v5
@ 2017-01-23 15:39 Mel Gorman
  2017-01-23 15:39   ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2017-01-23 15:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Linux-MM, Vlastimil Babka, Hillf Danton,
	Jesper Dangaard Brouer, Mel Gorman

This is rebased on top of mmotm to handle collisions with Vlastimil's
series on cpusets and premature OOMs.

Changelog since v4
o Protect drain with get_online_cpus
o Micro-optimisation of stat updates
o Avoid double preparing a page free

Changelog since v3
o Debugging check in allocation path
o Make it harder to use the free path incorrectly
o Use preempt-safe stats counter
o Do not use IPIs to drain the per-cpu allocator

Changelog since v2
o Add ack's and benchmark data
o Rebase to 4.10-rc3

Changelog since v1
o Remove a scheduler point from the allocation path
o Finalise the bulk allocator and test it

This series is motivated by a conversation led by Jesper Dangaard Brouer at
the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part
of his motivation was due to the overhead of allocating multiple order-0
that led some drivers to use high-order allocations and splitting them. This
is very slow in some cases.

The first two patches in this series restructure the page allocator such
that it is relatively easy to introduce an order-0 bulk page allocator.
A patch exists to do that and has been handed over to Jesper until an
in-kernel users is created. The third patch prevents the per-cpu allocator
being drained from IPI context as that can potentially corrupt the list
after patch four is merged. The final patch alters the per-cpu alloctor
to make it exclusive to !irq requests. This cuts allocation/free overhead
by roughly 30%.

Performance tests from both Jesper and I are included in the patch.

 mm/page_alloc.c | 282 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 181 insertions(+), 101 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-01-24 10:23 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-04 11:10 [RFC PATCH 0/4] Fast noirq bulk page allocator Mel Gorman
2017-01-04 11:10 ` Mel Gorman
2017-01-04 11:10 ` [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue Mel Gorman
2017-01-04 11:10   ` Mel Gorman
2017-01-04 11:10 ` [PATCH 2/4] mm, page_alloc: Split alloc_pages_nodemask Mel Gorman
2017-01-04 11:10   ` Mel Gorman
2017-01-04 11:10 ` [PATCH 3/4] mm, page_allocator: Only use per-cpu allocator for irq-safe requests Mel Gorman
2017-01-04 11:10   ` Mel Gorman
2017-01-04 14:20   ` Jesper Dangaard Brouer
2017-01-04 14:20     ` Jesper Dangaard Brouer
2017-01-06  3:26   ` Hillf Danton
2017-01-06  3:26     ` Hillf Danton
2017-01-06 10:15     ` Mel Gorman
2017-01-06 10:15       ` Mel Gorman
2017-01-09  3:14       ` Hillf Danton
2017-01-09  3:14         ` Hillf Danton
2017-01-09  9:48         ` Mel Gorman
2017-01-09  9:48           ` Mel Gorman
2017-01-09  9:55           ` Hillf Danton
2017-01-09  9:55             ` Hillf Danton
2017-01-04 11:10 ` [PATCH 4/4] mm, page_alloc: Add a bulk page allocator Mel Gorman
2017-01-04 11:10   ` Mel Gorman
2017-01-04 13:48   ` Jesper Dangaard Brouer
2017-01-04 13:48     ` Jesper Dangaard Brouer
2017-01-04 14:03     ` Mel Gorman
2017-01-04 14:03       ` Mel Gorman
2017-01-09 16:35 [RFC PATCH 0/4] Fast noirq bulk page allocator v2r7 Mel Gorman
2017-01-09 16:35 ` [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue Mel Gorman
2017-01-09 16:35   ` Mel Gorman
2017-01-11 12:31   ` Jesper Dangaard Brouer
2017-01-11 12:31     ` Jesper Dangaard Brouer
2017-01-12  3:09   ` Hillf Danton
2017-01-12  3:09     ` Hillf Danton
2017-01-17  9:29 [PATCH 0/4] Use per-cpu allocator for !irq requests and prepare for a bulk allocator v4 Mel Gorman
2017-01-17  9:29 ` [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue Mel Gorman
2017-01-17  9:29   ` Mel Gorman
2017-01-17 18:07   ` Jesper Dangaard Brouer
2017-01-17 18:07     ` Jesper Dangaard Brouer
2017-01-17 18:17     ` Vlastimil Babka
2017-01-17 20:20       ` Mel Gorman
2017-01-17 20:20         ` Mel Gorman
2017-01-17 21:07         ` Mel Gorman
2017-01-17 21:07           ` Mel Gorman
2017-01-17 21:24           ` Vlastimil Babka
2017-01-17 21:24             ` Vlastimil Babka
2017-01-23 15:39 [PATCH 0/4] Use per-cpu allocator for !irq requests and prepare for a bulk allocator v5 Mel Gorman
2017-01-23 15:39 ` [PATCH 1/4] mm, page_alloc: Split buffered_rmqueue Mel Gorman
2017-01-23 15:39   ` Mel Gorman
2017-01-24 10:23   ` Vlastimil Babka
2017-01-24 10:23     ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.