linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Fragmentation avoidance improvements v5
@ 2018-11-23 11:45 Mel Gorman
  2018-11-23 11:45 ` [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Mel Gorman @ 2018-11-23 11:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Andrea Arcangeli, Zi Yan,
	Michal Hocko, LKML, Linux-MM, Mel Gorman

There are some big changes due to both Vlastimil's review feedback on v4 and
some oddities spotted while answering his review.  In some respects, the
series is slightly less effective but the approach is more consistent and
logical overall. The overhead is also lower from the first patch and stalls
are less harmful in the last patch so overall I think it has much improved.

Changelog since v4
o Clarified changelogs in response to review
o Add a compile-time check on where Normal and DMA32 is		(vbabka)
o Restart zone iteration properly in get_page_from_freelist	(vbabka)
o Reduce overhead in the page allocation fast path		(mel)
o Do not over-boost due to a fragmentation event		(vbabka)
o Correct documentation of sysctl				(mel)
o Really do not wake kswapd if the calling context forbids it	(vbabka,mel)
o Do not shrink slab if boosting watermarks as premature
  reclaim of slab can lead to regressions in IO benchmarks	(mel)
o Take zone lock when boosting watermarks if necessary		(vbabka)

Changelog since v3
o Rebase to 4.20-rc3
o Remove a stupid warning from the last patch

Changelog since v2
o Drop patch 5 as it was borderline
o Decrease timeout when stalling on fragmentation events

Changelog since v1
o Rebase to v4.20-rc1 for the THP __GFP_THISNODE patch in particular
o Add tracepoint to record fragmentation stall durations
o Add vmstat event to record that a fragmentation stall occurred
o Stalls now alter watermark boosting
o Stalls occur only when the allocation is about to fail

It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is not perfect. Given sufficient time or an adverse
workload, memory gets fragmented and the long-term success of high-order
allocations degrades. This series defines an adverse workload, a definition
of external fragmentation events (including serious) ones and a series
that reduces the level of those fragmentation events.

The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Overall the series reduces external fragmentation causing events by over 94%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded
so they are treated as a secondary metric.

Patch 1 uses lower zones if they are populated and have free memory
	instead of fragmenting a higher zone. It's special cased to
	handle a Normal->DMA32 fallback with the reasons explained
	in the changelog.

Patch 2-4 boosts watermarks temporarily when an external fragmentation
	event occurs. kswapd wakes to reclaim a small amount of old memory
	and then wakes kcompactd on completion to recover the system
	slightly. This introduces some overhead in the slowpath. The level
	of boosting can be tuned or disabled depending on the tolerance
	for fragmentation vs allocation latency.

Patch 5 stalls some movable allocation requests to let kswapd from patch 4
	make some progress. The duration of the stalls is very low but it
	is possible to tune the system to avoid fragmentation events if
	larger stalls can be tolerated.

The bulk of the improvement in fragmentation avoidance is from patches
1-4 but patch 5 can deal with a rare corner case and provides the option
of tuning a system for THP allocation success rates in exchange for
some stalls to control fragmentation.

 Documentation/sysctl/vm.txt   |  44 +++++++
 include/linux/mm.h            |   2 +
 include/linux/mmzone.h        |  14 ++-
 include/linux/vm_event_item.h |   1 +
 include/trace/events/kmem.h   |  21 ++++
 kernel/sysctl.c               |  18 +++
 mm/compaction.c               |   2 +-
 mm/internal.h                 |  15 ++-
 mm/page_alloc.c               | 263 ++++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c                   | 136 ++++++++++++++++++++--
 mm/vmstat.c                   |   1 +
 11 files changed, 473 insertions(+), 44 deletions(-)

-- 
2.16.4

^ permalink raw reply	[flat|nested] 15+ messages in thread
* [PATCH 0/5] Fragmentation avoidance improvements v2
@ 2018-11-07 18:38 Mel Gorman
  2018-11-07 18:38 ` [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-11-07 18:38 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Andrea Arcangeli,
	Zi Yan, LKML, Mel Gorman

The 1-socket machine is different to the one used in v1 so some of the
results are changed on that basis. The baseline has changed to 4.20-rc1 so
the __GFP_THISNODE removal for THP is in effect which alters the behaviour
on 2-socket in particular.  The biggest changes are in the fourth patch,
both in terms of functional changes and the fact it adds a vmstat and
tracepoint for measuring stall latency.

Changelog since v1
o Rebase to v4.20-rc1 for the THP __GFP_THISNODE patch in particular
o Add tracepoint to record fragmentation stall durations
o Add vmstat event to record that a fragmentation stall occurred
o Stalls now alter watermark boosting
o Stalls occur only when the allocation is about to fail

It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is not perfect. Given sufficient time or an adverse
workload, memory gets fragmented and the long-term success of high-order
allocations degrades. This series defines an adverse workload, a definition
of external fragmentation events (including serious) ones and a series
that reduces the level of those fragmentation events.

The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Overall the series reduces external fragmentation causing events by over 95%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded so
they are treated as a secondary metric.

Patch 1 uses lower zones if they are populated and have free memory
	instead of fragmenting a higher zone. It's special cased to
	handle a Normal->DMA32 fallback with the reasons explained
	in the changelog.

Patch 2+3 boosts watermarks temporarily when an external fragmentation
	event occurs. kswapd wakes to reclaim a small amount of old memory
	and then wakes kcompactd on completion to recover the system
	slightly. This introduces some overhead in the slowpath. The level
	of boosting can be tuned or disabled depending on the tolerance
	for fragmentation vs allocation latency.

Patch 4 is more heavy handed. In the event of a movable allocation
	request that can stall, it'll wake kswapd as in patch 3.  However,
	if the expected fragmentation event is serious then the request
	will stall briefly on pfmemalloc_wait until kswapd completes
	light reclaim work and retry the allocation without stalling.
	This can avoid the fragmentation event entirely in some cases.
	The definition of a serious fragmentation event can be tuned
	or disabled.

Patch 5 is the hardest to prove it's a real benefit. In the event
	that fragmentation was unavoidable, it'll queue a pageblock for
	kcompactd to clean. It's a fixed-length queue that is neither
	guaranteed to have a slot available or successfully clean a
	pageblock.

Patches 4 and 5 can be treated independently or dropped if necessary. This
is particularly true of patch 5 as the benefit is difficult to detect
given the impact of the first 4 patches. The bulk of the improvement
in fragmentation avoidance is from patches 1-3 (94-97% reduction in
fragmentation events for an adverse workload on both a 1-socket and
2-socket machine). The primary benefit of patch 4 is the increase in
THP success rates and the fact it reduces fragmentation events to almost
negligible levels with the option of eliminating them.

 Documentation/sysctl/vm.txt       |  42 +++++++
 include/linux/compaction.h        |   4 +
 include/linux/migrate.h           |   7 +-
 include/linux/mm.h                |   2 +
 include/linux/mmzone.h            |  18 ++-
 include/linux/vm_event_item.h     |   1 +
 include/trace/events/compaction.h |  62 ++++++++++
 include/trace/events/kmem.h       |  21 ++++
 kernel/sysctl.c                   |  18 +++
 mm/compaction.c                   | 147 +++++++++++++++++++++--
 mm/internal.h                     |  14 ++-
 mm/migrate.c                      |   6 +-
 mm/page_alloc.c                   | 246 ++++++++++++++++++++++++++++++++++----
 mm/vmscan.c                       | 123 +++++++++++++++++--
 mm/vmstat.c                       |   1 +
 15 files changed, 661 insertions(+), 51 deletions(-)

-- 
2.16.4

^ permalink raw reply	[flat|nested] 15+ messages in thread
* [PATCH 0/5] Fragmentation avoidance improvements
@ 2018-10-31 16:06 Mel Gorman
  2018-10-31 16:06 ` [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-10-31 16:06 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Andrea Arcangeli,
	Zi Yan, LKML, Mel Gorman

Warning: This is a long intro with long changelogs and this is not a
	trivial area to either analyse or fix. TLDR -- 95% reduction in
	fragmentation events, patches 1-3 should be relatively ok. Patch
	4 and 5 need scrutiny but they are also independent or dropped.

It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is far from perfect. Given a long enough time or an
adverse enough workload, memory still gets fragmented and the long-term
success of high-order allocations degrades. This series defines an adverse
workload, a definition of external fragmentation events (including serious)
ones and a series that reduces the level of those fragmentation events.

This series is *not* directly related to the recent __GFP_THISNODE
discussion and has no impact on the trivial test cases that were discussed
there. This series was also evaluated without the candidate fixes from
that discussion. The series does have consequences for high-order and
THP allocations though that are important to consider so the same people
are cc'd. It's also far from a complete solution but side-issues such as
compaction, usability and other factors would require different series. It's
also extremely important to note that this is analysed in the context of
one adverse workload. While other patterns of fragmentation are possible
(and workloads that are mostly slab allocations have a completely different
solution space), they would need test cases to be properly considered.

The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Overall the series reduces external fragmentation causing events by over 95%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded so
they are treated as a secondary metric.

Patch 1 uses lower zones if they are populated and have free memory
	instead of fragmenting a higher zone. It's special cased to
	handle a Normal->DMA32 fallback with the reasons explained
	in the changelog.

Patch 2+3 boosts watermarks temporarily when an external fragmentation
	event occurs. kswapd wakes to reclaim a small amount of old memory
	and then wakes kcompactd on completion to recover the system
	slightly. This introduces some overhead in the slowpath. The level
	of boosting can be tuned or disabled depending on the tolerance
	for fragmentation vs allocation latency.

Patch 4 is more heavy handed. In the event of a movable allocation
	request that can stall, it'll wake kswapd as in patch 3.  However,
	if the expected fragmentation event is serious then the request
	will stall briefly on pfmemalloc_wait until kswapd completes
	light reclaim work and retry the allocation without stalling.
	This can avoid the fragmentation event entirely in some cases.
	The definition of a serious fragmentation event can be tuned
	or disabled.

Patch 5 is the hardest to prove it's a real benefit. In the event
	that fragmentation was unavoidable, it'll queue a pageblock for
	kcompactd to clean. It's a fixed-length queue that is neither
	guaranteed to have a slot available or successfully clean a
	pageblock.

Patches 4 and 5 can be treated independently or dropped. The bulk of
the improvement in fragmentation avoidance is from patches 1-3 (94-97%
reduction in fragmentation events for an adverse workload on both a
1-socket and 2-socket machine).

 Documentation/sysctl/vm.txt       |  42 +++++++
 include/linux/compaction.h        |   4 +
 include/linux/migrate.h           |   7 +-
 include/linux/mm.h                |   2 +
 include/linux/mmzone.h            |  18 ++-
 include/trace/events/compaction.h |  62 +++++++++++
 kernel/sysctl.c                   |  18 +++
 mm/compaction.c                   | 148 +++++++++++++++++++++++--
 mm/internal.h                     |  14 ++-
 mm/migrate.c                      |   6 +-
 mm/page_alloc.c                   | 228 ++++++++++++++++++++++++++++++++++----
 mm/vmscan.c                       | 123 ++++++++++++++++++--
 12 files changed, 621 insertions(+), 51 deletions(-)

-- 
2.16.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-12-05  8:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-23 11:45 [PATCH 0/5] Fragmentation avoidance improvements v5 Mel Gorman
2018-11-23 11:45 ` [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Mel Gorman
2018-11-26 12:36   ` Vlastimil Babka
2018-11-23 11:45 ` [PATCH 2/5] mm: Move zone watermark accesses behind an accessor Mel Gorman
2018-11-23 11:45 ` [PATCH 3/5] mm: Use alloc_flags to record if kswapd can wake Mel Gorman
2018-11-26 13:38   ` Vlastimil Babka
2018-11-26 14:35     ` [PATCH] mm: Use alloc_flags to record if kswapd can wake -fix Mel Gorman
2018-11-23 11:45 ` [PATCH 4/5] mm: Reclaim small amounts of memory when an external fragmentation event occurs Mel Gorman
2018-11-27  9:23   ` Vlastimil Babka
2018-11-23 11:45 ` [PATCH 5/5] mm: Stall movable allocations until kswapd progresses during serious external fragmentation event Mel Gorman
2018-11-27 13:20   ` Vlastimil Babka
2018-11-27 17:51     ` Mel Gorman
2018-12-05  8:06     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2018-11-07 18:38 [PATCH 0/5] Fragmentation avoidance improvements v2 Mel Gorman
2018-11-07 18:38 ` [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Mel Gorman
2018-10-31 16:06 [PATCH 0/5] Fragmentation avoidance improvements Mel Gorman
2018-10-31 16:06 ` [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).