* [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Short summary: There are severe stalls when a USB stick using VFAT
is used with THP enabled that are reduced by this series. If you are
experiencing this problem, please test and report back and considering
I have seen complaints from openSUSE and Fedora users on this as well
as a few private mails, I'm guessing it's a widespread issue. This
is a new type of USB-related stall because it is due to synchronous
compaction writing where as in the past the big problem was dirty
pages reaching the end of the LRU and being written by reclaim.
Am cc'ing Andrew this time and this series would replace
mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
for wider testing and ideally it would be reverted and replaced by
this series.
That said, the later patches could really do with some review. If this
series is not the answer then a new direction needs to be discussed
because as it is, the stalls are unacceptable as the results in this
leader show.
For testers that try backporting this to 3.1, it won't work because
there is a non-obvious dependency on not writing back pages in direct
reclaim so you need those patches too.
Changelog since V5
o Rebase to 3.2-rc5
o Tidy up the changelogs a bit
Changelog since V4
o Added reviewed-bys, credited Andrea properly for sync-light
o Allow dirty pages without mappings to be considered for migration
o Bound the number of pages freed for compaction
o Isolate PageReclaim pages on their own LRU list
This is against 3.2-rc5 and follows on from discussions on "mm: Do
not stall in synchronous compaction for THP allocations" and "[RFC
PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
patch eliminated stalls due to compaction which sometimes resulted in
user-visible interactivity problems on browsers by simply never using
sync compaction. The downside was that THP success allocation rates
were lower because dirty pages were not being migrated as reported by
Andrea. His approach at fixing this was nacked on the grounds that
it reverted fixes from Rik merged that reduced the amount of pages
reclaimed as it severely impacted his workloads performance.
This series attempts to reconcile the requirements of maximising THP
usage, without stalling in a user-visible fashion due to compaction
or cheating by reclaiming an excessive number of pages.
Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
dirty pages. This is because migration can move some dirty
pages without blocking.
Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
synchronous compaction when it should be. This is unrelated
to the reported stalls but is worth fixing.
Patch 3 checks if we isolated a compound page during lumpy scan and
account for it properly. For the most part, this affects
tracing so it's unrelated to the stalls but worth fixing.
Patch 4 notes that it is possible to abort reclaim early for compaction
and return 0 to the page allocator potentially entering the
"may oom" path. This has not been observed in practice but
the rest of the series potentially makes it easier to happen.
Patch 5 adds a sync parameter to the migratepage callback and gives
the callback responsibility for migrating the page without
blocking if sync==false. For example, fallback_migrate_page
will not call writepage if sync==false. This increases the
number of pages that can be handled by asynchronous compaction
thereby reducing stalls.
Patch 6 restores filter-awareness to isolate_lru_page for migration.
In practice, it means that pages under writeback and pages
without a ->migratepage callback will not be isolated
for migration.
Patch 7 avoids calling direct reclaim if compaction is deferred but
makes sure that compaction is only deferred if sync
compaction was used.
Patch 8 introduces a sync-light migration mechanism that sync compaction
uses. The objective is to allow some stalls but to not call
->writepage which can lead to significant user-visible stalls.
Patch 9 notes that while we want to abort reclaim ASAP to allow
compation to go ahead that we leave a very small window of
opportunity for compaction to run. This patch allows more pages
to be freed by reclaim but bounds the number to a reasonable
level based on the high watermark on each zone.
Patch 10 allows slabs to be shrunk even after compaction_ready() is
true for one zone. This is to avoid a problem whereby a single
small zone can abort reclaim even though no pages have been
reclaimed and no suitably large zone is in a usable state.
Patch 11 fixes a problem with the rate of page scanning. As reclaim is
rarely stalling on pages under writeback it means that scan
rates are very high. This is particularly true for direct
reclaim which is not calling writepage. The vmstat figures
implied that much of this was busy work with PageReclaim pages
marked for immediate reclaim. This patch is a prototype that
moves these pages to their own LRU list.
This has been tested and other than 2 USB keys getting trashed,
nothing horrible fell out. That said, I am a bit unhappy with the
rescue logic in patch 11 but did not find a better way around it. It
does significantly reduce scan rates and System CPU time indicating
it is the right direction to take.
What is of critical importance is that stalls due to compaction
are massively reduced even though sync compaction was still
allowed. Testing from people complaining about stalls copying to USBs
with THP enabled are particularly welcome.
The following tests all involve THP usage and USB keys in some
way. Each test follows this type of pattern
1. Read from some fast fast storage, be it raw device or file. Each time
the copy finishes, start again until the test ends
2. Write a large file to a filesystem on a USB stick. Each time the copy
finishes, start again until the test ends
3. When memory is low, start an alloc process that creates a mapping
the size of physical memory to stress THP allocation. This is the
"real" part of the test and the part that is meant to trigger
stalls when THP is enabled. Copying continues in the background.
4. Record the CPU usage and time to execute of the alloc process
5. Record the number of THP allocs and fallbacks as well as the number of THP
pages in use a the end of the test just before alloc exited
6. Run the test 5 times to get an idea of variability
7. Between each run, sync is run and caches dropped and the test
waits until nr_dirty is a small number to avoid interference
or caching between iterations that would skew the figures.
The individual tests were then
writebackCPDeviceBasevfat
Disable THP, read from a raw device (sda), vfat on USB stick
writebackCPDeviceBaseext4
Disable THP, read from a raw device (sda), ext4 on USB stick
writebackCPDevicevfat
THP enabled, read from a raw device (sda), vfat on USB stick
writebackCPDeviceext4
THP enabled, read from a raw device (sda), ext4 on USB stick
writebackCPFilevfat
THP enabled, read from a file on fast storage and USB, both vfat
writebackCPFileext4
THP enabled, read from a file on fast storage and USB, both ext4
The kernels tested were
3.1 3.1
vanilla 3.2-rc5
freemore Patches 1-10
immediate Patches 1-11
andrea The 8 patches Andrea posted as a basis of comparison
The results are very long unfortunately. I'll start with the case
where we are not using THP at all
writebackCPDeviceBasevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
+/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
+/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
+/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
The THP figures are obviously all 0 because THP was enabled. The
main thing to watch is the elapsed times and how they compare to
times when THP is enabled later. It's also important to note that
elapsed time is improved by this series as System CPu time is much
reduced.
writebackCPDevicevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
+/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
(-10818.53%)
User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
+/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
95.48%)
+/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
+/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
+/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
+/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28
The first thing to note is the "Elapsed Time" for the vanilla kernels
of 2249 seconds versus 56 with THP disabled which might explain the
reports of USB stalls with THP enabled. Applying the patches brings
performance in line with THP-disabled performance while isolating
pages for immediate reclaim from the LRU cuts down System CPU time.
The "Fault Alloc" success rate figures are also improved. The vanilla
kernel only managed to allocate 76.6 pages on average over the course
of 5 iterations where as applying the series allocated 181.20 on
average albeit it is well within variance. It's worth noting that
applies the series at least descreases the amount of variance which
implies an improvement.
Andrea's series had a higher success rate for THP allocations but
at a severe cost to elapsed time which is still better than vanilla
but still much worse than disabling THP altogether. One can bring my
series close to Andrea's by removing this check
/*
* If compaction is deferred for high-order allocations, it is because
* sync compaction recently failed. In this is the case and the caller
* has requested the system not be heavily disrupted, fail the
* allocation now instead of entering direct reclaim
*/
if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;
I didn't include a patch that removed the above check because hurting
overall performance to improve the THP figure is not what the average
user wants. It's something to consider though if someone really wants
to maximise THP usage no matter what it does to the workload initially.
This is summary of vmstat figures from the same test.
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
Page Ins 3257266139 1111844061 17263623 10901575 161423219
Page Outs 81054922 30364312 3626530 3657687 8753730
Swap Ins 3294 2851 6560 4964 4592
Swap Outs 390073 528094 620197 790912 698285
Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
Kswapd efficiency 83% 69% 58% 57% 79%
Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
Direct efficiency 74% 9% 0% 1% 0%
Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
Percentage direct scans 96% 99% 99% 98% 99%
Page writes by reclaim 722646 529174 620319 791018 699198
Page writes file 332573 1080 122 106 913
Page writes anon 390073 528094 620197 790912 698285
Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
Page rescued immediate 0 0 0 87848 0
Slabs scanned 23552 23552 9216 8192 9216
Direct inode steals 231 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 28076 786 0 61 6
THP fault alloc 609 383 753 906 1433
THP collapse alloc 12 6 0 0 6
THP splits 536 211 456 593 1136
THP fault fallback 4406 4633 4263 4110 3583
THP collapse fail 120 127 0 0 4
Compaction stalls 1810 728 623 779 3200
Compaction success 196 53 60 80 123
Compaction failures 1614 675 563 699 3077
Compaction pages moved 193158 53545 243185 333457 226688
Compaction move failure 9952 9396 16424 23676 45070
The main things to look at are
1. Page In/out figures are much reduced by the series.
2. Direct page scanning is incredibly high (264745.137 pages scanned
per second on the vanilla kernel) but isolating PageReclaim pages
on their own list reduces the number of pages scanned significantly.
3. The fact that "Page rescued immediate" is a positive number implies
that we sometimes race removing pages from the LRU_IMMEDIATE list
that need to be put back on a normal LRU but it happens only for
0.07% of the pages marked for immediate reclaim.
writebackCPDeviceext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Similar test but the USB stick is using ext4 instead of vfat. As
ext4 does not use writepage for migration, the large stalls due to
compaction when THP is enabled are not observed. Still, isolating
PageReclaim pages on their own list helped completion time largely
by reducing the number of pages scanned by direct reclaim although
time spend in congestion_wait could also be a factor.
Again, Andrea's series had far higher success rates for THP allocation
at the cost of elapsed time. I didn't look too closely but a quick
look at the vmstat figures tells me kswapd reclaimed 8 times more pages
than the patch series and direct reclaim reclaimed roughly three times
as many pages. It follows that if memory is aggressively reclaimed,
there will be more available for THP.
writebackCPFilevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
+/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
(-6863.76%)
User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
+/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
+/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
+/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
+/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
+/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28
In this case, the test is reading/writing only from filesystems but as
it's vfat, it's slow due to calling writepage during compaction. Little
to observe really - the time to complete the test goes way down
with the series applied and THP allocation success rates go up in
comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
the elapsed time for that kernel is abysmal so it is not really a
sensible comparison.
As before, Andrea's series allocates more THPs at the cost of overall
performance.
writebackCPFileext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Same type of story - elapsed times go down. In this case, allocation
success rates are roughtly the same. As before, Andrea's has higher
success rates but takes a lot longer.
Overall the series does reduce latencies and while the tests are
inherency racy as alloc competes with the cp processes, the variability
was included. The THP allocation rates are not as high as they could
be but that is because we would have to be more aggressive about
reclaim and compaction impacting overall performance.
Comments?
fs/btrfs/disk-io.c | 5 +-
fs/hugetlbfs/inode.c | 3 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 4 +-
include/linux/fs.h | 11 ++-
include/linux/migrate.h | 23 +++++-
include/linux/mmzone.h | 4 +
include/linux/vm_event_item.h | 1 +
mm/compaction.c | 5 +-
mm/memory-failure.c | 2 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 2 +-
mm/migrate.c | 171 ++++++++++++++++++++++++++++-------------
mm/page_alloc.c | 50 +++++++++---
mm/swap.c | 74 +++++++++++++++++-
mm/vmscan.c | 114 ++++++++++++++++++++++++----
mm/vmstat.c | 2 +
17 files changed, 371 insertions(+), 104 deletions(-)
--
1.7.3.4
^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Short summary: There are severe stalls when a USB stick using VFAT
is used with THP enabled that are reduced by this series. If you are
experiencing this problem, please test and report back and considering
I have seen complaints from openSUSE and Fedora users on this as well
as a few private mails, I'm guessing it's a widespread issue. This
is a new type of USB-related stall because it is due to synchronous
compaction writing where as in the past the big problem was dirty
pages reaching the end of the LRU and being written by reclaim.
Am cc'ing Andrew this time and this series would replace
mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
for wider testing and ideally it would be reverted and replaced by
this series.
That said, the later patches could really do with some review. If this
series is not the answer then a new direction needs to be discussed
because as it is, the stalls are unacceptable as the results in this
leader show.
For testers that try backporting this to 3.1, it won't work because
there is a non-obvious dependency on not writing back pages in direct
reclaim so you need those patches too.
Changelog since V5
o Rebase to 3.2-rc5
o Tidy up the changelogs a bit
Changelog since V4
o Added reviewed-bys, credited Andrea properly for sync-light
o Allow dirty pages without mappings to be considered for migration
o Bound the number of pages freed for compaction
o Isolate PageReclaim pages on their own LRU list
This is against 3.2-rc5 and follows on from discussions on "mm: Do
not stall in synchronous compaction for THP allocations" and "[RFC
PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
patch eliminated stalls due to compaction which sometimes resulted in
user-visible interactivity problems on browsers by simply never using
sync compaction. The downside was that THP success allocation rates
were lower because dirty pages were not being migrated as reported by
Andrea. His approach at fixing this was nacked on the grounds that
it reverted fixes from Rik merged that reduced the amount of pages
reclaimed as it severely impacted his workloads performance.
This series attempts to reconcile the requirements of maximising THP
usage, without stalling in a user-visible fashion due to compaction
or cheating by reclaiming an excessive number of pages.
Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
dirty pages. This is because migration can move some dirty
pages without blocking.
Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
synchronous compaction when it should be. This is unrelated
to the reported stalls but is worth fixing.
Patch 3 checks if we isolated a compound page during lumpy scan and
account for it properly. For the most part, this affects
tracing so it's unrelated to the stalls but worth fixing.
Patch 4 notes that it is possible to abort reclaim early for compaction
and return 0 to the page allocator potentially entering the
"may oom" path. This has not been observed in practice but
the rest of the series potentially makes it easier to happen.
Patch 5 adds a sync parameter to the migratepage callback and gives
the callback responsibility for migrating the page without
blocking if sync==false. For example, fallback_migrate_page
will not call writepage if sync==false. This increases the
number of pages that can be handled by asynchronous compaction
thereby reducing stalls.
Patch 6 restores filter-awareness to isolate_lru_page for migration.
In practice, it means that pages under writeback and pages
without a ->migratepage callback will not be isolated
for migration.
Patch 7 avoids calling direct reclaim if compaction is deferred but
makes sure that compaction is only deferred if sync
compaction was used.
Patch 8 introduces a sync-light migration mechanism that sync compaction
uses. The objective is to allow some stalls but to not call
->writepage which can lead to significant user-visible stalls.
Patch 9 notes that while we want to abort reclaim ASAP to allow
compation to go ahead that we leave a very small window of
opportunity for compaction to run. This patch allows more pages
to be freed by reclaim but bounds the number to a reasonable
level based on the high watermark on each zone.
Patch 10 allows slabs to be shrunk even after compaction_ready() is
true for one zone. This is to avoid a problem whereby a single
small zone can abort reclaim even though no pages have been
reclaimed and no suitably large zone is in a usable state.
Patch 11 fixes a problem with the rate of page scanning. As reclaim is
rarely stalling on pages under writeback it means that scan
rates are very high. This is particularly true for direct
reclaim which is not calling writepage. The vmstat figures
implied that much of this was busy work with PageReclaim pages
marked for immediate reclaim. This patch is a prototype that
moves these pages to their own LRU list.
This has been tested and other than 2 USB keys getting trashed,
nothing horrible fell out. That said, I am a bit unhappy with the
rescue logic in patch 11 but did not find a better way around it. It
does significantly reduce scan rates and System CPU time indicating
it is the right direction to take.
What is of critical importance is that stalls due to compaction
are massively reduced even though sync compaction was still
allowed. Testing from people complaining about stalls copying to USBs
with THP enabled are particularly welcome.
The following tests all involve THP usage and USB keys in some
way. Each test follows this type of pattern
1. Read from some fast fast storage, be it raw device or file. Each time
the copy finishes, start again until the test ends
2. Write a large file to a filesystem on a USB stick. Each time the copy
finishes, start again until the test ends
3. When memory is low, start an alloc process that creates a mapping
the size of physical memory to stress THP allocation. This is the
"real" part of the test and the part that is meant to trigger
stalls when THP is enabled. Copying continues in the background.
4. Record the CPU usage and time to execute of the alloc process
5. Record the number of THP allocs and fallbacks as well as the number of THP
pages in use a the end of the test just before alloc exited
6. Run the test 5 times to get an idea of variability
7. Between each run, sync is run and caches dropped and the test
waits until nr_dirty is a small number to avoid interference
or caching between iterations that would skew the figures.
The individual tests were then
writebackCPDeviceBasevfat
Disable THP, read from a raw device (sda), vfat on USB stick
writebackCPDeviceBaseext4
Disable THP, read from a raw device (sda), ext4 on USB stick
writebackCPDevicevfat
THP enabled, read from a raw device (sda), vfat on USB stick
writebackCPDeviceext4
THP enabled, read from a raw device (sda), ext4 on USB stick
writebackCPFilevfat
THP enabled, read from a file on fast storage and USB, both vfat
writebackCPFileext4
THP enabled, read from a file on fast storage and USB, both ext4
The kernels tested were
3.1 3.1
vanilla 3.2-rc5
freemore Patches 1-10
immediate Patches 1-11
andrea The 8 patches Andrea posted as a basis of comparison
The results are very long unfortunately. I'll start with the case
where we are not using THP at all
writebackCPDeviceBasevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
+/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
+/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
+/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
The THP figures are obviously all 0 because THP was enabled. The
main thing to watch is the elapsed times and how they compare to
times when THP is enabled later. It's also important to note that
elapsed time is improved by this series as System CPu time is much
reduced.
writebackCPDevicevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
+/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
(-10818.53%)
User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
+/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
95.48%)
+/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
+/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
+/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
+/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28
The first thing to note is the "Elapsed Time" for the vanilla kernels
of 2249 seconds versus 56 with THP disabled which might explain the
reports of USB stalls with THP enabled. Applying the patches brings
performance in line with THP-disabled performance while isolating
pages for immediate reclaim from the LRU cuts down System CPU time.
The "Fault Alloc" success rate figures are also improved. The vanilla
kernel only managed to allocate 76.6 pages on average over the course
of 5 iterations where as applying the series allocated 181.20 on
average albeit it is well within variance. It's worth noting that
applies the series at least descreases the amount of variance which
implies an improvement.
Andrea's series had a higher success rate for THP allocations but
at a severe cost to elapsed time which is still better than vanilla
but still much worse than disabling THP altogether. One can bring my
series close to Andrea's by removing this check
/*
* If compaction is deferred for high-order allocations, it is because
* sync compaction recently failed. In this is the case and the caller
* has requested the system not be heavily disrupted, fail the
* allocation now instead of entering direct reclaim
*/
if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;
I didn't include a patch that removed the above check because hurting
overall performance to improve the THP figure is not what the average
user wants. It's something to consider though if someone really wants
to maximise THP usage no matter what it does to the workload initially.
This is summary of vmstat figures from the same test.
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
Page Ins 3257266139 1111844061 17263623 10901575 161423219
Page Outs 81054922 30364312 3626530 3657687 8753730
Swap Ins 3294 2851 6560 4964 4592
Swap Outs 390073 528094 620197 790912 698285
Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
Kswapd efficiency 83% 69% 58% 57% 79%
Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
Direct efficiency 74% 9% 0% 1% 0%
Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
Percentage direct scans 96% 99% 99% 98% 99%
Page writes by reclaim 722646 529174 620319 791018 699198
Page writes file 332573 1080 122 106 913
Page writes anon 390073 528094 620197 790912 698285
Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
Page rescued immediate 0 0 0 87848 0
Slabs scanned 23552 23552 9216 8192 9216
Direct inode steals 231 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 28076 786 0 61 6
THP fault alloc 609 383 753 906 1433
THP collapse alloc 12 6 0 0 6
THP splits 536 211 456 593 1136
THP fault fallback 4406 4633 4263 4110 3583
THP collapse fail 120 127 0 0 4
Compaction stalls 1810 728 623 779 3200
Compaction success 196 53 60 80 123
Compaction failures 1614 675 563 699 3077
Compaction pages moved 193158 53545 243185 333457 226688
Compaction move failure 9952 9396 16424 23676 45070
The main things to look at are
1. Page In/out figures are much reduced by the series.
2. Direct page scanning is incredibly high (264745.137 pages scanned
per second on the vanilla kernel) but isolating PageReclaim pages
on their own list reduces the number of pages scanned significantly.
3. The fact that "Page rescued immediate" is a positive number implies
that we sometimes race removing pages from the LRU_IMMEDIATE list
that need to be put back on a normal LRU but it happens only for
0.07% of the pages marked for immediate reclaim.
writebackCPDeviceext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Similar test but the USB stick is using ext4 instead of vfat. As
ext4 does not use writepage for migration, the large stalls due to
compaction when THP is enabled are not observed. Still, isolating
PageReclaim pages on their own list helped completion time largely
by reducing the number of pages scanned by direct reclaim although
time spend in congestion_wait could also be a factor.
Again, Andrea's series had far higher success rates for THP allocation
at the cost of elapsed time. I didn't look too closely but a quick
look at the vmstat figures tells me kswapd reclaimed 8 times more pages
than the patch series and direct reclaim reclaimed roughly three times
as many pages. It follows that if memory is aggressively reclaimed,
there will be more available for THP.
writebackCPFilevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
+/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
(-6863.76%)
User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
+/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
+/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
+/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
+/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
+/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28
In this case, the test is reading/writing only from filesystems but as
it's vfat, it's slow due to calling writepage during compaction. Little
to observe really - the time to complete the test goes way down
with the series applied and THP allocation success rates go up in
comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
the elapsed time for that kernel is abysmal so it is not really a
sensible comparison.
As before, Andrea's series allocates more THPs at the cost of overall
performance.
writebackCPFileext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Same type of story - elapsed times go down. In this case, allocation
success rates are roughtly the same. As before, Andrea's has higher
success rates but takes a lot longer.
Overall the series does reduce latencies and while the tests are
inherency racy as alloc competes with the cp processes, the variability
was included. The THP allocation rates are not as high as they could
be but that is because we would have to be more aggressive about
reclaim and compaction impacting overall performance.
Comments?
fs/btrfs/disk-io.c | 5 +-
fs/hugetlbfs/inode.c | 3 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 4 +-
include/linux/fs.h | 11 ++-
include/linux/migrate.h | 23 +++++-
include/linux/mmzone.h | 4 +
include/linux/vm_event_item.h | 1 +
mm/compaction.c | 5 +-
mm/memory-failure.c | 2 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 2 +-
mm/migrate.c | 171 ++++++++++++++++++++++++++++-------------
mm/page_alloc.c | 50 +++++++++---
mm/swap.c | 74 +++++++++++++++++-
mm/vmscan.c | 114 ++++++++++++++++++++++++----
mm/vmstat.c | 2 +
17 files changed, 371 insertions(+), 104 deletions(-)
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 01/11] mm: compaction: Allow compaction to isolate dirty pages
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.
What was missed during review is that asynchronous migration moves
dirty pages if their ->migratepage callback is migrate_page() because
these can be moved without blocking. This potentially impacted
hugepage allocation success rates by a factor depending on how many
dirty pages are in the system.
This patch partially reverts 39deaf85 to allow migration to isolate
dirty pages again. This increases how much compaction disrupts the
LRU but that is addressed later in the series.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/compaction.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 899d956..237560e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -349,9 +349,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
continue;
}
- if (!cc->sync)
- mode |= ISOLATE_CLEAN;
-
/* Try isolate the page */
if (__isolate_lru_page(page, mode, 0) != 0)
continue;
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 01/11] mm: compaction: Allow compaction to isolate dirty pages
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.
What was missed during review is that asynchronous migration moves
dirty pages if their ->migratepage callback is migrate_page() because
these can be moved without blocking. This potentially impacted
hugepage allocation success rates by a factor depending on how many
dirty pages are in the system.
This patch partially reverts 39deaf85 to allow migration to isolate
dirty pages again. This increases how much compaction disrupts the
LRU but that is addressed later in the series.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/compaction.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 899d956..237560e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -349,9 +349,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
continue;
}
- if (!cc->sync)
- mode |= ISOLATE_CLEAN;
-
/* Try isolate the page */
if (__isolate_lru_page(page, mode, 0) != 0)
continue;
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 02/11] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
When asynchronous compaction was introduced, the
/proc/sys/vm/compact_memory handler should have been updated to always
use synchronous compaction. This did not happen so this patch addresses
it. The assumption is if a user writes to /proc/sys/vm/compact_memory,
they are willing for that process to stall.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/compaction.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 237560e..615502b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -666,6 +666,7 @@ static int compact_node(int nid)
.nr_freepages = 0,
.nr_migratepages = 0,
.order = -1,
+ .sync = true,
};
zone = &pgdat->node_zones[zoneid];
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 02/11] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
When asynchronous compaction was introduced, the
/proc/sys/vm/compact_memory handler should have been updated to always
use synchronous compaction. This did not happen so this patch addresses
it. The assumption is if a user writes to /proc/sys/vm/compact_memory,
they are willing for that process to stall.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/compaction.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 237560e..615502b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -666,6 +666,7 @@ static int compact_node(int nid)
.nr_freepages = 0,
.nr_migratepages = 0,
.order = -1,
+ .sync = true,
};
zone = &pgdat->node_zones[zoneid];
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 03/11] mm: vmscan: Check if we isolated a compound page during lumpy scan
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
From: Andrea Arcangeli <aarcange@redhat.com>
Properly take into account if we isolated a compound page during the
lumpy scan in reclaim and skip over the tail pages when encountered.
This corrects the values given to the tracepoint for number of lumpy
pages isolated and will avoid breaking the loop early if compound
pages smaller than the requested allocation size are requested.
[mgorman@suse.de: Updated changelog]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/vmscan.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f54a05b..faf88b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1183,13 +1183,16 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
break;
if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+ unsigned int isolated_pages;
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
- nr_taken += hpage_nr_pages(page);
- nr_lumpy_taken++;
+ isolated_pages = hpage_nr_pages(page);
+ nr_taken += isolated_pages;
+ nr_lumpy_taken += isolated_pages;
if (PageDirty(cursor_page))
- nr_lumpy_dirty++;
+ nr_lumpy_dirty += isolated_pages;
scan++;
+ pfn += isolated_pages-1;
} else {
/*
* Check if the page is freed already.
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 03/11] mm: vmscan: Check if we isolated a compound page during lumpy scan
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
From: Andrea Arcangeli <aarcange@redhat.com>
Properly take into account if we isolated a compound page during the
lumpy scan in reclaim and skip over the tail pages when encountered.
This corrects the values given to the tracepoint for number of lumpy
pages isolated and will avoid breaking the loop early if compound
pages smaller than the requested allocation size are requested.
[mgorman@suse.de: Updated changelog]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/vmscan.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f54a05b..faf88b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1183,13 +1183,16 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
break;
if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+ unsigned int isolated_pages;
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
- nr_taken += hpage_nr_pages(page);
- nr_lumpy_taken++;
+ isolated_pages = hpage_nr_pages(page);
+ nr_taken += isolated_pages;
+ nr_lumpy_taken += isolated_pages;
if (PageDirty(cursor_page))
- nr_lumpy_dirty++;
+ nr_lumpy_dirty += isolated_pages;
scan++;
+ pfn += isolated_pages-1;
} else {
/*
* Check if the page is freed already.
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 04/11] mm: vmscan: Do not OOM if aborting reclaim to start compaction
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
During direct reclaim it is possible that reclaim will be aborted so
that compaction can be attempted to satisfy a high-order allocation. If
this decision is made before any pages are reclaimed, it is possible
that 0 is returned to the page allocator potentially triggering an
OOM. This has not been observed but it is a possibility so this patch
addresses it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index faf88b8..69057b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,6 +2222,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
+ bool should_abort_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2233,7 +2234,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
sc->nr_scanned = 0;
if (!priority)
disable_swap_token(sc->mem_cgroup);
- if (shrink_zones(priority, zonelist, sc))
+ should_abort_reclaim = shrink_zones(priority, zonelist, sc);
+ if (should_abort_reclaim)
break;
/*
@@ -2301,6 +2303,10 @@ out:
if (oom_killer_disabled)
return 0;
+ /* Aborting reclaim to try compaction? don't OOM, then */
+ if (should_abort_reclaim)
+ return 1;
+
/* top priority shrink_zones still had more to do? don't OOM, then */
if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
return 1;
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 04/11] mm: vmscan: Do not OOM if aborting reclaim to start compaction
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
During direct reclaim it is possible that reclaim will be aborted so
that compaction can be attempted to satisfy a high-order allocation. If
this decision is made before any pages are reclaimed, it is possible
that 0 is returned to the page allocator potentially triggering an
OOM. This has not been observed but it is a possibility so this patch
addresses it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index faf88b8..69057b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,6 +2222,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
+ bool should_abort_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2233,7 +2234,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
sc->nr_scanned = 0;
if (!priority)
disable_swap_token(sc->mem_cgroup);
- if (shrink_zones(priority, zonelist, sc))
+ should_abort_reclaim = shrink_zones(priority, zonelist, sc);
+ if (should_abort_reclaim)
break;
/*
@@ -2301,6 +2303,10 @@ out:
if (oom_killer_disabled)
return 0;
+ /* Aborting reclaim to try compaction? don't OOM, then */
+ if (should_abort_reclaim)
+ return 1;
+
/* top priority shrink_zones still had more to do? don't OOM, then */
if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
return 1;
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Asynchronous compaction is used when allocating transparent hugepages
to avoid blocking for long periods of time. Due to reports of
stalling, there was a debate on disabling synchronous compaction
but this severely impacted allocation success rates. Part of the
reason was that many dirty pages are skipped in asynchronous compaction
by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page()
even though it is possible to migrate some of these pages without
blocking. This patch updates the ->migratepage callback with a "sync"
parameter. It is the responsibility of the callback to fail gracefully
if migration would block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/disk-io.c | 4 +-
fs/hugetlbfs/inode.c | 3 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 4 +-
include/linux/fs.h | 9 ++-
include/linux/migrate.h | 2 +-
mm/migrate.c | 129 +++++++++++++++++++++++++++++++++-------------
7 files changed, 106 insertions(+), 47 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 632f8f3..896b87a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -872,7 +872,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
/*
* we can't safely write a btree page from here,
@@ -887,7 +887,7 @@ static int btree_migratepage(struct address_space *mapping,
if (page_has_private(page) &&
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
#endif
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0be5a78..10b9883 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -576,7 +576,8 @@ static int hugetlbfs_set_page_dirty(struct page *page)
}
static int hugetlbfs_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page,
+ bool sync)
{
int rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 3f4d957..8d96ed6 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -330,7 +330,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
#ifdef CONFIG_MIGRATION
extern int nfs_migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
#else
#define nfs_migrate_page NULL
#endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1dda78d..33475df 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page)
+ struct page *page, bool sync)
{
/*
* If PagePrivate is set, then the page is currently associated with
@@ -1726,7 +1726,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
nfs_fscache_release_page(page, GFP_KERNEL);
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0bc4ff..5f3089c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -609,9 +609,12 @@ struct address_space_operations {
loff_t offset, unsigned long nr_segs);
int (*get_xip_mem)(struct address_space *, pgoff_t, int,
void **, unsigned long *);
- /* migrate the contents of a page to the specified target */
+ /*
+ * migrate the contents of a page to the specified target. If sync
+ * is false, it must not block.
+ */
int (*migratepage) (struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
@@ -2579,7 +2582,7 @@ extern int generic_check_addressable(unsigned, u64);
#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
#else
#define buffer_migrate_page NULL
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..14e6d2a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
extern void putback_lru_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
bool sync);
diff --git a/mm/migrate.c b/mm/migrate.c
index 177aca4..65c12d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -220,6 +220,55 @@ out:
pte_unmap_unlock(ptep, ptl);
}
+#ifdef CONFIG_BLOCK
+/* Returns true if all buffers are successfully locked */
+static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+ struct buffer_head *bh = head;
+
+ /* Simple case, sync compaction */
+ if (sync) {
+ do {
+ get_bh(bh);
+ lock_buffer(bh);
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+
+ return true;
+ }
+
+ /* async case, we cannot block on lock_buffer so use trylock_buffer */
+ do {
+ get_bh(bh);
+ if (!trylock_buffer(bh)) {
+ /*
+ * We failed to lock the buffer and cannot stall in
+ * async migration. Release the taken locks
+ */
+ struct buffer_head *failed_bh = bh;
+ put_bh(failed_bh);
+ bh = head;
+ while (bh != failed_bh) {
+ unlock_buffer(bh);
+ put_bh(bh);
+ bh = bh->b_this_page;
+ }
+ return false;
+ }
+
+ bh = bh->b_this_page;
+ } while (bh != head);
+ return true;
+}
+#else
+static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
+ bool sync)
+{
+ return true;
+}
+#endif /* CONFIG_BLOCK */
+
/*
* Replace the page in the mapping.
*
@@ -229,7 +278,8 @@ out:
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
*/
static int migrate_page_move_mapping(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page,
+ struct buffer_head *head, bool sync)
{
int expected_count;
void **pslot;
@@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
}
/*
+ * In the async migration case of moving a page with buffers, lock the
+ * buffers using trylock before the mapping is moved. If the mapping
+ * was moved, we later failed to lock the buffers and could not move
+ * the mapping back due to an elevated page count, we would have to
+ * block waiting on other references to be dropped.
+ */
+ if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+ page_unfreeze_refs(page, expected_count);
+ spin_unlock_irq(&mapping->tree_lock);
+ return -EAGAIN;
+ }
+
+ /*
* Now we know that no one else is looking at the page.
*/
get_page(newpage); /* add cache reference */
@@ -415,13 +478,13 @@ EXPORT_SYMBOL(fail_migrate_page);
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
int rc;
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page);
+ rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
if (rc)
return rc;
@@ -438,28 +501,28 @@ EXPORT_SYMBOL(migrate_page);
* exist.
*/
int buffer_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
struct buffer_head *bh, *head;
int rc;
if (!page_has_buffers(page))
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
head = page_buffers(page);
- rc = migrate_page_move_mapping(mapping, newpage, page);
+ rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
if (rc)
return rc;
- bh = head;
- do {
- get_bh(bh);
- lock_buffer(bh);
- bh = bh->b_this_page;
-
- } while (bh != head);
+ /*
+ * In the async case, migrate_page_move_mapping locked the buffers
+ * with an IRQ-safe spinlock held. In the sync case, the buffers
+ * need to be locked now
+ */
+ if (sync)
+ BUG_ON(!buffer_migrate_lock_buffers(head, sync));
ClearPagePrivate(page);
set_page_private(newpage, page_private(page));
@@ -536,10 +599,13 @@ static int writeout(struct address_space *mapping, struct page *page)
* Default handling if a filesystem does not provide a migration function.
*/
static int fallback_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
- if (PageDirty(page))
+ if (PageDirty(page)) {
+ if (!sync)
+ return -EBUSY;
return writeout(mapping, page);
+ }
/*
* Buffers may be managed in a filesystem specific way.
@@ -549,7 +615,7 @@ static int fallback_migrate_page(struct address_space *mapping,
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
/*
@@ -585,29 +651,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
mapping = page_mapping(page);
if (!mapping)
- rc = migrate_page(mapping, newpage, page);
- else {
+ rc = migrate_page(mapping, newpage, page, sync);
+ else if (mapping->a_ops->migratepage)
/*
- * Do not writeback pages if !sync and migratepage is
- * not pointing to migrate_page() which is nonblocking
- * (swapcache/tmpfs uses migratepage = migrate_page).
+ * Most pages have a mapping and most filesystems provide a
+ * migratepage callback. Anonymous pages are part of swap
+ * space which also has its own migratepage callback. This
+ * is the most common path for page migration.
*/
- if (PageDirty(page) && !sync &&
- mapping->a_ops->migratepage != migrate_page)
- rc = -EBUSY;
- else if (mapping->a_ops->migratepage)
- /*
- * Most pages have a mapping and most filesystems
- * should provide a migration function. Anonymous
- * pages are part of swap space which also has its
- * own migration function. This is the most common
- * path for page migration.
- */
- rc = mapping->a_ops->migratepage(mapping,
- newpage, page);
- else
- rc = fallback_migrate_page(mapping, newpage, page);
- }
+ rc = mapping->a_ops->migratepage(mapping,
+ newpage, page, sync);
+ else
+ rc = fallback_migrate_page(mapping, newpage, page, sync);
if (rc) {
newpage->mapping = NULL;
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Asynchronous compaction is used when allocating transparent hugepages
to avoid blocking for long periods of time. Due to reports of
stalling, there was a debate on disabling synchronous compaction
but this severely impacted allocation success rates. Part of the
reason was that many dirty pages are skipped in asynchronous compaction
by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page()
even though it is possible to migrate some of these pages without
blocking. This patch updates the ->migratepage callback with a "sync"
parameter. It is the responsibility of the callback to fail gracefully
if migration would block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/disk-io.c | 4 +-
fs/hugetlbfs/inode.c | 3 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 4 +-
include/linux/fs.h | 9 ++-
include/linux/migrate.h | 2 +-
mm/migrate.c | 129 +++++++++++++++++++++++++++++++++-------------
7 files changed, 106 insertions(+), 47 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 632f8f3..896b87a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -872,7 +872,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
/*
* we can't safely write a btree page from here,
@@ -887,7 +887,7 @@ static int btree_migratepage(struct address_space *mapping,
if (page_has_private(page) &&
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
#endif
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0be5a78..10b9883 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -576,7 +576,8 @@ static int hugetlbfs_set_page_dirty(struct page *page)
}
static int hugetlbfs_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page,
+ bool sync)
{
int rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 3f4d957..8d96ed6 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -330,7 +330,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
#ifdef CONFIG_MIGRATION
extern int nfs_migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
#else
#define nfs_migrate_page NULL
#endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1dda78d..33475df 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page)
+ struct page *page, bool sync)
{
/*
* If PagePrivate is set, then the page is currently associated with
@@ -1726,7 +1726,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
nfs_fscache_release_page(page, GFP_KERNEL);
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0bc4ff..5f3089c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -609,9 +609,12 @@ struct address_space_operations {
loff_t offset, unsigned long nr_segs);
int (*get_xip_mem)(struct address_space *, pgoff_t, int,
void **, unsigned long *);
- /* migrate the contents of a page to the specified target */
+ /*
+ * migrate the contents of a page to the specified target. If sync
+ * is false, it must not block.
+ */
int (*migratepage) (struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
@@ -2579,7 +2582,7 @@ extern int generic_check_addressable(unsigned, u64);
#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
#else
#define buffer_migrate_page NULL
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..14e6d2a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
extern void putback_lru_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
- struct page *, struct page *);
+ struct page *, struct page *, bool);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
bool sync);
diff --git a/mm/migrate.c b/mm/migrate.c
index 177aca4..65c12d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -220,6 +220,55 @@ out:
pte_unmap_unlock(ptep, ptl);
}
+#ifdef CONFIG_BLOCK
+/* Returns true if all buffers are successfully locked */
+static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+ struct buffer_head *bh = head;
+
+ /* Simple case, sync compaction */
+ if (sync) {
+ do {
+ get_bh(bh);
+ lock_buffer(bh);
+ bh = bh->b_this_page;
+
+ } while (bh != head);
+
+ return true;
+ }
+
+ /* async case, we cannot block on lock_buffer so use trylock_buffer */
+ do {
+ get_bh(bh);
+ if (!trylock_buffer(bh)) {
+ /*
+ * We failed to lock the buffer and cannot stall in
+ * async migration. Release the taken locks
+ */
+ struct buffer_head *failed_bh = bh;
+ put_bh(failed_bh);
+ bh = head;
+ while (bh != failed_bh) {
+ unlock_buffer(bh);
+ put_bh(bh);
+ bh = bh->b_this_page;
+ }
+ return false;
+ }
+
+ bh = bh->b_this_page;
+ } while (bh != head);
+ return true;
+}
+#else
+static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
+ bool sync)
+{
+ return true;
+}
+#endif /* CONFIG_BLOCK */
+
/*
* Replace the page in the mapping.
*
@@ -229,7 +278,8 @@ out:
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
*/
static int migrate_page_move_mapping(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page,
+ struct buffer_head *head, bool sync)
{
int expected_count;
void **pslot;
@@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
}
/*
+ * In the async migration case of moving a page with buffers, lock the
+ * buffers using trylock before the mapping is moved. If the mapping
+ * was moved, we later failed to lock the buffers and could not move
+ * the mapping back due to an elevated page count, we would have to
+ * block waiting on other references to be dropped.
+ */
+ if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+ page_unfreeze_refs(page, expected_count);
+ spin_unlock_irq(&mapping->tree_lock);
+ return -EAGAIN;
+ }
+
+ /*
* Now we know that no one else is looking at the page.
*/
get_page(newpage); /* add cache reference */
@@ -415,13 +478,13 @@ EXPORT_SYMBOL(fail_migrate_page);
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
int rc;
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page);
+ rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
if (rc)
return rc;
@@ -438,28 +501,28 @@ EXPORT_SYMBOL(migrate_page);
* exist.
*/
int buffer_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
struct buffer_head *bh, *head;
int rc;
if (!page_has_buffers(page))
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
head = page_buffers(page);
- rc = migrate_page_move_mapping(mapping, newpage, page);
+ rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
if (rc)
return rc;
- bh = head;
- do {
- get_bh(bh);
- lock_buffer(bh);
- bh = bh->b_this_page;
-
- } while (bh != head);
+ /*
+ * In the async case, migrate_page_move_mapping locked the buffers
+ * with an IRQ-safe spinlock held. In the sync case, the buffers
+ * need to be locked now
+ */
+ if (sync)
+ BUG_ON(!buffer_migrate_lock_buffers(head, sync));
ClearPagePrivate(page);
set_page_private(newpage, page_private(page));
@@ -536,10 +599,13 @@ static int writeout(struct address_space *mapping, struct page *page)
* Default handling if a filesystem does not provide a migration function.
*/
static int fallback_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page)
+ struct page *newpage, struct page *page, bool sync)
{
- if (PageDirty(page))
+ if (PageDirty(page)) {
+ if (!sync)
+ return -EBUSY;
return writeout(mapping, page);
+ }
/*
* Buffers may be managed in a filesystem specific way.
@@ -549,7 +615,7 @@ static int fallback_migrate_page(struct address_space *mapping,
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page);
+ return migrate_page(mapping, newpage, page, sync);
}
/*
@@ -585,29 +651,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
mapping = page_mapping(page);
if (!mapping)
- rc = migrate_page(mapping, newpage, page);
- else {
+ rc = migrate_page(mapping, newpage, page, sync);
+ else if (mapping->a_ops->migratepage)
/*
- * Do not writeback pages if !sync and migratepage is
- * not pointing to migrate_page() which is nonblocking
- * (swapcache/tmpfs uses migratepage = migrate_page).
+ * Most pages have a mapping and most filesystems provide a
+ * migratepage callback. Anonymous pages are part of swap
+ * space which also has its own migratepage callback. This
+ * is the most common path for page migration.
*/
- if (PageDirty(page) && !sync &&
- mapping->a_ops->migratepage != migrate_page)
- rc = -EBUSY;
- else if (mapping->a_ops->migratepage)
- /*
- * Most pages have a mapping and most filesystems
- * should provide a migration function. Anonymous
- * pages are part of swap space which also has its
- * own migration function. This is the most common
- * path for page migration.
- */
- rc = mapping->a_ops->migratepage(mapping,
- newpage, page);
- else
- rc = fallback_migrate_page(mapping, newpage, page);
- }
+ rc = mapping->a_ops->migratepage(mapping,
+ newpage, page, sync);
+ else
+ rc = fallback_migrate_page(mapping, newpage, page, sync);
if (rc) {
newpage->mapping = NULL;
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 06/11] mm: compaction: make isolate_lru_page() filter-aware again
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.
This had to be partially reverted because some dirty pages can be
migrated by compaction without blocking.
This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise
LRU disruption.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 ++
mm/compaction.c | 3 +++
mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++--
3 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 188cb2f..ac5b522 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -173,6 +173,8 @@ static inline int is_unevictable_lru(enum lru_list l)
#define ISOLATE_CLEAN ((__force isolate_mode_t)0x4)
/* Isolate unmapped file */
#define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x8)
+/* Isolate for asynchronous migration */
+#define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x10)
/* LRU Isolation modes. */
typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/compaction.c b/mm/compaction.c
index 615502b..0379263 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -349,6 +349,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
continue;
}
+ if (!cc->sync)
+ mode |= ISOLATE_ASYNC_MIGRATE;
+
/* Try isolate the page */
if (__isolate_lru_page(page, mode, 0) != 0)
continue;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 69057b5..16fb177 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1061,8 +1061,39 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
ret = -EBUSY;
- if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
- return ret;
+ /*
+ * To minimise LRU disruption, the caller can indicate that it only
+ * wants to isolate pages it will be able to operate on without
+ * blocking - clean pages for the most part.
+ *
+ * ISOLATE_CLEAN means that only clean pages should be isolated. This
+ * is used by reclaim when it is cannot write to backing storage
+ *
+ * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
+ * that it is possible to migrate without blocking
+ */
+ if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+ /* All the caller can do on PageWriteback is block */
+ if (PageWriteback(page))
+ return ret;
+
+ if (PageDirty(page)) {
+ struct address_space *mapping;
+
+ /* ISOLATE_CLEAN means only clean pages */
+ if (mode & ISOLATE_CLEAN)
+ return ret;
+
+ /*
+ * Only pages without mappings or that have a
+ * ->migratepage callback are possible to migrate
+ * without blocking
+ */
+ mapping = page_mapping(page);
+ if (mapping && !mapping->a_ops->migratepage)
+ return ret;
+ }
+ }
if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
return ret;
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 06/11] mm: compaction: make isolate_lru_page() filter-aware again
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.
This had to be partially reverted because some dirty pages can be
migrated by compaction without blocking.
This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise
LRU disruption.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 ++
mm/compaction.c | 3 +++
mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++--
3 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 188cb2f..ac5b522 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -173,6 +173,8 @@ static inline int is_unevictable_lru(enum lru_list l)
#define ISOLATE_CLEAN ((__force isolate_mode_t)0x4)
/* Isolate unmapped file */
#define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x8)
+/* Isolate for asynchronous migration */
+#define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x10)
/* LRU Isolation modes. */
typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/compaction.c b/mm/compaction.c
index 615502b..0379263 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -349,6 +349,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
continue;
}
+ if (!cc->sync)
+ mode |= ISOLATE_ASYNC_MIGRATE;
+
/* Try isolate the page */
if (__isolate_lru_page(page, mode, 0) != 0)
continue;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 69057b5..16fb177 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1061,8 +1061,39 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
ret = -EBUSY;
- if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
- return ret;
+ /*
+ * To minimise LRU disruption, the caller can indicate that it only
+ * wants to isolate pages it will be able to operate on without
+ * blocking - clean pages for the most part.
+ *
+ * ISOLATE_CLEAN means that only clean pages should be isolated. This
+ * is used by reclaim when it is cannot write to backing storage
+ *
+ * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
+ * that it is possible to migrate without blocking
+ */
+ if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+ /* All the caller can do on PageWriteback is block */
+ if (PageWriteback(page))
+ return ret;
+
+ if (PageDirty(page)) {
+ struct address_space *mapping;
+
+ /* ISOLATE_CLEAN means only clean pages */
+ if (mode & ISOLATE_CLEAN)
+ return ret;
+
+ /*
+ * Only pages without mappings or that have a
+ * ->migratepage callback are possible to migrate
+ * without blocking
+ */
+ mapping = page_mapping(page);
+ if (mapping && !mapping->a_ops->migratepage)
+ return ret;
+ }
+ }
if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
return ret;
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 07/11] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
If compaction is deferred, direct reclaim is used to try free enough
pages for the allocation to succeed. For small high-orders, this has
a reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more
sense to fail the allocation rather than stall the caller in direct
reclaim. This patch skips direct reclaim if compaction is deferred
and the caller specifies __GFP_NO_KSWAPD.
Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even
in cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/page_alloc.c | 45 +++++++++++++++++++++++++++++++++++----------
1 files changed, 35 insertions(+), 10 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b8ba3a..ecaba97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1886,14 +1886,20 @@ static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- int migratetype, unsigned long *did_some_progress,
- bool sync_migration)
+ int migratetype, bool sync_migration,
+ bool *deferred_compaction,
+ unsigned long *did_some_progress)
{
struct page *page;
- if (!order || compaction_deferred(preferred_zone))
+ if (!order)
return NULL;
+ if (compaction_deferred(preferred_zone)) {
+ *deferred_compaction = true;
+ return NULL;
+ }
+
current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, sync_migration);
@@ -1921,7 +1927,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);
- defer_compaction(preferred_zone);
+
+ /*
+ * As async compaction considers a subset of pageblocks, only
+ * defer if the failure was a sync compaction failure.
+ */
+ if (sync_migration)
+ defer_compaction(preferred_zone);
cond_resched();
}
@@ -1933,8 +1945,9 @@ static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- int migratetype, unsigned long *did_some_progress,
- bool sync_migration)
+ int migratetype, bool sync_migration,
+ bool *deferred_compaction,
+ unsigned long *did_some_progress)
{
return NULL;
}
@@ -2084,6 +2097,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
bool sync_migration = false;
+ bool deferred_compaction = false;
/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -2164,12 +2178,22 @@ rebalance:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- migratetype, &did_some_progress,
- sync_migration);
+ migratetype, sync_migration,
+ &deferred_compaction,
+ &did_some_progress);
if (page)
goto got_pg;
sync_migration = true;
+ /*
+ * If compaction is deferred for high-order allocations, it is because
+ * sync compaction recently failed. In this is the case and the caller
+ * has requested the system not be heavily disrupted, fail the
+ * allocation now instead of entering direct reclaim
+ */
+ if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+ goto nopage;
+
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
@@ -2232,8 +2256,9 @@ rebalance:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- migratetype, &did_some_progress,
- sync_migration);
+ migratetype, sync_migration,
+ &deferred_compaction,
+ &did_some_progress);
if (page)
goto got_pg;
}
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 07/11] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
If compaction is deferred, direct reclaim is used to try free enough
pages for the allocation to succeed. For small high-orders, this has
a reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more
sense to fail the allocation rather than stall the caller in direct
reclaim. This patch skips direct reclaim if compaction is deferred
and the caller specifies __GFP_NO_KSWAPD.
Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even
in cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/page_alloc.c | 45 +++++++++++++++++++++++++++++++++++----------
1 files changed, 35 insertions(+), 10 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b8ba3a..ecaba97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1886,14 +1886,20 @@ static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- int migratetype, unsigned long *did_some_progress,
- bool sync_migration)
+ int migratetype, bool sync_migration,
+ bool *deferred_compaction,
+ unsigned long *did_some_progress)
{
struct page *page;
- if (!order || compaction_deferred(preferred_zone))
+ if (!order)
return NULL;
+ if (compaction_deferred(preferred_zone)) {
+ *deferred_compaction = true;
+ return NULL;
+ }
+
current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, sync_migration);
@@ -1921,7 +1927,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);
- defer_compaction(preferred_zone);
+
+ /*
+ * As async compaction considers a subset of pageblocks, only
+ * defer if the failure was a sync compaction failure.
+ */
+ if (sync_migration)
+ defer_compaction(preferred_zone);
cond_resched();
}
@@ -1933,8 +1945,9 @@ static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- int migratetype, unsigned long *did_some_progress,
- bool sync_migration)
+ int migratetype, bool sync_migration,
+ bool *deferred_compaction,
+ unsigned long *did_some_progress)
{
return NULL;
}
@@ -2084,6 +2097,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
bool sync_migration = false;
+ bool deferred_compaction = false;
/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -2164,12 +2178,22 @@ rebalance:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- migratetype, &did_some_progress,
- sync_migration);
+ migratetype, sync_migration,
+ &deferred_compaction,
+ &did_some_progress);
if (page)
goto got_pg;
sync_migration = true;
+ /*
+ * If compaction is deferred for high-order allocations, it is because
+ * sync compaction recently failed. In this is the case and the caller
+ * has requested the system not be heavily disrupted, fail the
+ * allocation now instead of entering direct reclaim
+ */
+ if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+ goto nopage;
+
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
@@ -2232,8 +2256,9 @@ rebalance:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- migratetype, &did_some_progress,
- sync_migration);
+ migratetype, sync_migration,
+ &deferred_compaction,
+ &did_some_progress);
if (page)
goto got_pg;
}
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async
compaction maps to MIGRATE_ASYNC while sync compaction maps to
MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
hotplug, MIGRATE_SYNC is used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be
a large number of dirty pages backed by a filesystem that does not
support ->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/disk-io.c | 3 +-
fs/hugetlbfs/inode.c | 2 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 2 +-
include/linux/fs.h | 6 ++-
include/linux/migrate.h | 23 +++++++++++---
mm/compaction.c | 2 +-
mm/memory-failure.c | 2 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 2 +-
mm/migrate.c | 78 ++++++++++++++++++++++++++---------------------
11 files changed, 74 insertions(+), 50 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 896b87a..dbe9518 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -872,7 +872,8 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page,
+ enum migrate_mode sync)
{
/*
* we can't safely write a btree page from here,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 10b9883..6b80537 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
static int hugetlbfs_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
- bool sync)
+ enum migrate_mode mode)
{
int rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 8d96ed6..68b3f20 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -330,7 +330,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
#ifdef CONFIG_MIGRATION
extern int nfs_migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
#else
#define nfs_migrate_page NULL
#endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 33475df..adb87d9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page, bool sync)
+ struct page *page, enum migrate_mode sync)
{
/*
* If PagePrivate is set, then the page is currently associated with
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5f3089c..95ede31 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -525,6 +525,7 @@ enum positive_aop_returns {
struct page;
struct address_space;
struct writeback_control;
+enum migrate_mode;
struct iov_iter {
const struct iovec *iov;
@@ -614,7 +615,7 @@ struct address_space_operations {
* is false, it must not block.
*/
int (*migratepage) (struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
@@ -2582,7 +2583,8 @@ extern int generic_check_addressable(unsigned, u64);
#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *,
+ enum migrate_mode);
#else
#define buffer_migrate_page NULL
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 14e6d2a..775787c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -6,18 +6,31 @@
typedef struct page *new_page_t(struct page *, unsigned long private, int **);
+/*
+ * MIGRATE_ASYNC means never block
+ * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
+ * on most operations but not ->writepage as the potential stall time
+ * is too significant
+ * MIGRATE_SYNC will block when migrating pages
+ */
+enum migrate_mode {
+ MIGRATE_ASYNC,
+ MIGRATE_SYNC_LIGHT,
+ MIGRATE_SYNC,
+};
+
#ifdef CONFIG_MIGRATION
#define PAGE_MIGRATION 1
extern void putback_lru_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync);
+ enum migrate_mode sync);
extern int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync);
+ enum migrate_mode sync);
extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
@@ -36,10 +49,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync) { return -ENOSYS; }
+ enum migrate_mode sync) { return -ENOSYS; }
static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync) { return -ENOSYS; }
+ enum migrate_mode sync) { return -ENOSYS; }
static inline int migrate_prep(void) { return -ENOSYS; }
static inline int migrate_prep_local(void) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 0379263..dbe1da0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -555,7 +555,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
nr_migrate = cc->nr_migratepages;
err = migrate_pages(&cc->migratepages, compaction_alloc,
(unsigned long)cc, false,
- cc->sync);
+ cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
update_nr_listpages(cc);
nr_remaining = cc->nr_migratepages;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06d3479..56080ea 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1557,7 +1557,7 @@ int soft_offline_page(struct page *page, int flags)
page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
- 0, true);
+ 0, MIGRATE_SYNC);
if (ret) {
putback_lru_pages(&pagelist);
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2168489..6629faf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -809,7 +809,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
}
/* this function returns # of failed pages */
ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
- true, true);
+ true, MIGRATE_SYNC);
if (ret)
putback_lru_pages(&source);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index adc3954..97009a4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -933,7 +933,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_node_page, dest,
- false, true);
+ false, MIGRATE_SYNC);
if (err)
putback_lru_pages(&pagelist);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 65c12d2..180d97f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -222,12 +222,13 @@ out:
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+ enum migrate_mode mode)
{
struct buffer_head *bh = head;
/* Simple case, sync compaction */
- if (sync) {
+ if (mode != MIGRATE_ASYNC) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -263,7 +264,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
}
#else
static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
- bool sync)
+ enum migrate_mode mode)
{
return true;
}
@@ -279,7 +280,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
*/
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
- struct buffer_head *head, bool sync)
+ struct buffer_head *head, enum migrate_mode mode)
{
int expected_count;
void **pslot;
@@ -315,7 +316,8 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+ if (mode == MIGRATE_ASYNC && head &&
+ !buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
@@ -478,13 +480,14 @@ EXPORT_SYMBOL(fail_migrate_page);
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
{
int rc;
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
+ rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
if (rc)
return rc;
@@ -501,17 +504,17 @@ EXPORT_SYMBOL(migrate_page);
* exist.
*/
int buffer_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page, enum migrate_mode mode)
{
struct buffer_head *bh, *head;
int rc;
if (!page_has_buffers(page))
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
head = page_buffers(page);
- rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
+ rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
if (rc)
return rc;
@@ -521,8 +524,8 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (sync)
- BUG_ON(!buffer_migrate_lock_buffers(head, sync));
+ if (mode != MIGRATE_ASYNC)
+ BUG_ON(!buffer_migrate_lock_buffers(head, mode));
ClearPagePrivate(page);
set_page_private(newpage, page_private(page));
@@ -599,10 +602,11 @@ static int writeout(struct address_space *mapping, struct page *page)
* Default handling if a filesystem does not provide a migration function.
*/
static int fallback_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page, enum migrate_mode mode)
{
if (PageDirty(page)) {
- if (!sync)
+ /* Only writeback pages in full synchronous migration */
+ if (mode != MIGRATE_SYNC)
return -EBUSY;
return writeout(mapping, page);
}
@@ -615,7 +619,7 @@ static int fallback_migrate_page(struct address_space *mapping,
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
}
/*
@@ -630,7 +634,7 @@ static int fallback_migrate_page(struct address_space *mapping,
* == 0 - success
*/
static int move_to_new_page(struct page *newpage, struct page *page,
- int remap_swapcache, bool sync)
+ int remap_swapcache, enum migrate_mode mode)
{
struct address_space *mapping;
int rc;
@@ -651,7 +655,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
mapping = page_mapping(page);
if (!mapping)
- rc = migrate_page(mapping, newpage, page, sync);
+ rc = migrate_page(mapping, newpage, page, mode);
else if (mapping->a_ops->migratepage)
/*
* Most pages have a mapping and most filesystems provide a
@@ -660,9 +664,9 @@ static int move_to_new_page(struct page *newpage, struct page *page,
* is the most common path for page migration.
*/
rc = mapping->a_ops->migratepage(mapping,
- newpage, page, sync);
+ newpage, page, mode);
else
- rc = fallback_migrate_page(mapping, newpage, page, sync);
+ rc = fallback_migrate_page(mapping, newpage, page, mode);
if (rc) {
newpage->mapping = NULL;
@@ -677,7 +681,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
}
static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, bool offlining, bool sync)
+ int force, bool offlining, enum migrate_mode mode)
{
int rc = -EAGAIN;
int remap_swapcache = 1;
@@ -686,7 +690,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
struct anon_vma *anon_vma = NULL;
if (!trylock_page(page)) {
- if (!force || !sync)
+ if (!force || mode == MIGRATE_ASYNC)
goto out;
/*
@@ -732,10 +736,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
if (PageWriteback(page)) {
/*
- * For !sync, there is no point retrying as the retry loop
- * is expected to be too short for PageWriteback to be cleared
+ * Only in the case of a full syncronous migration is it
+ * necessary to wait for PageWriteback. In the async case,
+ * the retry loop is too short and in the sync-light case,
+ * the overhead of stalling is too much
*/
- if (!sync) {
+ if (mode != MIGRATE_SYNC) {
rc = -EBUSY;
goto uncharge;
}
@@ -806,7 +812,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
skip_unmap:
if (!page_mapped(page))
- rc = move_to_new_page(newpage, page, remap_swapcache, sync);
+ rc = move_to_new_page(newpage, page, remap_swapcache, mode);
if (rc && remap_swapcache)
remove_migration_ptes(page, page);
@@ -829,7 +835,8 @@ out:
* to the newly allocated page in newpage.
*/
static int unmap_and_move(new_page_t get_new_page, unsigned long private,
- struct page *page, int force, bool offlining, bool sync)
+ struct page *page, int force, bool offlining,
+ enum migrate_mode mode)
{
int rc = 0;
int *result = NULL;
@@ -847,7 +854,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
if (unlikely(split_huge_page(page)))
goto out;
- rc = __unmap_and_move(page, newpage, force, offlining, sync);
+ rc = __unmap_and_move(page, newpage, force, offlining, mode);
out:
if (rc != -EAGAIN) {
/*
@@ -895,7 +902,8 @@ out:
*/
static int unmap_and_move_huge_page(new_page_t get_new_page,
unsigned long private, struct page *hpage,
- int force, bool offlining, bool sync)
+ int force, bool offlining,
+ enum migrate_mode mode)
{
int rc = 0;
int *result = NULL;
@@ -908,7 +916,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
rc = -EAGAIN;
if (!trylock_page(hpage)) {
- if (!force || !sync)
+ if (!force || mode != MIGRATE_SYNC)
goto out;
lock_page(hpage);
}
@@ -919,7 +927,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
if (!page_mapped(hpage))
- rc = move_to_new_page(new_hpage, hpage, 1, sync);
+ rc = move_to_new_page(new_hpage, hpage, 1, mode);
if (rc)
remove_migration_ptes(hpage, hpage);
@@ -962,7 +970,7 @@ out:
*/
int migrate_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
- bool sync)
+ enum migrate_mode mode)
{
int retry = 1;
int nr_failed = 0;
@@ -983,7 +991,7 @@ int migrate_pages(struct list_head *from,
rc = unmap_and_move(get_new_page, private,
page, pass > 2, offlining,
- sync);
+ mode);
switch(rc) {
case -ENOMEM:
@@ -1013,7 +1021,7 @@ out:
int migrate_huge_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
- bool sync)
+ enum migrate_mode mode)
{
int retry = 1;
int nr_failed = 0;
@@ -1030,7 +1038,7 @@ int migrate_huge_pages(struct list_head *from,
rc = unmap_and_move_huge_page(get_new_page,
private, page, pass > 2, offlining,
- sync);
+ mode);
switch(rc) {
case -ENOMEM:
@@ -1159,7 +1167,7 @@ set_status:
err = 0;
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_page_node,
- (unsigned long)pm, 0, true);
+ (unsigned long)pm, 0, MIGRATE_SYNC);
if (err)
putback_lru_pages(&pagelist);
}
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async
compaction maps to MIGRATE_ASYNC while sync compaction maps to
MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
hotplug, MIGRATE_SYNC is used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be
a large number of dirty pages backed by a filesystem that does not
support ->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/disk-io.c | 3 +-
fs/hugetlbfs/inode.c | 2 +-
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 2 +-
include/linux/fs.h | 6 ++-
include/linux/migrate.h | 23 +++++++++++---
mm/compaction.c | 2 +-
mm/memory-failure.c | 2 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 2 +-
mm/migrate.c | 78 ++++++++++++++++++++++++++---------------------
11 files changed, 74 insertions(+), 50 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 896b87a..dbe9518 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -872,7 +872,8 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page,
+ enum migrate_mode sync)
{
/*
* we can't safely write a btree page from here,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 10b9883..6b80537 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
static int hugetlbfs_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
- bool sync)
+ enum migrate_mode mode)
{
int rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 8d96ed6..68b3f20 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -330,7 +330,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
#ifdef CONFIG_MIGRATION
extern int nfs_migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
#else
#define nfs_migrate_page NULL
#endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 33475df..adb87d9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page, bool sync)
+ struct page *page, enum migrate_mode sync)
{
/*
* If PagePrivate is set, then the page is currently associated with
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5f3089c..95ede31 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -525,6 +525,7 @@ enum positive_aop_returns {
struct page;
struct address_space;
struct writeback_control;
+enum migrate_mode;
struct iov_iter {
const struct iovec *iov;
@@ -614,7 +615,7 @@ struct address_space_operations {
* is false, it must not block.
*/
int (*migratepage) (struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
@@ -2582,7 +2583,8 @@ extern int generic_check_addressable(unsigned, u64);
#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *,
+ enum migrate_mode);
#else
#define buffer_migrate_page NULL
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 14e6d2a..775787c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -6,18 +6,31 @@
typedef struct page *new_page_t(struct page *, unsigned long private, int **);
+/*
+ * MIGRATE_ASYNC means never block
+ * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
+ * on most operations but not ->writepage as the potential stall time
+ * is too significant
+ * MIGRATE_SYNC will block when migrating pages
+ */
+enum migrate_mode {
+ MIGRATE_ASYNC,
+ MIGRATE_SYNC_LIGHT,
+ MIGRATE_SYNC,
+};
+
#ifdef CONFIG_MIGRATION
#define PAGE_MIGRATION 1
extern void putback_lru_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
- struct page *, struct page *, bool);
+ struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync);
+ enum migrate_mode sync);
extern int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync);
+ enum migrate_mode sync);
extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
@@ -36,10 +49,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync) { return -ENOSYS; }
+ enum migrate_mode sync) { return -ENOSYS; }
static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- bool sync) { return -ENOSYS; }
+ enum migrate_mode sync) { return -ENOSYS; }
static inline int migrate_prep(void) { return -ENOSYS; }
static inline int migrate_prep_local(void) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 0379263..dbe1da0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -555,7 +555,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
nr_migrate = cc->nr_migratepages;
err = migrate_pages(&cc->migratepages, compaction_alloc,
(unsigned long)cc, false,
- cc->sync);
+ cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
update_nr_listpages(cc);
nr_remaining = cc->nr_migratepages;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06d3479..56080ea 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1557,7 +1557,7 @@ int soft_offline_page(struct page *page, int flags)
page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
- 0, true);
+ 0, MIGRATE_SYNC);
if (ret) {
putback_lru_pages(&pagelist);
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2168489..6629faf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -809,7 +809,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
}
/* this function returns # of failed pages */
ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
- true, true);
+ true, MIGRATE_SYNC);
if (ret)
putback_lru_pages(&source);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index adc3954..97009a4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -933,7 +933,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_node_page, dest,
- false, true);
+ false, MIGRATE_SYNC);
if (err)
putback_lru_pages(&pagelist);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 65c12d2..180d97f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -222,12 +222,13 @@ out:
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+ enum migrate_mode mode)
{
struct buffer_head *bh = head;
/* Simple case, sync compaction */
- if (sync) {
+ if (mode != MIGRATE_ASYNC) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -263,7 +264,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
}
#else
static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
- bool sync)
+ enum migrate_mode mode)
{
return true;
}
@@ -279,7 +280,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
*/
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
- struct buffer_head *head, bool sync)
+ struct buffer_head *head, enum migrate_mode mode)
{
int expected_count;
void **pslot;
@@ -315,7 +316,8 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+ if (mode == MIGRATE_ASYNC && head &&
+ !buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
@@ -478,13 +480,14 @@ EXPORT_SYMBOL(fail_migrate_page);
* Pages are locked upon entry and exit.
*/
int migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page,
+ enum migrate_mode mode)
{
int rc;
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
+ rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
if (rc)
return rc;
@@ -501,17 +504,17 @@ EXPORT_SYMBOL(migrate_page);
* exist.
*/
int buffer_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page, enum migrate_mode mode)
{
struct buffer_head *bh, *head;
int rc;
if (!page_has_buffers(page))
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
head = page_buffers(page);
- rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
+ rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
if (rc)
return rc;
@@ -521,8 +524,8 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (sync)
- BUG_ON(!buffer_migrate_lock_buffers(head, sync));
+ if (mode != MIGRATE_ASYNC)
+ BUG_ON(!buffer_migrate_lock_buffers(head, mode));
ClearPagePrivate(page);
set_page_private(newpage, page_private(page));
@@ -599,10 +602,11 @@ static int writeout(struct address_space *mapping, struct page *page)
* Default handling if a filesystem does not provide a migration function.
*/
static int fallback_migrate_page(struct address_space *mapping,
- struct page *newpage, struct page *page, bool sync)
+ struct page *newpage, struct page *page, enum migrate_mode mode)
{
if (PageDirty(page)) {
- if (!sync)
+ /* Only writeback pages in full synchronous migration */
+ if (mode != MIGRATE_SYNC)
return -EBUSY;
return writeout(mapping, page);
}
@@ -615,7 +619,7 @@ static int fallback_migrate_page(struct address_space *mapping,
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;
- return migrate_page(mapping, newpage, page, sync);
+ return migrate_page(mapping, newpage, page, mode);
}
/*
@@ -630,7 +634,7 @@ static int fallback_migrate_page(struct address_space *mapping,
* == 0 - success
*/
static int move_to_new_page(struct page *newpage, struct page *page,
- int remap_swapcache, bool sync)
+ int remap_swapcache, enum migrate_mode mode)
{
struct address_space *mapping;
int rc;
@@ -651,7 +655,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
mapping = page_mapping(page);
if (!mapping)
- rc = migrate_page(mapping, newpage, page, sync);
+ rc = migrate_page(mapping, newpage, page, mode);
else if (mapping->a_ops->migratepage)
/*
* Most pages have a mapping and most filesystems provide a
@@ -660,9 +664,9 @@ static int move_to_new_page(struct page *newpage, struct page *page,
* is the most common path for page migration.
*/
rc = mapping->a_ops->migratepage(mapping,
- newpage, page, sync);
+ newpage, page, mode);
else
- rc = fallback_migrate_page(mapping, newpage, page, sync);
+ rc = fallback_migrate_page(mapping, newpage, page, mode);
if (rc) {
newpage->mapping = NULL;
@@ -677,7 +681,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
}
static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, bool offlining, bool sync)
+ int force, bool offlining, enum migrate_mode mode)
{
int rc = -EAGAIN;
int remap_swapcache = 1;
@@ -686,7 +690,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
struct anon_vma *anon_vma = NULL;
if (!trylock_page(page)) {
- if (!force || !sync)
+ if (!force || mode == MIGRATE_ASYNC)
goto out;
/*
@@ -732,10 +736,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
if (PageWriteback(page)) {
/*
- * For !sync, there is no point retrying as the retry loop
- * is expected to be too short for PageWriteback to be cleared
+ * Only in the case of a full syncronous migration is it
+ * necessary to wait for PageWriteback. In the async case,
+ * the retry loop is too short and in the sync-light case,
+ * the overhead of stalling is too much
*/
- if (!sync) {
+ if (mode != MIGRATE_SYNC) {
rc = -EBUSY;
goto uncharge;
}
@@ -806,7 +812,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
skip_unmap:
if (!page_mapped(page))
- rc = move_to_new_page(newpage, page, remap_swapcache, sync);
+ rc = move_to_new_page(newpage, page, remap_swapcache, mode);
if (rc && remap_swapcache)
remove_migration_ptes(page, page);
@@ -829,7 +835,8 @@ out:
* to the newly allocated page in newpage.
*/
static int unmap_and_move(new_page_t get_new_page, unsigned long private,
- struct page *page, int force, bool offlining, bool sync)
+ struct page *page, int force, bool offlining,
+ enum migrate_mode mode)
{
int rc = 0;
int *result = NULL;
@@ -847,7 +854,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
if (unlikely(split_huge_page(page)))
goto out;
- rc = __unmap_and_move(page, newpage, force, offlining, sync);
+ rc = __unmap_and_move(page, newpage, force, offlining, mode);
out:
if (rc != -EAGAIN) {
/*
@@ -895,7 +902,8 @@ out:
*/
static int unmap_and_move_huge_page(new_page_t get_new_page,
unsigned long private, struct page *hpage,
- int force, bool offlining, bool sync)
+ int force, bool offlining,
+ enum migrate_mode mode)
{
int rc = 0;
int *result = NULL;
@@ -908,7 +916,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
rc = -EAGAIN;
if (!trylock_page(hpage)) {
- if (!force || !sync)
+ if (!force || mode != MIGRATE_SYNC)
goto out;
lock_page(hpage);
}
@@ -919,7 +927,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
if (!page_mapped(hpage))
- rc = move_to_new_page(new_hpage, hpage, 1, sync);
+ rc = move_to_new_page(new_hpage, hpage, 1, mode);
if (rc)
remove_migration_ptes(hpage, hpage);
@@ -962,7 +970,7 @@ out:
*/
int migrate_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
- bool sync)
+ enum migrate_mode mode)
{
int retry = 1;
int nr_failed = 0;
@@ -983,7 +991,7 @@ int migrate_pages(struct list_head *from,
rc = unmap_and_move(get_new_page, private,
page, pass > 2, offlining,
- sync);
+ mode);
switch(rc) {
case -ENOMEM:
@@ -1013,7 +1021,7 @@ out:
int migrate_huge_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
- bool sync)
+ enum migrate_mode mode)
{
int retry = 1;
int nr_failed = 0;
@@ -1030,7 +1038,7 @@ int migrate_huge_pages(struct list_head *from,
rc = unmap_and_move_huge_page(get_new_page,
private, page, pass > 2, offlining,
- sync);
+ mode);
switch(rc) {
case -ENOMEM:
@@ -1159,7 +1167,7 @@ set_status:
err = 0;
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_page_node,
- (unsigned long)pm, 0, true);
+ (unsigned long)pm, 0, MIGRATE_SYNC);
if (err)
putback_lru_pages(&pagelist);
}
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 09/11] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
In commit [e0887c19: vmscan: limit direct reclaim for higher order
allocations], Rik noted that reclaim was too aggressive when THP was
enabled. In his initial patch he used the number of free pages to
decide if reclaim should abort for compaction. My feedback was that
reclaim and compaction should be using the same logic when deciding if
reclaim should be aborted.
Unfortunately, this had the effect of reducing THP success rates when
the workload included something like streaming reads that continually
allocated pages. The window during which compaction could run and return
a THP was too small.
This patch combines Rik's two patches together. compaction_suitable()
is still used to decide if reclaim should be aborted to allow
compaction is used. However, it will also ensure that there is a
reasonable buffer of free pages available. This improves upon the
THP allocation success rates but bounds the number of pages that are
freed for compaction.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 44 +++++++++++++++++++++++++++++++++++++++-----
1 files changed, 39 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 16fb177..d497248 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2122,6 +2122,42 @@ restart:
throttle_vm_writeout(sc->gfp_mask);
}
+/* Returns true if compaction should go ahead for a high-order request */
+static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
+{
+ unsigned long balance_gap, watermark;
+ bool watermark_ok;
+
+ /* Do not consider compaction for orders reclaim is meant to satisfy */
+ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER)
+ return false;
+
+ /*
+ * Compaction takes time to run and there are potentially other
+ * callers using the pages just freed. Continue reclaiming until
+ * there is a buffer of free pages available to give compaction
+ * a reasonable chance of completing and allocating the page
+ */
+ balance_gap = min(low_wmark_pages(zone),
+ (zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+ KSWAPD_ZONE_BALANCE_GAP_RATIO);
+ watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+
+ /*
+ * If compaction is deferred, reclaim up to a point where
+ * compaction will have a chance of success when re-enabled
+ */
+ if (compaction_deferred(zone))
+ return watermark_ok;
+
+ /* If compaction is not ready to start, keep reclaiming */
+ if (!compaction_suitable(zone, sc->order))
+ return false;
+
+ return watermark_ok;
+}
+
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2139,8 +2175,8 @@ restart:
* scan then give up on it.
*
* This function returns true if a zone is being reclaimed for a costly
- * high-order allocation and compaction is either ready to begin or deferred.
- * This indicates to the caller that it should retry the allocation or fail.
+ * high-order allocation and compaction is ready to begin. This indicates to
+ * the caller that it should retry the allocation or fail.
*/
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
@@ -2174,9 +2210,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
* noticable problem, like transparent huge page
* allocations.
*/
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
- (compaction_suitable(zone, sc->order) ||
- compaction_deferred(zone))) {
+ if (compaction_ready(zone, sc)) {
should_abort_reclaim = true;
continue;
}
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 09/11] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
In commit [e0887c19: vmscan: limit direct reclaim for higher order
allocations], Rik noted that reclaim was too aggressive when THP was
enabled. In his initial patch he used the number of free pages to
decide if reclaim should abort for compaction. My feedback was that
reclaim and compaction should be using the same logic when deciding if
reclaim should be aborted.
Unfortunately, this had the effect of reducing THP success rates when
the workload included something like streaming reads that continually
allocated pages. The window during which compaction could run and return
a THP was too small.
This patch combines Rik's two patches together. compaction_suitable()
is still used to decide if reclaim should be aborted to allow
compaction is used. However, it will also ensure that there is a
reasonable buffer of free pages available. This improves upon the
THP allocation success rates but bounds the number of pages that are
freed for compaction.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 44 +++++++++++++++++++++++++++++++++++++++-----
1 files changed, 39 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 16fb177..d497248 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2122,6 +2122,42 @@ restart:
throttle_vm_writeout(sc->gfp_mask);
}
+/* Returns true if compaction should go ahead for a high-order request */
+static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
+{
+ unsigned long balance_gap, watermark;
+ bool watermark_ok;
+
+ /* Do not consider compaction for orders reclaim is meant to satisfy */
+ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER)
+ return false;
+
+ /*
+ * Compaction takes time to run and there are potentially other
+ * callers using the pages just freed. Continue reclaiming until
+ * there is a buffer of free pages available to give compaction
+ * a reasonable chance of completing and allocating the page
+ */
+ balance_gap = min(low_wmark_pages(zone),
+ (zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+ KSWAPD_ZONE_BALANCE_GAP_RATIO);
+ watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+
+ /*
+ * If compaction is deferred, reclaim up to a point where
+ * compaction will have a chance of success when re-enabled
+ */
+ if (compaction_deferred(zone))
+ return watermark_ok;
+
+ /* If compaction is not ready to start, keep reclaiming */
+ if (!compaction_suitable(zone, sc->order))
+ return false;
+
+ return watermark_ok;
+}
+
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2139,8 +2175,8 @@ restart:
* scan then give up on it.
*
* This function returns true if a zone is being reclaimed for a costly
- * high-order allocation and compaction is either ready to begin or deferred.
- * This indicates to the caller that it should retry the allocation or fail.
+ * high-order allocation and compaction is ready to begin. This indicates to
+ * the caller that it should retry the allocation or fail.
*/
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
@@ -2174,9 +2210,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
* noticable problem, like transparent huge page
* allocations.
*/
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
- (compaction_suitable(zone, sc->order) ||
- compaction_deferred(zone))) {
+ if (compaction_ready(zone, sc)) {
should_abort_reclaim = true;
continue;
}
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 10/11] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
If compaction can proceed for a given zone, shrink_zones() does not
reclaim any more pages from it. After commit [e0c2327: vmscan: abort
reclaim/compaction if compaction can proceed], do_try_to_free_pages()
tries to finish as soon as possible once one zone can compact.
This was intended to prevent slabs being shrunk unnecessarily but
there are side-effects. One is that a small zone that is ready for
compaction will abort reclaim even if the chances of successfully
allocating a THP from that zone is small. It also means that reclaim
can return too early even though sc->nr_to_reclaim pages were not
reclaimed.
This partially reverts the commit until it is proven that slabs are
really being shrunk unnecessarily but preserves the check to return
1 to avoid OOM if reclaim was aborted prematurely.
[aarcange@redhat.com: This patch replaces a revert from Andrea]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d497248..298ceb8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2176,7 +2176,8 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
*
* This function returns true if a zone is being reclaimed for a costly
* high-order allocation and compaction is ready to begin. This indicates to
- * the caller that it should retry the allocation or fail.
+ * the caller that it should consider retrying the allocation instead of
+ * further reclaim.
*/
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
@@ -2185,7 +2186,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
- bool should_abort_reclaim = false;
+ bool aborted_reclaim = false;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2211,7 +2212,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
* allocations.
*/
if (compaction_ready(zone, sc)) {
- should_abort_reclaim = true;
+ aborted_reclaim = true;
continue;
}
}
@@ -2233,7 +2234,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
shrink_zone(priority, zone, sc);
}
- return should_abort_reclaim;
+ return aborted_reclaim;
}
static bool zone_reclaimable(struct zone *zone)
@@ -2287,7 +2288,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
- bool should_abort_reclaim;
+ bool aborted_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2299,9 +2300,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
sc->nr_scanned = 0;
if (!priority)
disable_swap_token(sc->mem_cgroup);
- should_abort_reclaim = shrink_zones(priority, zonelist, sc);
- if (should_abort_reclaim)
- break;
+ aborted_reclaim = shrink_zones(priority, zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
@@ -2368,8 +2367,8 @@ out:
if (oom_killer_disabled)
return 0;
- /* Aborting reclaim to try compaction? don't OOM, then */
- if (should_abort_reclaim)
+ /* Aborted reclaim to try compaction? don't OOM, then */
+ if (aborted_reclaim)
return 1;
/* top priority shrink_zones still had more to do? don't OOM, then */
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 10/11] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
If compaction can proceed for a given zone, shrink_zones() does not
reclaim any more pages from it. After commit [e0c2327: vmscan: abort
reclaim/compaction if compaction can proceed], do_try_to_free_pages()
tries to finish as soon as possible once one zone can compact.
This was intended to prevent slabs being shrunk unnecessarily but
there are side-effects. One is that a small zone that is ready for
compaction will abort reclaim even if the chances of successfully
allocating a THP from that zone is small. It also means that reclaim
can return too early even though sc->nr_to_reclaim pages were not
reclaimed.
This partially reverts the commit until it is proven that slabs are
really being shrunk unnecessarily but preserves the check to return
1 to avoid OOM if reclaim was aborted prematurely.
[aarcange@redhat.com: This patch replaces a revert from Andrea]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d497248..298ceb8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2176,7 +2176,8 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
*
* This function returns true if a zone is being reclaimed for a costly
* high-order allocation and compaction is ready to begin. This indicates to
- * the caller that it should retry the allocation or fail.
+ * the caller that it should consider retrying the allocation instead of
+ * further reclaim.
*/
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
@@ -2185,7 +2186,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
- bool should_abort_reclaim = false;
+ bool aborted_reclaim = false;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2211,7 +2212,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
* allocations.
*/
if (compaction_ready(zone, sc)) {
- should_abort_reclaim = true;
+ aborted_reclaim = true;
continue;
}
}
@@ -2233,7 +2234,7 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
shrink_zone(priority, zone, sc);
}
- return should_abort_reclaim;
+ return aborted_reclaim;
}
static bool zone_reclaimable(struct zone *zone)
@@ -2287,7 +2288,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct zoneref *z;
struct zone *zone;
unsigned long writeback_threshold;
- bool should_abort_reclaim;
+ bool aborted_reclaim;
get_mems_allowed();
delayacct_freepages_start();
@@ -2299,9 +2300,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
sc->nr_scanned = 0;
if (!priority)
disable_swap_token(sc->mem_cgroup);
- should_abort_reclaim = shrink_zones(priority, zonelist, sc);
- if (should_abort_reclaim)
- break;
+ aborted_reclaim = shrink_zones(priority, zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
@@ -2368,8 +2367,8 @@ out:
if (oom_killer_disabled)
return 0;
- /* Aborting reclaim to try compaction? don't OOM, then */
- if (should_abort_reclaim)
+ /* Aborted reclaim to try compaction? don't OOM, then */
+ if (aborted_reclaim)
return 1;
/* top priority shrink_zones still had more to do? don't OOM, then */
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-14 15:41 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
It was observed that scan rates from direct reclaim during tests
writing to both fast and slow storage were extraordinarily high. The
problem was that while pages were being marked for immediate reclaim
when writeback completed, the same pages were being encountered over
and over again during LRU scanning.
This patch isolates file-backed pages that are to be reclaimed when
clean on their own LRU list.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/page_alloc.c | 5 ++-
mm/swap.c | 74 ++++++++++++++++++++++++++++++++++++++---
mm/vmscan.c | 11 ++++++
mm/vmstat.c | 2 +
6 files changed, 89 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac5b522..80834eb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -84,6 +84,7 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
+ NR_IMMEDIATE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
@@ -136,6 +137,7 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
+ LRU_IMMEDIATE,
LRU_UNEVICTABLE,
NR_LRU_LISTS
};
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..9696fda 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -36,6 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ PGRESCUED,
#ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecaba97..5cf9077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2590,7 +2590,7 @@ void show_free_areas(unsigned int filter)
printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
- " unevictable:%lu"
+ " immediate:%lu unevictable:%lu"
" dirty:%lu writeback:%lu unstable:%lu\n"
" free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n",
@@ -2600,6 +2600,7 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_ACTIVE_FILE),
global_page_state(NR_INACTIVE_FILE),
global_page_state(NR_ISOLATED_FILE),
+ global_page_state(NR_IMMEDIATE),
global_page_state(NR_UNEVICTABLE),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
@@ -2627,6 +2628,7 @@ void show_free_areas(unsigned int filter)
" inactive_anon:%lukB"
" active_file:%lukB"
" inactive_file:%lukB"
+ " immediate:%lukB"
" unevictable:%lukB"
" isolated(anon):%lukB"
" isolated(file):%lukB"
@@ -2655,6 +2657,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_INACTIVE_ANON)),
K(zone_page_state(zone, NR_ACTIVE_FILE)),
K(zone_page_state(zone, NR_INACTIVE_FILE)),
+ K(zone_page_state(zone, NR_IMMEDIATE)),
K(zone_page_state(zone, NR_UNEVICTABLE)),
K(zone_page_state(zone, NR_ISOLATED_ANON)),
K(zone_page_state(zone, NR_ISOLATED_FILE)),
diff --git a/mm/swap.c b/mm/swap.c
index a91caf7..9973975 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
/*
@@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
}
/*
+ * Similar pair of functions to pagevec_move_tail except it is called when
+ * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
+ * lists
+ */
+static void pagevec_putback_immediate_fn(struct page *page, void *arg)
+{
+ struct zone *zone = page_zone(page);
+
+ if (PageLRU(page)) {
+ enum lru_list lru = page_lru(page);
+ list_move(&page->lru, &zone->lru[lru].list);
+ }
+}
+
+static void pagevec_putback_immediate(struct pagevec *pvec)
+{
+ pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
+}
+
+/*
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
* inactive list.
*/
void rotate_reclaimable_page(struct page *page)
{
+ struct zone *zone = page_zone(page);
+ struct list_head *page_list;
+ struct pagevec *pvec;
+ unsigned long flags;
+
+ page_cache_get(page);
+ local_irq_save(flags);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
+
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
!PageUnevictable(page) && PageLRU(page)) {
- struct pagevec *pvec;
- unsigned long flags;
- page_cache_get(page);
- local_irq_save(flags);
pvec = &__get_cpu_var(lru_rotate_pvecs);
if (!pagevec_add(pvec, page))
pagevec_move_tail(pvec);
- local_irq_restore(flags);
+ } else {
+ pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
+ if (!pagevec_add(pvec, page))
+ pagevec_putback_immediate(pvec);
+ }
+
+ /*
+ * There is a potential race that if a page is set PageReclaim
+ * and moved to the LRU_IMMEDIATE list after writeback completed,
+ * it can be left on the LRU_IMMEDATE list with no way for
+ * reclaim to find it.
+ *
+ * This race should be very rare but count how often it happens.
+ * If it is a continual race, then it's very unsatisfactory as there
+ * is no guarantee that rotate_reclaimable_page() will be called
+ * to rescue these pages but finding them in page reclaim is also
+ * problematic due to the problem of deciding when the right time
+ * to scan this list is.
+ */
+ page_list = &zone->lru[LRU_IMMEDIATE].list;
+ if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
+ struct page *page;
+
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(page_list)) {
+ page = list_entry(page_list->prev, struct page, lru);
+ list_move(&page->lru, &zone->lru[page_lru(page)].list);
+ __count_vm_event(PGRESCUED);
+ }
+ spin_unlock(&zone->lru_lock);
}
+
+ local_irq_restore(flags);
}
static void update_page_reclaim_stat(struct zone *zone, struct page *page,
@@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
* is _really_ small and it's non-critical problem.
*/
SetPageReclaim(page);
+
+ /*
+ * Move to the LRU_IMMEDIATE list to avoid being scanned
+ * by page reclaim uselessly.
+ */
+ list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
} else {
/*
* The page's writeback ends up during pagevec
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 298ceb8..cb28a07 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1404,6 +1404,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
}
SetPageLRU(page);
lru = page_lru(page);
+
+ /*
+ * If reclaim has tagged a file page reclaim, move it to
+ * a separate LRU lists to avoid it being scanned by other
+ * users. It is expected that as writeback completes that
+ * they are taken back off and moved to the normal LRU
+ */
+ if (lru == LRU_INACTIVE_FILE &&
+ PageReclaim(page) && PageWriteback(page))
+ lru = LRU_IMMEDIATE;
+
add_page_to_lru_list(zone, page, lru);
if (is_active_lru(lru)) {
int file = is_file_lru(lru);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8fd603b..dbfec4c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -688,6 +688,7 @@ const char * const vmstat_text[] = {
"nr_active_anon",
"nr_inactive_file",
"nr_active_file",
+ "nr_immediate",
"nr_unevictable",
"nr_mlock",
"nr_anon_pages",
@@ -756,6 +757,7 @@ const char * const vmstat_text[] = {
"allocstall",
"pgrotated",
+ "pgrescued",
#ifdef CONFIG_COMPACTION
"compact_blocks_moved",
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-14 15:41 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-14 15:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Mel Gorman, Rik van Riel,
Nai Xia, Linux-MM, LKML
It was observed that scan rates from direct reclaim during tests
writing to both fast and slow storage were extraordinarily high. The
problem was that while pages were being marked for immediate reclaim
when writeback completed, the same pages were being encountered over
and over again during LRU scanning.
This patch isolates file-backed pages that are to be reclaimed when
clean on their own LRU list.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/page_alloc.c | 5 ++-
mm/swap.c | 74 ++++++++++++++++++++++++++++++++++++++---
mm/vmscan.c | 11 ++++++
mm/vmstat.c | 2 +
6 files changed, 89 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac5b522..80834eb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -84,6 +84,7 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
+ NR_IMMEDIATE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
@@ -136,6 +137,7 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
+ LRU_IMMEDIATE,
LRU_UNEVICTABLE,
NR_LRU_LISTS
};
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..9696fda 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -36,6 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ PGRESCUED,
#ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecaba97..5cf9077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2590,7 +2590,7 @@ void show_free_areas(unsigned int filter)
printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
- " unevictable:%lu"
+ " immediate:%lu unevictable:%lu"
" dirty:%lu writeback:%lu unstable:%lu\n"
" free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n",
@@ -2600,6 +2600,7 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_ACTIVE_FILE),
global_page_state(NR_INACTIVE_FILE),
global_page_state(NR_ISOLATED_FILE),
+ global_page_state(NR_IMMEDIATE),
global_page_state(NR_UNEVICTABLE),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
@@ -2627,6 +2628,7 @@ void show_free_areas(unsigned int filter)
" inactive_anon:%lukB"
" active_file:%lukB"
" inactive_file:%lukB"
+ " immediate:%lukB"
" unevictable:%lukB"
" isolated(anon):%lukB"
" isolated(file):%lukB"
@@ -2655,6 +2657,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_INACTIVE_ANON)),
K(zone_page_state(zone, NR_ACTIVE_FILE)),
K(zone_page_state(zone, NR_INACTIVE_FILE)),
+ K(zone_page_state(zone, NR_IMMEDIATE)),
K(zone_page_state(zone, NR_UNEVICTABLE)),
K(zone_page_state(zone, NR_ISOLATED_ANON)),
K(zone_page_state(zone, NR_ISOLATED_FILE)),
diff --git a/mm/swap.c b/mm/swap.c
index a91caf7..9973975 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
/*
@@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
}
/*
+ * Similar pair of functions to pagevec_move_tail except it is called when
+ * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
+ * lists
+ */
+static void pagevec_putback_immediate_fn(struct page *page, void *arg)
+{
+ struct zone *zone = page_zone(page);
+
+ if (PageLRU(page)) {
+ enum lru_list lru = page_lru(page);
+ list_move(&page->lru, &zone->lru[lru].list);
+ }
+}
+
+static void pagevec_putback_immediate(struct pagevec *pvec)
+{
+ pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
+}
+
+/*
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
* inactive list.
*/
void rotate_reclaimable_page(struct page *page)
{
+ struct zone *zone = page_zone(page);
+ struct list_head *page_list;
+ struct pagevec *pvec;
+ unsigned long flags;
+
+ page_cache_get(page);
+ local_irq_save(flags);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
+
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
!PageUnevictable(page) && PageLRU(page)) {
- struct pagevec *pvec;
- unsigned long flags;
- page_cache_get(page);
- local_irq_save(flags);
pvec = &__get_cpu_var(lru_rotate_pvecs);
if (!pagevec_add(pvec, page))
pagevec_move_tail(pvec);
- local_irq_restore(flags);
+ } else {
+ pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
+ if (!pagevec_add(pvec, page))
+ pagevec_putback_immediate(pvec);
+ }
+
+ /*
+ * There is a potential race that if a page is set PageReclaim
+ * and moved to the LRU_IMMEDIATE list after writeback completed,
+ * it can be left on the LRU_IMMEDATE list with no way for
+ * reclaim to find it.
+ *
+ * This race should be very rare but count how often it happens.
+ * If it is a continual race, then it's very unsatisfactory as there
+ * is no guarantee that rotate_reclaimable_page() will be called
+ * to rescue these pages but finding them in page reclaim is also
+ * problematic due to the problem of deciding when the right time
+ * to scan this list is.
+ */
+ page_list = &zone->lru[LRU_IMMEDIATE].list;
+ if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
+ struct page *page;
+
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(page_list)) {
+ page = list_entry(page_list->prev, struct page, lru);
+ list_move(&page->lru, &zone->lru[page_lru(page)].list);
+ __count_vm_event(PGRESCUED);
+ }
+ spin_unlock(&zone->lru_lock);
}
+
+ local_irq_restore(flags);
}
static void update_page_reclaim_stat(struct zone *zone, struct page *page,
@@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
* is _really_ small and it's non-critical problem.
*/
SetPageReclaim(page);
+
+ /*
+ * Move to the LRU_IMMEDIATE list to avoid being scanned
+ * by page reclaim uselessly.
+ */
+ list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
} else {
/*
* The page's writeback ends up during pagevec
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 298ceb8..cb28a07 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1404,6 +1404,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
}
SetPageLRU(page);
lru = page_lru(page);
+
+ /*
+ * If reclaim has tagged a file page reclaim, move it to
+ * a separate LRU lists to avoid it being scanned by other
+ * users. It is expected that as writeback completes that
+ * they are taken back off and moved to the normal LRU
+ */
+ if (lru == LRU_INACTIVE_FILE &&
+ PageReclaim(page) && PageWriteback(page))
+ lru = LRU_IMMEDIATE;
+
add_page_to_lru_list(zone, page, lru);
if (is_active_lru(lru)) {
int file = is_file_lru(lru);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8fd603b..dbfec4c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -688,6 +688,7 @@ const char * const vmstat_text[] = {
"nr_active_anon",
"nr_inactive_file",
"nr_active_file",
+ "nr_immediate",
"nr_unevictable",
"nr_mlock",
"nr_anon_pages",
@@ -756,6 +757,7 @@ const char * const vmstat_text[] = {
"allocstall",
"pgrotated",
+ "pgrescued",
#ifdef CONFIG_COMPACTION
"compact_blocks_moved",
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH 03/11] mm: vmscan: Check if we isolated a compound page during lumpy scan
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-15 23:21 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-15 23:21 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> Properly take into account if we isolated a compound page during the
> lumpy scan in reclaim and skip over the tail pages when encountered.
> This corrects the values given to the tracepoint for number of lumpy
> pages isolated and will avoid breaking the loop early if compound
> pages smaller than the requested allocation size are requested.
>
> [mgorman@suse.de: Updated changelog]
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Reviewed-by: Minchan Kim<minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 03/11] mm: vmscan: Check if we isolated a compound page during lumpy scan
@ 2011-12-15 23:21 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-15 23:21 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> Properly take into account if we isolated a compound page during the
> lumpy scan in reclaim and skip over the tail pages when encountered.
> This corrects the values given to the tracepoint for number of lumpy
> pages isolated and will avoid breaking the loop early if compound
> pages smaller than the requested allocation size are requested.
>
> [mgorman@suse.de: Updated changelog]
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Reviewed-by: Minchan Kim<minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 04/11] mm: vmscan: Do not OOM if aborting reclaim to start compaction
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-15 23:36 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-15 23:36 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> During direct reclaim it is possible that reclaim will be aborted so
> that compaction can be attempted to satisfy a high-order allocation. If
> this decision is made before any pages are reclaimed, it is possible
> that 0 is returned to the page allocator potentially triggering an
> OOM. This has not been observed but it is a possibility so this patch
> addresses it.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 04/11] mm: vmscan: Do not OOM if aborting reclaim to start compaction
@ 2011-12-15 23:36 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-15 23:36 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> During direct reclaim it is possible that reclaim will be aborted so
> that compaction can be attempted to satisfy a high-order allocation. If
> this decision is made before any pages are reclaimed, it is possible
> that 0 is returned to the page allocator potentially triggering an
> OOM. This has not been observed but it is a possibility so this patch
> addresses it.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 3:32 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 3:32 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> Asynchronous compaction is used when allocating transparent hugepages
> to avoid blocking for long periods of time. Due to reports of
> stalling, there was a debate on disabling synchronous compaction
> but this severely impacted allocation success rates. Part of the
> reason was that many dirty pages are skipped in asynchronous compaction
> by the following check;
>
> if (PageDirty(page)&& !sync&&
> mapping->a_ops->migratepage != migrate_page)
> rc = -EBUSY;
>
> This skips over all mapping aops using buffer_migrate_page()
> even though it is possible to migrate some of these pages without
> blocking. This patch updates the ->migratepage callback with a "sync"
> parameter. It is the responsibility of the callback to fail gracefully
> if migration would block.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2011-12-16 3:32 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 3:32 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> Asynchronous compaction is used when allocating transparent hugepages
> to avoid blocking for long periods of time. Due to reports of
> stalling, there was a debate on disabling synchronous compaction
> but this severely impacted allocation success rates. Part of the
> reason was that many dirty pages are skipped in asynchronous compaction
> by the following check;
>
> if (PageDirty(page)&& !sync&&
> mapping->a_ops->migratepage != migrate_page)
> rc = -EBUSY;
>
> This skips over all mapping aops using buffer_migrate_page()
> even though it is possible to migrate some of these pages without
> blocking. This patch updates the ->migratepage callback with a "sync"
> parameter. It is the responsibility of the callback to fail gracefully
> if migration would block.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 06/11] mm: compaction: make isolate_lru_page() filter-aware again
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 3:34 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 3:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
> This had to be partially reverted because some dirty pages can be
> migrated by compaction without blocking.
>
> This patch updates "mm: compaction: make isolate_lru_page" by skipping
> over pages that migration has no possibility of migrating to minimise
> LRU disruption.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 06/11] mm: compaction: make isolate_lru_page() filter-aware again
@ 2011-12-16 3:34 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 3:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
> This had to be partially reverted because some dirty pages can be
> migrated by compaction without blocking.
>
> This patch updates "mm: compaction: make isolate_lru_page" by skipping
> over pages that migration has no possibility of migrating to minimise
> LRU disruption.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 07/11] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 4:10 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> If compaction is deferred, direct reclaim is used to try free enough
> pages for the allocation to succeed. For small high-orders, this has
> a reasonable chance of success. However, if the caller has specified
> __GFP_NO_KSWAPD to limit the disruption to the system, it makes more
> sense to fail the allocation rather than stall the caller in direct
> reclaim. This patch skips direct reclaim if compaction is deferred
> and the caller specifies __GFP_NO_KSWAPD.
>
> Async compaction only considers a subset of pages so it is possible for
> compaction to be deferred prematurely and not enter direct reclaim even
> in cases where it should. To compensate for this, this patch also defers
> compaction only if sync compaction failed.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Acked-by: Minchan Kim<minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 07/11] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred
@ 2011-12-16 4:10 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> If compaction is deferred, direct reclaim is used to try free enough
> pages for the allocation to succeed. For small high-orders, this has
> a reasonable chance of success. However, if the caller has specified
> __GFP_NO_KSWAPD to limit the disruption to the system, it makes more
> sense to fail the allocation rather than stall the caller in direct
> reclaim. This patch skips direct reclaim if compaction is deferred
> and the caller specifies __GFP_NO_KSWAPD.
>
> Async compaction only considers a subset of pages so it is possible for
> compaction to be deferred prematurely and not enter direct reclaim even
> in cases where it should. To compensate for this, this patch also defers
> compaction only if sync compaction failed.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
> Acked-by: Minchan Kim<minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 4:31 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:31 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> mode that avoids writing back pages to backing storage. Async
> compaction maps to MIGRATE_ASYNC while sync compaction maps to
> MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> hotplug, MIGRATE_SYNC is used.
>
> This avoids sync compaction stalling for an excessive length of time,
> particularly when copying files to a USB stick where there might be
> a large number of dirty pages backed by a filesystem that does not
> support ->writepages.
>
> [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
@ 2011-12-16 4:31 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:31 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> mode that avoids writing back pages to backing storage. Async
> compaction maps to MIGRATE_ASYNC while sync compaction maps to
> MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> hotplug, MIGRATE_SYNC is used.
>
> This avoids sync compaction stalling for an excessive length of time,
> particularly when copying files to a USB stick where there might be
> a large number of dirty pages backed by a filesystem that does not
> support ->writepages.
>
> [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 09/11] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 4:35 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:35 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> In commit [e0887c19: vmscan: limit direct reclaim for higher order
> allocations], Rik noted that reclaim was too aggressive when THP was
> enabled. In his initial patch he used the number of free pages to
> decide if reclaim should abort for compaction. My feedback was that
> reclaim and compaction should be using the same logic when deciding if
> reclaim should be aborted.
>
> Unfortunately, this had the effect of reducing THP success rates when
> the workload included something like streaming reads that continually
> allocated pages. The window during which compaction could run and return
> a THP was too small.
>
> This patch combines Rik's two patches together. compaction_suitable()
> is still used to decide if reclaim should be aborted to allow
> compaction is used. However, it will also ensure that there is a
> reasonable buffer of free pages available. This improves upon the
> THP allocation success rates but bounds the number of pages that are
> freed for compaction.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 09/11] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available
@ 2011-12-16 4:35 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:35 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> In commit [e0887c19: vmscan: limit direct reclaim for higher order
> allocations], Rik noted that reclaim was too aggressive when THP was
> enabled. In his initial patch he used the number of free pages to
> decide if reclaim should abort for compaction. My feedback was that
> reclaim and compaction should be using the same logic when deciding if
> reclaim should be aborted.
>
> Unfortunately, this had the effect of reducing THP success rates when
> the workload included something like streaming reads that continually
> allocated pages. The window during which compaction could run and return
> a THP was too small.
>
> This patch combines Rik's two patches together. compaction_suitable()
> is still used to decide if reclaim should be aborted to allow
> compaction is used. However, it will also ensure that there is a
> reasonable buffer of free pages available. This improves upon the
> THP allocation success rates but bounds the number of pages that are
> freed for compaction.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 10/11] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 4:38 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> If compaction can proceed for a given zone, shrink_zones() does not
> reclaim any more pages from it. After commit [e0c2327: vmscan: abort
> reclaim/compaction if compaction can proceed], do_try_to_free_pages()
> tries to finish as soon as possible once one zone can compact.
>
> This was intended to prevent slabs being shrunk unnecessarily but
> there are side-effects. One is that a small zone that is ready for
> compaction will abort reclaim even if the chances of successfully
> allocating a THP from that zone is small. It also means that reclaim
> can return too early even though sc->nr_to_reclaim pages were not
> reclaimed.
Having slabs shrunk "too much" might actually be good,
because it does result in more memory blocks where
compaction can be successful.
If we end up frequently evicting frequently accessed
data from the slab cache, chances are the buffer cache
will cache that data (since we reload it often).
If we end up evicting infrequently used data, chances
are it won't really matter for performance.
> This partially reverts the commit until it is proven that slabs are
> really being shrunk unnecessarily but preserves the check to return
> 1 to avoid OOM if reclaim was aborted prematurely.
>
> [aarcange@redhat.com: This patch replaces a revert from Andrea]
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 10/11] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
@ 2011-12-16 4:38 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> If compaction can proceed for a given zone, shrink_zones() does not
> reclaim any more pages from it. After commit [e0c2327: vmscan: abort
> reclaim/compaction if compaction can proceed], do_try_to_free_pages()
> tries to finish as soon as possible once one zone can compact.
>
> This was intended to prevent slabs being shrunk unnecessarily but
> there are side-effects. One is that a small zone that is ready for
> compaction will abort reclaim even if the chances of successfully
> allocating a THP from that zone is small. It also means that reclaim
> can return too early even though sc->nr_to_reclaim pages were not
> reclaimed.
Having slabs shrunk "too much" might actually be good,
because it does result in more memory blocks where
compaction can be successful.
If we end up frequently evicting frequently accessed
data from the slab cache, chances are the buffer cache
will cache that data (since we reload it often).
If we end up evicting infrequently used data, chances
are it won't really matter for performance.
> This partially reverts the commit until it is proven that slabs are
> really being shrunk unnecessarily but preserves the check to return
> 1 to avoid OOM if reclaim was aborted prematurely.
>
> [aarcange@redhat.com: This patch replaces a revert from Andrea]
> Signed-off-by: Mel Gorman<mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 4:47 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> It was observed that scan rates from direct reclaim during tests
> writing to both fast and slow storage were extraordinarily high. The
> problem was that while pages were being marked for immediate reclaim
> when writeback completed, the same pages were being encountered over
> and over again during LRU scanning.
>
> This patch isolates file-backed pages that are to be reclaimed when
> clean on their own LRU list.
The idea makes total sense to me. This is very similar
to the inactive_laundry list in the early 2.4 kernel.
One potential issue is that the page cannot be moved
back to the active list by mark_page_accessed(), which
would have to be taught about the immediate LRU.
> @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> }
>
> /*
> + * Similar pair of functions to pagevec_move_tail except it is called when
> + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> + * lists
> + */
> +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> +{
> + struct zone *zone = page_zone(page);
> +
> + if (PageLRU(page)) {
> + enum lru_list lru = page_lru(page);
> + list_move(&page->lru,&zone->lru[lru].list);
> + }
> +}
Should this not put the page at the reclaim end of the
inactive list, since we want to try evicting it?
> + /*
> + * There is a potential race that if a page is set PageReclaim
> + * and moved to the LRU_IMMEDIATE list after writeback completed,
> + * it can be left on the LRU_IMMEDATE list with no way for
> + * reclaim to find it.
> + *
> + * This race should be very rare but count how often it happens.
> + * If it is a continual race, then it's very unsatisfactory as there
> + * is no guarantee that rotate_reclaimable_page() will be called
> + * to rescue these pages but finding them in page reclaim is also
> + * problematic due to the problem of deciding when the right time
> + * to scan this list is.
> + */
Would it be an idea for the pageout code to check whether the
page at the head of the LRU_IMMEDIATE list is freeable, and
then take that page?
Of course, that does mean adding a check to rotate_reclaimable_page
to make sure the page is still on the LRU_IMMEDIATE list, and did
not get moved by somebody else...
Also, it looks like your debugging check can trigger even when the
bug does not happen (on the last LRU_IMMEDIATE page), because you
decrement NR_IMMEDIATE before you get to this check.
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-16 4:47 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-16 4:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On 12/14/2011 10:41 AM, Mel Gorman wrote:
> It was observed that scan rates from direct reclaim during tests
> writing to both fast and slow storage were extraordinarily high. The
> problem was that while pages were being marked for immediate reclaim
> when writeback completed, the same pages were being encountered over
> and over again during LRU scanning.
>
> This patch isolates file-backed pages that are to be reclaimed when
> clean on their own LRU list.
The idea makes total sense to me. This is very similar
to the inactive_laundry list in the early 2.4 kernel.
One potential issue is that the page cannot be moved
back to the active list by mark_page_accessed(), which
would have to be taught about the immediate LRU.
> @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> }
>
> /*
> + * Similar pair of functions to pagevec_move_tail except it is called when
> + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> + * lists
> + */
> +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> +{
> + struct zone *zone = page_zone(page);
> +
> + if (PageLRU(page)) {
> + enum lru_list lru = page_lru(page);
> + list_move(&page->lru,&zone->lru[lru].list);
> + }
> +}
Should this not put the page at the reclaim end of the
inactive list, since we want to try evicting it?
> + /*
> + * There is a potential race that if a page is set PageReclaim
> + * and moved to the LRU_IMMEDIATE list after writeback completed,
> + * it can be left on the LRU_IMMEDATE list with no way for
> + * reclaim to find it.
> + *
> + * This race should be very rare but count how often it happens.
> + * If it is a continual race, then it's very unsatisfactory as there
> + * is no guarantee that rotate_reclaimable_page() will be called
> + * to rescue these pages but finding them in page reclaim is also
> + * problematic due to the problem of deciding when the right time
> + * to scan this list is.
> + */
Would it be an idea for the pageout code to check whether the
page at the head of the LRU_IMMEDIATE list is freeable, and
then take that page?
Of course, that does mean adding a check to rotate_reclaimable_page
to make sure the page is still on the LRU_IMMEDIATE list, and did
not get moved by somebody else...
Also, it looks like your debugging check can trigger even when the
bug does not happen (on the last LRU_IMMEDIATE page), because you
decrement NR_IMMEDIATE before you get to this check.
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 10/11] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
2011-12-16 4:38 ` Rik van Riel
@ 2011-12-16 11:29 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-16 11:29 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On Thu, Dec 15, 2011 at 11:38:43PM -0500, Rik van Riel wrote:
> On 12/14/2011 10:41 AM, Mel Gorman wrote:
> >If compaction can proceed for a given zone, shrink_zones() does not
> >reclaim any more pages from it. After commit [e0c2327: vmscan: abort
> >reclaim/compaction if compaction can proceed], do_try_to_free_pages()
> >tries to finish as soon as possible once one zone can compact.
> >
> >This was intended to prevent slabs being shrunk unnecessarily but
> >there are side-effects. One is that a small zone that is ready for
> >compaction will abort reclaim even if the chances of successfully
> >allocating a THP from that zone is small. It also means that reclaim
> >can return too early even though sc->nr_to_reclaim pages were not
> >reclaimed.
>
> Having slabs shrunk "too much" might actually be good,
> because it does result in more memory blocks where
> compaction can be successful.
>
> If we end up frequently evicting frequently accessed
> data from the slab cache, chances are the buffer cache
> will cache that data (since we reload it often).
>
> If we end up evicting infrequently used data, chances
> are it won't really matter for performance.
>
True, but I was being mindful of Dave Chinners recent work on
preventing slab cache being dumped entirely. There still may be an
impact to metadata-intensive workloads although I did not spot any
problems myself.
> >This partially reverts the commit until it is proven that slabs are
> >really being shrunk unnecessarily but preserves the check to return
> >1 to avoid OOM if reclaim was aborted prematurely.
> >
> >[aarcange@redhat.com: This patch replaces a revert from Andrea]
> >Signed-off-by: Mel Gorman<mgorman@suse.de>
>
> Reviewed-by: Rik van Riel<riel@redhat.com>
>
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 10/11] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone
@ 2011-12-16 11:29 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-16 11:29 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On Thu, Dec 15, 2011 at 11:38:43PM -0500, Rik van Riel wrote:
> On 12/14/2011 10:41 AM, Mel Gorman wrote:
> >If compaction can proceed for a given zone, shrink_zones() does not
> >reclaim any more pages from it. After commit [e0c2327: vmscan: abort
> >reclaim/compaction if compaction can proceed], do_try_to_free_pages()
> >tries to finish as soon as possible once one zone can compact.
> >
> >This was intended to prevent slabs being shrunk unnecessarily but
> >there are side-effects. One is that a small zone that is ready for
> >compaction will abort reclaim even if the chances of successfully
> >allocating a THP from that zone is small. It also means that reclaim
> >can return too early even though sc->nr_to_reclaim pages were not
> >reclaimed.
>
> Having slabs shrunk "too much" might actually be good,
> because it does result in more memory blocks where
> compaction can be successful.
>
> If we end up frequently evicting frequently accessed
> data from the slab cache, chances are the buffer cache
> will cache that data (since we reload it often).
>
> If we end up evicting infrequently used data, chances
> are it won't really matter for performance.
>
True, but I was being mindful of Dave Chinners recent work on
preventing slab cache being dumped entirely. There still may be an
impact to metadata-intensive workloads although I did not spot any
problems myself.
> >This partially reverts the commit until it is proven that slabs are
> >really being shrunk unnecessarily but preserves the check to return
> >1 to avoid OOM if reclaim was aborted prematurely.
> >
> >[aarcange@redhat.com: This patch replaces a revert from Andrea]
> >Signed-off-by: Mel Gorman<mgorman@suse.de>
>
> Reviewed-by: Rik van Riel<riel@redhat.com>
>
Thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-16 4:47 ` Rik van Riel
@ 2011-12-16 12:26 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-16 12:26 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On Thu, Dec 15, 2011 at 11:47:37PM -0500, Rik van Riel wrote:
> On 12/14/2011 10:41 AM, Mel Gorman wrote:
> >It was observed that scan rates from direct reclaim during tests
> >writing to both fast and slow storage were extraordinarily high. The
> >problem was that while pages were being marked for immediate reclaim
> >when writeback completed, the same pages were being encountered over
> >and over again during LRU scanning.
> >
> >This patch isolates file-backed pages that are to be reclaimed when
> >clean on their own LRU list.
>
> The idea makes total sense to me. This is very similar
> to the inactive_laundry list in the early 2.4 kernel.
>
Just to clarify, do you mean the inactive_dirty_list? It was before
my time so out of curiousity do you recall why it was removed? I
would guess that based on how the LRUs were aged at the time that
adding pages to the inactive_dirty list would lose too much aging
information. If this was the case, it would not apply today as pages
moving to the "immediate reclaim" list have already been selected for
reclaim so we expect them to be old.
> One potential issue is that the page cannot be moved
> back to the active list by mark_page_accessed(), which
> would have to be taught about the immediate LRU.
>
Do you mean it *shouldn't* be moved back to the active list
by mark_page_accessed as opposed to "cannot"? As it is, if
mark_page_accessed() is called on a page on the immediate reclaim list,
it should get moved to the active list if it was previously inactive.
I'll admit this is odd but as it is we cannot tell for sure if the
page is on the inactive or immediate LRU list. Using PageReclaim is
not really an option because PG_reclaim is also used for readahead and
it seems overkill to try using a pageflag for this.
> >@@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> > }
> >
> > /*
> >+ * Similar pair of functions to pagevec_move_tail except it is called when
> >+ * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> >+ * lists
> >+ */
> >+static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> >+{
> >+ struct zone *zone = page_zone(page);
> >+
> >+ if (PageLRU(page)) {
> >+ enum lru_list lru = page_lru(page);
> >+ list_move(&page->lru,&zone->lru[lru].list);
> >+ }
> >+}
>
> Should this not put the page at the reclaim end of the
> inactive list, since we want to try evicting it?
>
I don't think so. pagevec_putback_immediate() is used by
rotate_reclaimable_page when the page is *not* immediately reclaimable
because it is locked, still dirty, activated or unevictable. I expected
that most likely case it was not reclaimable was because it was
redirtied in which case it should do another lap through the LRU list to
give the flushers a chance. Putting it at the tail of the list could
mean that reclaim keeps finding these pages that are being moved from
the immediate list and raising the priority unnecessarily to skip them.
> >+ /*
> >+ * There is a potential race that if a page is set PageReclaim
> >+ * and moved to the LRU_IMMEDIATE list after writeback completed,
> >+ * it can be left on the LRU_IMMEDATE list with no way for
> >+ * reclaim to find it.
> >+ *
> >+ * This race should be very rare but count how often it happens.
> >+ * If it is a continual race, then it's very unsatisfactory as there
> >+ * is no guarantee that rotate_reclaimable_page() will be called
> >+ * to rescue these pages but finding them in page reclaim is also
> >+ * problematic due to the problem of deciding when the right time
> >+ * to scan this list is.
> >+ */
>
> Would it be an idea for the pageout code to check whether the
> page at the head of the LRU_IMMEDIATE list is freeable, and
> then take that page?
>
That is one possibility.
> Of course, that does mean adding a check to rotate_reclaimable_page
> to make sure the page is still on the LRU_IMMEDIATE list, and did
> not get moved by somebody else...
>
This goes back to the problem of not being sure if the page is on the
inactive list or the immediate list and I don't want to introduce a
flag for this. While I think this could work, is it over complicating
things for what should be a rare occurance (see more on this later).
Ironically, the biggest complexity with solutions in this generation
direction is getting the accounting right!
> Also, it looks like your debugging check can trigger even when the
> bug does not happen (on the last LRU_IMMEDIATE page), because you
> decrement NR_IMMEDIATE before you get to this check.
>
When NR_IMMEDIATE goes to 0, one more page is taken from the list and
moved back to an appropriate LRU list so the counts should match up.
When that counter is 0, the LRU lock is only taken if there are pages on
the list. It's racy because we are calling list_empty() outside the LRU
lock but that should not matter. Did I misunderstand you?
Also, this is not a debugging check per-se. This "rescue" logic
is currently needed because it does happen. In the tests I ran 0.05
to 0.1% of the pages moved to the immediate reclaim list had to be
rescued from it using this logic. That was so low that I did not think a
more complex solution was justified.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-16 12:26 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-16 12:26 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Nai Xia, Linux-MM,
LKML
On Thu, Dec 15, 2011 at 11:47:37PM -0500, Rik van Riel wrote:
> On 12/14/2011 10:41 AM, Mel Gorman wrote:
> >It was observed that scan rates from direct reclaim during tests
> >writing to both fast and slow storage were extraordinarily high. The
> >problem was that while pages were being marked for immediate reclaim
> >when writeback completed, the same pages were being encountered over
> >and over again during LRU scanning.
> >
> >This patch isolates file-backed pages that are to be reclaimed when
> >clean on their own LRU list.
>
> The idea makes total sense to me. This is very similar
> to the inactive_laundry list in the early 2.4 kernel.
>
Just to clarify, do you mean the inactive_dirty_list? It was before
my time so out of curiousity do you recall why it was removed? I
would guess that based on how the LRUs were aged at the time that
adding pages to the inactive_dirty list would lose too much aging
information. If this was the case, it would not apply today as pages
moving to the "immediate reclaim" list have already been selected for
reclaim so we expect them to be old.
> One potential issue is that the page cannot be moved
> back to the active list by mark_page_accessed(), which
> would have to be taught about the immediate LRU.
>
Do you mean it *shouldn't* be moved back to the active list
by mark_page_accessed as opposed to "cannot"? As it is, if
mark_page_accessed() is called on a page on the immediate reclaim list,
it should get moved to the active list if it was previously inactive.
I'll admit this is odd but as it is we cannot tell for sure if the
page is on the inactive or immediate LRU list. Using PageReclaim is
not really an option because PG_reclaim is also used for readahead and
it seems overkill to try using a pageflag for this.
> >@@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> > }
> >
> > /*
> >+ * Similar pair of functions to pagevec_move_tail except it is called when
> >+ * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> >+ * lists
> >+ */
> >+static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> >+{
> >+ struct zone *zone = page_zone(page);
> >+
> >+ if (PageLRU(page)) {
> >+ enum lru_list lru = page_lru(page);
> >+ list_move(&page->lru,&zone->lru[lru].list);
> >+ }
> >+}
>
> Should this not put the page at the reclaim end of the
> inactive list, since we want to try evicting it?
>
I don't think so. pagevec_putback_immediate() is used by
rotate_reclaimable_page when the page is *not* immediately reclaimable
because it is locked, still dirty, activated or unevictable. I expected
that most likely case it was not reclaimable was because it was
redirtied in which case it should do another lap through the LRU list to
give the flushers a chance. Putting it at the tail of the list could
mean that reclaim keeps finding these pages that are being moved from
the immediate list and raising the priority unnecessarily to skip them.
> >+ /*
> >+ * There is a potential race that if a page is set PageReclaim
> >+ * and moved to the LRU_IMMEDIATE list after writeback completed,
> >+ * it can be left on the LRU_IMMEDATE list with no way for
> >+ * reclaim to find it.
> >+ *
> >+ * This race should be very rare but count how often it happens.
> >+ * If it is a continual race, then it's very unsatisfactory as there
> >+ * is no guarantee that rotate_reclaimable_page() will be called
> >+ * to rescue these pages but finding them in page reclaim is also
> >+ * problematic due to the problem of deciding when the right time
> >+ * to scan this list is.
> >+ */
>
> Would it be an idea for the pageout code to check whether the
> page at the head of the LRU_IMMEDIATE list is freeable, and
> then take that page?
>
That is one possibility.
> Of course, that does mean adding a check to rotate_reclaimable_page
> to make sure the page is still on the LRU_IMMEDIATE list, and did
> not get moved by somebody else...
>
This goes back to the problem of not being sure if the page is on the
inactive list or the immediate list and I don't want to introduce a
flag for this. While I think this could work, is it over complicating
things for what should be a rare occurance (see more on this later).
Ironically, the biggest complexity with solutions in this generation
direction is getting the accounting right!
> Also, it looks like your debugging check can trigger even when the
> bug does not happen (on the last LRU_IMMEDIATE page), because you
> decrement NR_IMMEDIATE before you get to this check.
>
When NR_IMMEDIATE goes to 0, one more page is taken from the list and
moved back to an appropriate LRU list so the counts should match up.
When that counter is 0, the LRU lock is only taken if there are pages on
the list. It's racy because we are calling list_empty() outside the LRU
lock but that should not matter. Did I misunderstand you?
Also, this is not a debugging check per-se. This "rescue" logic
is currently needed because it does happen. In the tests I ran 0.05
to 0.1% of the pages moved to the immediate reclaim list had to be
rescued from it using this logic. That was so low that I did not think a
more complex solution was justified.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 15:17 ` Johannes Weiner
-1 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-12-16 15:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Rik van Riel, Nai Xia, Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> It was observed that scan rates from direct reclaim during tests
> writing to both fast and slow storage were extraordinarily high. The
> problem was that while pages were being marked for immediate reclaim
> when writeback completed, the same pages were being encountered over
> and over again during LRU scanning.
>
> This patch isolates file-backed pages that are to be reclaimed when
> clean on their own LRU list.
Excuse me if I sound like a broken record, but have those observations
of high scan rates persisted with the per-zone dirty limits patchset?
In my tests with pzd, the scan rates went down considerably together
with the immediate reclaim / vmscan writes.
Our dirty limits are pretty low - if reclaim keeps shuffling through
dirty pages, where are the 80% reclaimable pages?! To me, this sounds
like the unfair distribution of dirty pages among zones again. Is
there are a different explanation that I missed?
PS: It also seems a bit out of place in this series...?
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-16 15:17 ` Johannes Weiner
0 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-12-16 15:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Rik van Riel, Nai Xia, Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> It was observed that scan rates from direct reclaim during tests
> writing to both fast and slow storage were extraordinarily high. The
> problem was that while pages were being marked for immediate reclaim
> when writeback completed, the same pages were being encountered over
> and over again during LRU scanning.
>
> This patch isolates file-backed pages that are to be reclaimed when
> clean on their own LRU list.
Excuse me if I sound like a broken record, but have those observations
of high scan rates persisted with the per-zone dirty limits patchset?
In my tests with pzd, the scan rates went down considerably together
with the immediate reclaim / vmscan writes.
Our dirty limits are pretty low - if reclaim keeps shuffling through
dirty pages, where are the 80% reclaimable pages?! To me, this sounds
like the unfair distribution of dirty pages among zones again. Is
there are a different explanation that I missed?
PS: It also seems a bit out of place in this series...?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-16 15:17 ` Johannes Weiner
@ 2011-12-16 16:07 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-16 16:07 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Rik van Riel, Nai Xia, Linux-MM, LKML
On Fri, Dec 16, 2011 at 04:17:31PM +0100, Johannes Weiner wrote:
> On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > It was observed that scan rates from direct reclaim during tests
> > writing to both fast and slow storage were extraordinarily high. The
> > problem was that while pages were being marked for immediate reclaim
> > when writeback completed, the same pages were being encountered over
> > and over again during LRU scanning.
> >
> > This patch isolates file-backed pages that are to be reclaimed when
> > clean on their own LRU list.
>
> Excuse me if I sound like a broken record, but have those observations
> of high scan rates persisted with the per-zone dirty limits patchset?
>
Unfortunately I wasn't testing that series. The focus of this series
was primarily on THP-related stalls incurred by compaction which
did not have a dependency on that series. Even with dirty balancing,
similar stalls would be observed once dirty pages were in the zone
at all.
> In my tests with pzd, the scan rates went down considerably together
> with the immediate reclaim / vmscan writes.
>
I probably should know but what is pzd?
> Our dirty limits are pretty low - if reclaim keeps shuffling through
> dirty pages, where are the 80% reclaimable pages?! To me, this sounds
> like the unfair distribution of dirty pages among zones again. Is
> there are a different explanation that I missed?
>
The alternative explanation is that the 20% dirty pages are all
long-lived, at the end of the highest zone which is always scanned first
so we continually have to scan over these dirty pages for prolonged
periods of time.
> PS: It also seems a bit out of place in this series...?
Without the last path, the System CPU time was stupidly high. In part,
this is because we are no longer calling ->writepage from direct
reclaim. If we were, the CPU usage would be far lower but it would
be a lot slower too. It seemed remiss to leave system CPU usage that
high without some explanation or patch dealing with it.
The following replaces this patch with your series. dirtybalance-v7r1 is
yours.
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 dirtybalance-v7r1
System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 43.05 (-3434.81%)
+/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 4.04 (-6581.33%)
User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.05 ( 20.69%)
+/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( -1.84%)
Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 73.71 ( 99.29%)
+/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 17.90 ( 97.22%)
THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 102.60 ( 657.69%)
+/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 26.06 ( 141.02%)
Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 214.80 ( 176.35%)
+/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 53.21 ( 72.39%)
Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 788.40 ( 10.53%)
+/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 53.41 ( 27.35%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 715.04
Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 549.64
Your series does help the System CPU time begining it from 46.4 seconds
to 43.05 seconds. That is within the noise but towards the edge of
one standard deviation. With such a small reduction, elapsed time was
not helped. However, it did help THP allocation success rates - still
within the noise but again at the edge of the noise which indicates
a solid improvement.
MMTests Statistics: vmstat
Page Ins 3257266139 1111844061 17263623 10901575 20870385
Page Outs 81054922 30364312 3626530 3657687 3665499
Swap Ins 3294 2851 6560 4964 6598
Swap Outs 390073 528094 620197 790912 604228
Direct pages scanned 1077581700 3024951463 1764930052 115140570 1796314840
Kswapd pages scanned 34826043 7112868 2131265 1686942 2093637
Kswapd pages reclaimed 28950067 4911036 1246044 966475 1319662
Direct pages reclaimed 805148398 280167837 3623473 2215044 4182274
Kswapd efficiency 83% 69% 58% 57% 63%
Kswapd velocity 664.399 622.521 4253.852 7304.360 3809.106
Direct efficiency 74% 9% 0% 1% 0%
Direct velocity 20557.737 264745.137 3522673.849 498551.938 3268166.145
Percentage direct scans 96% 99% 99% 98% 99%
Page writes by reclaim 722646 529174 620319 791018 604368
Page writes file 332573 1080 122 106 140
Page writes anon 390073 528094 620197 790912 604228
Page reclaim immediate 0 2552514720 1635858848 111281140 1661416934
Page rescued immediate 0 0 0 87848 0
Slabs scanned 23552 23552 9216 8192 8192
Direct inode steals 231 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 28076 786 0 61 1
THP fault alloc 609 383 753 906 1074
THP collapse alloc 12 6 0 0 0
THP splits 536 211 456 593 561
THP fault fallback 4406 4633 4263 4110 3942
THP collapse fail 120 127 0 0 0
Compaction stalls 1810 728 623 779 869
Compaction success 196 53 60 80 99
Compaction failures 1614 675 563 699 770
Compaction pages moved 193158 53545 243185 333457 409585
Compaction move failure 9952 9396 16424 23676 30668
The direct page scanned figure with your patch is still very high
unfortunately.
Overall, I would say that your series is not a replacement for the last
patch in this series.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-16 16:07 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-16 16:07 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Rik van Riel, Nai Xia, Linux-MM, LKML
On Fri, Dec 16, 2011 at 04:17:31PM +0100, Johannes Weiner wrote:
> On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > It was observed that scan rates from direct reclaim during tests
> > writing to both fast and slow storage were extraordinarily high. The
> > problem was that while pages were being marked for immediate reclaim
> > when writeback completed, the same pages were being encountered over
> > and over again during LRU scanning.
> >
> > This patch isolates file-backed pages that are to be reclaimed when
> > clean on their own LRU list.
>
> Excuse me if I sound like a broken record, but have those observations
> of high scan rates persisted with the per-zone dirty limits patchset?
>
Unfortunately I wasn't testing that series. The focus of this series
was primarily on THP-related stalls incurred by compaction which
did not have a dependency on that series. Even with dirty balancing,
similar stalls would be observed once dirty pages were in the zone
at all.
> In my tests with pzd, the scan rates went down considerably together
> with the immediate reclaim / vmscan writes.
>
I probably should know but what is pzd?
> Our dirty limits are pretty low - if reclaim keeps shuffling through
> dirty pages, where are the 80% reclaimable pages?! To me, this sounds
> like the unfair distribution of dirty pages among zones again. Is
> there are a different explanation that I missed?
>
The alternative explanation is that the 20% dirty pages are all
long-lived, at the end of the highest zone which is always scanned first
so we continually have to scan over these dirty pages for prolonged
periods of time.
> PS: It also seems a bit out of place in this series...?
Without the last path, the System CPU time was stupidly high. In part,
this is because we are no longer calling ->writepage from direct
reclaim. If we were, the CPU usage would be far lower but it would
be a lot slower too. It seemed remiss to leave system CPU usage that
high without some explanation or patch dealing with it.
The following replaces this patch with your series. dirtybalance-v7r1 is
yours.
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 dirtybalance-v7r1
System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 43.05 (-3434.81%)
+/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 4.04 (-6581.33%)
User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.05 ( 20.69%)
+/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( -1.84%)
Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 73.71 ( 99.29%)
+/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 17.90 ( 97.22%)
THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 102.60 ( 657.69%)
+/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 26.06 ( 141.02%)
Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 214.80 ( 176.35%)
+/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 53.21 ( 72.39%)
Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 788.40 ( 10.53%)
+/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 53.41 ( 27.35%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 715.04
Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 549.64
Your series does help the System CPU time begining it from 46.4 seconds
to 43.05 seconds. That is within the noise but towards the edge of
one standard deviation. With such a small reduction, elapsed time was
not helped. However, it did help THP allocation success rates - still
within the noise but again at the edge of the noise which indicates
a solid improvement.
MMTests Statistics: vmstat
Page Ins 3257266139 1111844061 17263623 10901575 20870385
Page Outs 81054922 30364312 3626530 3657687 3665499
Swap Ins 3294 2851 6560 4964 6598
Swap Outs 390073 528094 620197 790912 604228
Direct pages scanned 1077581700 3024951463 1764930052 115140570 1796314840
Kswapd pages scanned 34826043 7112868 2131265 1686942 2093637
Kswapd pages reclaimed 28950067 4911036 1246044 966475 1319662
Direct pages reclaimed 805148398 280167837 3623473 2215044 4182274
Kswapd efficiency 83% 69% 58% 57% 63%
Kswapd velocity 664.399 622.521 4253.852 7304.360 3809.106
Direct efficiency 74% 9% 0% 1% 0%
Direct velocity 20557.737 264745.137 3522673.849 498551.938 3268166.145
Percentage direct scans 96% 99% 99% 98% 99%
Page writes by reclaim 722646 529174 620319 791018 604368
Page writes file 332573 1080 122 106 140
Page writes anon 390073 528094 620197 790912 604228
Page reclaim immediate 0 2552514720 1635858848 111281140 1661416934
Page rescued immediate 0 0 0 87848 0
Slabs scanned 23552 23552 9216 8192 8192
Direct inode steals 231 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 28076 786 0 61 1
THP fault alloc 609 383 753 906 1074
THP collapse alloc 12 6 0 0 0
THP splits 536 211 456 593 561
THP fault fallback 4406 4633 4263 4110 3942
THP collapse fail 120 127 0 0 0
Compaction stalls 1810 728 623 779 869
Compaction success 196 53 60 80 99
Compaction failures 1614 675 563 699 770
Compaction pages moved 193158 53545 243185 333457 409585
Compaction move failure 9952 9396 16424 23676 30668
The direct page scanned figure with your patch is still very high
unfortunately.
Overall, I would say that your series is not a replacement for the last
patch in this series.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 22:56 ` Andrew Morton
-1 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-16 22:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:22 +0000
Mel Gorman <mgorman@suse.de> wrote:
> Short summary: There are severe stalls when a USB stick using VFAT
> is used with THP enabled that are reduced by this series. If you are
> experiencing this problem, please test and report back and considering
> I have seen complaints from openSUSE and Fedora users on this as well
> as a few private mails, I'm guessing it's a widespread issue. This
> is a new type of USB-related stall because it is due to synchronous
> compaction writing where as in the past the big problem was dirty
> pages reaching the end of the LRU and being written by reclaim.
>
> Am cc'ing Andrew this time and this series would replace
> mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
> I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
> for wider testing and ideally it would be reverted and replaced by
> this series.
So it appears that the problem is painful for distros and users and
that we won't have this fixed until 3.2 at best, and that fix will be a
difficult backport for distributors of earlier kernels.
To serve those people better, I'm wondering if we should merge
mm-do-not-stall-in-synchronous-compaction-for-thp-allocations now, make
it available for -stable backport and then revert it as part of this
series? ie: give people a stopgap while we fix it properly?
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
@ 2011-12-16 22:56 ` Andrew Morton
0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-16 22:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:22 +0000
Mel Gorman <mgorman@suse.de> wrote:
> Short summary: There are severe stalls when a USB stick using VFAT
> is used with THP enabled that are reduced by this series. If you are
> experiencing this problem, please test and report back and considering
> I have seen complaints from openSUSE and Fedora users on this as well
> as a few private mails, I'm guessing it's a widespread issue. This
> is a new type of USB-related stall because it is due to synchronous
> compaction writing where as in the past the big problem was dirty
> pages reaching the end of the LRU and being written by reclaim.
>
> Am cc'ing Andrew this time and this series would replace
> mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
> I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
> for wider testing and ideally it would be reverted and replaced by
> this series.
So it appears that the problem is painful for distros and users and
that we won't have this fixed until 3.2 at best, and that fix will be a
difficult backport for distributors of earlier kernels.
To serve those people better, I'm wondering if we should merge
mm-do-not-stall-in-synchronous-compaction-for-thp-allocations now, make
it available for -stable backport and then revert it as part of this
series? ie: give people a stopgap while we fix it properly?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 23:20 ` Andrew Morton
-1 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-16 23:20 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:27 +0000
Mel Gorman <mgorman@suse.de> wrote:
> Asynchronous compaction is used when allocating transparent hugepages
> to avoid blocking for long periods of time. Due to reports of
> stalling, there was a debate on disabling synchronous compaction
> but this severely impacted allocation success rates. Part of the
> reason was that many dirty pages are skipped in asynchronous compaction
> by the following check;
>
> if (PageDirty(page) && !sync &&
> mapping->a_ops->migratepage != migrate_page)
> rc = -EBUSY;
>
> This skips over all mapping aops using buffer_migrate_page()
> even though it is possible to migrate some of these pages without
> blocking. This patch updates the ->migratepage callback with a "sync"
> parameter. It is the responsibility of the callback to fail gracefully
> if migration would block.
>
> ...
>
> @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> }
>
> /*
> + * In the async migration case of moving a page with buffers, lock the
> + * buffers using trylock before the mapping is moved. If the mapping
> + * was moved, we later failed to lock the buffers and could not move
> + * the mapping back due to an elevated page count, we would have to
> + * block waiting on other references to be dropped.
> + */
> + if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
Once it has been established that "sync" is true, I find it clearer to
pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
I hadn't paid a lot of attention to buffer_migrate_page() before.
Scary function. I'm rather worried about its interactions with ext3
journal commit which locks buffers then plays with them while leaving
the page unlocked. How vigorously has this been whitebox-tested?
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2011-12-16 23:20 ` Andrew Morton
0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-16 23:20 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:27 +0000
Mel Gorman <mgorman@suse.de> wrote:
> Asynchronous compaction is used when allocating transparent hugepages
> to avoid blocking for long periods of time. Due to reports of
> stalling, there was a debate on disabling synchronous compaction
> but this severely impacted allocation success rates. Part of the
> reason was that many dirty pages are skipped in asynchronous compaction
> by the following check;
>
> if (PageDirty(page) && !sync &&
> mapping->a_ops->migratepage != migrate_page)
> rc = -EBUSY;
>
> This skips over all mapping aops using buffer_migrate_page()
> even though it is possible to migrate some of these pages without
> blocking. This patch updates the ->migratepage callback with a "sync"
> parameter. It is the responsibility of the callback to fail gracefully
> if migration would block.
>
> ...
>
> @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> }
>
> /*
> + * In the async migration case of moving a page with buffers, lock the
> + * buffers using trylock before the mapping is moved. If the mapping
> + * was moved, we later failed to lock the buffers and could not move
> + * the mapping back due to an elevated page count, we would have to
> + * block waiting on other references to be dropped.
> + */
> + if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
Once it has been established that "sync" is true, I find it clearer to
pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
I hadn't paid a lot of attention to buffer_migrate_page() before.
Scary function. I'm rather worried about its interactions with ext3
journal commit which locks buffers then plays with them while leaving
the page unlocked. How vigorously has this been whitebox-tested?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-16 23:37 ` Andrew Morton
-1 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-16 23:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:22 +0000
Mel Gorman <mgorman@suse.de> wrote:
> Short summary: There are severe stalls when a USB stick using VFAT
> is used with THP enabled that are reduced by this series. If you are
> experiencing this problem, please test and report back and considering
> I have seen complaints from openSUSE and Fedora users on this as well
> as a few private mails, I'm guessing it's a widespread issue. This
> is a new type of USB-related stall because it is due to synchronous
> compaction writing where as in the past the big problem was dirty
> pages reaching the end of the LRU and being written by reclaim.
Overall footprint:
fs/btrfs/disk-io.c | 5
fs/hugetlbfs/inode.c | 3
fs/nfs/internal.h | 2
fs/nfs/write.c | 4
include/linux/fs.h | 11 +-
include/linux/migrate.h | 23 +++-
include/linux/mmzone.h | 4
include/linux/vm_event_item.h | 1
mm/compaction.c | 5
mm/memory-failure.c | 2
mm/memory_hotplug.c | 2
mm/mempolicy.c | 2
mm/migrate.c | 171 +++++++++++++++++++++-----------
mm/page_alloc.c | 50 +++++++--
mm/swap.c | 74 ++++++++++++-
mm/vmscan.c | 114 ++++++++++++++++++---
mm/vmstat.c | 2
17 files changed, 371 insertions(+), 104 deletions(-)
The line count belies the increase in complexity.
Sigh, this whole hugetlb page thing is just killing us.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
@ 2011-12-16 23:37 ` Andrew Morton
0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-16 23:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:22 +0000
Mel Gorman <mgorman@suse.de> wrote:
> Short summary: There are severe stalls when a USB stick using VFAT
> is used with THP enabled that are reduced by this series. If you are
> experiencing this problem, please test and report back and considering
> I have seen complaints from openSUSE and Fedora users on this as well
> as a few private mails, I'm guessing it's a widespread issue. This
> is a new type of USB-related stall because it is due to synchronous
> compaction writing where as in the past the big problem was dirty
> pages reaching the end of the LRU and being written by reclaim.
Overall footprint:
fs/btrfs/disk-io.c | 5
fs/hugetlbfs/inode.c | 3
fs/nfs/internal.h | 2
fs/nfs/write.c | 4
include/linux/fs.h | 11 +-
include/linux/migrate.h | 23 +++-
include/linux/mmzone.h | 4
include/linux/vm_event_item.h | 1
mm/compaction.c | 5
mm/memory-failure.c | 2
mm/memory_hotplug.c | 2
mm/mempolicy.c | 2
mm/migrate.c | 171 +++++++++++++++++++++-----------
mm/page_alloc.c | 50 +++++++--
mm/swap.c | 74 ++++++++++++-
mm/vmscan.c | 114 ++++++++++++++++++---
mm/vmstat.c | 2
17 files changed, 371 insertions(+), 104 deletions(-)
The line count belies the increase in complexity.
Sigh, this whole hugetlb page thing is just killing us.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
2011-12-16 23:20 ` Andrew Morton
@ 2011-12-17 3:03 ` Nai Xia
-1 siblings, 0 replies; 100+ messages in thread
From: Nai Xia @ 2011-12-17 3:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Linux-MM, LKML
On Saturday 17 December 2011 07:20:54 Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:27 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Asynchronous compaction is used when allocating transparent hugepages
> > to avoid blocking for long periods of time. Due to reports of
> > stalling, there was a debate on disabling synchronous compaction
> > but this severely impacted allocation success rates. Part of the
> > reason was that many dirty pages are skipped in asynchronous compaction
> > by the following check;
> >
> > if (PageDirty(page) && !sync &&
> > mapping->a_ops->migratepage != migrate_page)
> > rc = -EBUSY;
> >
> > This skips over all mapping aops using buffer_migrate_page()
> > even though it is possible to migrate some of these pages without
> > blocking. This patch updates the ->migratepage callback with a "sync"
> > parameter. It is the responsibility of the callback to fail gracefully
> > if migration would block.
> >
> > ...
> >
> > @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > }
> >
> > /*
> > + * In the async migration case of moving a page with buffers, lock the
> > + * buffers using trylock before the mapping is moved. If the mapping
> > + * was moved, we later failed to lock the buffers and could not move
> > + * the mapping back due to an elevated page count, we would have to
> > + * block waiting on other references to be dropped.
> > + */
> > + if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
>
> Once it has been established that "sync" is true, I find it clearer to
> pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
>
>
>
> I hadn't paid a lot of attention to buffer_migrate_page() before.
> Scary function. I'm rather worried about its interactions with ext3
> journal commit which locks buffers then plays with them while leaving
> the page unlocked. How vigorously has this been whitebox-tested?
buffer_migrate_page() is done under page lock & buffer head locks.
I had assumed that anyone who has locked the buffer_heads should
also have a stable relationship between buffer_head <---> page,
otherwise, the buffer_head locking semantics should be broken itself ?
I am actually using the similar logic for some other stuff,
it will make me cry if it can really crash ext3....
Thanks,
Nai
>
>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2011-12-17 3:03 ` Nai Xia
0 siblings, 0 replies; 100+ messages in thread
From: Nai Xia @ 2011-12-17 3:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Linux-MM, LKML
On Saturday 17 December 2011 07:20:54 Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:27 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Asynchronous compaction is used when allocating transparent hugepages
> > to avoid blocking for long periods of time. Due to reports of
> > stalling, there was a debate on disabling synchronous compaction
> > but this severely impacted allocation success rates. Part of the
> > reason was that many dirty pages are skipped in asynchronous compaction
> > by the following check;
> >
> > if (PageDirty(page) && !sync &&
> > mapping->a_ops->migratepage != migrate_page)
> > rc = -EBUSY;
> >
> > This skips over all mapping aops using buffer_migrate_page()
> > even though it is possible to migrate some of these pages without
> > blocking. This patch updates the ->migratepage callback with a "sync"
> > parameter. It is the responsibility of the callback to fail gracefully
> > if migration would block.
> >
> > ...
> >
> > @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > }
> >
> > /*
> > + * In the async migration case of moving a page with buffers, lock the
> > + * buffers using trylock before the mapping is moved. If the mapping
> > + * was moved, we later failed to lock the buffers and could not move
> > + * the mapping back due to an elevated page count, we would have to
> > + * block waiting on other references to be dropped.
> > + */
> > + if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
>
> Once it has been established that "sync" is true, I find it clearer to
> pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
>
>
>
> I hadn't paid a lot of attention to buffer_migrate_page() before.
> Scary function. I'm rather worried about its interactions with ext3
> journal commit which locks buffers then plays with them while leaving
> the page unlocked. How vigorously has this been whitebox-tested?
buffer_migrate_page() is done under page lock & buffer head locks.
I had assumed that anyone who has locked the buffer_heads should
also have a stable relationship between buffer_head <---> page,
otherwise, the buffer_head locking semantics should be broken itself ?
I am actually using the similar logic for some other stuff,
it will make me cry if it can really crash ext3....
Thanks,
Nai
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
2011-12-17 3:03 ` Nai Xia
@ 2011-12-17 3:26 ` Andrew Morton
-1 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-17 3:26 UTC (permalink / raw)
To: nai.xia
Cc: Mel Gorman, Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Linux-MM, LKML
On Sat, 17 Dec 2011 11:03:01 +0800 Nai Xia <nai.xia@gmail.com> wrote:
> On Saturday 17 December 2011 07:20:54 Andrew Morton wrote:
> >
> > I hadn't paid a lot of attention to buffer_migrate_page() before.
> > Scary function. I'm rather worried about its interactions with ext3
> > journal commit which locks buffers then plays with them while leaving
> > the page unlocked. How vigorously has this been whitebox-tested?
>
> buffer_migrate_page() is done under page lock & buffer head locks.
>
> I had assumed that anyone who has locked the buffer_heads should
> also have a stable relationship between buffer_head <---> page,
> otherwise, the buffer_head locking semantics should be broken itself ?
>
> I am actually using the similar logic for some other stuff,
> it will make me cry if it can really crash ext3....
It's complicated ;) JBD attaches a journal_head to the buffer_head and
thereby largely increases the amount of metadata in the buffer_head.
Locking the buffer_head isn't considered to have locked the
journal_head, although it might often work out that way.
I don't see anything in the journal_head which refers to the page
contents (b_committed_data points to a JBD-private copy of the data),
and buffer_migrate_page() migrates the buffers to a new page, rather
than migrating new buffers to the new page.
We should check that the b_committed_data copy is taken under
lock_buffer() (surely true).
The core writeback code will initiate writeback against buffer_heads
and will then unlock the page. But in that case the buffer_heads are
locked and come unlocked after writeback has completed. So that should
be OK.
set_page_dirty() and friends can sometimes play with an unlocked page
and even unlocked buffers, from IRQ context iirc. If there are
problems around this, taking ->private_lock in buffer_migrate_page()
will help...
It's just ... scary. Whether there are gremlins in there (or in other
filesystems!) I just don't know.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2011-12-17 3:26 ` Andrew Morton
0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2011-12-17 3:26 UTC (permalink / raw)
To: nai.xia
Cc: Mel Gorman, Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Linux-MM, LKML
On Sat, 17 Dec 2011 11:03:01 +0800 Nai Xia <nai.xia@gmail.com> wrote:
> On Saturday 17 December 2011 07:20:54 Andrew Morton wrote:
> >
> > I hadn't paid a lot of attention to buffer_migrate_page() before.
> > Scary function. I'm rather worried about its interactions with ext3
> > journal commit which locks buffers then plays with them while leaving
> > the page unlocked. How vigorously has this been whitebox-tested?
>
> buffer_migrate_page() is done under page lock & buffer head locks.
>
> I had assumed that anyone who has locked the buffer_heads should
> also have a stable relationship between buffer_head <---> page,
> otherwise, the buffer_head locking semantics should be broken itself ?
>
> I am actually using the similar logic for some other stuff,
> it will make me cry if it can really crash ext3....
It's complicated ;) JBD attaches a journal_head to the buffer_head and
thereby largely increases the amount of metadata in the buffer_head.
Locking the buffer_head isn't considered to have locked the
journal_head, although it might often work out that way.
I don't see anything in the journal_head which refers to the page
contents (b_committed_data points to a JBD-private copy of the data),
and buffer_migrate_page() migrates the buffers to a new page, rather
than migrating new buffers to the new page.
We should check that the b_committed_data copy is taken under
lock_buffer() (surely true).
The core writeback code will initiate writeback against buffer_heads
and will then unlock the page. But in that case the buffer_heads are
locked and come unlocked after writeback has completed. So that should
be OK.
set_page_dirty() and friends can sometimes play with an unlocked page
and even unlocked buffers, from IRQ context iirc. If there are
problems around this, taking ->private_lock in buffer_migrate_page()
will help...
It's just ... scary. Whether there are gremlins in there (or in other
filesystems!) I just don't know.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-17 16:08 ` Minchan Kim
-1 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-17 16:08 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> It was observed that scan rates from direct reclaim during tests
> writing to both fast and slow storage were extraordinarily high. The
> problem was that while pages were being marked for immediate reclaim
> when writeback completed, the same pages were being encountered over
> and over again during LRU scanning.
>
> This patch isolates file-backed pages that are to be reclaimed when
> clean on their own LRU list.
Please include your test result about reducing CPU usage.
It makes this separate LRU list how vaule is.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/linux/mmzone.h | 2 +
> include/linux/vm_event_item.h | 1 +
> mm/page_alloc.c | 5 ++-
> mm/swap.c | 74 ++++++++++++++++++++++++++++++++++++++---
> mm/vmscan.c | 11 ++++++
> mm/vmstat.c | 2 +
> 6 files changed, 89 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ac5b522..80834eb 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -84,6 +84,7 @@ enum zone_stat_item {
> NR_ACTIVE_ANON, /* " " " " " */
> NR_INACTIVE_FILE, /* " " " " " */
> NR_ACTIVE_FILE, /* " " " " " */
> + NR_IMMEDIATE, /* " " " " " */
> NR_UNEVICTABLE, /* " " " " " */
> NR_MLOCK, /* mlock()ed pages found and moved off LRU */
> NR_ANON_PAGES, /* Mapped anonymous pages */
> @@ -136,6 +137,7 @@ enum lru_list {
> LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
> LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> + LRU_IMMEDIATE,
> LRU_UNEVICTABLE,
> NR_LRU_LISTS
> };
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 03b90cdc..9696fda 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -36,6 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> KSWAPD_SKIP_CONGESTION_WAIT,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> + PGRESCUED,
> #ifdef CONFIG_COMPACTION
> COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ecaba97..5cf9077 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2590,7 +2590,7 @@ void show_free_areas(unsigned int filter)
>
> printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
> " active_file:%lu inactive_file:%lu isolated_file:%lu\n"
> - " unevictable:%lu"
> + " immediate:%lu unevictable:%lu"
> " dirty:%lu writeback:%lu unstable:%lu\n"
> " free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
> " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n",
> @@ -2600,6 +2600,7 @@ void show_free_areas(unsigned int filter)
> global_page_state(NR_ACTIVE_FILE),
> global_page_state(NR_INACTIVE_FILE),
> global_page_state(NR_ISOLATED_FILE),
> + global_page_state(NR_IMMEDIATE),
> global_page_state(NR_UNEVICTABLE),
> global_page_state(NR_FILE_DIRTY),
> global_page_state(NR_WRITEBACK),
> @@ -2627,6 +2628,7 @@ void show_free_areas(unsigned int filter)
> " inactive_anon:%lukB"
> " active_file:%lukB"
> " inactive_file:%lukB"
> + " immediate:%lukB"
> " unevictable:%lukB"
> " isolated(anon):%lukB"
> " isolated(file):%lukB"
> @@ -2655,6 +2657,7 @@ void show_free_areas(unsigned int filter)
> K(zone_page_state(zone, NR_INACTIVE_ANON)),
> K(zone_page_state(zone, NR_ACTIVE_FILE)),
> K(zone_page_state(zone, NR_INACTIVE_FILE)),
> + K(zone_page_state(zone, NR_IMMEDIATE)),
> K(zone_page_state(zone, NR_UNEVICTABLE)),
> K(zone_page_state(zone, NR_ISOLATED_ANON)),
> K(zone_page_state(zone, NR_ISOLATED_FILE)),
> diff --git a/mm/swap.c b/mm/swap.c
> index a91caf7..9973975 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,7 @@ int page_cluster;
>
> static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>
> /*
> @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> }
>
> /*
> + * Similar pair of functions to pagevec_move_tail except it is called when
> + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> + * lists
> + */
> +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> +{
> + struct zone *zone = page_zone(page);
> +
> + if (PageLRU(page)) {
> + enum lru_list lru = page_lru(page);
> + list_move(&page->lru, &zone->lru[lru].list);
> + }
> +}
> +
> +static void pagevec_putback_immediate(struct pagevec *pvec)
> +{
> + pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
> +}
> +
> +/*
> * Writeback is about to end against a page which has been marked for immediate
> * reclaim. If it still appears to be reclaimable, move it to the tail of the
> * inactive list.
> */
> void rotate_reclaimable_page(struct page *page)
> {
> + struct zone *zone = page_zone(page);
> + struct list_head *page_list;
> + struct pagevec *pvec;
> + unsigned long flags;
> +
> + page_cache_get(page);
> + local_irq_save(flags);
> + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> +
I am not sure underflow never happen.
We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
> if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> !PageUnevictable(page) && PageLRU(page)) {
> - struct pagevec *pvec;
> - unsigned long flags;
>
> - page_cache_get(page);
> - local_irq_save(flags);
> pvec = &__get_cpu_var(lru_rotate_pvecs);
> if (!pagevec_add(pvec, page))
> pagevec_move_tail(pvec);
> - local_irq_restore(flags);
> + } else {
> + pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
> + if (!pagevec_add(pvec, page))
> + pagevec_putback_immediate(pvec);
Nitpick about naming.
It doesn't say immediate is from or to. So I got confused
which is source. I know comment of function already say it
but good naming can reduce unnecessary comment.
How about pagevec_putback_from_immediate_list?
> + }
> +
> + /*
> + * There is a potential race that if a page is set PageReclaim
> + * and moved to the LRU_IMMEDIATE list after writeback completed,
> + * it can be left on the LRU_IMMEDATE list with no way for
> + * reclaim to find it.
> + *
> + * This race should be very rare but count how often it happens.
> + * If it is a continual race, then it's very unsatisfactory as there
> + * is no guarantee that rotate_reclaimable_page() will be called
> + * to rescue these pages but finding them in page reclaim is also
> + * problematic due to the problem of deciding when the right time
> + * to scan this list is.
> + */
> + page_list = &zone->lru[LRU_IMMEDIATE].list;
> + if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
How about this
if (zone_page_state(zone, NR_IMMEDIATE)) {
page_list = &zone->lru[LRU_IMMEDIATE].list;
if (!list_empty(page_list))
...
...
}
It can reduce a unnecessary reference.
> + struct page *page;
> +
> + spin_lock(&zone->lru_lock);
> + while (!list_empty(page_list)) {
> + page = list_entry(page_list->prev, struct page, lru);
> + list_move(&page->lru, &zone->lru[page_lru(page)].list);
> + __count_vm_event(PGRESCUED);
> + }
> + spin_unlock(&zone->lru_lock);
> }
> +
> + local_irq_restore(flags);
> }
>
> static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> * is _really_ small and it's non-critical problem.
> */
> SetPageReclaim(page);
> +
> + /*
> + * Move to the LRU_IMMEDIATE list to avoid being scanned
> + * by page reclaim uselessly.
> + */
> + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
Before this patch, all is from active to inacive so it was right.
But with this patch, it can be from acdtive to immediate.
> } else {
> /*
> * The page's writeback ends up during pagevec
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 298ceb8..cb28a07 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1404,6 +1404,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
> }
> SetPageLRU(page);
> lru = page_lru(page);
> +
> + /*
> + * If reclaim has tagged a file page reclaim, move it to
> + * a separate LRU lists to avoid it being scanned by other
> + * users. It is expected that as writeback completes that
> + * they are taken back off and moved to the normal LRU
> + */
> + if (lru == LRU_INACTIVE_FILE &&
> + PageReclaim(page) && PageWriteback(page))
> + lru = LRU_IMMEDIATE;
> +
> add_page_to_lru_list(zone, page, lru);
> if (is_active_lru(lru)) {
> int file = is_file_lru(lru);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8fd603b..dbfec4c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -688,6 +688,7 @@ const char * const vmstat_text[] = {
> "nr_active_anon",
> "nr_inactive_file",
> "nr_active_file",
> + "nr_immediate",
> "nr_unevictable",
> "nr_mlock",
> "nr_anon_pages",
> @@ -756,6 +757,7 @@ const char * const vmstat_text[] = {
> "allocstall",
>
> "pgrotated",
> + "pgrescued",
>
> #ifdef CONFIG_COMPACTION
> "compact_blocks_moved",
> --
> 1.7.3.4
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-17 16:08 ` Minchan Kim
0 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-17 16:08 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> It was observed that scan rates from direct reclaim during tests
> writing to both fast and slow storage were extraordinarily high. The
> problem was that while pages were being marked for immediate reclaim
> when writeback completed, the same pages were being encountered over
> and over again during LRU scanning.
>
> This patch isolates file-backed pages that are to be reclaimed when
> clean on their own LRU list.
Please include your test result about reducing CPU usage.
It makes this separate LRU list how vaule is.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/linux/mmzone.h | 2 +
> include/linux/vm_event_item.h | 1 +
> mm/page_alloc.c | 5 ++-
> mm/swap.c | 74 ++++++++++++++++++++++++++++++++++++++---
> mm/vmscan.c | 11 ++++++
> mm/vmstat.c | 2 +
> 6 files changed, 89 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ac5b522..80834eb 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -84,6 +84,7 @@ enum zone_stat_item {
> NR_ACTIVE_ANON, /* " " " " " */
> NR_INACTIVE_FILE, /* " " " " " */
> NR_ACTIVE_FILE, /* " " " " " */
> + NR_IMMEDIATE, /* " " " " " */
> NR_UNEVICTABLE, /* " " " " " */
> NR_MLOCK, /* mlock()ed pages found and moved off LRU */
> NR_ANON_PAGES, /* Mapped anonymous pages */
> @@ -136,6 +137,7 @@ enum lru_list {
> LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
> LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> + LRU_IMMEDIATE,
> LRU_UNEVICTABLE,
> NR_LRU_LISTS
> };
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 03b90cdc..9696fda 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -36,6 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> KSWAPD_SKIP_CONGESTION_WAIT,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> + PGRESCUED,
> #ifdef CONFIG_COMPACTION
> COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ecaba97..5cf9077 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2590,7 +2590,7 @@ void show_free_areas(unsigned int filter)
>
> printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
> " active_file:%lu inactive_file:%lu isolated_file:%lu\n"
> - " unevictable:%lu"
> + " immediate:%lu unevictable:%lu"
> " dirty:%lu writeback:%lu unstable:%lu\n"
> " free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
> " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n",
> @@ -2600,6 +2600,7 @@ void show_free_areas(unsigned int filter)
> global_page_state(NR_ACTIVE_FILE),
> global_page_state(NR_INACTIVE_FILE),
> global_page_state(NR_ISOLATED_FILE),
> + global_page_state(NR_IMMEDIATE),
> global_page_state(NR_UNEVICTABLE),
> global_page_state(NR_FILE_DIRTY),
> global_page_state(NR_WRITEBACK),
> @@ -2627,6 +2628,7 @@ void show_free_areas(unsigned int filter)
> " inactive_anon:%lukB"
> " active_file:%lukB"
> " inactive_file:%lukB"
> + " immediate:%lukB"
> " unevictable:%lukB"
> " isolated(anon):%lukB"
> " isolated(file):%lukB"
> @@ -2655,6 +2657,7 @@ void show_free_areas(unsigned int filter)
> K(zone_page_state(zone, NR_INACTIVE_ANON)),
> K(zone_page_state(zone, NR_ACTIVE_FILE)),
> K(zone_page_state(zone, NR_INACTIVE_FILE)),
> + K(zone_page_state(zone, NR_IMMEDIATE)),
> K(zone_page_state(zone, NR_UNEVICTABLE)),
> K(zone_page_state(zone, NR_ISOLATED_ANON)),
> K(zone_page_state(zone, NR_ISOLATED_FILE)),
> diff --git a/mm/swap.c b/mm/swap.c
> index a91caf7..9973975 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -39,6 +39,7 @@ int page_cluster;
>
> static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
> static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>
> /*
> @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> }
>
> /*
> + * Similar pair of functions to pagevec_move_tail except it is called when
> + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> + * lists
> + */
> +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> +{
> + struct zone *zone = page_zone(page);
> +
> + if (PageLRU(page)) {
> + enum lru_list lru = page_lru(page);
> + list_move(&page->lru, &zone->lru[lru].list);
> + }
> +}
> +
> +static void pagevec_putback_immediate(struct pagevec *pvec)
> +{
> + pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
> +}
> +
> +/*
> * Writeback is about to end against a page which has been marked for immediate
> * reclaim. If it still appears to be reclaimable, move it to the tail of the
> * inactive list.
> */
> void rotate_reclaimable_page(struct page *page)
> {
> + struct zone *zone = page_zone(page);
> + struct list_head *page_list;
> + struct pagevec *pvec;
> + unsigned long flags;
> +
> + page_cache_get(page);
> + local_irq_save(flags);
> + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> +
I am not sure underflow never happen.
We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
> if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> !PageUnevictable(page) && PageLRU(page)) {
> - struct pagevec *pvec;
> - unsigned long flags;
>
> - page_cache_get(page);
> - local_irq_save(flags);
> pvec = &__get_cpu_var(lru_rotate_pvecs);
> if (!pagevec_add(pvec, page))
> pagevec_move_tail(pvec);
> - local_irq_restore(flags);
> + } else {
> + pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
> + if (!pagevec_add(pvec, page))
> + pagevec_putback_immediate(pvec);
Nitpick about naming.
It doesn't say immediate is from or to. So I got confused
which is source. I know comment of function already say it
but good naming can reduce unnecessary comment.
How about pagevec_putback_from_immediate_list?
> + }
> +
> + /*
> + * There is a potential race that if a page is set PageReclaim
> + * and moved to the LRU_IMMEDIATE list after writeback completed,
> + * it can be left on the LRU_IMMEDATE list with no way for
> + * reclaim to find it.
> + *
> + * This race should be very rare but count how often it happens.
> + * If it is a continual race, then it's very unsatisfactory as there
> + * is no guarantee that rotate_reclaimable_page() will be called
> + * to rescue these pages but finding them in page reclaim is also
> + * problematic due to the problem of deciding when the right time
> + * to scan this list is.
> + */
> + page_list = &zone->lru[LRU_IMMEDIATE].list;
> + if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
How about this
if (zone_page_state(zone, NR_IMMEDIATE)) {
page_list = &zone->lru[LRU_IMMEDIATE].list;
if (!list_empty(page_list))
...
...
}
It can reduce a unnecessary reference.
> + struct page *page;
> +
> + spin_lock(&zone->lru_lock);
> + while (!list_empty(page_list)) {
> + page = list_entry(page_list->prev, struct page, lru);
> + list_move(&page->lru, &zone->lru[page_lru(page)].list);
> + __count_vm_event(PGRESCUED);
> + }
> + spin_unlock(&zone->lru_lock);
> }
> +
> + local_irq_restore(flags);
> }
>
> static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> * is _really_ small and it's non-critical problem.
> */
> SetPageReclaim(page);
> +
> + /*
> + * Move to the LRU_IMMEDIATE list to avoid being scanned
> + * by page reclaim uselessly.
> + */
> + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
Before this patch, all is from active to inacive so it was right.
But with this patch, it can be from acdtive to immediate.
> } else {
> /*
> * The page's writeback ends up during pagevec
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 298ceb8..cb28a07 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1404,6 +1404,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
> }
> SetPageLRU(page);
> lru = page_lru(page);
> +
> + /*
> + * If reclaim has tagged a file page reclaim, move it to
> + * a separate LRU lists to avoid it being scanned by other
> + * users. It is expected that as writeback completes that
> + * they are taken back off and moved to the normal LRU
> + */
> + if (lru == LRU_INACTIVE_FILE &&
> + PageReclaim(page) && PageWriteback(page))
> + lru = LRU_IMMEDIATE;
> +
> add_page_to_lru_list(zone, page, lru);
> if (is_active_lru(lru)) {
> int file = is_file_lru(lru);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8fd603b..dbfec4c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -688,6 +688,7 @@ const char * const vmstat_text[] = {
> "nr_active_anon",
> "nr_inactive_file",
> "nr_active_file",
> + "nr_immediate",
> "nr_unevictable",
> "nr_mlock",
> "nr_anon_pages",
> @@ -756,6 +757,7 @@ const char * const vmstat_text[] = {
> "allocstall",
>
> "pgrotated",
> + "pgrescued",
>
> #ifdef CONFIG_COMPACTION
> "compact_blocks_moved",
> --
> 1.7.3.4
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 06/11] mm: compaction: make isolate_lru_page() filter-aware again
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-18 1:53 ` Minchan Kim
-1 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-18 1:53 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:28PM +0000, Mel Gorman wrote:
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
> This had to be partially reverted because some dirty pages can be
> migrated by compaction without blocking.
>
> This patch updates "mm: compaction: make isolate_lru_page" by skipping
> over pages that migration has no possibility of migrating to minimise
> LRU disruption.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan@kernel.org>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 06/11] mm: compaction: make isolate_lru_page() filter-aware again
@ 2011-12-18 1:53 ` Minchan Kim
0 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-18 1:53 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:28PM +0000, Mel Gorman wrote:
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
> This had to be partially reverted because some dirty pages can be
> migrated by compaction without blocking.
>
> This patch updates "mm: compaction: make isolate_lru_page" by skipping
> over pages that migration has no possibility of migrating to minimise
> LRU disruption.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan@kernel.org>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
2011-12-14 15:41 ` Mel Gorman
@ 2011-12-18 2:05 ` Minchan Kim
-1 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-18 2:05 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:30PM +0000, Mel Gorman wrote:
> This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> mode that avoids writing back pages to backing storage. Async
> compaction maps to MIGRATE_ASYNC while sync compaction maps to
> MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> hotplug, MIGRATE_SYNC is used.
>
> This avoids sync compaction stalling for an excessive length of time,
> particularly when copying files to a USB stick where there might be
> a large number of dirty pages backed by a filesystem that does not
> support ->writepages.
>
> [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
> ---
> fs/btrfs/disk-io.c | 3 +-
> fs/hugetlbfs/inode.c | 2 +-
> fs/nfs/internal.h | 2 +-
> fs/nfs/write.c | 2 +-
> include/linux/fs.h | 6 ++-
> include/linux/migrate.h | 23 +++++++++++---
> mm/compaction.c | 2 +-
> mm/memory-failure.c | 2 +-
> mm/memory_hotplug.c | 2 +-
> mm/mempolicy.c | 2 +-
> mm/migrate.c | 78 ++++++++++++++++++++++++++---------------------
> 11 files changed, 74 insertions(+), 50 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 896b87a..dbe9518 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -872,7 +872,8 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
>
> #ifdef CONFIG_MIGRATION
> static int btree_migratepage(struct address_space *mapping,
> - struct page *newpage, struct page *page, bool sync)
> + struct page *newpage, struct page *page,
> + enum migrate_mode sync)
> {
> /*
> * we can't safely write a btree page from here,
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 10b9883..6b80537 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
>
> static int hugetlbfs_migrate_page(struct address_space *mapping,
> struct page *newpage, struct page *page,
> - bool sync)
> + enum migrate_mode mode)
Nitpick, except this one, we use enum migrate_mode sync.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
@ 2011-12-18 2:05 ` Minchan Kim
0 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-18 2:05 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Wed, Dec 14, 2011 at 03:41:30PM +0000, Mel Gorman wrote:
> This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> mode that avoids writing back pages to backing storage. Async
> compaction maps to MIGRATE_ASYNC while sync compaction maps to
> MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> hotplug, MIGRATE_SYNC is used.
>
> This avoids sync compaction stalling for an excessive length of time,
> particularly when copying files to a USB stick where there might be
> a large number of dirty pages backed by a filesystem that does not
> support ->writepages.
>
> [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
> ---
> fs/btrfs/disk-io.c | 3 +-
> fs/hugetlbfs/inode.c | 2 +-
> fs/nfs/internal.h | 2 +-
> fs/nfs/write.c | 2 +-
> include/linux/fs.h | 6 ++-
> include/linux/migrate.h | 23 +++++++++++---
> mm/compaction.c | 2 +-
> mm/memory-failure.c | 2 +-
> mm/memory_hotplug.c | 2 +-
> mm/mempolicy.c | 2 +-
> mm/migrate.c | 78 ++++++++++++++++++++++++++---------------------
> 11 files changed, 74 insertions(+), 50 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 896b87a..dbe9518 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -872,7 +872,8 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
>
> #ifdef CONFIG_MIGRATION
> static int btree_migratepage(struct address_space *mapping,
> - struct page *newpage, struct page *page, bool sync)
> + struct page *newpage, struct page *page,
> + enum migrate_mode sync)
> {
> /*
> * we can't safely write a btree page from here,
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 10b9883..6b80537 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
>
> static int hugetlbfs_migrate_page(struct address_space *mapping,
> struct page *newpage, struct page *page,
> - bool sync)
> + enum migrate_mode mode)
Nitpick, except this one, we use enum migrate_mode sync.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
2011-12-16 23:20 ` Andrew Morton
@ 2011-12-19 11:05 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 11:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Fri, Dec 16, 2011 at 03:20:54PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:27 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Asynchronous compaction is used when allocating transparent hugepages
> > to avoid blocking for long periods of time. Due to reports of
> > stalling, there was a debate on disabling synchronous compaction
> > but this severely impacted allocation success rates. Part of the
> > reason was that many dirty pages are skipped in asynchronous compaction
> > by the following check;
> >
> > if (PageDirty(page) && !sync &&
> > mapping->a_ops->migratepage != migrate_page)
> > rc = -EBUSY;
> >
> > This skips over all mapping aops using buffer_migrate_page()
> > even though it is possible to migrate some of these pages without
> > blocking. This patch updates the ->migratepage callback with a "sync"
> > parameter. It is the responsibility of the callback to fail gracefully
> > if migration would block.
> >
> > ...
> >
> > @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > }
> >
> > /*
> > + * In the async migration case of moving a page with buffers, lock the
> > + * buffers using trylock before the mapping is moved. If the mapping
> > + * was moved, we later failed to lock the buffers and could not move
> > + * the mapping back due to an elevated page count, we would have to
> > + * block waiting on other references to be dropped.
> > + */
> > + if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
>
> Once it has been established that "sync" is true, I find it clearer to
> pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
>
Later in the series, sync changes to "mode" to distinguish between
async, sync-light and sync compaction. At that point, this becomes
if (mode == MIGRATE_ASYNC && head &&
!buffer_migrate_lock_buffers(head, mode)) {
Passing true in here would be fine, but it would just end up being
changed back later in the series so it can be left alone.
> I hadn't paid a lot of attention to buffer_migrate_page() before.
> Scary function. I'm rather worried about its interactions with ext3
> journal commit which locks buffers then plays with them while leaving
> the page unlocked. How vigorously has this been whitebox-tested?
>
Blackbox testing only AFAIK. This has been tested recently with ext3
and nothing unusual was reported. The list of events for migration
looks like
isolate page from LRU
migrate_pages
unmap_and_move
lock_page(src_page)
if page under writeback, either bail or wait on writeback
try_to_unmap
move_to_new_page
lock_page(dst_page)
buffer_migrate_page
migrate_page_move_mapping
spin_lock_irq(&mapping->tree_lock)
lookup in radix tree
check reference counts to make sure no one else has references
lock buffers if async mode
replace page in radix tree with new page
spin_unlock_irq
lock buffers if !async mode
copy buffers
unlock buffers
unlock_page(dst_page)
The critical part is that the copying of buffer data is happening with
both page and buffer locks held and no other references to the page
exists - it has already been unmapped for example.
Journal commit minimally acquires the buffer lock. If migration is
in the process of copying the buffers, the buffer lock will prevent
journal commit starting at the same time buffers are being copied.
block_write_full_page and friends should be taking the buffer lock so
they should also be ok.
For other accessors, the mapping tree_lock should prevent other users
looking up the page in the radix tree in the first place while the radix
tree replacement is taking place.
Racing against try_to_free_buffer should also be a problem.
According to buffer.c, exclusion from try_to_free_buffer "may
be obtained by either locking the page or holding the mappings
private_lock". Migration is holding the page lock.
Taking private_lock would give additional protection but I haven't heard
or seen a case where it is necessary.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2011-12-19 11:05 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 11:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Fri, Dec 16, 2011 at 03:20:54PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:27 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Asynchronous compaction is used when allocating transparent hugepages
> > to avoid blocking for long periods of time. Due to reports of
> > stalling, there was a debate on disabling synchronous compaction
> > but this severely impacted allocation success rates. Part of the
> > reason was that many dirty pages are skipped in asynchronous compaction
> > by the following check;
> >
> > if (PageDirty(page) && !sync &&
> > mapping->a_ops->migratepage != migrate_page)
> > rc = -EBUSY;
> >
> > This skips over all mapping aops using buffer_migrate_page()
> > even though it is possible to migrate some of these pages without
> > blocking. This patch updates the ->migratepage callback with a "sync"
> > parameter. It is the responsibility of the callback to fail gracefully
> > if migration would block.
> >
> > ...
> >
> > @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> > }
> >
> > /*
> > + * In the async migration case of moving a page with buffers, lock the
> > + * buffers using trylock before the mapping is moved. If the mapping
> > + * was moved, we later failed to lock the buffers and could not move
> > + * the mapping back due to an elevated page count, we would have to
> > + * block waiting on other references to be dropped.
> > + */
> > + if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
>
> Once it has been established that "sync" is true, I find it clearer to
> pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
>
Later in the series, sync changes to "mode" to distinguish between
async, sync-light and sync compaction. At that point, this becomes
if (mode == MIGRATE_ASYNC && head &&
!buffer_migrate_lock_buffers(head, mode)) {
Passing true in here would be fine, but it would just end up being
changed back later in the series so it can be left alone.
> I hadn't paid a lot of attention to buffer_migrate_page() before.
> Scary function. I'm rather worried about its interactions with ext3
> journal commit which locks buffers then plays with them while leaving
> the page unlocked. How vigorously has this been whitebox-tested?
>
Blackbox testing only AFAIK. This has been tested recently with ext3
and nothing unusual was reported. The list of events for migration
looks like
isolate page from LRU
migrate_pages
unmap_and_move
lock_page(src_page)
if page under writeback, either bail or wait on writeback
try_to_unmap
move_to_new_page
lock_page(dst_page)
buffer_migrate_page
migrate_page_move_mapping
spin_lock_irq(&mapping->tree_lock)
lookup in radix tree
check reference counts to make sure no one else has references
lock buffers if async mode
replace page in radix tree with new page
spin_unlock_irq
lock buffers if !async mode
copy buffers
unlock buffers
unlock_page(dst_page)
The critical part is that the copying of buffer data is happening with
both page and buffer locks held and no other references to the page
exists - it has already been unmapped for example.
Journal commit minimally acquires the buffer lock. If migration is
in the process of copying the buffers, the buffer lock will prevent
journal commit starting at the same time buffers are being copied.
block_write_full_page and friends should be taking the buffer lock so
they should also be ok.
For other accessors, the mapping tree_lock should prevent other users
looking up the page in the radix tree in the first place while the radix
tree replacement is taking place.
Racing against try_to_free_buffer should also be a problem.
According to buffer.c, exclusion from try_to_free_buffer "may
be obtained by either locking the page or holding the mappings
private_lock". Migration is holding the page lock.
Taking private_lock would give additional protection but I haven't heard
or seen a case where it is necessary.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
2011-12-18 2:05 ` Minchan Kim
@ 2011-12-19 11:45 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 11:45 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Sun, Dec 18, 2011 at 11:05:52AM +0900, Minchan Kim wrote:
> On Wed, Dec 14, 2011 at 03:41:30PM +0000, Mel Gorman wrote:
> > This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> > mode that avoids writing back pages to backing storage. Async
> > compaction maps to MIGRATE_ASYNC while sync compaction maps to
> > MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> > hotplug, MIGRATE_SYNC is used.
> >
> > This avoids sync compaction stalling for an excessive length of time,
> > particularly when copying files to a USB stick where there might be
> > a large number of dirty pages backed by a filesystem that does not
> > support ->writepages.
> >
> > [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Acked-by: Minchan Kim <minchan@kernel.org>
>
Thanks.
> > <SNIP>
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 10b9883..6b80537 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
> >
> > static int hugetlbfs_migrate_page(struct address_space *mapping,
> > struct page *newpage, struct page *page,
> > - bool sync)
> > + enum migrate_mode mode)
>
> Nitpick, except this one, we use enum migrate_mode sync.
>
Actually, in all the core code, I used "mode" but I was inconsistent in
the headers and some of the filesystems. I should have converted all use
of "sync" which was a boolean to a mode which has three possible values
after this patch.
==== CUT HERE ====
mm: compaction: Introduce sync-light migration for use by compaction fix
Consistently name enum migrate_mode parameters "mode" instead of "sync".
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/disk-io.c | 2 +-
fs/nfs/write.c | 2 +-
include/linux/migrate.h | 8 ++++----
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dbe9518..ff45cdf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -873,7 +873,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
struct page *newpage, struct page *page,
- enum migrate_mode sync)
+ enum migrate_mode mode)
{
/*
* we can't safely write a btree page from here,
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index adb87d9..1f4f18f9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page, enum migrate_mode sync)
+ struct page *page, enum migrate_mode mode)
{
/*
* If PagePrivate is set, then the page is currently associated with
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 775787c..eaf8674 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,10 +27,10 @@ extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync);
+ enum migrate_mode mode);
extern int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync);
+ enum migrate_mode mode);
extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
@@ -49,10 +49,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync) { return -ENOSYS; }
+ enum migrate_mode mode) { return -ENOSYS; }
static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync) { return -ENOSYS; }
+ enum migrate_mode mode) { return -ENOSYS; }
static inline int migrate_prep(void) { return -ENOSYS; }
static inline int migrate_prep_local(void) { return -ENOSYS; }
^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
@ 2011-12-19 11:45 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 11:45 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Sun, Dec 18, 2011 at 11:05:52AM +0900, Minchan Kim wrote:
> On Wed, Dec 14, 2011 at 03:41:30PM +0000, Mel Gorman wrote:
> > This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> > mode that avoids writing back pages to backing storage. Async
> > compaction maps to MIGRATE_ASYNC while sync compaction maps to
> > MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> > hotplug, MIGRATE_SYNC is used.
> >
> > This avoids sync compaction stalling for an excessive length of time,
> > particularly when copying files to a USB stick where there might be
> > a large number of dirty pages backed by a filesystem that does not
> > support ->writepages.
> >
> > [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Acked-by: Minchan Kim <minchan@kernel.org>
>
Thanks.
> > <SNIP>
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 10b9883..6b80537 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
> >
> > static int hugetlbfs_migrate_page(struct address_space *mapping,
> > struct page *newpage, struct page *page,
> > - bool sync)
> > + enum migrate_mode mode)
>
> Nitpick, except this one, we use enum migrate_mode sync.
>
Actually, in all the core code, I used "mode" but I was inconsistent in
the headers and some of the filesystems. I should have converted all use
of "sync" which was a boolean to a mode which has three possible values
after this patch.
==== CUT HERE ====
mm: compaction: Introduce sync-light migration for use by compaction fix
Consistently name enum migrate_mode parameters "mode" instead of "sync".
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/btrfs/disk-io.c | 2 +-
fs/nfs/write.c | 2 +-
include/linux/migrate.h | 8 ++++----
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dbe9518..ff45cdf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -873,7 +873,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
#ifdef CONFIG_MIGRATION
static int btree_migratepage(struct address_space *mapping,
struct page *newpage, struct page *page,
- enum migrate_mode sync)
+ enum migrate_mode mode)
{
/*
* we can't safely write a btree page from here,
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index adb87d9..1f4f18f9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
#ifdef CONFIG_MIGRATION
int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
- struct page *page, enum migrate_mode sync)
+ struct page *page, enum migrate_mode mode)
{
/*
* If PagePrivate is set, then the page is currently associated with
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 775787c..eaf8674 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,10 +27,10 @@ extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync);
+ enum migrate_mode mode);
extern int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync);
+ enum migrate_mode mode);
extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
@@ -49,10 +49,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync) { return -ENOSYS; }
+ enum migrate_mode mode) { return -ENOSYS; }
static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode sync) { return -ENOSYS; }
+ enum migrate_mode mode) { return -ENOSYS; }
static inline int migrate_prep(void) { return -ENOSYS; }
static inline int migrate_prep_local(void) { return -ENOSYS; }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
2011-12-19 11:05 ` Mel Gorman
@ 2011-12-19 13:12 ` nai.xia
-1 siblings, 0 replies; 100+ messages in thread
From: nai.xia @ 2011-12-19 13:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Linux-MM,
LKML
On 2011年12月19日 19:05, Mel Gorman wrote:
> On Fri, Dec 16, 2011 at 03:20:54PM -0800, Andrew Morton wrote:
>> On Wed, 14 Dec 2011 15:41:27 +0000
>> Mel Gorman<mgorman@suse.de> wrote:
>>
>>> Asynchronous compaction is used when allocating transparent hugepages
>>> to avoid blocking for long periods of time. Due to reports of
>>> stalling, there was a debate on disabling synchronous compaction
>>> but this severely impacted allocation success rates. Part of the
>>> reason was that many dirty pages are skipped in asynchronous compaction
>>> by the following check;
>>>
>>> if (PageDirty(page)&& !sync&&
>>> mapping->a_ops->migratepage != migrate_page)
>>> rc = -EBUSY;
>>>
>>> This skips over all mapping aops using buffer_migrate_page()
>>> even though it is possible to migrate some of these pages without
>>> blocking. This patch updates the ->migratepage callback with a "sync"
>>> parameter. It is the responsibility of the callback to fail gracefully
>>> if migration would block.
>>>
>>> ...
>>>
>>> @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>>> }
>>>
>>> /*
>>> + * In the async migration case of moving a page with buffers, lock the
>>> + * buffers using trylock before the mapping is moved. If the mapping
>>> + * was moved, we later failed to lock the buffers and could not move
>>> + * the mapping back due to an elevated page count, we would have to
>>> + * block waiting on other references to be dropped.
>>> + */
>>> + if (!sync&& head&& !buffer_migrate_lock_buffers(head, sync)) {
>>
>> Once it has been established that "sync" is true, I find it clearer to
>> pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
>>
>
> Later in the series, sync changes to "mode" to distinguish between
> async, sync-light and sync compaction. At that point, this becomes
>
> if (mode == MIGRATE_ASYNC&& head&&
> !buffer_migrate_lock_buffers(head, mode)) {
>
> Passing true in here would be fine, but it would just end up being
> changed back later in the series so it can be left alone.
>
>> I hadn't paid a lot of attention to buffer_migrate_page() before.
>> Scary function. I'm rather worried about its interactions with ext3
>> journal commit which locks buffers then plays with them while leaving
>> the page unlocked. How vigorously has this been whitebox-tested?
>>
>
> Blackbox testing only AFAIK. This has been tested recently with ext3
> and nothing unusual was reported. The list of events for migration
> looks like
>
> isolate page from LRU
> migrate_pages
> unmap_and_move
> lock_page(src_page)
> if page under writeback, either bail or wait on writeback
> try_to_unmap
> move_to_new_page
> lock_page(dst_page)
> buffer_migrate_page
> migrate_page_move_mapping
> spin_lock_irq(&mapping->tree_lock)
> lookup in radix tree
> check reference counts to make sure no one else has references
> lock buffers if async mode
> replace page in radix tree with new page
> spin_unlock_irq
> lock buffers if !async mode
> copy buffers
> unlock buffers
> unlock_page(dst_page)
>
> The critical part is that the copying of buffer data is happening with
> both page and buffer locks held and no other references to the page
> exists - it has already been unmapped for example.
>
> Journal commit minimally acquires the buffer lock. If migration is
> in the process of copying the buffers, the buffer lock will prevent
> journal commit starting at the same time buffers are being copied.
>
> block_write_full_page and friends should be taking the buffer lock so
> they should also be ok.
>
> For other accessors, the mapping tree_lock should prevent other users
> looking up the page in the radix tree in the first place while the radix
> tree replacement is taking place.
>
> Racing against try_to_free_buffer should also be a problem.
> According to buffer.c, exclusion from try_to_free_buffer "may
> be obtained by either locking the page or holding the mappings
> private_lock". Migration is holding the page lock.
>
> Taking private_lock would give additional protection but I haven't heard
> or seen a case where it is necessary.
>
Make sure that it has no risk path by path is good. But maybe it's
time to make some explicit locking protocol here. I think the only possible
threat is that we changed buffer head ==> page relationship. Before
buffer_migrate_page()'s existence, the weak assumption of "if a bh is
valid then the page it is pointing to should also be valid, even without
locking" just held, although, like you said above, it seems not really exploited
by someone.
But this weak assumption is not true anymore. So maybe it's good to doc
explicitly like this:
Anyone who wants to reference a page should either directly get_page or
if you are going through the buffer heads to the page, you should take
the buffer lock at least.
If there were really "gremlins" somewhere now or in the future, just burn
them under the supreme holy light of buffer locks!
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage
@ 2011-12-19 13:12 ` nai.xia
0 siblings, 0 replies; 100+ messages in thread
From: nai.xia @ 2011-12-19 13:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Linux-MM,
LKML
On 2011a1'12ae??19ae?JPY 19:05, Mel Gorman wrote:
> On Fri, Dec 16, 2011 at 03:20:54PM -0800, Andrew Morton wrote:
>> On Wed, 14 Dec 2011 15:41:27 +0000
>> Mel Gorman<mgorman@suse.de> wrote:
>>
>>> Asynchronous compaction is used when allocating transparent hugepages
>>> to avoid blocking for long periods of time. Due to reports of
>>> stalling, there was a debate on disabling synchronous compaction
>>> but this severely impacted allocation success rates. Part of the
>>> reason was that many dirty pages are skipped in asynchronous compaction
>>> by the following check;
>>>
>>> if (PageDirty(page)&& !sync&&
>>> mapping->a_ops->migratepage != migrate_page)
>>> rc = -EBUSY;
>>>
>>> This skips over all mapping aops using buffer_migrate_page()
>>> even though it is possible to migrate some of these pages without
>>> blocking. This patch updates the ->migratepage callback with a "sync"
>>> parameter. It is the responsibility of the callback to fail gracefully
>>> if migration would block.
>>>
>>> ...
>>>
>>> @@ -259,6 +309,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>>> }
>>>
>>> /*
>>> + * In the async migration case of moving a page with buffers, lock the
>>> + * buffers using trylock before the mapping is moved. If the mapping
>>> + * was moved, we later failed to lock the buffers and could not move
>>> + * the mapping back due to an elevated page count, we would have to
>>> + * block waiting on other references to be dropped.
>>> + */
>>> + if (!sync&& head&& !buffer_migrate_lock_buffers(head, sync)) {
>>
>> Once it has been established that "sync" is true, I find it clearer to
>> pass in plain old "true" to buffer_migrate_lock_buffers(). Minor point.
>>
>
> Later in the series, sync changes to "mode" to distinguish between
> async, sync-light and sync compaction. At that point, this becomes
>
> if (mode == MIGRATE_ASYNC&& head&&
> !buffer_migrate_lock_buffers(head, mode)) {
>
> Passing true in here would be fine, but it would just end up being
> changed back later in the series so it can be left alone.
>
>> I hadn't paid a lot of attention to buffer_migrate_page() before.
>> Scary function. I'm rather worried about its interactions with ext3
>> journal commit which locks buffers then plays with them while leaving
>> the page unlocked. How vigorously has this been whitebox-tested?
>>
>
> Blackbox testing only AFAIK. This has been tested recently with ext3
> and nothing unusual was reported. The list of events for migration
> looks like
>
> isolate page from LRU
> migrate_pages
> unmap_and_move
> lock_page(src_page)
> if page under writeback, either bail or wait on writeback
> try_to_unmap
> move_to_new_page
> lock_page(dst_page)
> buffer_migrate_page
> migrate_page_move_mapping
> spin_lock_irq(&mapping->tree_lock)
> lookup in radix tree
> check reference counts to make sure no one else has references
> lock buffers if async mode
> replace page in radix tree with new page
> spin_unlock_irq
> lock buffers if !async mode
> copy buffers
> unlock buffers
> unlock_page(dst_page)
>
> The critical part is that the copying of buffer data is happening with
> both page and buffer locks held and no other references to the page
> exists - it has already been unmapped for example.
>
> Journal commit minimally acquires the buffer lock. If migration is
> in the process of copying the buffers, the buffer lock will prevent
> journal commit starting at the same time buffers are being copied.
>
> block_write_full_page and friends should be taking the buffer lock so
> they should also be ok.
>
> For other accessors, the mapping tree_lock should prevent other users
> looking up the page in the radix tree in the first place while the radix
> tree replacement is taking place.
>
> Racing against try_to_free_buffer should also be a problem.
> According to buffer.c, exclusion from try_to_free_buffer "may
> be obtained by either locking the page or holding the mappings
> private_lock". Migration is holding the page lock.
>
> Taking private_lock would give additional protection but I haven't heard
> or seen a case where it is necessary.
>
Make sure that it has no risk path by path is good. But maybe it's
time to make some explicit locking protocol here. I think the only possible
threat is that we changed buffer head ==> page relationship. Before
buffer_migrate_page()'s existence, the weak assumption of "if a bh is
valid then the page it is pointing to should also be valid, even without
locking" just held, although, like you said above, it seems not really exploited
by someone.
But this weak assumption is not true anymore. So maybe it's good to doc
explicitly like this:
Anyone who wants to reference a page should either directly get_page or
if you are going through the buffer heads to the page, you should take
the buffer lock at least.
If there were really "gremlins" somewhere now or in the future, just burn
them under the supreme holy light of buffer locks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-17 16:08 ` Minchan Kim
@ 2011-12-19 13:26 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 13:26 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Sun, Dec 18, 2011 at 01:08:22AM +0900, Minchan Kim wrote:
> On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > It was observed that scan rates from direct reclaim during tests
> > writing to both fast and slow storage were extraordinarily high. The
> > problem was that while pages were being marked for immediate reclaim
> > when writeback completed, the same pages were being encountered over
> > and over again during LRU scanning.
> >
> > This patch isolates file-backed pages that are to be reclaimed when
> > clean on their own LRU list.
>
> Please include your test result about reducing CPU usage.
> It makes this separate LRU list how vaule is.
>
It's in the leader. The writebackCPDevicevfat tests should that System
CPU goes from 46.40 seconds to 4.44 seconds with this patch applied.
> > <SNIP>
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a91caf7..9973975 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -39,6 +39,7 @@ int page_cluster;
> >
> > static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> > static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
> > static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> >
> > /*
> > @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> > }
> >
> > /*
> > + * Similar pair of functions to pagevec_move_tail except it is called when
> > + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> > + * lists
> > + */
> > +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> > +{
> > + struct zone *zone = page_zone(page);
> > +
> > + if (PageLRU(page)) {
> > + enum lru_list lru = page_lru(page);
> > + list_move(&page->lru, &zone->lru[lru].list);
> > + }
> > +}
> > +
> > +static void pagevec_putback_immediate(struct pagevec *pvec)
> > +{
> > + pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
> > +}
> > +
> > +/*
> > * Writeback is about to end against a page which has been marked for immediate
> > * reclaim. If it still appears to be reclaimable, move it to the tail of the
> > * inactive list.
> > */
> > void rotate_reclaimable_page(struct page *page)
> > {
> > + struct zone *zone = page_zone(page);
> > + struct list_head *page_list;
> > + struct pagevec *pvec;
> > + unsigned long flags;
> > +
> > + page_cache_get(page);
> > + local_irq_save(flags);
> > + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> > +
>
> I am not sure underflow never happen.
> We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
>
In those cases, we do not move the page to the immedate list either.
During one test I was recording /proc/vmstat every 10 seconds and never
saw an underflow.
> > if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> > !PageUnevictable(page) && PageLRU(page)) {
> > - struct pagevec *pvec;
> > - unsigned long flags;
> >
> > - page_cache_get(page);
> > - local_irq_save(flags);
> > pvec = &__get_cpu_var(lru_rotate_pvecs);
> > if (!pagevec_add(pvec, page))
> > pagevec_move_tail(pvec);
> > - local_irq_restore(flags);
> > + } else {
> > + pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
> > + if (!pagevec_add(pvec, page))
> > + pagevec_putback_immediate(pvec);
>
> Nitpick about naming.
Naming is important.
> It doesn't say immediate is from or to. So I got confused
> which is source. I know comment of function already say it
> but good naming can reduce unnecessary comment.
> How about pagevec_putback_from_immediate_list?
>
Sure. Done.
> > + }
> > +
> > + /*
> > + * There is a potential race that if a page is set PageReclaim
> > + * and moved to the LRU_IMMEDIATE list after writeback completed,
> > + * it can be left on the LRU_IMMEDATE list with no way for
> > + * reclaim to find it.
> > + *
> > + * This race should be very rare but count how often it happens.
> > + * If it is a continual race, then it's very unsatisfactory as there
> > + * is no guarantee that rotate_reclaimable_page() will be called
> > + * to rescue these pages but finding them in page reclaim is also
> > + * problematic due to the problem of deciding when the right time
> > + * to scan this list is.
> > + */
> > + page_list = &zone->lru[LRU_IMMEDIATE].list;
> > + if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
>
> How about this
>
> if (zone_page_state(zone, NR_IMMEDIATE)) {
> page_list = &zone->lru[LRU_IMMEDIATE].list;
> if (!list_empty(page_list))
> ...
> ...
> }
>
> It can reduce a unnecessary reference.
>
Ok, it mucks up the indentation a bit but with some renaming it looks
reasonable.
> > + struct page *page;
> > +
> > + spin_lock(&zone->lru_lock);
> > + while (!list_empty(page_list)) {
> > + page = list_entry(page_list->prev, struct page, lru);
> > + list_move(&page->lru, &zone->lru[page_lru(page)].list);
> > + __count_vm_event(PGRESCUED);
> > + }
> > + spin_unlock(&zone->lru_lock);
> > }
> > +
> > + local_irq_restore(flags);
> > }
> >
> > static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> > @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> > * is _really_ small and it's non-critical problem.
> > */
> > SetPageReclaim(page);
> > +
> > + /*
> > + * Move to the LRU_IMMEDIATE list to avoid being scanned
> > + * by page reclaim uselessly.
> > + */
> > + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> > + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
>
> It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
> Before this patch, all is from active to inacive so it was right.
> But with this patch, it can be from acdtive to immediate.
>
I do not quite understand. PGDEACTIVATE is incremented if the page was
active and this is checked before the move to the immediate LRU. Whether
it moves to the immediate LRU or the end of the inactive list, it is
still a deactivation. What's wrong with incrementing the count if it
moves from active to immediate?
==== CUT HERE ====
mm: Isolate pages for immediate reclaim on their own LRU fix
Rename pagevec_putback_immediate_fn to pagevec_putback_from_immediate_fn
for clarity and alter flow of rotate_reclaimable_page() slightly to
avoid an unnecessary list reference.
This is a fix to the patch
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru.patch in mmotm.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/swap.c | 32 ++++++++++++++++++--------------
1 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 9973975..dfe67eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -260,7 +260,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
* moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
* lists
*/
-static void pagevec_putback_immediate_fn(struct page *page, void *arg)
+static void pagevec_putback_from_immediate_fn(struct page *page, void *arg)
{
struct zone *zone = page_zone(page);
@@ -270,9 +270,9 @@ static void pagevec_putback_immediate_fn(struct page *page, void *arg)
}
}
-static void pagevec_putback_immediate(struct pagevec *pvec)
+static void pagevec_putback_from_immediate(struct pagevec *pvec)
{
- pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
+ pagevec_lru_move_fn(pvec, pagevec_putback_from_immediate_fn, NULL);
}
/*
@@ -283,7 +283,7 @@ static void pagevec_putback_immediate(struct pagevec *pvec)
void rotate_reclaimable_page(struct page *page)
{
struct zone *zone = page_zone(page);
- struct list_head *page_list;
+ struct list_head *list;
struct pagevec *pvec;
unsigned long flags;
@@ -300,7 +300,7 @@ void rotate_reclaimable_page(struct page *page)
} else {
pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
if (!pagevec_add(pvec, page))
- pagevec_putback_immediate(pvec);
+ pagevec_putback_from_immediate(pvec);
}
/*
@@ -316,17 +316,21 @@ void rotate_reclaimable_page(struct page *page)
* problematic due to the problem of deciding when the right time
* to scan this list is.
*/
- page_list = &zone->lru[LRU_IMMEDIATE].list;
- if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
+ if (!zone_page_state(zone, NR_IMMEDIATE)) {
struct page *page;
-
- spin_lock(&zone->lru_lock);
- while (!list_empty(page_list)) {
- page = list_entry(page_list->prev, struct page, lru);
- list_move(&page->lru, &zone->lru[page_lru(page)].list);
- __count_vm_event(PGRESCUED);
+ list = &zone->lru[LRU_IMMEDIATE].list;
+
+ if (!list_empty(list)) {
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(list)) {
+ int lru;
+ page = list_entry(list->prev, struct page, lru);
+ lru = page_lru(page);
+ list_move(&page->lru, &zone->lru[lru].list);
+ __count_vm_event(PGRESCUED);
+ }
+ spin_unlock(&zone->lru_lock);
}
- spin_unlock(&zone->lru_lock);
}
local_irq_restore(flags);
^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-19 13:26 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 13:26 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Sun, Dec 18, 2011 at 01:08:22AM +0900, Minchan Kim wrote:
> On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > It was observed that scan rates from direct reclaim during tests
> > writing to both fast and slow storage were extraordinarily high. The
> > problem was that while pages were being marked for immediate reclaim
> > when writeback completed, the same pages were being encountered over
> > and over again during LRU scanning.
> >
> > This patch isolates file-backed pages that are to be reclaimed when
> > clean on their own LRU list.
>
> Please include your test result about reducing CPU usage.
> It makes this separate LRU list how vaule is.
>
It's in the leader. The writebackCPDevicevfat tests should that System
CPU goes from 46.40 seconds to 4.44 seconds with this patch applied.
> > <SNIP>
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a91caf7..9973975 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -39,6 +39,7 @@ int page_cluster;
> >
> > static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> > static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
> > static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> >
> > /*
> > @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> > }
> >
> > /*
> > + * Similar pair of functions to pagevec_move_tail except it is called when
> > + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> > + * lists
> > + */
> > +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> > +{
> > + struct zone *zone = page_zone(page);
> > +
> > + if (PageLRU(page)) {
> > + enum lru_list lru = page_lru(page);
> > + list_move(&page->lru, &zone->lru[lru].list);
> > + }
> > +}
> > +
> > +static void pagevec_putback_immediate(struct pagevec *pvec)
> > +{
> > + pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
> > +}
> > +
> > +/*
> > * Writeback is about to end against a page which has been marked for immediate
> > * reclaim. If it still appears to be reclaimable, move it to the tail of the
> > * inactive list.
> > */
> > void rotate_reclaimable_page(struct page *page)
> > {
> > + struct zone *zone = page_zone(page);
> > + struct list_head *page_list;
> > + struct pagevec *pvec;
> > + unsigned long flags;
> > +
> > + page_cache_get(page);
> > + local_irq_save(flags);
> > + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> > +
>
> I am not sure underflow never happen.
> We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
>
In those cases, we do not move the page to the immedate list either.
During one test I was recording /proc/vmstat every 10 seconds and never
saw an underflow.
> > if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> > !PageUnevictable(page) && PageLRU(page)) {
> > - struct pagevec *pvec;
> > - unsigned long flags;
> >
> > - page_cache_get(page);
> > - local_irq_save(flags);
> > pvec = &__get_cpu_var(lru_rotate_pvecs);
> > if (!pagevec_add(pvec, page))
> > pagevec_move_tail(pvec);
> > - local_irq_restore(flags);
> > + } else {
> > + pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
> > + if (!pagevec_add(pvec, page))
> > + pagevec_putback_immediate(pvec);
>
> Nitpick about naming.
Naming is important.
> It doesn't say immediate is from or to. So I got confused
> which is source. I know comment of function already say it
> but good naming can reduce unnecessary comment.
> How about pagevec_putback_from_immediate_list?
>
Sure. Done.
> > + }
> > +
> > + /*
> > + * There is a potential race that if a page is set PageReclaim
> > + * and moved to the LRU_IMMEDIATE list after writeback completed,
> > + * it can be left on the LRU_IMMEDATE list with no way for
> > + * reclaim to find it.
> > + *
> > + * This race should be very rare but count how often it happens.
> > + * If it is a continual race, then it's very unsatisfactory as there
> > + * is no guarantee that rotate_reclaimable_page() will be called
> > + * to rescue these pages but finding them in page reclaim is also
> > + * problematic due to the problem of deciding when the right time
> > + * to scan this list is.
> > + */
> > + page_list = &zone->lru[LRU_IMMEDIATE].list;
> > + if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
>
> How about this
>
> if (zone_page_state(zone, NR_IMMEDIATE)) {
> page_list = &zone->lru[LRU_IMMEDIATE].list;
> if (!list_empty(page_list))
> ...
> ...
> }
>
> It can reduce a unnecessary reference.
>
Ok, it mucks up the indentation a bit but with some renaming it looks
reasonable.
> > + struct page *page;
> > +
> > + spin_lock(&zone->lru_lock);
> > + while (!list_empty(page_list)) {
> > + page = list_entry(page_list->prev, struct page, lru);
> > + list_move(&page->lru, &zone->lru[page_lru(page)].list);
> > + __count_vm_event(PGRESCUED);
> > + }
> > + spin_unlock(&zone->lru_lock);
> > }
> > +
> > + local_irq_restore(flags);
> > }
> >
> > static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> > @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> > * is _really_ small and it's non-critical problem.
> > */
> > SetPageReclaim(page);
> > +
> > + /*
> > + * Move to the LRU_IMMEDIATE list to avoid being scanned
> > + * by page reclaim uselessly.
> > + */
> > + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> > + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
>
> It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
> Before this patch, all is from active to inacive so it was right.
> But with this patch, it can be from acdtive to immediate.
>
I do not quite understand. PGDEACTIVATE is incremented if the page was
active and this is checked before the move to the immediate LRU. Whether
it moves to the immediate LRU or the end of the inactive list, it is
still a deactivation. What's wrong with incrementing the count if it
moves from active to immediate?
==== CUT HERE ====
mm: Isolate pages for immediate reclaim on their own LRU fix
Rename pagevec_putback_immediate_fn to pagevec_putback_from_immediate_fn
for clarity and alter flow of rotate_reclaimable_page() slightly to
avoid an unnecessary list reference.
This is a fix to the patch
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru.patch in mmotm.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/swap.c | 32 ++++++++++++++++++--------------
1 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 9973975..dfe67eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -260,7 +260,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
* moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
* lists
*/
-static void pagevec_putback_immediate_fn(struct page *page, void *arg)
+static void pagevec_putback_from_immediate_fn(struct page *page, void *arg)
{
struct zone *zone = page_zone(page);
@@ -270,9 +270,9 @@ static void pagevec_putback_immediate_fn(struct page *page, void *arg)
}
}
-static void pagevec_putback_immediate(struct pagevec *pvec)
+static void pagevec_putback_from_immediate(struct pagevec *pvec)
{
- pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
+ pagevec_lru_move_fn(pvec, pagevec_putback_from_immediate_fn, NULL);
}
/*
@@ -283,7 +283,7 @@ static void pagevec_putback_immediate(struct pagevec *pvec)
void rotate_reclaimable_page(struct page *page)
{
struct zone *zone = page_zone(page);
- struct list_head *page_list;
+ struct list_head *list;
struct pagevec *pvec;
unsigned long flags;
@@ -300,7 +300,7 @@ void rotate_reclaimable_page(struct page *page)
} else {
pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
if (!pagevec_add(pvec, page))
- pagevec_putback_immediate(pvec);
+ pagevec_putback_from_immediate(pvec);
}
/*
@@ -316,17 +316,21 @@ void rotate_reclaimable_page(struct page *page)
* problematic due to the problem of deciding when the right time
* to scan this list is.
*/
- page_list = &zone->lru[LRU_IMMEDIATE].list;
- if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
+ if (!zone_page_state(zone, NR_IMMEDIATE)) {
struct page *page;
-
- spin_lock(&zone->lru_lock);
- while (!list_empty(page_list)) {
- page = list_entry(page_list->prev, struct page, lru);
- list_move(&page->lru, &zone->lru[page_lru(page)].list);
- __count_vm_event(PGRESCUED);
+ list = &zone->lru[LRU_IMMEDIATE].list;
+
+ if (!list_empty(list)) {
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(list)) {
+ int lru;
+ page = list_entry(list->prev, struct page, lru);
+ lru = page_lru(page);
+ list_move(&page->lru, &zone->lru[lru].list);
+ __count_vm_event(PGRESCUED);
+ }
+ spin_unlock(&zone->lru_lock);
}
- spin_unlock(&zone->lru_lock);
}
local_irq_restore(flags);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
2011-12-16 23:37 ` Andrew Morton
@ 2011-12-19 14:20 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 14:20 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Fri, Dec 16, 2011 at 03:37:16PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:22 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Short summary: There are severe stalls when a USB stick using VFAT
> > is used with THP enabled that are reduced by this series. If you are
> > experiencing this problem, please test and report back and considering
> > I have seen complaints from openSUSE and Fedora users on this as well
> > as a few private mails, I'm guessing it's a widespread issue. This
> > is a new type of USB-related stall because it is due to synchronous
> > compaction writing where as in the past the big problem was dirty
> > pages reaching the end of the LRU and being written by reclaim.
>
> Overall footprint:
>
> fs/btrfs/disk-io.c | 5
> fs/hugetlbfs/inode.c | 3
> fs/nfs/internal.h | 2
> fs/nfs/write.c | 4
> include/linux/fs.h | 11 +-
> include/linux/migrate.h | 23 +++-
> include/linux/mmzone.h | 4
> include/linux/vm_event_item.h | 1
> mm/compaction.c | 5
> mm/memory-failure.c | 2
> mm/memory_hotplug.c | 2
> mm/mempolicy.c | 2
> mm/migrate.c | 171 +++++++++++++++++++++-----------
> mm/page_alloc.c | 50 +++++++--
> mm/swap.c | 74 ++++++++++++-
> mm/vmscan.c | 114 ++++++++++++++++++---
> mm/vmstat.c | 2
> 17 files changed, 371 insertions(+), 104 deletions(-)
>
> The line count belies the increase in complexity.
>
I know and I regret that. Unfortunately while I considered other
solutions that were less complex, they were also nowhere near as
effective. The theme is at least consistent in that we are continuing
to move away from calling writepage in reclaim context.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
@ 2011-12-19 14:20 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 14:20 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Fri, Dec 16, 2011 at 03:37:16PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:22 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Short summary: There are severe stalls when a USB stick using VFAT
> > is used with THP enabled that are reduced by this series. If you are
> > experiencing this problem, please test and report back and considering
> > I have seen complaints from openSUSE and Fedora users on this as well
> > as a few private mails, I'm guessing it's a widespread issue. This
> > is a new type of USB-related stall because it is due to synchronous
> > compaction writing where as in the past the big problem was dirty
> > pages reaching the end of the LRU and being written by reclaim.
>
> Overall footprint:
>
> fs/btrfs/disk-io.c | 5
> fs/hugetlbfs/inode.c | 3
> fs/nfs/internal.h | 2
> fs/nfs/write.c | 4
> include/linux/fs.h | 11 +-
> include/linux/migrate.h | 23 +++-
> include/linux/mmzone.h | 4
> include/linux/vm_event_item.h | 1
> mm/compaction.c | 5
> mm/memory-failure.c | 2
> mm/memory_hotplug.c | 2
> mm/mempolicy.c | 2
> mm/migrate.c | 171 +++++++++++++++++++++-----------
> mm/page_alloc.c | 50 +++++++--
> mm/swap.c | 74 ++++++++++++-
> mm/vmscan.c | 114 ++++++++++++++++++---
> mm/vmstat.c | 2
> 17 files changed, 371 insertions(+), 104 deletions(-)
>
> The line count belies the increase in complexity.
>
I know and I regret that. Unfortunately while I considered other
solutions that were less complex, they were also nowhere near as
effective. The theme is at least consistent in that we are continuing
to move away from calling writepage in reclaim context.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
2011-12-16 22:56 ` Andrew Morton
@ 2011-12-19 14:40 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 14:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, David Rientjes, Rik van Riel,
Nai Xia, Linux-MM, LKML
On Fri, Dec 16, 2011 at 02:56:00PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:22 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Short summary: There are severe stalls when a USB stick using VFAT
> > is used with THP enabled that are reduced by this series. If you are
> > experiencing this problem, please test and report back and considering
> > I have seen complaints from openSUSE and Fedora users on this as well
> > as a few private mails, I'm guessing it's a widespread issue. This
> > is a new type of USB-related stall because it is due to synchronous
> > compaction writing where as in the past the big problem was dirty
> > pages reaching the end of the LRU and being written by reclaim.
> >
> > Am cc'ing Andrew this time and this series would replace
> > mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
> > I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
> > for wider testing and ideally it would be reverted and replaced by
> > this series.
>
> So it appears that the problem is painful for distros and users and
> that we won't have this fixed until 3.2 at best, and that fix will be a
> difficult backport for distributors of earlier kernels.
>
It is only difficult because the series "Do not call ->writepage[s]
from direct reclaim and use a_ops->writepages() where possible"
is also required. If both are put into -stable, then the backport
is straight forward but I was skeptical that -stable will take two
series that are this far reaching for a performance problem.
> To serve those people better, I'm wondering if we should merge
> mm-do-not-stall-in-synchronous-compaction-for-thp-allocations now, make
> it available for -stable backport and then revert it as part of this
> series? ie: give people a stopgap while we fix it properly?
If -stable cannot take both series then this is probably the
only realistic option. I'd be ok with this but it will hurt THP
allocation success rates on those kernels so that will hurt other
people like Andrea and David Rientjes. It's between a rock and a hard
place. Another realistic option might be for distros to disable THP
by default on 3.0 and 3.1.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6
@ 2011-12-19 14:40 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-19 14:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, David Rientjes, Rik van Riel,
Nai Xia, Linux-MM, LKML
On Fri, Dec 16, 2011 at 02:56:00PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:22 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Short summary: There are severe stalls when a USB stick using VFAT
> > is used with THP enabled that are reduced by this series. If you are
> > experiencing this problem, please test and report back and considering
> > I have seen complaints from openSUSE and Fedora users on this as well
> > as a few private mails, I'm guessing it's a widespread issue. This
> > is a new type of USB-related stall because it is due to synchronous
> > compaction writing where as in the past the big problem was dirty
> > pages reaching the end of the LRU and being written by reclaim.
> >
> > Am cc'ing Andrew this time and this series would replace
> > mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
> > I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
> > for wider testing and ideally it would be reverted and replaced by
> > this series.
>
> So it appears that the problem is painful for distros and users and
> that we won't have this fixed until 3.2 at best, and that fix will be a
> difficult backport for distributors of earlier kernels.
>
It is only difficult because the series "Do not call ->writepage[s]
from direct reclaim and use a_ops->writepages() where possible"
is also required. If both are put into -stable, then the backport
is straight forward but I was skeptical that -stable will take two
series that are this far reaching for a performance problem.
> To serve those people better, I'm wondering if we should merge
> mm-do-not-stall-in-synchronous-compaction-for-thp-allocations now, make
> it available for -stable backport and then revert it as part of this
> series? ie: give people a stopgap while we fix it properly?
If -stable cannot take both series then this is probably the
only realistic option. I'd be ok with this but it will hurt THP
allocation success rates on those kernels so that will hurt other
people like Andrea and David Rientjes. It's between a rock and a hard
place. Another realistic option might be for distros to disable THP
by default on 3.0 and 3.1.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-16 16:07 ` Mel Gorman
@ 2011-12-19 16:14 ` Johannes Weiner
-1 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-12-19 16:14 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Andrew Morton, Andrea Arcangeli, Minchan Kim,
Dave Jones, Jan Kara, Andy Isaacson, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Fri, Dec 16, 2011 at 04:07:28PM +0000, Mel Gorman wrote:
> On Fri, Dec 16, 2011 at 04:17:31PM +0100, Johannes Weiner wrote:
> > On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > > It was observed that scan rates from direct reclaim during tests
> > > writing to both fast and slow storage were extraordinarily high. The
> > > problem was that while pages were being marked for immediate reclaim
> > > when writeback completed, the same pages were being encountered over
> > > and over again during LRU scanning.
> > >
> > > This patch isolates file-backed pages that are to be reclaimed when
> > > clean on their own LRU list.
> >
> > Excuse me if I sound like a broken record, but have those observations
> > of high scan rates persisted with the per-zone dirty limits patchset?
> >
>
> Unfortunately I wasn't testing that series. The focus of this series
> was primarily on THP-related stalls incurred by compaction which
> did not have a dependency on that series. Even with dirty balancing,
> similar stalls would be observed once dirty pages were in the zone
> at all.
>
> > In my tests with pzd, the scan rates went down considerably together
> > with the immediate reclaim / vmscan writes.
> >
>
> I probably should know but what is pzd?
Oops. Per-Zone Dirty limits.
> > Our dirty limits are pretty low - if reclaim keeps shuffling through
> > dirty pages, where are the 80% reclaimable pages?! To me, this sounds
> > like the unfair distribution of dirty pages among zones again. Is
> > there are a different explanation that I missed?
> >
>
> The alternative explanation is that the 20% dirty pages are all
> long-lived, at the end of the highest zone which is always scanned first
> so we continually have to scan over these dirty pages for prolonged
> periods of time.
That certainly makes sense to me and is consistent with your test case
having a fast producer of clean cache while the dirty cache is against
a slow backing device, so it may survive multiple full inactive cycles
before writeback finishes.
> > PS: It also seems a bit out of place in this series...?
>
> Without the last path, the System CPU time was stupidly high. In part,
> this is because we are no longer calling ->writepage from direct
> reclaim. If we were, the CPU usage would be far lower but it would
> be a lot slower too. It seemed remiss to leave system CPU usage that
> high without some explanation or patch dealing with it.
>
> The following replaces this patch with your series. dirtybalance-v7r1 is
> yours.
>
> 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 dirtybalance-v7r1
> System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 43.05 (-3434.81%)
> +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 4.04 (-6581.33%)
> User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.05 ( 20.69%)
> +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( -1.84%)
> Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 73.71 ( 99.29%)
> +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 17.90 ( 97.22%)
> THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 102.60 ( 657.69%)
> +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 26.06 ( 141.02%)
> Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 214.80 ( 176.35%)
> +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 53.21 ( 72.39%)
> Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 788.40 ( 10.53%)
> +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 53.41 ( 27.35%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 715.04
> Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 549.64
>
> Your series does help the System CPU time begining it from 46.4 seconds
> to 43.05 seconds. That is within the noise but towards the edge of
> one standard deviation. With such a small reduction, elapsed time was
> not helped. However, it did help THP allocation success rates - still
> within the noise but again at the edge of the noise which indicates
> a solid improvement.
>
> MMTests Statistics: vmstat
> Page Ins 3257266139 1111844061 17263623 10901575 20870385
> Page Outs 81054922 30364312 3626530 3657687 3665499
> Swap Ins 3294 2851 6560 4964 6598
> Swap Outs 390073 528094 620197 790912 604228
> Direct pages scanned 1077581700 3024951463 1764930052 115140570 1796314840
> Kswapd pages scanned 34826043 7112868 2131265 1686942 2093637
> Kswapd pages reclaimed 28950067 4911036 1246044 966475 1319662
> Direct pages reclaimed 805148398 280167837 3623473 2215044 4182274
> Kswapd efficiency 83% 69% 58% 57% 63%
> Kswapd velocity 664.399 622.521 4253.852 7304.360 3809.106
> Direct efficiency 74% 9% 0% 1% 0%
> Direct velocity 20557.737 264745.137 3522673.849 498551.938 3268166.145
> Percentage direct scans 96% 99% 99% 98% 99%
> Page writes by reclaim 722646 529174 620319 791018 604368
> Page writes file 332573 1080 122 106 140
> Page writes anon 390073 528094 620197 790912 604228
> Page reclaim immediate 0 2552514720 1635858848 111281140 1661416934
> Page rescued immediate 0 0 0 87848 0
> Slabs scanned 23552 23552 9216 8192 8192
> Direct inode steals 231 0 0 0 0
> Kswapd inode steals 0 0 0 0 0
> Kswapd skipped wait 28076 786 0 61 1
> THP fault alloc 609 383 753 906 1074
> THP collapse alloc 12 6 0 0 0
> THP splits 536 211 456 593 561
> THP fault fallback 4406 4633 4263 4110 3942
> THP collapse fail 120 127 0 0 0
> Compaction stalls 1810 728 623 779 869
> Compaction success 196 53 60 80 99
> Compaction failures 1614 675 563 699 770
> Compaction pages moved 193158 53545 243185 333457 409585
> Compaction move failure 9952 9396 16424 23676 30668
>
> The direct page scanned figure with your patch is still very high
> unfortunately.
>
> Overall, I would say that your series is not a replacement for the last
> patch in this series.
Agreed, thanks for clearing this up.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-19 16:14 ` Johannes Weiner
0 siblings, 0 replies; 100+ messages in thread
From: Johannes Weiner @ 2011-12-19 16:14 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Andrew Morton, Andrea Arcangeli, Minchan Kim,
Dave Jones, Jan Kara, Andy Isaacson, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Fri, Dec 16, 2011 at 04:07:28PM +0000, Mel Gorman wrote:
> On Fri, Dec 16, 2011 at 04:17:31PM +0100, Johannes Weiner wrote:
> > On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > > It was observed that scan rates from direct reclaim during tests
> > > writing to both fast and slow storage were extraordinarily high. The
> > > problem was that while pages were being marked for immediate reclaim
> > > when writeback completed, the same pages were being encountered over
> > > and over again during LRU scanning.
> > >
> > > This patch isolates file-backed pages that are to be reclaimed when
> > > clean on their own LRU list.
> >
> > Excuse me if I sound like a broken record, but have those observations
> > of high scan rates persisted with the per-zone dirty limits patchset?
> >
>
> Unfortunately I wasn't testing that series. The focus of this series
> was primarily on THP-related stalls incurred by compaction which
> did not have a dependency on that series. Even with dirty balancing,
> similar stalls would be observed once dirty pages were in the zone
> at all.
>
> > In my tests with pzd, the scan rates went down considerably together
> > with the immediate reclaim / vmscan writes.
> >
>
> I probably should know but what is pzd?
Oops. Per-Zone Dirty limits.
> > Our dirty limits are pretty low - if reclaim keeps shuffling through
> > dirty pages, where are the 80% reclaimable pages?! To me, this sounds
> > like the unfair distribution of dirty pages among zones again. Is
> > there are a different explanation that I missed?
> >
>
> The alternative explanation is that the 20% dirty pages are all
> long-lived, at the end of the highest zone which is always scanned first
> so we continually have to scan over these dirty pages for prolonged
> periods of time.
That certainly makes sense to me and is consistent with your test case
having a fast producer of clean cache while the dirty cache is against
a slow backing device, so it may survive multiple full inactive cycles
before writeback finishes.
> > PS: It also seems a bit out of place in this series...?
>
> Without the last path, the System CPU time was stupidly high. In part,
> this is because we are no longer calling ->writepage from direct
> reclaim. If we were, the CPU usage would be far lower but it would
> be a lot slower too. It seemed remiss to leave system CPU usage that
> high without some explanation or patch dealing with it.
>
> The following replaces this patch with your series. dirtybalance-v7r1 is
> yours.
>
> 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 dirtybalance-v7r1
> System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 43.05 (-3434.81%)
> +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 4.04 (-6581.33%)
> User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.05 ( 20.69%)
> +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( -1.84%)
> Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 73.71 ( 99.29%)
> +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 17.90 ( 97.22%)
> THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 102.60 ( 657.69%)
> +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 26.06 ( 141.02%)
> Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 214.80 ( 176.35%)
> +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 53.21 ( 72.39%)
> Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 788.40 ( 10.53%)
> +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 53.41 ( 27.35%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 715.04
> Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 549.64
>
> Your series does help the System CPU time begining it from 46.4 seconds
> to 43.05 seconds. That is within the noise but towards the edge of
> one standard deviation. With such a small reduction, elapsed time was
> not helped. However, it did help THP allocation success rates - still
> within the noise but again at the edge of the noise which indicates
> a solid improvement.
>
> MMTests Statistics: vmstat
> Page Ins 3257266139 1111844061 17263623 10901575 20870385
> Page Outs 81054922 30364312 3626530 3657687 3665499
> Swap Ins 3294 2851 6560 4964 6598
> Swap Outs 390073 528094 620197 790912 604228
> Direct pages scanned 1077581700 3024951463 1764930052 115140570 1796314840
> Kswapd pages scanned 34826043 7112868 2131265 1686942 2093637
> Kswapd pages reclaimed 28950067 4911036 1246044 966475 1319662
> Direct pages reclaimed 805148398 280167837 3623473 2215044 4182274
> Kswapd efficiency 83% 69% 58% 57% 63%
> Kswapd velocity 664.399 622.521 4253.852 7304.360 3809.106
> Direct efficiency 74% 9% 0% 1% 0%
> Direct velocity 20557.737 264745.137 3522673.849 498551.938 3268166.145
> Percentage direct scans 96% 99% 99% 98% 99%
> Page writes by reclaim 722646 529174 620319 791018 604368
> Page writes file 332573 1080 122 106 140
> Page writes anon 390073 528094 620197 790912 604228
> Page reclaim immediate 0 2552514720 1635858848 111281140 1661416934
> Page rescued immediate 0 0 0 87848 0
> Slabs scanned 23552 23552 9216 8192 8192
> Direct inode steals 231 0 0 0 0
> Kswapd inode steals 0 0 0 0 0
> Kswapd skipped wait 28076 786 0 61 1
> THP fault alloc 609 383 753 906 1074
> THP collapse alloc 12 6 0 0 0
> THP splits 536 211 456 593 561
> THP fault fallback 4406 4633 4263 4110 3942
> THP collapse fail 120 127 0 0 0
> Compaction stalls 1810 728 623 779 869
> Compaction success 196 53 60 80 99
> Compaction failures 1614 675 563 699 770
> Compaction pages moved 193158 53545 243185 333457 409585
> Compaction move failure 9952 9396 16424 23676 30668
>
> The direct page scanned figure with your patch is still very high
> unfortunately.
>
> Overall, I would say that your series is not a replacement for the last
> patch in this series.
Agreed, thanks for clearing this up.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-19 13:26 ` Mel Gorman
@ 2011-12-20 7:10 ` Minchan Kim
-1 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-20 7:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Mon, Dec 19, 2011 at 01:26:15PM +0000, Mel Gorman wrote:
> On Sun, Dec 18, 2011 at 01:08:22AM +0900, Minchan Kim wrote:
> > On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > > It was observed that scan rates from direct reclaim during tests
> > > writing to both fast and slow storage were extraordinarily high. The
> > > problem was that while pages were being marked for immediate reclaim
> > > when writeback completed, the same pages were being encountered over
> > > and over again during LRU scanning.
> > >
> > > This patch isolates file-backed pages that are to be reclaimed when
> > > clean on their own LRU list.
> >
> > Please include your test result about reducing CPU usage.
> > It makes this separate LRU list how vaule is.
> >
>
> It's in the leader. The writebackCPDevicevfat tests should that System
> CPU goes from 46.40 seconds to 4.44 seconds with this patch applied.
Sorry I didn't read cover.
Looks great.
>
> > > <SNIP>
> > >
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index a91caf7..9973975 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -39,6 +39,7 @@ int page_cluster;
> > >
> > > static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> > > static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> > > +static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
> > > static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> > >
> > > /*
> > > @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> > > }
> > >
> > > /*
> > > + * Similar pair of functions to pagevec_move_tail except it is called when
> > > + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> > > + * lists
> > > + */
> > > +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> > > +{
> > > + struct zone *zone = page_zone(page);
> > > +
> > > + if (PageLRU(page)) {
> > > + enum lru_list lru = page_lru(page);
> > > + list_move(&page->lru, &zone->lru[lru].list);
> > > + }
> > > +}
> > > +
> > > +static void pagevec_putback_immediate(struct pagevec *pvec)
> > > +{
> > > + pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
> > > +}
> > > +
> > > +/*
> > > * Writeback is about to end against a page which has been marked for immediate
> > > * reclaim. If it still appears to be reclaimable, move it to the tail of the
> > > * inactive list.
> > > */
> > > void rotate_reclaimable_page(struct page *page)
> > > {
> > > + struct zone *zone = page_zone(page);
> > > + struct list_head *page_list;
> > > + struct pagevec *pvec;
> > > + unsigned long flags;
> > > +
> > > + page_cache_get(page);
> > > + local_irq_save(flags);
> > > + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> > > +
> >
> > I am not sure underflow never happen.
> > We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
> >
>
> In those cases, we do not move the page to the immedate list either.
That's my concern.
We didn't move the page to immediate list but set SetPageReclaim. It means
we don't increate NR_IMMEDIATE.
If end_page_writeback have called that page, rotate_reclimable_page would be called.
Eventually, __mod_zone_page_state(zone, NR_IMMEDIATE, -1) is called.
But I didn't look into the code yet for confirming it's possbile or not.
> During one test I was recording /proc/vmstat every 10 seconds and never
> saw an underflow.
If it's very rare, it would be very hard to see it.
>
> > > if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> > > !PageUnevictable(page) && PageLRU(page)) {
> > > - struct pagevec *pvec;
> > > - unsigned long flags;
> > >
> > > - page_cache_get(page);
> > > - local_irq_save(flags);
> > > pvec = &__get_cpu_var(lru_rotate_pvecs);
> > > if (!pagevec_add(pvec, page))
> > > pagevec_move_tail(pvec);
> > > - local_irq_restore(flags);
> > > + } else {
> > > + pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
> > > + if (!pagevec_add(pvec, page))
> > > + pagevec_putback_immediate(pvec);
> >
> > Nitpick about naming.
>
> Naming is important.
>
> > It doesn't say immediate is from or to. So I got confused
> > which is source. I know comment of function already say it
> > but good naming can reduce unnecessary comment.
> > How about pagevec_putback_from_immediate_list?
> >
>
> Sure. Done.
>
> > > + }
> > > +
> > > + /*
> > > + * There is a potential race that if a page is set PageReclaim
> > > + * and moved to the LRU_IMMEDIATE list after writeback completed,
> > > + * it can be left on the LRU_IMMEDATE list with no way for
> > > + * reclaim to find it.
> > > + *
> > > + * This race should be very rare but count how often it happens.
> > > + * If it is a continual race, then it's very unsatisfactory as there
> > > + * is no guarantee that rotate_reclaimable_page() will be called
> > > + * to rescue these pages but finding them in page reclaim is also
> > > + * problematic due to the problem of deciding when the right time
> > > + * to scan this list is.
> > > + */
> > > + page_list = &zone->lru[LRU_IMMEDIATE].list;
> > > + if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
> >
> > How about this
> >
> > if (zone_page_state(zone, NR_IMMEDIATE)) {
> > page_list = &zone->lru[LRU_IMMEDIATE].list;
> > if (!list_empty(page_list))
> > ...
> > ...
> > }
> >
> > It can reduce a unnecessary reference.
> >
>
> Ok, it mucks up the indentation a bit but with some renaming it looks
> reasonable.
>
> > > + struct page *page;
> > > +
> > > + spin_lock(&zone->lru_lock);
> > > + while (!list_empty(page_list)) {
> > > + page = list_entry(page_list->prev, struct page, lru);
> > > + list_move(&page->lru, &zone->lru[page_lru(page)].list);
> > > + __count_vm_event(PGRESCUED);
> > > + }
> > > + spin_unlock(&zone->lru_lock);
> > > }
> > > +
> > > + local_irq_restore(flags);
> > > }
> > >
> > > static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> > > @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> > > * is _really_ small and it's non-critical problem.
> > > */
> > > SetPageReclaim(page);
> > > +
> > > + /*
> > > + * Move to the LRU_IMMEDIATE list to avoid being scanned
> > > + * by page reclaim uselessly.
> > > + */
> > > + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> > > + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
> >
> > It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
> > Before this patch, all is from active to inacive so it was right.
> > But with this patch, it can be from acdtive to immediate.
> >
>
> I do not quite understand. PGDEACTIVATE is incremented if the page was
> active and this is checked before the move to the immediate LRU. Whether
> it moves to the immediate LRU or the end of the inactive list, it is
> still a deactivation. What's wrong with incrementing the count if it
Hmm, I have thought deactivation is only from active to deactive.
I might be wrong but if we perhaps move page from active to unevictable list,
is it deactivation, too?
Maybe we need consistent count.
> moves from active to immediate?
>
> ==== CUT HERE ====
> mm: Isolate pages for immediate reclaim on their own LRU fix
>
> Rename pagevec_putback_immediate_fn to pagevec_putback_from_immediate_fn
> for clarity and alter flow of rotate_reclaimable_page() slightly to
> avoid an unnecessary list reference.
>
> This is a fix to the patch
> mm-isolate-pages-for-immediate-reclaim-on-their-own-lru.patch in mmotm.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Thanks.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-20 7:10 ` Minchan Kim
0 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-20 7:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Mon, Dec 19, 2011 at 01:26:15PM +0000, Mel Gorman wrote:
> On Sun, Dec 18, 2011 at 01:08:22AM +0900, Minchan Kim wrote:
> > On Wed, Dec 14, 2011 at 03:41:33PM +0000, Mel Gorman wrote:
> > > It was observed that scan rates from direct reclaim during tests
> > > writing to both fast and slow storage were extraordinarily high. The
> > > problem was that while pages were being marked for immediate reclaim
> > > when writeback completed, the same pages were being encountered over
> > > and over again during LRU scanning.
> > >
> > > This patch isolates file-backed pages that are to be reclaimed when
> > > clean on their own LRU list.
> >
> > Please include your test result about reducing CPU usage.
> > It makes this separate LRU list how vaule is.
> >
>
> It's in the leader. The writebackCPDevicevfat tests should that System
> CPU goes from 46.40 seconds to 4.44 seconds with this patch applied.
Sorry I didn't read cover.
Looks great.
>
> > > <SNIP>
> > >
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index a91caf7..9973975 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -39,6 +39,7 @@ int page_cluster;
> > >
> > > static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
> > > static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> > > +static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
> > > static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> > >
> > > /*
> > > @@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
> > > }
> > >
> > > /*
> > > + * Similar pair of functions to pagevec_move_tail except it is called when
> > > + * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
> > > + * lists
> > > + */
> > > +static void pagevec_putback_immediate_fn(struct page *page, void *arg)
> > > +{
> > > + struct zone *zone = page_zone(page);
> > > +
> > > + if (PageLRU(page)) {
> > > + enum lru_list lru = page_lru(page);
> > > + list_move(&page->lru, &zone->lru[lru].list);
> > > + }
> > > +}
> > > +
> > > +static void pagevec_putback_immediate(struct pagevec *pvec)
> > > +{
> > > + pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
> > > +}
> > > +
> > > +/*
> > > * Writeback is about to end against a page which has been marked for immediate
> > > * reclaim. If it still appears to be reclaimable, move it to the tail of the
> > > * inactive list.
> > > */
> > > void rotate_reclaimable_page(struct page *page)
> > > {
> > > + struct zone *zone = page_zone(page);
> > > + struct list_head *page_list;
> > > + struct pagevec *pvec;
> > > + unsigned long flags;
> > > +
> > > + page_cache_get(page);
> > > + local_irq_save(flags);
> > > + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> > > +
> >
> > I am not sure underflow never happen.
> > We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
> >
>
> In those cases, we do not move the page to the immedate list either.
That's my concern.
We didn't move the page to immediate list but set SetPageReclaim. It means
we don't increate NR_IMMEDIATE.
If end_page_writeback have called that page, rotate_reclimable_page would be called.
Eventually, __mod_zone_page_state(zone, NR_IMMEDIATE, -1) is called.
But I didn't look into the code yet for confirming it's possbile or not.
> During one test I was recording /proc/vmstat every 10 seconds and never
> saw an underflow.
If it's very rare, it would be very hard to see it.
>
> > > if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> > > !PageUnevictable(page) && PageLRU(page)) {
> > > - struct pagevec *pvec;
> > > - unsigned long flags;
> > >
> > > - page_cache_get(page);
> > > - local_irq_save(flags);
> > > pvec = &__get_cpu_var(lru_rotate_pvecs);
> > > if (!pagevec_add(pvec, page))
> > > pagevec_move_tail(pvec);
> > > - local_irq_restore(flags);
> > > + } else {
> > > + pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
> > > + if (!pagevec_add(pvec, page))
> > > + pagevec_putback_immediate(pvec);
> >
> > Nitpick about naming.
>
> Naming is important.
>
> > It doesn't say immediate is from or to. So I got confused
> > which is source. I know comment of function already say it
> > but good naming can reduce unnecessary comment.
> > How about pagevec_putback_from_immediate_list?
> >
>
> Sure. Done.
>
> > > + }
> > > +
> > > + /*
> > > + * There is a potential race that if a page is set PageReclaim
> > > + * and moved to the LRU_IMMEDIATE list after writeback completed,
> > > + * it can be left on the LRU_IMMEDATE list with no way for
> > > + * reclaim to find it.
> > > + *
> > > + * This race should be very rare but count how often it happens.
> > > + * If it is a continual race, then it's very unsatisfactory as there
> > > + * is no guarantee that rotate_reclaimable_page() will be called
> > > + * to rescue these pages but finding them in page reclaim is also
> > > + * problematic due to the problem of deciding when the right time
> > > + * to scan this list is.
> > > + */
> > > + page_list = &zone->lru[LRU_IMMEDIATE].list;
> > > + if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
> >
> > How about this
> >
> > if (zone_page_state(zone, NR_IMMEDIATE)) {
> > page_list = &zone->lru[LRU_IMMEDIATE].list;
> > if (!list_empty(page_list))
> > ...
> > ...
> > }
> >
> > It can reduce a unnecessary reference.
> >
>
> Ok, it mucks up the indentation a bit but with some renaming it looks
> reasonable.
>
> > > + struct page *page;
> > > +
> > > + spin_lock(&zone->lru_lock);
> > > + while (!list_empty(page_list)) {
> > > + page = list_entry(page_list->prev, struct page, lru);
> > > + list_move(&page->lru, &zone->lru[page_lru(page)].list);
> > > + __count_vm_event(PGRESCUED);
> > > + }
> > > + spin_unlock(&zone->lru_lock);
> > > }
> > > +
> > > + local_irq_restore(flags);
> > > }
> > >
> > > static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> > > @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> > > * is _really_ small and it's non-critical problem.
> > > */
> > > SetPageReclaim(page);
> > > +
> > > + /*
> > > + * Move to the LRU_IMMEDIATE list to avoid being scanned
> > > + * by page reclaim uselessly.
> > > + */
> > > + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> > > + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
> >
> > It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
> > Before this patch, all is from active to inacive so it was right.
> > But with this patch, it can be from acdtive to immediate.
> >
>
> I do not quite understand. PGDEACTIVATE is incremented if the page was
> active and this is checked before the move to the immediate LRU. Whether
> it moves to the immediate LRU or the end of the inactive list, it is
> still a deactivation. What's wrong with incrementing the count if it
Hmm, I have thought deactivation is only from active to deactive.
I might be wrong but if we perhaps move page from active to unevictable list,
is it deactivation, too?
Maybe we need consistent count.
> moves from active to immediate?
>
> ==== CUT HERE ====
> mm: Isolate pages for immediate reclaim on their own LRU fix
>
> Rename pagevec_putback_immediate_fn to pagevec_putback_from_immediate_fn
> for clarity and alter flow of rotate_reclaimable_page() slightly to
> avoid an unnecessary list reference.
>
> This is a fix to the patch
> mm-isolate-pages-for-immediate-reclaim-on-their-own-lru.patch in mmotm.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Thanks.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
2011-12-19 11:45 ` Mel Gorman
@ 2011-12-20 7:18 ` Minchan Kim
-1 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-20 7:18 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Mon, Dec 19, 2011 at 11:45:22AM +0000, Mel Gorman wrote:
> On Sun, Dec 18, 2011 at 11:05:52AM +0900, Minchan Kim wrote:
> > On Wed, Dec 14, 2011 at 03:41:30PM +0000, Mel Gorman wrote:
> > > This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> > > mode that avoids writing back pages to backing storage. Async
> > > compaction maps to MIGRATE_ASYNC while sync compaction maps to
> > > MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> > > hotplug, MIGRATE_SYNC is used.
> > >
> > > This avoids sync compaction stalling for an excessive length of time,
> > > particularly when copying files to a USB stick where there might be
> > > a large number of dirty pages backed by a filesystem that does not
> > > support ->writepages.
> > >
> > > [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >
> > Acked-by: Minchan Kim <minchan@kernel.org>
> >
>
> Thanks.
>
> > > <SNIP>
> > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > > index 10b9883..6b80537 100644
> > > --- a/fs/hugetlbfs/inode.c
> > > +++ b/fs/hugetlbfs/inode.c
> > > @@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
> > >
> > > static int hugetlbfs_migrate_page(struct address_space *mapping,
> > > struct page *newpage, struct page *page,
> > > - bool sync)
> > > + enum migrate_mode mode)
> >
> > Nitpick, except this one, we use enum migrate_mode sync.
> >
>
> Actually, in all the core code, I used "mode" but I was inconsistent in
> the headers and some of the filesystems. I should have converted all use
> of "sync" which was a boolean to a mode which has three possible values
> after this patch.
>
> ==== CUT HERE ====
> mm: compaction: Introduce sync-light migration for use by compaction fix
>
> Consistently name enum migrate_mode parameters "mode" instead of "sync".
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Thanks.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
@ 2011-12-20 7:18 ` Minchan Kim
0 siblings, 0 replies; 100+ messages in thread
From: Minchan Kim @ 2011-12-20 7:18 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Mon, Dec 19, 2011 at 11:45:22AM +0000, Mel Gorman wrote:
> On Sun, Dec 18, 2011 at 11:05:52AM +0900, Minchan Kim wrote:
> > On Wed, Dec 14, 2011 at 03:41:30PM +0000, Mel Gorman wrote:
> > > This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> > > mode that avoids writing back pages to backing storage. Async
> > > compaction maps to MIGRATE_ASYNC while sync compaction maps to
> > > MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> > > hotplug, MIGRATE_SYNC is used.
> > >
> > > This avoids sync compaction stalling for an excessive length of time,
> > > particularly when copying files to a USB stick where there might be
> > > a large number of dirty pages backed by a filesystem that does not
> > > support ->writepages.
> > >
> > > [aarcange@redhat.com: This patch is heavily based on Andrea's work]
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >
> > Acked-by: Minchan Kim <minchan@kernel.org>
> >
>
> Thanks.
>
> > > <SNIP>
> > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > > index 10b9883..6b80537 100644
> > > --- a/fs/hugetlbfs/inode.c
> > > +++ b/fs/hugetlbfs/inode.c
> > > @@ -577,7 +577,7 @@ static int hugetlbfs_set_page_dirty(struct page *page)
> > >
> > > static int hugetlbfs_migrate_page(struct address_space *mapping,
> > > struct page *newpage, struct page *page,
> > > - bool sync)
> > > + enum migrate_mode mode)
> >
> > Nitpick, except this one, we use enum migrate_mode sync.
> >
>
> Actually, in all the core code, I used "mode" but I was inconsistent in
> the headers and some of the filesystems. I should have converted all use
> of "sync" which was a boolean to a mode which has three possible values
> after this patch.
>
> ==== CUT HERE ====
> mm: compaction: Introduce sync-light migration for use by compaction fix
>
> Consistently name enum migrate_mode parameters "mode" instead of "sync".
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Thanks.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-20 7:10 ` Minchan Kim
@ 2011-12-20 9:55 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-20 9:55 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Tue, Dec 20, 2011 at 04:10:26PM +0900, Minchan Kim wrote:
> > > > * Writeback is about to end against a page which has been marked for immediate
> > > > * reclaim. If it still appears to be reclaimable, move it to the tail of the
> > > > * inactive list.
> > > > */
> > > > void rotate_reclaimable_page(struct page *page)
> > > > {
> > > > + struct zone *zone = page_zone(page);
> > > > + struct list_head *page_list;
> > > > + struct pagevec *pvec;
> > > > + unsigned long flags;
> > > > +
> > > > + page_cache_get(page);
> > > > + local_irq_save(flags);
> > > > + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> > > > +
> > >
> > > I am not sure underflow never happen.
> > > We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
> > >
> >
> > In those cases, we do not move the page to the immedate list either.
>
> That's my concern.
> We didn't move the page to immediate list but set SetPageReclaim. It means
> we don't increate NR_IMMEDIATE.
> If end_page_writeback have called that page, rotate_reclimable_page would be called.
> Eventually, __mod_zone_page_state(zone, NR_IMMEDIATE, -1) is called.
> But I didn't look into the code yet for confirming it's possbile or not.
>
Ah, now I see your concern. The key is that they get moved to the
immediate LRU later although it is not obvious. This should be double
checked but when I was implementing this, I looked at the different
places that called SetPageReclaim.
mm/swap.c:lru_deactivate_fn() calls SetPageReclaim but also moves the
page to the immediate LRU list so no problem with accounting
there.
mm/vmscan.c:pageout() calls SetPageReclaim but does not move the page
explicitly as such. Instead, it gets picked up by
putback_lru_pages() later which checks for inactive LRU pages
that are marked PageReclaim and selects the immediate LRU in
this case. The counter gets incremented for the appropriate
LRU list by __add_page_to_lru_list(). Even if we do have
an active page with PageReclaim set, it should not cause an
accounting difficulty
mm/vmscan.c:shrink_page_list() calls SetPageReclaim but like pageout(),
it gets picked up by putback_lru_pages() later
Did I miss anything?
> > During one test I was recording /proc/vmstat every 10 seconds and never
> > saw an underflow.
>
> If it's very rare, it would be very hard to see it.
>
But once it happened, I would not expect it to recover. The nr_immediate
value usually reads as 0.
> > > > <SNIP>
> > > > static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> > > > @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> > > > * is _really_ small and it's non-critical problem.
> > > > */
> > > > SetPageReclaim(page);
> > > > +
> > > > + /*
> > > > + * Move to the LRU_IMMEDIATE list to avoid being scanned
> > > > + * by page reclaim uselessly.
> > > > + */
> > > > + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> > > > + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
> > >
> > > It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
> > > Before this patch, all is from active to inacive so it was right.
> > > But with this patch, it can be from acdtive to immediate.
> > >
> >
> > I do not quite understand. PGDEACTIVATE is incremented if the page was
> > active and this is checked before the move to the immediate LRU. Whether
> > it moves to the immediate LRU or the end of the inactive list, it is
> > still a deactivation. What's wrong with incrementing the count if it
>
> Hmm, I have thought deactivation is only from active to deactive.
This is a matter of definition really. The page is going from active
to inactive. The immediate list is similar to the inactive list in
this case, at least from a deactivation point of view.
> I might be wrong but if we perhaps move page from active to unevictable list,
> is it deactivation, too?
I would consider it a deactivate if PageActive got cleared. Here we are
talking about the lru_deactivate_fn function. Whether it moves to the
immediate list or the end of the inactive list, the page is being
deactivated.
> Maybe we need consistent count.
>
In this case, I think we are being consistent. The page is deactivated,
we increase the PFDEACTIVATE counter.
Thanks very much for reviewing this closely, I appreciate it.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-20 9:55 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-20 9:55 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Andrea Arcangeli, Minchan Kim, Dave Jones,
Jan Kara, Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Linux-MM, LKML
On Tue, Dec 20, 2011 at 04:10:26PM +0900, Minchan Kim wrote:
> > > > * Writeback is about to end against a page which has been marked for immediate
> > > > * reclaim. If it still appears to be reclaimable, move it to the tail of the
> > > > * inactive list.
> > > > */
> > > > void rotate_reclaimable_page(struct page *page)
> > > > {
> > > > + struct zone *zone = page_zone(page);
> > > > + struct list_head *page_list;
> > > > + struct pagevec *pvec;
> > > > + unsigned long flags;
> > > > +
> > > > + page_cache_get(page);
> > > > + local_irq_save(flags);
> > > > + __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
> > > > +
> > >
> > > I am not sure underflow never happen.
> > > We do SetPageReclaim at several places but dont' increase NR_IMMEDIATE.
> > >
> >
> > In those cases, we do not move the page to the immedate list either.
>
> That's my concern.
> We didn't move the page to immediate list but set SetPageReclaim. It means
> we don't increate NR_IMMEDIATE.
> If end_page_writeback have called that page, rotate_reclimable_page would be called.
> Eventually, __mod_zone_page_state(zone, NR_IMMEDIATE, -1) is called.
> But I didn't look into the code yet for confirming it's possbile or not.
>
Ah, now I see your concern. The key is that they get moved to the
immediate LRU later although it is not obvious. This should be double
checked but when I was implementing this, I looked at the different
places that called SetPageReclaim.
mm/swap.c:lru_deactivate_fn() calls SetPageReclaim but also moves the
page to the immediate LRU list so no problem with accounting
there.
mm/vmscan.c:pageout() calls SetPageReclaim but does not move the page
explicitly as such. Instead, it gets picked up by
putback_lru_pages() later which checks for inactive LRU pages
that are marked PageReclaim and selects the immediate LRU in
this case. The counter gets incremented for the appropriate
LRU list by __add_page_to_lru_list(). Even if we do have
an active page with PageReclaim set, it should not cause an
accounting difficulty
mm/vmscan.c:shrink_page_list() calls SetPageReclaim but like pageout(),
it gets picked up by putback_lru_pages() later
Did I miss anything?
> > During one test I was recording /proc/vmstat every 10 seconds and never
> > saw an underflow.
>
> If it's very rare, it would be very hard to see it.
>
But once it happened, I would not expect it to recover. The nr_immediate
value usually reads as 0.
> > > > <SNIP>
> > > > static void update_page_reclaim_stat(struct zone *zone, struct page *page,
> > > > @@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> > > > * is _really_ small and it's non-critical problem.
> > > > */
> > > > SetPageReclaim(page);
> > > > +
> > > > + /*
> > > > + * Move to the LRU_IMMEDIATE list to avoid being scanned
> > > > + * by page reclaim uselessly.
> > > > + */
> > > > + list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
> > > > + __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
> > >
> > > It mekes below count of PGDEACTIVATE wrong in lru_deactivate_fn.
> > > Before this patch, all is from active to inacive so it was right.
> > > But with this patch, it can be from acdtive to immediate.
> > >
> >
> > I do not quite understand. PGDEACTIVATE is incremented if the page was
> > active and this is checked before the move to the immediate LRU. Whether
> > it moves to the immediate LRU or the end of the inactive list, it is
> > still a deactivation. What's wrong with incrementing the count if it
>
> Hmm, I have thought deactivation is only from active to deactive.
This is a matter of definition really. The page is going from active
to inactive. The immediate list is similar to the inactive list in
this case, at least from a deactivation point of view.
> I might be wrong but if we perhaps move page from active to unevictable list,
> is it deactivation, too?
I would consider it a deactivate if PageActive got cleared. Here we are
talking about the lru_deactivate_fn function. Whether it moves to the
immediate list or the end of the inactive list, the page is being
deactivated.
> Maybe we need consistent count.
>
In this case, I think we are being consistent. The page is deactivated,
we increase the PFDEACTIVATE counter.
Thanks very much for reviewing this closely, I appreciate it.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-20 9:55 ` Mel Gorman
@ 2011-12-23 19:08 ` Hugh Dickins
-1 siblings, 0 replies; 100+ messages in thread
From: Hugh Dickins @ 2011-12-23 19:08 UTC (permalink / raw)
To: Mel Gorman
Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Minchan Kim,
Dave Jones, Jan Kara, Andy Isaacson, Johannes Weiner,
Rik van Riel, Nai Xia, Linux-MM, LKML
Sorry, Mel, I've had to revert this patch (and its two little children)
from my 3.2.0-rc6-next-20111222 testing: you really do need a page flag
(or substitute) for your "immediate" lru.
How else can a del_page_from_lru[_list]() know whether to decrement
the count of the immediate or the inactive list? page_lru() says to
decrement the count of the inactive list, so in due course that wraps
to a gigantic number, and then page reclaim livelocks trying to wring
pages out of an empty list. It's the memcg case I've been hitting,
but presumably the same happens with global counts.
There is another such accounting bug in -next, been there longer and
not so easy to hit: I'm fairly sure it will turn out to be memcg
misaccounting a THPage somewhere, I'll have a look around shortly.
Hugh
p.s. Immediate? Isn't that an odd name for a list of pages which are
not immediately freeable? Maybe Rik's launder/laundry name would be
better: pages which are currently being cleaned.
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-23 19:08 ` Hugh Dickins
0 siblings, 0 replies; 100+ messages in thread
From: Hugh Dickins @ 2011-12-23 19:08 UTC (permalink / raw)
To: Mel Gorman
Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Minchan Kim,
Dave Jones, Jan Kara, Andy Isaacson, Johannes Weiner,
Rik van Riel, Nai Xia, Linux-MM, LKML
Sorry, Mel, I've had to revert this patch (and its two little children)
from my 3.2.0-rc6-next-20111222 testing: you really do need a page flag
(or substitute) for your "immediate" lru.
How else can a del_page_from_lru[_list]() know whether to decrement
the count of the immediate or the inactive list? page_lru() says to
decrement the count of the inactive list, so in due course that wraps
to a gigantic number, and then page reclaim livelocks trying to wring
pages out of an empty list. It's the memcg case I've been hitting,
but presumably the same happens with global counts.
There is another such accounting bug in -next, been there longer and
not so easy to hit: I'm fairly sure it will turn out to be memcg
misaccounting a THPage somewhere, I'll have a look around shortly.
Hugh
p.s. Immediate? Isn't that an odd name for a list of pages which are
not immediately freeable? Maybe Rik's launder/laundry name would be
better: pages which are currently being cleaned.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-23 19:08 ` Hugh Dickins
@ 2011-12-29 16:59 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-29 16:59 UTC (permalink / raw)
To: Hugh Dickins
Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Minchan Kim,
Dave Jones, Jan Kara, Andy Isaacson, Johannes Weiner,
Rik van Riel, Nai Xia, Linux-MM, LKML
I was offline for several days for the holidays and I'm not back
online properly until Jan 4th, hence the delay in responding.
On Fri, Dec 23, 2011 at 11:08:19AM -0800, Hugh Dickins wrote:
> Sorry, Mel, I've had to revert this patch (and its two little children)
> from my 3.2.0-rc6-next-20111222 testing: you really do need a page flag
> (or substitute) for your "immediate" lru.
>
Don't be sorry at all. I prefer that this was caught before merging
to mainline and thanks for catching this.
> How else can a del_page_from_lru[_list]() know whether to decrement
> the count of the immediate or the inactive list?
You are right, it cannot and because pages are removed from the
LRU list in contexts such as invalidating a mapping, we cannot be
sure whether a page is on the immediate LRU or inactive_file in all
cases. It is further complicated by the fact that PageReclaim and
PageReadhead use the same page flag.
> page_lru() says to
> decrement the count of the inactive list, so in due course that wraps
> to a gigantic number, and then page reclaim livelocks trying to wring
> pages out of an empty list. It's the memcg case I've been hitting,
> but presumably the same happens with global counts.
>
I've verified that the accounting can break. I did not see it wrap
negative because in my testing it was rare the problem occurred but it
would happen eventually.
I considered a few ways of fixing this. The obvious one is to add a
new page flag but that is difficult to justify as the high-cpu-usage
problem should only occur when there is a lot of writeback to slow
storage which I believe is a rare case. It is not a suitable use for
an extended page flag.
The second was to keep these PageReclaim pages off the LRU but this
leads to complications of its own.
The third was to use a combination of flags to mark pages that
are on the immediate LRU such as how PG_compound and PG_reclaim in
combination mark tail pages. This would not be free of races and would
eventually cause corruption. There is also the problem that we cannot
atomically set multiple bits so setting the bits in contexts such as
set_page_dirty() may be problematic.
Andrew, as there is not an easy uncontroversial fix can you remove
the following patches from mmotm please?
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru.patch
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru-fix.patch
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru-fix-2.patch
The impact is that users writing to slow stage may see higher CPU usage
as the pages under writeback have to be skipped by scanning once the
dirty pages move to the end of the LRU list. I'm assuming once they
are removed from mmotm that they also get removed from linux-next.
> There is another such accounting bug in -next, been there longer and
> not so easy to hit: I'm fairly sure it will turn out to be memcg
> misaccounting a THPage somewhere, I'll have a look around shortly.
>
> p.s. Immediate? Isn't that an odd name for a list of pages which are
> not immediately freeable? Maybe Rik's launder/laundry name would be
> better: pages which are currently being cleaned.
That is potentially very misleading as not all pages being laundered are
on that list. reclaim_writeback might be a better name.
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-29 16:59 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-29 16:59 UTC (permalink / raw)
To: Hugh Dickins
Cc: Minchan Kim, Andrew Morton, Andrea Arcangeli, Minchan Kim,
Dave Jones, Jan Kara, Andy Isaacson, Johannes Weiner,
Rik van Riel, Nai Xia, Linux-MM, LKML
I was offline for several days for the holidays and I'm not back
online properly until Jan 4th, hence the delay in responding.
On Fri, Dec 23, 2011 at 11:08:19AM -0800, Hugh Dickins wrote:
> Sorry, Mel, I've had to revert this patch (and its two little children)
> from my 3.2.0-rc6-next-20111222 testing: you really do need a page flag
> (or substitute) for your "immediate" lru.
>
Don't be sorry at all. I prefer that this was caught before merging
to mainline and thanks for catching this.
> How else can a del_page_from_lru[_list]() know whether to decrement
> the count of the immediate or the inactive list?
You are right, it cannot and because pages are removed from the
LRU list in contexts such as invalidating a mapping, we cannot be
sure whether a page is on the immediate LRU or inactive_file in all
cases. It is further complicated by the fact that PageReclaim and
PageReadhead use the same page flag.
> page_lru() says to
> decrement the count of the inactive list, so in due course that wraps
> to a gigantic number, and then page reclaim livelocks trying to wring
> pages out of an empty list. It's the memcg case I've been hitting,
> but presumably the same happens with global counts.
>
I've verified that the accounting can break. I did not see it wrap
negative because in my testing it was rare the problem occurred but it
would happen eventually.
I considered a few ways of fixing this. The obvious one is to add a
new page flag but that is difficult to justify as the high-cpu-usage
problem should only occur when there is a lot of writeback to slow
storage which I believe is a rare case. It is not a suitable use for
an extended page flag.
The second was to keep these PageReclaim pages off the LRU but this
leads to complications of its own.
The third was to use a combination of flags to mark pages that
are on the immediate LRU such as how PG_compound and PG_reclaim in
combination mark tail pages. This would not be free of races and would
eventually cause corruption. There is also the problem that we cannot
atomically set multiple bits so setting the bits in contexts such as
set_page_dirty() may be problematic.
Andrew, as there is not an easy uncontroversial fix can you remove
the following patches from mmotm please?
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru.patch
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru-fix.patch
mm-isolate-pages-for-immediate-reclaim-on-their-own-lru-fix-2.patch
The impact is that users writing to slow stage may see higher CPU usage
as the pages under writeback have to be skipped by scanning once the
dirty pages move to the end of the LRU list. I'm assuming once they
are removed from mmotm that they also get removed from linux-next.
> There is another such accounting bug in -next, been there longer and
> not so easy to hit: I'm fairly sure it will turn out to be memcg
> misaccounting a THPage somewhere, I'll have a look around shortly.
>
> p.s. Immediate? Isn't that an odd name for a list of pages which are
> not immediately freeable? Maybe Rik's launder/laundry name would be
> better: pages which are currently being cleaned.
That is potentially very misleading as not all pages being laundered are
on that list. reclaim_writeback might be a better name.
Thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-29 16:59 ` Mel Gorman
@ 2011-12-29 19:31 ` Rik van Riel
-1 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-29 19:31 UTC (permalink / raw)
To: Mel Gorman
Cc: Hugh Dickins, Minchan Kim, Andrew Morton, Andrea Arcangeli,
Minchan Kim, Dave Jones, Jan Kara, Andy Isaacson,
Johannes Weiner, Nai Xia, Linux-MM, LKML
On 12/29/2011 11:59 AM, Mel Gorman wrote:
> I considered a few ways of fixing this. The obvious one is to add a
> new page flag but that is difficult to justify as the high-cpu-usage
> problem should only occur when there is a lot of writeback to slow
> storage which I believe is a rare case. It is not a suitable use for
> an extended page flag.
Actually, don't we already have three LRU related
bits in the page flags?
We could stop using those as bit flags, and use
them as a number instead. That way we could encode
up to 7 or 8 (depending on how we use all-zeroes)
LRU lists with the number of bits we have now.
--
All rights reversed
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-29 19:31 ` Rik van Riel
0 siblings, 0 replies; 100+ messages in thread
From: Rik van Riel @ 2011-12-29 19:31 UTC (permalink / raw)
To: Mel Gorman
Cc: Hugh Dickins, Minchan Kim, Andrew Morton, Andrea Arcangeli,
Minchan Kim, Dave Jones, Jan Kara, Andy Isaacson,
Johannes Weiner, Nai Xia, Linux-MM, LKML
On 12/29/2011 11:59 AM, Mel Gorman wrote:
> I considered a few ways of fixing this. The obvious one is to add a
> new page flag but that is difficult to justify as the high-cpu-usage
> problem should only occur when there is a lot of writeback to slow
> storage which I believe is a rare case. It is not a suitable use for
> an extended page flag.
Actually, don't we already have three LRU related
bits in the page flags?
We could stop using those as bit flags, and use
them as a number instead. That way we could encode
up to 7 or 8 (depending on how we use all-zeroes)
LRU lists with the number of bits we have now.
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-29 19:31 ` Rik van Riel
@ 2011-12-30 11:27 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-30 11:27 UTC (permalink / raw)
To: Rik van Riel
Cc: Hugh Dickins, Minchan Kim, Andrew Morton, Andrea Arcangeli,
Minchan Kim, Dave Jones, Jan Kara, Andy Isaacson,
Johannes Weiner, Nai Xia, Linux-MM, LKML
On Thu, Dec 29, 2011 at 02:31:20PM -0500, Rik van Riel wrote:
> On 12/29/2011 11:59 AM, Mel Gorman wrote:
>
> >I considered a few ways of fixing this. The obvious one is to add a
> >new page flag but that is difficult to justify as the high-cpu-usage
> >problem should only occur when there is a lot of writeback to slow
> >storage which I believe is a rare case. It is not a suitable use for
> >an extended page flag.
>
> Actually, don't we already have three LRU related
> bits in the page flags?
>
Yes - PG_active, PG_unevictable and PG_swapbacked
> We could stop using those as bit flags, and use
> them as a number instead. That way we could encode
> up to 7 or 8 (depending on how we use all-zeroes)
> LRU lists with the number of bits we have now.
>
I wondered about this and I felt there were two problems.
One was reading and updating them atomically. To do this safely,
the page would either need to be locked, have the page isolated from
the LRU without any other references or be protected by the zone->lru
lock. For the most part we are accessing these bits under the page lock
and in cases such as rotate_reclaimable_page()[1] or truncation that do
not necessarily hold the page lock, we would depend on the zone->lru to
prevent parallel changes (particularly updating PageActive). I did not
spot a case where we were not protected by some combination of the page
lock and zone->lru so it should be fine but there might be a corner
case I missed. Can you think of one? If a case is missed, it means
that it is possible to get an invalid LRU index leading to corruption.
The other problem is that certain operations become more expensive. We
can no longer check one bit for PageActive for example. We'd have
to read the LRU index and see if it corresponds to an activated
page or not. This is not insurmountable but there would be a small
hit for any path that currently checks PageSwapBacked, PageActive
or PageUnevictable.
[1] I noticed another bug in the LRU immediate patch. It's possible
to call pagevec_putback_from_immediate on a page isolated for reclaim
because the check for PageLRU is wrong.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-30 11:27 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-30 11:27 UTC (permalink / raw)
To: Rik van Riel
Cc: Hugh Dickins, Minchan Kim, Andrew Morton, Andrea Arcangeli,
Minchan Kim, Dave Jones, Jan Kara, Andy Isaacson,
Johannes Weiner, Nai Xia, Linux-MM, LKML
On Thu, Dec 29, 2011 at 02:31:20PM -0500, Rik van Riel wrote:
> On 12/29/2011 11:59 AM, Mel Gorman wrote:
>
> >I considered a few ways of fixing this. The obvious one is to add a
> >new page flag but that is difficult to justify as the high-cpu-usage
> >problem should only occur when there is a lot of writeback to slow
> >storage which I believe is a rare case. It is not a suitable use for
> >an extended page flag.
>
> Actually, don't we already have three LRU related
> bits in the page flags?
>
Yes - PG_active, PG_unevictable and PG_swapbacked
> We could stop using those as bit flags, and use
> them as a number instead. That way we could encode
> up to 7 or 8 (depending on how we use all-zeroes)
> LRU lists with the number of bits we have now.
>
I wondered about this and I felt there were two problems.
One was reading and updating them atomically. To do this safely,
the page would either need to be locked, have the page isolated from
the LRU without any other references or be protected by the zone->lru
lock. For the most part we are accessing these bits under the page lock
and in cases such as rotate_reclaimable_page()[1] or truncation that do
not necessarily hold the page lock, we would depend on the zone->lru to
prevent parallel changes (particularly updating PageActive). I did not
spot a case where we were not protected by some combination of the page
lock and zone->lru so it should be fine but there might be a corner
case I missed. Can you think of one? If a case is missed, it means
that it is possible to get an invalid LRU index leading to corruption.
The other problem is that certain operations become more expensive. We
can no longer check one bit for PageActive for example. We'd have
to read the LRU index and see if it corresponds to an activated
page or not. This is not insurmountable but there would be a small
hit for any path that currently checks PageSwapBacked, PageActive
or PageUnevictable.
[1] I noticed another bug in the LRU immediate patch. It's possible
to call pagevec_putback_from_immediate on a page isolated for reclaim
because the check for PageLRU is wrong.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
2011-12-14 15:41 ` Mel Gorman
@ 2012-01-13 21:25 ` Andrew Morton
-1 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2012-01-13 21:25 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:30 +0000
Mel Gorman <mgorman@suse.de> wrote:
> This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> mode that avoids writing back pages to backing storage. Async
> compaction maps to MIGRATE_ASYNC while sync compaction maps to
> MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> hotplug, MIGRATE_SYNC is used.
>
> This avoids sync compaction stalling for an excessive length of time,
> particularly when copying files to a USB stick where there might be
> a large number of dirty pages backed by a filesystem that does not
> support ->writepages.
>
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -525,6 +525,7 @@ enum positive_aop_returns {
> struct page;
> struct address_space;
> struct writeback_control;
> +enum migrate_mode;
>
> struct iov_iter {
> const struct iovec *iov;
> @@ -614,7 +615,7 @@ struct address_space_operations {
> * is false, it must not block.
> */
> int (*migratepage) (struct address_space *,
> - struct page *, struct page *, bool);
> + struct page *, struct page *, enum migrate_mode);
I'm getting a huge warning spew from this with my sparc64 gcc-3.4.5.
I'm not sure why, really.
Forward-declaring an enum in this fashion is problematic because some
compilers (I'm unsure about gcc) use different sizeofs for enums,
depending on the enum's value range. For example, an enum which only
has values 0...255 can fit into a byte. (iirc, the compiler actually
put it in a 16-bit storage).
So I propose:
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm: fix warnings regarding enum migrate_mode
sparc64 allmodconfig:
In file included from include/linux/compat.h:15,
from /usr/src/25/arch/sparc/include/asm/siginfo.h:19,
from include/linux/signal.h:5,
from include/linux/sched.h:73,
from arch/sparc/kernel/asm-offsets.c:13:
include/linux/fs.h:618: warning: parameter has incomplete type
It seems that my sparc64 compiler (gcc-3.4.5) doesn't like the forward
declaration of enums.
Fix this by moving the "enum migrate_mode" definition into its own header
file.
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/fs.h | 2 +-
include/linux/migrate.h | 14 +-------------
include/linux/migrate_mode.h | 16 ++++++++++++++++
3 files changed, 18 insertions(+), 14 deletions(-)
diff -puN include/linux/fs.h~mm-fix-warnings-regarding-enum-migrate_mode include/linux/fs.h
--- a/include/linux/fs.h~mm-fix-warnings-regarding-enum-migrate_mode
+++ a/include/linux/fs.h
@@ -10,6 +10,7 @@
#include <linux/ioctl.h>
#include <linux/blk_types.h>
#include <linux/types.h>
+#include <linux/migrate_mode.h>
/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -525,7 +526,6 @@ enum positive_aop_returns {
struct page;
struct address_space;
struct writeback_control;
-enum migrate_mode;
struct iov_iter {
const struct iovec *iov;
diff -puN include/linux/migrate.h~mm-fix-warnings-regarding-enum-migrate_mode include/linux/migrate.h
--- a/include/linux/migrate.h~mm-fix-warnings-regarding-enum-migrate_mode
+++ a/include/linux/migrate.h
@@ -3,22 +3,10 @@
#include <linux/mm.h>
#include <linux/mempolicy.h>
+#include <linux/migrate_mode.h>
typedef struct page *new_page_t(struct page *, unsigned long private, int **);
-/*
- * MIGRATE_ASYNC means never block
- * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
- * on most operations but not ->writepage as the potential stall time
- * is too significant
- * MIGRATE_SYNC will block when migrating pages
- */
-enum migrate_mode {
- MIGRATE_ASYNC,
- MIGRATE_SYNC_LIGHT,
- MIGRATE_SYNC,
-};
-
#ifdef CONFIG_MIGRATION
#define PAGE_MIGRATION 1
diff -puN /dev/null include/linux/migrate_mode.h
--- /dev/null
+++ a/include/linux/migrate_mode.h
@@ -0,0 +1,16 @@
+#ifndef MIGRATE_MODE_H_INCLUDED
+#define MIGRATE_MODE_H_INCLUDED
+/*
+ * MIGRATE_ASYNC means never block
+ * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
+ * on most operations but not ->writepage as the potential stall time
+ * is too significant
+ * MIGRATE_SYNC will block when migrating pages
+ */
+enum migrate_mode {
+ MIGRATE_ASYNC,
+ MIGRATE_SYNC_LIGHT,
+ MIGRATE_SYNC,
+};
+
+#endif /* MIGRATE_MODE_H_INCLUDED */
_
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
@ 2012-01-13 21:25 ` Andrew Morton
0 siblings, 0 replies; 100+ messages in thread
From: Andrew Morton @ 2012-01-13 21:25 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia, Linux-MM,
LKML
On Wed, 14 Dec 2011 15:41:30 +0000
Mel Gorman <mgorman@suse.de> wrote:
> This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> mode that avoids writing back pages to backing storage. Async
> compaction maps to MIGRATE_ASYNC while sync compaction maps to
> MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> hotplug, MIGRATE_SYNC is used.
>
> This avoids sync compaction stalling for an excessive length of time,
> particularly when copying files to a USB stick where there might be
> a large number of dirty pages backed by a filesystem that does not
> support ->writepages.
>
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -525,6 +525,7 @@ enum positive_aop_returns {
> struct page;
> struct address_space;
> struct writeback_control;
> +enum migrate_mode;
>
> struct iov_iter {
> const struct iovec *iov;
> @@ -614,7 +615,7 @@ struct address_space_operations {
> * is false, it must not block.
> */
> int (*migratepage) (struct address_space *,
> - struct page *, struct page *, bool);
> + struct page *, struct page *, enum migrate_mode);
I'm getting a huge warning spew from this with my sparc64 gcc-3.4.5.
I'm not sure why, really.
Forward-declaring an enum in this fashion is problematic because some
compilers (I'm unsure about gcc) use different sizeofs for enums,
depending on the enum's value range. For example, an enum which only
has values 0...255 can fit into a byte. (iirc, the compiler actually
put it in a 16-bit storage).
So I propose:
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm: fix warnings regarding enum migrate_mode
sparc64 allmodconfig:
In file included from include/linux/compat.h:15,
from /usr/src/25/arch/sparc/include/asm/siginfo.h:19,
from include/linux/signal.h:5,
from include/linux/sched.h:73,
from arch/sparc/kernel/asm-offsets.c:13:
include/linux/fs.h:618: warning: parameter has incomplete type
It seems that my sparc64 compiler (gcc-3.4.5) doesn't like the forward
declaration of enums.
Fix this by moving the "enum migrate_mode" definition into its own header
file.
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/fs.h | 2 +-
include/linux/migrate.h | 14 +-------------
include/linux/migrate_mode.h | 16 ++++++++++++++++
3 files changed, 18 insertions(+), 14 deletions(-)
diff -puN include/linux/fs.h~mm-fix-warnings-regarding-enum-migrate_mode include/linux/fs.h
--- a/include/linux/fs.h~mm-fix-warnings-regarding-enum-migrate_mode
+++ a/include/linux/fs.h
@@ -10,6 +10,7 @@
#include <linux/ioctl.h>
#include <linux/blk_types.h>
#include <linux/types.h>
+#include <linux/migrate_mode.h>
/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -525,7 +526,6 @@ enum positive_aop_returns {
struct page;
struct address_space;
struct writeback_control;
-enum migrate_mode;
struct iov_iter {
const struct iovec *iov;
diff -puN include/linux/migrate.h~mm-fix-warnings-regarding-enum-migrate_mode include/linux/migrate.h
--- a/include/linux/migrate.h~mm-fix-warnings-regarding-enum-migrate_mode
+++ a/include/linux/migrate.h
@@ -3,22 +3,10 @@
#include <linux/mm.h>
#include <linux/mempolicy.h>
+#include <linux/migrate_mode.h>
typedef struct page *new_page_t(struct page *, unsigned long private, int **);
-/*
- * MIGRATE_ASYNC means never block
- * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
- * on most operations but not ->writepage as the potential stall time
- * is too significant
- * MIGRATE_SYNC will block when migrating pages
- */
-enum migrate_mode {
- MIGRATE_ASYNC,
- MIGRATE_SYNC_LIGHT,
- MIGRATE_SYNC,
-};
-
#ifdef CONFIG_MIGRATION
#define PAGE_MIGRATION 1
diff -puN /dev/null include/linux/migrate_mode.h
--- /dev/null
+++ a/include/linux/migrate_mode.h
@@ -0,0 +1,16 @@
+#ifndef MIGRATE_MODE_H_INCLUDED
+#define MIGRATE_MODE_H_INCLUDED
+/*
+ * MIGRATE_ASYNC means never block
+ * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
+ * on most operations but not ->writepage as the potential stall time
+ * is too significant
+ * MIGRATE_SYNC will block when migrating pages
+ */
+enum migrate_mode {
+ MIGRATE_ASYNC,
+ MIGRATE_SYNC_LIGHT,
+ MIGRATE_SYNC,
+};
+
+#endif /* MIGRATE_MODE_H_INCLUDED */
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
2012-01-13 21:25 ` Andrew Morton
@ 2012-01-16 11:33 ` Mel Gorman
-1 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2012-01-16 11:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Tetsuo Handa, Linux-MM, LKML
On Fri, Jan 13, 2012 at 01:25:40PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:30 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> > mode that avoids writing back pages to backing storage. Async
> > compaction maps to MIGRATE_ASYNC while sync compaction maps to
> > MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> > hotplug, MIGRATE_SYNC is used.
> >
> > This avoids sync compaction stalling for an excessive length of time,
> > particularly when copying files to a USB stick where there might be
> > a large number of dirty pages backed by a filesystem that does not
> > support ->writepages.
> >
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -525,6 +525,7 @@ enum positive_aop_returns {
> > struct page;
> > struct address_space;
> > struct writeback_control;
> > +enum migrate_mode;
> >
> > struct iov_iter {
> > const struct iovec *iov;
> > @@ -614,7 +615,7 @@ struct address_space_operations {
> > * is false, it must not block.
> > */
> > int (*migratepage) (struct address_space *,
> > - struct page *, struct page *, bool);
> > + struct page *, struct page *, enum migrate_mode);
>
> I'm getting a huge warning spew from this with my sparc64 gcc-3.4.5.
> I'm not sure why, really.
>
Tetsuo Handa complained about the same thing using gcc 3.3 (added
to cc).
> Forward-declaring an enum in this fashion is problematic because some
> compilers (I'm unsure about gcc) use different sizeofs for enums,
> depending on the enum's value range. For example, an enum which only
> has values 0...255 can fit into a byte. (iirc, the compiler actually
> put it in a 16-bit storage).
>
Ok, I was not aware of this. Thanks for the heads-up.
> So I propose:
>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm: fix warnings regarding enum migrate_mode
>
> sparc64 allmodconfig:
>
> In file included from include/linux/compat.h:15,
> from /usr/src/25/arch/sparc/include/asm/siginfo.h:19,
> from include/linux/signal.h:5,
> from include/linux/sched.h:73,
> from arch/sparc/kernel/asm-offsets.c:13:
> include/linux/fs.h:618: warning: parameter has incomplete type
>
> It seems that my sparc64 compiler (gcc-3.4.5) doesn't like the forward
> declaration of enums.
>
> Fix this by moving the "enum migrate_mode" definition into its own header
> file.
>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Andy Isaacson <adi@hexapodia.org>
> Cc: Nai Xia <nai.xia@gmail.com>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction
@ 2012-01-16 11:33 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2012-01-16 11:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Minchan Kim, Dave Jones, Jan Kara,
Andy Isaacson, Johannes Weiner, Rik van Riel, Nai Xia,
Tetsuo Handa, Linux-MM, LKML
On Fri, Jan 13, 2012 at 01:25:40PM -0800, Andrew Morton wrote:
> On Wed, 14 Dec 2011 15:41:30 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
> > mode that avoids writing back pages to backing storage. Async
> > compaction maps to MIGRATE_ASYNC while sync compaction maps to
> > MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory
> > hotplug, MIGRATE_SYNC is used.
> >
> > This avoids sync compaction stalling for an excessive length of time,
> > particularly when copying files to a USB stick where there might be
> > a large number of dirty pages backed by a filesystem that does not
> > support ->writepages.
> >
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -525,6 +525,7 @@ enum positive_aop_returns {
> > struct page;
> > struct address_space;
> > struct writeback_control;
> > +enum migrate_mode;
> >
> > struct iov_iter {
> > const struct iovec *iov;
> > @@ -614,7 +615,7 @@ struct address_space_operations {
> > * is false, it must not block.
> > */
> > int (*migratepage) (struct address_space *,
> > - struct page *, struct page *, bool);
> > + struct page *, struct page *, enum migrate_mode);
>
> I'm getting a huge warning spew from this with my sparc64 gcc-3.4.5.
> I'm not sure why, really.
>
Tetsuo Handa complained about the same thing using gcc 3.3 (added
to cc).
> Forward-declaring an enum in this fashion is problematic because some
> compilers (I'm unsure about gcc) use different sizeofs for enums,
> depending on the enum's value range. For example, an enum which only
> has values 0...255 can fit into a byte. (iirc, the compiler actually
> put it in a 16-bit storage).
>
Ok, I was not aware of this. Thanks for the heads-up.
> So I propose:
>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm: fix warnings regarding enum migrate_mode
>
> sparc64 allmodconfig:
>
> In file included from include/linux/compat.h:15,
> from /usr/src/25/arch/sparc/include/asm/siginfo.h:19,
> from include/linux/signal.h:5,
> from include/linux/sched.h:73,
> from arch/sparc/kernel/asm-offsets.c:13:
> include/linux/fs.h:618: warning: parameter has incomplete type
>
> It seems that my sparc64 compiler (gcc-3.4.5) doesn't like the forward
> declaration of enums.
>
> Fix this by moving the "enum migrate_mode" definition into its own header
> file.
>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Andy Isaacson <adi@hexapodia.org>
> Cc: Nai Xia <nai.xia@gmail.com>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
2011-12-01 17:36 [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v5 Mel Gorman
@ 2011-12-01 17:36 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-01 17:36 UTC (permalink / raw)
To: Linux-MM
Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Mel Gorman, Rik van Riel, Nai Xia, LKML
It was observed that scan rates from direct reclaim during tests
writing to both fast and slow storage were extraordinarily high. The
problem was that while pages were being marked for immediate reclaim
when writeback completed, the same pages were being encountered over
and over again during LRU scanning.
This patch isolates file-backed pages that are to be reclaimed when
clean on their own LRU list.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/page_alloc.c | 5 ++-
mm/swap.c | 74 ++++++++++++++++++++++++++++++++++++++---
mm/vmscan.c | 11 ++++++
mm/vmstat.c | 2 +
6 files changed, 89 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac5b522..80834eb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -84,6 +84,7 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
+ NR_IMMEDIATE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
@@ -136,6 +137,7 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
+ LRU_IMMEDIATE,
LRU_UNEVICTABLE,
NR_LRU_LISTS
};
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..9696fda 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -36,6 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ PGRESCUED,
#ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d979376..9e3cd8d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2590,7 +2590,7 @@ void show_free_areas(unsigned int filter)
printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
- " unevictable:%lu"
+ " immediate:%lu unevictable:%lu"
" dirty:%lu writeback:%lu unstable:%lu\n"
" free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n",
@@ -2600,6 +2600,7 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_ACTIVE_FILE),
global_page_state(NR_INACTIVE_FILE),
global_page_state(NR_ISOLATED_FILE),
+ global_page_state(NR_IMMEDIATE),
global_page_state(NR_UNEVICTABLE),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
@@ -2627,6 +2628,7 @@ void show_free_areas(unsigned int filter)
" inactive_anon:%lukB"
" active_file:%lukB"
" inactive_file:%lukB"
+ " immediate:%lukB"
" unevictable:%lukB"
" isolated(anon):%lukB"
" isolated(file):%lukB"
@@ -2655,6 +2657,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_INACTIVE_ANON)),
K(zone_page_state(zone, NR_ACTIVE_FILE)),
K(zone_page_state(zone, NR_INACTIVE_FILE)),
+ K(zone_page_state(zone, NR_IMMEDIATE)),
K(zone_page_state(zone, NR_UNEVICTABLE)),
K(zone_page_state(zone, NR_ISOLATED_ANON)),
K(zone_page_state(zone, NR_ISOLATED_FILE)),
diff --git a/mm/swap.c b/mm/swap.c
index a91caf7..9973975 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
/*
@@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
}
/*
+ * Similar pair of functions to pagevec_move_tail except it is called when
+ * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
+ * lists
+ */
+static void pagevec_putback_immediate_fn(struct page *page, void *arg)
+{
+ struct zone *zone = page_zone(page);
+
+ if (PageLRU(page)) {
+ enum lru_list lru = page_lru(page);
+ list_move(&page->lru, &zone->lru[lru].list);
+ }
+}
+
+static void pagevec_putback_immediate(struct pagevec *pvec)
+{
+ pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
+}
+
+/*
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
* inactive list.
*/
void rotate_reclaimable_page(struct page *page)
{
+ struct zone *zone = page_zone(page);
+ struct list_head *page_list;
+ struct pagevec *pvec;
+ unsigned long flags;
+
+ page_cache_get(page);
+ local_irq_save(flags);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
+
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
!PageUnevictable(page) && PageLRU(page)) {
- struct pagevec *pvec;
- unsigned long flags;
- page_cache_get(page);
- local_irq_save(flags);
pvec = &__get_cpu_var(lru_rotate_pvecs);
if (!pagevec_add(pvec, page))
pagevec_move_tail(pvec);
- local_irq_restore(flags);
+ } else {
+ pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
+ if (!pagevec_add(pvec, page))
+ pagevec_putback_immediate(pvec);
+ }
+
+ /*
+ * There is a potential race that if a page is set PageReclaim
+ * and moved to the LRU_IMMEDIATE list after writeback completed,
+ * it can be left on the LRU_IMMEDATE list with no way for
+ * reclaim to find it.
+ *
+ * This race should be very rare but count how often it happens.
+ * If it is a continual race, then it's very unsatisfactory as there
+ * is no guarantee that rotate_reclaimable_page() will be called
+ * to rescue these pages but finding them in page reclaim is also
+ * problematic due to the problem of deciding when the right time
+ * to scan this list is.
+ */
+ page_list = &zone->lru[LRU_IMMEDIATE].list;
+ if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
+ struct page *page;
+
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(page_list)) {
+ page = list_entry(page_list->prev, struct page, lru);
+ list_move(&page->lru, &zone->lru[page_lru(page)].list);
+ __count_vm_event(PGRESCUED);
+ }
+ spin_unlock(&zone->lru_lock);
}
+
+ local_irq_restore(flags);
}
static void update_page_reclaim_stat(struct zone *zone, struct page *page,
@@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
* is _really_ small and it's non-critical problem.
*/
SetPageReclaim(page);
+
+ /*
+ * Move to the LRU_IMMEDIATE list to avoid being scanned
+ * by page reclaim uselessly.
+ */
+ list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
} else {
/*
* The page's writeback ends up during pagevec
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b0eeec7..9879ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1404,6 +1404,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
}
SetPageLRU(page);
lru = page_lru(page);
+
+ /*
+ * If reclaim has tagged a file page reclaim, move it to
+ * a separate LRU lists to avoid it being scanned by other
+ * users. It is expected that as writeback completes that
+ * they are taken back off and moved to the normal LRU
+ */
+ if (lru == LRU_INACTIVE_FILE &&
+ PageReclaim(page) && PageWriteback(page))
+ lru = LRU_IMMEDIATE;
+
add_page_to_lru_list(zone, page, lru);
if (is_active_lru(lru)) {
int file = is_file_lru(lru);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8fd603b..dbfec4c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -688,6 +688,7 @@ const char * const vmstat_text[] = {
"nr_active_anon",
"nr_inactive_file",
"nr_active_file",
+ "nr_immediate",
"nr_unevictable",
"nr_mlock",
"nr_anon_pages",
@@ -756,6 +757,7 @@ const char * const vmstat_text[] = {
"allocstall",
"pgrotated",
+ "pgrescued",
#ifdef CONFIG_COMPACTION
"compact_blocks_moved",
--
1.7.3.4
^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU
@ 2011-12-01 17:36 ` Mel Gorman
0 siblings, 0 replies; 100+ messages in thread
From: Mel Gorman @ 2011-12-01 17:36 UTC (permalink / raw)
To: Linux-MM
Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Mel Gorman, Rik van Riel, Nai Xia, LKML
It was observed that scan rates from direct reclaim during tests
writing to both fast and slow storage were extraordinarily high. The
problem was that while pages were being marked for immediate reclaim
when writeback completed, the same pages were being encountered over
and over again during LRU scanning.
This patch isolates file-backed pages that are to be reclaimed when
clean on their own LRU list.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
mm/page_alloc.c | 5 ++-
mm/swap.c | 74 ++++++++++++++++++++++++++++++++++++++---
mm/vmscan.c | 11 ++++++
mm/vmstat.c | 2 +
6 files changed, 89 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac5b522..80834eb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -84,6 +84,7 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
+ NR_IMMEDIATE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
@@ -136,6 +137,7 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
+ LRU_IMMEDIATE,
LRU_UNEVICTABLE,
NR_LRU_LISTS
};
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..9696fda 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -36,6 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ PGRESCUED,
#ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d979376..9e3cd8d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2590,7 +2590,7 @@ void show_free_areas(unsigned int filter)
printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
- " unevictable:%lu"
+ " immediate:%lu unevictable:%lu"
" dirty:%lu writeback:%lu unstable:%lu\n"
" free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n",
@@ -2600,6 +2600,7 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_ACTIVE_FILE),
global_page_state(NR_INACTIVE_FILE),
global_page_state(NR_ISOLATED_FILE),
+ global_page_state(NR_IMMEDIATE),
global_page_state(NR_UNEVICTABLE),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
@@ -2627,6 +2628,7 @@ void show_free_areas(unsigned int filter)
" inactive_anon:%lukB"
" active_file:%lukB"
" inactive_file:%lukB"
+ " immediate:%lukB"
" unevictable:%lukB"
" isolated(anon):%lukB"
" isolated(file):%lukB"
@@ -2655,6 +2657,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_INACTIVE_ANON)),
K(zone_page_state(zone, NR_ACTIVE_FILE)),
K(zone_page_state(zone, NR_INACTIVE_FILE)),
+ K(zone_page_state(zone, NR_IMMEDIATE)),
K(zone_page_state(zone, NR_UNEVICTABLE)),
K(zone_page_state(zone, NR_ISOLATED_ANON)),
K(zone_page_state(zone, NR_ISOLATED_FILE)),
diff --git a/mm/swap.c b/mm/swap.c
index a91caf7..9973975 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_putback_immediate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
/*
@@ -255,24 +256,80 @@ static void pagevec_move_tail(struct pagevec *pvec)
}
/*
+ * Similar pair of functions to pagevec_move_tail except it is called when
+ * moving a page from the LRU_IMMEDIATE to one of the [in]active_[file|anon]
+ * lists
+ */
+static void pagevec_putback_immediate_fn(struct page *page, void *arg)
+{
+ struct zone *zone = page_zone(page);
+
+ if (PageLRU(page)) {
+ enum lru_list lru = page_lru(page);
+ list_move(&page->lru, &zone->lru[lru].list);
+ }
+}
+
+static void pagevec_putback_immediate(struct pagevec *pvec)
+{
+ pagevec_lru_move_fn(pvec, pagevec_putback_immediate_fn, NULL);
+}
+
+/*
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
* inactive list.
*/
void rotate_reclaimable_page(struct page *page)
{
+ struct zone *zone = page_zone(page);
+ struct list_head *page_list;
+ struct pagevec *pvec;
+ unsigned long flags;
+
+ page_cache_get(page);
+ local_irq_save(flags);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, -1);
+
if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
!PageUnevictable(page) && PageLRU(page)) {
- struct pagevec *pvec;
- unsigned long flags;
- page_cache_get(page);
- local_irq_save(flags);
pvec = &__get_cpu_var(lru_rotate_pvecs);
if (!pagevec_add(pvec, page))
pagevec_move_tail(pvec);
- local_irq_restore(flags);
+ } else {
+ pvec = &__get_cpu_var(lru_putback_immediate_pvecs);
+ if (!pagevec_add(pvec, page))
+ pagevec_putback_immediate(pvec);
+ }
+
+ /*
+ * There is a potential race that if a page is set PageReclaim
+ * and moved to the LRU_IMMEDIATE list after writeback completed,
+ * it can be left on the LRU_IMMEDATE list with no way for
+ * reclaim to find it.
+ *
+ * This race should be very rare but count how often it happens.
+ * If it is a continual race, then it's very unsatisfactory as there
+ * is no guarantee that rotate_reclaimable_page() will be called
+ * to rescue these pages but finding them in page reclaim is also
+ * problematic due to the problem of deciding when the right time
+ * to scan this list is.
+ */
+ page_list = &zone->lru[LRU_IMMEDIATE].list;
+ if (!zone_page_state(zone, NR_IMMEDIATE) && !list_empty(page_list)) {
+ struct page *page;
+
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(page_list)) {
+ page = list_entry(page_list->prev, struct page, lru);
+ list_move(&page->lru, &zone->lru[page_lru(page)].list);
+ __count_vm_event(PGRESCUED);
+ }
+ spin_unlock(&zone->lru_lock);
}
+
+ local_irq_restore(flags);
}
static void update_page_reclaim_stat(struct zone *zone, struct page *page,
@@ -475,6 +532,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
* is _really_ small and it's non-critical problem.
*/
SetPageReclaim(page);
+
+ /*
+ * Move to the LRU_IMMEDIATE list to avoid being scanned
+ * by page reclaim uselessly.
+ */
+ list_move_tail(&page->lru, &zone->lru[LRU_IMMEDIATE].list);
+ __mod_zone_page_state(zone, NR_IMMEDIATE, 1);
} else {
/*
* The page's writeback ends up during pagevec
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b0eeec7..9879ae5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1404,6 +1404,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
}
SetPageLRU(page);
lru = page_lru(page);
+
+ /*
+ * If reclaim has tagged a file page reclaim, move it to
+ * a separate LRU lists to avoid it being scanned by other
+ * users. It is expected that as writeback completes that
+ * they are taken back off and moved to the normal LRU
+ */
+ if (lru == LRU_INACTIVE_FILE &&
+ PageReclaim(page) && PageWriteback(page))
+ lru = LRU_IMMEDIATE;
+
add_page_to_lru_list(zone, page, lru);
if (is_active_lru(lru)) {
int file = is_file_lru(lru);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8fd603b..dbfec4c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -688,6 +688,7 @@ const char * const vmstat_text[] = {
"nr_active_anon",
"nr_inactive_file",
"nr_active_file",
+ "nr_immediate",
"nr_unevictable",
"nr_mlock",
"nr_anon_pages",
@@ -756,6 +757,7 @@ const char * const vmstat_text[] = {
"allocstall",
"pgrotated",
+ "pgrescued",
#ifdef CONFIG_COMPACTION
"compact_blocks_moved",
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 100+ messages in thread
end of thread, other threads:[~2012-01-16 11:33 UTC | newest]
Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-14 15:41 [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6 Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-14 15:41 ` [PATCH 01/11] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-14 15:41 ` [PATCH 02/11] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-14 15:41 ` [PATCH 03/11] mm: vmscan: Check if we isolated a compound page during lumpy scan Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-15 23:21 ` Rik van Riel
2011-12-15 23:21 ` Rik van Riel
2011-12-14 15:41 ` [PATCH 04/11] mm: vmscan: Do not OOM if aborting reclaim to start compaction Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-15 23:36 ` Rik van Riel
2011-12-15 23:36 ` Rik van Riel
2011-12-14 15:41 ` [PATCH 05/11] mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-16 3:32 ` Rik van Riel
2011-12-16 3:32 ` Rik van Riel
2011-12-16 23:20 ` Andrew Morton
2011-12-16 23:20 ` Andrew Morton
2011-12-17 3:03 ` Nai Xia
2011-12-17 3:03 ` Nai Xia
2011-12-17 3:26 ` Andrew Morton
2011-12-17 3:26 ` Andrew Morton
2011-12-19 11:05 ` Mel Gorman
2011-12-19 11:05 ` Mel Gorman
2011-12-19 13:12 ` nai.xia
2011-12-19 13:12 ` nai.xia
2011-12-14 15:41 ` [PATCH 06/11] mm: compaction: make isolate_lru_page() filter-aware again Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-16 3:34 ` Rik van Riel
2011-12-16 3:34 ` Rik van Riel
2011-12-18 1:53 ` Minchan Kim
2011-12-18 1:53 ` Minchan Kim
2011-12-14 15:41 ` [PATCH 07/11] mm: page allocator: Do not call direct reclaim for THP allocations while compaction is deferred Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-16 4:10 ` Rik van Riel
2011-12-16 4:10 ` Rik van Riel
2011-12-14 15:41 ` [PATCH 08/11] mm: compaction: Introduce sync-light migration for use by compaction Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-16 4:31 ` Rik van Riel
2011-12-16 4:31 ` Rik van Riel
2011-12-18 2:05 ` Minchan Kim
2011-12-18 2:05 ` Minchan Kim
2011-12-19 11:45 ` Mel Gorman
2011-12-19 11:45 ` Mel Gorman
2011-12-20 7:18 ` Minchan Kim
2011-12-20 7:18 ` Minchan Kim
2012-01-13 21:25 ` Andrew Morton
2012-01-13 21:25 ` Andrew Morton
2012-01-16 11:33 ` Mel Gorman
2012-01-16 11:33 ` Mel Gorman
2011-12-14 15:41 ` [PATCH 09/11] mm: vmscan: When reclaiming for compaction, ensure there are sufficient free pages available Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-16 4:35 ` Rik van Riel
2011-12-16 4:35 ` Rik van Riel
2011-12-14 15:41 ` [PATCH 10/11] mm: vmscan: Check if reclaim should really abort even if compaction_ready() is true for one zone Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-16 4:38 ` Rik van Riel
2011-12-16 4:38 ` Rik van Riel
2011-12-16 11:29 ` Mel Gorman
2011-12-16 11:29 ` Mel Gorman
2011-12-14 15:41 ` [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU Mel Gorman
2011-12-14 15:41 ` Mel Gorman
2011-12-16 4:47 ` Rik van Riel
2011-12-16 4:47 ` Rik van Riel
2011-12-16 12:26 ` Mel Gorman
2011-12-16 12:26 ` Mel Gorman
2011-12-16 15:17 ` Johannes Weiner
2011-12-16 15:17 ` Johannes Weiner
2011-12-16 16:07 ` Mel Gorman
2011-12-16 16:07 ` Mel Gorman
2011-12-19 16:14 ` Johannes Weiner
2011-12-19 16:14 ` Johannes Weiner
2011-12-17 16:08 ` Minchan Kim
2011-12-17 16:08 ` Minchan Kim
2011-12-19 13:26 ` Mel Gorman
2011-12-19 13:26 ` Mel Gorman
2011-12-20 7:10 ` Minchan Kim
2011-12-20 7:10 ` Minchan Kim
2011-12-20 9:55 ` Mel Gorman
2011-12-20 9:55 ` Mel Gorman
2011-12-23 19:08 ` Hugh Dickins
2011-12-23 19:08 ` Hugh Dickins
2011-12-29 16:59 ` Mel Gorman
2011-12-29 16:59 ` Mel Gorman
2011-12-29 19:31 ` Rik van Riel
2011-12-29 19:31 ` Rik van Riel
2011-12-30 11:27 ` Mel Gorman
2011-12-30 11:27 ` Mel Gorman
2011-12-16 22:56 ` [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v6 Andrew Morton
2011-12-16 22:56 ` Andrew Morton
2011-12-19 14:40 ` Mel Gorman
2011-12-19 14:40 ` Mel Gorman
2011-12-16 23:37 ` Andrew Morton
2011-12-16 23:37 ` Andrew Morton
2011-12-19 14:20 ` Mel Gorman
2011-12-19 14:20 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2011-12-01 17:36 [PATCH 0/11] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v5 Mel Gorman
2011-12-01 17:36 ` [PATCH 11/11] mm: Isolate pages for immediate reclaim on their own LRU Mel Gorman
2011-12-01 17:36 ` Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.