[3.6 regression?] THP + migration/compaction livelock (I think)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [3.6 regression?] THP + migration/compaction livelock (I think)
@ 2012-11-13 22:13 Andy Lutomirski
  2012-11-13 23:11 ` David Rientjes
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2012-11-13 22:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm

I've seen an odd problem three times in the past two weeks.  I suspect
a Linux 3.6 regression.  I"m on 3.6.3-1.fc17.x86_64.  I run a parallel
compilation, and no progress is made.  All cpus are pegged at 100%
system time by the respective cc1plus processes.  Reading
/proc/<pid>/stack shows either

[<ffffffff8108e01a>] __cond_resched+0x2a/0x40
[<ffffffff8114e432>] isolate_migratepages_range+0xb2/0x620
[<ffffffff8114eba4>] compact_zone+0x144/0x410
[<ffffffff8114f152>] compact_zone_order+0x82/0xc0
[<ffffffff8114f271>] try_to_compact_pages+0xe1/0x130
[<ffffffff816143db>] __alloc_pages_direct_compact+0xaa/0x190
[<ffffffff81133d26>] __alloc_pages_nodemask+0x526/0x990
[<ffffffff81171496>] alloc_pages_vma+0xb6/0x190
[<ffffffff81182683>] do_huge_pmd_anonymous_page+0x143/0x340
[<ffffffff811549fd>] handle_mm_fault+0x27d/0x320
[<ffffffff81620adc>] do_page_fault+0x15c/0x4b0
[<ffffffff8161d625>] page_fault+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

or

[<ffffffffffffffff>] 0xffffffffffffffff

seemingly at random (i.e. if I read that file twice in a row, I might
see different results).  If I had to guess, I'd say that

perf shows no 'faults'.  The livelock resolved after several minutes
(and before I got far enough with perf to get more useful results).
Every time this happens, firefox hangs but everything else keeps
working.

If I trigger it again, I'll try to grab /proc/zoneinfo and /proc/meminfo.

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-13 22:13 [3.6 regression?] THP + migration/compaction livelock (I think) Andy Lutomirski
@ 2012-11-13 23:11 ` David Rientjes
  2012-11-13 23:25   ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2012-11-13 23:11 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Marc Duponcheel, Mel Gorman, linux-kernel, linux-mm

On Tue, 13 Nov 2012, Andy Lutomirski wrote:

> I've seen an odd problem three times in the past two weeks.  I suspect
> a Linux 3.6 regression.  I"m on 3.6.3-1.fc17.x86_64.  I run a parallel
> compilation, and no progress is made.  All cpus are pegged at 100%
> system time by the respective cc1plus processes.  Reading
> /proc/<pid>/stack shows either
> 
> [<ffffffff8108e01a>] __cond_resched+0x2a/0x40
> [<ffffffff8114e432>] isolate_migratepages_range+0xb2/0x620
> [<ffffffff8114eba4>] compact_zone+0x144/0x410
> [<ffffffff8114f152>] compact_zone_order+0x82/0xc0
> [<ffffffff8114f271>] try_to_compact_pages+0xe1/0x130
> [<ffffffff816143db>] __alloc_pages_direct_compact+0xaa/0x190
> [<ffffffff81133d26>] __alloc_pages_nodemask+0x526/0x990
> [<ffffffff81171496>] alloc_pages_vma+0xb6/0x190
> [<ffffffff81182683>] do_huge_pmd_anonymous_page+0x143/0x340
> [<ffffffff811549fd>] handle_mm_fault+0x27d/0x320
> [<ffffffff81620adc>] do_page_fault+0x15c/0x4b0
> [<ffffffff8161d625>] page_fault+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> or
> 
> [<ffffffffffffffff>] 0xffffffffffffffff
> 

This reminds me of the thread at http://marc.info/?t=135102111800004 which 
caused Marc's system to reportedly go unresponsive like your report but in 
his case it also caused a reboot.  If your system is still running (or, 
even better, if you're able to capture this happening in realtime), could 
you try to capture

	grep -E "compact_|thp_" /proc/vmstat

as well while it is in progress?  (Even if it's not happening right now, 
the data might still be useful if you have knowledge that it has occurred 
since the last reboot.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-13 23:11 ` David Rientjes
@ 2012-11-13 23:25   ` Andy Lutomirski
  2012-11-13 23:41     ` David Rientjes
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2012-11-13 23:25 UTC (permalink / raw)
  To: David Rientjes; +Cc: Marc Duponcheel, Mel Gorman, linux-kernel, linux-mm

On Tue, Nov 13, 2012 at 3:11 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 13 Nov 2012, Andy Lutomirski wrote:
>
>> I've seen an odd problem three times in the past two weeks.  I suspect
>> a Linux 3.6 regression.  I"m on 3.6.3-1.fc17.x86_64.  I run a parallel
>> compilation, and no progress is made.  All cpus are pegged at 100%
>> system time by the respective cc1plus processes.  Reading
>> /proc/<pid>/stack shows either
>>
>> [<ffffffff8108e01a>] __cond_resched+0x2a/0x40
>> [<ffffffff8114e432>] isolate_migratepages_range+0xb2/0x620
>> [<ffffffff8114eba4>] compact_zone+0x144/0x410
>> [<ffffffff8114f152>] compact_zone_order+0x82/0xc0
>> [<ffffffff8114f271>] try_to_compact_pages+0xe1/0x130
>> [<ffffffff816143db>] __alloc_pages_direct_compact+0xaa/0x190
>> [<ffffffff81133d26>] __alloc_pages_nodemask+0x526/0x990
>> [<ffffffff81171496>] alloc_pages_vma+0xb6/0x190
>> [<ffffffff81182683>] do_huge_pmd_anonymous_page+0x143/0x340
>> [<ffffffff811549fd>] handle_mm_fault+0x27d/0x320
>> [<ffffffff81620adc>] do_page_fault+0x15c/0x4b0
>> [<ffffffff8161d625>] page_fault+0x25/0x30
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> or
>>
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>
> This reminds me of the thread at http://marc.info/?t=135102111800004 which
> caused Marc's system to reportedly go unresponsive like your report but in
> his case it also caused a reboot.  If your system is still running (or,
> even better, if you're able to capture this happening in realtime), could
> you try to capture
>
>         grep -E "compact_|thp_" /proc/vmstat
>
> as well while it is in progress?  (Even if it's not happening right now,
> the data might still be useful if you have knowledge that it has occurred
> since the last reboot.)

It just happened again.

$ grep -E "compact_|thp_" /proc/vmstat
compact_blocks_moved 8332448774
compact_pages_moved 21831286
compact_pagemigrate_failed 211260
compact_stall 13484
compact_fail 6717
compact_success 6755
thp_fault_alloc 150665
thp_fault_fallback 4270
thp_collapse_alloc 19771
thp_collapse_alloc_failed 2188
thp_split 19600


/proc/meminfo:

MemTotal:       16388116 kB
MemFree:         6684372 kB
Buffers:           34960 kB
Cached:          6233588 kB
SwapCached:        29500 kB
Active:          4881396 kB
Inactive:        3824296 kB
Active(anon):    1687576 kB
Inactive(anon):   764852 kB
Active(file):    3193820 kB
Inactive(file):  3059444 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      16777212 kB
SwapFree:       16643864 kB
Dirty:               184 kB
Writeback:             0 kB
AnonPages:       2408692 kB
Mapped:           126964 kB
Shmem:             15272 kB
Slab:             635496 kB
SReclaimable:     528924 kB
SUnreclaim:       106572 kB
KernelStack:        3600 kB
PageTables:        39460 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24971268 kB
Committed_AS:    5688448 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      614952 kB
VmallocChunk:   34359109524 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1050624 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     3600384 kB
DirectMap2M:    11038720 kB
DirectMap1G:     1048576 kB

$ sudo ./perf stat -p 11764 -e
compaction:mm_compaction_isolate_migratepages,task-clock,vmscan:mm_vmscan_direct_reclaim_begin,vmscan:mm_vmscan_lru_isolate,vmscan:mm_vmscan_memcg_isolate
[sudo] password for luto:
^C
 Performance counter stats for process id '11764':

         1,638,009 compaction:mm_compaction_isolate_migratepages #
0.716 M/sec                   [100.00%]
       2286.993046 task-clock                #    0.872 CPUs utilized
         [100.00%]
                 0 vmscan:mm_vmscan_direct_reclaim_begin #    0.000
M/sec                   [100.00%]
                 0 vmscan:mm_vmscan_lru_isolate #    0.000 M/sec
            [100.00%]
                 0 vmscan:mm_vmscan_memcg_isolate #    0.000 M/sec

       2.623626878 seconds time elapsed

/proc/zoneinfo:
Node 0, zone      DMA
  pages free     3972
        min      16
        low      20
        high     24
        scanned  0
        spanned  4080
        present  3911
    nr_free_pages 3972
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 4
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    nr_dirtied   0
    nr_written   0
    numa_hit     1
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   1
    numa_other   0
    nr_anon_transparent_hugepages 0
        protection: (0, 2434, 16042, 16042)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 2
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 3
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 4
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 5
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 6
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 7
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 8
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 9
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 10
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
    cpu: 11
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 8
  all_unreclaimable: 1
  start_pfn:         16
  inactive_ratio:    1
Node 0, zone    DMA32
  pages free     321075
        min      2561
        low      3201
        high     3841
        scanned  0
        spanned  1044480
        present  623163
    nr_free_pages 321075
    nr_inactive_anon 43450
    nr_active_anon 203472
    nr_inactive_file 5416
    nr_active_file 39568
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 86455
    nr_mapped    156
    nr_file_pages 45195
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 6679
    nr_slab_unreclaimable 419
    nr_page_table_pages 2
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 9994
    nr_vmscan_immediate_reclaim 1
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     1
    nr_dirtied   1765256
    nr_written   1763392
    numa_hit     53134489
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   53134489
    numa_other   0
    nr_anon_transparent_hugepages 313
        protection: (0, 0, 13608, 13608)
  pagesets
    cpu: 0
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 1
              count: 4
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 2
              count: 4
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 3
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 4
              count: 4
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 5
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 6
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 7
              count: 11
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 8
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 9
              count: 4
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 10
              count: 13
              high:  186
              batch: 31
  vm stats threshold: 48
    cpu: 11
              count: 4
              high:  186
              batch: 31
  vm stats threshold: 48
  all_unreclaimable: 0
  start_pfn:         4096
  inactive_ratio:    4
Node 0, zone   Normal
  pages free     1343098
        min      14318
        low      17897
        high     21477
        scanned  0
        spanned  3538944
        present  3483648
    nr_free_pages 1343098
    nr_inactive_anon 147925
    nr_active_anon 221736
    nr_inactive_file 759336
    nr_active_file 758833
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 257074
    nr_mapped    31632
    nr_file_pages 1529150
    nr_dirty     25
    nr_writeback 0
    nr_slab_reclaimable 125552
    nr_slab_unreclaimable 26176
    nr_page_table_pages 9844
    nr_kernel_stack 456
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 36224
    nr_vmscan_immediate_reclaim 117
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     3815
    nr_dirtied   51415788
    nr_written   48993658
    numa_hit     1081691700
    numa_miss    0
    numa_foreign 0
    numa_interleave 25195
    numa_local   1081691700
    numa_other   0
    nr_anon_transparent_hugepages 199
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 156
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 1
              count: 177
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 2
              count: 159
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 3
              count: 161
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 4
              count: 146
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 5
              count: 98
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 6
              count: 59
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 7
              count: 54
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 8
              count: 40
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 9
              count: 32
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 10
              count: 46
              high:  186
              batch: 31
  vm stats threshold: 64
    cpu: 11
              count: 57
              high:  186
              batch: 31
  vm stats threshold: 64
  all_unreclaimable: 0
  start_pfn:         1048576
  inactive_ratio:    11


--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-13 23:25   ` Andy Lutomirski
@ 2012-11-13 23:41     ` David Rientjes
  2012-11-13 23:45       ` Andy Lutomirski
  2012-11-14 10:01       ` Mel Gorman
  0 siblings, 2 replies; 16+ messages in thread
From: David Rientjes @ 2012-11-13 23:41 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Marc Duponcheel, Mel Gorman, linux-kernel, linux-mm

On Tue, 13 Nov 2012, Andy Lutomirski wrote:

> It just happened again.
> 
> $ grep -E "compact_|thp_" /proc/vmstat
> compact_blocks_moved 8332448774
> compact_pages_moved 21831286
> compact_pagemigrate_failed 211260
> compact_stall 13484
> compact_fail 6717
> compact_success 6755
> thp_fault_alloc 150665
> thp_fault_fallback 4270
> thp_collapse_alloc 19771
> thp_collapse_alloc_failed 2188
> thp_split 19600
> 

Two of the patches from the list provided at
http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3 
kernel:

	mm: compaction: abort compaction loop if lock is contended or run too long
	mm: compaction: acquire the zone->lock as late as possible

and all have not made it to the 3.6 stable kernel yet, so would it be 
possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will 
indicate that the entire series is a candidate to backport to 3.6.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-13 23:41     ` David Rientjes
@ 2012-11-13 23:45       ` Andy Lutomirski
  2012-11-13 23:54         ` David Rientjes
  2012-11-14 10:01       ` Mel Gorman
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2012-11-13 23:45 UTC (permalink / raw)
  To: David Rientjes; +Cc: Marc Duponcheel, Mel Gorman, linux-kernel, linux-mm

On Tue, Nov 13, 2012 at 3:41 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 13 Nov 2012, Andy Lutomirski wrote:
>
>> It just happened again.
>>
>> $ grep -E "compact_|thp_" /proc/vmstat
>> compact_blocks_moved 8332448774
>> compact_pages_moved 21831286
>> compact_pagemigrate_failed 211260
>> compact_stall 13484
>> compact_fail 6717
>> compact_success 6755
>> thp_fault_alloc 150665
>> thp_fault_fallback 4270
>> thp_collapse_alloc 19771
>> thp_collapse_alloc_failed 2188
>> thp_split 19600
>>
>
> Two of the patches from the list provided at
> http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3
> kernel:
>
>         mm: compaction: abort compaction loop if lock is contended or run too long
>         mm: compaction: acquire the zone->lock as late as possible
>
> and all have not made it to the 3.6 stable kernel yet, so would it be
> possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will
> indicate that the entire series is a candidate to backport to 3.6.

I'll try later on.  The last time I tried to boot 3.7 on this box, it
failed impressively (presumably due to a localmodconfig bug, but I
haven't tracked it down yet).

I'm also not sure how reliably I can reproduce this.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-13 23:45       ` Andy Lutomirski
@ 2012-11-13 23:54         ` David Rientjes
  2012-11-14  1:22           ` Marc Duponcheel
  0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2012-11-13 23:54 UTC (permalink / raw)
  To: Andy Lutomirski, Marc Duponcheel; +Cc: Mel Gorman, linux-kernel, linux-mm

On Tue, 13 Nov 2012, Andy Lutomirski wrote:

> >> $ grep -E "compact_|thp_" /proc/vmstat
> >> compact_blocks_moved 8332448774
> >> compact_pages_moved 21831286
> >> compact_pagemigrate_failed 211260
> >> compact_stall 13484
> >> compact_fail 6717
> >> compact_success 6755
> >> thp_fault_alloc 150665
> >> thp_fault_fallback 4270
> >> thp_collapse_alloc 19771
> >> thp_collapse_alloc_failed 2188
> >> thp_split 19600
> >>
> >
> > Two of the patches from the list provided at
> > http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3
> > kernel:
> >
> >         mm: compaction: abort compaction loop if lock is contended or run too long
> >         mm: compaction: acquire the zone->lock as late as possible
> >
> > and all have not made it to the 3.6 stable kernel yet, so would it be
> > possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will
> > indicate that the entire series is a candidate to backport to 3.6.
> 
> I'll try later on.  The last time I tried to boot 3.7 on this box, it
> failed impressively (presumably due to a localmodconfig bug, but I
> haven't tracked it down yet).
> 
> I'm also not sure how reliably I can reproduce this.
> 

The challenge goes out to Marc too since he reported this issue on 3.6.2 
but we haven't heard back yet on the success of the backport (although 
it's probably easier to try 3.7-rc5 since there are some conflicts to 
resolve).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-13 23:54         ` David Rientjes
@ 2012-11-14  1:22           ` Marc Duponcheel
  2012-11-14  1:51             ` David Rientjes
  0 siblings, 1 reply; 16+ messages in thread
From: Marc Duponcheel @ 2012-11-14  1:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andy Lutomirski, Mel Gorman, linux-kernel, linux-mm, Marc Duponcheel

 Hi all, please let me know if there is are patches you want me to try.

 FWIW time did not stand still and I run 3.6.6 now.


On 2012 Nov 13, #David Rientjes wrote:
> On Tue, 13 Nov 2012, Andy Lutomirski wrote:
> 
> > >> $ grep -E "compact_|thp_" /proc/vmstat
> > >> compact_blocks_moved 8332448774
> > >> compact_pages_moved 21831286
> > >> compact_pagemigrate_failed 211260
> > >> compact_stall 13484
> > >> compact_fail 6717
> > >> compact_success 6755
> > >> thp_fault_alloc 150665
> > >> thp_fault_fallback 4270
> > >> thp_collapse_alloc 19771
> > >> thp_collapse_alloc_failed 2188
> > >> thp_split 19600
> > >>
> > >
> > > Two of the patches from the list provided at
> > > http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3
> > > kernel:
> > >
> > >         mm: compaction: abort compaction loop if lock is contended or run too long
> > >         mm: compaction: acquire the zone->lock as late as possible
> > >
> > > and all have not made it to the 3.6 stable kernel yet, so would it be
> > > possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will
> > > indicate that the entire series is a candidate to backport to 3.6.
> > 
> > I'll try later on.  The last time I tried to boot 3.7 on this box, it
> > failed impressively (presumably due to a localmodconfig bug, but I
> > haven't tracked it down yet).
> > 
> > I'm also not sure how reliably I can reproduce this.
> > 
> 
> The challenge goes out to Marc too since he reported this issue on 3.6.2 
> but we haven't heard back yet on the success of the backport (although 
> it's probably easier to try 3.7-rc5 since there are some conflicts to 
> resolve).

--
 Marc Duponcheel
 Velodroomstraat 74 - 2600 Berchem - Belgium
 +32 (0)478 68.10.91 - marc@offline.be

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-14  1:22           ` Marc Duponcheel
@ 2012-11-14  1:51             ` David Rientjes
  2012-11-14 13:21               ` Marc Duponcheel
  0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2012-11-14  1:51 UTC (permalink / raw)
  To: Marc Duponcheel; +Cc: Andy Lutomirski, Mel Gorman, linux-kernel, linux-mm

On Wed, 14 Nov 2012, Marc Duponcheel wrote:

>  Hi all, please let me know if there is are patches you want me to try.
> 
>  FWIW time did not stand still and I run 3.6.6 now.
> 

Hmm, interesting since there are no core VM changes between 3.6.2, the 
kernel you ran into problems with, and 3.6.6.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-13 23:41     ` David Rientjes
  2012-11-13 23:45       ` Andy Lutomirski
@ 2012-11-14 10:01       ` Mel Gorman
  2012-11-14 13:29         ` Marc Duponcheel
  1 sibling, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2012-11-14 10:01 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andy Lutomirski, Marc Duponcheel, linux-kernel, linux-mm

On Tue, Nov 13, 2012 at 03:41:02PM -0800, David Rientjes wrote:
> On Tue, 13 Nov 2012, Andy Lutomirski wrote:
> 
> > It just happened again.
> > 
> > $ grep -E "compact_|thp_" /proc/vmstat
> > compact_blocks_moved 8332448774
> > compact_pages_moved 21831286
> > compact_pagemigrate_failed 211260
> > compact_stall 13484
> > compact_fail 6717
> > compact_success 6755
> > thp_fault_alloc 150665
> > thp_fault_fallback 4270
> > thp_collapse_alloc 19771
> > thp_collapse_alloc_failed 2188
> > thp_split 19600
> > 
> 
> Two of the patches from the list provided at
> http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3 
> kernel:
> 
> 	mm: compaction: abort compaction loop if lock is contended or run too long
> 	mm: compaction: acquire the zone->lock as late as possible
> 
> and all have not made it to the 3.6 stable kernel yet, so would it be 
> possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will 
> indicate that the entire series is a candidate to backport to 3.6.

Thanks David once again.

The full list of compaction-related patches I believe are necessary for
this particular problem are

e64c5237cf6ff474cb2f3f832f48f2b441dd9979 mm: compaction: abort compaction loop if lock is contended or run too long
3cc668f4e30fbd97b3c0574d8cac7a83903c9bc7 mm: compaction: move fatal signal check out of compact_checklock_irqsave
661c4cb9b829110cb68c18ea05a56be39f75a4d2 mm: compaction: Update try_to_compact_pages()kerneldoc comment
2a1402aa044b55c2d30ab0ed9405693ef06fb07c mm: compaction: acquire the zone->lru_lock as late as possible
f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e mm: compaction: acquire the zone->lock as late as possible
753341a4b85ff337487b9959c71c529f522004f4 revert "mm: have order > 0 compaction start off where it left"
bb13ffeb9f6bfeb301443994dfbf29f91117dfb3 mm: compaction: cache if a pageblock was scanned and no pages were isolated
c89511ab2f8fe2b47585e60da8af7fd213ec877e mm: compaction: Restart compaction from near where it left off
62997027ca5b3d4618198ed8b1aba40b61b1137b mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
0db63d7e25f96e2c6da925c002badf6f144ddf30 mm: compaction: correct the nr_strict va isolated check for CMA

If we can get confirmation that these fix the problem in 3.6 kernels then
I can backport them to -stable. This fixing a problem where "many processes
stall, all in an isolation-related function". This started happening after
lumpy reclaim was removed because we depended on that to aggressively
reclaim with less compaction. Now compaction is depended upon more.

The full 3.7-rc5 kernel has a different problem on top of this and it's
important the problems do not get conflacted. It has these fixes *but*
GFP_NO_KSWAPD has been removed and there is a patch that scales reclaim
with THP failures that is causing problem. With them, kswapd can get
stuck in a 100% loop where it is neither reclaiming nor reaching its exit
conditions. The correct fix would be to identify why this happens but I
have not got around to it yet. To test with 3.7-rc5 then apply either

1) https://lkml.org/lkml/2012/11/5/308
2) https://lkml.org/lkml/2012/11/12/113

or

1) https://lkml.org/lkml/2012/11/5/308
3) https://lkml.org/lkml/2012/11/12/151

on top of 3.7-rc5. So it's a lot of work but there are three tests I'm
interested in hearing about. The results of each determine what happens
in -stable or mainline

Test 1: 3.6 + the last of commits above	(should fix processes stick in isolate)
Test 2: 3.7-rc5 + (1+2) above (should fix kswapd stuck at 100%)
Test 3: 3.7-rc5 + (1+3) above (should fix kswapd stuck at 100% but better)

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-14  1:51             ` David Rientjes
@ 2012-11-14 13:21               ` Marc Duponcheel
  0 siblings, 0 replies; 16+ messages in thread
From: Marc Duponcheel @ 2012-11-14 13:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andy Lutomirski, Mel Gorman, linux-kernel, linux-mm, Marc Duponcheel

On 2012 Nov 13, David Rientjes wrote:
> On Wed, 14 Nov 2012, Marc Duponcheel wrote:
> 
> >  Hi all, please let me know if there is are patches you want me to try.
> > 
> >  FWIW time did not stand still and I run 3.6.6 now.
> 
> Hmm, interesting since there are no core VM changes between 3.6.2, the 
> kernel you ran into problems with, and 3.6.6.

 Hi David

 I have not tried yet to repro #49361 on 3.6.6, but, as you say, if
there are no core VM changes, I am confident I can do so just by doing

# echo always > /sys/kernel/mm/transparent_hugepage/enabled

 I am at your disposal to test further, and, if there are patches, to
try them out.

 Note that I only once experienced a crash for which I could not find
relevant info in logs. But the hanging processes issue could always be
reproduced consistently.

 have a nice day

--
 Marc Duponcheel
 Velodroomstraat 74 - 2600 Berchem - Belgium
 +32 (0)478 68.10.91 - marc@offline.be

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-14 10:01       ` Mel Gorman
@ 2012-11-14 13:29         ` Marc Duponcheel
  2012-11-14 21:50           ` David Rientjes
  0 siblings, 1 reply; 16+ messages in thread
From: Marc Duponcheel @ 2012-11-14 13:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andy Lutomirski, linux-kernel, linux-mm, Marc Duponcheel

 Hi all

 If someone can provide the patches (or learn me how to get them with
git (I apologise to not be git savy)) then, this weekend, I can apply
them to 3.6.6 and compare before/after to check if they fix #49361.

 Thanks

On 2012 Nov 14, Mel Gorman wrote:
> On Tue, Nov 13, 2012 at 03:41:02PM -0800, David Rientjes wrote:
> > On Tue, 13 Nov 2012, Andy Lutomirski wrote:
> > 
> > > It just happened again.
> > > 
> > > $ grep -E "compact_|thp_" /proc/vmstat
> > > compact_blocks_moved 8332448774
> > > compact_pages_moved 21831286
> > > compact_pagemigrate_failed 211260
> > > compact_stall 13484
> > > compact_fail 6717
> > > compact_success 6755
> > > thp_fault_alloc 150665
> > > thp_fault_fallback 4270
> > > thp_collapse_alloc 19771
> > > thp_collapse_alloc_failed 2188
> > > thp_split 19600
> > > 
> > 
> > Two of the patches from the list provided at
> > http://marc.info/?l=linux-mm&m=135179005510688 are already in your 3.6.3 
> > kernel:
> > 
> > 	mm: compaction: abort compaction loop if lock is contended or run too long
> > 	mm: compaction: acquire the zone->lock as late as possible
> > 
> > and all have not made it to the 3.6 stable kernel yet, so would it be 
> > possible to try with 3.7-rc5 to see if it fixes the issue?  If so, it will 
> > indicate that the entire series is a candidate to backport to 3.6.
> 
> Thanks David once again.
> 
> The full list of compaction-related patches I believe are necessary for
> this particular problem are
> 
> e64c5237cf6ff474cb2f3f832f48f2b441dd9979 mm: compaction: abort compaction loop if lock is contended or run too long
> 3cc668f4e30fbd97b3c0574d8cac7a83903c9bc7 mm: compaction: move fatal signal check out of compact_checklock_irqsave
> 661c4cb9b829110cb68c18ea05a56be39f75a4d2 mm: compaction: Update try_to_compact_pages()kerneldoc comment
> 2a1402aa044b55c2d30ab0ed9405693ef06fb07c mm: compaction: acquire the zone->lru_lock as late as possible
> f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e mm: compaction: acquire the zone->lock as late as possible
> 753341a4b85ff337487b9959c71c529f522004f4 revert "mm: have order > 0 compaction start off where it left"
> bb13ffeb9f6bfeb301443994dfbf29f91117dfb3 mm: compaction: cache if a pageblock was scanned and no pages were isolated
> c89511ab2f8fe2b47585e60da8af7fd213ec877e mm: compaction: Restart compaction from near where it left off
> 62997027ca5b3d4618198ed8b1aba40b61b1137b mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
> 0db63d7e25f96e2c6da925c002badf6f144ddf30 mm: compaction: correct the nr_strict va isolated check for CMA
> 
> If we can get confirmation that these fix the problem in 3.6 kernels then
> I can backport them to -stable. This fixing a problem where "many processes
> stall, all in an isolation-related function". This started happening after
> lumpy reclaim was removed because we depended on that to aggressively
> reclaim with less compaction. Now compaction is depended upon more.
> 
> The full 3.7-rc5 kernel has a different problem on top of this and it's
> important the problems do not get conflacted. It has these fixes *but*
> GFP_NO_KSWAPD has been removed and there is a patch that scales reclaim
> with THP failures that is causing problem. With them, kswapd can get
> stuck in a 100% loop where it is neither reclaiming nor reaching its exit
> conditions. The correct fix would be to identify why this happens but I
> have not got around to it yet. To test with 3.7-rc5 then apply either
> 
> 1) https://lkml.org/lkml/2012/11/5/308
> 2) https://lkml.org/lkml/2012/11/12/113
> 
> or
> 
> 1) https://lkml.org/lkml/2012/11/5/308
> 3) https://lkml.org/lkml/2012/11/12/151
> 
> on top of 3.7-rc5. So it's a lot of work but there are three tests I'm
> interested in hearing about. The results of each determine what happens
> in -stable or mainline
> 
> Test 1: 3.6 + the last of commits above	(should fix processes stick in isolate)
> Test 2: 3.7-rc5 + (1+2) above (should fix kswapd stuck at 100%)
> Test 3: 3.7-rc5 + (1+3) above (should fix kswapd stuck at 100% but better)
> 
> Thanks.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 

-- 
--
 Marc Duponcheel
 Velodroomstraat 74 - 2600 Berchem - Belgium
 +32 (0)478 68.10.91 - marc@offline.be

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-14 13:29         ` Marc Duponcheel
@ 2012-11-14 21:50           ` David Rientjes
  2012-11-15  1:14             ` Marc Duponcheel
  0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2012-11-14 21:50 UTC (permalink / raw)
  To: Marc Duponcheel; +Cc: Mel Gorman, Andy Lutomirski, linux-kernel, linux-mm

On Wed, 14 Nov 2012, Marc Duponcheel wrote:

>  Hi all
> 
>  If someone can provide the patches (or learn me how to get them with
> git (I apologise to not be git savy)) then, this weekend, I can apply
> them to 3.6.6 and compare before/after to check if they fix #49361.
> 

I've backported all the commits that Mel quoted to 3.6.6 and appended them 
to this email as one big patch.  It should apply cleanly to your kernel.

Now we are only missing these commits that weren't quoted:

 - 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page 
                  immediately when it is made available"), and

 - 83fde0f22872 ("mm: vmscan: scale number of pages reclaimed by 
                  reclaim/compaction based on failures").

Since your regression is easily reproducible, would it be possible to try 
to reproduce the issue FIRST with 3.6.6 and, if still present as it was in 
3.6.2, then try reproducing it with the appended patch?

You earlier reported that khugepaged was taking the second-most cpu time 
when this was happening, which initially pointed you to thp, so presumably 
this isn't a kswapd issue running at 100%.  If both 3.6.6 kernels fail 
(the one with and without the following patch), would it be possible to 
try Mel's suggestion of patching with

 - https://lkml.org/lkml/2012/11/5/308 +
   https://lkml.org/lkml/2012/11/12/113

to see if it helps and, if not, reverting the latter and trying

 - https://lkml.org/lkml/2012/11/5/308 +
   https://lkml.org/lkml/2012/11/12/151

as the final test?  This will certainly help us to find out what needs to 
be backported to 3.6 stable to prevent this issue for other users.

Thanks!
---
 include/linux/compaction.h      |   15 ++
 include/linux/mmzone.h          |    6 +-
 include/linux/pageblock-flags.h |   19 +-
 mm/compaction.c                 |  450 +++++++++++++++++++++++++--------------
 mm/internal.h                   |   16 +-
 mm/page_alloc.c                 |   42 ++--
 mm/vmscan.c                     |    8 +
 7 files changed, 366 insertions(+), 190 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -24,6 +24,7 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended);
 extern int compact_pgdat(pg_data_t *pgdat, int order);
+extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -61,6 +62,16 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 	return zone->compact_considered < defer_limit;
 }
 
+/* Returns true if restarting compaction after many failures */
+static inline bool compaction_restarting(struct zone *zone, int order)
+{
+	if (order < zone->compact_order_failed)
+		return false;
+
+	return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
+		zone->compact_considered >= 1UL << zone->compact_defer_shift;
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
@@ -74,6 +85,10 @@ static inline int compact_pgdat(pg_data_t *pgdat, int order)
 	return COMPACT_CONTINUE;
 }
 
+static inline void reset_isolation_suitable(pg_data_t *pgdat)
+{
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,8 +369,12 @@ struct zone {
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* pfn where the last incremental compaction isolated free pages */
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
+
+	/* pfns where compaction scanners should start */
 	unsigned long		compact_cached_free_pfn;
+	unsigned long		compact_cached_migrate_pfn;
 #endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,6 +30,9 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
+#ifdef CONFIG_COMPACTION
+	PB_migrate_skip,/* If set the block is skipped by compaction */
+#endif /* CONFIG_COMPACTION */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -65,10 +68,22 @@ unsigned long get_pageblock_flags_group(struct page *page,
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
 					int start_bitidx, int end_bitidx);
 
+#ifdef CONFIG_COMPACTION
+#define get_pageblock_skip(page) \
+			get_pageblock_flags_group(page, PB_migrate_skip,     \
+							PB_migrate_skip + 1)
+#define clear_pageblock_skip(page) \
+			set_pageblock_flags_group(page, 0, PB_migrate_skip,  \
+							PB_migrate_skip + 1)
+#define set_pageblock_skip(page) \
+			set_pageblock_flags_group(page, 1, PB_migrate_skip,  \
+							PB_migrate_skip + 1)
+#endif /* CONFIG_COMPACTION */
+
 #define get_pageblock_flags(page) \
-			get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+			get_pageblock_flags_group(page, 0, PB_migrate_end)
 #define set_pageblock_flags(page, flags) \
 			set_pageblock_flags_group(page, flags,	\
-						  0, NR_PAGEBLOCK_BITS-1)
+						  0, PB_migrate_end)
 
 #endif	/* PAGEBLOCK_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,111 @@ static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+#ifdef CONFIG_COMPACTION
+/* Returns true if the pageblock should be scanned for pages to isolate. */
+static inline bool isolation_suitable(struct compact_control *cc,
+					struct page *page)
+{
+	if (cc->ignore_skip_hint)
+		return true;
+
+	return !get_pageblock_skip(page);
+}
+
+/*
+ * This function is called to clear all cached information on pageblocks that
+ * should be skipped for page isolation when the migrate and free page scanner
+ * meet.
+ */
+static void __reset_isolation_suitable(struct zone *zone)
+{
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	unsigned long pfn;
+
+	zone->compact_cached_migrate_pfn = start_pfn;
+	zone->compact_cached_free_pfn = end_pfn;
+	zone->compact_blockskip_flush = false;
+
+	/* Walk the zone and mark every pageblock as suitable for isolation */
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+
+		cond_resched();
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (zone != page_zone(page))
+			continue;
+
+		clear_pageblock_skip(page);
+	}
+}
+
+void reset_isolation_suitable(pg_data_t *pgdat)
+{
+	int zoneid;
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct zone *zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		/* Only flush if a full compaction finished recently */
+		if (zone->compact_blockskip_flush)
+			__reset_isolation_suitable(zone);
+	}
+}
+
+/*
+ * If no pages were isolated then mark this pageblock to be skipped in the
+ * future. The information is later cleared by __reset_isolation_suitable().
+ */
+static void update_pageblock_skip(struct compact_control *cc,
+			struct page *page, unsigned long nr_isolated,
+			bool migrate_scanner)
+{
+	struct zone *zone = cc->zone;
+	if (!page)
+		return;
+
+	if (!nr_isolated) {
+		unsigned long pfn = page_to_pfn(page);
+		set_pageblock_skip(page);
+
+		/* Update where compaction should restart */
+		if (migrate_scanner) {
+			if (!cc->finished_update_migrate &&
+			    pfn > zone->compact_cached_migrate_pfn)
+				zone->compact_cached_migrate_pfn = pfn;
+		} else {
+			if (!cc->finished_update_free &&
+			    pfn < zone->compact_cached_free_pfn)
+				zone->compact_cached_free_pfn = pfn;
+		}
+	}
+}
+#else
+static inline bool isolation_suitable(struct compact_control *cc,
+					struct page *page)
+{
+	return true;
+}
+
+static void update_pageblock_skip(struct compact_control *cc,
+			struct page *page, unsigned long nr_isolated,
+			bool migrate_scanner)
+{
+}
+#endif /* CONFIG_COMPACTION */
+
+static inline bool should_release_lock(spinlock_t *lock)
+{
+	return need_resched() || spin_is_contended(lock);
+}
+
 /*
  * Compaction requires the taking of some coarse locks that are potentially
  * very heavily contended. Check if the process needs to be scheduled or
@@ -62,7 +167,7 @@ static inline bool migrate_async_suitable(int migratetype)
 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 				      bool locked, struct compact_control *cc)
 {
-	if (need_resched() || spin_is_contended(lock)) {
+	if (should_release_lock(lock)) {
 		if (locked) {
 			spin_unlock_irqrestore(lock, *flags);
 			locked = false;
@@ -70,14 +175,11 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 
 		/* async aborts if taking too long or contended */
 		if (!cc->sync) {
-			if (cc->contended)
-				*cc->contended = true;
+			cc->contended = true;
 			return false;
 		}
 
 		cond_resched();
-		if (fatal_signal_pending(current))
-			return false;
 	}
 
 	if (!locked)
@@ -91,44 +193,85 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
 	return compact_checklock_irqsave(lock, flags, false, cc);
 }
 
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
+{
+	int migratetype = get_pageblock_migratetype(page);
+
+	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
+	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+		return false;
+
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return true;
+
+	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+	if (migrate_async_suitable(migratetype))
+		return true;
+
+	/* Otherwise skip the block */
+	return false;
+}
+
 /*
  * Isolate free pages onto a private freelist. Caller must hold zone->lock.
  * If @strict is true, will abort returning 0 on any invalid PFNs or non-free
  * pages inside of the pageblock (even though it may still end up isolating
  * some pages).
  */
-static unsigned long isolate_freepages_block(unsigned long blockpfn,
+static unsigned long isolate_freepages_block(struct compact_control *cc,
+				unsigned long blockpfn,
 				unsigned long end_pfn,
 				struct list_head *freelist,
 				bool strict)
 {
 	int nr_scanned = 0, total_isolated = 0;
-	struct page *cursor;
+	struct page *cursor, *valid_page = NULL;
+	unsigned long nr_strict_required = end_pfn - blockpfn;
+	unsigned long flags;
+	bool locked = false;
 
 	cursor = pfn_to_page(blockpfn);
 
-	/* Isolate free pages. This assumes the block is valid */
+	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn++, cursor++) {
 		int isolated, i;
 		struct page *page = cursor;
 
-		if (!pfn_valid_within(blockpfn)) {
-			if (strict)
-				return 0;
-			continue;
-		}
 		nr_scanned++;
+		if (!pfn_valid_within(blockpfn))
+			continue;
+		if (!valid_page)
+			valid_page = page;
+		if (!PageBuddy(page))
+			continue;
+
+		/*
+		 * The zone lock must be held to isolate freepages.
+		 * Unfortunately this is a very coarse lock and can be
+		 * heavily contended if there are parallel allocations
+		 * or parallel compactions. For async compaction do not
+		 * spin on the lock and we acquire the lock as late as
+		 * possible.
+		 */
+		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
+								locked, cc);
+		if (!locked)
+			break;
+
+		/* Recheck this is a suitable migration target under lock */
+		if (!strict && !suitable_migration_target(page))
+			break;
 
-		if (!PageBuddy(page)) {
-			if (strict)
-				return 0;
+		/* Recheck this is a buddy page under lock */
+		if (!PageBuddy(page))
 			continue;
-		}
 
 		/* Found a free page, break it into order-0 pages */
 		isolated = split_free_page(page);
 		if (!isolated && strict)
-			return 0;
+			break;
 		total_isolated += isolated;
 		for (i = 0; i < isolated; i++) {
 			list_add(&page->lru, freelist);
@@ -143,6 +286,22 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
 	}
 
 	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
+
+	/*
+	 * If strict isolation is requested by CMA then check that all the
+	 * pages requested were isolated. If there were any failures, 0 is
+	 * returned and CMA will fail.
+	 */
+	if (strict && nr_strict_required > total_isolated)
+		total_isolated = 0;
+
+	if (locked)
+		spin_unlock_irqrestore(&cc->zone->lock, flags);
+
+	/* Update the pageblock-skip if the whole pageblock was scanned */
+	if (blockpfn == end_pfn)
+		update_pageblock_skip(cc, valid_page, total_isolated, false);
+
 	return total_isolated;
 }
 
@@ -160,17 +319,14 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
  * a free page).
  */
 unsigned long
-isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
+isolate_freepages_range(struct compact_control *cc,
+			unsigned long start_pfn, unsigned long end_pfn)
 {
-	unsigned long isolated, pfn, block_end_pfn, flags;
-	struct zone *zone = NULL;
+	unsigned long isolated, pfn, block_end_pfn;
 	LIST_HEAD(freelist);
 
-	if (pfn_valid(start_pfn))
-		zone = page_zone(pfn_to_page(start_pfn));
-
 	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
-		if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn)))
+		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
 			break;
 
 		/*
@@ -180,10 +336,8 @@ isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
 		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
-		spin_lock_irqsave(&zone->lock, flags);
-		isolated = isolate_freepages_block(pfn, block_end_pfn,
+		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
 						   &freelist, true);
-		spin_unlock_irqrestore(&zone->lock, flags);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
@@ -276,7 +430,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	isolate_mode_t mode = 0;
 	struct lruvec *lruvec;
 	unsigned long flags;
-	bool locked;
+	bool locked = false;
+	struct page *page = NULL, *valid_page = NULL;
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -296,23 +451,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	cond_resched();
-	spin_lock_irqsave(&zone->lru_lock, flags);
-	locked = true;
 	for (; low_pfn < end_pfn; low_pfn++) {
-		struct page *page;
-
 		/* give a chance to irqs before checking need_resched() */
-		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
-			locked = false;
+		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			if (should_release_lock(&zone->lru_lock)) {
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				locked = false;
+			}
 		}
 
-		/* Check if it is ok to still hold the lock */
-		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
-								locked, cc);
-		if (!locked)
-			break;
-
 		/*
 		 * migrate_pfn does not necessarily start aligned to a
 		 * pageblock. Ensure that pfn_valid is called when moving
@@ -340,6 +487,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if (page_zone(page) != zone)
 			continue;
 
+		if (!valid_page)
+			valid_page = page;
+
+		/* If isolation recently failed, do not retry */
+		pageblock_nr = low_pfn >> pageblock_order;
+		if (!isolation_suitable(cc, page))
+			goto next_pageblock;
+
 		/* Skip if free */
 		if (PageBuddy(page))
 			continue;
@@ -349,24 +504,43 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 * migration is optimistic to see if the minimum amount of work
 		 * satisfies the allocation
 		 */
-		pageblock_nr = low_pfn >> pageblock_order;
 		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
 		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
-			low_pfn += pageblock_nr_pages;
-			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
-			last_pageblock_nr = pageblock_nr;
-			continue;
+			cc->finished_update_migrate = true;
+			goto next_pageblock;
 		}
 
+		/* Check may be lockless but that's ok as we recheck later */
 		if (!PageLRU(page))
 			continue;
 
 		/*
-		 * PageLRU is set, and lru_lock excludes isolation,
-		 * splitting and collapsing (collapsing has already
-		 * happened if PageLRU is set).
+		 * PageLRU is set. lru_lock normally excludes isolation
+		 * splitting and collapsing (collapsing has already happened
+		 * if PageLRU is set) but the lock is not necessarily taken
+		 * here and it is wasteful to take it just to check transhuge.
+		 * Check TransHuge without lock and skip the whole pageblock if
+		 * it's either a transhuge or hugetlbfs page, as calling
+		 * compound_order() without preventing THP from splitting the
+		 * page underneath us may return surprising results.
 		 */
 		if (PageTransHuge(page)) {
+			if (!locked)
+				goto next_pageblock;
+			low_pfn += (1 << compound_order(page)) - 1;
+			continue;
+		}
+
+		/* Check if it is ok to still hold the lock */
+		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
+								locked, cc);
+		if (!locked || fatal_signal_pending(current))
+			break;
+
+		/* Recheck PageLRU and PageTransHuge under lock */
+		if (!PageLRU(page))
+			continue;
+		if (PageTransHuge(page)) {
 			low_pfn += (1 << compound_order(page)) - 1;
 			continue;
 		}
@@ -383,6 +557,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
+		cc->finished_update_migrate = true;
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
@@ -393,6 +568,13 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			++low_pfn;
 			break;
 		}
+
+		continue;
+
+next_pageblock:
+		low_pfn += pageblock_nr_pages;
+		low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
+		last_pageblock_nr = pageblock_nr;
 	}
 
 	acct_isolated(zone, locked, cc);
@@ -400,6 +582,10 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	if (locked)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	/* Update the pageblock-skip if the whole pageblock was scanned */
+	if (low_pfn == end_pfn)
+		update_pageblock_skip(cc, valid_page, nr_isolated, true);
+
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
 	return low_pfn;
@@ -407,43 +593,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
-
-/* Returns true if the page is within a block suitable for migration to */
-static bool suitable_migration_target(struct page *page)
-{
-
-	int migratetype = get_pageblock_migratetype(page);
-
-	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
-	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
-		return false;
-
-	/* If the page is a large free page, then allow migration */
-	if (PageBuddy(page) && page_order(page) >= pageblock_order)
-		return true;
-
-	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
-	if (migrate_async_suitable(migratetype))
-		return true;
-
-	/* Otherwise skip the block */
-	return false;
-}
-
-/*
- * Returns the start pfn of the last page block in a zone.  This is the starting
- * point for full compaction of a zone.  Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
-	unsigned long free_pfn;
-	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	free_pfn &= ~(pageblock_nr_pages-1);
-	return free_pfn;
-}
-
 /*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
@@ -453,7 +602,6 @@ static void isolate_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn;
-	unsigned long flags;
 	int nr_freepages = cc->nr_freepages;
 	struct list_head *freelist = &cc->freepages;
 
@@ -501,30 +649,16 @@ static void isolate_freepages(struct zone *zone,
 		if (!suitable_migration_target(page))
 			continue;
 
-		/*
-		 * Found a block suitable for isolating free pages from. Now
-		 * we disabled interrupts, double check things are ok and
-		 * isolate the pages. This is to minimise the time IRQs
-		 * are disabled
-		 */
-		isolated = 0;
+		/* If isolation recently failed, do not retry */
+		if (!isolation_suitable(cc, page))
+			continue;
 
-		/*
-		 * The zone lock must be held to isolate freepages. This
-		 * unfortunately this is a very coarse lock and can be
-		 * heavily contended if there are parallel allocations
-		 * or parallel compactions. For async compaction do not
-		 * spin on the lock
-		 */
-		if (!compact_trylock_irqsave(&zone->lock, &flags, cc))
-			break;
-		if (suitable_migration_target(page)) {
-			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
-			isolated = isolate_freepages_block(pfn, end_pfn,
-							   freelist, false);
-			nr_freepages += isolated;
-		}
-		spin_unlock_irqrestore(&zone->lock, flags);
+		/* Found a block suitable for isolating free pages from */
+		isolated = 0;
+		end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
+		isolated = isolate_freepages_block(cc, pfn, end_pfn,
+						   freelist, false);
+		nr_freepages += isolated;
 
 		/*
 		 * Record the highest PFN we isolated pages from. When next
@@ -532,17 +666,8 @@ static void isolate_freepages(struct zone *zone,
 		 * page migration may have returned some pages to the allocator
 		 */
 		if (isolated) {
+			cc->finished_update_free = true;
 			high_pfn = max(high_pfn, pfn);
-
-			/*
-			 * If the free scanner has wrapped, update
-			 * compact_cached_free_pfn to point to the highest
-			 * pageblock with free pages. This reduces excessive
-			 * scanning of full pageblocks near the end of the
-			 * zone
-			 */
-			if (cc->order > 0 && cc->wrapped)
-				zone->compact_cached_free_pfn = high_pfn;
 		}
 	}
 
@@ -551,11 +676,6 @@ static void isolate_freepages(struct zone *zone,
 
 	cc->free_pfn = high_pfn;
 	cc->nr_freepages = nr_freepages;
-
-	/* If compact_cached_free_pfn is reset then set it now */
-	if (cc->order > 0 && !cc->wrapped &&
-			zone->compact_cached_free_pfn == start_free_pfn(zone))
-		zone->compact_cached_free_pfn = high_pfn;
 }
 
 /*
@@ -634,7 +754,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 
 	/* Perform the isolation */
 	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-	if (!low_pfn)
+	if (!low_pfn || cc->contended)
 		return ISOLATE_ABORT;
 
 	cc->migrate_pfn = low_pfn;
@@ -651,27 +771,19 @@ static int compact_finished(struct zone *zone,
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
 
-	/*
-	 * A full (order == -1) compaction run starts at the beginning and
-	 * end of a zone; it completes when the migrate and free scanner meet.
-	 * A partial (order > 0) compaction can start with the free scanner
-	 * at a random point in the zone, and may have to restart.
-	 */
+	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn) {
-		if (cc->order > 0 && !cc->wrapped) {
-			/* We started partway through; restart at the end. */
-			unsigned long free_pfn = start_free_pfn(zone);
-			zone->compact_cached_free_pfn = free_pfn;
-			cc->free_pfn = free_pfn;
-			cc->wrapped = 1;
-			return COMPACT_CONTINUE;
-		}
-		return COMPACT_COMPLETE;
-	}
+		/*
+		 * Mark that the PG_migrate_skip information should be cleared
+		 * by kswapd when it goes to sleep. kswapd does not set the
+		 * flag itself as the decision to be clear should be directly
+		 * based on an allocation request.
+		 */
+		if (!current_is_kswapd())
+			zone->compact_blockskip_flush = true;
 
-	/* We wrapped around and ended up where we started. */
-	if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
 		return COMPACT_COMPLETE;
+	}
 
 	/*
 	 * order == -1 is expected when compacting via
@@ -754,6 +866,8 @@ unsigned long compaction_suitable(struct zone *zone, int order)
 static int compact_zone(struct zone *zone, struct compact_control *cc)
 {
 	int ret;
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
 
 	ret = compaction_suitable(zone, cc->order);
 	switch (ret) {
@@ -766,17 +880,29 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		;
 	}
 
-	/* Setup to move all movable pages to the end of the zone */
-	cc->migrate_pfn = zone->zone_start_pfn;
-
-	if (cc->order > 0) {
-		/* Incremental compaction. Start where the last one stopped. */
-		cc->free_pfn = zone->compact_cached_free_pfn;
-		cc->start_free_pfn = cc->free_pfn;
-	} else {
-		/* Order == -1 starts at the end of the zone. */
-		cc->free_pfn = start_free_pfn(zone);
+	/*
+	 * Setup to move all movable pages to the end of the zone. Used cached
+	 * information on where the scanners should start but check that it
+	 * is initialised by ensuring the values are within zone boundaries.
+	 */
+	cc->migrate_pfn = zone->compact_cached_migrate_pfn;
+	cc->free_pfn = zone->compact_cached_free_pfn;
+	if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {
+		cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1);
+		zone->compact_cached_free_pfn = cc->free_pfn;
 	}
+	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {
+		cc->migrate_pfn = start_pfn;
+		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
+	}
+
+	/*
+	 * Clear pageblock skip if there were failures recently and compaction
+	 * is about to be retried after being deferred. kswapd does not do
+	 * this reset as it'll reset the cached information when going to sleep.
+	 */
+	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+		__reset_isolation_suitable(zone);
 
 	migrate_prep_local();
 
@@ -787,6 +913,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		switch (isolate_migratepages(zone, cc)) {
 		case ISOLATE_ABORT:
 			ret = COMPACT_PARTIAL;
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
 			goto out;
 		case ISOLATE_NONE:
 			continue;
@@ -831,6 +959,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
 				 bool sync, bool *contended)
 {
+	unsigned long ret;
 	struct compact_control cc = {
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
@@ -838,12 +967,17 @@ static unsigned long compact_zone_order(struct zone *zone,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
-		.contended = contended,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
 
-	return compact_zone(zone, &cc);
+	ret = compact_zone(zone, &cc);
+
+	VM_BUG_ON(!list_empty(&cc.freepages));
+	VM_BUG_ON(!list_empty(&cc.migratepages));
+
+	*contended = cc.contended;
+	return ret;
 }
 
 int sysctl_extfrag_threshold = 500;
@@ -855,6 +989,8 @@ int sysctl_extfrag_threshold = 500;
  * @gfp_mask: The GFP mask of the current allocation
  * @nodemask: The allowed nodes to allocate from
  * @sync: Whether migration is synchronous or not
+ * @contended: Return value that is true if compaction was aborted due to lock contention
+ * @page: Optionally capture a free page of the requested order during compaction
  *
  * This is the main entry point for direct page compaction.
  */
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,23 +118,23 @@ struct compact_control {
 	unsigned long nr_freepages;	/* Number of isolated free pages */
 	unsigned long nr_migratepages;	/* Number of pages to migrate */
 	unsigned long free_pfn;		/* isolate_freepages search base */
-	unsigned long start_free_pfn;	/* where we started the search */
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
-	bool wrapped;			/* Order > 0 compactions are
-					   incremental, once free_pfn
-					   and migrate_pfn meet, we restart
-					   from the top of the zone;
-					   remember we wrapped around. */
+	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
+	bool finished_update_free;	/* True when the zone cached pfns are
+					 * no longer being updated
+					 */
+	bool finished_update_migrate;
 
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
-	bool *contended;		/* True if a lock was contended */
+	bool contended;			/* True if a lock was contended */
 };
 
 unsigned long
-isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn);
+isolate_freepages_range(struct compact_control *cc,
+			unsigned long start_pfn, unsigned long end_pfn);
 unsigned long
 isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2131,6 +2131,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
 				preferred_zone, migratetype);
 		if (page) {
+			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
@@ -4438,11 +4439,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-		zone->compact_cached_free_pfn = zone->zone_start_pfn +
-						zone->spanned_pages;
-		zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
-#endif
 #ifdef CONFIG_NUMA
 		zone->node = nid;
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
@@ -5632,7 +5628,8 @@ __alloc_contig_migrate_alloc(struct page *page, unsigned long private,
 }
 
 /* [start, end) must belong to a single zone. */
-static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
+static int __alloc_contig_migrate_range(struct compact_control *cc,
+					unsigned long start, unsigned long end)
 {
 	/* This function is based on compact_zone() from compaction.c. */
 
@@ -5640,25 +5637,17 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 	unsigned int tries = 0;
 	int ret = 0;
 
-	struct compact_control cc = {
-		.nr_migratepages = 0,
-		.order = -1,
-		.zone = page_zone(pfn_to_page(start)),
-		.sync = true,
-	};
-	INIT_LIST_HEAD(&cc.migratepages);
-
 	migrate_prep_local();
 
-	while (pfn < end || !list_empty(&cc.migratepages)) {
+	while (pfn < end || !list_empty(&cc->migratepages)) {
 		if (fatal_signal_pending(current)) {
 			ret = -EINTR;
 			break;
 		}
 
-		if (list_empty(&cc.migratepages)) {
-			cc.nr_migratepages = 0;
-			pfn = isolate_migratepages_range(cc.zone, &cc,
+		if (list_empty(&cc->migratepages)) {
+			cc->nr_migratepages = 0;
+			pfn = isolate_migratepages_range(cc->zone, cc,
 							 pfn, end);
 			if (!pfn) {
 				ret = -EINTR;
@@ -5670,12 +5659,12 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 			break;
 		}
 
-		ret = migrate_pages(&cc.migratepages,
+		ret = migrate_pages(&cc->migratepages,
 				    __alloc_contig_migrate_alloc,
 				    0, false, MIGRATE_SYNC);
 	}
 
-	putback_lru_pages(&cc.migratepages);
+	putback_lru_pages(&cc->migratepages);
 	return ret > 0 ? 0 : ret;
 }
 
@@ -5754,6 +5743,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	unsigned long outer_start, outer_end;
 	int ret = 0, order;
 
+	struct compact_control cc = {
+		.nr_migratepages = 0,
+		.order = -1,
+		.zone = page_zone(pfn_to_page(start)),
+		.sync = true,
+		.ignore_skip_hint = true,
+	};
+	INIT_LIST_HEAD(&cc.migratepages);
+
 	/*
 	 * What we do here is we mark all pageblocks in range as
 	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
@@ -5783,7 +5781,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	if (ret)
 		goto done;
 
-	ret = __alloc_contig_migrate_range(start, end);
+	ret = __alloc_contig_migrate_range(&cc, start, end);
 	if (ret)
 		goto done;
 
@@ -5832,7 +5830,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	__reclaim_pages(zone, GFP_HIGHUSER_MOVABLE, end-start);
 
 	/* Grab isolated pages from freelists. */
-	outer_end = isolate_freepages_range(outer_start, end);
+	outer_end = isolate_freepages_range(&cc, outer_start, end);
 	if (!outer_end) {
 		ret = -EBUSY;
 		goto done;
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2839,6 +2839,14 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
+		/*
+		 * Compaction records what page blocks it recently failed to
+		 * isolate pages from and skips them in the future scanning.
+		 * When kswapd is going to sleep, it is reasonable to assume
+		 * that pages and compaction may succeed so reset the cache.
+		 */
+		reset_isolation_suitable(pgdat);
+
 		if (!kthread_should_stop())
 			schedule();
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-14 21:50           ` David Rientjes
@ 2012-11-15  1:14             ` Marc Duponcheel
  2012-11-17  0:18               ` Marc Duponcheel
  0 siblings, 1 reply; 16+ messages in thread
From: Marc Duponcheel @ 2012-11-15  1:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andy Lutomirski, linux-kernel, linux-mm, Marc Duponcheel

 Hi David

Thanks for the changeset

I will test 3.6.6 without&with this weekend.

 Have a nice day

On 2012 Nov 14, #David Rientjes wrote:
> On Wed, 14 Nov 2012, Marc Duponcheel wrote:
> 
> >  Hi all
> > 
> >  If someone can provide the patches (or learn me how to get them with
> > git (I apologise to not be git savy)) then, this weekend, I can apply
> > them to 3.6.6 and compare before/after to check if they fix #49361.
> > 
> 
> I've backported all the commits that Mel quoted to 3.6.6 and appended them 
> to this email as one big patch.  It should apply cleanly to your kernel.
> 
> Now we are only missing these commits that weren't quoted:
> 
>  - 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page 
>                   immediately when it is made available"), and
> 
>  - 83fde0f22872 ("mm: vmscan: scale number of pages reclaimed by 
>                   reclaim/compaction based on failures").
> 
> Since your regression is easily reproducible, would it be possible to try 
> to reproduce the issue FIRST with 3.6.6 and, if still present as it was in 
> 3.6.2, then try reproducing it with the appended patch?
> 
> You earlier reported that khugepaged was taking the second-most cpu time 
> when this was happening, which initially pointed you to thp, so presumably 
> this isn't a kswapd issue running at 100%.  If both 3.6.6 kernels fail 
> (the one with and without the following patch), would it be possible to 
> try Mel's suggestion of patching with
> 
>  - https://lkml.org/lkml/2012/11/5/308 +
>    https://lkml.org/lkml/2012/11/12/113
> 
> to see if it helps and, if not, reverting the latter and trying
> 
>  - https://lkml.org/lkml/2012/11/5/308 +
>    https://lkml.org/lkml/2012/11/12/151
> 
> as the final test?  This will certainly help us to find out what needs to 
> be backported to 3.6 stable to prevent this issue for other users.
> 
> Thanks!
> ---
>  include/linux/compaction.h      |   15 ++
>  include/linux/mmzone.h          |    6 +-
>  include/linux/pageblock-flags.h |   19 +-
>  mm/compaction.c                 |  450 +++++++++++++++++++++++++--------------
>  mm/internal.h                   |   16 +-
>  mm/page_alloc.c                 |   42 ++--
>  mm/vmscan.c                     |    8 +
>  7 files changed, 366 insertions(+), 190 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -24,6 +24,7 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask,
>  			bool sync, bool *contended);
>  extern int compact_pgdat(pg_data_t *pgdat, int order);
> +extern void reset_isolation_suitable(pg_data_t *pgdat);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
>  
>  /* Do not skip compaction more than 64 times */
> @@ -61,6 +62,16 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return zone->compact_considered < defer_limit;
>  }
>  
> +/* Returns true if restarting compaction after many failures */
> +static inline bool compaction_restarting(struct zone *zone, int order)
> +{
> +	if (order < zone->compact_order_failed)
> +		return false;
> +
> +	return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
> +		zone->compact_considered >= 1UL << zone->compact_defer_shift;
> +}
> +
>  #else
>  static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *nodemask,
> @@ -74,6 +85,10 @@ static inline int compact_pgdat(pg_data_t *pgdat, int order)
>  	return COMPACT_CONTINUE;
>  }
>  
> +static inline void reset_isolation_suitable(pg_data_t *pgdat)
> +{
> +}
> +
>  static inline unsigned long compaction_suitable(struct zone *zone, int order)
>  {
>  	return COMPACT_SKIPPED;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -369,8 +369,12 @@ struct zone {
>  	spinlock_t		lock;
>  	int                     all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> -	/* pfn where the last incremental compaction isolated free pages */
> +	/* Set to true when the PG_migrate_skip bits should be cleared */
> +	bool			compact_blockskip_flush;
> +
> +	/* pfns where compaction scanners should start */
>  	unsigned long		compact_cached_free_pfn;
> +	unsigned long		compact_cached_migrate_pfn;
>  #endif
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  	/* see spanned/present_pages for more description */
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -30,6 +30,9 @@ enum pageblock_bits {
>  	PB_migrate,
>  	PB_migrate_end = PB_migrate + 3 - 1,
>  			/* 3 bits required for migrate types */
> +#ifdef CONFIG_COMPACTION
> +	PB_migrate_skip,/* If set the block is skipped by compaction */
> +#endif /* CONFIG_COMPACTION */
>  	NR_PAGEBLOCK_BITS
>  };
>  
> @@ -65,10 +68,22 @@ unsigned long get_pageblock_flags_group(struct page *page,
>  void set_pageblock_flags_group(struct page *page, unsigned long flags,
>  					int start_bitidx, int end_bitidx);
>  
> +#ifdef CONFIG_COMPACTION
> +#define get_pageblock_skip(page) \
> +			get_pageblock_flags_group(page, PB_migrate_skip,     \
> +							PB_migrate_skip + 1)
> +#define clear_pageblock_skip(page) \
> +			set_pageblock_flags_group(page, 0, PB_migrate_skip,  \
> +							PB_migrate_skip + 1)
> +#define set_pageblock_skip(page) \
> +			set_pageblock_flags_group(page, 1, PB_migrate_skip,  \
> +							PB_migrate_skip + 1)
> +#endif /* CONFIG_COMPACTION */
> +
>  #define get_pageblock_flags(page) \
> -			get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
> +			get_pageblock_flags_group(page, 0, PB_migrate_end)
>  #define set_pageblock_flags(page, flags) \
>  			set_pageblock_flags_group(page, flags,	\
> -						  0, NR_PAGEBLOCK_BITS-1)
> +						  0, PB_migrate_end)
>  
>  #endif	/* PAGEBLOCK_FLAGS_H */
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -50,6 +50,111 @@ static inline bool migrate_async_suitable(int migratetype)
>  	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
>  }
>  
> +#ifdef CONFIG_COMPACTION
> +/* Returns true if the pageblock should be scanned for pages to isolate. */
> +static inline bool isolation_suitable(struct compact_control *cc,
> +					struct page *page)
> +{
> +	if (cc->ignore_skip_hint)
> +		return true;
> +
> +	return !get_pageblock_skip(page);
> +}
> +
> +/*
> + * This function is called to clear all cached information on pageblocks that
> + * should be skipped for page isolation when the migrate and free page scanner
> + * meet.
> + */
> +static void __reset_isolation_suitable(struct zone *zone)
> +{
> +	unsigned long start_pfn = zone->zone_start_pfn;
> +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	unsigned long pfn;
> +
> +	zone->compact_cached_migrate_pfn = start_pfn;
> +	zone->compact_cached_free_pfn = end_pfn;
> +	zone->compact_blockskip_flush = false;
> +
> +	/* Walk the zone and mark every pageblock as suitable for isolation */
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> +		struct page *page;
> +
> +		cond_resched();
> +
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		page = pfn_to_page(pfn);
> +		if (zone != page_zone(page))
> +			continue;
> +
> +		clear_pageblock_skip(page);
> +	}
> +}
> +
> +void reset_isolation_suitable(pg_data_t *pgdat)
> +{
> +	int zoneid;
> +
> +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> +		struct zone *zone = &pgdat->node_zones[zoneid];
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		/* Only flush if a full compaction finished recently */
> +		if (zone->compact_blockskip_flush)
> +			__reset_isolation_suitable(zone);
> +	}
> +}
> +
> +/*
> + * If no pages were isolated then mark this pageblock to be skipped in the
> + * future. The information is later cleared by __reset_isolation_suitable().
> + */
> +static void update_pageblock_skip(struct compact_control *cc,
> +			struct page *page, unsigned long nr_isolated,
> +			bool migrate_scanner)
> +{
> +	struct zone *zone = cc->zone;
> +	if (!page)
> +		return;
> +
> +	if (!nr_isolated) {
> +		unsigned long pfn = page_to_pfn(page);
> +		set_pageblock_skip(page);
> +
> +		/* Update where compaction should restart */
> +		if (migrate_scanner) {
> +			if (!cc->finished_update_migrate &&
> +			    pfn > zone->compact_cached_migrate_pfn)
> +				zone->compact_cached_migrate_pfn = pfn;
> +		} else {
> +			if (!cc->finished_update_free &&
> +			    pfn < zone->compact_cached_free_pfn)
> +				zone->compact_cached_free_pfn = pfn;
> +		}
> +	}
> +}
> +#else
> +static inline bool isolation_suitable(struct compact_control *cc,
> +					struct page *page)
> +{
> +	return true;
> +}
> +
> +static void update_pageblock_skip(struct compact_control *cc,
> +			struct page *page, unsigned long nr_isolated,
> +			bool migrate_scanner)
> +{
> +}
> +#endif /* CONFIG_COMPACTION */
> +
> +static inline bool should_release_lock(spinlock_t *lock)
> +{
> +	return need_resched() || spin_is_contended(lock);
> +}
> +
>  /*
>   * Compaction requires the taking of some coarse locks that are potentially
>   * very heavily contended. Check if the process needs to be scheduled or
> @@ -62,7 +167,7 @@ static inline bool migrate_async_suitable(int migratetype)
>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  				      bool locked, struct compact_control *cc)
>  {
> -	if (need_resched() || spin_is_contended(lock)) {
> +	if (should_release_lock(lock)) {
>  		if (locked) {
>  			spin_unlock_irqrestore(lock, *flags);
>  			locked = false;
> @@ -70,14 +175,11 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  
>  		/* async aborts if taking too long or contended */
>  		if (!cc->sync) {
> -			if (cc->contended)
> -				*cc->contended = true;
> +			cc->contended = true;
>  			return false;
>  		}
>  
>  		cond_resched();
> -		if (fatal_signal_pending(current))
> -			return false;
>  	}
>  
>  	if (!locked)
> @@ -91,44 +193,85 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
>  	return compact_checklock_irqsave(lock, flags, false, cc);
>  }
>  
> +/* Returns true if the page is within a block suitable for migration to */
> +static bool suitable_migration_target(struct page *page)
> +{
> +	int migratetype = get_pageblock_migratetype(page);
> +
> +	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
> +	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
> +		return false;
> +
> +	/* If the page is a large free page, then allow migration */
> +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> +		return true;
> +
> +	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
> +	if (migrate_async_suitable(migratetype))
> +		return true;
> +
> +	/* Otherwise skip the block */
> +	return false;
> +}
> +
>  /*
>   * Isolate free pages onto a private freelist. Caller must hold zone->lock.
>   * If @strict is true, will abort returning 0 on any invalid PFNs or non-free
>   * pages inside of the pageblock (even though it may still end up isolating
>   * some pages).
>   */
> -static unsigned long isolate_freepages_block(unsigned long blockpfn,
> +static unsigned long isolate_freepages_block(struct compact_control *cc,
> +				unsigned long blockpfn,
>  				unsigned long end_pfn,
>  				struct list_head *freelist,
>  				bool strict)
>  {
>  	int nr_scanned = 0, total_isolated = 0;
> -	struct page *cursor;
> +	struct page *cursor, *valid_page = NULL;
> +	unsigned long nr_strict_required = end_pfn - blockpfn;
> +	unsigned long flags;
> +	bool locked = false;
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> -	/* Isolate free pages. This assumes the block is valid */
> +	/* Isolate free pages. */
>  	for (; blockpfn < end_pfn; blockpfn++, cursor++) {
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> -		if (!pfn_valid_within(blockpfn)) {
> -			if (strict)
> -				return 0;
> -			continue;
> -		}
>  		nr_scanned++;
> +		if (!pfn_valid_within(blockpfn))
> +			continue;
> +		if (!valid_page)
> +			valid_page = page;
> +		if (!PageBuddy(page))
> +			continue;
> +
> +		/*
> +		 * The zone lock must be held to isolate freepages.
> +		 * Unfortunately this is a very coarse lock and can be
> +		 * heavily contended if there are parallel allocations
> +		 * or parallel compactions. For async compaction do not
> +		 * spin on the lock and we acquire the lock as late as
> +		 * possible.
> +		 */
> +		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> +								locked, cc);
> +		if (!locked)
> +			break;
> +
> +		/* Recheck this is a suitable migration target under lock */
> +		if (!strict && !suitable_migration_target(page))
> +			break;
>  
> -		if (!PageBuddy(page)) {
> -			if (strict)
> -				return 0;
> +		/* Recheck this is a buddy page under lock */
> +		if (!PageBuddy(page))
>  			continue;
> -		}
>  
>  		/* Found a free page, break it into order-0 pages */
>  		isolated = split_free_page(page);
>  		if (!isolated && strict)
> -			return 0;
> +			break;
>  		total_isolated += isolated;
>  		for (i = 0; i < isolated; i++) {
>  			list_add(&page->lru, freelist);
> @@ -143,6 +286,22 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
>  	}
>  
>  	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
> +
> +	/*
> +	 * If strict isolation is requested by CMA then check that all the
> +	 * pages requested were isolated. If there were any failures, 0 is
> +	 * returned and CMA will fail.
> +	 */
> +	if (strict && nr_strict_required > total_isolated)
> +		total_isolated = 0;
> +
> +	if (locked)
> +		spin_unlock_irqrestore(&cc->zone->lock, flags);
> +
> +	/* Update the pageblock-skip if the whole pageblock was scanned */
> +	if (blockpfn == end_pfn)
> +		update_pageblock_skip(cc, valid_page, total_isolated, false);
> +
>  	return total_isolated;
>  }
>  
> @@ -160,17 +319,14 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
>   * a free page).
>   */
>  unsigned long
> -isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
> +isolate_freepages_range(struct compact_control *cc,
> +			unsigned long start_pfn, unsigned long end_pfn)
>  {
> -	unsigned long isolated, pfn, block_end_pfn, flags;
> -	struct zone *zone = NULL;
> +	unsigned long isolated, pfn, block_end_pfn;
>  	LIST_HEAD(freelist);
>  
> -	if (pfn_valid(start_pfn))
> -		zone = page_zone(pfn_to_page(start_pfn));
> -
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> -		if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn)))
> +		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>  			break;
>  
>  		/*
> @@ -180,10 +336,8 @@ isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
>  		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>  		block_end_pfn = min(block_end_pfn, end_pfn);
>  
> -		spin_lock_irqsave(&zone->lock, flags);
> -		isolated = isolate_freepages_block(pfn, block_end_pfn,
> +		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
>  						   &freelist, true);
> -		spin_unlock_irqrestore(&zone->lock, flags);
>  
>  		/*
>  		 * In strict mode, isolate_freepages_block() returns 0 if
> @@ -276,7 +430,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  	isolate_mode_t mode = 0;
>  	struct lruvec *lruvec;
>  	unsigned long flags;
> -	bool locked;
> +	bool locked = false;
> +	struct page *page = NULL, *valid_page = NULL;
>  
>  	/*
>  	 * Ensure that there are not too many pages isolated from the LRU
> @@ -296,23 +451,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	cond_resched();
> -	spin_lock_irqsave(&zone->lru_lock, flags);
> -	locked = true;
>  	for (; low_pfn < end_pfn; low_pfn++) {
> -		struct page *page;
> -
>  		/* give a chance to irqs before checking need_resched() */
> -		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> -			spin_unlock_irqrestore(&zone->lru_lock, flags);
> -			locked = false;
> +		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> +			if (should_release_lock(&zone->lru_lock)) {
> +				spin_unlock_irqrestore(&zone->lru_lock, flags);
> +				locked = false;
> +			}
>  		}
>  
> -		/* Check if it is ok to still hold the lock */
> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> -								locked, cc);
> -		if (!locked)
> -			break;
> -
>  		/*
>  		 * migrate_pfn does not necessarily start aligned to a
>  		 * pageblock. Ensure that pfn_valid is called when moving
> @@ -340,6 +487,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		if (page_zone(page) != zone)
>  			continue;
>  
> +		if (!valid_page)
> +			valid_page = page;
> +
> +		/* If isolation recently failed, do not retry */
> +		pageblock_nr = low_pfn >> pageblock_order;
> +		if (!isolation_suitable(cc, page))
> +			goto next_pageblock;
> +
>  		/* Skip if free */
>  		if (PageBuddy(page))
>  			continue;
> @@ -349,24 +504,43 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		 * migration is optimistic to see if the minimum amount of work
>  		 * satisfies the allocation
>  		 */
> -		pageblock_nr = low_pfn >> pageblock_order;
>  		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
>  		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
> -			low_pfn += pageblock_nr_pages;
> -			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
> -			last_pageblock_nr = pageblock_nr;
> -			continue;
> +			cc->finished_update_migrate = true;
> +			goto next_pageblock;
>  		}
>  
> +		/* Check may be lockless but that's ok as we recheck later */
>  		if (!PageLRU(page))
>  			continue;
>  
>  		/*
> -		 * PageLRU is set, and lru_lock excludes isolation,
> -		 * splitting and collapsing (collapsing has already
> -		 * happened if PageLRU is set).
> +		 * PageLRU is set. lru_lock normally excludes isolation
> +		 * splitting and collapsing (collapsing has already happened
> +		 * if PageLRU is set) but the lock is not necessarily taken
> +		 * here and it is wasteful to take it just to check transhuge.
> +		 * Check TransHuge without lock and skip the whole pageblock if
> +		 * it's either a transhuge or hugetlbfs page, as calling
> +		 * compound_order() without preventing THP from splitting the
> +		 * page underneath us may return surprising results.
>  		 */
>  		if (PageTransHuge(page)) {
> +			if (!locked)
> +				goto next_pageblock;
> +			low_pfn += (1 << compound_order(page)) - 1;
> +			continue;
> +		}
> +
> +		/* Check if it is ok to still hold the lock */
> +		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> +								locked, cc);
> +		if (!locked || fatal_signal_pending(current))
> +			break;
> +
> +		/* Recheck PageLRU and PageTransHuge under lock */
> +		if (!PageLRU(page))
> +			continue;
> +		if (PageTransHuge(page)) {
>  			low_pfn += (1 << compound_order(page)) - 1;
>  			continue;
>  		}
> @@ -383,6 +557,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		VM_BUG_ON(PageTransCompound(page));
>  
>  		/* Successfully isolated */
> +		cc->finished_update_migrate = true;
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>  		list_add(&page->lru, migratelist);
>  		cc->nr_migratepages++;
> @@ -393,6 +568,13 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			++low_pfn;
>  			break;
>  		}
> +
> +		continue;
> +
> +next_pageblock:
> +		low_pfn += pageblock_nr_pages;
> +		low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
> +		last_pageblock_nr = pageblock_nr;
>  	}
>  
>  	acct_isolated(zone, locked, cc);
> @@ -400,6 +582,10 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  	if (locked)
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  
> +	/* Update the pageblock-skip if the whole pageblock was scanned */
> +	if (low_pfn == end_pfn)
> +		update_pageblock_skip(cc, valid_page, nr_isolated, true);
> +
>  	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>  
>  	return low_pfn;
> @@ -407,43 +593,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  #endif /* CONFIG_COMPACTION || CONFIG_CMA */
>  #ifdef CONFIG_COMPACTION
> -
> -/* Returns true if the page is within a block suitable for migration to */
> -static bool suitable_migration_target(struct page *page)
> -{
> -
> -	int migratetype = get_pageblock_migratetype(page);
> -
> -	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
> -	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
> -		return false;
> -
> -	/* If the page is a large free page, then allow migration */
> -	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> -		return true;
> -
> -	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
> -	if (migrate_async_suitable(migratetype))
> -		return true;
> -
> -	/* Otherwise skip the block */
> -	return false;
> -}
> -
> -/*
> - * Returns the start pfn of the last page block in a zone.  This is the starting
> - * point for full compaction of a zone.  Compaction searches for free pages from
> - * the end of each zone, while isolate_freepages_block scans forward inside each
> - * page block.
> - */
> -static unsigned long start_free_pfn(struct zone *zone)
> -{
> -	unsigned long free_pfn;
> -	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
> -	free_pfn &= ~(pageblock_nr_pages-1);
> -	return free_pfn;
> -}
> -
>  /*
>   * Based on information in the current compact_control, find blocks
>   * suitable for isolating free pages from and then isolate them.
> @@ -453,7 +602,6 @@ static void isolate_freepages(struct zone *zone,
>  {
>  	struct page *page;
>  	unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn;
> -	unsigned long flags;
>  	int nr_freepages = cc->nr_freepages;
>  	struct list_head *freelist = &cc->freepages;
>  
> @@ -501,30 +649,16 @@ static void isolate_freepages(struct zone *zone,
>  		if (!suitable_migration_target(page))
>  			continue;
>  
> -		/*
> -		 * Found a block suitable for isolating free pages from. Now
> -		 * we disabled interrupts, double check things are ok and
> -		 * isolate the pages. This is to minimise the time IRQs
> -		 * are disabled
> -		 */
> -		isolated = 0;
> +		/* If isolation recently failed, do not retry */
> +		if (!isolation_suitable(cc, page))
> +			continue;
>  
> -		/*
> -		 * The zone lock must be held to isolate freepages. This
> -		 * unfortunately this is a very coarse lock and can be
> -		 * heavily contended if there are parallel allocations
> -		 * or parallel compactions. For async compaction do not
> -		 * spin on the lock
> -		 */
> -		if (!compact_trylock_irqsave(&zone->lock, &flags, cc))
> -			break;
> -		if (suitable_migration_target(page)) {
> -			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
> -			isolated = isolate_freepages_block(pfn, end_pfn,
> -							   freelist, false);
> -			nr_freepages += isolated;
> -		}
> -		spin_unlock_irqrestore(&zone->lock, flags);
> +		/* Found a block suitable for isolating free pages from */
> +		isolated = 0;
> +		end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
> +		isolated = isolate_freepages_block(cc, pfn, end_pfn,
> +						   freelist, false);
> +		nr_freepages += isolated;
>  
>  		/*
>  		 * Record the highest PFN we isolated pages from. When next
> @@ -532,17 +666,8 @@ static void isolate_freepages(struct zone *zone,
>  		 * page migration may have returned some pages to the allocator
>  		 */
>  		if (isolated) {
> +			cc->finished_update_free = true;
>  			high_pfn = max(high_pfn, pfn);
> -
> -			/*
> -			 * If the free scanner has wrapped, update
> -			 * compact_cached_free_pfn to point to the highest
> -			 * pageblock with free pages. This reduces excessive
> -			 * scanning of full pageblocks near the end of the
> -			 * zone
> -			 */
> -			if (cc->order > 0 && cc->wrapped)
> -				zone->compact_cached_free_pfn = high_pfn;
>  		}
>  	}
>  
> @@ -551,11 +676,6 @@ static void isolate_freepages(struct zone *zone,
>  
>  	cc->free_pfn = high_pfn;
>  	cc->nr_freepages = nr_freepages;
> -
> -	/* If compact_cached_free_pfn is reset then set it now */
> -	if (cc->order > 0 && !cc->wrapped &&
> -			zone->compact_cached_free_pfn == start_free_pfn(zone))
> -		zone->compact_cached_free_pfn = high_pfn;
>  }
>  
>  /*
> @@ -634,7 +754,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  
>  	/* Perform the isolation */
>  	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
> -	if (!low_pfn)
> +	if (!low_pfn || cc->contended)
>  		return ISOLATE_ABORT;
>  
>  	cc->migrate_pfn = low_pfn;
> @@ -651,27 +771,19 @@ static int compact_finished(struct zone *zone,
>  	if (fatal_signal_pending(current))
>  		return COMPACT_PARTIAL;
>  
> -	/*
> -	 * A full (order == -1) compaction run starts at the beginning and
> -	 * end of a zone; it completes when the migrate and free scanner meet.
> -	 * A partial (order > 0) compaction can start with the free scanner
> -	 * at a random point in the zone, and may have to restart.
> -	 */
> +	/* Compaction run completes if the migrate and free scanner meet */
>  	if (cc->free_pfn <= cc->migrate_pfn) {
> -		if (cc->order > 0 && !cc->wrapped) {
> -			/* We started partway through; restart at the end. */
> -			unsigned long free_pfn = start_free_pfn(zone);
> -			zone->compact_cached_free_pfn = free_pfn;
> -			cc->free_pfn = free_pfn;
> -			cc->wrapped = 1;
> -			return COMPACT_CONTINUE;
> -		}
> -		return COMPACT_COMPLETE;
> -	}
> +		/*
> +		 * Mark that the PG_migrate_skip information should be cleared
> +		 * by kswapd when it goes to sleep. kswapd does not set the
> +		 * flag itself as the decision to be clear should be directly
> +		 * based on an allocation request.
> +		 */
> +		if (!current_is_kswapd())
> +			zone->compact_blockskip_flush = true;
>  
> -	/* We wrapped around and ended up where we started. */
> -	if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
>  		return COMPACT_COMPLETE;
> +	}
>  
>  	/*
>  	 * order == -1 is expected when compacting via
> @@ -754,6 +866,8 @@ unsigned long compaction_suitable(struct zone *zone, int order)
>  static int compact_zone(struct zone *zone, struct compact_control *cc)
>  {
>  	int ret;
> +	unsigned long start_pfn = zone->zone_start_pfn;
> +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
>  
>  	ret = compaction_suitable(zone, cc->order);
>  	switch (ret) {
> @@ -766,17 +880,29 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  		;
>  	}
>  
> -	/* Setup to move all movable pages to the end of the zone */
> -	cc->migrate_pfn = zone->zone_start_pfn;
> -
> -	if (cc->order > 0) {
> -		/* Incremental compaction. Start where the last one stopped. */
> -		cc->free_pfn = zone->compact_cached_free_pfn;
> -		cc->start_free_pfn = cc->free_pfn;
> -	} else {
> -		/* Order == -1 starts at the end of the zone. */
> -		cc->free_pfn = start_free_pfn(zone);
> +	/*
> +	 * Setup to move all movable pages to the end of the zone. Used cached
> +	 * information on where the scanners should start but check that it
> +	 * is initialised by ensuring the values are within zone boundaries.
> +	 */
> +	cc->migrate_pfn = zone->compact_cached_migrate_pfn;
> +	cc->free_pfn = zone->compact_cached_free_pfn;
> +	if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {
> +		cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1);
> +		zone->compact_cached_free_pfn = cc->free_pfn;
>  	}
> +	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {
> +		cc->migrate_pfn = start_pfn;
> +		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
> +	}
> +
> +	/*
> +	 * Clear pageblock skip if there were failures recently and compaction
> +	 * is about to be retried after being deferred. kswapd does not do
> +	 * this reset as it'll reset the cached information when going to sleep.
> +	 */
> +	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
> +		__reset_isolation_suitable(zone);
>  
>  	migrate_prep_local();
>  
> @@ -787,6 +913,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  		switch (isolate_migratepages(zone, cc)) {
>  		case ISOLATE_ABORT:
>  			ret = COMPACT_PARTIAL;
> +			putback_lru_pages(&cc->migratepages);
> +			cc->nr_migratepages = 0;
>  			goto out;
>  		case ISOLATE_NONE:
>  			continue;
> @@ -831,6 +959,7 @@ static unsigned long compact_zone_order(struct zone *zone,
>  				 int order, gfp_t gfp_mask,
>  				 bool sync, bool *contended)
>  {
> +	unsigned long ret;
>  	struct compact_control cc = {
>  		.nr_freepages = 0,
>  		.nr_migratepages = 0,
> @@ -838,12 +967,17 @@ static unsigned long compact_zone_order(struct zone *zone,
>  		.migratetype = allocflags_to_migratetype(gfp_mask),
>  		.zone = zone,
>  		.sync = sync,
> -		.contended = contended,
>  	};
>  	INIT_LIST_HEAD(&cc.freepages);
>  	INIT_LIST_HEAD(&cc.migratepages);
>  
> -	return compact_zone(zone, &cc);
> +	ret = compact_zone(zone, &cc);
> +
> +	VM_BUG_ON(!list_empty(&cc.freepages));
> +	VM_BUG_ON(!list_empty(&cc.migratepages));
> +
> +	*contended = cc.contended;
> +	return ret;
>  }
>  
>  int sysctl_extfrag_threshold = 500;
> @@ -855,6 +989,8 @@ int sysctl_extfrag_threshold = 500;
>   * @gfp_mask: The GFP mask of the current allocation
>   * @nodemask: The allowed nodes to allocate from
>   * @sync: Whether migration is synchronous or not
> + * @contended: Return value that is true if compaction was aborted due to lock contention
> + * @page: Optionally capture a free page of the requested order during compaction
>   *
>   * This is the main entry point for direct page compaction.
>   */
> diff --git a/mm/internal.h b/mm/internal.h
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -118,23 +118,23 @@ struct compact_control {
>  	unsigned long nr_freepages;	/* Number of isolated free pages */
>  	unsigned long nr_migratepages;	/* Number of pages to migrate */
>  	unsigned long free_pfn;		/* isolate_freepages search base */
> -	unsigned long start_free_pfn;	/* where we started the search */
>  	unsigned long migrate_pfn;	/* isolate_migratepages search base */
>  	bool sync;			/* Synchronous migration */
> -	bool wrapped;			/* Order > 0 compactions are
> -					   incremental, once free_pfn
> -					   and migrate_pfn meet, we restart
> -					   from the top of the zone;
> -					   remember we wrapped around. */
> +	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
> +	bool finished_update_free;	/* True when the zone cached pfns are
> +					 * no longer being updated
> +					 */
> +	bool finished_update_migrate;
>  
>  	int order;			/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> -	bool *contended;		/* True if a lock was contended */
> +	bool contended;			/* True if a lock was contended */
>  };
>  
>  unsigned long
> -isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn);
> +isolate_freepages_range(struct compact_control *cc,
> +			unsigned long start_pfn, unsigned long end_pfn);
>  unsigned long
>  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			   unsigned long low_pfn, unsigned long end_pfn);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2131,6 +2131,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  				alloc_flags & ~ALLOC_NO_WATERMARKS,
>  				preferred_zone, migratetype);
>  		if (page) {
> +			preferred_zone->compact_blockskip_flush = false;
>  			preferred_zone->compact_considered = 0;
>  			preferred_zone->compact_defer_shift = 0;
>  			if (order >= preferred_zone->compact_order_failed)
> @@ -4438,11 +4439,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  
>  		zone->spanned_pages = size;
>  		zone->present_pages = realsize;
> -#if defined CONFIG_COMPACTION || defined CONFIG_CMA
> -		zone->compact_cached_free_pfn = zone->zone_start_pfn +
> -						zone->spanned_pages;
> -		zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
> -#endif
>  #ifdef CONFIG_NUMA
>  		zone->node = nid;
>  		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
> @@ -5632,7 +5628,8 @@ __alloc_contig_migrate_alloc(struct page *page, unsigned long private,
>  }
>  
>  /* [start, end) must belong to a single zone. */
> -static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
> +static int __alloc_contig_migrate_range(struct compact_control *cc,
> +					unsigned long start, unsigned long end)
>  {
>  	/* This function is based on compact_zone() from compaction.c. */
>  
> @@ -5640,25 +5637,17 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
>  	unsigned int tries = 0;
>  	int ret = 0;
>  
> -	struct compact_control cc = {
> -		.nr_migratepages = 0,
> -		.order = -1,
> -		.zone = page_zone(pfn_to_page(start)),
> -		.sync = true,
> -	};
> -	INIT_LIST_HEAD(&cc.migratepages);
> -
>  	migrate_prep_local();
>  
> -	while (pfn < end || !list_empty(&cc.migratepages)) {
> +	while (pfn < end || !list_empty(&cc->migratepages)) {
>  		if (fatal_signal_pending(current)) {
>  			ret = -EINTR;
>  			break;
>  		}
>  
> -		if (list_empty(&cc.migratepages)) {
> -			cc.nr_migratepages = 0;
> -			pfn = isolate_migratepages_range(cc.zone, &cc,
> +		if (list_empty(&cc->migratepages)) {
> +			cc->nr_migratepages = 0;
> +			pfn = isolate_migratepages_range(cc->zone, cc,
>  							 pfn, end);
>  			if (!pfn) {
>  				ret = -EINTR;
> @@ -5670,12 +5659,12 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
>  			break;
>  		}
>  
> -		ret = migrate_pages(&cc.migratepages,
> +		ret = migrate_pages(&cc->migratepages,
>  				    __alloc_contig_migrate_alloc,
>  				    0, false, MIGRATE_SYNC);
>  	}
>  
> -	putback_lru_pages(&cc.migratepages);
> +	putback_lru_pages(&cc->migratepages);
>  	return ret > 0 ? 0 : ret;
>  }
>  
> @@ -5754,6 +5743,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	unsigned long outer_start, outer_end;
>  	int ret = 0, order;
>  
> +	struct compact_control cc = {
> +		.nr_migratepages = 0,
> +		.order = -1,
> +		.zone = page_zone(pfn_to_page(start)),
> +		.sync = true,
> +		.ignore_skip_hint = true,
> +	};
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
>  	/*
>  	 * What we do here is we mark all pageblocks in range as
>  	 * MIGRATE_ISOLATE.  Because pageblock and max order pages may
> @@ -5783,7 +5781,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	if (ret)
>  		goto done;
>  
> -	ret = __alloc_contig_migrate_range(start, end);
> +	ret = __alloc_contig_migrate_range(&cc, start, end);
>  	if (ret)
>  		goto done;
>  
> @@ -5832,7 +5830,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	__reclaim_pages(zone, GFP_HIGHUSER_MOVABLE, end-start);
>  
>  	/* Grab isolated pages from freelists. */
> -	outer_end = isolate_freepages_range(outer_start, end);
> +	outer_end = isolate_freepages_range(&cc, outer_start, end);
>  	if (!outer_end) {
>  		ret = -EBUSY;
>  		goto done;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2839,6 +2839,14 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		 */
>  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>  
> +		/*
> +		 * Compaction records what page blocks it recently failed to
> +		 * isolate pages from and skips them in the future scanning.
> +		 * When kswapd is going to sleep, it is reasonable to assume
> +		 * that pages and compaction may succeed so reset the cache.
> +		 */
> +		reset_isolation_suitable(pgdat);
> +
>  		if (!kthread_should_stop())
>  			schedule();
>  
> 

-- 
--
 Marc Duponcheel
 Velodroomstraat 74 - 2600 Berchem - Belgium
 +32 (0)478 68.10.91 - marc@offline.be

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-15  1:14             ` Marc Duponcheel
@ 2012-11-17  0:18               ` Marc Duponcheel
  2012-11-18 22:55                 ` David Rientjes
  0 siblings, 1 reply; 16+ messages in thread
From: Marc Duponcheel @ 2012-11-17  0:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andy Lutomirski, linux-kernel, linux-mm, Marc Duponcheel

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

 Hi David, others

Results seem OK

 recap: I have 2 6core 64bit opterons and I make -j13

I do

# echo always >/sys/kernel/mm/transparent_hugepage/enabled
# while [ 1 ]
  do
   sleep 10
   date
   echo = vmstat
   egrep "(thp|compact)" /proc/vmstat
   echo = khugepaged stack
   cat /proc/501/stack
 done > /tmp/49361.xxxx
# emerge icedtea
(where 501 = pidof khugepaged)

for xxxx = base = 3.6.6
and xxxx = test = 3.6.6 + diff you provided

I attach 
 /tmp/49361.base.gz
and
 /tmp/49361.test.gz

Note:

 with xxx=base, I could see
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM     TIME+ COMMAND
 8617 root      20   0 3620m  41m  10m S 988.3  0.5   6:19.06 javac
    1 root      20   0  4208  588  556 S   0.0  0.0   0:03.25 init
 already during configure and I needed to kill -9 javac

 with xxx=test, I could see
  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM     TIME+ COMMAND
9275 root      20   0 2067m 474m  10m S 304.2  5.9   0:32.81 javac
 710 root       0 -20     0    0    0 S   0.3  0.0   0:01.07 kworker/0:1H
 later when processing >700 java files

Also note that with xxx=test compact_blocks_moved stays 0

hope this helps

 Thanks

have a nice day

On 2012 Nov 15, Marc Duponcheel wrote:
>  Hi David
> 
> Thanks for the changeset
> 
> I will test 3.6.6 without&with this weekend.
> 
>  Have a nice day

--
 Marc Duponcheel
 Velodroomstraat 74 - 2600 Berchem - Belgium
 +32 (0)478 68.10.91 - marc@offline.be

[-- Attachment #2: 49361.base.gz --]
[-- Type: application/octet-stream, Size: 861 bytes --]

[-- Attachment #3: 49361.test.gz --]
[-- Type: application/octet-stream, Size: 691 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-17  0:18               ` Marc Duponcheel
@ 2012-11-18 22:55                 ` David Rientjes
  2012-12-05 19:23                   ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2012-11-18 22:55 UTC (permalink / raw)
  To: Marc Duponcheel; +Cc: Mel Gorman, Andy Lutomirski, linux-kernel, linux-mm

On Sat, 17 Nov 2012, Marc Duponcheel wrote:

> # echo always >/sys/kernel/mm/transparent_hugepage/enabled
> # while [ 1 ]
>   do
>    sleep 10
>    date
>    echo = vmstat
>    egrep "(thp|compact)" /proc/vmstat
>    echo = khugepaged stack
>    cat /proc/501/stack
>  done > /tmp/49361.xxxx
> # emerge icedtea
> (where 501 = pidof khugepaged)
> 
> for xxxx = base = 3.6.6
> and xxxx = test = 3.6.6 + diff you provided
> 
> I attach 
>  /tmp/49361.base.gz
> and
>  /tmp/49361.test.gz
> 
> Note:
> 
>  with xxx=base, I could see
>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM     TIME+ COMMAND
>  8617 root      20   0 3620m  41m  10m S 988.3  0.5   6:19.06 javac
>     1 root      20   0  4208  588  556 S   0.0  0.0   0:03.25 init
>  already during configure and I needed to kill -9 javac
> 
>  with xxx=test, I could see
>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM     TIME+ COMMAND
> 9275 root      20   0 2067m 474m  10m S 304.2  5.9   0:32.81 javac
>  710 root       0 -20     0    0    0 S   0.3  0.0   0:01.07 kworker/0:1H
>  later when processing >700 java files
> 
> Also note that with xxx=test compact_blocks_moved stays 0
> 

Sounds good!  Andy, have you had the opportunity to try to reproduce your 
issue with the backports that Mel listed?  I think he'll be considering 
asking for some of these to be backported for a future stable release so 
any input you can provide would certainly be helpful.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [3.6 regression?] THP + migration/compaction livelock (I think)
  2012-11-18 22:55                 ` David Rientjes
@ 2012-12-05 19:23                   ` Andy Lutomirski
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2012-12-05 19:23 UTC (permalink / raw)
  To: David Rientjes; +Cc: Marc Duponcheel, Mel Gorman, linux-kernel, linux-mm

On Sun, Nov 18, 2012 at 2:55 PM, David Rientjes <rientjes@google.com> wrote:
> On Sat, 17 Nov 2012, Marc Duponcheel wrote:
>
>> # echo always >/sys/kernel/mm/transparent_hugepage/enabled
>> # while [ 1 ]
>>   do
>>    sleep 10
>>    date
>>    echo = vmstat
>>    egrep "(thp|compact)" /proc/vmstat
>>    echo = khugepaged stack
>>    cat /proc/501/stack
>>  done > /tmp/49361.xxxx
>> # emerge icedtea
>> (where 501 = pidof khugepaged)
>>
>> for xxxx = base = 3.6.6
>> and xxxx = test = 3.6.6 + diff you provided
>>
>> I attach
>>  /tmp/49361.base.gz
>> and
>>  /tmp/49361.test.gz
>>
>> Note:
>>
>>  with xxx=base, I could see
>>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM     TIME+ COMMAND
>>  8617 root      20   0 3620m  41m  10m S 988.3  0.5   6:19.06 javac
>>     1 root      20   0  4208  588  556 S   0.0  0.0   0:03.25 init
>>  already during configure and I needed to kill -9 javac
>>
>>  with xxx=test, I could see
>>   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM     TIME+ COMMAND
>> 9275 root      20   0 2067m 474m  10m S 304.2  5.9   0:32.81 javac
>>  710 root       0 -20     0    0    0 S   0.3  0.0   0:01.07 kworker/0:1H
>>  later when processing >700 java files
>>
>> Also note that with xxx=test compact_blocks_moved stays 0
>>
>
> Sounds good!  Andy, have you had the opportunity to try to reproduce your
> issue with the backports that Mel listed?  I think he'll be considering
> asking for some of these to be backported for a future stable release so
> any input you can provide would certainly be helpful.

I've had an impressive amount of trouble even reproducing it on 3.6.
Apparently I haven't hid the magic combination yet.  I'll give it
another try soon.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-12-05 19:23 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-13 22:13 [3.6 regression?] THP + migration/compaction livelock (I think) Andy Lutomirski
2012-11-13 23:11 ` David Rientjes
2012-11-13 23:25   ` Andy Lutomirski
2012-11-13 23:41     ` David Rientjes
2012-11-13 23:45       ` Andy Lutomirski
2012-11-13 23:54         ` David Rientjes
2012-11-14  1:22           ` Marc Duponcheel
2012-11-14  1:51             ` David Rientjes
2012-11-14 13:21               ` Marc Duponcheel
2012-11-14 10:01       ` Mel Gorman
2012-11-14 13:29         ` Marc Duponcheel
2012-11-14 21:50           ` David Rientjes
2012-11-15  1:14             ` Marc Duponcheel
2012-11-17  0:18               ` Marc Duponcheel
2012-11-18 22:55                 ` David Rientjes
2012-12-05 19:23                   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).