All of lore.kernel.org
 help / color / mirror / Atom feed
* excessive CPU utilization by isolate_freepages?
@ 2012-06-27 21:59 ` Jim Schutt
  0 siblings, 0 replies; 24+ messages in thread
From: Jim Schutt @ 2012-06-27 21:59 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim
  Cc: linux-mm, linux-kernel, ceph-devel

Hi,

I'm running into trouble with systems going unresponsive,
and perf suggests it's excessive CPU usage by isolate_freepages().
I'm currently testing 3.5-rc4, but I think this problem may have
first shown up in 3.4.  I'm only just learning how to use perf,
so I only currently have results to report for 3.5-rc4.

(FWIW I'm running the my distro version of perf; please let me know
if I need to compile the tools/perf version to match my kernel.)

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive.  FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

In the early phase of such a test, when things are running well,
here's what vmstat reports for the state of one of these servers:

2012-06-27 13:56:58.356-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
  6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
30 12          0     230844        576   38461648    0    0    14 2298537 216580 166661  12 45  23 20  0
29 19          0     237856        576   38422884    0    0     5 2332741 209408 157138  12 42  25 22  0
17 11          0     222156        576   38483992    0    0     4 2380495 210312 173121  12 39  28 22  0
11 13          0     216152        576   38463872    0    0    44 2362186 215236 176454  12 39  27 22  0
12 14          0     223704        576   38546720    0    0     2 2395530 214684 177518  12 39  28 22  0
39 17          0     219932        576   38598184    0    0     4 2428231 223284 179095  12 42  24 21  0
11 10          0     219956        576   38521048    0    0     4 2323853 207472 166665  12 39  28 21  0
23 15          0     216580        576   38451904    0    0     3 2241800 201049 163496  11 37  31 21  0
  9 13          0     225792        576   38451916    0    0    13 2281900 204869 171814  11 36  30 23  0
14 12          0     233820        576   38492728    0    0     4 2293423 207686 173019  11 37  31 21  0
11 20          0     213796        576   38533208    0    0     3 2288636 205605 168118  11 37  31 21  0


The system begins to struggle over the next several
minutes; here's what vmstat has to say:

2012-06-27 13:57:58.831-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
21 16          0     224628        576   38526872    0    0     2  1233    9   19   1  4  95  0  0
12 19          0     232060        576   38501020    0    0     4 2366769 221418 159890  12 48  20 19  0
124  8          0     218548        576   38379656    0    0    13 2103075 199660 108618  11 65  13 12  0
24 10          0     300476        576   38230288    0    0    31 1966282 177472 84572  10 76   7  7  0
20 16          0     217584        576   38296700    0    0     9 2062571 195936 128810  10 55  20 15  0
53 12          0     235720        576   38247968    0    0    30 2035407 196973 133921  10 52  23 15  0
20 16          0     360340        576   38067992    0    0     6 2192179 208692 136784  11 54  19 15  0
26 10          0     310800        576   38093884    0    0    43 2138772 207105 118718  11 64  12 13  0
24 15          0     261108        576   38030828    0    0    68 2174015 205793 135302  11 56  18 15  0
  9 17          0     241816        576   37982072    0    0    20 2076145 194971 120285  10 60  16 14  0
37 15          0     255972        576   37892868    0    0    14 2225076 205694 126954  11 59  16 13  0
27 16          0     243212        576   37872704    0    0     6 2249476 210885 134684  12 60  15 14  0
30 10          0     217572        576   37795388    0    0     3 2128688 205027 118319  11 66  12 11  0
26 11          0     236420        576   37740164    0    0    23 2109709 205105 133925  10 56  19 14  0
45 15          0     330056        576   37619896    0    0    15 1948311 196188 119330  10 62  15 13  0
54 15          0     242696        576   37631500    0    0     4 2159530 202595 132588  11 59  16 15  0

2012-06-27 13:58:59.569-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
14 15          0     274932        576   37621548    0    0     2  1301   15    0   1  4  95  0  0
44 16          0     278748        576   37509516    0    0     6 2097643 196722 112697  10 66  12 11  0
88 14          0     228088        576   37412008    0    0     6 2089559 202206 116146  10 66  13 11  0
94 11          0     348348        576   37270624    0    0     7 1906390 181488 87333   9 76   9  6  0
42 11          0     215996        576   37288556    0    0     6 1782459 184375 95901   9 74   9  7  0
45 11          0     323112        576   37146500    0    0    12 1868376 187814 103947   9 71  11  9  0
51  7          0     244560        576   37100124    0    0    35 1767496 181383 107259   9 68  12 11  0
74 12          0     221584        576   37016420    0    0    18 1884986 183376 93425   9 75   9  7  0
45 10          0     275564        576   36985324    0    0    23 1683688 167223 97036   8 75   9  8  0
19 10          0     322176        576   36813176    0    0    14 1747378 177594 97218   8 72  12  8  0
122  7          0     225256        576   36838084    0    0    26 1730643 177915 92621   8 75   9  8  0
243 10          0     223464        576   36765460    0    0    18 1730158 173059 79373   8 80   6  5  0
100 10          0     307528        576   36598456    0    0     4 1738567 174077 79585   9 82   6  4  0
243  6          0     370064        576   36358576    0    0     9 1586528 174680 85353   8 81   6  5  0
267  2          0     322640        576   36254044    0    0    40 1011650 129389 42277   5 93   1  1  0
210  4          0     505092        576   35865460    0    0    25 720825 116356 32422   3 96   1  0  0

2012-06-27 14:00:03.219-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
75  1          0     566988        576   35664800    0    0     2  1355   21    3   1  4  95  0  0
433  1          0     964052        576   35069112    0    0     7 456359 102256 20901   2 98   0  0  0
547  3          0     820116        576   34893932    0    0    57 560507 114878 28115   3 96   0  0  0
806  2          0     606992        576   34848180    0    0   339 309668 101230 21056   2 98   0  0  0
708  1          0     529624        576   34708000    0    0   248 370886 101327 20062   2 97   0  0  0
231  5          0     504772        576   34663880    0    0   305 334824 95045 20407   2 97   1  1  0
158  6          0    1063088        576   33518536    0    0   531 847435 130696 47140   4 92   1  2  0
193  0          0    1449156        576   33035572    0    0   363 371279 94470 18955   2 96   1  1  0
266  6          0    1623512        576   32728164    0    0    77 241114 95730 15483   2 98   0  0  0
243  8          0    1629504        576   32653080    0    0    81 471018 100223 20920   3 96   0  1  0
70 11          0    1342140        576   33084020    0    0   100 925869 139876 56599   6 88   3  3  0
211  7          0    1130316        576   33470432    0    0   290 1008984 150699 74320   6 83   6  5  0
365  3          0     776736        576   34072772    0    0   182 747167 139436 67135   5 88   4  3  0
29  1          0    1528412        576   34110640    0    0    50 612181 137403 77609   4 87   6  3  0
266  5          0    1657688        576   34105696    0    0     3 258307 62879 38508   2 93   3  2  0
1159  2          0    2002256        576   33775476    0    0    19 88554 42112 14230   1 98   0  0  0


Right around 14:00 I was able to get a "perf -a -g"; here's the
beginning of what "perf report --sort symbol --call-graph fractal,5"
had to say:

#
     64.86%  [k] _raw_spin_lock_irqsave
             |
             |--97.94%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          |
             |          |--99.56%-- try_to_compact_pages
             |          |          __alloc_pages_direct_compact
             |          |          __alloc_pages_slowpath
             |          |          __alloc_pages_nodemask
             |          |          alloc_pages_vma
             |          |          do_huge_pmd_anonymous_page
             |          |          handle_mm_fault
             |          |          do_page_fault
             |          |          page_fault
             |          |          |
             |          |          |--53.53%-- skb_copy_datagram_iovec
             |          |          |          tcp_recvmsg
             |          |          |          inet_recvmsg
             |          |          |          sock_recvmsg
             |          |          |          sys_recvfrom
             |          |          |          system_call_fastpath
             |          |          |          __recv
             |          |          |          |
             |          |          |           --100.00%-- (nil)
             |          |          |
             |          |          |--27.80%-- __pthread_create_2_1
             |          |          |          (nil)
             |          |          |
             |          |           --18.67%-- memcpy
             |          |                     |
             |          |                     |--57.38%-- 0x50d000005
             |          |                     |
             |          |                     |--34.52%-- 0x3b300bf271940a35
             |          |                     |
             |          |                      --8.10%-- 0x1500000000000009
             |           --0.44%-- [...]
              --2.06%-- [...]

      6.15%  [k] isolate_freepages_block
             |
             |--99.95%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          |
             |          |--99.54%-- try_to_compact_pages
             |          |          __alloc_pages_direct_compact
             |          |          __alloc_pages_slowpath
             |          |          __alloc_pages_nodemask
             |          |          alloc_pages_vma
             |          |          do_huge_pmd_anonymous_page
             |          |          handle_mm_fault
             |          |          do_page_fault
             |          |          page_fault
             |          |          |
             |          |          |--54.40%-- skb_copy_datagram_iovec
             |          |          |          tcp_recvmsg
             |          |          |          inet_recvmsg
             |          |          |          sock_recvmsg
             |          |          |          sys_recvfrom
             |          |          |          system_call_fastpath
             |          |          |          __recv
             |          |          |          |
             |          |          |           --100.00%-- (nil)
             |          |          |
             |          |          |--25.19%-- __pthread_create_2_1
             |          |          |          (nil)
             |          |          |
             |          |           --20.41%-- memcpy
             |          |                     |
             |          |                     |--40.24%-- 0x3b300bf271940a35
             |          |                     |
             |          |                     |--38.29%-- 0x1500000000000009
             |          |                     |
             |          |                      --21.47%-- 0x50d000005
             |           --0.46%-- [...]
              --0.05%-- [...]

      3.96%  [.] ceph_crc32c_le
             |
             |--99.99%-- 0xb8057558d0065990
              --0.01%-- [...]

A different system in the same test had a slightly different
call tree, but isolate_freepages() still seems to show up
prominently:

#
     32.32%  [k] _raw_spin_lock_irqsave
             |
             |--97.64%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--65.31%-- skb_copy_datagram_iovec
             |          |          tcp_recvmsg
             |          |          inet_recvmsg
             |          |          sock_recvmsg
             |          |          sys_recvfrom
             |          |          system_call_fastpath
             |          |          __recv
             |          |          |
             |          |           --100.00%-- (nil)
             |          |
             |          |--30.98%-- memcpy
             |          |          |
             |          |          |--50.60%-- 0x50d0000
             |          |          |
             |          |           --49.40%-- 0x50d000005
             |           --3.70%-- [...]
              --2.36%-- [...]

     17.10%  [k] _raw_spin_lock_irq
             |
             |--98.27%-- isolate_migratepages_range
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--99.85%-- __pthread_create_2_1
             |          |          (nil)
             |           --0.15%-- [...]
              --1.73%-- [...]

      4.59%  [k] mutex_spin_on_owner
             |
             --- __mutex_lock_slowpath
                 mutex_lock
                |
                |--50.14%-- page_lock_anon_vma
                |          |
                |          |--99.99%-- try_to_unmap_anon
                |          |          try_to_unmap
                |          |          __unmap_and_move
                |          |          unmap_and_move
                |          |          migrate_pages
                |          |          compact_zone
                |          |          try_to_compact_pages
                |          |          __alloc_pages_direct_compact
                |          |          __alloc_pages_slowpath
                |          |          __alloc_pages_nodemask
                |          |          alloc_pages_vma
                |          |          do_huge_pmd_anonymous_page
                |          |          handle_mm_fault
                |          |          do_page_fault
                |          |          page_fault
                |          |          |
                |          |          |--99.91%-- __pthread_create_2_1
                |          |          |          (nil)
                |          |           --0.09%-- [...]
                |           --0.01%-- [...]
                |
                |--49.67%-- rmap_walk
                |          move_to_new_page
                |          __unmap_and_move
                |          unmap_and_move
                |          migrate_pages
                |          compact_zone
                |          try_to_compact_pages
                |          __alloc_pages_direct_compact
                |          __alloc_pages_slowpath
                |          __alloc_pages_nodemask
                |          alloc_pages_vma
                |          do_huge_pmd_anonymous_page
                |          handle_mm_fault
                |          do_page_fault
                |          page_fault
                |          |
                |          |--99.69%-- __pthread_create_2_1
                |          |          (nil)
                |           --0.31%-- [...]
                 --0.20%-- [...]

      4.10%  [k] isolate_freepages_block
             |
             |--99.95%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--46.97%-- skb_copy_datagram_iovec
             |          |          tcp_recvmsg
             |          |          inet_recvmsg
             |          |          sock_recvmsg
             |          |          sys_recvfrom
             |          |          system_call_fastpath
             |          |          __recv
             |          |          |
             |          |           --100.00%-- (nil)
             |          |
             |          |--31.79%-- __pthread_create_2_1
             |          |          (nil)
             |          |
             |           --21.24%-- memcpy
             |                     |
             |                     |--61.90%-- 0x50d000005
             |                     |
             |                      --38.10%-- 0x50d0000
              --0.05%-- [...]

      3.65%  [.] ceph_crc32c_le
             |
             |--99.86%-- 0xb8057558d0065990
              --0.14%-- [...]



I seem to be able to recreate this issue at will, so please
let me know what I can do to help learn what is going on.

Thanks -- Jim

P.S. I got the recipients list via "scripts/get_maintainer.pl -f mm/compaction.c";
      please let me know if I should have done something else.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* excessive CPU utilization by isolate_freepages?
@ 2012-06-27 21:59 ` Jim Schutt
  0 siblings, 0 replies; 24+ messages in thread
From: Jim Schutt @ 2012-06-27 21:59 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim
  Cc: linux-mm, linux-kernel, ceph-devel

Hi,

I'm running into trouble with systems going unresponsive,
and perf suggests it's excessive CPU usage by isolate_freepages().
I'm currently testing 3.5-rc4, but I think this problem may have
first shown up in 3.4.  I'm only just learning how to use perf,
so I only currently have results to report for 3.5-rc4.

(FWIW I'm running the my distro version of perf; please let me know
if I need to compile the tools/perf version to match my kernel.)

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive.  FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

In the early phase of such a test, when things are running well,
here's what vmstat reports for the state of one of these servers:

2012-06-27 13:56:58.356-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
  6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
30 12          0     230844        576   38461648    0    0    14 2298537 216580 166661  12 45  23 20  0
29 19          0     237856        576   38422884    0    0     5 2332741 209408 157138  12 42  25 22  0
17 11          0     222156        576   38483992    0    0     4 2380495 210312 173121  12 39  28 22  0
11 13          0     216152        576   38463872    0    0    44 2362186 215236 176454  12 39  27 22  0
12 14          0     223704        576   38546720    0    0     2 2395530 214684 177518  12 39  28 22  0
39 17          0     219932        576   38598184    0    0     4 2428231 223284 179095  12 42  24 21  0
11 10          0     219956        576   38521048    0    0     4 2323853 207472 166665  12 39  28 21  0
23 15          0     216580        576   38451904    0    0     3 2241800 201049 163496  11 37  31 21  0
  9 13          0     225792        576   38451916    0    0    13 2281900 204869 171814  11 36  30 23  0
14 12          0     233820        576   38492728    0    0     4 2293423 207686 173019  11 37  31 21  0
11 20          0     213796        576   38533208    0    0     3 2288636 205605 168118  11 37  31 21  0


The system begins to struggle over the next several
minutes; here's what vmstat has to say:

2012-06-27 13:57:58.831-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
21 16          0     224628        576   38526872    0    0     2  1233    9   19   1  4  95  0  0
12 19          0     232060        576   38501020    0    0     4 2366769 221418 159890  12 48  20 19  0
124  8          0     218548        576   38379656    0    0    13 2103075 199660 108618  11 65  13 12  0
24 10          0     300476        576   38230288    0    0    31 1966282 177472 84572  10 76   7  7  0
20 16          0     217584        576   38296700    0    0     9 2062571 195936 128810  10 55  20 15  0
53 12          0     235720        576   38247968    0    0    30 2035407 196973 133921  10 52  23 15  0
20 16          0     360340        576   38067992    0    0     6 2192179 208692 136784  11 54  19 15  0
26 10          0     310800        576   38093884    0    0    43 2138772 207105 118718  11 64  12 13  0
24 15          0     261108        576   38030828    0    0    68 2174015 205793 135302  11 56  18 15  0
  9 17          0     241816        576   37982072    0    0    20 2076145 194971 120285  10 60  16 14  0
37 15          0     255972        576   37892868    0    0    14 2225076 205694 126954  11 59  16 13  0
27 16          0     243212        576   37872704    0    0     6 2249476 210885 134684  12 60  15 14  0
30 10          0     217572        576   37795388    0    0     3 2128688 205027 118319  11 66  12 11  0
26 11          0     236420        576   37740164    0    0    23 2109709 205105 133925  10 56  19 14  0
45 15          0     330056        576   37619896    0    0    15 1948311 196188 119330  10 62  15 13  0
54 15          0     242696        576   37631500    0    0     4 2159530 202595 132588  11 59  16 15  0

2012-06-27 13:58:59.569-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
14 15          0     274932        576   37621548    0    0     2  1301   15    0   1  4  95  0  0
44 16          0     278748        576   37509516    0    0     6 2097643 196722 112697  10 66  12 11  0
88 14          0     228088        576   37412008    0    0     6 2089559 202206 116146  10 66  13 11  0
94 11          0     348348        576   37270624    0    0     7 1906390 181488 87333   9 76   9  6  0
42 11          0     215996        576   37288556    0    0     6 1782459 184375 95901   9 74   9  7  0
45 11          0     323112        576   37146500    0    0    12 1868376 187814 103947   9 71  11  9  0
51  7          0     244560        576   37100124    0    0    35 1767496 181383 107259   9 68  12 11  0
74 12          0     221584        576   37016420    0    0    18 1884986 183376 93425   9 75   9  7  0
45 10          0     275564        576   36985324    0    0    23 1683688 167223 97036   8 75   9  8  0
19 10          0     322176        576   36813176    0    0    14 1747378 177594 97218   8 72  12  8  0
122  7          0     225256        576   36838084    0    0    26 1730643 177915 92621   8 75   9  8  0
243 10          0     223464        576   36765460    0    0    18 1730158 173059 79373   8 80   6  5  0
100 10          0     307528        576   36598456    0    0     4 1738567 174077 79585   9 82   6  4  0
243  6          0     370064        576   36358576    0    0     9 1586528 174680 85353   8 81   6  5  0
267  2          0     322640        576   36254044    0    0    40 1011650 129389 42277   5 93   1  1  0
210  4          0     505092        576   35865460    0    0    25 720825 116356 32422   3 96   1  0  0

2012-06-27 14:00:03.219-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
75  1          0     566988        576   35664800    0    0     2  1355   21    3   1  4  95  0  0
433  1          0     964052        576   35069112    0    0     7 456359 102256 20901   2 98   0  0  0
547  3          0     820116        576   34893932    0    0    57 560507 114878 28115   3 96   0  0  0
806  2          0     606992        576   34848180    0    0   339 309668 101230 21056   2 98   0  0  0
708  1          0     529624        576   34708000    0    0   248 370886 101327 20062   2 97   0  0  0
231  5          0     504772        576   34663880    0    0   305 334824 95045 20407   2 97   1  1  0
158  6          0    1063088        576   33518536    0    0   531 847435 130696 47140   4 92   1  2  0
193  0          0    1449156        576   33035572    0    0   363 371279 94470 18955   2 96   1  1  0
266  6          0    1623512        576   32728164    0    0    77 241114 95730 15483   2 98   0  0  0
243  8          0    1629504        576   32653080    0    0    81 471018 100223 20920   3 96   0  1  0
70 11          0    1342140        576   33084020    0    0   100 925869 139876 56599   6 88   3  3  0
211  7          0    1130316        576   33470432    0    0   290 1008984 150699 74320   6 83   6  5  0
365  3          0     776736        576   34072772    0    0   182 747167 139436 67135   5 88   4  3  0
29  1          0    1528412        576   34110640    0    0    50 612181 137403 77609   4 87   6  3  0
266  5          0    1657688        576   34105696    0    0     3 258307 62879 38508   2 93   3  2  0
1159  2          0    2002256        576   33775476    0    0    19 88554 42112 14230   1 98   0  0  0


Right around 14:00 I was able to get a "perf -a -g"; here's the
beginning of what "perf report --sort symbol --call-graph fractal,5"
had to say:

#
     64.86%  [k] _raw_spin_lock_irqsave
             |
             |--97.94%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          |
             |          |--99.56%-- try_to_compact_pages
             |          |          __alloc_pages_direct_compact
             |          |          __alloc_pages_slowpath
             |          |          __alloc_pages_nodemask
             |          |          alloc_pages_vma
             |          |          do_huge_pmd_anonymous_page
             |          |          handle_mm_fault
             |          |          do_page_fault
             |          |          page_fault
             |          |          |
             |          |          |--53.53%-- skb_copy_datagram_iovec
             |          |          |          tcp_recvmsg
             |          |          |          inet_recvmsg
             |          |          |          sock_recvmsg
             |          |          |          sys_recvfrom
             |          |          |          system_call_fastpath
             |          |          |          __recv
             |          |          |          |
             |          |          |           --100.00%-- (nil)
             |          |          |
             |          |          |--27.80%-- __pthread_create_2_1
             |          |          |          (nil)
             |          |          |
             |          |           --18.67%-- memcpy
             |          |                     |
             |          |                     |--57.38%-- 0x50d000005
             |          |                     |
             |          |                     |--34.52%-- 0x3b300bf271940a35
             |          |                     |
             |          |                      --8.10%-- 0x1500000000000009
             |           --0.44%-- [...]
              --2.06%-- [...]

      6.15%  [k] isolate_freepages_block
             |
             |--99.95%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          |
             |          |--99.54%-- try_to_compact_pages
             |          |          __alloc_pages_direct_compact
             |          |          __alloc_pages_slowpath
             |          |          __alloc_pages_nodemask
             |          |          alloc_pages_vma
             |          |          do_huge_pmd_anonymous_page
             |          |          handle_mm_fault
             |          |          do_page_fault
             |          |          page_fault
             |          |          |
             |          |          |--54.40%-- skb_copy_datagram_iovec
             |          |          |          tcp_recvmsg
             |          |          |          inet_recvmsg
             |          |          |          sock_recvmsg
             |          |          |          sys_recvfrom
             |          |          |          system_call_fastpath
             |          |          |          __recv
             |          |          |          |
             |          |          |           --100.00%-- (nil)
             |          |          |
             |          |          |--25.19%-- __pthread_create_2_1
             |          |          |          (nil)
             |          |          |
             |          |           --20.41%-- memcpy
             |          |                     |
             |          |                     |--40.24%-- 0x3b300bf271940a35
             |          |                     |
             |          |                     |--38.29%-- 0x1500000000000009
             |          |                     |
             |          |                      --21.47%-- 0x50d000005
             |           --0.46%-- [...]
              --0.05%-- [...]

      3.96%  [.] ceph_crc32c_le
             |
             |--99.99%-- 0xb8057558d0065990
              --0.01%-- [...]

A different system in the same test had a slightly different
call tree, but isolate_freepages() still seems to show up
prominently:

#
     32.32%  [k] _raw_spin_lock_irqsave
             |
             |--97.64%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--65.31%-- skb_copy_datagram_iovec
             |          |          tcp_recvmsg
             |          |          inet_recvmsg
             |          |          sock_recvmsg
             |          |          sys_recvfrom
             |          |          system_call_fastpath
             |          |          __recv
             |          |          |
             |          |           --100.00%-- (nil)
             |          |
             |          |--30.98%-- memcpy
             |          |          |
             |          |          |--50.60%-- 0x50d0000
             |          |          |
             |          |           --49.40%-- 0x50d000005
             |           --3.70%-- [...]
              --2.36%-- [...]

     17.10%  [k] _raw_spin_lock_irq
             |
             |--98.27%-- isolate_migratepages_range
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--99.85%-- __pthread_create_2_1
             |          |          (nil)
             |           --0.15%-- [...]
              --1.73%-- [...]

      4.59%  [k] mutex_spin_on_owner
             |
             --- __mutex_lock_slowpath
                 mutex_lock
                |
                |--50.14%-- page_lock_anon_vma
                |          |
                |          |--99.99%-- try_to_unmap_anon
                |          |          try_to_unmap
                |          |          __unmap_and_move
                |          |          unmap_and_move
                |          |          migrate_pages
                |          |          compact_zone
                |          |          try_to_compact_pages
                |          |          __alloc_pages_direct_compact
                |          |          __alloc_pages_slowpath
                |          |          __alloc_pages_nodemask
                |          |          alloc_pages_vma
                |          |          do_huge_pmd_anonymous_page
                |          |          handle_mm_fault
                |          |          do_page_fault
                |          |          page_fault
                |          |          |
                |          |          |--99.91%-- __pthread_create_2_1
                |          |          |          (nil)
                |          |           --0.09%-- [...]
                |           --0.01%-- [...]
                |
                |--49.67%-- rmap_walk
                |          move_to_new_page
                |          __unmap_and_move
                |          unmap_and_move
                |          migrate_pages
                |          compact_zone
                |          try_to_compact_pages
                |          __alloc_pages_direct_compact
                |          __alloc_pages_slowpath
                |          __alloc_pages_nodemask
                |          alloc_pages_vma
                |          do_huge_pmd_anonymous_page
                |          handle_mm_fault
                |          do_page_fault
                |          page_fault
                |          |
                |          |--99.69%-- __pthread_create_2_1
                |          |          (nil)
                |           --0.31%-- [...]
                 --0.20%-- [...]

      4.10%  [k] isolate_freepages_block
             |
             |--99.95%-- isolate_freepages
             |          compaction_alloc
             |          unmap_and_move
             |          migrate_pages
             |          compact_zone
             |          try_to_compact_pages
             |          __alloc_pages_direct_compact
             |          __alloc_pages_slowpath
             |          __alloc_pages_nodemask
             |          alloc_pages_vma
             |          do_huge_pmd_anonymous_page
             |          handle_mm_fault
             |          do_page_fault
             |          page_fault
             |          |
             |          |--46.97%-- skb_copy_datagram_iovec
             |          |          tcp_recvmsg
             |          |          inet_recvmsg
             |          |          sock_recvmsg
             |          |          sys_recvfrom
             |          |          system_call_fastpath
             |          |          __recv
             |          |          |
             |          |           --100.00%-- (nil)
             |          |
             |          |--31.79%-- __pthread_create_2_1
             |          |          (nil)
             |          |
             |           --21.24%-- memcpy
             |                     |
             |                     |--61.90%-- 0x50d000005
             |                     |
             |                      --38.10%-- 0x50d0000
              --0.05%-- [...]

      3.65%  [.] ceph_crc32c_le
             |
             |--99.86%-- 0xb8057558d0065990
              --0.14%-- [...]



I seem to be able to recreate this issue at will, so please
let me know what I can do to help learn what is going on.

Thanks -- Jim

P.S. I got the recipients list via "scripts/get_maintainer.pl -f mm/compaction.c";
      please let me know if I should have done something else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-27 21:59 ` Jim Schutt
@ 2012-06-27 22:35   ` Mark Nelson
  -1 siblings, 0 replies; 24+ messages in thread
From: Mark Nelson @ 2012-06-27 22:35 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Minchan Kim, linux-mm, linux-kernel, ceph-devel

On 06/27/2012 04:59 PM, Jim Schutt wrote:
> Hi,
>
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4. I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
>
> (FWIW I'm running the my distro version of perf; please let me know
> if I need to compile the tools/perf version to match my kernel.)
>
> The systems in question have 24 SAS drives spread across 3 HBAs,
> running 24 Ceph OSD instances, one per drive. FWIW these servers
> are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
> Ceph Linux clients doing dd simultaneously to a Ceph file system
> backed by 12 of these servers.
>
> In the early phase of such a test, when things are running well,
> here's what vmstat reports for the state of one of these servers:
>
> 2012-06-27 13:56:58.356-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
> 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
> 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
> 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
> 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0
> 30 12 0 230844 576 38461648 0 0 14 2298537 216580 166661 12 45 23 20 0
> 29 19 0 237856 576 38422884 0 0 5 2332741 209408 157138 12 42 25 22 0
> 17 11 0 222156 576 38483992 0 0 4 2380495 210312 173121 12 39 28 22 0
> 11 13 0 216152 576 38463872 0 0 44 2362186 215236 176454 12 39 27 22 0
> 12 14 0 223704 576 38546720 0 0 2 2395530 214684 177518 12 39 28 22 0
> 39 17 0 219932 576 38598184 0 0 4 2428231 223284 179095 12 42 24 21 0
> 11 10 0 219956 576 38521048 0 0 4 2323853 207472 166665 12 39 28 21 0
> 23 15 0 216580 576 38451904 0 0 3 2241800 201049 163496 11 37 31 21 0
> 9 13 0 225792 576 38451916 0 0 13 2281900 204869 171814 11 36 30 23 0
> 14 12 0 233820 576 38492728 0 0 4 2293423 207686 173019 11 37 31 21 0
> 11 20 0 213796 576 38533208 0 0 3 2288636 205605 168118 11 37 31 21 0
>
>
> The system begins to struggle over the next several
> minutes; here's what vmstat has to say:
>
> 2012-06-27 13:57:58.831-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 21 16 0 224628 576 38526872 0 0 2 1233 9 19 1 4 95 0 0
> 12 19 0 232060 576 38501020 0 0 4 2366769 221418 159890 12 48 20 19 0
> 124 8 0 218548 576 38379656 0 0 13 2103075 199660 108618 11 65 13 12 0
> 24 10 0 300476 576 38230288 0 0 31 1966282 177472 84572 10 76 7 7 0
> 20 16 0 217584 576 38296700 0 0 9 2062571 195936 128810 10 55 20 15 0
> 53 12 0 235720 576 38247968 0 0 30 2035407 196973 133921 10 52 23 15 0
> 20 16 0 360340 576 38067992 0 0 6 2192179 208692 136784 11 54 19 15 0
> 26 10 0 310800 576 38093884 0 0 43 2138772 207105 118718 11 64 12 13 0
> 24 15 0 261108 576 38030828 0 0 68 2174015 205793 135302 11 56 18 15 0
> 9 17 0 241816 576 37982072 0 0 20 2076145 194971 120285 10 60 16 14 0
> 37 15 0 255972 576 37892868 0 0 14 2225076 205694 126954 11 59 16 13 0
> 27 16 0 243212 576 37872704 0 0 6 2249476 210885 134684 12 60 15 14 0
> 30 10 0 217572 576 37795388 0 0 3 2128688 205027 118319 11 66 12 11 0
> 26 11 0 236420 576 37740164 0 0 23 2109709 205105 133925 10 56 19 14 0
> 45 15 0 330056 576 37619896 0 0 15 1948311 196188 119330 10 62 15 13 0
> 54 15 0 242696 576 37631500 0 0 4 2159530 202595 132588 11 59 16 15 0
>
> 2012-06-27 13:58:59.569-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 14 15 0 274932 576 37621548 0 0 2 1301 15 0 1 4 95 0 0
> 44 16 0 278748 576 37509516 0 0 6 2097643 196722 112697 10 66 12 11 0
> 88 14 0 228088 576 37412008 0 0 6 2089559 202206 116146 10 66 13 11 0
> 94 11 0 348348 576 37270624 0 0 7 1906390 181488 87333 9 76 9 6 0
> 42 11 0 215996 576 37288556 0 0 6 1782459 184375 95901 9 74 9 7 0
> 45 11 0 323112 576 37146500 0 0 12 1868376 187814 103947 9 71 11 9 0
> 51 7 0 244560 576 37100124 0 0 35 1767496 181383 107259 9 68 12 11 0
> 74 12 0 221584 576 37016420 0 0 18 1884986 183376 93425 9 75 9 7 0
> 45 10 0 275564 576 36985324 0 0 23 1683688 167223 97036 8 75 9 8 0
> 19 10 0 322176 576 36813176 0 0 14 1747378 177594 97218 8 72 12 8 0
> 122 7 0 225256 576 36838084 0 0 26 1730643 177915 92621 8 75 9 8 0
> 243 10 0 223464 576 36765460 0 0 18 1730158 173059 79373 8 80 6 5 0
> 100 10 0 307528 576 36598456 0 0 4 1738567 174077 79585 9 82 6 4 0
> 243 6 0 370064 576 36358576 0 0 9 1586528 174680 85353 8 81 6 5 0
> 267 2 0 322640 576 36254044 0 0 40 1011650 129389 42277 5 93 1 1 0
> 210 4 0 505092 576 35865460 0 0 25 720825 116356 32422 3 96 1 0 0
>
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 75 1 0 566988 576 35664800 0 0 2 1355 21 3 1 4 95 0 0
> 433 1 0 964052 576 35069112 0 0 7 456359 102256 20901 2 98 0 0 0
> 547 3 0 820116 576 34893932 0 0 57 560507 114878 28115 3 96 0 0 0
> 806 2 0 606992 576 34848180 0 0 339 309668 101230 21056 2 98 0 0 0
> 708 1 0 529624 576 34708000 0 0 248 370886 101327 20062 2 97 0 0 0
> 231 5 0 504772 576 34663880 0 0 305 334824 95045 20407 2 97 1 1 0
> 158 6 0 1063088 576 33518536 0 0 531 847435 130696 47140 4 92 1 2 0
> 193 0 0 1449156 576 33035572 0 0 363 371279 94470 18955 2 96 1 1 0
> 266 6 0 1623512 576 32728164 0 0 77 241114 95730 15483 2 98 0 0 0
> 243 8 0 1629504 576 32653080 0 0 81 471018 100223 20920 3 96 0 1 0
> 70 11 0 1342140 576 33084020 0 0 100 925869 139876 56599 6 88 3 3 0
> 211 7 0 1130316 576 33470432 0 0 290 1008984 150699 74320 6 83 6 5 0
> 365 3 0 776736 576 34072772 0 0 182 747167 139436 67135 5 88 4 3 0
> 29 1 0 1528412 576 34110640 0 0 50 612181 137403 77609 4 87 6 3 0
> 266 5 0 1657688 576 34105696 0 0 3 258307 62879 38508 2 93 3 2 0
> 1159 2 0 2002256 576 33775476 0 0 19 88554 42112 14230 1 98 0 0 0
>
>
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
>
> #
> 64.86% [k] _raw_spin_lock_irqsave
> |
> |--97.94%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.56%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--53.53%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--27.80%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --18.67%-- memcpy
> | | |
> | | |--57.38%-- 0x50d000005
> | | |
> | | |--34.52%-- 0x3b300bf271940a35
> | | |
> | | --8.10%-- 0x1500000000000009
> | --0.44%-- [...]
> --2.06%-- [...]
>
> 6.15% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.54%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--54.40%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--25.19%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --20.41%-- memcpy
> | | |
> | | |--40.24%-- 0x3b300bf271940a35
> | | |
> | | |--38.29%-- 0x1500000000000009
> | | |
> | | --21.47%-- 0x50d000005
> | --0.46%-- [...]
> --0.05%-- [...]
>
> 3.96% [.] ceph_crc32c_le
> |
> |--99.99%-- 0xb8057558d0065990
> --0.01%-- [...]
>
> A different system in the same test had a slightly different
> call tree, but isolate_freepages() still seems to show up
> prominently:
>
> #
> 32.32% [k] _raw_spin_lock_irqsave
> |
> |--97.64%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--65.31%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--30.98%-- memcpy
> | | |
> | | |--50.60%-- 0x50d0000
> | | |
> | | --49.40%-- 0x50d000005
> | --3.70%-- [...]
> --2.36%-- [...]
>
> 17.10% [k] _raw_spin_lock_irq
> |
> |--98.27%-- isolate_migratepages_range
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.85%-- __pthread_create_2_1
> | | (nil)
> | --0.15%-- [...]
> --1.73%-- [...]
>
> 4.59% [k] mutex_spin_on_owner
> |
> --- __mutex_lock_slowpath
> mutex_lock
> |
> |--50.14%-- page_lock_anon_vma
> | |
> | |--99.99%-- try_to_unmap_anon
> | | try_to_unmap
> | | __unmap_and_move
> | | unmap_and_move
> | | migrate_pages
> | | compact_zone
> | | try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--99.91%-- __pthread_create_2_1
> | | | (nil)
> | | --0.09%-- [...]
> | --0.01%-- [...]
> |
> |--49.67%-- rmap_walk
> | move_to_new_page
> | __unmap_and_move
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.69%-- __pthread_create_2_1
> | | (nil)
> | --0.31%-- [...]
> --0.20%-- [...]
>
> 4.10% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--46.97%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--31.79%-- __pthread_create_2_1
> | | (nil)
> | |
> | --21.24%-- memcpy
> | |
> | |--61.90%-- 0x50d000005
> | |
> | --38.10%-- 0x50d0000
> --0.05%-- [...]
>
> 3.65% [.] ceph_crc32c_le
> |
> |--99.86%-- 0xb8057558d0065990
> --0.14%-- [...]
>
>
>
> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
>
> Thanks -- Jim
>
> P.S. I got the recipients list via "scripts/get_maintainer.pl -f
> mm/compaction.c";
> please let me know if I should have done something else.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Ok, I looked around a bit and found this:

https://lkml.org/lkml/2011/11/9/252

I think the same thing is happening here, ie huge pages are trying to be 
allocated but due to memory fragmentation there are none available. 
try_to_compact_pages is getting called to get a contiguous set of pages, 
but apparently it can't move the pages it needs because they are stuck 
waiting.

According to that post, /sys/block/<device>/bdi/max_ratio can be changed 
to specify what percentage of dirty cache each of your OSDs can use. 
Apparently the default is 20% for each device.

Might be worth a try!

Mark

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-27 22:35   ` Mark Nelson
  0 siblings, 0 replies; 24+ messages in thread
From: Mark Nelson @ 2012-06-27 22:35 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	Minchan Kim, linux-mm, linux-kernel, ceph-devel

On 06/27/2012 04:59 PM, Jim Schutt wrote:
> Hi,
>
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4. I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
>
> (FWIW I'm running the my distro version of perf; please let me know
> if I need to compile the tools/perf version to match my kernel.)
>
> The systems in question have 24 SAS drives spread across 3 HBAs,
> running 24 Ceph OSD instances, one per drive. FWIW these servers
> are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
> Ceph Linux clients doing dd simultaneously to a Ceph file system
> backed by 12 of these servers.
>
> In the early phase of such a test, when things are running well,
> here's what vmstat reports for the state of one of these servers:
>
> 2012-06-27 13:56:58.356-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
> 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
> 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
> 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
> 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0
> 30 12 0 230844 576 38461648 0 0 14 2298537 216580 166661 12 45 23 20 0
> 29 19 0 237856 576 38422884 0 0 5 2332741 209408 157138 12 42 25 22 0
> 17 11 0 222156 576 38483992 0 0 4 2380495 210312 173121 12 39 28 22 0
> 11 13 0 216152 576 38463872 0 0 44 2362186 215236 176454 12 39 27 22 0
> 12 14 0 223704 576 38546720 0 0 2 2395530 214684 177518 12 39 28 22 0
> 39 17 0 219932 576 38598184 0 0 4 2428231 223284 179095 12 42 24 21 0
> 11 10 0 219956 576 38521048 0 0 4 2323853 207472 166665 12 39 28 21 0
> 23 15 0 216580 576 38451904 0 0 3 2241800 201049 163496 11 37 31 21 0
> 9 13 0 225792 576 38451916 0 0 13 2281900 204869 171814 11 36 30 23 0
> 14 12 0 233820 576 38492728 0 0 4 2293423 207686 173019 11 37 31 21 0
> 11 20 0 213796 576 38533208 0 0 3 2288636 205605 168118 11 37 31 21 0
>
>
> The system begins to struggle over the next several
> minutes; here's what vmstat has to say:
>
> 2012-06-27 13:57:58.831-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 21 16 0 224628 576 38526872 0 0 2 1233 9 19 1 4 95 0 0
> 12 19 0 232060 576 38501020 0 0 4 2366769 221418 159890 12 48 20 19 0
> 124 8 0 218548 576 38379656 0 0 13 2103075 199660 108618 11 65 13 12 0
> 24 10 0 300476 576 38230288 0 0 31 1966282 177472 84572 10 76 7 7 0
> 20 16 0 217584 576 38296700 0 0 9 2062571 195936 128810 10 55 20 15 0
> 53 12 0 235720 576 38247968 0 0 30 2035407 196973 133921 10 52 23 15 0
> 20 16 0 360340 576 38067992 0 0 6 2192179 208692 136784 11 54 19 15 0
> 26 10 0 310800 576 38093884 0 0 43 2138772 207105 118718 11 64 12 13 0
> 24 15 0 261108 576 38030828 0 0 68 2174015 205793 135302 11 56 18 15 0
> 9 17 0 241816 576 37982072 0 0 20 2076145 194971 120285 10 60 16 14 0
> 37 15 0 255972 576 37892868 0 0 14 2225076 205694 126954 11 59 16 13 0
> 27 16 0 243212 576 37872704 0 0 6 2249476 210885 134684 12 60 15 14 0
> 30 10 0 217572 576 37795388 0 0 3 2128688 205027 118319 11 66 12 11 0
> 26 11 0 236420 576 37740164 0 0 23 2109709 205105 133925 10 56 19 14 0
> 45 15 0 330056 576 37619896 0 0 15 1948311 196188 119330 10 62 15 13 0
> 54 15 0 242696 576 37631500 0 0 4 2159530 202595 132588 11 59 16 15 0
>
> 2012-06-27 13:58:59.569-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 14 15 0 274932 576 37621548 0 0 2 1301 15 0 1 4 95 0 0
> 44 16 0 278748 576 37509516 0 0 6 2097643 196722 112697 10 66 12 11 0
> 88 14 0 228088 576 37412008 0 0 6 2089559 202206 116146 10 66 13 11 0
> 94 11 0 348348 576 37270624 0 0 7 1906390 181488 87333 9 76 9 6 0
> 42 11 0 215996 576 37288556 0 0 6 1782459 184375 95901 9 74 9 7 0
> 45 11 0 323112 576 37146500 0 0 12 1868376 187814 103947 9 71 11 9 0
> 51 7 0 244560 576 37100124 0 0 35 1767496 181383 107259 9 68 12 11 0
> 74 12 0 221584 576 37016420 0 0 18 1884986 183376 93425 9 75 9 7 0
> 45 10 0 275564 576 36985324 0 0 23 1683688 167223 97036 8 75 9 8 0
> 19 10 0 322176 576 36813176 0 0 14 1747378 177594 97218 8 72 12 8 0
> 122 7 0 225256 576 36838084 0 0 26 1730643 177915 92621 8 75 9 8 0
> 243 10 0 223464 576 36765460 0 0 18 1730158 173059 79373 8 80 6 5 0
> 100 10 0 307528 576 36598456 0 0 4 1738567 174077 79585 9 82 6 4 0
> 243 6 0 370064 576 36358576 0 0 9 1586528 174680 85353 8 81 6 5 0
> 267 2 0 322640 576 36254044 0 0 40 1011650 129389 42277 5 93 1 1 0
> 210 4 0 505092 576 35865460 0 0 25 720825 116356 32422 3 96 1 0 0
>
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 75 1 0 566988 576 35664800 0 0 2 1355 21 3 1 4 95 0 0
> 433 1 0 964052 576 35069112 0 0 7 456359 102256 20901 2 98 0 0 0
> 547 3 0 820116 576 34893932 0 0 57 560507 114878 28115 3 96 0 0 0
> 806 2 0 606992 576 34848180 0 0 339 309668 101230 21056 2 98 0 0 0
> 708 1 0 529624 576 34708000 0 0 248 370886 101327 20062 2 97 0 0 0
> 231 5 0 504772 576 34663880 0 0 305 334824 95045 20407 2 97 1 1 0
> 158 6 0 1063088 576 33518536 0 0 531 847435 130696 47140 4 92 1 2 0
> 193 0 0 1449156 576 33035572 0 0 363 371279 94470 18955 2 96 1 1 0
> 266 6 0 1623512 576 32728164 0 0 77 241114 95730 15483 2 98 0 0 0
> 243 8 0 1629504 576 32653080 0 0 81 471018 100223 20920 3 96 0 1 0
> 70 11 0 1342140 576 33084020 0 0 100 925869 139876 56599 6 88 3 3 0
> 211 7 0 1130316 576 33470432 0 0 290 1008984 150699 74320 6 83 6 5 0
> 365 3 0 776736 576 34072772 0 0 182 747167 139436 67135 5 88 4 3 0
> 29 1 0 1528412 576 34110640 0 0 50 612181 137403 77609 4 87 6 3 0
> 266 5 0 1657688 576 34105696 0 0 3 258307 62879 38508 2 93 3 2 0
> 1159 2 0 2002256 576 33775476 0 0 19 88554 42112 14230 1 98 0 0 0
>
>
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
>
> #
> 64.86% [k] _raw_spin_lock_irqsave
> |
> |--97.94%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.56%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--53.53%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--27.80%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --18.67%-- memcpy
> | | |
> | | |--57.38%-- 0x50d000005
> | | |
> | | |--34.52%-- 0x3b300bf271940a35
> | | |
> | | --8.10%-- 0x1500000000000009
> | --0.44%-- [...]
> --2.06%-- [...]
>
> 6.15% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.54%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--54.40%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--25.19%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --20.41%-- memcpy
> | | |
> | | |--40.24%-- 0x3b300bf271940a35
> | | |
> | | |--38.29%-- 0x1500000000000009
> | | |
> | | --21.47%-- 0x50d000005
> | --0.46%-- [...]
> --0.05%-- [...]
>
> 3.96% [.] ceph_crc32c_le
> |
> |--99.99%-- 0xb8057558d0065990
> --0.01%-- [...]
>
> A different system in the same test had a slightly different
> call tree, but isolate_freepages() still seems to show up
> prominently:
>
> #
> 32.32% [k] _raw_spin_lock_irqsave
> |
> |--97.64%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--65.31%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--30.98%-- memcpy
> | | |
> | | |--50.60%-- 0x50d0000
> | | |
> | | --49.40%-- 0x50d000005
> | --3.70%-- [...]
> --2.36%-- [...]
>
> 17.10% [k] _raw_spin_lock_irq
> |
> |--98.27%-- isolate_migratepages_range
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.85%-- __pthread_create_2_1
> | | (nil)
> | --0.15%-- [...]
> --1.73%-- [...]
>
> 4.59% [k] mutex_spin_on_owner
> |
> --- __mutex_lock_slowpath
> mutex_lock
> |
> |--50.14%-- page_lock_anon_vma
> | |
> | |--99.99%-- try_to_unmap_anon
> | | try_to_unmap
> | | __unmap_and_move
> | | unmap_and_move
> | | migrate_pages
> | | compact_zone
> | | try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--99.91%-- __pthread_create_2_1
> | | | (nil)
> | | --0.09%-- [...]
> | --0.01%-- [...]
> |
> |--49.67%-- rmap_walk
> | move_to_new_page
> | __unmap_and_move
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.69%-- __pthread_create_2_1
> | | (nil)
> | --0.31%-- [...]
> --0.20%-- [...]
>
> 4.10% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--46.97%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--31.79%-- __pthread_create_2_1
> | | (nil)
> | |
> | --21.24%-- memcpy
> | |
> | |--61.90%-- 0x50d000005
> | |
> | --38.10%-- 0x50d0000
> --0.05%-- [...]
>
> 3.65% [.] ceph_crc32c_le
> |
> |--99.86%-- 0xb8057558d0065990
> --0.14%-- [...]
>
>
>
> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
>
> Thanks -- Jim
>
> P.S. I got the recipients list via "scripts/get_maintainer.pl -f
> mm/compaction.c";
> please let me know if I should have done something else.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Ok, I looked around a bit and found this:

https://lkml.org/lkml/2011/11/9/252

I think the same thing is happening here, ie huge pages are trying to be 
allocated but due to memory fragmentation there are none available. 
try_to_compact_pages is getting called to get a contiguous set of pages, 
but apparently it can't move the pages it needs because they are stuck 
waiting.

According to that post, /sys/block/<device>/bdi/max_ratio can be changed 
to specify what percentage of dirty cache each of your OSDs can use. 
Apparently the default is 20% for each device.

Might be worth a try!

Mark

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-27 21:59 ` Jim Schutt
@ 2012-06-27 23:59   ` Minchan Kim
  -1 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-27 23:59 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	linux-mm, linux-kernel, ceph-devel

On 06/28/2012 06:59 AM, Jim Schutt wrote:

> Hi,
> 
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4.  I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
> 
> (FWIW I'm running the my distro version of perf; please let me know
> if I need to compile the tools/perf version to match my kernel.)
> 
> The systems in question have 24 SAS drives spread across 3 HBAs,
> running 24 Ceph OSD instances, one per drive.  FWIW these servers
> are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
> Ceph Linux clients doing dd simultaneously to a Ceph file system
> backed by 12 of these servers.
> 
> In the early phase of such a test, when things are running well,
> here's what vmstat reports for the state of one of these servers:
> 
> 2012-06-27 13:56:58.356-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 31 15          0     287216        576   38606628    0    0     2 
> 1158    2   14   1  3  95  0  0
> 27 15          0     225288        576   38583384    0    0    18
> 2222016 203357 134876  11 56  17 15  0
> 28 17          0     219256        576   38544736    0    0    11
> 2305932 203141 146296  11 49  23 17  0
>  6 18          0     215596        576   38552872    0    0     7
> 2363207 215264 166502  12 45  22 20  0
> 22 18          0     226984        576   38596404    0    0     3
> 2445741 223114 179527  12 43  23 22  0
> 30 12          0     230844        576   38461648    0    0    14
> 2298537 216580 166661  12 45  23 20  0
> 29 19          0     237856        576   38422884    0    0     5
> 2332741 209408 157138  12 42  25 22  0
> 17 11          0     222156        576   38483992    0    0     4
> 2380495 210312 173121  12 39  28 22  0
> 11 13          0     216152        576   38463872    0    0    44
> 2362186 215236 176454  12 39  27 22  0
> 12 14          0     223704        576   38546720    0    0     2
> 2395530 214684 177518  12 39  28 22  0
> 39 17          0     219932        576   38598184    0    0     4
> 2428231 223284 179095  12 42  24 21  0
> 11 10          0     219956        576   38521048    0    0     4
> 2323853 207472 166665  12 39  28 21  0
> 23 15          0     216580        576   38451904    0    0     3
> 2241800 201049 163496  11 37  31 21  0
>  9 13          0     225792        576   38451916    0    0    13
> 2281900 204869 171814  11 36  30 23  0
> 14 12          0     233820        576   38492728    0    0     4
> 2293423 207686 173019  11 37  31 21  0
> 11 20          0     213796        576   38533208    0    0     3
> 2288636 205605 168118  11 37  31 21  0
> 
> 
> The system begins to struggle over the next several
> minutes; here's what vmstat has to say:
> 
> 2012-06-27 13:57:58.831-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 21 16          0     224628        576   38526872    0    0     2 
> 1233    9   19   1  4  95  0  0
> 12 19          0     232060        576   38501020    0    0     4
> 2366769 221418 159890  12 48  20 19  0
> 124  8          0     218548        576   38379656    0    0    13
> 2103075 199660 108618  11 65  13 12  0
> 24 10          0     300476        576   38230288    0    0    31
> 1966282 177472 84572  10 76   7  7  0
> 20 16          0     217584        576   38296700    0    0     9
> 2062571 195936 128810  10 55  20 15  0
> 53 12          0     235720        576   38247968    0    0    30
> 2035407 196973 133921  10 52  23 15  0
> 20 16          0     360340        576   38067992    0    0     6
> 2192179 208692 136784  11 54  19 15  0
> 26 10          0     310800        576   38093884    0    0    43
> 2138772 207105 118718  11 64  12 13  0
> 24 15          0     261108        576   38030828    0    0    68
> 2174015 205793 135302  11 56  18 15  0
>  9 17          0     241816        576   37982072    0    0    20
> 2076145 194971 120285  10 60  16 14  0
> 37 15          0     255972        576   37892868    0    0    14
> 2225076 205694 126954  11 59  16 13  0
> 27 16          0     243212        576   37872704    0    0     6
> 2249476 210885 134684  12 60  15 14  0
> 30 10          0     217572        576   37795388    0    0     3
> 2128688 205027 118319  11 66  12 11  0
> 26 11          0     236420        576   37740164    0    0    23
> 2109709 205105 133925  10 56  19 14  0
> 45 15          0     330056        576   37619896    0    0    15
> 1948311 196188 119330  10 62  15 13  0
> 54 15          0     242696        576   37631500    0    0     4
> 2159530 202595 132588  11 59  16 15  0
> 
> 2012-06-27 13:58:59.569-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 14 15          0     274932        576   37621548    0    0     2 
> 1301   15    0   1  4  95  0  0
> 44 16          0     278748        576   37509516    0    0     6
> 2097643 196722 112697  10 66  12 11  0
> 88 14          0     228088        576   37412008    0    0     6
> 2089559 202206 116146  10 66  13 11  0
> 94 11          0     348348        576   37270624    0    0     7
> 1906390 181488 87333   9 76   9  6  0
> 42 11          0     215996        576   37288556    0    0     6
> 1782459 184375 95901   9 74   9  7  0
> 45 11          0     323112        576   37146500    0    0    12
> 1868376 187814 103947   9 71  11  9  0
> 51  7          0     244560        576   37100124    0    0    35
> 1767496 181383 107259   9 68  12 11  0
> 74 12          0     221584        576   37016420    0    0    18
> 1884986 183376 93425   9 75   9  7  0
> 45 10          0     275564        576   36985324    0    0    23
> 1683688 167223 97036   8 75   9  8  0
> 19 10          0     322176        576   36813176    0    0    14
> 1747378 177594 97218   8 72  12  8  0
> 122  7          0     225256        576   36838084    0    0    26
> 1730643 177915 92621   8 75   9  8  0
> 243 10          0     223464        576   36765460    0    0    18
> 1730158 173059 79373   8 80   6  5  0
> 100 10          0     307528        576   36598456    0    0     4
> 1738567 174077 79585   9 82   6  4  0
> 243  6          0     370064        576   36358576    0    0     9
> 1586528 174680 85353   8 81   6  5  0
> 267  2          0     322640        576   36254044    0    0    40
> 1011650 129389 42277   5 93   1  1  0
> 210  4          0     505092        576   35865460    0    0    25
> 720825 116356 32422   3 96   1  0  0
> 
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 75  1          0     566988        576   35664800    0    0     2 
> 1355   21    3   1  4  95  0  0
> 433  1          0     964052        576   35069112    0    0     7
> 456359 102256 20901   2 98   0  0  0
> 547  3          0     820116        576   34893932    0    0    57
> 560507 114878 28115   3 96   0  0  0
> 806  2          0     606992        576   34848180    0    0   339
> 309668 101230 21056   2 98   0  0  0
> 708  1          0     529624        576   34708000    0    0   248
> 370886 101327 20062   2 97   0  0  0
> 231  5          0     504772        576   34663880    0    0   305
> 334824 95045 20407   2 97   1  1  0
> 158  6          0    1063088        576   33518536    0    0   531
> 847435 130696 47140   4 92   1  2  0
> 193  0          0    1449156        576   33035572    0    0   363
> 371279 94470 18955   2 96   1  1  0
> 266  6          0    1623512        576   32728164    0    0    77
> 241114 95730 15483   2 98   0  0  0
> 243  8          0    1629504        576   32653080    0    0    81
> 471018 100223 20920   3 96   0  1  0
> 70 11          0    1342140        576   33084020    0    0   100 925869
> 139876 56599   6 88   3  3  0
> 211  7          0    1130316        576   33470432    0    0   290
> 1008984 150699 74320   6 83   6  5  0
> 365  3          0     776736        576   34072772    0    0   182
> 747167 139436 67135   5 88   4  3  0
> 29  1          0    1528412        576   34110640    0    0    50 612181
> 137403 77609   4 87   6  3  0
> 266  5          0    1657688        576   34105696    0    0     3
> 258307 62879 38508   2 93   3  2  0
> 1159  2          0    2002256        576   33775476    0    0    19
> 88554 42112 14230   1 98   0  0  0
> 
> 
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
> 
> #
>     64.86%  [k] _raw_spin_lock_irqsave
>             |
>             |--97.94%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          |
>             |          |--99.56%-- try_to_compact_pages
>             |          |          __alloc_pages_direct_compact
>             |          |          __alloc_pages_slowpath
>             |          |          __alloc_pages_nodemask
>             |          |          alloc_pages_vma
>             |          |          do_huge_pmd_anonymous_page
>             |          |          handle_mm_fault
>             |          |          do_page_fault
>             |          |          page_fault
>             |          |          |
>             |          |          |--53.53%-- skb_copy_datagram_iovec
>             |          |          |          tcp_recvmsg
>             |          |          |          inet_recvmsg
>             |          |          |          sock_recvmsg
>             |          |          |          sys_recvfrom
>             |          |          |          system_call_fastpath
>             |          |          |          __recv
>             |          |          |          |
>             |          |          |           --100.00%-- (nil)
>             |          |          |
>             |          |          |--27.80%-- __pthread_create_2_1
>             |          |          |          (nil)
>             |          |          |
>             |          |           --18.67%-- memcpy
>             |          |                     |
>             |          |                     |--57.38%-- 0x50d000005
>             |          |                     |
>             |          |                     |--34.52%-- 0x3b300bf271940a35
>             |          |                     |
>             |          |                      --8.10%-- 0x1500000000000009
>             |           --0.44%-- [...]
>              --2.06%-- [...]
> 
>      6.15%  [k] isolate_freepages_block
>             |
>             |--99.95%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          |
>             |          |--99.54%-- try_to_compact_pages
>             |          |          __alloc_pages_direct_compact
>             |          |          __alloc_pages_slowpath
>             |          |          __alloc_pages_nodemask
>             |          |          alloc_pages_vma
>             |          |          do_huge_pmd_anonymous_page
>             |          |          handle_mm_fault
>             |          |          do_page_fault
>             |          |          page_fault
>             |          |          |
>             |          |          |--54.40%-- skb_copy_datagram_iovec
>             |          |          |          tcp_recvmsg
>             |          |          |          inet_recvmsg
>             |          |          |          sock_recvmsg
>             |          |          |          sys_recvfrom
>             |          |          |          system_call_fastpath
>             |          |          |          __recv
>             |          |          |          |
>             |          |          |           --100.00%-- (nil)
>             |          |          |
>             |          |          |--25.19%-- __pthread_create_2_1
>             |          |          |          (nil)
>             |          |          |
>             |          |           --20.41%-- memcpy
>             |          |                     |
>             |          |                     |--40.24%-- 0x3b300bf271940a35
>             |          |                     |
>             |          |                     |--38.29%-- 0x1500000000000009
>             |          |                     |
>             |          |                      --21.47%-- 0x50d000005
>             |           --0.46%-- [...]
>              --0.05%-- [...]
> 
>      3.96%  [.] ceph_crc32c_le
>             |
>             |--99.99%-- 0xb8057558d0065990
>              --0.01%-- [...]
> 
> A different system in the same test had a slightly different
> call tree, but isolate_freepages() still seems to show up
> prominently:
> 
> #
>     32.32%  [k] _raw_spin_lock_irqsave
>             |
>             |--97.64%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          try_to_compact_pages
>             |          __alloc_pages_direct_compact
>             |          __alloc_pages_slowpath
>             |          __alloc_pages_nodemask
>             |          alloc_pages_vma
>             |          do_huge_pmd_anonymous_page
>             |          handle_mm_fault
>             |          do_page_fault
>             |          page_fault
>             |          |
>             |          |--65.31%-- skb_copy_datagram_iovec
>             |          |          tcp_recvmsg
>             |          |          inet_recvmsg
>             |          |          sock_recvmsg
>             |          |          sys_recvfrom
>             |          |          system_call_fastpath
>             |          |          __recv
>             |          |          |
>             |          |           --100.00%-- (nil)
>             |          |
>             |          |--30.98%-- memcpy
>             |          |          |
>             |          |          |--50.60%-- 0x50d0000
>             |          |          |
>             |          |           --49.40%-- 0x50d000005
>             |           --3.70%-- [...]
>              --2.36%-- [...]
> 
>     17.10%  [k] _raw_spin_lock_irq
>             |
>             |--98.27%-- isolate_migratepages_range
>             |          compact_zone
>             |          try_to_compact_pages
>             |          __alloc_pages_direct_compact
>             |          __alloc_pages_slowpath
>             |          __alloc_pages_nodemask
>             |          alloc_pages_vma
>             |          do_huge_pmd_anonymous_page
>             |          handle_mm_fault
>             |          do_page_fault
>             |          page_fault
>             |          |
>             |          |--99.85%-- __pthread_create_2_1
>             |          |          (nil)
>             |           --0.15%-- [...]
>              --1.73%-- [...]
> 
>      4.59%  [k] mutex_spin_on_owner
>             |
>             --- __mutex_lock_slowpath
>                 mutex_lock
>                |
>                |--50.14%-- page_lock_anon_vma
>                |          |
>                |          |--99.99%-- try_to_unmap_anon
>                |          |          try_to_unmap
>                |          |          __unmap_and_move
>                |          |          unmap_and_move
>                |          |          migrate_pages
>                |          |          compact_zone
>                |          |          try_to_compact_pages
>                |          |          __alloc_pages_direct_compact
>                |          |          __alloc_pages_slowpath
>                |          |          __alloc_pages_nodemask
>                |          |          alloc_pages_vma
>                |          |          do_huge_pmd_anonymous_page
>                |          |          handle_mm_fault
>                |          |          do_page_fault
>                |          |          page_fault
>                |          |          |
>                |          |          |--99.91%-- __pthread_create_2_1
>                |          |          |          (nil)
>                |          |           --0.09%-- [...]
>                |           --0.01%-- [...]
>                |
>                |--49.67%-- rmap_walk
>                |          move_to_new_page
>                |          __unmap_and_move
>                |          unmap_and_move
>                |          migrate_pages
>                |          compact_zone
>                |          try_to_compact_pages
>                |          __alloc_pages_direct_compact
>                |          __alloc_pages_slowpath
>                |          __alloc_pages_nodemask
>                |          alloc_pages_vma
>                |          do_huge_pmd_anonymous_page
>                |          handle_mm_fault
>                |          do_page_fault
>                |          page_fault
>                |          |
>                |          |--99.69%-- __pthread_create_2_1
>                |          |          (nil)
>                |           --0.31%-- [...]
>                 --0.20%-- [...]
> 
>      4.10%  [k] isolate_freepages_block
>             |
>             |--99.95%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          try_to_compact_pages
>             |          __alloc_pages_direct_compact
>             |          __alloc_pages_slowpath
>             |          __alloc_pages_nodemask
>             |          alloc_pages_vma
>             |          do_huge_pmd_anonymous_page
>             |          handle_mm_fault
>             |          do_page_fault
>             |          page_fault
>             |          |
>             |          |--46.97%-- skb_copy_datagram_iovec
>             |          |          tcp_recvmsg
>             |          |          inet_recvmsg
>             |          |          sock_recvmsg
>             |          |          sys_recvfrom
>             |          |          system_call_fastpath
>             |          |          __recv
>             |          |          |
>             |          |           --100.00%-- (nil)
>             |          |
>             |          |--31.79%-- __pthread_create_2_1
>             |          |          (nil)
>             |          |
>             |           --21.24%-- memcpy
>             |                     |
>             |                     |--61.90%-- 0x50d000005
>             |                     |
>             |                      --38.10%-- 0x50d0000
>              --0.05%-- [...]
> 
>      3.65%  [.] ceph_crc32c_le
>             |
>             |--99.86%-- 0xb8057558d0065990
>              --0.14%-- [...]
> 
> 
> 
> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
> 
> Thanks -- Jim



I doubt compaction try to migrate continuously although we have no free memory.
Could you apply this patch and retest?

https://lkml.org/lkml/2012/6/21/30

Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-27 23:59   ` Minchan Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-27 23:59 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki, Rik van Riel,
	linux-mm, linux-kernel, ceph-devel

On 06/28/2012 06:59 AM, Jim Schutt wrote:

> Hi,
> 
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4.  I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
> 
> (FWIW I'm running the my distro version of perf; please let me know
> if I need to compile the tools/perf version to match my kernel.)
> 
> The systems in question have 24 SAS drives spread across 3 HBAs,
> running 24 Ceph OSD instances, one per drive.  FWIW these servers
> are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
> Ceph Linux clients doing dd simultaneously to a Ceph file system
> backed by 12 of these servers.
> 
> In the early phase of such a test, when things are running well,
> here's what vmstat reports for the state of one of these servers:
> 
> 2012-06-27 13:56:58.356-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 31 15          0     287216        576   38606628    0    0     2 
> 1158    2   14   1  3  95  0  0
> 27 15          0     225288        576   38583384    0    0    18
> 2222016 203357 134876  11 56  17 15  0
> 28 17          0     219256        576   38544736    0    0    11
> 2305932 203141 146296  11 49  23 17  0
>  6 18          0     215596        576   38552872    0    0     7
> 2363207 215264 166502  12 45  22 20  0
> 22 18          0     226984        576   38596404    0    0     3
> 2445741 223114 179527  12 43  23 22  0
> 30 12          0     230844        576   38461648    0    0    14
> 2298537 216580 166661  12 45  23 20  0
> 29 19          0     237856        576   38422884    0    0     5
> 2332741 209408 157138  12 42  25 22  0
> 17 11          0     222156        576   38483992    0    0     4
> 2380495 210312 173121  12 39  28 22  0
> 11 13          0     216152        576   38463872    0    0    44
> 2362186 215236 176454  12 39  27 22  0
> 12 14          0     223704        576   38546720    0    0     2
> 2395530 214684 177518  12 39  28 22  0
> 39 17          0     219932        576   38598184    0    0     4
> 2428231 223284 179095  12 42  24 21  0
> 11 10          0     219956        576   38521048    0    0     4
> 2323853 207472 166665  12 39  28 21  0
> 23 15          0     216580        576   38451904    0    0     3
> 2241800 201049 163496  11 37  31 21  0
>  9 13          0     225792        576   38451916    0    0    13
> 2281900 204869 171814  11 36  30 23  0
> 14 12          0     233820        576   38492728    0    0     4
> 2293423 207686 173019  11 37  31 21  0
> 11 20          0     213796        576   38533208    0    0     3
> 2288636 205605 168118  11 37  31 21  0
> 
> 
> The system begins to struggle over the next several
> minutes; here's what vmstat has to say:
> 
> 2012-06-27 13:57:58.831-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 21 16          0     224628        576   38526872    0    0     2 
> 1233    9   19   1  4  95  0  0
> 12 19          0     232060        576   38501020    0    0     4
> 2366769 221418 159890  12 48  20 19  0
> 124  8          0     218548        576   38379656    0    0    13
> 2103075 199660 108618  11 65  13 12  0
> 24 10          0     300476        576   38230288    0    0    31
> 1966282 177472 84572  10 76   7  7  0
> 20 16          0     217584        576   38296700    0    0     9
> 2062571 195936 128810  10 55  20 15  0
> 53 12          0     235720        576   38247968    0    0    30
> 2035407 196973 133921  10 52  23 15  0
> 20 16          0     360340        576   38067992    0    0     6
> 2192179 208692 136784  11 54  19 15  0
> 26 10          0     310800        576   38093884    0    0    43
> 2138772 207105 118718  11 64  12 13  0
> 24 15          0     261108        576   38030828    0    0    68
> 2174015 205793 135302  11 56  18 15  0
>  9 17          0     241816        576   37982072    0    0    20
> 2076145 194971 120285  10 60  16 14  0
> 37 15          0     255972        576   37892868    0    0    14
> 2225076 205694 126954  11 59  16 13  0
> 27 16          0     243212        576   37872704    0    0     6
> 2249476 210885 134684  12 60  15 14  0
> 30 10          0     217572        576   37795388    0    0     3
> 2128688 205027 118319  11 66  12 11  0
> 26 11          0     236420        576   37740164    0    0    23
> 2109709 205105 133925  10 56  19 14  0
> 45 15          0     330056        576   37619896    0    0    15
> 1948311 196188 119330  10 62  15 13  0
> 54 15          0     242696        576   37631500    0    0     4
> 2159530 202595 132588  11 59  16 15  0
> 
> 2012-06-27 13:58:59.569-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 14 15          0     274932        576   37621548    0    0     2 
> 1301   15    0   1  4  95  0  0
> 44 16          0     278748        576   37509516    0    0     6
> 2097643 196722 112697  10 66  12 11  0
> 88 14          0     228088        576   37412008    0    0     6
> 2089559 202206 116146  10 66  13 11  0
> 94 11          0     348348        576   37270624    0    0     7
> 1906390 181488 87333   9 76   9  6  0
> 42 11          0     215996        576   37288556    0    0     6
> 1782459 184375 95901   9 74   9  7  0
> 45 11          0     323112        576   37146500    0    0    12
> 1868376 187814 103947   9 71  11  9  0
> 51  7          0     244560        576   37100124    0    0    35
> 1767496 181383 107259   9 68  12 11  0
> 74 12          0     221584        576   37016420    0    0    18
> 1884986 183376 93425   9 75   9  7  0
> 45 10          0     275564        576   36985324    0    0    23
> 1683688 167223 97036   8 75   9  8  0
> 19 10          0     322176        576   36813176    0    0    14
> 1747378 177594 97218   8 72  12  8  0
> 122  7          0     225256        576   36838084    0    0    26
> 1730643 177915 92621   8 75   9  8  0
> 243 10          0     223464        576   36765460    0    0    18
> 1730158 173059 79373   8 80   6  5  0
> 100 10          0     307528        576   36598456    0    0     4
> 1738567 174077 79585   9 82   6  4  0
> 243  6          0     370064        576   36358576    0    0     9
> 1586528 174680 85353   8 81   6  5  0
> 267  2          0     322640        576   36254044    0    0    40
> 1011650 129389 42277   5 93   1  1  0
> 210  4          0     505092        576   35865460    0    0    25
> 720825 116356 32422   3 96   1  0  0
> 
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi   
> bo   in   cs  us sy  id wa st
> 75  1          0     566988        576   35664800    0    0     2 
> 1355   21    3   1  4  95  0  0
> 433  1          0     964052        576   35069112    0    0     7
> 456359 102256 20901   2 98   0  0  0
> 547  3          0     820116        576   34893932    0    0    57
> 560507 114878 28115   3 96   0  0  0
> 806  2          0     606992        576   34848180    0    0   339
> 309668 101230 21056   2 98   0  0  0
> 708  1          0     529624        576   34708000    0    0   248
> 370886 101327 20062   2 97   0  0  0
> 231  5          0     504772        576   34663880    0    0   305
> 334824 95045 20407   2 97   1  1  0
> 158  6          0    1063088        576   33518536    0    0   531
> 847435 130696 47140   4 92   1  2  0
> 193  0          0    1449156        576   33035572    0    0   363
> 371279 94470 18955   2 96   1  1  0
> 266  6          0    1623512        576   32728164    0    0    77
> 241114 95730 15483   2 98   0  0  0
> 243  8          0    1629504        576   32653080    0    0    81
> 471018 100223 20920   3 96   0  1  0
> 70 11          0    1342140        576   33084020    0    0   100 925869
> 139876 56599   6 88   3  3  0
> 211  7          0    1130316        576   33470432    0    0   290
> 1008984 150699 74320   6 83   6  5  0
> 365  3          0     776736        576   34072772    0    0   182
> 747167 139436 67135   5 88   4  3  0
> 29  1          0    1528412        576   34110640    0    0    50 612181
> 137403 77609   4 87   6  3  0
> 266  5          0    1657688        576   34105696    0    0     3
> 258307 62879 38508   2 93   3  2  0
> 1159  2          0    2002256        576   33775476    0    0    19
> 88554 42112 14230   1 98   0  0  0
> 
> 
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
> 
> #
>     64.86%  [k] _raw_spin_lock_irqsave
>             |
>             |--97.94%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          |
>             |          |--99.56%-- try_to_compact_pages
>             |          |          __alloc_pages_direct_compact
>             |          |          __alloc_pages_slowpath
>             |          |          __alloc_pages_nodemask
>             |          |          alloc_pages_vma
>             |          |          do_huge_pmd_anonymous_page
>             |          |          handle_mm_fault
>             |          |          do_page_fault
>             |          |          page_fault
>             |          |          |
>             |          |          |--53.53%-- skb_copy_datagram_iovec
>             |          |          |          tcp_recvmsg
>             |          |          |          inet_recvmsg
>             |          |          |          sock_recvmsg
>             |          |          |          sys_recvfrom
>             |          |          |          system_call_fastpath
>             |          |          |          __recv
>             |          |          |          |
>             |          |          |           --100.00%-- (nil)
>             |          |          |
>             |          |          |--27.80%-- __pthread_create_2_1
>             |          |          |          (nil)
>             |          |          |
>             |          |           --18.67%-- memcpy
>             |          |                     |
>             |          |                     |--57.38%-- 0x50d000005
>             |          |                     |
>             |          |                     |--34.52%-- 0x3b300bf271940a35
>             |          |                     |
>             |          |                      --8.10%-- 0x1500000000000009
>             |           --0.44%-- [...]
>              --2.06%-- [...]
> 
>      6.15%  [k] isolate_freepages_block
>             |
>             |--99.95%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          |
>             |          |--99.54%-- try_to_compact_pages
>             |          |          __alloc_pages_direct_compact
>             |          |          __alloc_pages_slowpath
>             |          |          __alloc_pages_nodemask
>             |          |          alloc_pages_vma
>             |          |          do_huge_pmd_anonymous_page
>             |          |          handle_mm_fault
>             |          |          do_page_fault
>             |          |          page_fault
>             |          |          |
>             |          |          |--54.40%-- skb_copy_datagram_iovec
>             |          |          |          tcp_recvmsg
>             |          |          |          inet_recvmsg
>             |          |          |          sock_recvmsg
>             |          |          |          sys_recvfrom
>             |          |          |          system_call_fastpath
>             |          |          |          __recv
>             |          |          |          |
>             |          |          |           --100.00%-- (nil)
>             |          |          |
>             |          |          |--25.19%-- __pthread_create_2_1
>             |          |          |          (nil)
>             |          |          |
>             |          |           --20.41%-- memcpy
>             |          |                     |
>             |          |                     |--40.24%-- 0x3b300bf271940a35
>             |          |                     |
>             |          |                     |--38.29%-- 0x1500000000000009
>             |          |                     |
>             |          |                      --21.47%-- 0x50d000005
>             |           --0.46%-- [...]
>              --0.05%-- [...]
> 
>      3.96%  [.] ceph_crc32c_le
>             |
>             |--99.99%-- 0xb8057558d0065990
>              --0.01%-- [...]
> 
> A different system in the same test had a slightly different
> call tree, but isolate_freepages() still seems to show up
> prominently:
> 
> #
>     32.32%  [k] _raw_spin_lock_irqsave
>             |
>             |--97.64%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          try_to_compact_pages
>             |          __alloc_pages_direct_compact
>             |          __alloc_pages_slowpath
>             |          __alloc_pages_nodemask
>             |          alloc_pages_vma
>             |          do_huge_pmd_anonymous_page
>             |          handle_mm_fault
>             |          do_page_fault
>             |          page_fault
>             |          |
>             |          |--65.31%-- skb_copy_datagram_iovec
>             |          |          tcp_recvmsg
>             |          |          inet_recvmsg
>             |          |          sock_recvmsg
>             |          |          sys_recvfrom
>             |          |          system_call_fastpath
>             |          |          __recv
>             |          |          |
>             |          |           --100.00%-- (nil)
>             |          |
>             |          |--30.98%-- memcpy
>             |          |          |
>             |          |          |--50.60%-- 0x50d0000
>             |          |          |
>             |          |           --49.40%-- 0x50d000005
>             |           --3.70%-- [...]
>              --2.36%-- [...]
> 
>     17.10%  [k] _raw_spin_lock_irq
>             |
>             |--98.27%-- isolate_migratepages_range
>             |          compact_zone
>             |          try_to_compact_pages
>             |          __alloc_pages_direct_compact
>             |          __alloc_pages_slowpath
>             |          __alloc_pages_nodemask
>             |          alloc_pages_vma
>             |          do_huge_pmd_anonymous_page
>             |          handle_mm_fault
>             |          do_page_fault
>             |          page_fault
>             |          |
>             |          |--99.85%-- __pthread_create_2_1
>             |          |          (nil)
>             |           --0.15%-- [...]
>              --1.73%-- [...]
> 
>      4.59%  [k] mutex_spin_on_owner
>             |
>             --- __mutex_lock_slowpath
>                 mutex_lock
>                |
>                |--50.14%-- page_lock_anon_vma
>                |          |
>                |          |--99.99%-- try_to_unmap_anon
>                |          |          try_to_unmap
>                |          |          __unmap_and_move
>                |          |          unmap_and_move
>                |          |          migrate_pages
>                |          |          compact_zone
>                |          |          try_to_compact_pages
>                |          |          __alloc_pages_direct_compact
>                |          |          __alloc_pages_slowpath
>                |          |          __alloc_pages_nodemask
>                |          |          alloc_pages_vma
>                |          |          do_huge_pmd_anonymous_page
>                |          |          handle_mm_fault
>                |          |          do_page_fault
>                |          |          page_fault
>                |          |          |
>                |          |          |--99.91%-- __pthread_create_2_1
>                |          |          |          (nil)
>                |          |           --0.09%-- [...]
>                |           --0.01%-- [...]
>                |
>                |--49.67%-- rmap_walk
>                |          move_to_new_page
>                |          __unmap_and_move
>                |          unmap_and_move
>                |          migrate_pages
>                |          compact_zone
>                |          try_to_compact_pages
>                |          __alloc_pages_direct_compact
>                |          __alloc_pages_slowpath
>                |          __alloc_pages_nodemask
>                |          alloc_pages_vma
>                |          do_huge_pmd_anonymous_page
>                |          handle_mm_fault
>                |          do_page_fault
>                |          page_fault
>                |          |
>                |          |--99.69%-- __pthread_create_2_1
>                |          |          (nil)
>                |           --0.31%-- [...]
>                 --0.20%-- [...]
> 
>      4.10%  [k] isolate_freepages_block
>             |
>             |--99.95%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          try_to_compact_pages
>             |          __alloc_pages_direct_compact
>             |          __alloc_pages_slowpath
>             |          __alloc_pages_nodemask
>             |          alloc_pages_vma
>             |          do_huge_pmd_anonymous_page
>             |          handle_mm_fault
>             |          do_page_fault
>             |          page_fault
>             |          |
>             |          |--46.97%-- skb_copy_datagram_iovec
>             |          |          tcp_recvmsg
>             |          |          inet_recvmsg
>             |          |          sock_recvmsg
>             |          |          sys_recvfrom
>             |          |          system_call_fastpath
>             |          |          __recv
>             |          |          |
>             |          |           --100.00%-- (nil)
>             |          |
>             |          |--31.79%-- __pthread_create_2_1
>             |          |          (nil)
>             |          |
>             |           --21.24%-- memcpy
>             |                     |
>             |                     |--61.90%-- 0x50d000005
>             |                     |
>             |                      --38.10%-- 0x50d0000
>              --0.05%-- [...]
> 
>      3.65%  [.] ceph_crc32c_le
>             |
>             |--99.86%-- 0xb8057558d0065990
>              --0.14%-- [...]
> 
> 
> 
> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
> 
> Thanks -- Jim



I doubt compaction try to migrate continuously although we have no free memory.
Could you apply this patch and retest?

https://lkml.org/lkml/2012/6/21/30

Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-27 23:59   ` Minchan Kim
@ 2012-06-28  0:28     ` Rik van Riel
  -1 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2012-06-28  0:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jim Schutt, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, ceph-devel

On 06/27/2012 07:59 PM, Minchan Kim wrote:

> I doubt compaction try to migrate continuously although we have no free memory.
> Could you apply this patch and retest?
>
> https://lkml.org/lkml/2012/6/21/30

Another possibility is that compaction is succeeding every time,
but since we always start scanning all the way at the beginning
and end of each zone, we waste a lot of CPU time rescanning the
same pages (that we just filled up with moved pages) to see if
any are free.

In short, due to the way compaction behaves right now,
compaction + isolate_freepages are essentially quadratic.

What we need to do is remember where we left off after a
successful compaction, so we can continue the search there
at the next invocation.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28  0:28     ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2012-06-28  0:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jim Schutt, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, ceph-devel

On 06/27/2012 07:59 PM, Minchan Kim wrote:

> I doubt compaction try to migrate continuously although we have no free memory.
> Could you apply this patch and retest?
>
> https://lkml.org/lkml/2012/6/21/30

Another possibility is that compaction is succeeding every time,
but since we always start scanning all the way at the beginning
and end of each zone, we waste a lot of CPU time rescanning the
same pages (that we just filled up with moved pages) to see if
any are free.

In short, due to the way compaction behaves right now,
compaction + isolate_freepages are essentially quadratic.

What we need to do is remember where we left off after a
successful compaction, so we can continue the search there
at the next invocation.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-28  0:28     ` Rik van Riel
@ 2012-06-28  0:52       ` David Rientjes
  -1 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2012-06-28  0:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On Wed, 27 Jun 2012, Rik van Riel wrote:

> > I doubt compaction try to migrate continuously although we have no free
> > memory.
> > Could you apply this patch and retest?
> > 
> > https://lkml.org/lkml/2012/6/21/30
> 

Not sure if Jim is using memcg; if not, then this won't be helpful.

> Another possibility is that compaction is succeeding every time,
> but since we always start scanning all the way at the beginning
> and end of each zone, we waste a lot of CPU time rescanning the
> same pages (that we just filled up with moved pages) to see if
> any are free.
> 
> In short, due to the way compaction behaves right now,
> compaction + isolate_freepages are essentially quadratic.
> 
> What we need to do is remember where we left off after a
> successful compaction, so we can continue the search there
> at the next invocation.
> 

So when the free and migration scanners meet and compact_finished() == 
COMPACT_CONTINUE, loop around to the start of the zone and continue until 
you reach the pfn that it was started at?  Seems appropriate.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28  0:52       ` David Rientjes
  0 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2012-06-28  0:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On Wed, 27 Jun 2012, Rik van Riel wrote:

> > I doubt compaction try to migrate continuously although we have no free
> > memory.
> > Could you apply this patch and retest?
> > 
> > https://lkml.org/lkml/2012/6/21/30
> 

Not sure if Jim is using memcg; if not, then this won't be helpful.

> Another possibility is that compaction is succeeding every time,
> but since we always start scanning all the way at the beginning
> and end of each zone, we waste a lot of CPU time rescanning the
> same pages (that we just filled up with moved pages) to see if
> any are free.
> 
> In short, due to the way compaction behaves right now,
> compaction + isolate_freepages are essentially quadratic.
> 
> What we need to do is remember where we left off after a
> successful compaction, so we can continue the search there
> at the next invocation.
> 

So when the free and migration scanners meet and compact_finished() == 
COMPACT_CONTINUE, loop around to the start of the zone and continue until 
you reach the pfn that it was started at?  Seems appropriate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-28  0:28     ` Rik van Riel
@ 2012-06-28  0:55       ` Minchan Kim
  -1 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-28  0:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jim Schutt, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, ceph-devel

On 06/28/2012 09:28 AM, Rik van Riel wrote:

> On 06/27/2012 07:59 PM, Minchan Kim wrote:
> 
>> I doubt compaction try to migrate continuously although we have no
>> free memory.
>> Could you apply this patch and retest?
>>
>> https://lkml.org/lkml/2012/6/21/30
> 
> Another possibility is that compaction is succeeding every time,
> but since we always start scanning all the way at the beginning
> and end of each zone, we waste a lot of CPU time rescanning the
> same pages (that we just filled up with moved pages) to see if
> any are free.


It does make sense.

> 
> In short, due to the way compaction behaves right now,
> compaction + isolate_freepages are essentially quadratic.
> 
> What we need to do is remember where we left off after a
> successful compaction, so we can continue the search there
> at the next invocation.
> 


Good idea.
It could enhance parallel compaction, too.
Of course, if we can't meet the goal, we need loop around from start/end of zone.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28  0:55       ` Minchan Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-28  0:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jim Schutt, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	linux-mm, linux-kernel, ceph-devel

On 06/28/2012 09:28 AM, Rik van Riel wrote:

> On 06/27/2012 07:59 PM, Minchan Kim wrote:
> 
>> I doubt compaction try to migrate continuously although we have no
>> free memory.
>> Could you apply this patch and retest?
>>
>> https://lkml.org/lkml/2012/6/21/30
> 
> Another possibility is that compaction is succeeding every time,
> but since we always start scanning all the way at the beginning
> and end of each zone, we waste a lot of CPU time rescanning the
> same pages (that we just filled up with moved pages) to see if
> any are free.


It does make sense.

> 
> In short, due to the way compaction behaves right now,
> compaction + isolate_freepages are essentially quadratic.
> 
> What we need to do is remember where we left off after a
> successful compaction, so we can continue the search there
> at the next invocation.
> 


Good idea.
It could enhance parallel compaction, too.
Of course, if we can't meet the goal, we need loop around from start/end of zone.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-28  0:52       ` David Rientjes
@ 2012-06-28  0:58         ` Minchan Kim
  -1 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-28  0:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On 06/28/2012 09:52 AM, David Rientjes wrote:

> On Wed, 27 Jun 2012, Rik van Riel wrote:
> 
>>> > > I doubt compaction try to migrate continuously although we have no free
>>> > > memory.
>>> > > Could you apply this patch and retest?
>>> > > 
>>> > > https://lkml.org/lkml/2012/6/21/30
>> > 
> Not sure if Jim is using memcg; if not, then this won't be helpful.
> 


It doesn't related to memcg.
if compaction_alloc can't find suitable migration target, it returns NULL.
Then, migrate_pages should be exit.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28  0:58         ` Minchan Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-28  0:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On 06/28/2012 09:52 AM, David Rientjes wrote:

> On Wed, 27 Jun 2012, Rik van Riel wrote:
> 
>>> > > I doubt compaction try to migrate continuously although we have no free
>>> > > memory.
>>> > > Could you apply this patch and retest?
>>> > > 
>>> > > https://lkml.org/lkml/2012/6/21/30
>> > 
> Not sure if Jim is using memcg; if not, then this won't be helpful.
> 


It doesn't related to memcg.
if compaction_alloc can't find suitable migration target, it returns NULL.
Then, migrate_pages should be exit.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-28  0:58         ` Minchan Kim
@ 2012-06-28  1:06           ` David Rientjes
  -1 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2012-06-28  1:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On Thu, 28 Jun 2012, Minchan Kim wrote:

> >>> > > https://lkml.org/lkml/2012/6/21/30
> >> > 
> > Not sure if Jim is using memcg; if not, then this won't be helpful.
> > 
> 
> 
> It doesn't related to memcg.
> if compaction_alloc can't find suitable migration target, it returns NULL.
> Then, migrate_pages should be exit.
> 

If isolate_freepages() is going to fail, then this zone should have been 
skipped when checking for compaction_suitable().  In Jim's perf output, 
compaction_suitable() returns COMPACT_CONTINUE for a transparent hugepage.  
Why is zone_watermark_ok(zone, 0 low_wmark + 1024, 0, 0) succeeding if 
isolate_freepages() is going to fail?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28  1:06           ` David Rientjes
  0 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2012-06-28  1:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On Thu, 28 Jun 2012, Minchan Kim wrote:

> >>> > > https://lkml.org/lkml/2012/6/21/30
> >> > 
> > Not sure if Jim is using memcg; if not, then this won't be helpful.
> > 
> 
> 
> It doesn't related to memcg.
> if compaction_alloc can't find suitable migration target, it returns NULL.
> Then, migrate_pages should be exit.
> 

If isolate_freepages() is going to fail, then this zone should have been 
skipped when checking for compaction_suitable().  In Jim's perf output, 
compaction_suitable() returns COMPACT_CONTINUE for a transparent hugepage.  
Why is zone_watermark_ok(zone, 0 low_wmark + 1024, 0, 0) succeeding if 
isolate_freepages() is going to fail?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-28  0:52       ` David Rientjes
@ 2012-06-28  1:13         ` Rik van Riel
  -1 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2012-06-28  1:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On 06/27/2012 08:52 PM, David Rientjes wrote:
> On Wed, 27 Jun 2012, Rik van Riel wrote:

>> Another possibility is that compaction is succeeding every time,
>> but since we always start scanning all the way at the beginning
>> and end of each zone, we waste a lot of CPU time rescanning the
>> same pages (that we just filled up with moved pages) to see if
>> any are free.
>>
>> In short, due to the way compaction behaves right now,
>> compaction + isolate_freepages are essentially quadratic.
>>
>> What we need to do is remember where we left off after a
>> successful compaction, so we can continue the search there
>> at the next invocation.
>>
>
> So when the free and migration scanners meet and compact_finished() ==
> COMPACT_CONTINUE, loop around to the start of the zone and continue until
> you reach the pfn that it was started at?  Seems appropriate.

Exactly.

It would entail changes to struct compact_control, where
we have to remember whether we started at the top of the
zone or not (for a full compaction, ie order==-1 we might).

For a compaction of order >0, we would remember the last
pfn where isolate_freepages isolated a page, and start
isolating below that.

If we fail to isolate pages when cc->free_pfn and cc->migrate_pfn
meet, we may want to restart from the top for a second round.
If they meet again after the second round, we have really failed,
and compaction will be deferred.

As long as compaction succeeds, we will slowly move through the
zone, with each invocation of compaction scanning a little more.

For cc->migrate_pfn it is probably fine to start from the
beginning of the zone each time.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28  1:13         ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2012-06-28  1:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On 06/27/2012 08:52 PM, David Rientjes wrote:
> On Wed, 27 Jun 2012, Rik van Riel wrote:

>> Another possibility is that compaction is succeeding every time,
>> but since we always start scanning all the way at the beginning
>> and end of each zone, we waste a lot of CPU time rescanning the
>> same pages (that we just filled up with moved pages) to see if
>> any are free.
>>
>> In short, due to the way compaction behaves right now,
>> compaction + isolate_freepages are essentially quadratic.
>>
>> What we need to do is remember where we left off after a
>> successful compaction, so we can continue the search there
>> at the next invocation.
>>
>
> So when the free and migration scanners meet and compact_finished() ==
> COMPACT_CONTINUE, loop around to the start of the zone and continue until
> you reach the pfn that it was started at?  Seems appropriate.

Exactly.

It would entail changes to struct compact_control, where
we have to remember whether we started at the top of the
zone or not (for a full compaction, ie order==-1 we might).

For a compaction of order >0, we would remember the last
pfn where isolate_freepages isolated a page, and start
isolating below that.

If we fail to isolate pages when cc->free_pfn and cc->migrate_pfn
meet, we may want to restart from the top for a second round.
If they meet again after the second round, we have really failed,
and compaction will be deferred.

As long as compaction succeeds, we will slowly move through the
zone, with each invocation of compaction scanning a little more.

For cc->migrate_pfn it is probably fine to start from the
beginning of the zone each time.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-28  1:06           ` David Rientjes
@ 2012-06-28  1:18             ` Minchan Kim
  -1 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-28  1:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On 06/28/2012 10:06 AM, David Rientjes wrote:

> On Thu, 28 Jun 2012, Minchan Kim wrote:
> 
>>>>>>> https://lkml.org/lkml/2012/6/21/30
>>>>>
>>> Not sure if Jim is using memcg; if not, then this won't be helpful.
>>>
>>
>>
>> It doesn't related to memcg.
>> if compaction_alloc can't find suitable migration target, it returns NULL.
>> Then, migrate_pages should be exit.
>>
> 
> If isolate_freepages() is going to fail, then this zone should have been 
> skipped when checking for compaction_suitable().  In Jim's perf output, 
> compaction_suitable() returns COMPACT_CONTINUE for a transparent hugepage.  
> Why is zone_watermark_ok(zone, 0 low_wmark + 1024, 0, 0) succeeding if 
> isolate_freepages() is going to fail?
> 


zone_watermark_ok doesn't consider migratetype but suitable_migrate_target
does consider one.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28  1:18             ` Minchan Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Minchan Kim @ 2012-06-28  1:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Jim Schutt, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, linux-mm, linux-kernel, ceph-devel

On 06/28/2012 10:06 AM, David Rientjes wrote:

> On Thu, 28 Jun 2012, Minchan Kim wrote:
> 
>>>>>>> https://lkml.org/lkml/2012/6/21/30
>>>>>
>>> Not sure if Jim is using memcg; if not, then this won't be helpful.
>>>
>>
>>
>> It doesn't related to memcg.
>> if compaction_alloc can't find suitable migration target, it returns NULL.
>> Then, migrate_pages should be exit.
>>
> 
> If isolate_freepages() is going to fail, then this zone should have been 
> skipped when checking for compaction_suitable().  In Jim's perf output, 
> compaction_suitable() returns COMPACT_CONTINUE for a transparent hugepage.  
> Why is zone_watermark_ok(zone, 0 low_wmark + 1024, 0, 0) succeeding if 
> isolate_freepages() is going to fail?
> 


zone_watermark_ok doesn't consider migratetype but suitable_migrate_target
does consider one.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-27 21:59 ` Jim Schutt
@ 2012-06-28 11:36   ` Mel Gorman
  -1 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2012-06-28 11:36 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim,
	linux-mm, linux-kernel, ceph-devel

On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote:
> Hi,
> 
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4.  I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
> 

Out of curiosity, why do you think it showed up in 3.4? It's not
surprising as such if it did show up there but I'm wondering what you
are basing it on.

It's not a suprise because it's also where reclaim/compaction stopped
depending on lumpy reclaim. In the past we would have reclaimed more
pages but now rely on compaction more. It's plassible that for many
parallel compactions that there would be higher CPU usage now.

> <SNIP>
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
> 75  1          0     566988        576   35664800    0    0     2  1355   21    3   1  4  95  0  0
> 433  1          0     964052        576   35069112    0    0     7 456359 102256 20901   2 98   0  0  0
> 547  3          0     820116        576   34893932    0    0    57 560507 114878 28115   3 96   0  0  0
> 806  2          0     606992        576   34848180    0    0   339 309668 101230 21056   2 98   0  0  0
> 708  1          0     529624        576   34708000    0    0   248 370886 101327 20062   2 97   0  0  0
> 231  5          0     504772        576   34663880    0    0   305 334824 95045 20407   2 97   1  1  0
> 158  6          0    1063088        576   33518536    0    0   531 847435 130696 47140   4 92   1  2  0
> 193  0          0    1449156        576   33035572    0    0   363 371279 94470 18955   2 96   1  1  0
> 266  6          0    1623512        576   32728164    0    0    77 241114 95730 15483   2 98   0  0  0
> 243  8          0    1629504        576   32653080    0    0    81 471018 100223 20920   3 96   0  1  0
> 70 11          0    1342140        576   33084020    0    0   100 925869 139876 56599   6 88   3  3  0
> 211  7          0    1130316        576   33470432    0    0   290 1008984 150699 74320   6 83   6  5  0
> 365  3          0     776736        576   34072772    0    0   182 747167 139436 67135   5 88   4  3  0
> 29  1          0    1528412        576   34110640    0    0    50 612181 137403 77609   4 87   6  3  0
> 266  5          0    1657688        576   34105696    0    0     3 258307 62879 38508   2 93   3  2  0
> 1159  2          0    2002256        576   33775476    0    0    19 88554 42112 14230   1 98   0  0  0
> 

ok, so System CPU usage through the roof.

> 
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
> 
> #
>     64.86%  [k] _raw_spin_lock_irqsave
>             |
>             |--97.94%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          |
>             |          |--99.56%-- try_to_compact_pages
>             |          |          __alloc_pages_direct_compact
>             |          |          __alloc_pages_slowpath
>             |          |          __alloc_pages_nodemask
>             |          |          alloc_pages_vma
>             |          |          do_huge_pmd_anonymous_page
>             |          |          handle_mm_fault
>             |          |          do_page_fault
>             |          |          page_fault
>             |          |          |
>             |          |          |--53.53%-- skb_copy_datagram_iovec
>             |          |          |          tcp_recvmsg
>             |          |          |          inet_recvmsg
>             |          |          |          sock_recvmsg
>             |          |          |          sys_recvfrom
>             |          |          |          system_call_fastpath
>             |          |          |          __recv
>             |          |          |          |
>             |          |          |           --100.00%-- (nil)
>             |          |          |
>             |          |          |--27.80%-- __pthread_create_2_1
>             |          |          |          (nil)
>             |          |          |
>             |          |           --18.67%-- memcpy
>             |          |                     |
>             |          |                     |--57.38%-- 0x50d000005
>             |          |                     |
>             |          |                     |--34.52%-- 0x3b300bf271940a35
>             |          |                     |
>             |          |                      --8.10%-- 0x1500000000000009
>             |           --0.44%-- [...]
>              --2.06%-- [...]
> 

This looks like lock contention to me on zone->lock which
isolate_freepages takes and releases frequently. You do not describe the
exact memory layout but it's likely that there are two very large zones
with 12 CPUs each. If they all were running compaction they would pound
zone->lock pretty heavily.

> <SNIP>

The other call traces also look like they are pounding zone->lock
heavily.

Rik's patch has the potential to reduce contention by virtue of the fact
that less scanning is required. I'd be interested in hearing how much of
an impact that patch has so please test that first.

If that approach does not work I'll put together a patch that either
backs off compaction on zone->lock contention.

> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
> 

Thanks very much for testing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28 11:36   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2012-06-28 11:36 UTC (permalink / raw)
  To: Jim Schutt
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim,
	linux-mm, linux-kernel, ceph-devel

On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote:
> Hi,
> 
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4.  I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
> 

Out of curiosity, why do you think it showed up in 3.4? It's not
surprising as such if it did show up there but I'm wondering what you
are basing it on.

It's not a suprise because it's also where reclaim/compaction stopped
depending on lumpy reclaim. In the past we would have reclaimed more
pages but now rely on compaction more. It's plassible that for many
parallel compactions that there would be higher CPU usage now.

> <SNIP>
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>  r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
> 75  1          0     566988        576   35664800    0    0     2  1355   21    3   1  4  95  0  0
> 433  1          0     964052        576   35069112    0    0     7 456359 102256 20901   2 98   0  0  0
> 547  3          0     820116        576   34893932    0    0    57 560507 114878 28115   3 96   0  0  0
> 806  2          0     606992        576   34848180    0    0   339 309668 101230 21056   2 98   0  0  0
> 708  1          0     529624        576   34708000    0    0   248 370886 101327 20062   2 97   0  0  0
> 231  5          0     504772        576   34663880    0    0   305 334824 95045 20407   2 97   1  1  0
> 158  6          0    1063088        576   33518536    0    0   531 847435 130696 47140   4 92   1  2  0
> 193  0          0    1449156        576   33035572    0    0   363 371279 94470 18955   2 96   1  1  0
> 266  6          0    1623512        576   32728164    0    0    77 241114 95730 15483   2 98   0  0  0
> 243  8          0    1629504        576   32653080    0    0    81 471018 100223 20920   3 96   0  1  0
> 70 11          0    1342140        576   33084020    0    0   100 925869 139876 56599   6 88   3  3  0
> 211  7          0    1130316        576   33470432    0    0   290 1008984 150699 74320   6 83   6  5  0
> 365  3          0     776736        576   34072772    0    0   182 747167 139436 67135   5 88   4  3  0
> 29  1          0    1528412        576   34110640    0    0    50 612181 137403 77609   4 87   6  3  0
> 266  5          0    1657688        576   34105696    0    0     3 258307 62879 38508   2 93   3  2  0
> 1159  2          0    2002256        576   33775476    0    0    19 88554 42112 14230   1 98   0  0  0
> 

ok, so System CPU usage through the roof.

> 
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
> 
> #
>     64.86%  [k] _raw_spin_lock_irqsave
>             |
>             |--97.94%-- isolate_freepages
>             |          compaction_alloc
>             |          unmap_and_move
>             |          migrate_pages
>             |          compact_zone
>             |          |
>             |          |--99.56%-- try_to_compact_pages
>             |          |          __alloc_pages_direct_compact
>             |          |          __alloc_pages_slowpath
>             |          |          __alloc_pages_nodemask
>             |          |          alloc_pages_vma
>             |          |          do_huge_pmd_anonymous_page
>             |          |          handle_mm_fault
>             |          |          do_page_fault
>             |          |          page_fault
>             |          |          |
>             |          |          |--53.53%-- skb_copy_datagram_iovec
>             |          |          |          tcp_recvmsg
>             |          |          |          inet_recvmsg
>             |          |          |          sock_recvmsg
>             |          |          |          sys_recvfrom
>             |          |          |          system_call_fastpath
>             |          |          |          __recv
>             |          |          |          |
>             |          |          |           --100.00%-- (nil)
>             |          |          |
>             |          |          |--27.80%-- __pthread_create_2_1
>             |          |          |          (nil)
>             |          |          |
>             |          |           --18.67%-- memcpy
>             |          |                     |
>             |          |                     |--57.38%-- 0x50d000005
>             |          |                     |
>             |          |                     |--34.52%-- 0x3b300bf271940a35
>             |          |                     |
>             |          |                      --8.10%-- 0x1500000000000009
>             |           --0.44%-- [...]
>              --2.06%-- [...]
> 

This looks like lock contention to me on zone->lock which
isolate_freepages takes and releases frequently. You do not describe the
exact memory layout but it's likely that there are two very large zones
with 12 CPUs each. If they all were running compaction they would pound
zone->lock pretty heavily.

> <SNIP>

The other call traces also look like they are pounding zone->lock
heavily.

Rik's patch has the potential to reduce contention by virtue of the fact
that less scanning is required. I'd be interested in hearing how much of
an impact that patch has so please test that first.

If that approach does not work I'll put together a patch that either
backs off compaction on zone->lock contention.

> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
> 

Thanks very much for testing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
  2012-06-28 11:36   ` Mel Gorman
@ 2012-06-28 15:30     ` Jim Schutt
  -1 siblings, 0 replies; 24+ messages in thread
From: Jim Schutt @ 2012-06-28 15:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim,
	linux-mm, linux-kernel, ceph-devel

On 06/28/2012 05:36 AM, Mel Gorman wrote:
> On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote:
>> Hi,
>>
>> I'm running into trouble with systems going unresponsive,
>> and perf suggests it's excessive CPU usage by isolate_freepages().
>> I'm currently testing 3.5-rc4, but I think this problem may have
>> first shown up in 3.4.  I'm only just learning how to use perf,
>> so I only currently have results to report for 3.5-rc4.
>>
>
> Out of curiosity, why do you think it showed up in 3.4? It's not
> surprising as such if it did show up there but I'm wondering what you
> are basing it on.

If I remember correctly, when I was testing this workload on 3.4 is
when I first saw hundreds of runable threads being reported by vmstat.
At that time I couldn't reproduce quite as reliably, and I didn't
know how to get perf to give me call chains, so I didn't follow up :(

>
> It's not a suprise because it's also where reclaim/compaction stopped
> depending on lumpy reclaim. In the past we would have reclaimed more
> pages but now rely on compaction more. It's plassible that for many
> parallel compactions that there would be higher CPU usage now.
>
>> <SNIP>
>> 2012-06-27 14:00:03.219-06:00
>> vmstat -w 4 16
>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>>   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
>> 75  1          0     566988        576   35664800    0    0     2  1355   21    3   1  4  95  0  0
>> 433  1          0     964052        576   35069112    0    0     7 456359 102256 20901   2 98   0  0  0
>> 547  3          0     820116        576   34893932    0    0    57 560507 114878 28115   3 96   0  0  0
>> 806  2          0     606992        576   34848180    0    0   339 309668 101230 21056   2 98   0  0  0
>> 708  1          0     529624        576   34708000    0    0   248 370886 101327 20062   2 97   0  0  0
>> 231  5          0     504772        576   34663880    0    0   305 334824 95045 20407   2 97   1  1  0
>> 158  6          0    1063088        576   33518536    0    0   531 847435 130696 47140   4 92   1  2  0
>> 193  0          0    1449156        576   33035572    0    0   363 371279 94470 18955   2 96   1  1  0
>> 266  6          0    1623512        576   32728164    0    0    77 241114 95730 15483   2 98   0  0  0
>> 243  8          0    1629504        576   32653080    0    0    81 471018 100223 20920   3 96   0  1  0
>> 70 11          0    1342140        576   33084020    0    0   100 925869 139876 56599   6 88   3  3  0
>> 211  7          0    1130316        576   33470432    0    0   290 1008984 150699 74320   6 83   6  5  0
>> 365  3          0     776736        576   34072772    0    0   182 747167 139436 67135   5 88   4  3  0
>> 29  1          0    1528412        576   34110640    0    0    50 612181 137403 77609   4 87   6  3  0
>> 266  5          0    1657688        576   34105696    0    0     3 258307 62879 38508   2 93   3  2  0
>> 1159  2          0    2002256        576   33775476    0    0    19 88554 42112 14230   1 98   0  0  0
>>
>
> ok, so System CPU usage through the roof.
>
>>
>> Right around 14:00 I was able to get a "perf -a -g"; here's the
>> beginning of what "perf report --sort symbol --call-graph fractal,5"
>> had to say:
>>
>> #
>>      64.86%  [k] _raw_spin_lock_irqsave
>>              |
>>              |--97.94%-- isolate_freepages
>>              |          compaction_alloc
>>              |          unmap_and_move
>>              |          migrate_pages
>>              |          compact_zone
>>              |          |
>>              |          |--99.56%-- try_to_compact_pages
>>              |          |          __alloc_pages_direct_compact
>>              |          |          __alloc_pages_slowpath
>>              |          |          __alloc_pages_nodemask
>>              |          |          alloc_pages_vma
>>              |          |          do_huge_pmd_anonymous_page
>>              |          |          handle_mm_fault
>>              |          |          do_page_fault
>>              |          |          page_fault
>>              |          |          |
>>              |          |          |--53.53%-- skb_copy_datagram_iovec
>>              |          |          |          tcp_recvmsg
>>              |          |          |          inet_recvmsg
>>              |          |          |          sock_recvmsg
>>              |          |          |          sys_recvfrom
>>              |          |          |          system_call_fastpath
>>              |          |          |          __recv
>>              |          |          |          |
>>              |          |          |           --100.00%-- (nil)
>>              |          |          |
>>              |          |          |--27.80%-- __pthread_create_2_1
>>              |          |          |          (nil)
>>              |          |          |
>>              |          |           --18.67%-- memcpy
>>              |          |                     |
>>              |          |                     |--57.38%-- 0x50d000005
>>              |          |                     |
>>              |          |                     |--34.52%-- 0x3b300bf271940a35
>>              |          |                     |
>>              |          |                      --8.10%-- 0x1500000000000009
>>              |           --0.44%-- [...]
>>               --2.06%-- [...]
>>
>
> This looks like lock contention to me on zone->lock which
> isolate_freepages takes and releases frequently. You do not describe the
> exact memory layout but it's likely that there are two very large zones
> with 12 CPUs each. If they all were running compaction they would pound
> zone->lock pretty heavily.

Does this tell you what you need to know?
If not, let me know what you'd like to see.

# grep -H MemTotal /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemTotal:       25156644 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemTotal:       25165824 kB

>
>> <SNIP>
>
> The other call traces also look like they are pounding zone->lock
> heavily.
>
> Rik's patch has the potential to reduce contention by virtue of the fact
> that less scanning is required. I'd be interested in hearing how much of
> an impact that patch has so please test that first.

Working on that now....

>
> If that approach does not work I'll put together a patch that either
> backs off compaction on zone->lock contention.
>
>> I seem to be able to recreate this issue at will, so please
>> let me know what I can do to help learn what is going on.
>>
>
> Thanks very much for testing.
>

No problem.  Thanks to you all for jumping on this so quickly.

FWIW, perf rocks!  Thanks to all who've helped make it so great!

-- Jim



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: excessive CPU utilization by isolate_freepages?
@ 2012-06-28 15:30     ` Jim Schutt
  0 siblings, 0 replies; 24+ messages in thread
From: Jim Schutt @ 2012-06-28 15:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Rik van Riel, Minchan Kim,
	linux-mm, linux-kernel, ceph-devel

On 06/28/2012 05:36 AM, Mel Gorman wrote:
> On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote:
>> Hi,
>>
>> I'm running into trouble with systems going unresponsive,
>> and perf suggests it's excessive CPU usage by isolate_freepages().
>> I'm currently testing 3.5-rc4, but I think this problem may have
>> first shown up in 3.4.  I'm only just learning how to use perf,
>> so I only currently have results to report for 3.5-rc4.
>>
>
> Out of curiosity, why do you think it showed up in 3.4? It's not
> surprising as such if it did show up there but I'm wondering what you
> are basing it on.

If I remember correctly, when I was testing this workload on 3.4 is
when I first saw hundreds of runable threads being reported by vmstat.
At that time I couldn't reproduce quite as reliably, and I didn't
know how to get perf to give me call chains, so I didn't follow up :(

>
> It's not a suprise because it's also where reclaim/compaction stopped
> depending on lumpy reclaim. In the past we would have reclaimed more
> pages but now rely on compaction more. It's plassible that for many
> parallel compactions that there would be higher CPU usage now.
>
>> <SNIP>
>> 2012-06-27 14:00:03.219-06:00
>> vmstat -w 4 16
>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>>   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
>> 75  1          0     566988        576   35664800    0    0     2  1355   21    3   1  4  95  0  0
>> 433  1          0     964052        576   35069112    0    0     7 456359 102256 20901   2 98   0  0  0
>> 547  3          0     820116        576   34893932    0    0    57 560507 114878 28115   3 96   0  0  0
>> 806  2          0     606992        576   34848180    0    0   339 309668 101230 21056   2 98   0  0  0
>> 708  1          0     529624        576   34708000    0    0   248 370886 101327 20062   2 97   0  0  0
>> 231  5          0     504772        576   34663880    0    0   305 334824 95045 20407   2 97   1  1  0
>> 158  6          0    1063088        576   33518536    0    0   531 847435 130696 47140   4 92   1  2  0
>> 193  0          0    1449156        576   33035572    0    0   363 371279 94470 18955   2 96   1  1  0
>> 266  6          0    1623512        576   32728164    0    0    77 241114 95730 15483   2 98   0  0  0
>> 243  8          0    1629504        576   32653080    0    0    81 471018 100223 20920   3 96   0  1  0
>> 70 11          0    1342140        576   33084020    0    0   100 925869 139876 56599   6 88   3  3  0
>> 211  7          0    1130316        576   33470432    0    0   290 1008984 150699 74320   6 83   6  5  0
>> 365  3          0     776736        576   34072772    0    0   182 747167 139436 67135   5 88   4  3  0
>> 29  1          0    1528412        576   34110640    0    0    50 612181 137403 77609   4 87   6  3  0
>> 266  5          0    1657688        576   34105696    0    0     3 258307 62879 38508   2 93   3  2  0
>> 1159  2          0    2002256        576   33775476    0    0    19 88554 42112 14230   1 98   0  0  0
>>
>
> ok, so System CPU usage through the roof.
>
>>
>> Right around 14:00 I was able to get a "perf -a -g"; here's the
>> beginning of what "perf report --sort symbol --call-graph fractal,5"
>> had to say:
>>
>> #
>>      64.86%  [k] _raw_spin_lock_irqsave
>>              |
>>              |--97.94%-- isolate_freepages
>>              |          compaction_alloc
>>              |          unmap_and_move
>>              |          migrate_pages
>>              |          compact_zone
>>              |          |
>>              |          |--99.56%-- try_to_compact_pages
>>              |          |          __alloc_pages_direct_compact
>>              |          |          __alloc_pages_slowpath
>>              |          |          __alloc_pages_nodemask
>>              |          |          alloc_pages_vma
>>              |          |          do_huge_pmd_anonymous_page
>>              |          |          handle_mm_fault
>>              |          |          do_page_fault
>>              |          |          page_fault
>>              |          |          |
>>              |          |          |--53.53%-- skb_copy_datagram_iovec
>>              |          |          |          tcp_recvmsg
>>              |          |          |          inet_recvmsg
>>              |          |          |          sock_recvmsg
>>              |          |          |          sys_recvfrom
>>              |          |          |          system_call_fastpath
>>              |          |          |          __recv
>>              |          |          |          |
>>              |          |          |           --100.00%-- (nil)
>>              |          |          |
>>              |          |          |--27.80%-- __pthread_create_2_1
>>              |          |          |          (nil)
>>              |          |          |
>>              |          |           --18.67%-- memcpy
>>              |          |                     |
>>              |          |                     |--57.38%-- 0x50d000005
>>              |          |                     |
>>              |          |                     |--34.52%-- 0x3b300bf271940a35
>>              |          |                     |
>>              |          |                      --8.10%-- 0x1500000000000009
>>              |           --0.44%-- [...]
>>               --2.06%-- [...]
>>
>
> This looks like lock contention to me on zone->lock which
> isolate_freepages takes and releases frequently. You do not describe the
> exact memory layout but it's likely that there are two very large zones
> with 12 CPUs each. If they all were running compaction they would pound
> zone->lock pretty heavily.

Does this tell you what you need to know?
If not, let me know what you'd like to see.

# grep -H MemTotal /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemTotal:       25156644 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemTotal:       25165824 kB

>
>> <SNIP>
>
> The other call traces also look like they are pounding zone->lock
> heavily.
>
> Rik's patch has the potential to reduce contention by virtue of the fact
> that less scanning is required. I'd be interested in hearing how much of
> an impact that patch has so please test that first.

Working on that now....

>
> If that approach does not work I'll put together a patch that either
> backs off compaction on zone->lock contention.
>
>> I seem to be able to recreate this issue at will, so please
>> let me know what I can do to help learn what is going on.
>>
>
> Thanks very much for testing.
>

No problem.  Thanks to you all for jumping on this so quickly.

FWIW, perf rocks!  Thanks to all who've helped make it so great!

-- Jim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-06-28 15:30 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-27 21:59 excessive CPU utilization by isolate_freepages? Jim Schutt
2012-06-27 21:59 ` Jim Schutt
2012-06-27 22:35 ` Mark Nelson
2012-06-27 22:35   ` Mark Nelson
2012-06-27 23:59 ` Minchan Kim
2012-06-27 23:59   ` Minchan Kim
2012-06-28  0:28   ` Rik van Riel
2012-06-28  0:28     ` Rik van Riel
2012-06-28  0:52     ` David Rientjes
2012-06-28  0:52       ` David Rientjes
2012-06-28  0:58       ` Minchan Kim
2012-06-28  0:58         ` Minchan Kim
2012-06-28  1:06         ` David Rientjes
2012-06-28  1:06           ` David Rientjes
2012-06-28  1:18           ` Minchan Kim
2012-06-28  1:18             ` Minchan Kim
2012-06-28  1:13       ` Rik van Riel
2012-06-28  1:13         ` Rik van Riel
2012-06-28  0:55     ` Minchan Kim
2012-06-28  0:55       ` Minchan Kim
2012-06-28 11:36 ` Mel Gorman
2012-06-28 11:36   ` Mel Gorman
2012-06-28 15:30   ` Jim Schutt
2012-06-28 15:30     ` Jim Schutt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.