All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] infinite loop in find_get_pages()
@ 2011-09-13 19:23 Eric Dumazet
  2011-09-13 23:53 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Eric Dumazet @ 2011-09-13 19:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus,

It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
expect too much from them.

On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
have a cpu locked in

 find_get_pages -> radix_tree_gang_lookup_slot -> __lookup 


Problem is : A bisection will be very hard, since a lot of kernels
simply destroy my disk (the PCI MRRS horror stuff).

Messages at console :
 
INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
11 t=60002 jiffies)

perf top -C 1

Events: 3K cycles                                                                                                                                             
+     43,08%  bash  [kernel.kallsyms]  [k] __lookup
+     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
+     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot

    43.08%     bash  [kernel.kallsyms]  [k] __lookup
               |
               --- __lookup
                  |          
                  |--97.09%-- radix_tree_gang_lookup_slot
                  |          find_get_pages
                  |          pagevec_lookup
                  |          invalidate_mapping_pages
                  |          drop_pagecache_sb
                  |          iterate_supers
                  |          drop_caches_sysctl_handler
                  |          proc_sys_call_handler.isra.3
                  |          proc_sys_write
                  |          vfs_write
                  |          sys_write
                  |          system_call_fastpath
                  |          __write
                  |          


Steps to reproduce :

In one terminal, kernel builds in a loop (defconfig + hpsa driver)

cd /usr/src/linux
while :
do
 make clean
 make -j128
done


In another term :

while :
do
 echo 3 >/proc/sys/vm/drop_caches
 sleep 20
done


Before the lock, I can see in another terminal some swapping activity.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  2  17728 3443924  11520 328020    0    0   256 12076 16250  554  0  6 82 12
 1  1  17728 3444776  11584 328072    0    0   100  2868 16223  267  0  6 86  7
 1  1  17728 3442200  12100 328348    0    0   868     0 16600 1778  0  7 88  6
 1  1  17728 3438032  13036 329048    0    0  1628     0 16862 2480  0  7 87  5
 1  1  17728 3546864  13988 220256    0    0  1000     0 16313  931  0  7 87  6
 1  1  17728 3544260  16024 220256    0    0  2036     0 16513 1531  0  6 88  6
 1  1  17728 3542896  17196 220256    0    0  1160   556 16324  893  0  6 88  6
 1  1  17728 3540748  18756 220256    0    0  1560     0 16398 1172  0  6 88  6
 1  1  17728 3538692  20168 220256    0    0  1412     0 16544 1088  0  6 88  6
 2  0  17728 3536676  21816 220248    0    0  1648     0 16447 1246  0  6 88  6
 1  1  17728 3535052  22544 220256    0    0   728     0 16215  605  1  6 87  5
 1  1  17728 3533672  23404 220244    0    0   860  4240 16264  705  0  6 88  6
 1  1  17728 3532688  24232 220244    0    0   828     0 16272  685  0  6 87  6
 1  1  17728 3531552  25080 220244    0    0   848     0 16294  700  0  6 88  6
 1  1  17728 3529584  26532 220256    0    0  1452     0 16376 1104  0  6 87  6
 1  2  17728 3545232  27848 199176    0    0  1312    52 16392  911  0  7 85  8
 1  2  17728 3659060  29576  84420    0    0  1736    40 16570  959  0  7 81 12
38  3  17728 3640652  29984  69976    0    0   688     0 16885 2987  3  8 80  9
 5  2  17728 3601716  30208  75628    0    0  4676     4 18080 5727 11 10 66 12
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
103 27  17728 2286372  30376  78952    0    0  3044     8 17772 6803 49 16 34  1
128  1  17728 1337588  30416  79952    0    0   732  4080 16389 4874 91  9  0  0
122  7  17728 730264  30472  81056    0    0   540  1300 16535 5451 91  9  0  0
99 16  17728 996308  30544  83136    0    0   492   452 16951 6629 92  8  0  0
89 23  17728 1150640  30592  88288    0    0  3232   224 17286 7312 91  9  0  0
114  7  17728 1344768  30660  92104    0    0  1668   228 17395 7297 89 11  0  0
99  3  17728 848716  30696  93684    0    0   688  2072 16947 6368 92  8  0  0
112  9  17728 609908  30748  96036    0    0   620   272 17221 7640 90 10  0  0
111  8  17728 480244  30808  98268    0    0   788   320 17227 7391 92  8  0  0
115  7  17728 549564  30852 100552    0    0   656   232 17583 7807 92  9  0  0
107  9  17728 666776  30888 102904    0    0   716     0 17406 7781 91  9  0  0
124  5  17728 685368  30960 105544    0    0  1056   944 17281 7713 90 10  0  0
130  1  17728 538832  31000 108080    0    0   776     0 16943 7347 91  9  0  0
130  0  17728 364476  31032 110252    0    0   676     0 16767 6948 91  9  0  0
129  0  17728 149332  31064 111848    0    0   540    32 16673 6272 92  8  0  0
129  0  17728 274664  31096 114052    0    0   628     0 17207 7694 92  8  0  0
128  3  17728 589736  31160 117420    0    0   816   996 17381 8443 90 10  0  0
126  5  17728 485300  31172 119544    0    0   416     0 17024 7186 91  9  0  0
130  0  17728 349500  31216 122344    0    0   492     0 17046 7358 91  9  0  0
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
130  2  17728 416972  31248 125404    0    0   496   120 17112 7124 91  9  0  0
125  5  17308 188608  29444 106888    0  576  1436   612 40020 9430 91  8  0  0
113 16  17308 218700  29528 110336   32    0  1908     0 17210 7214 92  8  0  0
 1 145  20292  15688  26884 108200   40 3020   188  4660 27003 3664 30  7  0 63
 1 145  21128  15920  24212 107420    0  836     0  3824 16813  430  1  6  0 93
 2 144  22904  16020  20780 106780    0 1776     0  6548 16611  505  1  6  0 93
 1 146  23496  15788  17476 106160   32  596    60  3620 16610  308  1  6  0 93
 1 147  23924  16216  16028 105852   32  432    32  5012 16477  156  0  6  0 93
 1 145  24428  15904  14744 103452   20  504    20  3112 16776  125  1  6  0 93
 1 146  25304  16184  14688  97712    0  876    16  3352 16759  447  2  6  0 92
 1 147  26984  15908  14588  88348   96 1680    96  6352 17006  235  1  6  0 93
 1 146  28724  16112  14152  77132   32 1740    44  3536 16739  375  2  6  0 92
 1 151  29900  15896  12072  68484  156 1184   192  2068 16860  576  2  6  0 91
 2 152  33724  33908   9536  58616  184 3856   512  6764 16536  492  2  6  3 88
 1 142  33276 427352   8964  58988 1096  120  2624   120 16730 1129  6  7  8 79
 2 142  33000 421512   8988  60944 1560    0  3512     0 16771 1220  1  6  9 84
 2 143  32604 392952   9012  62308 1176    0  2436     0 16690 1173  2  7 10 82
 8 134  32400 255348   9044  64696  688    0  2584     0 17105 2181 16  8 14 62
 6 136  31796 142068   9092  66024 1060    0  1828     0 17040 2226 37 10 12 41
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2 143  31664  15844   9152  67452  580   64  1324   292 16973 2066 37 10 11 43
 4 141  31876  56160   9052  67528   48  328   140  1724 16476  696  6  7  0 87
 4 141  32420 176260   8896  68808  108  732   760  2280 17449 3081 24  9  0 67
11 134  32540 119868   8484  70568  108  852  1140  1408 17436 3788 45 12  1 43
17 129  32880  57044   8256  73008    0  364  1212   364 17489 4000 59 13  0 28
11 135  33468 107128   7660  73124  200 1044   888  2076 17043 1956 23  9  0 69
 1 144  34788  16076   6948  71572  180 1524   276  1908 16787  967 13  7  0 79
 1 145  35472  16188   5868  70348  112  768   120  1284 16696  561  1  6  0 93
 1 145  36056  16696   5492  68240   16  596    16  3356 16456  202  0  6  0 93
 1 143  38200  15952   3168  63968   32 2168    52  6460 16834  423  1  7  0 92
 9 131  40128 139084   3064  61060  172 2144   644  2192 17701 2250 19  9  0 72
 9 133  40548 110308   3092  60492  468  620   900  1852 17516 1983 35  9  0 55
10 132  40448  79476   3132  61808 1020    0  1480     0 17505 3254 35 10  0 55
12 132  40532 139396   3156  63204  776  260  1272   892 17457 3179 44 11  0 45
11 132  40392  66336   3256  65264  788    0  1536     0 17551 3860 46 11  0 43
 1 142  41112  15796   3296  65680 1176  812  1636  2568 17026 1798 28  9  0 63
 1 140  41500  15960   3244  64828   92  472   116  4008 16445  443  4  7  0 90
 1 140  42252  16740   3232  64356    0  764     0  1500 16403  185  0  6  0 94
 1 139  49636  16024   2928  60652   52 7376    52  7376 17507 1236  0  7  0 93
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2 140  55780  16444   2548  55948  176 6200   332  6260 17160  592  1  7  0 92
 3 145  59800 358088   2404  55468  100 4108  1092  4132 18514 3864 17  8  0 75
 3 143  60712  27028   2416  57184  816  964  2392  1288 18089 3476 43 10  0 47
 4 141  61296 154136   2516  58024  424  980  1312   980 17298 2489 28  9  0 62
21 122  62544  83120   2528  58372  100 1456   788  1456 17717 2738 64 12  0 24
24 120  62780  53328   2580  62216   16  292  2528   292 17163 4076 85 14  0  1
 1 143  65088  16096   2492  61524  152 2708   764  2712 16734 1474 16  8  0 76
 3 141  65672  34232   2476  60536   56  672   240  3208 16726  661  4  7  0 89
 1 144  65584  16044   2488  60440  808   68   948  1532 17187 1353 10  8  0 82
 4 141  70836  17216   2444  58024   64 5272    64  6968 16957  437  0  6  0 93
 6 134  73728  31940   2424  56880  436 3092   748  3188 16950 1269  8  7  0 85
 2 139  76036 107996   2408  56404   92 2420   476  2784 16869  690  6  7  0 87
 6 135  76112  82792   2436  57884 1108  476  1632   724 16999 1711 18  8  0 73
 1 139  77184  17872   2444  57860  996 1084  1168  2320 16644  748 11  8  0 81
 1 141  91136  15952   2300  51868  100 14088   128 14152 17494 1284  1  7  5 87
 1 143  98356 204144   2256  48168  640 7496  1148  7580 17471 1840  6  7 12 74
 3 139  97344 174272   2276  48968 2636    0  3216     0 16962 1499 13  8 11 69
 9 133  97220 123464   2352  50584 1348    0  2320   500 17100 2255 27  9  8 56
 9 134  97092  33672   2396  51780 1292  108  2028   108 16821 1547 27  8  8 57
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
11 134  95068  75744   2448  53444  852    0  1696     0 17318 2630 34 10  2 54
 1 143  95104  15972   2504  54544  116   44   696    44 16545 1209 20  8  5 67
^C




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-13 19:23 [BUG] infinite loop in find_get_pages() Eric Dumazet
@ 2011-09-13 23:53 ` Andrew Morton
  2011-09-14  0:21   ` Eric Dumazet
  2011-09-14  0:34   ` Lin Ming
       [not found] ` <CA+55aFyG3-3_gqGjqUmsTAHWfmNLMdQVf4XqUZrDAGMBxgur=Q@mail.gmail.com>
       [not found] ` <CA+55aFx41_Z4TjjJwPuE21Q8oD3aGWtQwh45DUiCjPVD-wCJXw@mail.gmail.com>
  2 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2011-09-13 23:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Toshiyuki Okajima,
	Dave Chinner, Hugh Dickins

On Tue, 13 Sep 2011 21:23:21 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Linus,
> 
> It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
> expect too much from them.
> 
> On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
> have a cpu locked in
> 
>  find_get_pages -> radix_tree_gang_lookup_slot -> __lookup 
> 
> 
> Problem is : A bisection will be very hard, since a lot of kernels
> simply destroy my disk (the PCI MRRS horror stuff).

Yes, that's hard.  Quite often my bisection efforts involve moving to a
new bisection point then hand-applying a few patches to make the the
thing compile and/or work.

There have only been three commits to radix-tree.c this year, so a bit
of manual searching through those would be practical?

> Messages at console :
>  
> INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
> 11 t=60002 jiffies)
> 
> perf top -C 1
> 
> Events: 3K cycles                                                                                                                                             
> +     43,08%  bash  [kernel.kallsyms]  [k] __lookup
> +     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
> +     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot
> 
>     43.08%     bash  [kernel.kallsyms]  [k] __lookup
>                |
>                --- __lookup
>                   |          
>                   |--97.09%-- radix_tree_gang_lookup_slot
>                   |          find_get_pages
>                   |          pagevec_lookup
>                   |          invalidate_mapping_pages
>                   |          drop_pagecache_sb
>                   |          iterate_supers
>                   |          drop_caches_sysctl_handler
>                   |          proc_sys_call_handler.isra.3
>                   |          proc_sys_write
>                   |          vfs_write
>                   |          sys_write
>                   |          system_call_fastpath
>                   |          __write
>                   |          
> 
> 
> Steps to reproduce :
> 
> In one terminal, kernel builds in a loop (defconfig + hpsa driver)
> 
> cd /usr/src/linux
> while :
> do
>  make clean
>  make -j128
> done
> 
> 
> In another term :
> 
> while :
> do
>  echo 3 >/proc/sys/vm/drop_caches
>  sleep 20
> done
> 

This is a regression?  3.0 is OK?

Also, do you know that the hang is happening at the radix-tree level? 
It might be at the filemap.c level or at the superblock level and we
just end up spending most cycles at the lower levels because they're
called so often?  The iterate_supers/drop_pagecache_sb code is fairly
recent.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-13 23:53 ` Andrew Morton
@ 2011-09-14  0:21   ` Eric Dumazet
  2011-09-14  0:34   ` Lin Ming
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2011-09-14  0:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, linux-kernel, Andrew Morton, Toshiyuki Okajima,
	Dave Chinner, Hugh Dickins

Le mardi 13 septembre 2011 à 16:53 -0700, Andrew Morton a écrit :
> On Tue, 13 Sep 2011 21:23:21 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Linus,
> > 
> > It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
> > expect too much from them.
> > 
> > On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
> > have a cpu locked in
> > 
> >  find_get_pages -> radix_tree_gang_lookup_slot -> __lookup 
> > 
> > 
> > Problem is : A bisection will be very hard, since a lot of kernels
> > simply destroy my disk (the PCI MRRS horror stuff).
> 
> Yes, that's hard.  Quite often my bisection efforts involve moving to a
> new bisection point then hand-applying a few patches to make the the
> thing compile and/or work.
> 
> There have only been three commits to radix-tree.c this year, so a bit
> of manual searching through those would be practical?
> 
> > Messages at console :
> >  
> > INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
> > 11 t=60002 jiffies)
> > 
> > perf top -C 1
> > 
> > Events: 3K cycles                                                                                                                                             
> > +     43,08%  bash  [kernel.kallsyms]  [k] __lookup
> > +     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
> > +     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot
> > 
> >     43.08%     bash  [kernel.kallsyms]  [k] __lookup
> >                |
> >                --- __lookup
> >                   |          
> >                   |--97.09%-- radix_tree_gang_lookup_slot
> >                   |          find_get_pages
> >                   |          pagevec_lookup
> >                   |          invalidate_mapping_pages
> >                   |          drop_pagecache_sb
> >                   |          iterate_supers
> >                   |          drop_caches_sysctl_handler
> >                   |          proc_sys_call_handler.isra.3
> >                   |          proc_sys_write
> >                   |          vfs_write
> >                   |          sys_write
> >                   |          system_call_fastpath
> >                   |          __write
> >                   |          
> > 
> > 
> > Steps to reproduce :
> > 
> > In one terminal, kernel builds in a loop (defconfig + hpsa driver)
> > 
> > cd /usr/src/linux
> > while :
> > do
> >  make clean
> >  make -j128
> > done
> > 
> > 
> > In another term :
> > 
> > while :
> > do
> >  echo 3 >/proc/sys/vm/drop_caches
> >  sleep 20
> > done
> > 
> 
> This is a regression?  3.0 is OK?
> 

3.0 seems ok, and first bisection point seems OK too.

# git bisect log
git bisect start
# bad: [003f6c9df54970d8b19578d195b3e2b398cdbde2] lib/sha1.c: quiet
sparse noise about symbol not declared
git bisect bad 003f6c9df54970d8b19578d195b3e2b398cdbde2
# good: [02f8c6aee8df3cdc935e9bdd4f2d020306035dbe] Linux 3.0
git bisect good 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe


(I let the machine runs one hour or so before concluding its a good/bad
point)


> Also, do you know that the hang is happening at the radix-tree level? 
> It might be at the filemap.c level or at the superblock level and we
> just end up spending most cycles at the lower levels because they're
> called so often?  The iterate_supers/drop_pagecache_sb code is fairly
> recent.
> 
> 

No idea yet, but I'll take a look after a bit of sleep ;)

Thanks !



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-13 23:53 ` Andrew Morton
  2011-09-14  0:21   ` Eric Dumazet
@ 2011-09-14  0:34   ` Lin Ming
  2011-09-15 10:47     ` Pawel Sikora
  1 sibling, 1 reply; 24+ messages in thread
From: Lin Ming @ 2011-09-14  0:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Linus Torvalds, linux-kernel, Andrew Morton,
	Toshiyuki Okajima, Dave Chinner, Hugh Dickins, Pawel Sikora,
	Justin Piszcz

On Wed, Sep 14, 2011 at 7:53 AM, Andrew Morton <akpm@google.com> wrote:
> On Tue, 13 Sep 2011 21:23:21 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> Linus,
>>
>> It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
>> expect too much from them.
>>
>> On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
>> have a cpu locked in
>>
>>  find_get_pages -> radix_tree_gang_lookup_slot -> __lookup
>>
>>
>> Problem is : A bisection will be very hard, since a lot of kernels
>> simply destroy my disk (the PCI MRRS horror stuff).
>
> Yes, that's hard.  Quite often my bisection efforts involve moving to a
> new bisection point then hand-applying a few patches to make the the
> thing compile and/or work.
>
> There have only been three commits to radix-tree.c this year, so a bit
> of manual searching through those would be practical?
>
>> Messages at console :
>>
>> INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
>> 11 t=60002 jiffies)
>>
>> perf top -C 1
>>
>> Events: 3K cycles
>> +     43,08%  bash  [kernel.kallsyms]  [k] __lookup
>> +     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
>> +     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot
>>
>>     43.08%     bash  [kernel.kallsyms]  [k] __lookup
>>                |
>>                --- __lookup
>>                   |
>>                   |--97.09%-- radix_tree_gang_lookup_slot
>>                   |          find_get_pages
>>                   |          pagevec_lookup
>>                   |          invalidate_mapping_pages
>>                   |          drop_pagecache_sb
>>                   |          iterate_supers
>>                   |          drop_caches_sysctl_handler
>>                   |          proc_sys_call_handler.isra.3
>>                   |          proc_sys_write
>>                   |          vfs_write
>>                   |          sys_write
>>                   |          system_call_fastpath
>>                   |          __write
>>                   |
>>
>>
>> Steps to reproduce :
>>
>> In one terminal, kernel builds in a loop (defconfig + hpsa driver)
>>
>> cd /usr/src/linux
>> while :
>> do
>>  make clean
>>  make -j128
>> done
>>
>>
>> In another term :
>>
>> while :
>> do
>>  echo 3 >/proc/sys/vm/drop_caches
>>  sleep 20
>> done
>>
>
> This is a regression?  3.0 is OK?

FYI,  other guys have reported similar bugs for 3.0.

kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
http://marc.info/?l=linux-kernel&m=131342662028153&w=2

[3.0.2-stable] BUG: soft lockup - CPU#13 stuck for 22s! [kswapd2:1092]
http://marc.info/?l=linux-kernel&m=131469584117857&w=2

kernel 3.1-rc4: BUG soft lockup (w/frame pointers enabled)
http://marc.info/?l=linux-kernel&m=131566383719422&w=2

Lin Ming

>
> Also, do you know that the hang is happening at the radix-tree level?
> It might be at the filemap.c level or at the superblock level and we
> just end up spending most cycles at the lower levels because they're
> called so often?  The iterate_supers/drop_pagecache_sb code is fairly
> recent.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
       [not found] ` <CA+55aFyG3-3_gqGjqUmsTAHWfmNLMdQVf4XqUZrDAGMBxgur=Q@mail.gmail.com>
@ 2011-09-14  6:48   ` Linus Torvalds
  2011-09-14  6:53     ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2011-09-14  6:48 UTC (permalink / raw)
  To: Eric Dumazet, Hugh Dickins, Andrew Morton; +Cc: linux-kernel, Rik van Riel

Re-sending, because apparently none of my email in the last few days
have actually gone out due to LF problems..

                       Linus

On Tue, Sep 13, 2011 at 12:48 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Sep 13, 2011 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
>> expect too much from them.
>
> No, by now, they should be damn well reliable.
>
>> On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
>> have a cpu locked in
>>
>>  find_get_pages -> radix_tree_gang_lookup_slot -> __lookup
>
> Hmm. There hasn't been many changes in this area, so the few changes
> that *do* exist are obviously very suspicious.
>
> In particular, the only real change to that whole setup is the changes
> by Hugh to make the swap entries use the radix tree. So I'm bringing
> Hugh and Andrew to the discussion (and Rik, since he acked a few of
> those changes).
>
> The fact that some light swapping activity seems to accompany the
> problem just makes me more certain it's Hugh's swap/radix tree work.
>
> We're talking only a handful of patches, so maybe Hugh could create a
> revert patch just to confirm that yes, that's the problem.
>
> Hugh?
>
>                      Linus
>
> --- quoting the rest of the email for Hugh/Andrew ---
>> Problem is : A bisection will be very hard, since a lot of kernels
>> simply destroy my disk (the PCI MRRS horror stuff).
>>
>> Messages at console :
>>
>> INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
>> 11 t=60002 jiffies)
>>
>> perf top -C 1
>>
>> Events: 3K cycles
>> +     43,08%  bash  [kernel.kallsyms]  [k] __lookup
>> +     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
>> +     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot
>>
>>    43.08%     bash  [kernel.kallsyms]  [k] __lookup
>>               |
>>               --- __lookup
>>                  |
>>                  |--97.09%-- radix_tree_gang_lookup_slot
>>                  |          find_get_pages
>>                  |          pagevec_lookup
>>                  |          invalidate_mapping_pages
>>                  |          drop_pagecache_sb
>>                  |          iterate_supers
>>                  |          drop_caches_sysctl_handler
>>                  |          proc_sys_call_handler.isra.3
>>                  |          proc_sys_write
>>                  |          vfs_write
>>                  |          sys_write
>>                  |          system_call_fastpath
>>                  |          __write
>>                  |
>>
>>
>> Steps to reproduce :
>>
>> In one terminal, kernel builds in a loop (defconfig + hpsa driver)
>>
>> cd /usr/src/linux
>> while :
>> do
>>  make clean
>>  make -j128
>> done
>>
>>
>> In another term :
>>
>> while :
>> do
>>  echo 3 >/proc/sys/vm/drop_caches
>>  sleep 20
>> done
>>
>>
>> Before the lock, I can see in another terminal some swapping activity.
>>
>> $ vmstat 1
>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>  2  2  17728 3443924  11520 328020    0    0   256 12076 16250  554  0  6 82 12
>>  1  1  17728 3444776  11584 328072    0    0   100  2868 16223  267  0  6 86  7
>>  1  1  17728 3442200  12100 328348    0    0   868     0 16600 1778  0  7 88  6
>>  1  1  17728 3438032  13036 329048    0    0  1628     0 16862 2480  0  7 87  5
>>  1  1  17728 3546864  13988 220256    0    0  1000     0 16313  931  0  7 87  6
>>  1  1  17728 3544260  16024 220256    0    0  2036     0 16513 1531  0  6 88  6
>>  1  1  17728 3542896  17196 220256    0    0  1160   556 16324  893  0  6 88  6
>>  1  1  17728 3540748  18756 220256    0    0  1560     0 16398 1172  0  6 88  6
>>  1  1  17728 3538692  20168 220256    0    0  1412     0 16544 1088  0  6 88  6
>>  2  0  17728 3536676  21816 220248    0    0  1648     0 16447 1246  0  6 88  6
>>  1  1  17728 3535052  22544 220256    0    0   728     0 16215  605  1  6 87  5
>>  1  1  17728 3533672  23404 220244    0    0   860  4240 16264  705  0  6 88  6
>>  1  1  17728 3532688  24232 220244    0    0   828     0 16272  685  0  6 87  6
>>  1  1  17728 3531552  25080 220244    0    0   848     0 16294  700  0  6 88  6
>>  1  1  17728 3529584  26532 220256    0    0  1452     0 16376 1104  0  6 87  6
>>  1  2  17728 3545232  27848 199176    0    0  1312    52 16392  911  0  7 85  8
>>  1  2  17728 3659060  29576  84420    0    0  1736    40 16570  959  0  7 81 12
>> 38  3  17728 3640652  29984  69976    0    0   688     0 16885 2987  3  8 80  9
>>  5  2  17728 3601716  30208  75628    0    0  4676     4 18080 5727 11 10 66 12
>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> 103 27  17728 2286372  30376  78952    0    0  3044     8 17772 6803 49 16 34  1
>> 128  1  17728 1337588  30416  79952    0    0   732  4080 16389 4874 91  9  0  0
>> 122  7  17728 730264  30472  81056    0    0   540  1300 16535 5451 91  9  0  0
>> 99 16  17728 996308  30544  83136    0    0   492   452 16951 6629 92  8  0  0
>> 89 23  17728 1150640  30592  88288    0    0  3232   224 17286 7312 91  9  0  0
>> 114  7  17728 1344768  30660  92104    0    0  1668   228 17395 7297 89 11  0  0
>> 99  3  17728 848716  30696  93684    0    0   688  2072 16947 6368 92  8  0  0
>> 112  9  17728 609908  30748  96036    0    0   620   272 17221 7640 90 10  0  0
>> 111  8  17728 480244  30808  98268    0    0   788   320 17227 7391 92  8  0  0
>> 115  7  17728 549564  30852 100552    0    0   656   232 17583 7807 92  9  0  0
>> 107  9  17728 666776  30888 102904    0    0   716     0 17406 7781 91  9  0  0
>> 124  5  17728 685368  30960 105544    0    0  1056   944 17281 7713 90 10  0  0
>> 130  1  17728 538832  31000 108080    0    0   776     0 16943 7347 91  9  0  0
>> 130  0  17728 364476  31032 110252    0    0   676     0 16767 6948 91  9  0  0
>> 129  0  17728 149332  31064 111848    0    0   540    32 16673 6272 92  8  0  0
>> 129  0  17728 274664  31096 114052    0    0   628     0 17207 7694 92  8  0  0
>> 128  3  17728 589736  31160 117420    0    0   816   996 17381 8443 90 10  0  0
>> 126  5  17728 485300  31172 119544    0    0   416     0 17024 7186 91  9  0  0
>> 130  0  17728 349500  31216 122344    0    0   492     0 17046 7358 91  9  0  0
>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> 130  2  17728 416972  31248 125404    0    0   496   120 17112 7124 91  9  0  0
>> 125  5  17308 188608  29444 106888    0  576  1436   612 40020 9430 91  8  0  0
>> 113 16  17308 218700  29528 110336   32    0  1908     0 17210 7214 92  8  0  0
>>  1 145  20292  15688  26884 108200   40 3020   188  4660 27003 3664 30  7  0 63
>>  1 145  21128  15920  24212 107420    0  836     0  3824 16813  430  1  6  0 93
>>  2 144  22904  16020  20780 106780    0 1776     0  6548 16611  505  1  6  0 93
>>  1 146  23496  15788  17476 106160   32  596    60  3620 16610  308  1  6  0 93
>>  1 147  23924  16216  16028 105852   32  432    32  5012 16477  156  0  6  0 93
>>  1 145  24428  15904  14744 103452   20  504    20  3112 16776  125  1  6  0 93
>>  1 146  25304  16184  14688  97712    0  876    16  3352 16759  447  2  6  0 92
>>  1 147  26984  15908  14588  88348   96 1680    96  6352 17006  235  1  6  0 93
>>  1 146  28724  16112  14152  77132   32 1740    44  3536 16739  375  2  6  0 92
>>  1 151  29900  15896  12072  68484  156 1184   192  2068 16860  576  2  6  0 91
>>  2 152  33724  33908   9536  58616  184 3856   512  6764 16536  492  2  6  3 88
>>  1 142  33276 427352   8964  58988 1096  120  2624   120 16730 1129  6  7  8 79
>>  2 142  33000 421512   8988  60944 1560    0  3512     0 16771 1220  1  6  9 84
>>  2 143  32604 392952   9012  62308 1176    0  2436     0 16690 1173  2  7 10 82
>>  8 134  32400 255348   9044  64696  688    0  2584     0 17105 2181 16  8 14 62
>>  6 136  31796 142068   9092  66024 1060    0  1828     0 17040 2226 37 10 12 41
>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>  2 143  31664  15844   9152  67452  580   64  1324   292 16973 2066 37 10 11 43
>>  4 141  31876  56160   9052  67528   48  328   140  1724 16476  696  6  7  0 87
>>  4 141  32420 176260   8896  68808  108  732   760  2280 17449 3081 24  9  0 67
>> 11 134  32540 119868   8484  70568  108  852  1140  1408 17436 3788 45 12  1 43
>> 17 129  32880  57044   8256  73008    0  364  1212   364 17489 4000 59 13  0 28
>> 11 135  33468 107128   7660  73124  200 1044   888  2076 17043 1956 23  9  0 69
>>  1 144  34788  16076   6948  71572  180 1524   276  1908 16787  967 13  7  0 79
>>  1 145  35472  16188   5868  70348  112  768   120  1284 16696  561  1  6  0 93
>>  1 145  36056  16696   5492  68240   16  596    16  3356 16456  202  0  6  0 93
>>  1 143  38200  15952   3168  63968   32 2168    52  6460 16834  423  1  7  0 92
>>  9 131  40128 139084   3064  61060  172 2144   644  2192 17701 2250 19  9  0 72
>>  9 133  40548 110308   3092  60492  468  620   900  1852 17516 1983 35  9  0 55
>> 10 132  40448  79476   3132  61808 1020    0  1480     0 17505 3254 35 10  0 55
>> 12 132  40532 139396   3156  63204  776  260  1272   892 17457 3179 44 11  0 45
>> 11 132  40392  66336   3256  65264  788    0  1536     0 17551 3860 46 11  0 43
>>  1 142  41112  15796   3296  65680 1176  812  1636  2568 17026 1798 28  9  0 63
>>  1 140  41500  15960   3244  64828   92  472   116  4008 16445  443  4  7  0 90
>>  1 140  42252  16740   3232  64356    0  764     0  1500 16403  185  0  6  0 94
>>  1 139  49636  16024   2928  60652   52 7376    52  7376 17507 1236  0  7  0 93
>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>  2 140  55780  16444   2548  55948  176 6200   332  6260 17160  592  1  7  0 92
>>  3 145  59800 358088   2404  55468  100 4108  1092  4132 18514 3864 17  8  0 75
>>  3 143  60712  27028   2416  57184  816  964  2392  1288 18089 3476 43 10  0 47
>>  4 141  61296 154136   2516  58024  424  980  1312   980 17298 2489 28  9  0 62
>> 21 122  62544  83120   2528  58372  100 1456   788  1456 17717 2738 64 12  0 24
>> 24 120  62780  53328   2580  62216   16  292  2528   292 17163 4076 85 14  0  1
>>  1 143  65088  16096   2492  61524  152 2708   764  2712 16734 1474 16  8  0 76
>>  3 141  65672  34232   2476  60536   56  672   240  3208 16726  661  4  7  0 89
>>  1 144  65584  16044   2488  60440  808   68   948  1532 17187 1353 10  8  0 82
>>  4 141  70836  17216   2444  58024   64 5272    64  6968 16957  437  0  6  0 93
>>  6 134  73728  31940   2424  56880  436 3092   748  3188 16950 1269  8  7  0 85
>>  2 139  76036 107996   2408  56404   92 2420   476  2784 16869  690  6  7  0 87
>>  6 135  76112  82792   2436  57884 1108  476  1632   724 16999 1711 18  8  0 73
>>  1 139  77184  17872   2444  57860  996 1084  1168  2320 16644  748 11  8  0 81
>>  1 141  91136  15952   2300  51868  100 14088   128 14152 17494 1284  1  7  5 87
>>  1 143  98356 204144   2256  48168  640 7496  1148  7580 17471 1840  6  7 12 74
>>  3 139  97344 174272   2276  48968 2636    0  3216     0 16962 1499 13  8 11 69
>>  9 133  97220 123464   2352  50584 1348    0  2320   500 17100 2255 27  9  8 56
>>  9 134  97092  33672   2396  51780 1292  108  2028   108 16821 1547 27  8  8 57
>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> 11 134  95068  75744   2448  53444  852    0  1696     0 17318 2630 34 10  2 54
>>  1 143  95104  15972   2504  54544  116   44   696    44 16545 1209 20  8  5 67
>> ^C
>>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
       [not found] ` <CA+55aFx41_Z4TjjJwPuE21Q8oD3aGWtQwh45DUiCjPVD-wCJXw@mail.gmail.com>
@ 2011-09-14  6:48   ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2011-09-14  6:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel

Again, sorry for the possible duplicates - but it looks like my email
hasn't been going out.

                      Linus

On Tue, Sep 13, 2011 at 6:26 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Sep 13, 2011 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> while :
>> do
>>  echo 3 >/proc/sys/vm/drop_caches
>>  sleep 20
>> done
>
> Btw, do you actually have problems without this?
>
> The drop_caches thing could potentially result in a livelock, where
> we're dropping stuff as we are reading it in, and the reader just
> never makes progress (because dropping things is always faster than
> reading).
>
> So it may not be a "true lockup", it may just be really *really* slow,
> and wasting tons of CPU.
>
> It is possible that we should look at modifying the drop_caches code
> so that it always makes forward progress..
>
>                            Linus
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14  6:48   ` Linus Torvalds
@ 2011-09-14  6:53     ` Eric Dumazet
  2011-09-14  7:32       ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2011-09-14  6:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hugh Dickins, Andrew Morton, linux-kernel, Rik van Riel

Le mardi 13 septembre 2011 à 23:48 -0700, Linus Torvalds a écrit :
> Re-sending, because apparently none of my email in the last few days
> have actually gone out due to LF problems..
> 
>                        Linus
> 
> On Tue, Sep 13, 2011 at 12:48 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, Sep 13, 2011 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>
> >> It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
> >> expect too much from them.
> >
> > No, by now, they should be damn well reliable.
> >
> >> On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
> >> have a cpu locked in
> >>
> >>  find_get_pages -> radix_tree_gang_lookup_slot -> __lookup
> >
> > Hmm. There hasn't been many changes in this area, so the few changes
> > that *do* exist are obviously very suspicious.
> >
> > In particular, the only real change to that whole setup is the changes
> > by Hugh to make the swap entries use the radix tree. So I'm bringing
> > Hugh and Andrew to the discussion (and Rik, since he acked a few of
> > those changes).
> >
> > The fact that some light swapping activity seems to accompany the
> > problem just makes me more certain it's Hugh's swap/radix tree work.
> >
> > We're talking only a handful of patches, so maybe Hugh could create a
> > revert patch just to confirm that yes, that's the problem.
> >
> > Hugh?
> >
> >                      Linus
> >
> > --- quoting the rest of the email for Hugh/Andrew ---
> >> Problem is : A bisection will be very hard, since a lot of kernels
> >> simply destroy my disk (the PCI MRRS horror stuff).
> >>
> >> Messages at console :
> >>
> >> INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
> >> 11 t=60002 jiffies)
> >>
> >> perf top -C 1
> >>
> >> Events: 3K cycles
> >> +     43,08%  bash  [kernel.kallsyms]  [k] __lookup
> >> +     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
> >> +     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot
> >>
> >>    43.08%     bash  [kernel.kallsyms]  [k] __lookup
> >>               |
> >>               --- __lookup
> >>                  |
> >>                  |--97.09%-- radix_tree_gang_lookup_slot
> >>                  |          find_get_pages
> >>                  |          pagevec_lookup
> >>                  |          invalidate_mapping_pages
> >>                  |          drop_pagecache_sb
> >>                  |          iterate_supers
> >>                  |          drop_caches_sysctl_handler
> >>                  |          proc_sys_call_handler.isra.3
> >>                  |          proc_sys_write
> >>                  |          vfs_write
> >>                  |          sys_write
> >>                  |          system_call_fastpath
> >>                  |          __write
> >>                  |
> >>
> >>
> >> Steps to reproduce :
> >>
> >> In one terminal, kernel builds in a loop (defconfig + hpsa driver)
> >>
> >> cd /usr/src/linux
> >> while :
> >> do
> >>  make clean
> >>  make -j128
> >> done
> >>
> >>
> >> In another term :
> >>
> >> while :
> >> do
> >>  echo 3 >/proc/sys/vm/drop_caches
> >>  sleep 20
> >> done
> >>
> >>
> >> Before the lock, I can see in another terminal some swapping activity.
> >>
> >> $ vmstat 1
> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> >>  2  2  17728 3443924  11520 328020    0    0   256 12076 16250  554  0  6 82 12
> >>  1  1  17728 3444776  11584 328072    0    0   100  2868 16223  267  0  6 86  7
> >>  1  1  17728 3442200  12100 328348    0    0   868     0 16600 1778  0  7 88  6
> >>  1  1  17728 3438032  13036 329048    0    0  1628     0 16862 2480  0  7 87  5
> >>  1  1  17728 3546864  13988 220256    0    0  1000     0 16313  931  0  7 87  6
> >>  1  1  17728 3544260  16024 220256    0    0  2036     0 16513 1531  0  6 88  6
> >>  1  1  17728 3542896  17196 220256    0    0  1160   556 16324  893  0  6 88  6
> >>  1  1  17728 3540748  18756 220256    0    0  1560     0 16398 1172  0  6 88  6
> >>  1  1  17728 3538692  20168 220256    0    0  1412     0 16544 1088  0  6 88  6
> >>  2  0  17728 3536676  21816 220248    0    0  1648     0 16447 1246  0  6 88  6
> >>  1  1  17728 3535052  22544 220256    0    0   728     0 16215  605  1  6 87  5
> >>  1  1  17728 3533672  23404 220244    0    0   860  4240 16264  705  0  6 88  6
> >>  1  1  17728 3532688  24232 220244    0    0   828     0 16272  685  0  6 87  6
> >>  1  1  17728 3531552  25080 220244    0    0   848     0 16294  700  0  6 88  6
> >>  1  1  17728 3529584  26532 220256    0    0  1452     0 16376 1104  0  6 87  6
> >>  1  2  17728 3545232  27848 199176    0    0  1312    52 16392  911  0  7 85  8
> >>  1  2  17728 3659060  29576  84420    0    0  1736    40 16570  959  0  7 81 12
> >> 38  3  17728 3640652  29984  69976    0    0   688     0 16885 2987  3  8 80  9
> >>  5  2  17728 3601716  30208  75628    0    0  4676     4 18080 5727 11 10 66 12
> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> >> 103 27  17728 2286372  30376  78952    0    0  3044     8 17772 6803 49 16 34  1
> >> 128  1  17728 1337588  30416  79952    0    0   732  4080 16389 4874 91  9  0  0
> >> 122  7  17728 730264  30472  81056    0    0   540  1300 16535 5451 91  9  0  0
> >> 99 16  17728 996308  30544  83136    0    0   492   452 16951 6629 92  8  0  0
> >> 89 23  17728 1150640  30592  88288    0    0  3232   224 17286 7312 91  9  0  0
> >> 114  7  17728 1344768  30660  92104    0    0  1668   228 17395 7297 89 11  0  0
> >> 99  3  17728 848716  30696  93684    0    0   688  2072 16947 6368 92  8  0  0
> >> 112  9  17728 609908  30748  96036    0    0   620   272 17221 7640 90 10  0  0
> >> 111  8  17728 480244  30808  98268    0    0   788   320 17227 7391 92  8  0  0
> >> 115  7  17728 549564  30852 100552    0    0   656   232 17583 7807 92  9  0  0
> >> 107  9  17728 666776  30888 102904    0    0   716     0 17406 7781 91  9  0  0
> >> 124  5  17728 685368  30960 105544    0    0  1056   944 17281 7713 90 10  0  0
> >> 130  1  17728 538832  31000 108080    0    0   776     0 16943 7347 91  9  0  0
> >> 130  0  17728 364476  31032 110252    0    0   676     0 16767 6948 91  9  0  0
> >> 129  0  17728 149332  31064 111848    0    0   540    32 16673 6272 92  8  0  0
> >> 129  0  17728 274664  31096 114052    0    0   628     0 17207 7694 92  8  0  0
> >> 128  3  17728 589736  31160 117420    0    0   816   996 17381 8443 90 10  0  0
> >> 126  5  17728 485300  31172 119544    0    0   416     0 17024 7186 91  9  0  0
> >> 130  0  17728 349500  31216 122344    0    0   492     0 17046 7358 91  9  0  0
> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> >> 130  2  17728 416972  31248 125404    0    0   496   120 17112 7124 91  9  0  0
> >> 125  5  17308 188608  29444 106888    0  576  1436   612 40020 9430 91  8  0  0
> >> 113 16  17308 218700  29528 110336   32    0  1908     0 17210 7214 92  8  0  0
> >>  1 145  20292  15688  26884 108200   40 3020   188  4660 27003 3664 30  7  0 63
> >>  1 145  21128  15920  24212 107420    0  836     0  3824 16813  430  1  6  0 93
> >>  2 144  22904  16020  20780 106780    0 1776     0  6548 16611  505  1  6  0 93
> >>  1 146  23496  15788  17476 106160   32  596    60  3620 16610  308  1  6  0 93
> >>  1 147  23924  16216  16028 105852   32  432    32  5012 16477  156  0  6  0 93
> >>  1 145  24428  15904  14744 103452   20  504    20  3112 16776  125  1  6  0 93
> >>  1 146  25304  16184  14688  97712    0  876    16  3352 16759  447  2  6  0 92
> >>  1 147  26984  15908  14588  88348   96 1680    96  6352 17006  235  1  6  0 93
> >>  1 146  28724  16112  14152  77132   32 1740    44  3536 16739  375  2  6  0 92
> >>  1 151  29900  15896  12072  68484  156 1184   192  2068 16860  576  2  6  0 91
> >>  2 152  33724  33908   9536  58616  184 3856   512  6764 16536  492  2  6  3 88
> >>  1 142  33276 427352   8964  58988 1096  120  2624   120 16730 1129  6  7  8 79
> >>  2 142  33000 421512   8988  60944 1560    0  3512     0 16771 1220  1  6  9 84
> >>  2 143  32604 392952   9012  62308 1176    0  2436     0 16690 1173  2  7 10 82
> >>  8 134  32400 255348   9044  64696  688    0  2584     0 17105 2181 16  8 14 62
> >>  6 136  31796 142068   9092  66024 1060    0  1828     0 17040 2226 37 10 12 41
> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> >>  2 143  31664  15844   9152  67452  580   64  1324   292 16973 2066 37 10 11 43
> >>  4 141  31876  56160   9052  67528   48  328   140  1724 16476  696  6  7  0 87
> >>  4 141  32420 176260   8896  68808  108  732   760  2280 17449 3081 24  9  0 67
> >> 11 134  32540 119868   8484  70568  108  852  1140  1408 17436 3788 45 12  1 43
> >> 17 129  32880  57044   8256  73008    0  364  1212   364 17489 4000 59 13  0 28
> >> 11 135  33468 107128   7660  73124  200 1044   888  2076 17043 1956 23  9  0 69
> >>  1 144  34788  16076   6948  71572  180 1524   276  1908 16787  967 13  7  0 79
> >>  1 145  35472  16188   5868  70348  112  768   120  1284 16696  561  1  6  0 93
> >>  1 145  36056  16696   5492  68240   16  596    16  3356 16456  202  0  6  0 93
> >>  1 143  38200  15952   3168  63968   32 2168    52  6460 16834  423  1  7  0 92
> >>  9 131  40128 139084   3064  61060  172 2144   644  2192 17701 2250 19  9  0 72
> >>  9 133  40548 110308   3092  60492  468  620   900  1852 17516 1983 35  9  0 55
> >> 10 132  40448  79476   3132  61808 1020    0  1480     0 17505 3254 35 10  0 55
> >> 12 132  40532 139396   3156  63204  776  260  1272   892 17457 3179 44 11  0 45
> >> 11 132  40392  66336   3256  65264  788    0  1536     0 17551 3860 46 11  0 43
> >>  1 142  41112  15796   3296  65680 1176  812  1636  2568 17026 1798 28  9  0 63
> >>  1 140  41500  15960   3244  64828   92  472   116  4008 16445  443  4  7  0 90
> >>  1 140  42252  16740   3232  64356    0  764     0  1500 16403  185  0  6  0 94
> >>  1 139  49636  16024   2928  60652   52 7376    52  7376 17507 1236  0  7  0 93
> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> >>  2 140  55780  16444   2548  55948  176 6200   332  6260 17160  592  1  7  0 92
> >>  3 145  59800 358088   2404  55468  100 4108  1092  4132 18514 3864 17  8  0 75
> >>  3 143  60712  27028   2416  57184  816  964  2392  1288 18089 3476 43 10  0 47
> >>  4 141  61296 154136   2516  58024  424  980  1312   980 17298 2489 28  9  0 62
> >> 21 122  62544  83120   2528  58372  100 1456   788  1456 17717 2738 64 12  0 24
> >> 24 120  62780  53328   2580  62216   16  292  2528   292 17163 4076 85 14  0  1
> >>  1 143  65088  16096   2492  61524  152 2708   764  2712 16734 1474 16  8  0 76
> >>  3 141  65672  34232   2476  60536   56  672   240  3208 16726  661  4  7  0 89
> >>  1 144  65584  16044   2488  60440  808   68   948  1532 17187 1353 10  8  0 82
> >>  4 141  70836  17216   2444  58024   64 5272    64  6968 16957  437  0  6  0 93
> >>  6 134  73728  31940   2424  56880  436 3092   748  3188 16950 1269  8  7  0 85
> >>  2 139  76036 107996   2408  56404   92 2420   476  2784 16869  690  6  7  0 87
> >>  6 135  76112  82792   2436  57884 1108  476  1632   724 16999 1711 18  8  0 73
> >>  1 139  77184  17872   2444  57860  996 1084  1168  2320 16644  748 11  8  0 81
> >>  1 141  91136  15952   2300  51868  100 14088   128 14152 17494 1284  1  7  5 87
> >>  1 143  98356 204144   2256  48168  640 7496  1148  7580 17471 1840  6  7 12 74
> >>  3 139  97344 174272   2276  48968 2636    0  3216     0 16962 1499 13  8 11 69
> >>  9 133  97220 123464   2352  50584 1348    0  2320   500 17100 2255 27  9  8 56
> >>  9 134  97092  33672   2396  51780 1292  108  2028   108 16821 1547 27  8  8 57
> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> >> 11 134  95068  75744   2448  53444  852    0  1696     0 17318 2630 34 10  2 54
> >>  1 143  95104  15972   2504  54544  116   44   696    44 16545 1209 20  8  5 67
> >> ^C
> >>
> >>
> >>
> >>
> >

It appears bisection was not so horrific (I was out of the PCI/MRSS bug
window), It will complete shortly :

# git bisect bad
Bisecting: 33 revisions left to test after this (roughly 5 steps)
[c299eba3c5a801657f275d33be588b34831cd30e] Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
# git bisect log
git bisect start
# bad: [003f6c9df54970d8b19578d195b3e2b398cdbde2] lib/sha1.c: quiet sparse noise about symbol not declared
git bisect bad 003f6c9df54970d8b19578d195b3e2b398cdbde2
# good: [02f8c6aee8df3cdc935e9bdd4f2d020306035dbe] Linux 3.0
git bisect good 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
# good: [d5ef642355bdd9b383ff5c18cbc6102a06eecbaf] Merge branch 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
git bisect good d5ef642355bdd9b383ff5c18cbc6102a06eecbaf
# good: [664a41b8a91bf78a01a751e15175e0008977685a] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
git bisect good 664a41b8a91bf78a01a751e15175e0008977685a
# bad: [585df1d90cb07a02ca6c7a7d339e56e46d50dafb] xhci: Remove TDs from TD lists when URBs are canceled.
git bisect bad 585df1d90cb07a02ca6c7a7d339e56e46d50dafb
# good: [60ad4466821a96913a9b567115e194ed1087c2d7] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
git bisect good 60ad4466821a96913a9b567115e194ed1087c2d7
# bad: [7f3bf7cd348cead84f8027b32aa30ea49fa64df5] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
git bisect bad 7f3bf7cd348cead84f8027b32aa30ea49fa64df5
# good: [9e8ed3ae924b65ab5f088fe63ee6f4326f04590f] [S390] signal: use set_restore_sigmask() helper
git bisect good 9e8ed3ae924b65ab5f088fe63ee6f4326f04590f
# bad: [31475dd611209413bace21651a400afb91d0bd9d] mm: a few small updates for radix-swap
git bisect bad 31475dd611209413bace21651a400afb91d0bd9d




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14  6:53     ` Eric Dumazet
@ 2011-09-14  7:32       ` Shaohua Li
  2011-09-14  8:20         ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2011-09-14  7:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, linux-kernel, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 15399 bytes --]

it appears we didn't account skipped swap entry in find_get_pages().
does the attached patch help?

Thanks,
Shaohua
2011/9/14 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mardi 13 septembre 2011 à 23:48 -0700, Linus Torvalds a écrit :
>> Re-sending, because apparently none of my email in the last few days
>> have actually gone out due to LF problems..
>>
>>                        Linus
>>
>> On Tue, Sep 13, 2011 at 12:48 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> > On Tue, Sep 13, 2011 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >>
>> >> It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
>> >> expect too much from them.
>> >
>> > No, by now, they should be damn well reliable.
>> >
>> >> On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
>> >> have a cpu locked in
>> >>
>> >>  find_get_pages -> radix_tree_gang_lookup_slot -> __lookup
>> >
>> > Hmm. There hasn't been many changes in this area, so the few changes
>> > that *do* exist are obviously very suspicious.
>> >
>> > In particular, the only real change to that whole setup is the changes
>> > by Hugh to make the swap entries use the radix tree. So I'm bringing
>> > Hugh and Andrew to the discussion (and Rik, since he acked a few of
>> > those changes).
>> >
>> > The fact that some light swapping activity seems to accompany the
>> > problem just makes me more certain it's Hugh's swap/radix tree work.
>> >
>> > We're talking only a handful of patches, so maybe Hugh could create a
>> > revert patch just to confirm that yes, that's the problem.
>> >
>> > Hugh?
>> >
>> >                      Linus
>> >
>> > --- quoting the rest of the email for Hugh/Andrew ---
>> >> Problem is : A bisection will be very hard, since a lot of kernels
>> >> simply destroy my disk (the PCI MRRS horror stuff).
>> >>
>> >> Messages at console :
>> >>
>> >> INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
>> >> 11 t=60002 jiffies)
>> >>
>> >> perf top -C 1
>> >>
>> >> Events: 3K cycles
>> >> +     43,08%  bash  [kernel.kallsyms]  [k] __lookup
>> >> +     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
>> >> +     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot
>> >>
>> >>    43.08%     bash  [kernel.kallsyms]  [k] __lookup
>> >>               |
>> >>               --- __lookup
>> >>                  |
>> >>                  |--97.09%-- radix_tree_gang_lookup_slot
>> >>                  |          find_get_pages
>> >>                  |          pagevec_lookup
>> >>                  |          invalidate_mapping_pages
>> >>                  |          drop_pagecache_sb
>> >>                  |          iterate_supers
>> >>                  |          drop_caches_sysctl_handler
>> >>                  |          proc_sys_call_handler.isra.3
>> >>                  |          proc_sys_write
>> >>                  |          vfs_write
>> >>                  |          sys_write
>> >>                  |          system_call_fastpath
>> >>                  |          __write
>> >>                  |
>> >>
>> >>
>> >> Steps to reproduce :
>> >>
>> >> In one terminal, kernel builds in a loop (defconfig + hpsa driver)
>> >>
>> >> cd /usr/src/linux
>> >> while :
>> >> do
>> >>  make clean
>> >>  make -j128
>> >> done
>> >>
>> >>
>> >> In another term :
>> >>
>> >> while :
>> >> do
>> >>  echo 3 >/proc/sys/vm/drop_caches
>> >>  sleep 20
>> >> done
>> >>
>> >>
>> >> Before the lock, I can see in another terminal some swapping activity.
>> >>
>> >> $ vmstat 1
>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> >>  2  2  17728 3443924  11520 328020    0    0   256 12076 16250  554  0  6 82 12
>> >>  1  1  17728 3444776  11584 328072    0    0   100  2868 16223  267  0  6 86  7
>> >>  1  1  17728 3442200  12100 328348    0    0   868     0 16600 1778  0  7 88  6
>> >>  1  1  17728 3438032  13036 329048    0    0  1628     0 16862 2480  0  7 87  5
>> >>  1  1  17728 3546864  13988 220256    0    0  1000     0 16313  931  0  7 87  6
>> >>  1  1  17728 3544260  16024 220256    0    0  2036     0 16513 1531  0  6 88  6
>> >>  1  1  17728 3542896  17196 220256    0    0  1160   556 16324  893  0  6 88  6
>> >>  1  1  17728 3540748  18756 220256    0    0  1560     0 16398 1172  0  6 88  6
>> >>  1  1  17728 3538692  20168 220256    0    0  1412     0 16544 1088  0  6 88  6
>> >>  2  0  17728 3536676  21816 220248    0    0  1648     0 16447 1246  0  6 88  6
>> >>  1  1  17728 3535052  22544 220256    0    0   728     0 16215  605  1  6 87  5
>> >>  1  1  17728 3533672  23404 220244    0    0   860  4240 16264  705  0  6 88  6
>> >>  1  1  17728 3532688  24232 220244    0    0   828     0 16272  685  0  6 87  6
>> >>  1  1  17728 3531552  25080 220244    0    0   848     0 16294  700  0  6 88  6
>> >>  1  1  17728 3529584  26532 220256    0    0  1452     0 16376 1104  0  6 87  6
>> >>  1  2  17728 3545232  27848 199176    0    0  1312    52 16392  911  0  7 85  8
>> >>  1  2  17728 3659060  29576  84420    0    0  1736    40 16570  959  0  7 81 12
>> >> 38  3  17728 3640652  29984  69976    0    0   688     0 16885 2987  3  8 80  9
>> >>  5  2  17728 3601716  30208  75628    0    0  4676     4 18080 5727 11 10 66 12
>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> >> 103 27  17728 2286372  30376  78952    0    0  3044     8 17772 6803 49 16 34  1
>> >> 128  1  17728 1337588  30416  79952    0    0   732  4080 16389 4874 91  9  0  0
>> >> 122  7  17728 730264  30472  81056    0    0   540  1300 16535 5451 91  9  0  0
>> >> 99 16  17728 996308  30544  83136    0    0   492   452 16951 6629 92  8  0  0
>> >> 89 23  17728 1150640  30592  88288    0    0  3232   224 17286 7312 91  9  0  0
>> >> 114  7  17728 1344768  30660  92104    0    0  1668   228 17395 7297 89 11  0  0
>> >> 99  3  17728 848716  30696  93684    0    0   688  2072 16947 6368 92  8  0  0
>> >> 112  9  17728 609908  30748  96036    0    0   620   272 17221 7640 90 10  0  0
>> >> 111  8  17728 480244  30808  98268    0    0   788   320 17227 7391 92  8  0  0
>> >> 115  7  17728 549564  30852 100552    0    0   656   232 17583 7807 92  9  0  0
>> >> 107  9  17728 666776  30888 102904    0    0   716     0 17406 7781 91  9  0  0
>> >> 124  5  17728 685368  30960 105544    0    0  1056   944 17281 7713 90 10  0  0
>> >> 130  1  17728 538832  31000 108080    0    0   776     0 16943 7347 91  9  0  0
>> >> 130  0  17728 364476  31032 110252    0    0   676     0 16767 6948 91  9  0  0
>> >> 129  0  17728 149332  31064 111848    0    0   540    32 16673 6272 92  8  0  0
>> >> 129  0  17728 274664  31096 114052    0    0   628     0 17207 7694 92  8  0  0
>> >> 128  3  17728 589736  31160 117420    0    0   816   996 17381 8443 90 10  0  0
>> >> 126  5  17728 485300  31172 119544    0    0   416     0 17024 7186 91  9  0  0
>> >> 130  0  17728 349500  31216 122344    0    0   492     0 17046 7358 91  9  0  0
>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> >> 130  2  17728 416972  31248 125404    0    0   496   120 17112 7124 91  9  0  0
>> >> 125  5  17308 188608  29444 106888    0  576  1436   612 40020 9430 91  8  0  0
>> >> 113 16  17308 218700  29528 110336   32    0  1908     0 17210 7214 92  8  0  0
>> >>  1 145  20292  15688  26884 108200   40 3020   188  4660 27003 3664 30  7  0 63
>> >>  1 145  21128  15920  24212 107420    0  836     0  3824 16813  430  1  6  0 93
>> >>  2 144  22904  16020  20780 106780    0 1776     0  6548 16611  505  1  6  0 93
>> >>  1 146  23496  15788  17476 106160   32  596    60  3620 16610  308  1  6  0 93
>> >>  1 147  23924  16216  16028 105852   32  432    32  5012 16477  156  0  6  0 93
>> >>  1 145  24428  15904  14744 103452   20  504    20  3112 16776  125  1  6  0 93
>> >>  1 146  25304  16184  14688  97712    0  876    16  3352 16759  447  2  6  0 92
>> >>  1 147  26984  15908  14588  88348   96 1680    96  6352 17006  235  1  6  0 93
>> >>  1 146  28724  16112  14152  77132   32 1740    44  3536 16739  375  2  6  0 92
>> >>  1 151  29900  15896  12072  68484  156 1184   192  2068 16860  576  2  6  0 91
>> >>  2 152  33724  33908   9536  58616  184 3856   512  6764 16536  492  2  6  3 88
>> >>  1 142  33276 427352   8964  58988 1096  120  2624   120 16730 1129  6  7  8 79
>> >>  2 142  33000 421512   8988  60944 1560    0  3512     0 16771 1220  1  6  9 84
>> >>  2 143  32604 392952   9012  62308 1176    0  2436     0 16690 1173  2  7 10 82
>> >>  8 134  32400 255348   9044  64696  688    0  2584     0 17105 2181 16  8 14 62
>> >>  6 136  31796 142068   9092  66024 1060    0  1828     0 17040 2226 37 10 12 41
>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> >>  2 143  31664  15844   9152  67452  580   64  1324   292 16973 2066 37 10 11 43
>> >>  4 141  31876  56160   9052  67528   48  328   140  1724 16476  696  6  7  0 87
>> >>  4 141  32420 176260   8896  68808  108  732   760  2280 17449 3081 24  9  0 67
>> >> 11 134  32540 119868   8484  70568  108  852  1140  1408 17436 3788 45 12  1 43
>> >> 17 129  32880  57044   8256  73008    0  364  1212   364 17489 4000 59 13  0 28
>> >> 11 135  33468 107128   7660  73124  200 1044   888  2076 17043 1956 23  9  0 69
>> >>  1 144  34788  16076   6948  71572  180 1524   276  1908 16787  967 13  7  0 79
>> >>  1 145  35472  16188   5868  70348  112  768   120  1284 16696  561  1  6  0 93
>> >>  1 145  36056  16696   5492  68240   16  596    16  3356 16456  202  0  6  0 93
>> >>  1 143  38200  15952   3168  63968   32 2168    52  6460 16834  423  1  7  0 92
>> >>  9 131  40128 139084   3064  61060  172 2144   644  2192 17701 2250 19  9  0 72
>> >>  9 133  40548 110308   3092  60492  468  620   900  1852 17516 1983 35  9  0 55
>> >> 10 132  40448  79476   3132  61808 1020    0  1480     0 17505 3254 35 10  0 55
>> >> 12 132  40532 139396   3156  63204  776  260  1272   892 17457 3179 44 11  0 45
>> >> 11 132  40392  66336   3256  65264  788    0  1536     0 17551 3860 46 11  0 43
>> >>  1 142  41112  15796   3296  65680 1176  812  1636  2568 17026 1798 28  9  0 63
>> >>  1 140  41500  15960   3244  64828   92  472   116  4008 16445  443  4  7  0 90
>> >>  1 140  42252  16740   3232  64356    0  764     0  1500 16403  185  0  6  0 94
>> >>  1 139  49636  16024   2928  60652   52 7376    52  7376 17507 1236  0  7  0 93
>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> >>  2 140  55780  16444   2548  55948  176 6200   332  6260 17160  592  1  7  0 92
>> >>  3 145  59800 358088   2404  55468  100 4108  1092  4132 18514 3864 17  8  0 75
>> >>  3 143  60712  27028   2416  57184  816  964  2392  1288 18089 3476 43 10  0 47
>> >>  4 141  61296 154136   2516  58024  424  980  1312   980 17298 2489 28  9  0 62
>> >> 21 122  62544  83120   2528  58372  100 1456   788  1456 17717 2738 64 12  0 24
>> >> 24 120  62780  53328   2580  62216   16  292  2528   292 17163 4076 85 14  0  1
>> >>  1 143  65088  16096   2492  61524  152 2708   764  2712 16734 1474 16  8  0 76
>> >>  3 141  65672  34232   2476  60536   56  672   240  3208 16726  661  4  7  0 89
>> >>  1 144  65584  16044   2488  60440  808   68   948  1532 17187 1353 10  8  0 82
>> >>  4 141  70836  17216   2444  58024   64 5272    64  6968 16957  437  0  6  0 93
>> >>  6 134  73728  31940   2424  56880  436 3092   748  3188 16950 1269  8  7  0 85
>> >>  2 139  76036 107996   2408  56404   92 2420   476  2784 16869  690  6  7  0 87
>> >>  6 135  76112  82792   2436  57884 1108  476  1632   724 16999 1711 18  8  0 73
>> >>  1 139  77184  17872   2444  57860  996 1084  1168  2320 16644  748 11  8  0 81
>> >>  1 141  91136  15952   2300  51868  100 14088   128 14152 17494 1284  1  7  5 87
>> >>  1 143  98356 204144   2256  48168  640 7496  1148  7580 17471 1840  6  7 12 74
>> >>  3 139  97344 174272   2276  48968 2636    0  3216     0 16962 1499 13  8 11 69
>> >>  9 133  97220 123464   2352  50584 1348    0  2320   500 17100 2255 27  9  8 56
>> >>  9 134  97092  33672   2396  51780 1292  108  2028   108 16821 1547 27  8  8 57
>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> >> 11 134  95068  75744   2448  53444  852    0  1696     0 17318 2630 34 10  2 54
>> >>  1 143  95104  15972   2504  54544  116   44   696    44 16545 1209 20  8  5 67
>> >> ^C
>> >>
>> >>
>> >>
>> >>
>> >
>
> It appears bisection was not so horrific (I was out of the PCI/MRSS bug
> window), It will complete shortly :
>
> # git bisect bad
> Bisecting: 33 revisions left to test after this (roughly 5 steps)
> [c299eba3c5a801657f275d33be588b34831cd30e] Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
> # git bisect log
> git bisect start
> # bad: [003f6c9df54970d8b19578d195b3e2b398cdbde2] lib/sha1.c: quiet sparse noise about symbol not declared
> git bisect bad 003f6c9df54970d8b19578d195b3e2b398cdbde2
> # good: [02f8c6aee8df3cdc935e9bdd4f2d020306035dbe] Linux 3.0
> git bisect good 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
> # good: [d5ef642355bdd9b383ff5c18cbc6102a06eecbaf] Merge branch 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
> git bisect good d5ef642355bdd9b383ff5c18cbc6102a06eecbaf
> # good: [664a41b8a91bf78a01a751e15175e0008977685a] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
> git bisect good 664a41b8a91bf78a01a751e15175e0008977685a
> # bad: [585df1d90cb07a02ca6c7a7d339e56e46d50dafb] xhci: Remove TDs from TD lists when URBs are canceled.
> git bisect bad 585df1d90cb07a02ca6c7a7d339e56e46d50dafb
> # good: [60ad4466821a96913a9b567115e194ed1087c2d7] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
> git bisect good 60ad4466821a96913a9b567115e194ed1087c2d7
> # bad: [7f3bf7cd348cead84f8027b32aa30ea49fa64df5] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
> git bisect bad 7f3bf7cd348cead84f8027b32aa30ea49fa64df5
> # good: [9e8ed3ae924b65ab5f088fe63ee6f4326f04590f] [S390] signal: use set_restore_sigmask() helper
> git bisect good 9e8ed3ae924b65ab5f088fe63ee6f4326f04590f
> # bad: [31475dd611209413bace21651a400afb91d0bd9d] mm: a few small updates for radix-swap
> git bisect bad 31475dd611209413bace21651a400afb91d0bd9d
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

[-- Attachment #2: filemap-dbg.patch --]
[-- Type: text/x-patch, Size: 1023 bytes --]

diff --git a/mm/filemap.c b/mm/filemap.c
index 645a080..f177e96 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -827,13 +827,14 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 {
 	unsigned int i;
 	unsigned int ret;
-	unsigned int nr_found;
+	unsigned int nr_found, nr_skip;
 
 	rcu_read_lock();
 restart:
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
 				(void ***)pages, NULL, start, nr_pages);
 	ret = 0;
+	nr_skip = 0;
 	for (i = 0; i < nr_found; i++) {
 		struct page *page;
 repeat:
@@ -856,6 +857,7 @@ repeat:
 			 * here as an exceptional entry: so skip over it -
 			 * we only reach this from invalidate_mapping_pages().
 			 */
+			nr_skip++;
 			continue;
 		}
 
@@ -876,7 +878,7 @@ repeat:
 	 * If all entries were removed before we could secure them,
 	 * try again, because callers stop trying once 0 is returned.
 	 */
-	if (unlikely(!ret && nr_found))
+	if (unlikely(!ret && nr_found && (nr_found != nr_skip)))
 		goto restart;
 	rcu_read_unlock();
 	return ret;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14  7:32       ` Shaohua Li
@ 2011-09-14  8:20         ` Shaohua Li
  2011-09-14  8:43           ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2011-09-14  8:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, linux-kernel, Rik van Riel

2011/9/14 Shaohua Li <shli@kernel.org>:
> it appears we didn't account skipped swap entry in find_get_pages().
> does the attached patch help?
I can easily reproduce the issue. Just cp files in tmpfs, trigger swap and
drop caches. The debug patch fixes it at my side.
Eric, please try it.

Thanks,
Shaohua

> 2011/9/14 Eric Dumazet <eric.dumazet@gmail.com>:
>> Le mardi 13 septembre 2011 à 23:48 -0700, Linus Torvalds a écrit :
>>> Re-sending, because apparently none of my email in the last few days
>>> have actually gone out due to LF problems..
>>>
>>>                        Linus
>>>
>>> On Tue, Sep 13, 2011 at 12:48 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>> > On Tue, Sep 13, 2011 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> >>
>>> >> It seems current kernels (3.1.0-rc6) are really unreliable, or maybe I
>>> >> expect too much from them.
>>> >
>>> > No, by now, they should be damn well reliable.
>>> >
>>> >> On my 4GB x86_64 machine (2 quad-core cpus, 2 threads per core), I can
>>> >> have a cpu locked in
>>> >>
>>> >>  find_get_pages -> radix_tree_gang_lookup_slot -> __lookup
>>> >
>>> > Hmm. There hasn't been many changes in this area, so the few changes
>>> > that *do* exist are obviously very suspicious.
>>> >
>>> > In particular, the only real change to that whole setup is the changes
>>> > by Hugh to make the swap entries use the radix tree. So I'm bringing
>>> > Hugh and Andrew to the discussion (and Rik, since he acked a few of
>>> > those changes).
>>> >
>>> > The fact that some light swapping activity seems to accompany the
>>> > problem just makes me more certain it's Hugh's swap/radix tree work.
>>> >
>>> > We're talking only a handful of patches, so maybe Hugh could create a
>>> > revert patch just to confirm that yes, that's the problem.
>>> >
>>> > Hugh?
>>> >
>>> >                      Linus
>>> >
>>> > --- quoting the rest of the email for Hugh/Andrew ---
>>> >> Problem is : A bisection will be very hard, since a lot of kernels
>>> >> simply destroy my disk (the PCI MRRS horror stuff).
>>> >>
>>> >> Messages at console :
>>> >>
>>> >> INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by
>>> >> 11 t=60002 jiffies)
>>> >>
>>> >> perf top -C 1
>>> >>
>>> >> Events: 3K cycles
>>> >> +     43,08%  bash  [kernel.kallsyms]  [k] __lookup
>>> >> +     41,51%  bash  [kernel.kallsyms]  [k] find_get_pages
>>> >> +     15,31%  bash  [kernel.kallsyms]  [k] radix_tree_gang_lookup_slot
>>> >>
>>> >>    43.08%     bash  [kernel.kallsyms]  [k] __lookup
>>> >>               |
>>> >>               --- __lookup
>>> >>                  |
>>> >>                  |--97.09%-- radix_tree_gang_lookup_slot
>>> >>                  |          find_get_pages
>>> >>                  |          pagevec_lookup
>>> >>                  |          invalidate_mapping_pages
>>> >>                  |          drop_pagecache_sb
>>> >>                  |          iterate_supers
>>> >>                  |          drop_caches_sysctl_handler
>>> >>                  |          proc_sys_call_handler.isra.3
>>> >>                  |          proc_sys_write
>>> >>                  |          vfs_write
>>> >>                  |          sys_write
>>> >>                  |          system_call_fastpath
>>> >>                  |          __write
>>> >>                  |
>>> >>
>>> >>
>>> >> Steps to reproduce :
>>> >>
>>> >> In one terminal, kernel builds in a loop (defconfig + hpsa driver)
>>> >>
>>> >> cd /usr/src/linux
>>> >> while :
>>> >> do
>>> >>  make clean
>>> >>  make -j128
>>> >> done
>>> >>
>>> >>
>>> >> In another term :
>>> >>
>>> >> while :
>>> >> do
>>> >>  echo 3 >/proc/sys/vm/drop_caches
>>> >>  sleep 20
>>> >> done
>>> >>
>>> >>
>>> >> Before the lock, I can see in another terminal some swapping activity.
>>> >>
>>> >> $ vmstat 1
>>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>> >>  2  2  17728 3443924  11520 328020    0    0   256 12076 16250  554  0  6 82 12
>>> >>  1  1  17728 3444776  11584 328072    0    0   100  2868 16223  267  0  6 86  7
>>> >>  1  1  17728 3442200  12100 328348    0    0   868     0 16600 1778  0  7 88  6
>>> >>  1  1  17728 3438032  13036 329048    0    0  1628     0 16862 2480  0  7 87  5
>>> >>  1  1  17728 3546864  13988 220256    0    0  1000     0 16313  931  0  7 87  6
>>> >>  1  1  17728 3544260  16024 220256    0    0  2036     0 16513 1531  0  6 88  6
>>> >>  1  1  17728 3542896  17196 220256    0    0  1160   556 16324  893  0  6 88  6
>>> >>  1  1  17728 3540748  18756 220256    0    0  1560     0 16398 1172  0  6 88  6
>>> >>  1  1  17728 3538692  20168 220256    0    0  1412     0 16544 1088  0  6 88  6
>>> >>  2  0  17728 3536676  21816 220248    0    0  1648     0 16447 1246  0  6 88  6
>>> >>  1  1  17728 3535052  22544 220256    0    0   728     0 16215  605  1  6 87  5
>>> >>  1  1  17728 3533672  23404 220244    0    0   860  4240 16264  705  0  6 88  6
>>> >>  1  1  17728 3532688  24232 220244    0    0   828     0 16272  685  0  6 87  6
>>> >>  1  1  17728 3531552  25080 220244    0    0   848     0 16294  700  0  6 88  6
>>> >>  1  1  17728 3529584  26532 220256    0    0  1452     0 16376 1104  0  6 87  6
>>> >>  1  2  17728 3545232  27848 199176    0    0  1312    52 16392  911  0  7 85  8
>>> >>  1  2  17728 3659060  29576  84420    0    0  1736    40 16570  959  0  7 81 12
>>> >> 38  3  17728 3640652  29984  69976    0    0   688     0 16885 2987  3  8 80  9
>>> >>  5  2  17728 3601716  30208  75628    0    0  4676     4 18080 5727 11 10 66 12
>>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>> >> 103 27  17728 2286372  30376  78952    0    0  3044     8 17772 6803 49 16 34  1
>>> >> 128  1  17728 1337588  30416  79952    0    0   732  4080 16389 4874 91  9  0  0
>>> >> 122  7  17728 730264  30472  81056    0    0   540  1300 16535 5451 91  9  0  0
>>> >> 99 16  17728 996308  30544  83136    0    0   492   452 16951 6629 92  8  0  0
>>> >> 89 23  17728 1150640  30592  88288    0    0  3232   224 17286 7312 91  9  0  0
>>> >> 114  7  17728 1344768  30660  92104    0    0  1668   228 17395 7297 89 11  0  0
>>> >> 99  3  17728 848716  30696  93684    0    0   688  2072 16947 6368 92  8  0  0
>>> >> 112  9  17728 609908  30748  96036    0    0   620   272 17221 7640 90 10  0  0
>>> >> 111  8  17728 480244  30808  98268    0    0   788   320 17227 7391 92  8  0  0
>>> >> 115  7  17728 549564  30852 100552    0    0   656   232 17583 7807 92  9  0  0
>>> >> 107  9  17728 666776  30888 102904    0    0   716     0 17406 7781 91  9  0  0
>>> >> 124  5  17728 685368  30960 105544    0    0  1056   944 17281 7713 90 10  0  0
>>> >> 130  1  17728 538832  31000 108080    0    0   776     0 16943 7347 91  9  0  0
>>> >> 130  0  17728 364476  31032 110252    0    0   676     0 16767 6948 91  9  0  0
>>> >> 129  0  17728 149332  31064 111848    0    0   540    32 16673 6272 92  8  0  0
>>> >> 129  0  17728 274664  31096 114052    0    0   628     0 17207 7694 92  8  0  0
>>> >> 128  3  17728 589736  31160 117420    0    0   816   996 17381 8443 90 10  0  0
>>> >> 126  5  17728 485300  31172 119544    0    0   416     0 17024 7186 91  9  0  0
>>> >> 130  0  17728 349500  31216 122344    0    0   492     0 17046 7358 91  9  0  0
>>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>> >> 130  2  17728 416972  31248 125404    0    0   496   120 17112 7124 91  9  0  0
>>> >> 125  5  17308 188608  29444 106888    0  576  1436   612 40020 9430 91  8  0  0
>>> >> 113 16  17308 218700  29528 110336   32    0  1908     0 17210 7214 92  8  0  0
>>> >>  1 145  20292  15688  26884 108200   40 3020   188  4660 27003 3664 30  7  0 63
>>> >>  1 145  21128  15920  24212 107420    0  836     0  3824 16813  430  1  6  0 93
>>> >>  2 144  22904  16020  20780 106780    0 1776     0  6548 16611  505  1  6  0 93
>>> >>  1 146  23496  15788  17476 106160   32  596    60  3620 16610  308  1  6  0 93
>>> >>  1 147  23924  16216  16028 105852   32  432    32  5012 16477  156  0  6  0 93
>>> >>  1 145  24428  15904  14744 103452   20  504    20  3112 16776  125  1  6  0 93
>>> >>  1 146  25304  16184  14688  97712    0  876    16  3352 16759  447  2  6  0 92
>>> >>  1 147  26984  15908  14588  88348   96 1680    96  6352 17006  235  1  6  0 93
>>> >>  1 146  28724  16112  14152  77132   32 1740    44  3536 16739  375  2  6  0 92
>>> >>  1 151  29900  15896  12072  68484  156 1184   192  2068 16860  576  2  6  0 91
>>> >>  2 152  33724  33908   9536  58616  184 3856   512  6764 16536  492  2  6  3 88
>>> >>  1 142  33276 427352   8964  58988 1096  120  2624   120 16730 1129  6  7  8 79
>>> >>  2 142  33000 421512   8988  60944 1560    0  3512     0 16771 1220  1  6  9 84
>>> >>  2 143  32604 392952   9012  62308 1176    0  2436     0 16690 1173  2  7 10 82
>>> >>  8 134  32400 255348   9044  64696  688    0  2584     0 17105 2181 16  8 14 62
>>> >>  6 136  31796 142068   9092  66024 1060    0  1828     0 17040 2226 37 10 12 41
>>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>> >>  2 143  31664  15844   9152  67452  580   64  1324   292 16973 2066 37 10 11 43
>>> >>  4 141  31876  56160   9052  67528   48  328   140  1724 16476  696  6  7  0 87
>>> >>  4 141  32420 176260   8896  68808  108  732   760  2280 17449 3081 24  9  0 67
>>> >> 11 134  32540 119868   8484  70568  108  852  1140  1408 17436 3788 45 12  1 43
>>> >> 17 129  32880  57044   8256  73008    0  364  1212   364 17489 4000 59 13  0 28
>>> >> 11 135  33468 107128   7660  73124  200 1044   888  2076 17043 1956 23  9  0 69
>>> >>  1 144  34788  16076   6948  71572  180 1524   276  1908 16787  967 13  7  0 79
>>> >>  1 145  35472  16188   5868  70348  112  768   120  1284 16696  561  1  6  0 93
>>> >>  1 145  36056  16696   5492  68240   16  596    16  3356 16456  202  0  6  0 93
>>> >>  1 143  38200  15952   3168  63968   32 2168    52  6460 16834  423  1  7  0 92
>>> >>  9 131  40128 139084   3064  61060  172 2144   644  2192 17701 2250 19  9  0 72
>>> >>  9 133  40548 110308   3092  60492  468  620   900  1852 17516 1983 35  9  0 55
>>> >> 10 132  40448  79476   3132  61808 1020    0  1480     0 17505 3254 35 10  0 55
>>> >> 12 132  40532 139396   3156  63204  776  260  1272   892 17457 3179 44 11  0 45
>>> >> 11 132  40392  66336   3256  65264  788    0  1536     0 17551 3860 46 11  0 43
>>> >>  1 142  41112  15796   3296  65680 1176  812  1636  2568 17026 1798 28  9  0 63
>>> >>  1 140  41500  15960   3244  64828   92  472   116  4008 16445  443  4  7  0 90
>>> >>  1 140  42252  16740   3232  64356    0  764     0  1500 16403  185  0  6  0 94
>>> >>  1 139  49636  16024   2928  60652   52 7376    52  7376 17507 1236  0  7  0 93
>>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>> >>  2 140  55780  16444   2548  55948  176 6200   332  6260 17160  592  1  7  0 92
>>> >>  3 145  59800 358088   2404  55468  100 4108  1092  4132 18514 3864 17  8  0 75
>>> >>  3 143  60712  27028   2416  57184  816  964  2392  1288 18089 3476 43 10  0 47
>>> >>  4 141  61296 154136   2516  58024  424  980  1312   980 17298 2489 28  9  0 62
>>> >> 21 122  62544  83120   2528  58372  100 1456   788  1456 17717 2738 64 12  0 24
>>> >> 24 120  62780  53328   2580  62216   16  292  2528   292 17163 4076 85 14  0  1
>>> >>  1 143  65088  16096   2492  61524  152 2708   764  2712 16734 1474 16  8  0 76
>>> >>  3 141  65672  34232   2476  60536   56  672   240  3208 16726  661  4  7  0 89
>>> >>  1 144  65584  16044   2488  60440  808   68   948  1532 17187 1353 10  8  0 82
>>> >>  4 141  70836  17216   2444  58024   64 5272    64  6968 16957  437  0  6  0 93
>>> >>  6 134  73728  31940   2424  56880  436 3092   748  3188 16950 1269  8  7  0 85
>>> >>  2 139  76036 107996   2408  56404   92 2420   476  2784 16869  690  6  7  0 87
>>> >>  6 135  76112  82792   2436  57884 1108  476  1632   724 16999 1711 18  8  0 73
>>> >>  1 139  77184  17872   2444  57860  996 1084  1168  2320 16644  748 11  8  0 81
>>> >>  1 141  91136  15952   2300  51868  100 14088   128 14152 17494 1284  1  7  5 87
>>> >>  1 143  98356 204144   2256  48168  640 7496  1148  7580 17471 1840  6  7 12 74
>>> >>  3 139  97344 174272   2276  48968 2636    0  3216     0 16962 1499 13  8 11 69
>>> >>  9 133  97220 123464   2352  50584 1348    0  2320   500 17100 2255 27  9  8 56
>>> >>  9 134  97092  33672   2396  51780 1292  108  2028   108 16821 1547 27  8  8 57
>>> >> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>> >> 11 134  95068  75744   2448  53444  852    0  1696     0 17318 2630 34 10  2 54
>>> >>  1 143  95104  15972   2504  54544  116   44   696    44 16545 1209 20  8  5 67
>>> >> ^C
>>> >>
>>> >>
>>> >>
>>> >>
>>> >
>>
>> It appears bisection was not so horrific (I was out of the PCI/MRSS bug
>> window), It will complete shortly :
>>
>> # git bisect bad
>> Bisecting: 33 revisions left to test after this (roughly 5 steps)
>> [c299eba3c5a801657f275d33be588b34831cd30e] Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
>> # git bisect log
>> git bisect start
>> # bad: [003f6c9df54970d8b19578d195b3e2b398cdbde2] lib/sha1.c: quiet sparse noise about symbol not declared
>> git bisect bad 003f6c9df54970d8b19578d195b3e2b398cdbde2
>> # good: [02f8c6aee8df3cdc935e9bdd4f2d020306035dbe] Linux 3.0
>> git bisect good 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe
>> # good: [d5ef642355bdd9b383ff5c18cbc6102a06eecbaf] Merge branch 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
>> git bisect good d5ef642355bdd9b383ff5c18cbc6102a06eecbaf
>> # good: [664a41b8a91bf78a01a751e15175e0008977685a] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
>> git bisect good 664a41b8a91bf78a01a751e15175e0008977685a
>> # bad: [585df1d90cb07a02ca6c7a7d339e56e46d50dafb] xhci: Remove TDs from TD lists when URBs are canceled.
>> git bisect bad 585df1d90cb07a02ca6c7a7d339e56e46d50dafb
>> # good: [60ad4466821a96913a9b567115e194ed1087c2d7] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
>> git bisect good 60ad4466821a96913a9b567115e194ed1087c2d7
>> # bad: [7f3bf7cd348cead84f8027b32aa30ea49fa64df5] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
>> git bisect bad 7f3bf7cd348cead84f8027b32aa30ea49fa64df5
>> # good: [9e8ed3ae924b65ab5f088fe63ee6f4326f04590f] [S390] signal: use set_restore_sigmask() helper
>> git bisect good 9e8ed3ae924b65ab5f088fe63ee6f4326f04590f
>> # bad: [31475dd611209413bace21651a400afb91d0bd9d] mm: a few small updates for radix-swap
>> git bisect bad 31475dd611209413bace21651a400afb91d0bd9d
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14  8:20         ` Shaohua Li
@ 2011-09-14  8:43           ` Eric Dumazet
  2011-09-14  8:55             ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2011-09-14  8:43 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, linux-kernel, Rik van Riel

Le mercredi 14 septembre 2011 à 16:20 +0800, Shaohua Li a écrit :
> 2011/9/14 Shaohua Li <shli@kernel.org>:
> > it appears we didn't account skipped swap entry in find_get_pages().
> > does the attached patch help?
> I can easily reproduce the issue. Just cp files in tmpfs, trigger swap and
> drop caches. The debug patch fixes it at my side.
> Eric, please try it.
> 

Hello Shaohua

I tried it with added traces :


[  277.077855] mv used greatest stack depth: 3336 bytes left
[  310.558012] nr_found=2 nr_skip=2
[  310.558139] nr_found=14 nr_skip=14
[  332.195162] nr_found=2 nr_skip=2
[  332.195274] nr_found=14 nr_skip=14
[  352.315273] nr_found=14 nr_skip=14
[  372.673575] nr_found=14 nr_skip=14
[  397.115463] nr_found=14 nr_skip=14
[  403.391694] cc1 used greatest stack depth: 3184 bytes left
[  404.761194] cc1 used greatest stack depth: 2640 bytes left
[  417.306510] nr_found=14 nr_skip=14
[  440.198051] nr_found=14 nr_skip=14

I also used :

-	if (unlikely(!ret && nr_found))
+	if (unlikely(!ret && nr_found > nr_skip))
 		goto restart;

It seems to fix the bug. I suspect it also aborts
invalidate_mapping_pages() if we skip 14 pages, but existing comment
states its OK :

        /*
         * Note: this function may get called on a shmem/tmpfs mapping:
         * pagevec_lookup() might then return 0 prematurely (because it
         * got a gangful of swap entries); but it's hardly worth worrying
         * about - it can rarely have anything to free from such a mapping
         * (most pages are dirty), and already skips over any difficulties.
         */
 
Thanks !



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14  8:43           ` Eric Dumazet
@ 2011-09-14  8:55             ` Shaohua Li
  2011-09-14 20:38               ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2011-09-14  8:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, linux-kernel, Rik van Riel

On Wed, 2011-09-14 at 16:43 +0800, Eric Dumazet wrote:
> Le mercredi 14 septembre 2011 à 16:20 +0800, Shaohua Li a écrit :
> > 2011/9/14 Shaohua Li <shli@kernel.org>:
> > > it appears we didn't account skipped swap entry in find_get_pages().
> > > does the attached patch help?
> > I can easily reproduce the issue. Just cp files in tmpfs, trigger swap and
> > drop caches. The debug patch fixes it at my side.
> > Eric, please try it.
> > 
> 
> Hello Shaohua
> 
> I tried it with added traces :
> 
> 
> [  277.077855] mv used greatest stack depth: 3336 bytes left
> [  310.558012] nr_found=2 nr_skip=2
> [  310.558139] nr_found=14 nr_skip=14
> [  332.195162] nr_found=2 nr_skip=2
> [  332.195274] nr_found=14 nr_skip=14
> [  352.315273] nr_found=14 nr_skip=14
> [  372.673575] nr_found=14 nr_skip=14
> [  397.115463] nr_found=14 nr_skip=14
> [  403.391694] cc1 used greatest stack depth: 3184 bytes left
> [  404.761194] cc1 used greatest stack depth: 2640 bytes left
> [  417.306510] nr_found=14 nr_skip=14
> [  440.198051] nr_found=14 nr_skip=14
> 
> I also used :
> 
> -	if (unlikely(!ret && nr_found))
> +	if (unlikely(!ret && nr_found > nr_skip))
>  		goto restart;
nr_found > nr_skip is better

> It seems to fix the bug. I suspect it also aborts
> invalidate_mapping_pages() if we skip 14 pages, but existing comment
> states its OK :
> 
>         /*
>          * Note: this function may get called on a shmem/tmpfs mapping:
>          * pagevec_lookup() might then return 0 prematurely (because it
>          * got a gangful of swap entries); but it's hardly worth worrying
>          * about - it can rarely have anything to free from such a mapping
>          * (most pages are dirty), and already skips over any difficulties.
>          */
that might be a problem, let Hugh answer if it is.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14  8:55             ` Shaohua Li
@ 2011-09-14 20:38               ` Hugh Dickins
  2011-09-14 20:55                 ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2011-09-14 20:38 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Eric Dumazet, Linus Torvalds, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6832 bytes --]

On Wed, 14 Sep 2011, Shaohua Li wrote:
> On Wed, 2011-09-14 at 16:43 +0800, Eric Dumazet wrote:
> > Le mercredi 14 septembre 2011 à 16:20 +0800, Shaohua Li a écrit :
> > > 2011/9/14 Shaohua Li <shli@kernel.org>:
> > > > it appears we didn't account skipped swap entry in find_get_pages().
> > > > does the attached patch help?
> > > I can easily reproduce the issue. Just cp files in tmpfs, trigger swap and
> > > drop caches. The debug patch fixes it at my side.
> > > Eric, please try it.
> > > 
> > 
> > Hello Shaohua
> > 
> > I tried it with added traces :
> > 
> > 
> > [  277.077855] mv used greatest stack depth: 3336 bytes left
> > [  310.558012] nr_found=2 nr_skip=2
> > [  310.558139] nr_found=14 nr_skip=14
> > [  332.195162] nr_found=2 nr_skip=2
> > [  332.195274] nr_found=14 nr_skip=14
> > [  352.315273] nr_found=14 nr_skip=14
> > [  372.673575] nr_found=14 nr_skip=14
> > [  397.115463] nr_found=14 nr_skip=14
> > [  403.391694] cc1 used greatest stack depth: 3184 bytes left
> > [  404.761194] cc1 used greatest stack depth: 2640 bytes left
> > [  417.306510] nr_found=14 nr_skip=14
> > [  440.198051] nr_found=14 nr_skip=14
> > 
> > I also used :
> > 
> > -	if (unlikely(!ret && nr_found))
> > +	if (unlikely(!ret && nr_found > nr_skip))
> >  		goto restart;
> nr_found > nr_skip is better
> 
> > It seems to fix the bug. I suspect it also aborts
> > invalidate_mapping_pages() if we skip 14 pages, but existing comment
> > states its OK :
> > 
> >         /*
> >          * Note: this function may get called on a shmem/tmpfs mapping:
> >          * pagevec_lookup() might then return 0 prematurely (because it
> >          * got a gangful of swap entries); but it's hardly worth worrying
> >          * about - it can rarely have anything to free from such a mapping
> >          * (most pages are dirty), and already skips over any difficulties.
> >          */
> that might be a problem, let Hugh answer if it is.

Thanks to you all for suffering, reporting and investigating this.
Yes, in 3.1-rc I have converted an extremely rare try-again-once
into a too-easily stumbled-upon endless loop.

Would it be a problem to give up early on a shmem/tmpfs mapping in
invalidate_mapping_pages()?  No, not really: it's rare for it to find
anything it can throw away from tmpfs, because it cannot recognize the
clean swapcache pages (getting it to work on those would be nice, and
something I did look into once, but it's not a job for today), and
entirely clean pages (readonly mmap'ed zeroes never touched) are uncommon.

However, I did independently run across scan_mapping_unevictable_pages()
a few days ago: that uses pagevec_lookup() on shmem when doing SHM_UNLOCK,
and although the normal case would be that everything then is in memory,
I think it's not impossible for some to be swapped out (already swapped
out at SHM_LOCK time, and not touched since), which should not stop it
from doing its work on unswapped pages beyond.

My preferred patch is below: but it does add a cond_resched() into
find_get_pages(), which is really below the level at which we usually
do cond_resched().  All callers appear fine with it, and in practice
it would be very^14 rare on anything other than shmem/tmpfs: so this
being rc6 I'm reluctant to make matters worse with a might_sleep().

But I'm not signing this off yet, because I'm still mystified by the
several reports of seemingly the same problem on 3.0.1 and 3.0.2,
which I fear the patch below (even if adjusted to apply) will do
nothing to help - there are no swap entries in radix_tree in 3.0.

My suspicion is that there's some path by which a page gets trapped
in the radix_tree with page count 0.  While it's easy to imagine that
THP's use of compaction and compaction's use of migration could have
made a bug there more common, I do not see it.

I'd like to think about that a little more before finalizing the
patch below - does it work, and does it look acceptable so far?
Of course, the mods to truncate.c and vmscan.c are not essential
parts of this fix, just things to tidy up while on the subject.
Right now I must attend to some other stuff, will return tomorrow.

Hugh

---

 mm/filemap.c  |   14 ++++++++++----
 mm/truncate.c |    8 --------
 mm/vmscan.c   |    2 +-
 3 files changed, 11 insertions(+), 13 deletions(-)

--- 3.1-rc6/mm/filemap.c	2011-08-07 23:44:41.231928061 -0700
+++ linux/mm/filemap.c	2011-09-14 12:24:26.431242155 -0700
@@ -829,8 +829,8 @@ unsigned find_get_pages(struct address_s
 	unsigned int ret;
 	unsigned int nr_found;
 
-	rcu_read_lock();
 restart:
+	rcu_read_lock();
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
 				(void ***)pages, NULL, start, nr_pages);
 	ret = 0;
@@ -849,12 +849,15 @@ repeat:
 				 * to root: none yet gotten, safe to restart.
 				 */
 				WARN_ON(start | i);
+				rcu_read_unlock();
 				goto restart;
 			}
 			/*
 			 * Otherwise, shmem/tmpfs must be storing a swap entry
 			 * here as an exceptional entry: so skip over it -
-			 * we only reach this from invalidate_mapping_pages().
+			 * we only reach this from invalidate_mapping_pages(),
+			 * or SHM_UNLOCK's scan_mapping_unevictable_pages() -
+			 * in each case it's correct to skip a swapped entry.
 			 */
 			continue;
 		}
@@ -871,14 +874,17 @@ repeat:
 		pages[ret] = page;
 		ret++;
 	}
+	rcu_read_unlock();
 
 	/*
 	 * If all entries were removed before we could secure them,
 	 * try again, because callers stop trying once 0 is returned.
 	 */
-	if (unlikely(!ret && nr_found))
+	if (unlikely(!ret && nr_found)) {
+		cond_resched();
+		start += nr_found;
 		goto restart;
-	rcu_read_unlock();
+	}
 	return ret;
 }
 
--- 3.1-rc6/mm/truncate.c	2011-08-07 23:44:41.299928402 -0700
+++ linux/mm/truncate.c	2011-09-14 11:23:19.513059010 -0700
@@ -336,14 +336,6 @@ unsigned long invalidate_mapping_pages(s
 	unsigned long count = 0;
 	int i;
 
-	/*
-	 * Note: this function may get called on a shmem/tmpfs mapping:
-	 * pagevec_lookup() might then return 0 prematurely (because it
-	 * got a gangful of swap entries); but it's hardly worth worrying
-	 * about - it can rarely have anything to free from such a mapping
-	 * (most pages are dirty), and already skips over any difficulties.
-	 */
-
 	pagevec_init(&pvec, 0);
 	while (index <= end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
--- 3.1-rc6/mm/vmscan.c	2011-08-28 22:10:26.516791859 -0700
+++ linux/mm/vmscan.c	2011-09-14 11:25:27.701694661 -0700
@@ -3375,8 +3375,8 @@ void scan_mapping_unevictable_pages(stru
 		pagevec_release(&pvec);
 
 		count_vm_events(UNEVICTABLE_PGSCANNED, pg_scanned);
+		cond_resched();
 	}
-
 }
 
 /**

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14 20:38               ` Hugh Dickins
@ 2011-09-14 20:55                 ` Eric Dumazet
  2011-09-14 21:53                   ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2011-09-14 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shaohua Li, Linus Torvalds, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

Le mercredi 14 septembre 2011 à 13:38 -0700, Hugh Dickins a écrit :
> On Wed, 14 Sep 2011, Shaohua Li wrote:
> > On Wed, 2011-09-14 at 16:43 +0800, Eric Dumazet wrote:
> > > Le mercredi 14 septembre 2011 à 16:20 +0800, Shaohua Li a écrit :
> > > > 2011/9/14 Shaohua Li <shli@kernel.org>:
> > > > > it appears we didn't account skipped swap entry in find_get_pages().
> > > > > does the attached patch help?
> > > > I can easily reproduce the issue. Just cp files in tmpfs, trigger swap and
> > > > drop caches. The debug patch fixes it at my side.
> > > > Eric, please try it.
> > > > 
> > > 
> > > Hello Shaohua
> > > 
> > > I tried it with added traces :
> > > 
> > > 
> > > [  277.077855] mv used greatest stack depth: 3336 bytes left
> > > [  310.558012] nr_found=2 nr_skip=2
> > > [  310.558139] nr_found=14 nr_skip=14
> > > [  332.195162] nr_found=2 nr_skip=2
> > > [  332.195274] nr_found=14 nr_skip=14
> > > [  352.315273] nr_found=14 nr_skip=14
> > > [  372.673575] nr_found=14 nr_skip=14
> > > [  397.115463] nr_found=14 nr_skip=14
> > > [  403.391694] cc1 used greatest stack depth: 3184 bytes left
> > > [  404.761194] cc1 used greatest stack depth: 2640 bytes left
> > > [  417.306510] nr_found=14 nr_skip=14
> > > [  440.198051] nr_found=14 nr_skip=14
> > > 
> > > I also used :
> > > 
> > > -	if (unlikely(!ret && nr_found))
> > > +	if (unlikely(!ret && nr_found > nr_skip))
> > >  		goto restart;
> > nr_found > nr_skip is better
> > 
> > > It seems to fix the bug. I suspect it also aborts
> > > invalidate_mapping_pages() if we skip 14 pages, but existing comment
> > > states its OK :
> > > 
> > >         /*
> > >          * Note: this function may get called on a shmem/tmpfs mapping:
> > >          * pagevec_lookup() might then return 0 prematurely (because it
> > >          * got a gangful of swap entries); but it's hardly worth worrying
> > >          * about - it can rarely have anything to free from such a mapping
> > >          * (most pages are dirty), and already skips over any difficulties.
> > >          */
> > that might be a problem, let Hugh answer if it is.
> 
> Thanks to you all for suffering, reporting and investigating this.
> Yes, in 3.1-rc I have converted an extremely rare try-again-once
> into a too-easily stumbled-upon endless loop.
> 
> Would it be a problem to give up early on a shmem/tmpfs mapping in
> invalidate_mapping_pages()?  No, not really: it's rare for it to find
> anything it can throw away from tmpfs, because it cannot recognize the
> clean swapcache pages (getting it to work on those would be nice, and
> something I did look into once, but it's not a job for today), and
> entirely clean pages (readonly mmap'ed zeroes never touched) are uncommon.
> 
> However, I did independently run across scan_mapping_unevictable_pages()
> a few days ago: that uses pagevec_lookup() on shmem when doing SHM_UNLOCK,
> and although the normal case would be that everything then is in memory,
> I think it's not impossible for some to be swapped out (already swapped
> out at SHM_LOCK time, and not touched since), which should not stop it
> from doing its work on unswapped pages beyond.
> 
> My preferred patch is below: but it does add a cond_resched() into
> find_get_pages(), which is really below the level at which we usually
> do cond_resched().  All callers appear fine with it, and in practice
> it would be very^14 rare on anything other than shmem/tmpfs: so this
> being rc6 I'm reluctant to make matters worse with a might_sleep().
> 
> But I'm not signing this off yet, because I'm still mystified by the
> several reports of seemingly the same problem on 3.0.1 and 3.0.2,
> which I fear the patch below (even if adjusted to apply) will do
> nothing to help - there are no swap entries in radix_tree in 3.0.
> 
> My suspicion is that there's some path by which a page gets trapped
> in the radix_tree with page count 0.  While it's easy to imagine that
> THP's use of compaction and compaction's use of migration could have
> made a bug there more common, I do not see it.
> 
> I'd like to think about that a little more before finalizing the
> patch below - does it work, and does it look acceptable so far?
> Of course, the mods to truncate.c and vmscan.c are not essential
> parts of this fix, just things to tidy up while on the subject.
> Right now I must attend to some other stuff, will return tomorrow.
> 
> Hugh
> 

Hello Hugh

I am going to test this ASAP, but have one question below :

> ---
> 
>  mm/filemap.c  |   14 ++++++++++----
>  mm/truncate.c |    8 --------
>  mm/vmscan.c   |    2 +-
>  3 files changed, 11 insertions(+), 13 deletions(-)
> 
> --- 3.1-rc6/mm/filemap.c	2011-08-07 23:44:41.231928061 -0700
> +++ linux/mm/filemap.c	2011-09-14 12:24:26.431242155 -0700
> @@ -829,8 +829,8 @@ unsigned find_get_pages(struct address_s
>  	unsigned int ret;
>  	unsigned int nr_found;
>  
> -	rcu_read_lock();
>  restart:
> +	rcu_read_lock();
>  	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
>  				(void ***)pages, NULL, start, nr_pages);
>  	ret = 0;
> @@ -849,12 +849,15 @@ repeat:
>  				 * to root: none yet gotten, safe to restart.
>  				 */
>  				WARN_ON(start | i);
> +				rcu_read_unlock();
>  				goto restart;
>  			}
>  			/*
>  			 * Otherwise, shmem/tmpfs must be storing a swap entry
>  			 * here as an exceptional entry: so skip over it -
> -			 * we only reach this from invalidate_mapping_pages().
> +			 * we only reach this from invalidate_mapping_pages(),
> +			 * or SHM_UNLOCK's scan_mapping_unevictable_pages() -
> +			 * in each case it's correct to skip a swapped entry.
>  			 */
>  			continue;
>  		}
> @@ -871,14 +874,17 @@ repeat:
>  		pages[ret] = page;
>  		ret++;
>  	}
> +	rcu_read_unlock();
>  
>  	/*
>  	 * If all entries were removed before we could secure them,
>  	 * try again, because callers stop trying once 0 is returned.
>  	 */
> -	if (unlikely(!ret && nr_found))
> +	if (unlikely(!ret && nr_found)) {
> +		cond_resched();
> +		start += nr_found;

Isnt it possible to go out of initial window ?
start could be greater than 'end' ?

invalidate_mapping_pages()

does some capping (end - index)


>  	pagevec_init(&pvec, 0);
>  	while (index <= end && pagevec_lookup(&pvec, mapping, index,
>  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14 20:55                 ` Eric Dumazet
@ 2011-09-14 21:53                   ` Hugh Dickins
  2011-09-14 22:08                     ` Eric Dumazet
  2011-09-14 22:37                     ` Linus Torvalds
  0 siblings, 2 replies; 24+ messages in thread
From: Hugh Dickins @ 2011-09-14 21:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shaohua Li, Linus Torvalds, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2580 bytes --]

On Wed, 14 Sep 2011, Eric Dumazet wrote:
> Le mercredi 14 septembre 2011 à 13:38 -0700, Hugh Dickins a écrit :
> > 
> > I'd like to think about that a little more before finalizing the
> > patch below - does it work, and does it look acceptable so far?
> > Of course, the mods to truncate.c and vmscan.c are not essential
> > parts of this fix, just things to tidy up while on the subject.
> > Right now I must attend to some other stuff, will return tomorrow.
> 
> Hello Hugh
> 
> I am going to test this ASAP,

Thanks, Eric, though it may not be worth spending your time on it.
It occurred to me over lunch that it may take painfully longer than
expected to invalidate_mapping_pages() on a single-swapped-out-page
1TB sparse tmpfs file - all those "start += 1" restarts until it
reaches the end.

I might decide to leave invalidate_mapping_pages() giving up early
(unsatisfying, but no worse than before), and convert scan_mapping_
unevictable_pages() (which is used on nothing but shmem) to pass
index vector to radix_tree_gang_whatever().

Dunno, I'll think about it more later.

> but have one question below :
> 
> >  	/*
> >  	 * If all entries were removed before we could secure them,
> >  	 * try again, because callers stop trying once 0 is returned.
> >  	 */
> > -	if (unlikely(!ret && nr_found))
> > +	if (unlikely(!ret && nr_found)) {
> > +		cond_resched();
> > +		start += nr_found;
> 
> Isnt it possible to go out of initial window ?
> start could be greater than 'end' ?
> 
> invalidate_mapping_pages()
> 
> does some capping (end - index)

Good question, but even before the change (or any of my changes here)
it's perfectly possible to go out of the initial window - the radix_tree
gang interfaces allow you to specify the maximum you want back (i.e. size
of buffer), but they do not actually allow you to specify end of range.

There's a few places where we trim the maximum to match our end of range,
but that's just a slight optimization in the face of an arguably incomplete
interface.  But the radix_tree is not too inefficient this way, because of
how empty nodes get removed immediately - there's a limit to the number
of nodes it will have to look through before it fills the buffer.

> 
> 
> >  	pagevec_init(&pvec, 0);
> >  	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> >  			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {

It does cap by "end - index", but it has already checked "index <= end",
and it is only this minor optimization, nothing essential.

Hugh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14 21:53                   ` Hugh Dickins
@ 2011-09-14 22:08                     ` Eric Dumazet
  2011-09-14 22:37                     ` Linus Torvalds
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2011-09-14 22:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shaohua Li, Linus Torvalds, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

Le mercredi 14 septembre 2011 à 14:53 -0700, Hugh Dickins a écrit :
> On Wed, 14 Sep 2011, Eric Dumazet wrote:
> > Le mercredi 14 septembre 2011 à 13:38 -0700, Hugh Dickins a écrit :
> > > 
> > > I'd like to think about that a little more before finalizing the
> > > patch below - does it work, and does it look acceptable so far?
> > > Of course, the mods to truncate.c and vmscan.c are not essential
> > > parts of this fix, just things to tidy up while on the subject.
> > > Right now I must attend to some other stuff, will return tomorrow.
> > 
> > Hello Hugh
> > 
> > I am going to test this ASAP,
> 
> Thanks, Eric, though it may not be worth spending your time on it.
> It occurred to me over lunch that it may take painfully longer than
> expected to invalidate_mapping_pages() on a single-swapped-out-page
> 1TB sparse tmpfs file - all those "start += 1" restarts until it
> reaches the end.
> 
> I might decide to leave invalidate_mapping_pages() giving up early
> (unsatisfying, but no worse than before), and convert scan_mapping_
> unevictable_pages() (which is used on nothing but shmem) to pass
> index vector to radix_tree_gang_whatever().
> 
> Dunno, I'll think about it more later.
> 

I tested your patch as is on my machine, and everything seems fine.

I let the stress continue while I am going to sleep :)

See you



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14 21:53                   ` Hugh Dickins
  2011-09-14 22:08                     ` Eric Dumazet
@ 2011-09-14 22:37                     ` Linus Torvalds
  2011-09-15  0:45                       ` Shaohua Li
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2011-09-14 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Eric Dumazet, Shaohua Li, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

On Wed, Sep 14, 2011 at 2:53 PM, Hugh Dickins <hughd@google.com> wrote:
>
> Thanks, Eric, though it may not be worth spending your time on it.
> It occurred to me over lunch that it may take painfully longer than
> expected to invalidate_mapping_pages() on a single-swapped-out-page
> 1TB sparse tmpfs file - all those "start += 1" restarts until it
> reaches the end.

So can we have a stop-gap patch to just fixes it for now? I assume
that would be Shaohua's patch with the "nr_found > nr_skip" change?

Can you guys send whatever patch is appropriate for now with a nice
changelog and the appropriate sign-offs, please? So that we can at
least close the issue...

                       Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14 22:37                     ` Linus Torvalds
@ 2011-09-15  0:45                       ` Shaohua Li
  2011-09-15  2:00                         ` Hugh Dickins
  2011-09-15  4:02                         ` Eric Dumazet
  0 siblings, 2 replies; 24+ messages in thread
From: Shaohua Li @ 2011-09-15  0:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Eric Dumazet, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

On Thu, 2011-09-15 at 06:37 +0800, Linus Torvalds wrote:
> On Wed, Sep 14, 2011 at 2:53 PM, Hugh Dickins <hughd@google.com> wrote:
> >
> > Thanks, Eric, though it may not be worth spending your time on it.
> > It occurred to me over lunch that it may take painfully longer than
> > expected to invalidate_mapping_pages() on a single-swapped-out-page
> > 1TB sparse tmpfs file - all those "start += 1" restarts until it
> > reaches the end.
> 
> So can we have a stop-gap patch to just fixes it for now? I assume
> that would be Shaohua's patch with the "nr_found > nr_skip" change?
> 
> Can you guys send whatever patch is appropriate for now with a nice
> changelog and the appropriate sign-offs, please? So that we can at
> least close the issue...
here is my patch if you want to close the issue at hand.

Subject: mm: account skipped entries to avoid looping in find_get_pages

The found entries by find_get_pages() could be all swap entries. In
this case we skip the entries, but make sure the skipped entries are
accounted, so we don't keep looping.
Using nr_found > nr_skip to simplify code as suggested by Eric.

Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>

diff --git a/mm/filemap.c b/mm/filemap.c
index 645a080..7771871 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -827,13 +827,14 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 {
 	unsigned int i;
 	unsigned int ret;
-	unsigned int nr_found;
+	unsigned int nr_found, nr_skip;
 
 	rcu_read_lock();
 restart:
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
 				(void ***)pages, NULL, start, nr_pages);
 	ret = 0;
+	nr_skip = 0;
 	for (i = 0; i < nr_found; i++) {
 		struct page *page;
 repeat:
@@ -856,6 +857,7 @@ repeat:
 			 * here as an exceptional entry: so skip over it -
 			 * we only reach this from invalidate_mapping_pages().
 			 */
+			nr_skip++;
 			continue;
 		}
 
@@ -876,7 +878,7 @@ repeat:
 	 * If all entries were removed before we could secure them,
 	 * try again, because callers stop trying once 0 is returned.
 	 */
-	if (unlikely(!ret && nr_found))
+	if (unlikely(!ret && nr_found > nr_skip))
 		goto restart;
 	rcu_read_unlock();
 	return ret;



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-15  0:45                       ` Shaohua Li
@ 2011-09-15  2:00                         ` Hugh Dickins
  2011-09-15  4:02                         ` Eric Dumazet
  1 sibling, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2011-09-15  2:00 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linus Torvalds, Eric Dumazet, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

On Thu, 15 Sep 2011, Shaohua Li wrote:
> On Thu, 2011-09-15 at 06:37 +0800, Linus Torvalds wrote:
> > On Wed, Sep 14, 2011 at 2:53 PM, Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Thanks, Eric, though it may not be worth spending your time on it.
> > > It occurred to me over lunch that it may take painfully longer than
> > > expected to invalidate_mapping_pages() on a single-swapped-out-page
> > > 1TB sparse tmpfs file - all those "start += 1" restarts until it
> > > reaches the end.
> > 
> > So can we have a stop-gap patch to just fixes it for now? I assume
> > that would be Shaohua's patch with the "nr_found > nr_skip" change?
> > 
> > Can you guys send whatever patch is appropriate for now with a nice
> > changelog and the appropriate sign-offs, please? So that we can at
> > least close the issue...
> here is my patch if you want to close the issue at hand.

Right, it closes one of the hangs, but not whatever the 3.0 hang is,
and not the unlikely SHM_UNLOCK issue I factored in.  I cannot consider
those issues closed, but I am happy to be let off the hook of providing
another fix tomorrow - thanks!

> 
> Subject: mm: account skipped entries to avoid looping in find_get_pages
> 
> The found entries by find_get_pages() could be all swap entries. In
> this case we skip the entries, but make sure the skipped entries are
> accounted, so we don't keep looping.
> Using nr_found > nr_skip to simplify code as suggested by Eric.
> 
> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>

Acked-by: Hugh Dickins <hughd@google.com>

> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 645a080..7771871 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -827,13 +827,14 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
>  {
>  	unsigned int i;
>  	unsigned int ret;
> -	unsigned int nr_found;
> +	unsigned int nr_found, nr_skip;
>  
>  	rcu_read_lock();
>  restart:
>  	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
>  				(void ***)pages, NULL, start, nr_pages);
>  	ret = 0;
> +	nr_skip = 0;
>  	for (i = 0; i < nr_found; i++) {
>  		struct page *page;
>  repeat:
> @@ -856,6 +857,7 @@ repeat:
>  			 * here as an exceptional entry: so skip over it -
>  			 * we only reach this from invalidate_mapping_pages().
>  			 */
> +			nr_skip++;
>  			continue;
>  		}
>  
> @@ -876,7 +878,7 @@ repeat:
>  	 * If all entries were removed before we could secure them,
>  	 * try again, because callers stop trying once 0 is returned.
>  	 */
> -	if (unlikely(!ret && nr_found))
> +	if (unlikely(!ret && nr_found > nr_skip))
>  		goto restart;
>  	rcu_read_unlock();
>  	return ret;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-15  0:45                       ` Shaohua Li
  2011-09-15  2:00                         ` Hugh Dickins
@ 2011-09-15  4:02                         ` Eric Dumazet
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2011-09-15  4:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, linux-kernel,
	Rik van Riel, Lin Ming, Justin Piszcz, Pawel Sikora

Le jeudi 15 septembre 2011 à 08:45 +0800, Shaohua Li a écrit :

> here is my patch if you want to close the issue at hand.
> 
> Subject: mm: account skipped entries to avoid looping in find_get_pages
> 
> The found entries by find_get_pages() could be all swap entries. In
> this case we skip the entries, but make sure the skipped entries are
> accounted, so we don't keep looping.
> Using nr_found > nr_skip to simplify code as suggested by Eric.
> 
> Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> 

Yep, I guess Hugh can refine it later.

I'm pulling latest Linus tree (including this patch) and redo a stress
session, including transparent hugepage games.

Thanks !



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-14  0:34   ` Lin Ming
@ 2011-09-15 10:47     ` Pawel Sikora
  2011-09-15 11:11       ` Justin Piszcz
  0 siblings, 1 reply; 24+ messages in thread
From: Pawel Sikora @ 2011-09-15 10:47 UTC (permalink / raw)
  To: Lin Ming
  Cc: Andrew Morton, Eric Dumazet, Linus Torvalds, linux-kernel,
	Andrew Morton, Toshiyuki Okajima, Dave Chinner, Hugh Dickins,
	Justin Piszcz

On Wednesday 14 of September 2011 08:34:21 Lin Ming wrote:

> [3.0.2-stable] BUG: soft lockup - CPU#13 stuck for 22s! [kswapd2:1092]
> http://marc.info/?l=linux-kernel&m=131469584117857&w=2

Hi,

i'm not sure that this is fully related to this thread but i've found
new warnings about memory pages in dmesg today:

[650697.716481] ------------[ cut here ]------------
[650697.716498] WARNING: at mm/page-writeback.c:1176 __set_page_dirty_nobuffers+0x10a/0x140()
[650697.716501] Hardware name: H8DGU
[650697.716502] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4 
dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi pata_atiixp sp5100_tco ohci_hcd ide_pci_generic ssb ehci_hcd pcmcia igb pcmcia_core psmouse mmc_core evdev 
i2c_piix4 atiixp ide_core k10temp usbcore amd64_edac_mod edac_core i2c_core dca hwmon edac_mce_amd ghes serio_raw button hed processor pcspkr sg sd_mod crc_t10dif raid1 md_mod ext3 
jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
[650697.716569] Pid: 16806, comm: m_xilinx Not tainted 3.0.4 #5
[650697.716572] Call Trace:
[650697.716582]  [<ffffffff810470da>] warn_slowpath_common+0x7a/0xb0
[650697.716586]  [<ffffffff81047125>] warn_slowpath_null+0x15/0x20
[650697.716590]  [<ffffffff810e71ba>] __set_page_dirty_nobuffers+0x10a/0x140
[650697.716596]  [<ffffffff81127eb8>] migrate_page_copy+0x1c8/0x1d0
[650697.716600]  [<ffffffff81127ef5>] migrate_page+0x35/0x50
[650697.716623]  [<ffffffffa04b6f19>] nfs_migrate_page+0x59/0xf0 [nfs]
[650697.716627]  [<ffffffff81127fb9>] move_to_new_page+0xa9/0x260
[650697.716630]  [<ffffffff811286bd>] migrate_pages+0x3fd/0x4c0
[650697.716635]  [<ffffffff8142988e>] ? apic_timer_interrupt+0xe/0x20
[650697.716641]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
[650697.716645]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
[650697.716649]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
[650697.716653]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
[650697.716657]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
[650697.716661]  [<ffffffff810e588d>] __alloc_pages_nodemask+0x66d/0x7f0
[650697.716667]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
[650697.716671]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
[650697.716676]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
[650697.716680]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
[650697.716685]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[650697.716688]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
[650697.716692]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
[650697.716697]  [<ffffffff811371b8>] ? do_sys_open+0x168/0x1d0
[650697.716701]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
[650697.716704] ---[ end trace 4255de435c6def21 ]---

BR,
Paweł.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-15 10:47     ` Pawel Sikora
@ 2011-09-15 11:11       ` Justin Piszcz
  2011-09-15 12:04         ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Justin Piszcz @ 2011-09-15 11:11 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Lin Ming, Andrew Morton, Eric Dumazet, Linus Torvalds,
	linux-kernel, Andrew Morton, Toshiyuki Okajima, Dave Chinner,
	Hugh Dickins, Alan Piszcz

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4020 bytes --]



On Thu, 15 Sep 2011, Pawel Sikora wrote:

> On Wednesday 14 of September 2011 08:34:21 Lin Ming wrote:
>
>> [3.0.2-stable] BUG: soft lockup - CPU#13 stuck for 22s! [kswapd2:1092]
>> http://marc.info/?l=linux-kernel&m=131469584117857&w=2
>
> Hi,
>
> i'm not sure that this is fully related to this thread but i've found
> new warnings about memory pages in dmesg today:
>
> [650697.716481] ------------[ cut here ]------------
> [650697.716498] WARNING: at mm/page-writeback.c:1176 __set_page_dirty_nobuffers+0x10a/0x140()
> [650697.716501] Hardware name: H8DGU
> [650697.716502] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4
> nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4
> dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi pata_atiixp sp5100_tco ohci_hcd ide_pci_generic ssb ehci_hcd pcmcia igb pcmcia_core psmouse mmc_core evdev
> i2c_piix4 atiixp ide_core k10temp usbcore amd64_edac_mod edac_core i2c_core dca hwmon edac_mce_amd ghes serio_raw button hed processor pcspkr sg sd_mod crc_t10dif raid1 md_mod ext3
> jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
> [650697.716569] Pid: 16806, comm: m_xilinx Not tainted 3.0.4 #5
> [650697.716572] Call Trace:
> [650697.716582]  [<ffffffff810470da>] warn_slowpath_common+0x7a/0xb0
> [650697.716586]  [<ffffffff81047125>] warn_slowpath_null+0x15/0x20
> [650697.716590]  [<ffffffff810e71ba>] __set_page_dirty_nobuffers+0x10a/0x140
> [650697.716596]  [<ffffffff81127eb8>] migrate_page_copy+0x1c8/0x1d0
> [650697.716600]  [<ffffffff81127ef5>] migrate_page+0x35/0x50
> [650697.716623]  [<ffffffffa04b6f19>] nfs_migrate_page+0x59/0xf0 [nfs]
> [650697.716627]  [<ffffffff81127fb9>] move_to_new_page+0xa9/0x260
> [650697.716630]  [<ffffffff811286bd>] migrate_pages+0x3fd/0x4c0
> [650697.716635]  [<ffffffff8142988e>] ? apic_timer_interrupt+0xe/0x20
> [650697.716641]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
> [650697.716645]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
> [650697.716649]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
> [650697.716653]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
> [650697.716657]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
> [650697.716661]  [<ffffffff810e588d>] __alloc_pages_nodemask+0x66d/0x7f0
> [650697.716667]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
> [650697.716671]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
> [650697.716676]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
> [650697.716680]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
> [650697.716685]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> [650697.716688]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
> [650697.716692]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
> [650697.716697]  [<ffffffff811371b8>] ? do_sys_open+0x168/0x1d0
> [650697.716701]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> [650697.716704] ---[ end trace 4255de435c6def21 ]---
>
> BR,
> Pawe?.
>

Hi Pawell,

I had the same issues, either try the latest patch that was recommended,
OR, try the older ones (I am using these three and I have not had a memory
error/OOPS/etc in 24hrs)

Before patches:
Aug 30 05:00:48 p34 kernel: [122150.720173]  [<ffffffff8103798a>] warn_slowpath_common+0x7a/0xb0
Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0

After patches:
(no errors)

Patches you need (against 3.1-rc4):

(for the igb problem/memory allocation issue)
0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch
0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch

(for the RCU/memory errors)
0003-filemap.patch

I've attached them to this e-mail, they seem to have fixed all of my 
problems so far.

Justin.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-diff; name=0003-filemap.patch, Size: 2108 bytes --]

From eric.dumazet@gmail.com Wed Sep 14 06:20:11 2011
Date: Wed, 14 Sep 2011 06:20:08
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Lin Ming <mlin@ss.pku.edu.cn>, linux-kernel@vger.kernel.org, Alan Piszcz <ap@solarrain.com>, "Li, Shaohua" <shaohua.li@intel.com>, Andrew Morton <akpm@google.com>
Subject: Re: 3.0.1: pagevec_lookup+0x1d/0x30, SLAB issues?

Le mercredi 14 septembre 2011 à 05:47 -0400, Justin Piszcz a écrit :
> 
> On Wed, 14 Sep 2011, Lin Ming wrote:
> 
> > On Mon, Sep 12, 2011 at 6:44 AM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Hi, Justin
> >
> > There is a similar bug report at:
> > http://marc.info/?t=131594190600005&r=1&w=2
> >
> > The attached patch from Shaohua fixed the bug.
> >
> > Could you have a try it?
> >
> 
> Hi Lin/LKML,
> 
> Can you please provide text patch files for what you want me to apply?
> I did read that e-mail thread and that could be the culprit, I will patch
> and apply as soon as someone points to to the patch locations :)

diff --git a/mm/filemap.c b/mm/filemap.c
index 645a080..7771871 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -827,13 +827,14 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 {
 	unsigned int i;
 	unsigned int ret;
-	unsigned int nr_found;
+	unsigned int nr_found, nr_skip;
 
 	rcu_read_lock();
 restart:
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
 				(void ***)pages, NULL, start, nr_pages);
 	ret = 0;
+	nr_skip = 0;
 	for (i = 0; i < nr_found; i++) {
 		struct page *page;
 repeat:
@@ -856,6 +857,7 @@ repeat:
 			 * here as an exceptional entry: so skip over it -
 			 * we only reach this from invalidate_mapping_pages().
 			 */
+			nr_skip++;
 			continue;
 		}
 
@@ -876,7 +878,7 @@ repeat:
 	 * If all entries were removed before we could secure them,
 	 * try again, because callers stop trying once 0 is returned.
 	 */
-	if (unlikely(!ret && nr_found))
+	if (unlikely(!ret && nr_found > nr_skip))
 		goto restart;
 	rcu_read_unlock();
 	return ret;


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: Type: TEXT/x-diff; name=0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch, Size: 4518 bytes --]

From 74d81235f8e4bd60859d539a27e51d3a09d183cf Mon Sep 17 00:00:00 2001
From: Jon Mason <mason@myri.com>
Date: Thu, 8 Sep 2011 12:59:00 -0500
Subject: [PATCH 2/2] PCI: Remove MRRS modification from MPS setting code

Modifying the Maximum Read Request Size to 0 (value of 128Bytes) has
massive negative ramifications on some devices.  Without knowing which
devices have this issue, do not modify from the default value when
walking the PCI-E bus in pcie_bus_safe mode.  Also, make pcie_bus_safe
the default procedure.

Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Simon Kirby <sim@hostway.ca>
Tested-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-tested-by: Niels Ole Salscheider <niels_ole@salscheider-online.de>
References: https://bugzilla.kernel.org/show_bug.cgi?id=42162
Signed-off-by: Jon Mason <mason@myri.com>
---
 drivers/pci/pci.c   |    2 +-
 drivers/pci/probe.c |   41 ++++++++++++++++++++++-------------------
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 0ce6742..4e84fd4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
 unsigned long pci_hotplug_io_size  = DEFAULT_HOTPLUG_IO_SIZE;
 unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;
 
-enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
+enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
 
 /*
  * The default CLS is used if arch didn't set CLS explicitly and not
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 0820fc1..b1187ff 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1396,34 +1396,37 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
 
 static void pcie_write_mrrs(struct pci_dev *dev, int mps)
 {
-	int rc, mrrs;
+	int rc, mrrs, dev_mpss;
 
-	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
-		int dev_mpss = 128 << dev->pcie_mpss;
+	/* In the "safe" case, do not configure the MRRS.  There appear to be
+	 * issues with setting MRRS to 0 on a number of devices.
+	 */
 
-		/* For Max performance, the MRRS must be set to the largest
-		 * supported value.  However, it cannot be configured larger
-		 * than the MPS the device or the bus can support.  This assumes
-		 * that the largest MRRS available on the device cannot be
-		 * smaller than the device MPSS.
-		 */
-		mrrs = mps < dev_mpss ? mps : dev_mpss;
-	} else
-		/* In the "safe" case, configure the MRRS for fairness on the
-		 * bus by making all devices have the same size
-		 */
-		mrrs = mps;
+	if (pcie_bus_config != PCIE_BUS_PERFORMANCE)
+		return;
+
+	dev_mpss = 128 << dev->pcie_mpss;
 
+	/* For Max performance, the MRRS must be set to the largest supported
+	 * value.  However, it cannot be configured larger than the MPS the
+	 * device or the bus can support.  This assumes that the largest MRRS
+	 * available on the device cannot be smaller than the device MPSS.
+	 */
+	mrrs = min(mps, dev_mpss);
 
 	/* MRRS is a R/W register.  Invalid values can be written, but a
-	 * subsiquent read will verify if the value is acceptable or not.
+	 * subsequent read will verify if the value is acceptable or not.
 	 * If the MRRS value provided is not acceptable (e.g., too large),
 	 * shrink the value until it is acceptable to the HW.
  	 */
 	while (mrrs != pcie_get_readrq(dev) && mrrs >= 128) {
+		dev_warn(&dev->dev, "Attempting to modify the PCI-E MRRS value"
+			 " to %d.  If any issues are encountered, please try "
+			 "running with pci=pcie_bus_safe\n", mrrs);
 		rc = pcie_set_readrq(dev, mrrs);
 		if (rc)
-			dev_err(&dev->dev, "Failed attempting to set the MRRS\n");
+			dev_err(&dev->dev,
+				"Failed attempting to set the MRRS\n");
 
 		mrrs /= 2;
 	}
@@ -1436,13 +1439,13 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
 	if (!pci_is_pcie(dev))
 		return 0;
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	pcie_write_mps(dev, mps);
 	pcie_write_mrrs(dev, mps);
 
-	dev_info(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
+	dev_dbg(&dev->dev, "Dev MPS %d MPSS %d MRRS %d\n",
 		 pcie_get_mps(dev), 128<<dev->pcie_mpss, pcie_get_readrq(dev));
 
 	return 0;
-- 
1.7.6


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: Type: TEXT/x-diff; name=0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch, Size: 2416 bytes --]

From cf822aed99fd8851d82ae5f2df11c29b79e316c8 Mon Sep 17 00:00:00 2001
From: Shyam Iyer <shyam.iyer.t@gmail.com>
Date: Wed, 31 Aug 2011 12:21:42 -0400
Subject: [PATCH 1/2] Fix pointer dereference before call to
 pcie_bus_configure_settings

There is a potential NULL pointer dereference in calls to
pcie_bus_configure_settings due to attempts to access pci_bus self
variables when the self pointer is NULL.  To correct this, verify that
the self pointer in pci_bus is non-NULL before dereferencing it.

Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: Jon Mason <mason@myri.com>
---
 arch/x86/pci/acpi.c              |    9 +++++++--
 drivers/pci/hotplug/pcihp_slot.c |    4 +++-
 drivers/pci/probe.c              |    3 ---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index c953302..039d913 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -365,8 +365,13 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_pci_root *root)
 	 */
 	if (bus) {
 		struct pci_bus *child;
-		list_for_each_entry(child, &bus->children, node)
-			pcie_bus_configure_settings(child, child->self->pcie_mpss);
+		list_for_each_entry(child, &bus->children, node) {
+			struct pci_dev *self = child->self;
+			if (!self)
+				continue;
+
+			pcie_bus_configure_settings(child, self->pcie_mpss);
+		}
 	}
 
 	if (!bus)
diff --git a/drivers/pci/hotplug/pcihp_slot.c b/drivers/pci/hotplug/pcihp_slot.c
index 753b21a..3ffd9c1 100644
--- a/drivers/pci/hotplug/pcihp_slot.c
+++ b/drivers/pci/hotplug/pcihp_slot.c
@@ -169,7 +169,9 @@ void pci_configure_slot(struct pci_dev *dev)
 			(dev->class >> 8) == PCI_CLASS_BRIDGE_PCI)))
 		return;
 
-	pcie_bus_configure_settings(dev->bus, dev->bus->self->pcie_mpss);
+	if (dev->bus && dev->bus->self)
+		pcie_bus_configure_settings(dev->bus,
+					    dev->bus->self->pcie_mpss);
 
 	memset(&hpp, 0, sizeof(hpp));
 	ret = pci_get_hp_params(dev, &hpp);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8473727..0820fc1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1456,9 +1456,6 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
 {
 	u8 smpss = mpss;
 
-	if (!bus->self)
-		return;
-
 	if (!pci_is_pcie(bus->self))
 		return;
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-15 11:11       ` Justin Piszcz
@ 2011-09-15 12:04         ` Eric Dumazet
  2011-09-15 15:00           ` Paweł Sikora
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2011-09-15 12:04 UTC (permalink / raw)
  To: Justin Piszcz
  Cc: Pawel Sikora, Lin Ming, Andrew Morton, Linus Torvalds,
	linux-kernel, Andrew Morton, Toshiyuki Okajima, Dave Chinner,
	Hugh Dickins, Alan Piszcz

Le jeudi 15 septembre 2011 à 07:11 -0400, Justin Piszcz a écrit :
> 

> Before patches:
> Aug 30 05:00:48 p34 kernel: [122150.720173]  [<ffffffff8103798a>] warn_slowpath_common+0x7a/0xb0
> Sep 10 20:59:39 p34 kernel: [531189.671424]  [<ffffffff810379ba>] warn_slowpath_common+0x7a/0xb0
> 
> After patches:
> (no errors)
> 
> Patches you need (against 3.1-rc4):
> 
> (for the igb problem/memory allocation issue)
> 0001-Fix-pointer-dereference-before-call-to-pcie_bus_conf.patch
> 0002-PCI-Remove-MRRS-modification-from-MPS-setting-code.patch
> 
> (for the RCU/memory errors)
> 0003-filemap.patch
> 
> I've attached them to this e-mail, they seem to have fixed all of my 
> problems so far.
> 

Or just pull latest Linus tree. No need to repost those patches over and
over ;)

from your local copy, do :

git pull https://github.com/torvalds/linux.git



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-15 12:04         ` Eric Dumazet
@ 2011-09-15 15:00           ` Paweł Sikora
  2011-09-15 15:15             ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Paweł Sikora @ 2011-09-15 15:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Justin Piszcz, Lin Ming, Andrew Morton, Linus Torvalds,
	linux-kernel, Andrew Morton, Toshiyuki Okajima, Dave Chinner,
	Hugh Dickins, Alan Piszcz

 On Thu, 15 Sep 2011 14:04:15 +0200, Eric Dumazet wrote:
 
> Or just pull latest Linus tree. No need to repost those patches over 
> and
> over ;)
>
> from your local copy, do :
>
> git pull https://github.com/torvalds/linux.git

 i'm using the 3.0.x line and mentioned patch won't be helpful 
 (https://lkml.org/lkml/2011/9/14/271).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] infinite loop in find_get_pages()
  2011-09-15 15:00           ` Paweł Sikora
@ 2011-09-15 15:15             ` Eric Dumazet
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2011-09-15 15:15 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Justin Piszcz, Lin Ming, Andrew Morton, Linus Torvalds,
	linux-kernel, Andrew Morton, Toshiyuki Okajima, Dave Chinner,
	Hugh Dickins, Alan Piszcz

Le jeudi 15 septembre 2011 à 17:00 +0200, Paweł Sikora a écrit :
> On Thu, 15 Sep 2011 14:04:15 +0200, Eric Dumazet wrote:
>  
> > Or just pull latest Linus tree. No need to repost those patches over 
> > and
> > over ;)
> >
> > from your local copy, do :
> >
> > git pull https://github.com/torvalds/linux.git
> 
>  i'm using the 3.0.x line and mentioned patch won't be helpful 
>  (https://lkml.org/lkml/2011/9/14/271).
> 

All mentioned patches are for 3.1 only.




^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-09-15 15:15 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-13 19:23 [BUG] infinite loop in find_get_pages() Eric Dumazet
2011-09-13 23:53 ` Andrew Morton
2011-09-14  0:21   ` Eric Dumazet
2011-09-14  0:34   ` Lin Ming
2011-09-15 10:47     ` Pawel Sikora
2011-09-15 11:11       ` Justin Piszcz
2011-09-15 12:04         ` Eric Dumazet
2011-09-15 15:00           ` Paweł Sikora
2011-09-15 15:15             ` Eric Dumazet
     [not found] ` <CA+55aFyG3-3_gqGjqUmsTAHWfmNLMdQVf4XqUZrDAGMBxgur=Q@mail.gmail.com>
2011-09-14  6:48   ` Linus Torvalds
2011-09-14  6:53     ` Eric Dumazet
2011-09-14  7:32       ` Shaohua Li
2011-09-14  8:20         ` Shaohua Li
2011-09-14  8:43           ` Eric Dumazet
2011-09-14  8:55             ` Shaohua Li
2011-09-14 20:38               ` Hugh Dickins
2011-09-14 20:55                 ` Eric Dumazet
2011-09-14 21:53                   ` Hugh Dickins
2011-09-14 22:08                     ` Eric Dumazet
2011-09-14 22:37                     ` Linus Torvalds
2011-09-15  0:45                       ` Shaohua Li
2011-09-15  2:00                         ` Hugh Dickins
2011-09-15  4:02                         ` Eric Dumazet
     [not found] ` <CA+55aFx41_Z4TjjJwPuE21Q8oD3aGWtQwh45DUiCjPVD-wCJXw@mail.gmail.com>
2011-09-14  6:48   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.