linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.4.20: Proccess stuck in __lock_page ...
@ 2003-05-27  3:41 manish
  2003-05-27  4:03 ` Marcelo Tosatti
  0 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27  3:41 UTC (permalink / raw)
  To: linux-kernel, manish

Hello !

I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU. 
I am running bonnie accross four drives in parallel:

bonnie -s 1000 -d /<dir-name>

bdflush settings on this system:

[root@dyn-10-123-130-235 vm]# cat bdflush
2       50      32      100     50      300     1       0       0

All the bonnie process and any other process (like df, ps -ef etc.) are 
hung in __lock_page. Breaking into kdb, I observe the following for one 
such bonnie process:

schedule(..)
__lock_page(..)
lock_page(..)
do_generic_file_read(..)
generic_file_read(..)

After this, the processes never exit the hang. At times, a couple of 
bonnie processes complete but the hang still occurs with the remaining 
processes and with the other processes.

I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that 
the hang does not occur. If I run, two bonnie processes, they never get 
stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.

Any clues where this could be happening?

Thanks
-Manish


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27  3:41 2.4.20: Proccess stuck in __lock_page manish
@ 2003-05-27  4:03 ` Marcelo Tosatti
  2003-05-27  4:25   ` manish
                     ` (2 more replies)
  0 siblings, 3 replies; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27  4:03 UTC (permalink / raw)
  To: manish; +Cc: linux-kernel



On Mon, 26 May 2003, manish wrote:

> Hello !
>
> I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
> I am running bonnie accross four drives in parallel:
>
> bonnie -s 1000 -d /<dir-name>
>
> bdflush settings on this system:
>
> [root@dyn-10-123-130-235 vm]# cat bdflush
> 2       50      32      100     50      300     1       0       0
>
> All the bonnie process and any other process (like df, ps -ef etc.) are
> hung in __lock_page. Breaking into kdb, I observe the following for one
> such bonnie process:
>
> schedule(..)
> __lock_page(..)
> lock_page(..)
> do_generic_file_read(..)
> generic_file_read(..)
>
> After this, the processes never exit the hang. At times, a couple of
> bonnie processes complete but the hang still occurs with the remaining
> processes and with the other processes.
>
> I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
> the hang does not occur. If I run, two bonnie processes, they never get
> stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>
> Any clues where this could be happening?

Hi,

Are you sure there is no disk activity ?

Run vmstat and check that, please.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27  4:03 ` Marcelo Tosatti
@ 2003-05-27  4:25   ` manish
  2003-05-27  4:59     ` Marcelo Tosatti
  2003-05-27  4:31   ` manish
  2003-05-27 14:14   ` Carl-Daniel Hailfinger
  2 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27  4:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

Marcelo Tosatti wrote:

>
>On Mon, 26 May 2003, manish wrote:
>
>>Hello !
>>
>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>I am running bonnie accross four drives in parallel:
>>
>>bonnie -s 1000 -d /<dir-name>
>>
>>bdflush settings on this system:
>>
>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>2       50      32      100     50      300     1       0       0
>>
>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>such bonnie process:
>>
>>schedule(..)
>>__lock_page(..)
>>lock_page(..)
>>do_generic_file_read(..)
>>generic_file_read(..)
>>
>>After this, the processes never exit the hang. At times, a couple of
>>bonnie processes complete but the hang still occurs with the remaining
>>processes and with the other processes.
>>
>>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
>>the hang does not occur. If I run, two bonnie processes, they never get
>>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>>
>>Any clues where this could be happening?
>>
>
>Hi,
>
>Are you sure there is no disk activity ?
>
>Run vmstat and check that, please.
>
Hello !

Thanks for the response.

 The light on the controller does not blink at all. Intitially, it does 
blink. However, after this hang, it does not at all.

vmstat after the hang

1  1  0    780 2056892   5784 1415324   0   0     0     4  102     7  
49   1  50
 1  1  0    780 2056892   5784 1415324   0   0     0     4  102     9  
49   1  50
 1  1  0    780 2056892   5784 1415324   0   0     0     5  104    10  
29  21  50
 0  1  0    780 2056708   5784 1415324   0   0     0     1  104    12   
0  13  86
 1  1  0    780 2222904   5784 1249396   0   0     0   172  126    25   
0   4  96
 0  1  0    780 3081052   5784 391324   0   0     0   403  161    43   
0  12  88
   procs                      memory    swap          io     
system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  
sy  id
 0  1  0    780 3080952   5788 391408   0   0    29     9  120    72   
0   0 100
 0  1  0    780 3080952   5788 391408   0   0     0     0  111    19   
0   0 100
 0  1  0    780 3080952   5788 391408   0   0     0     1  103     9   
0   0 100
 0  1  0    780 3080952   5788 391408   0   0     0     0  101     9   
0   0 100
 0  1  0    780 3080952   5788 391408   0   0     0     0  101     7   
0   0 100
 0  1  0    780 3080952   5788 391408   0   0     0     0  101     9   
0   0 100
 0  1  0    780 3080952   5788 391408   0   0     0     0  102     9   
0   0 100
 0  1  0    780 3080952   5788 391408   0   0     0     1  101     8   
0   0 100
 0  1  0    780 3081308   5788 391420   0   0     0   231  150    92   
3   0  97
 0  1  0    780 3081308   5788 391420   0   0     0     0  102     7   
0   0 100
 0  1  0    780 3081308   5788 391420   0   0     0     0  102     7   
0   0 100
 0  1  0    780 3081304   5788 391420   0   0     0     0  101     9   
0   0 100
 0  1  0    780 3081304   5788 391420   0   0     0     0  102     8   
0   0 100
 0  1  0    780 3081300   5788 391420   0   0     0     0  101     8   
0   0 100
 0  1  0    780 3081300   5788 391420   0   0     0     0  101     9   
0   0 100
 0  1  0    780 3081296   5788 391420   0   0     0     0  101     7   
0   0 100
 0  1  0    780 3081296   5788 391420   0   0     0     0  101     9   
0   0 100

 0  1  0    780 3081292   5788 391420   0   0     0     0  102     9   
0   0 100
 0  1  0    780 3081292   5788 391420   0   0     0     0  101     8   
0   0 100
 0  1  0    780 3081288   5788 391420   0   0     0     0  102     9   
0   0 100
 0  1  0    780 3081288   5788 391420   0   0     0     0  102     7   
0   0 100
 0  1  0    780 3081284   5788 391420   0   0     0     0  102     9   
0   0 100
 0  1  0    780 3081284   5788 391420   0   0     0     0  102     8   
0   0 100
 0  1  0    780 3081280   5788 391420   0   0     0     0  101     8   
0   0 100

 0  1  0    780 3081276   5788 391420   0   0     0     0  102     9   
0   0 100

0  1  0    780 3081260   5788 391420   0   0     0     0  235    30   
0   0 100
 0  1  0    780 3081260   5788 391420   0   0     0     0  101     9   
0   0 100
 0  1  0    780 3081256   5788 391420   0   0     0     0  101     7   
0   0 100
 0  1  0    780 3081248   5788 391424   0   0     0   169  137    54   
3   1  97
 0  1  0    780 3081248   5788 391424   0   0     0     0  101     9   
0   0 100
 0  1  0    780 3081248   5788 391424   0   0     0     0  101     8   
0   0 100
 0  1  0    780 3081248   5788 391424   0   0     0     0  101     9   
0   0 100

One bonnie process is hung.








^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27  4:03 ` Marcelo Tosatti
  2003-05-27  4:25   ` manish
@ 2003-05-27  4:31   ` manish
  2003-05-27 14:14   ` Carl-Daniel Hailfinger
  2 siblings, 0 replies; 142+ messages in thread
From: manish @ 2003-05-27  4:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

Marcelo Tosatti wrote:

>
>On Mon, 26 May 2003, manish wrote:
>
>>Hello !
>>
>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>I am running bonnie accross four drives in parallel:
>>
>>bonnie -s 1000 -d /<dir-name>
>>
>>bdflush settings on this system:
>>
>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>2       50      32      100     50      300     1       0       0
>>
>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>such bonnie process:
>>
>>schedule(..)
>>__lock_page(..)
>>lock_page(..)
>>do_generic_file_read(..)
>>generic_file_read(..)
>>
>>After this, the processes never exit the hang. At times, a couple of
>>bonnie processes complete but the hang still occurs with the remaining
>>processes and with the other processes.
>>
>>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
>>the hang does not occur. If I run, two bonnie processes, they never get
>>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>>
>>Any clues where this could be happening?
>>
>
>Hi,
>
>Are you sure there is no disk activity ?
>
>Run vmstat and check that, please.
>
Hello !

My bad. This is one of the kernels that had modified the IO subsystem to 
replace the io_request_lock with a finer grained host_lock and queue_lock.

I also noticed that the hang occurs when the settings of bdflush are the 
following:

root@dyn-10-123-130-235 vm]# cat bdflush
30      50      32      100     50      300     60      0       0

Thanks
-Manish






^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27  4:25   ` manish
@ 2003-05-27  4:59     ` Marcelo Tosatti
  2003-05-27 15:29       ` manish
  0 siblings, 1 reply; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27  4:59 UTC (permalink / raw)
  To: manish; +Cc: linux-kernel



On Mon, 26 May 2003, manish wrote:

> Marcelo Tosatti wrote:
>
> >
> >On Mon, 26 May 2003, manish wrote:
> >
> >>Hello !
> >>
> >>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
> >>I am running bonnie accross four drives in parallel:
> >>
> >>bonnie -s 1000 -d /<dir-name>
> >>
> >>bdflush settings on this system:
> >>
> >>[root@dyn-10-123-130-235 vm]# cat bdflush
> >>2       50      32      100     50      300     1       0       0
> >>
> >>All the bonnie process and any other process (like df, ps -ef etc.) are
> >>hung in __lock_page. Breaking into kdb, I observe the following for one
> >>such bonnie process:
> >>
> >>schedule(..)
> >>__lock_page(..)
> >>lock_page(..)
> >>do_generic_file_read(..)
> >>generic_file_read(..)
> >>
> >>After this, the processes never exit the hang. At times, a couple of
> >>bonnie processes complete but the hang still occurs with the remaining
> >>processes and with the other processes.
> >>
> >>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
> >>the hang does not occur. If I run, two bonnie processes, they never get
> >>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
> >>
> >>Any clues where this could be happening?
> >>
> >
> >Hi,
> >
> >Are you sure there is no disk activity ?
> >
> >Run vmstat and check that, please.
> >
> Hello !
>
> Thanks for the response.
>
>  The light on the controller does not blink at all. Intitially, it does
> blink. However, after this hang, it does not at all.
>
> vmstat after the hang
>
> 1  1  0    780 2056892   5784 1415324   0   0     0     4  102     7
> 49   1  50
>  1  1  0    780 2056892   5784 1415324   0   0     0     4  102     9
> 49   1  50
>  1  1  0    780 2056892   5784 1415324   0   0     0     5  104    10
> 29  21  50
>  0  1  0    780 2056708   5784 1415324   0   0     0     1  104    12
> 0  13  86
>  1  1  0    780 2222904   5784 1249396   0   0     0   172  126    25
> 0   4  96
>  0  1  0    780 3081052   5784 391324   0   0     0   403  161    43
> 0  12  88
>    procs                      memory    swap          io
> system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
> sy  id
>  0  1  0    780 3080952   5788 391408   0   0    29     9  120    72
> 0   0 100
>  0  1  0    780 3080952   5788 391408   0   0     0     0  111    19
> 0   0 100
>  0  1  0    780 3080952   5788 391408   0   0     0     1  103     9
> 0   0 100
>  0  1  0    780 3080952   5788 391408   0   0     0     0  101     9
> 0   0 100
>  0  1  0    780 3080952   5788 391408   0   0     0     0  101     7
> 0   0 100
>  0  1  0    780 3080952   5788 391408   0   0     0     0  101     9
> 0   0 100
>  0  1  0    780 3080952   5788 391408   0   0     0     0  102     9
> 0   0 100
>  0  1  0    780 3080952   5788 391408   0   0     0     1  101     8
> 0   0 100
>  0  1  0    780 3081308   5788 391420   0   0     0   231  150    92
> 3   0  97
>  0  1  0    780 3081308   5788 391420   0   0     0     0  102     7
> 0   0 100
>  0  1  0    780 3081308   5788 391420   0   0     0     0  102     7
> 0   0 100
>  0  1  0    780 3081304   5788 391420   0   0     0     0  101     9
> 0   0 100
>  0  1  0    780 3081304   5788 391420   0   0     0     0  102     8
> 0   0 100
>  0  1  0    780 3081300   5788 391420   0   0     0     0  101     8
> 0   0 100
>  0  1  0    780 3081300   5788 391420   0   0     0     0  101     9
> 0   0 100
>  0  1  0    780 3081296   5788 391420   0   0     0     0  101     7

Ok, and does it happen with the stock kernel?

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27  4:03 ` Marcelo Tosatti
  2003-05-27  4:25   ` manish
  2003-05-27  4:31   ` manish
@ 2003-05-27 14:14   ` Carl-Daniel Hailfinger
  2003-05-27 14:28     ` William Lee Irwin III
  2003-05-27 17:27     ` Marcelo Tosatti
  2 siblings, 2 replies; 142+ messages in thread
From: Carl-Daniel Hailfinger @ 2003-05-27 14:14 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: manish, linux-kernel, Christian Klose, Marc-Christian Petersen,
	William Lee Irwin III

Christian,

this looks supiciously like the problem you are experiencing since
2.4.19-pre. Maybe we can fix this for good.

Marcelo Tosatti wrote:
> 
> On Mon, 26 May 2003, manish wrote:
> 
> 
>>Hello !
>>
>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>I am running bonnie accross four drives in parallel:
>>
>>bonnie -s 1000 -d /<dir-name>
>>
>>bdflush settings on this system:
>>
>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>2       50      32      100     50      300     1       0       0
>>
>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>hung in __lock_page. Breaking into kdb, I observe the following for one

Following is SysRq-T output for stuck processes during such a pause from
Christian Klose. Only processes in D state are listed for brevity.
Especially the last two call traces are interesting.

 kjournald     D C15C7240     4   122      1           123   120 (L-TLB)
 Call Trace:    [__get_request_wait+197/208] [__make_request+392/1472]
[generic_make_request+226/304] [submit_bh+80/112] [ll_rw_block+263/432]
[journal_commit_transaction+4017/4416] [kjournald+277/464]
[commit_timeout+0/16] [kernel_thread+46/64] [kjournald+0/464]
 kmail         D D73E9360  2656  1960      1                1978 (NOTLB)
 Call Trace:    [sleep_on+56/96] [log_wait_commit+56/80]
[journal_stop+345/480] [journal_force_commit+60/64]
[ext3_force_commit+35/48] [ext3_sync_file+132/176]
[ext3_writepage+0/672] [sys_fsync+151/208] [system_call+51/56]
 mc            D C016B338     0  2177   2152  2179               (NOTLB)
 Call Trace:    [journal_stop+328/480] [__lock_page+149/192]
[lock_page+26/32] [do_generic_file_read+653/1104]
[file_read_actor+0/160] [generic_file_read+178/368]
[file_read_actor+0/160] [sys_read+163/320] [system_call+51/56]
 kmail         D 00200282  2656  1960      1                1978 (NOTLB)
 Call Trace:    [sleep_on+56/96] [log_wait_commit+56/80]
[journal_stop+345/480] [journal_force_commit+60/64]
[ext3_force_commit+35/48] [ext3_sync_file+132/176]
[ext3_writepage+0/672] [sys_fsync+151/208] [system_call+51/56]
 mc            D C016B338     0  2177   2152  2179               (NOTLB)
 Call Trace:    [journal_stop+328/480] [__lock_page+149/192]
[lock_page+26/32] [do_generic_file_read+653/1104]
[file_read_actor+0/160] [generic_file_read+178/368]
[file_read_actor+0/160] [sys_read+163/320] [system_call+51/56]
 grep          D DFD7E120     0  3243   1470          3244       (NOTLB)
 Call Trace:    [__wait_on_buffer+93/144] [bread+123/144]
[ext3_get_branch+106/240] [ext3_get_block_handle+120/688]
[create_buffers+107/224] [ext3_get_block+74/144]
[block_read_full_page+541/624] [__alloc_pages+75/400]
[page_cache_read+173/208] [ext3_get_block+0/144]
[read_cluster_nonblocking+57/80] [filemap_nopage+285/560]
[do_no_page+137/480] [do_page_fault+376/1246] [handle_mm_fault+119/256]
[do_page_fault+376/1246] [rb_insert_color+210/240]
[do_page_fault+0/1246] [error_code+52/60] [clear_user+51/80]
[do_page_fault+0/1246] [error_code+52/60] [clear_user+51/80]
[padzero+40/48] [load_elf_binary+1179/2848] [load_elf_binary+0/2848]
[search_binary_handler+269/400] [copy_strings+440/560]
[do_execve+365/544] [sys_execve+66/128] [system_call+51/56]
 grep          D C02508D4     0  3244   1470          3245  3243 (NOTLB)
 Call Trace:    [__lock_page+149/192] [lock_page+26/32]
[filemap_nopage+305/560] [do_no_page+137/480] [do_page_fault+376/1246]
[handle_mm_fault+119/256] [do_page_fault+376/1246]
[rb_insert_color+210/240] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [padzero+40/48] [load_elf_binary+1179/2848]
[__lock_page+175/192] [file_read_actor+0/160] [load_elf_binary+0/2848]
[search_binary_handler+269/400] [copy_strings+440/560]
[do_execve+365/544] [sys_execve+66/128] [system_call+51/56]
 grep          D C02508D4     0  3245   1470                3244 (NOTLB)
 Call Trace:    [__lock_page+149/192] [lock_page+26/32]
[filemap_nopage+305/560] [do_no_page+137/480] [do_page_fault+376/1246]
[handle_mm_fault+119/256] [do_page_fault+376/1246]
[rb_insert_color+210/240] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [do_page_fault+0/1246] [error_code+52/60]
[clear_user+51/80] [padzero+40/48] [load_elf_binary+1179/2848]
[__lock_page+175/192] [file_read_actor+0/160] [load_elf_binary+0/2848]
[search_binary_handler+269/400] [copy_strings+440/560]
[do_execve+365/544] [sys_execve+66/128] [system_call+51/56]


Regards,
Carl-Daniel


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 14:14   ` Carl-Daniel Hailfinger
@ 2003-05-27 14:28     ` William Lee Irwin III
  2003-05-27 17:27     ` Marcelo Tosatti
  1 sibling, 0 replies; 142+ messages in thread
From: William Lee Irwin III @ 2003-05-27 14:28 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Marcelo Tosatti, manish, linux-kernel, Christian Klose,
	Marc-Christian Petersen

On Tue, May 27, 2003 at 04:14:51PM +0200, Carl-Daniel Hailfinger wrote:
> Christian,
> this looks supiciously like the problem you are experiencing since
> 2.4.19-pre. Maybe we can fix this for good.

The most I know of this is that someone made it go away by backing out
some ll_rw_blk.c cset.


-- wli

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27  4:59     ` Marcelo Tosatti
@ 2003-05-27 15:29       ` manish
  2003-05-27 16:59         ` Marcelo Tosatti
  0 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 15:29 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

Marcelo Tosatti wrote:

>
>On Mon, 26 May 2003, manish wrote:
>
>>Marcelo Tosatti wrote:
>>
>>>On Mon, 26 May 2003, manish wrote:
>>>
>>>>Hello !
>>>>
>>>>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
>>>>I am running bonnie accross four drives in parallel:
>>>>
>>>>bonnie -s 1000 -d /<dir-name>
>>>>
>>>>bdflush settings on this system:
>>>>
>>>>[root@dyn-10-123-130-235 vm]# cat bdflush
>>>>2       50      32      100     50      300     1       0       0
>>>>
>>>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>>>such bonnie process:
>>>>
>>>>schedule(..)
>>>>__lock_page(..)
>>>>lock_page(..)
>>>>do_generic_file_read(..)
>>>>generic_file_read(..)
>>>>
>>>>After this, the processes never exit the hang. At times, a couple of
>>>>bonnie processes complete but the hang still occurs with the remaining
>>>>processes and with the other processes.
>>>>
>>>>I tried out the 2.5.33 kernel (one of the 2.5 series) and observed that
>>>>the hang does not occur. If I run, two bonnie processes, they never get
>>>>stuck. Actually, if I run 4 parallel mke2fs, they too get stuck.
>>>>
>>>>Any clues where this could be happening?
>>>>
>>>Hi,
>>>
>>>Are you sure there is no disk activity ?
>>>
>>>Run vmstat and check that, please.
>>>
>>Hello !
>>
>>Thanks for the response.
>>
>> The light on the controller does not blink at all. Intitially, it does
>>blink. However, after this hang, it does not at all.
>>
>>vmstat after the hang
>>
>>1  1  0    780 2056892   5784 1415324   0   0     0     4  102     7
>>49   1  50
>> 1  1  0    780 2056892   5784 1415324   0   0     0     4  102     9
>>49   1  50
>> 1  1  0    780 2056892   5784 1415324   0   0     0     5  104    10
>>29  21  50
>> 0  1  0    780 2056708   5784 1415324   0   0     0     1  104    12
>>0  13  86
>> 1  1  0    780 2222904   5784 1249396   0   0     0   172  126    25
>>0   4  96
>> 0  1  0    780 3081052   5784 391324   0   0     0   403  161    43
>>0  12  88
>>   procs                      memory    swap          io
>>system         cpu
>> r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
>>sy  id
>> 0  1  0    780 3080952   5788 391408   0   0    29     9  120    72
>>0   0 100
>> 0  1  0    780 3080952   5788 391408   0   0     0     0  111    19
>>0   0 100
>> 0  1  0    780 3080952   5788 391408   0   0     0     1  103     9
>>0   0 100
>> 0  1  0    780 3080952   5788 391408   0   0     0     0  101     9
>>0   0 100
>> 0  1  0    780 3080952   5788 391408   0   0     0     0  101     7
>>0   0 100
>> 0  1  0    780 3080952   5788 391408   0   0     0     0  101     9
>>0   0 100
>> 0  1  0    780 3080952   5788 391408   0   0     0     0  102     9
>>0   0 100
>> 0  1  0    780 3080952   5788 391408   0   0     0     1  101     8
>>0   0 100
>> 0  1  0    780 3081308   5788 391420   0   0     0   231  150    92
>>3   0  97
>> 0  1  0    780 3081308   5788 391420   0   0     0     0  102     7
>>0   0 100
>> 0  1  0    780 3081308   5788 391420   0   0     0     0  102     7
>>0   0 100
>> 0  1  0    780 3081304   5788 391420   0   0     0     0  101     9
>>0   0 100
>> 0  1  0    780 3081304   5788 391420   0   0     0     0  102     8
>>0   0 100
>> 0  1  0    780 3081300   5788 391420   0   0     0     0  101     8
>>0   0 100
>> 0  1  0    780 3081300   5788 391420   0   0     0     0  101     9
>>0   0 100
>> 0  1  0    780 3081296   5788 391420   0   0     0     0  101     7
>>
>
>Ok, and does it happen with the stock kernel?
>
Yes, with the stock kernel too but after long hrs of runtime ..

Thanks
-Manish




^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 15:29       ` manish
@ 2003-05-27 16:59         ` Marcelo Tosatti
  0 siblings, 0 replies; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 16:59 UTC (permalink / raw)
  To: manish; +Cc: lkml



On Tue, 27 May 2003, manish wrote:

> >Ok, and does it happen with the stock kernel?
> Yes, with the stock kernel too but after long hrs of runtime ..

Could you try Alt+SysRq+T and send us the output on the locked STOCK
kernel please?

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 14:14   ` Carl-Daniel Hailfinger
  2003-05-27 14:28     ` William Lee Irwin III
@ 2003-05-27 17:27     ` Marcelo Tosatti
  2003-05-27 17:36       ` Marc-Christian Petersen
                         ` (2 more replies)
  1 sibling, 3 replies; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 17:27 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: manish, linux-kernel, Christian Klose, Marc-Christian Petersen,
	William Lee Irwin III



On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:

> Christian,
>
> this looks supiciously like the problem you are experiencing since
> 2.4.19-pre. Maybe we can fix this for good.
>
> Marcelo Tosatti wrote:
> >
> > On Mon, 26 May 2003, manish wrote:
> >
> >
> >>Hello !
> >>
> >>I am running the 2.4.20 kernel on a system with 3.5 GB RAM and dual CPU.
> >>I am running bonnie accross four drives in parallel:
> >>
> >>bonnie -s 1000 -d /<dir-name>
> >>
> >>bdflush settings on this system:
> >>
> >>[root@dyn-10-123-130-235 vm]# cat bdflush
> >>2       50      32      100     50      300     1       0       0
> >>
> >>All the bonnie process and any other process (like df, ps -ef etc.) are
> >>hung in __lock_page. Breaking into kdb, I observe the following for one
>
> Following is SysRq-T output for stuck processes during such a pause from
> Christian Klose. Only processes in D state are listed for brevity.
> Especially the last two call traces are interesting.

A "pause" is perfectly fine (to some extent, of course), now a hang is
not. Is this backtrace from a hanged, unusable kernel or ?

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:27     ` Marcelo Tosatti
@ 2003-05-27 17:36       ` Marc-Christian Petersen
  2003-05-27 17:47         ` Marcelo Tosatti
  2003-05-27 17:36       ` William Lee Irwin III
  2003-05-27 17:38       ` Carl-Daniel Hailfinger
  2 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 17:36 UTC (permalink / raw)
  To: linux-kernel, Marcelo Tosatti, Carl-Daniel Hailfinger
  Cc: manish, Christian Klose, William Lee Irwin III

On Tuesday 27 May 2003 19:27, Marcelo Tosatti wrote:

Hi Marcelo,

> > Following is SysRq-T output for stuck processes during such a pause from
> > Christian Klose. Only processes in D state are listed for brevity.
> > Especially the last two call traces are interesting.
> A "pause" is perfectly fine (to some extent, of course), now a hang is
> not. Is this backtrace from a hanged, unusable kernel or ?
A pause is _not_ perfectly fine, even not to some extent. That pause we are 
discussing about is a pause of the _whole_ machine, not just disk i/o pauses. 
Mouse stops, keyboard stops, everything stops, who knows wtf.

That behaviour is absolutely bullshit for desktop users. For serverusage you 
may not notice it in this dimension (mostly no X so no mouse), but also for a 
server environment this may be very bad.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:27     ` Marcelo Tosatti
  2003-05-27 17:36       ` Marc-Christian Petersen
@ 2003-05-27 17:36       ` William Lee Irwin III
  2003-05-27 17:38       ` Carl-Daniel Hailfinger
  2 siblings, 0 replies; 142+ messages in thread
From: William Lee Irwin III @ 2003-05-27 17:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Carl-Daniel Hailfinger, manish, linux-kernel, Christian Klose,
	Marc-Christian Petersen

On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:
>> Following is SysRq-T output for stuck processes during such a pause from
>> Christian Klose. Only processes in D state are listed for brevity.
>> Especially the last two call traces are interesting.

On Tue, May 27, 2003 at 02:27:00PM -0300, Marcelo Tosatti wrote:
> A "pause" is perfectly fine (to some extent, of course), now a hang is
> not. Is this backtrace from a hanged, unusable kernel or ?

This sounds like deadlocked proceses, but not a whole system hang.
Sounds like a correctness issue, not a performance issue.


-- wli

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:27     ` Marcelo Tosatti
  2003-05-27 17:36       ` Marc-Christian Petersen
  2003-05-27 17:36       ` William Lee Irwin III
@ 2003-05-27 17:38       ` Carl-Daniel Hailfinger
  2003-05-27 17:50         ` manish
  2 siblings, 1 reply; 142+ messages in thread
From: Carl-Daniel Hailfinger @ 2003-05-27 17:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: manish, linux-kernel, Christian Klose, Marc-Christian Petersen,
	William Lee Irwin III

Marcelo Tosatti wrote:
> 
> On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:
> 
>>Marcelo Tosatti wrote:
>>
>>>On Mon, 26 May 2003, manish wrote:
>>>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>
>>Following is SysRq-T output for stuck processes during such a pause from
>>Christian Klose. Only processes in D state are listed for brevity.
>>Especially the last two call traces are interesting.
> 
> A "pause" is perfectly fine (to some extent, of course), now a hang is
> not. Is this backtrace from a hanged, unusable kernel or ?

AFAIK, the kernel is not unusable, but a 20 second pause with no disk
access at all is not nice either.


Regards,
Carl-Daniel


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:36       ` Marc-Christian Petersen
@ 2003-05-27 17:47         ` Marcelo Tosatti
  2003-05-27 17:52           ` Marc-Christian Petersen
                             ` (2 more replies)
  0 siblings, 3 replies; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 17:47 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: linux-kernel, Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III


On Tue, 27 May 2003, Marc-Christian Petersen wrote:

> On Tuesday 27 May 2003 19:27, Marcelo Tosatti wrote:
>
> Hi Marcelo,
>
> > > Following is SysRq-T output for stuck processes during such a pause from
> > > Christian Klose. Only processes in D state are listed for brevity.
> > > Especially the last two call traces are interesting.
> > A "pause" is perfectly fine (to some extent, of course), now a hang is
> > not. Is this backtrace from a hanged, unusable kernel or ?
> A pause is _not_ perfectly fine, even not to some extent. That pause we are
> discussing about is a pause of the _whole_ machine, not just disk i/o pauses.
> Mouse stops, keyboard stops, everything stops, who knows wtf.

Do you also notice them?


> That behaviour is absolutely bullshit for desktop users. For serverusage you
> may not notice it in this dimension (mostly no X so no mouse), but also for a
> server environment this may be very bad.

Agreed.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:38       ` Carl-Daniel Hailfinger
@ 2003-05-27 17:50         ` manish
  2003-05-27 18:04           ` Marc-Christian Petersen
  0 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 17:50 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Marcelo Tosatti, linux-kernel, Christian Klose,
	Marc-Christian Petersen, William Lee Irwin III

Carl-Daniel Hailfinger wrote:

>Marcelo Tosatti wrote:
>
>>On Tue, 27 May 2003, Carl-Daniel Hailfinger wrote:
>>
>>>Marcelo Tosatti wrote:
>>>
>>>>On Mon, 26 May 2003, manish wrote:
>>>>
>>>>>All the bonnie process and any other process (like df, ps -ef etc.) are
>>>>>hung in __lock_page. Breaking into kdb, I observe the following for one
>>>>>
>>>Following is SysRq-T output for stuck processes during such a pause from
>>>Christian Klose. Only processes in D state are listed for brevity.
>>>Especially the last two call traces are interesting.
>>>
>>A "pause" is perfectly fine (to some extent, of course), now a hang is
>>not. Is this backtrace from a hanged, unusable kernel or ?
>>
>
>AFAIK, the kernel is not unusable, but a 20 second pause with no disk
>access at all is not nice either.
>
>
>Regards,
>Carl-Daniel
>
Hello !

It is not a system hang but the processes hang showing the same stack 
trace. This is certainly not a pause since the bonnie processes that 
were hung (or deadlocked) never completed after several hrs. The stack 
trace  was the same.

Thanks
Manish





^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:47         ` Marcelo Tosatti
@ 2003-05-27 17:52           ` Marc-Christian Petersen
  2003-05-27 17:57             ` Marcelo Tosatti
  2003-05-27 17:53           ` manish
  2003-05-27 18:12           ` Matthias Mueller
  2 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 17:52 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tuesday 27 May 2003 19:47, Marcelo Tosatti wrote:

Hi Marcelo,

> > A pause is _not_ perfectly fine, even not to some extent. That pause we
> > are discussing about is a pause of the _whole_ machine, not just disk i/o
> > pauses. Mouse stops, keyboard stops, everything stops, who knows wtf.
> Do you also notice them?
I do, people I know do also, numbers of those people only _I_ know are about 
~30. I've reported this problem over a year ago while 2.4.19-pre time.

> > That behaviour is absolutely bullshit for desktop users. For serverusage
> > you may not notice it in this dimension (mostly no X so no mouse), but
> > also for a server environment this may be very bad.
> Agreed.
thanks =)

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:47         ` Marcelo Tosatti
  2003-05-27 17:52           ` Marc-Christian Petersen
@ 2003-05-27 17:53           ` manish
  2003-05-27 18:01             ` Marc-Christian Petersen
  2003-05-27 18:12           ` Matthias Mueller
  2 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 17:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, linux-kernel, Carl-Daniel Hailfinger,
	Christian Klose, William Lee Irwin III

Marcelo Tosatti wrote:

>On Tue, 27 May 2003, Marc-Christian Petersen wrote:
>
>>On Tuesday 27 May 2003 19:27, Marcelo Tosatti wrote:
>>
>>Hi Marcelo,
>>
>>>>Following is SysRq-T output for stuck processes during such a pause from
>>>>Christian Klose. Only processes in D state are listed for brevity.
>>>>Especially the last two call traces are interesting.
>>>>
>>>A "pause" is perfectly fine (to some extent, of course), now a hang is
>>>not. Is this backtrace from a hanged, unusable kernel or ?
>>>
>>A pause is _not_ perfectly fine, even not to some extent. That pause we are
>>discussing about is a pause of the _whole_ machine, not just disk i/o pauses.
>>Mouse stops, keyboard stops, everything stops, who knows wtf.
>>
>
>Do you also notice them?
>
>
>>That behaviour is absolutely bullshit for desktop users. For serverusage you
>>may not notice it in this dimension (mostly no X so no mouse), but also for a
>>server environment this may be very bad.
>>
>
>Agreed.
>
Hi Marc,

With respect to the hangs that you noticed, did the processes complete 
after a "pause" or did they stay hung (deadlocked)?

Thanks
Manish




^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:52           ` Marc-Christian Petersen
@ 2003-05-27 17:57             ` Marcelo Tosatti
  2003-05-27 18:08               ` Marc-Christian Petersen
  2003-05-27 18:09               ` manish
  0 siblings, 2 replies; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 17:57 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: linux-kernel, Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III



On Tue, 27 May 2003, Marc-Christian Petersen wrote:

> On Tuesday 27 May 2003 19:47, Marcelo Tosatti wrote:
>
> Hi Marcelo,
>
> > > A pause is _not_ perfectly fine, even not to some extent. That pause we
> > > are discussing about is a pause of the _whole_ machine, not just disk i/o
> > > pauses. Mouse stops, keyboard stops, everything stops, who knows wtf.
> > Do you also notice them?
> I do, people I know do also, numbers of those people only _I_ know are about
> ~30. I've reported this problem over a year ago while 2.4.19-pre time.

Can you please try to reproduce it with -aa?

> > > That behaviour is absolutely bullshit for desktop users. For serverusage
> > > you may not notice it in this dimension (mostly no X so no mouse), but
> > > also for a server environment this may be very bad.
> > Agreed.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:53           ` manish
@ 2003-05-27 18:01             ` Marc-Christian Petersen
  2003-05-27 18:16               ` Marcelo Tosatti
  0 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 18:01 UTC (permalink / raw)
  To: manish, Marcelo Tosatti
  Cc: linux-kernel, Carl-Daniel Hailfinger, Christian Klose,
	William Lee Irwin III

On Tuesday 27 May 2003 19:53, manish wrote:

Hi Manish,

> With respect to the hangs that you noticed, did the processes complete
> after a "pause" or did they stay hung (deadlocked)?
yes, no processes get ever deadlocked nor anything else in this area. The 
whole system just does _nothing_ for an amount of time (1-15 seconds, 
depends). _Sometimes_ (not always) even a ping is stoped for the amount of 
time the machine does nothing but pausing.

Also not a hardware problem. I made this clear before reporting this bug. 
Tested tons of different hardware, different drivers for the network card 
etc.

I repeat this now for the $high_number'th time ;):
- 2.4.18 worked perfect
- 2.4.19-pre not

ciao, Marc



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:50         ` manish
@ 2003-05-27 18:04           ` Marc-Christian Petersen
  2003-05-27 23:06             ` Georg Nikodym
                               ` (3 more replies)
  0 siblings, 4 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 18:04 UTC (permalink / raw)
  To: manish, Carl-Daniel Hailfinger, Andrea Arcangeli
  Cc: Marcelo Tosatti, linux-kernel, Christian Klose, William Lee Irwin III

On Tuesday 27 May 2003 19:50, manish wrote:

Hi Manish,

> It is not a system hang but the processes hang showing the same stack
> trace. This is certainly not a pause since the bonnie processes that
> were hung (or deadlocked) never completed after several hrs. The stack
> trace  was the same.
then you are hitting a different bug or a bug related to the issues Christian 
Klose and me and $tons of others were complaining.

The bug you are hitting might be the problem with "process stuck in D state" 
Andrea Arcangeli fixed, let me guess, over half a year ago or so.

In case you have a good mind to try to address your issue, you might want to 
try out the patch you can find here:

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2

ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
     speak _NOW_ please, doesn't matter who you are!

I've added Andrea into CC.

ciao, Marc



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:57             ` Marcelo Tosatti
@ 2003-05-27 18:08               ` Marc-Christian Petersen
  2003-05-27 18:25                 ` Andrea Arcangeli
  2003-05-27 18:09               ` manish
  1 sibling, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 18:08 UTC (permalink / raw)
  To: Marcelo Tosatti, Andrea Arcangeli
  Cc: linux-kernel, Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:

Hi Marcelo,

> > I do, people I know do also, numbers of those people only _I_ know are
> > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > time.
> Can you please try to reproduce it with -aa?
not again ;)

I've tried almost all known kernel tree's around, every kernel has the same 
effect. I even tried SuSE and Redhat Kernels.

I've 'wasted' tons of time just find a solution for it.

Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is 
dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput, 
and the "pauses/stops" are still there. Much less but not gone.

The _only_ workaround yet (known to the public) is to change nr_requests in 
drivers/block/ll_rw_blk.c from 128 to 4 which gives a performance hit of 
about 40% (not acceptable in any way).

.oO( I am quite sure I've mailed you all this stuff privately in response to 
      your private mail to me ;) )Oo.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:57             ` Marcelo Tosatti
  2003-05-27 18:08               ` Marc-Christian Petersen
@ 2003-05-27 18:09               ` manish
  1 sibling, 0 replies; 142+ messages in thread
From: manish @ 2003-05-27 18:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, linux-kernel, Carl-Daniel Hailfinger,
	Christian Klose, William Lee Irwin III

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, Marc-Christian Petersen wrote:
>
>>On Tuesday 27 May 2003 19:47, Marcelo Tosatti wrote:
>>
>>Hi Marcelo,
>>
>>>>A pause is _not_ perfectly fine, even not to some extent. That pause we
>>>>are discussing about is a pause of the _whole_ machine, not just disk i/o
>>>>pauses. Mouse stops, keyboard stops, everything stops, who knows wtf.
>>>>
>>>Do you also notice them?
>>>
>>I do, people I know do also, numbers of those people only _I_ know are about
>>~30. I've reported this problem over a year ago while 2.4.19-pre time.
>>
>
>Can you please try to reproduce it with -aa?
>
>>>>That behaviour is absolutely bullshit for desktop users. For serverusage
>>>>you may not notice it in this dimension (mostly no X so no mouse), but
>>>>also for a server environment this may be very bad.
>>>>
>>>Agreed.
>>>
Hello !

After several tests, I have noticed that I can produce this problem 
easily when my bdflush settings are:

30   50      32      100     50      300   60       0       0

and it occurs very less frequently when my settings are:

2       50      32      100     50      300     1       0       0


Right now, I noticed the following stack trace for one such stuck process:

sys_read
generic_file_read
do_generic_file_read
page_cache_read
__alloc_pages
balance_classzone
try_to_free_pages
shrink_caches
shrink_cache
try_to_release_page
try_to_free_buffer
sync_page_buffers
wait_on_buffer
__wait_on_buffer
schedule

Thanks
-Manish










^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 17:47         ` Marcelo Tosatti
  2003-05-27 17:52           ` Marc-Christian Petersen
  2003-05-27 17:53           ` manish
@ 2003-05-27 18:12           ` Matthias Mueller
  2 siblings, 0 replies; 142+ messages in thread
From: Matthias Mueller @ 2003-05-27 18:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, linux-kernel, Carl-Daniel Hailfinger,
	manish, Christian Klose, William Lee Irwin III

Hi,

On Tue, May 27, 2003 at 02:47:24PM -0300, Marcelo Tosatti wrote:
> On Tue, 27 May 2003, Marc-Christian Petersen wrote:
> > > A "pause" is perfectly fine (to some extent, of course), now a hang is
> > > not. Is this backtrace from a hanged, unusable kernel or ?
> > A pause is _not_ perfectly fine, even not to some extent. That pause we are
> > discussing about is a pause of the _whole_ machine, not just disk i/o pauses.
> > Mouse stops, keyboard stops, everything stops, who knows wtf.
> 
> Do you also notice them?

Since 2.4.19 I notice a lot of pauses with interactive work (desktop
usage). If i copy a big file over network or on local disk, some of my
desktop machines simply don't respond anymore to user requests (e.g. I
start copying a large file over nfs to local disk and start mozilla,
mozilla won't start until the copy is finished).
My current testcase is: dd if=/dev/zero of=blubber bs=4096 count=65000 and
moving the mouse during this operation. With 2.4.18 everything is ok, the
mouse runs smooth the whole time. 2.4.19 and later: I get mouse hangs, it
won't move for a second, sometimes longer. wolk reduces this problem, but
doesn't solve it.
On my servers (mostly IBM xseries 345 and 335) it's ok with a
vanilla-kernel, but there is no interactive work, mostly routing or
network monitoring.
I hope, I can run a vanilla 2.4 kernel again on my machines, at the moment
that isn't possible.

Bye,
Matthias
-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:01             ` Marc-Christian Petersen
@ 2003-05-27 18:16               ` Marcelo Tosatti
  2003-05-27 18:25                 ` Marc-Christian Petersen
  0 siblings, 1 reply; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 18:16 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: manish, linux-kernel, Carl-Daniel Hailfinger, Christian Klose,
	William Lee Irwin III



On Tue, 27 May 2003, Marc-Christian Petersen wrote:

> On Tuesday 27 May 2003 19:53, manish wrote:
>
> Hi Manish,
>
> > With respect to the hangs that you noticed, did the processes complete
> > after a "pause" or did they stay hung (deadlocked)?
> yes, no processes get ever deadlocked nor anything else in this area. The
> whole system just does _nothing_ for an amount of time (1-15 seconds,
> depends). _Sometimes_ (not always) even a ping is stoped for the amount of
> time the machine does nothing but pausing.
>
> Also not a hardware problem. I made this clear before reporting this bug.
> Tested tons of different hardware, different drivers for the network card
> etc.
>
> I repeat this now for the $high_number'th time ;):
> - 2.4.18 worked perfect
> - 2.4.19-pre not

Thats very useful information. Can you track down which -pre introduced
the hangs?

Thanks!

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:16               ` Marcelo Tosatti
@ 2003-05-27 18:25                 ` Marc-Christian Petersen
  0 siblings, 0 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 18:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: manish, linux-kernel, Carl-Daniel Hailfinger, Christian Klose,
	William Lee Irwin III

On Tuesday 27 May 2003 20:16, Marcelo Tosatti wrote:

Hi Marcelo,

> > I repeat this now for the $high_number'th time ;):
> > - 2.4.18 worked perfect
> > - 2.4.19-pre not
> Thats very useful information. Can you track down which -pre introduced
> the hangs?
If I am not on drugs and my last test was not under drugs, the causing patch 
is this one:

http://linux.bkbits.net:8080/linux-2.4/diffs/drivers/block/ll_rw_blk.c@1.29?nav=index.html|ChangeSet@-2y|cset@1.160|hist/drivers/block/ll_rw_blk.c

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:08               ` Marc-Christian Petersen
@ 2003-05-27 18:25                 ` Andrea Arcangeli
  2003-05-27 18:33                   ` Marcelo Tosatti
  2003-05-27 18:35                   ` Marc-Christian Petersen
  0 siblings, 2 replies; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 18:25 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose, William Lee Irwin III

On Tue, May 27, 2003 at 08:08:43PM +0200, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:
> 
> Hi Marcelo,
> 
> > > I do, people I know do also, numbers of those people only _I_ know are
> > > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > > time.
> > Can you please try to reproduce it with -aa?
> not again ;)
> 
> I've tried almost all known kernel tree's around, every kernel has the same 
> effect. I even tried SuSE and Redhat Kernels.
> 
> I've 'wasted' tons of time just find a solution for it.
> 
> Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is 
> dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput, 

not exactly decreases I/O throughput, the latest I/O benchmarks I seen
from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
included the lowlatency elevator patch. So it may not help latency but
it doesn't hurt in the numbers, at least not in the high end (that in
theory is the one that needs the overkill length in the I/O queue most).

However it definitely helps latency for me and I had a number of
positive reports.

Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
may affect the latency so you can try with plain ext2 to be sure it's
not a fs issue.

the lowlatency elevator patch may not be perfect but it definitely seems
to work better here. especially since there's no apparent throughput
loss, it makes lots of sense to keep it applied, or it would waste lots
of ram for apparently no gain.

> and the "pauses/stops" are still there. Much less but not gone.
> 
> The _only_ workaround yet (known to the public) is to change nr_requests in 
> drivers/block/ll_rw_blk.c from 128 to 4 which gives a performance hit of 
> about 40% (not acceptable in any way).
> 
> .oO( I am quite sure I've mailed you all this stuff privately in response to 
>       your private mail to me ;) )Oo.
> 
> ciao, Marc
> 


Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:25                 ` Andrea Arcangeli
@ 2003-05-27 18:33                   ` Marcelo Tosatti
  2003-05-27 18:39                     ` Marc-Christian Petersen
                                       ` (2 more replies)
  2003-05-27 18:35                   ` Marc-Christian Petersen
  1 sibling, 3 replies; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 18:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marc-Christian Petersen, linux-kernel, Carl-Daniel Hailfinger,
	manish, Christian Klose, William Lee Irwin III

u

On Tue, 27 May 2003, Andrea Arcangeli wrote:

> On Tue, May 27, 2003 at 08:08:43PM +0200, Marc-Christian Petersen wrote:
> > On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:
> >
> > Hi Marcelo,
> >
> > > > I do, people I know do also, numbers of those people only _I_ know are
> > > > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > > > time.
> > > Can you please try to reproduce it with -aa?
> > not again ;)
> >
> > I've tried almost all known kernel tree's around, every kernel has the same
> > effect. I even tried SuSE and Redhat Kernels.
> >
> > I've 'wasted' tons of time just find a solution for it.
> >
> > Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is
> > dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput,
>
> not exactly decreases I/O throughput, the latest I/O benchmarks I seen
> from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> included the lowlatency elevator patch. So it may not help latency but
> it doesn't hurt in the numbers, at least not in the high end (that in
> theory is the one that needs the overkill length in the I/O queue most).
>
> However it definitely helps latency for me and I had a number of
> positive reports.
>
> Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
> may affect the latency so you can try with plain ext2 to be sure it's
> not a fs issue.
>
> the lowlatency elevator patch may not be perfect but it definitely seems
> to work better here. especially since there's no apparent throughput
> loss, it makes lots of sense to keep it applied, or it would waste lots
> of ram for apparently no gain.

Andrea,

It seems your "fix-pausing" patch is fixing a potential wakeup
miss, right? (I looked quickly throught it). Could you explain me the
problem its trying to fix and how?

Its too late to fix that in 2.4.21 (rc5 is going out in hours).

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:25                 ` Andrea Arcangeli
  2003-05-27 18:33                   ` Marcelo Tosatti
@ 2003-05-27 18:35                   ` Marc-Christian Petersen
  2003-05-27 20:10                     ` Andrea Arcangeli
  1 sibling, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 18:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose, William Lee Irwin III

On Tuesday 27 May 2003 20:25, Andrea Arcangeli wrote:

Hi Andrea,

> not exactly decreases I/O throughput, the latest I/O benchmarks I seen
it decreases performance. I've seen this, Con also saw this (well it's better 
than the 'nr_requests = 4' change ;) but mouse stops are still there.

> from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> included the lowlatency elevator patch. So it may not help latency but
> it doesn't hurt in the numbers, at least not in the high end (that in
> theory is the one that needs the overkill length in the I/O queue most).
I agree with the last sentence, in theory, but practice showed something 
different (about 10% to 15% performance decrease)

But I am quite sure that this depends on your machine/hardware. Using IDE 
instead of SCSI for example.

> However it definitely helps latency for me and I had a number of
> positive reports.
It helps but it's not as good as 2.4.18 stock.

> Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
I also tried that.

> may affect the latency so you can try with plain ext2 to be sure it's
> not a fs issue.
Sure, I did this too. FS independent, where ReiserFS is still the best for 
this scenario with the most few pauses than any other FS (ext2, ext3, ...)

But for desktop usage: not acceptable! No way, No go!

> the lowlatency elevator patch may not be perfect but it definitely seems
> to work better here. especially since there's no apparent throughput
> loss, it makes lots of sense to keep it applied, or it would waste lots
> of ram for apparently no gain.
hehe, well wasting RAM for no gain is my next part on my todo ;) (cache 
everything even if there is no RAM for example, well but this is not the 
point in this thread)

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:33                   ` Marcelo Tosatti
@ 2003-05-27 18:39                     ` Marc-Christian Petersen
  2003-05-27 19:00                       ` manish
  2003-05-27 20:03                     ` Andrea Arcangeli
  2003-05-27 20:08                     ` Chris Mason
  2 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 18:39 UTC (permalink / raw)
  To: Marcelo Tosatti, Andrea Arcangeli
  Cc: linux-kernel, Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:

Hi Marcelo,

> It seems your "fix-pausing" patch is fixing a potential wakeup
> miss, right? (I looked quickly throught it). Could you explain me the
> problem its trying to fix and how?
Please have also a look here:

http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:39                     ` Marc-Christian Petersen
@ 2003-05-27 19:00                       ` manish
  2003-05-27 19:01                         ` Marcelo Tosatti
  0 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 19:00 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Marcelo Tosatti, Andrea Arcangeli, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

Marc-Christian Petersen wrote:

>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>
>Hi Marcelo,
>
>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>miss, right? (I looked quickly throught it). Could you explain me the
>>problem its trying to fix and how?
>>
>Please have also a look here:
>
>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>
>ciao, Marc
>
Hello !

I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on, 
the stack trace:

sys_write
generic_file_write
ext2_get_group_desc
bread
__wait_on_buffer
schedule





^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 19:00                       ` manish
@ 2003-05-27 19:01                         ` Marcelo Tosatti
  2003-05-27 19:09                           ` manish
  2003-05-27 19:12                           ` manish
  0 siblings, 2 replies; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 19:01 UTC (permalink / raw)
  To: manish
  Cc: Marc-Christian Petersen, Andrea Arcangeli, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III



On Tue, 27 May 2003, manish wrote:

> Marc-Christian Petersen wrote:
>
> >On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
> >
> >Hi Marcelo,
> >
> >>It seems your "fix-pausing" patch is fixing a potential wakeup
> >>miss, right? (I looked quickly throught it). Could you explain me the
> >>problem its trying to fix and how?
> >>
> >Please have also a look here:
> >
> >http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
> >
> >ciao, Marc
> >
> Hello !
>
> I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
> the stack trace:
>
> sys_write
> generic_file_write
> ext2_get_group_desc
> bread
> __wait_on_buffer
> schedule

Huh? You mean bonnie still deadlocks or ?

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 19:01                         ` Marcelo Tosatti
@ 2003-05-27 19:09                           ` manish
  2003-05-27 19:12                           ` manish
  1 sibling, 0 replies; 142+ messages in thread
From: manish @ 2003-05-27 19:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, Andrea Arcangeli, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, manish wrote:
>
>>Marc-Christian Petersen wrote:
>>
>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>>>
>>>Hi Marcelo,
>>>
>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>>>miss, right? (I looked quickly throught it). Could you explain me the
>>>>problem its trying to fix and how?
>>>>
>>>Please have also a look here:
>>>
>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>>>
>>>ciao, Marc
>>>
>>Hello !
>>
>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
>>the stack trace:
>>
>>sys_write
>>generic_file_write
>>ext2_get_group_desc
>>bread
>>__wait_on_buffer
>>schedule
>>
>
>Huh? You mean bonnie still deadlocks or ?
>
Well, this is to the kernel that has the io_request_lock removed. The 
stock kernel (with the fix-pausing-2 patch) is running fine upto now. 
However, we will have to give it a few hrs of runtime.

Thanks
Manish




^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 19:01                         ` Marcelo Tosatti
  2003-05-27 19:09                           ` manish
@ 2003-05-27 19:12                           ` manish
  2003-05-27 19:28                             ` Marcelo Tosatti
  1 sibling, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 19:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, Andrea Arcangeli, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, manish wrote:
>
>>Marc-Christian Petersen wrote:
>>
>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>>>
>>>Hi Marcelo,
>>>
>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>>>miss, right? (I looked quickly throught it). Could you explain me the
>>>>problem its trying to fix and how?
>>>>
>>>Please have also a look here:
>>>
>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>>>
>>>ciao, Marc
>>>
>>Hello !
>>
>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
>>the stack trace:
>>
>>sys_write
>>generic_file_write
>>ext2_get_group_desc
>>bread
>>__wait_on_buffer
>>schedule
>>
>
>Huh? You mean bonnie still deadlocks or ?
>
At the time the processes get stuck:


[root@dyn-10-123-130-235 vm]# more /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  3709870080 3699126272 10743808        0 18313216 3531255808
Swap: 1077501952        0 1077501952
MemTotal:      3622920 kB
MemFree:         10492 kB
MemShared:           0 kB
Buffers:         17884 kB
Cached:        3448492 kB
SwapCached:          0 kB
Active:          25252 kB
Inactive:      3445344 kB
HighTotal:     2752512 kB
HighFree:         2120 kB
LowTotal:       870408 kB
LowFree:          8372 kB
SwapTotal:     1052248 kB
SwapFree:      1052248 kB





^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 19:12                           ` manish
@ 2003-05-27 19:28                             ` Marcelo Tosatti
  2003-05-27 19:34                               ` manish
  0 siblings, 1 reply; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 19:28 UTC (permalink / raw)
  To: manish
  Cc: Marc-Christian Petersen, Andrea Arcangeli, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III



On Tue, 27 May 2003, manish wrote:

> Marcelo Tosatti wrote:
>
> >
> >On Tue, 27 May 2003, manish wrote:
> >
> >>Marc-Christian Petersen wrote:
> >>
> >>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
> >>>
> >>>Hi Marcelo,
> >>>
> >>>>It seems your "fix-pausing" patch is fixing a potential wakeup
> >>>>miss, right? (I looked quickly throught it). Could you explain me the
> >>>>problem its trying to fix and how?
> >>>>
> >>>Please have also a look here:
> >>>
> >>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
> >>>
> >>>ciao, Marc
> >>>
> >>Hello !
> >>
> >>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
> >>the stack trace:
> >>
> >>sys_write
> >>generic_file_write
> >>ext2_get_group_desc
> >>bread
> >>__wait_on_buffer
> >>schedule
> >>
> >
> >Huh? You mean bonnie still deadlocks or ?
> >
> At the time the processes get stuck:
>
>
> [root@dyn-10-123-130-235 vm]# more /proc/meminfo
>         total:    used:    free:  shared: buffers:  cached:
> Mem:  3709870080 3699126272 10743808        0 18313216 3531255808
> Swap: 1077501952        0 1077501952
> MemTotal:      3622920 kB
> MemFree:         10492 kB
> MemShared:           0 kB
> Buffers:         17884 kB
> Cached:        3448492 kB
> SwapCached:          0 kB
> Active:          25252 kB
> Inactive:      3445344 kB
> HighTotal:     2752512 kB
> HighFree:         2120 kB
> LowTotal:       870408 kB
> LowFree:          8372 kB
> SwapTotal:     1052248 kB
> SwapFree:      1052248 kB
>

Ok, so just to confirm: You're still getting pauses with Andrea's patches
but no hangs anymore?

Correct?

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 19:28                             ` Marcelo Tosatti
@ 2003-05-27 19:34                               ` manish
  2003-05-27 20:20                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 19:34 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, Andrea Arcangeli, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

Marcelo Tosatti wrote:

>
>On Tue, 27 May 2003, manish wrote:
>
>>Marcelo Tosatti wrote:
>>
>>>On Tue, 27 May 2003, manish wrote:
>>>
>>>>Marc-Christian Petersen wrote:
>>>>
>>>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
>>>>>
>>>>>Hi Marcelo,
>>>>>
>>>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
>>>>>>miss, right? (I looked quickly throught it). Could you explain me the
>>>>>>problem its trying to fix and how?
>>>>>>
>>>>>Please have also a look here:
>>>>>
>>>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
>>>>>
>>>>>ciao, Marc
>>>>>
>>>>Hello !
>>>>
>>>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
>>>>the stack trace:
>>>>
>>>>sys_write
>>>>generic_file_write
>>>>ext2_get_group_desc
>>>>bread
>>>>__wait_on_buffer
>>>>schedule
>>>>
>>>Huh? You mean bonnie still deadlocks or ?
>>>
>>At the time the processes get stuck:
>>
>>
>>[root@dyn-10-123-130-235 vm]# more /proc/meminfo
>>        total:    used:    free:  shared: buffers:  cached:
>>Mem:  3709870080 3699126272 10743808        0 18313216 3531255808
>>Swap: 1077501952        0 1077501952
>>MemTotal:      3622920 kB
>>MemFree:         10492 kB
>>MemShared:           0 kB
>>Buffers:         17884 kB
>>Cached:        3448492 kB
>>SwapCached:          0 kB
>>Active:          25252 kB
>>Inactive:      3445344 kB
>>HighTotal:     2752512 kB
>>HighFree:         2120 kB
>>LowTotal:       870408 kB
>>LowFree:          8372 kB
>>SwapTotal:     1052248 kB
>>SwapFree:      1052248 kB
>>
>
>Ok, so just to confirm: You're still getting pauses with Andrea's patches
>but no hangs anymore?
>
>Correct?
>
Hi Marcelo,

I have applied Andrea's patch to two kernels:

1. Stock 2.4.20
2. 2.4.20 with the io_request_lock removed.

The tests on the first one are still going. The tests on the second one 
showed processes getting stuck for long times (> 5 minutes) and not 
paused ...

Thanks
Manish




^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:33                   ` Marcelo Tosatti
  2003-05-27 18:39                     ` Marc-Christian Petersen
@ 2003-05-27 20:03                     ` Andrea Arcangeli
  2003-05-27 20:08                       ` Marcelo Tosatti
  2003-05-27 20:08                     ` Chris Mason
  2 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 20:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, linux-kernel, Carl-Daniel Hailfinger,
	manish, Christian Klose, William Lee Irwin III

[-- Attachment #1: Type: text/plain, Size: 2662 bytes --]

On Tue, May 27, 2003 at 03:33:14PM -0300, Marcelo Tosatti wrote:
> u
> 
> On Tue, 27 May 2003, Andrea Arcangeli wrote:
> 
> > On Tue, May 27, 2003 at 08:08:43PM +0200, Marc-Christian Petersen wrote:
> > > On Tuesday 27 May 2003 19:57, Marcelo Tosatti wrote:
> > >
> > > Hi Marcelo,
> > >
> > > > > I do, people I know do also, numbers of those people only _I_ know are
> > > > > about ~30. I've reported this problem over a year ago while 2.4.19-pre
> > > > > time.
> > > > Can you please try to reproduce it with -aa?
> > > not again ;)
> > >
> > > I've tried almost all known kernel tree's around, every kernel has the same
> > > effect. I even tried SuSE and Redhat Kernels.
> > >
> > > I've 'wasted' tons of time just find a solution for it.
> > >
> > > Andrea introduced, to address _exact_ this problem (pauses, stops, mouse is
> > > dead etc.), his lowlatency elevator. Side effect: decreases i/o throughput,
> >
> > not exactly decreases I/O throughput, the latest I/O benchmarks I seen
> > from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> > included the lowlatency elevator patch. So it may not help latency but
> > it doesn't hurt in the numbers, at least not in the high end (that in
> > theory is the one that needs the overkill length in the I/O queue most).
> >
> > However it definitely helps latency for me and I had a number of
> > positive reports.
> >
> > Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
> > may affect the latency so you can try with plain ext2 to be sure it's
> > not a fs issue.
> >
> > the lowlatency elevator patch may not be perfect but it definitely seems
> > to work better here. especially since there's no apparent throughput
> > loss, it makes lots of sense to keep it applied, or it would waste lots
> > of ram for apparently no gain.
> 
> Andrea,
> 
> It seems your "fix-pausing" patch is fixing a potential wakeup
> miss, right? (I looked quickly throught it). Could you explain me the

yes, not just one but multiple of them, all similar. lots of boxes were
hanging in a weird manner until I found and fixed this glitch.

> problem its trying to fix and how?

I'm attaching the old email, it should have all the explanataions.

but don't use that old patch (that was the first revision and it missed
one last race in wait_for_request noticed by Chris or Andrew [or
both?]), use this one instead (seems just the second revision, should be
that one plus that last race fix):

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2

thanks,

> 
> Its too late to fix that in 2.4.21 (rc5 is going out in hours).


Andrea

[-- Attachment #2: Type: message/rfc822, Size: 18863 bytes --]

[-- Attachment #2.1.1: Type: text/plain, Size: 7288 bytes --]

I recently found and fixed a misterious hang that could hang the
kernel with tasks in D state with the disk idle.  We could reproduce
very long hangs (several hours) with tasks in D state with reiserfs
after some hour of some intensive load running (not cerberus, see below
why), but it wasn't a reiserfs specific problem, reiserfs just happens
to take the lock_super while doing the fsync_buffer_list and this leads
kupdate to get stuck in the lock_super waiting a wait_on_buffer to
return, so with kupdate stuck the background run_task_queue() doesn't
run every 5 seconds anymore and in turn if there's a missing unplug
somewhere it will lead to an hanging machine in wait_on_buffer for
indefinite/infinite time (kind of deadlock, unless somebody can trigger
a readpage or something that unplugs the disk queue, usually logging in
with ssh fixed the problem). Increasing singificantly the kupdate
interval would potentially lead to the same indefinite hang on a ext2
while running fsync.

For some time I didn't even consider the possibility of wait_on_buffer
being the problem, there are over 700 patches applied in the kernel
where we could reproduce so for some time I was looking at everything
but the buggy place.  After ruling out various other bits
(scheduler fixes/compiler/fsync corruption fixes etc..) I actually
realized the problem is a longstanding design locking problem in
wait_on_buffer (then I found the same problem in wait_on_page and
yesterday Chris found a similar problem in get_request_wait too, the
get_request_wait is not exactly the same issue, but it's quite similar
and it could lead to exactly the same hangs).

Probably nobody noticed this yet because normally with ext2/ext3 these
hangs happens in all the machines but they are resolved after a disk
idle time of 2.5 seconds in mean and they happens once in a while,
normally people would see mean delays of 2.5 sec caming from the
datacenter and they would think it's a normal I/O congestion or the
elevator or something during the fsync on ext2.  Furthmore as Chris
pointed out with very intensive load bdflush would be usually running in
the background, this race can trigger only with mid writepage loads when
bdflush/pdflush has no reason to run.

We also have the lowlatency fixes (they're fixes) inside submit_bh so we
probably opened a larger window for the race to trigger than mainline.
Chris also double checked the bug we were facing was really this race by
introducing a delay in submit_bh to make it reproducible in a reasonable
amount of time.

the race looks like this:

        CPU0                    CPU1 
        -----------------       ------------------------ 
        reiserfs_writepage 
        lock_buffer() 
                                fsync_buffers_list() under lock_super() 
                                wait_on_buffer() 
                                run_task_queue(&tq_disk) -> noop 
                                schedule() <- hang with lock_super acquired 
        submit_bh() 
        /* don't unplug here */ 
 
This example is reiserfs specific but any wait_on_buffer can definitely
hang indefinitely against any concurrent ll_rw_block or submit_bh (even
on UP since submit_bh is a blocking operation and in particular with the
lowlat fixes). There's no big kernel lock anymore serializing
wait_on_buffer/ll_rw_block.  This design locking problem was introduced
with the removal of the BKL from wait_on_buffer/wait_on_page/ll_rw_blk
during one of the 2.3 scalability efforts. So any 2.4 kernel out there
is affected by this race.

in short the problem here is that the wait_on_"something" has no clue if
the locked "something" is just inserted in the I/O queue and visible to the
device, so it has no clue if the run_task_queue may become a noop or if
it may affect the "something". And the writer side that executes the
submit_bh won't unplug the queue rightfully (to allow merging and boost
performance until somebody actually asks for the I/O completed ASAP).

I fixed the race by simply doing a wakeup of any waiter after any
submit_bh/submit_bio that left stuff pending in the I/O queue. So if the
race triggers now the wait_on_something will get a wakeup and in turn it
will trigger the unplug again closing the window for the race. This is
fixing the problem in practice and it seems the best fix at least for
2.4, and I don't see any potential performance regression, so I don't
feel the need of anything more complicated than this, the race triggers
once every several hours only under some special workload. You may try
to avoid loading the waitqueue head cacheline during submit_bh, but at
least for 2.4 I don't think it worth the complexity and it's an I/O path
anyways so it's certainly not critical.

The problem noticed by Chris with get_request_wait is similar, the
unplugging was run before adding the task to the waitqueue, so the
unplug could free the requests and somebody else could allocate the
freed requests without unplugging the queue afterwards. I fixed it simply
by unplugging the queue just before schedule(). That was really a more
genuine bug than the other subtle ones. With get_request_wait the fix is
so simple because we deal with entities that are guaranteed to be
affected by the queue unplug always (they're the I/O requests), this
isn't the case with the locked bh or locked/writeback pages, that was
infact the wrong assumption that allowed the other races to trigger in the
first place.

while doing these fixes I noticed various other bits:

1) in general the blk_run_queues()/run_task_queue()/sync_page should
always run just before schedule(), it's pointless to unplug anything if
we don't run schedule (ultramicrooptimization)
2) in 2.4 the run_task_queue() in wait_on_buffer could have its
TQ_ACTIVE executed inside the add_wait_queue critical section since
spin_unlock has inclusive semantics (literally speculative reads can
pass the spin_unlock even on x86/x86-64)
3) the __blk_put_request/blkdev_release_request was increasing the count
and reading the waitqueue contents without even a barrier() for the asm
layer, it needs an smp_mb() in between to serialize against
get_request_wait that runs locklessy

I did two patches one for 2.4.20rc1 and one for 2.5.47 (sorry no bk
tree here, I will try to make bitdropper.py available shortly so I can
access the new info encoded in proprietary form too) that will address
all these races.  However 2.5 should be analysed further, I didn't
search too hard for all the possible places that could have this race in
2.5, I searched hard in 2.4 and I only addressed all the same problems
in 2.5. The only bit I think could be problematic in 2.4 is the nfs
specualtive I/O, the reason nfs is implementing a sync_page in the first
place. That may have the same race, I heard infact of some report with
nfs hung in wait_on_page, and I wonder if this could explain it too.
I assume the fs maintainers will take care of checking their fs for
missing wakeups of page waiters in 2.4 and 2.5 now that the problem is
well known.

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.20rc1/fix-pausing-1
	http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.5/2.5.47/fix-pausing-1

they're attached to this email too since they're tiny.

Andrea

[-- Attachment #2.1.2: fix-pausing-1 --]
[-- Type: text/plain, Size: 5359 bytes --]

diff -urNp 2.4.20rc1/drivers/block/ll_rw_blk.c hangs-2.4/drivers/block/ll_rw_blk.c
--- 2.4.20rc1/drivers/block/ll_rw_blk.c	Sat Nov  2 19:45:33 2002
+++ hangs-2.4/drivers/block/ll_rw_blk.c	Tue Nov 12 02:18:35 2002
@@ -590,12 +590,20 @@ static struct request *__get_request_wai
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	generic_unplug_device(q);
 	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count == 0)
+		if (q->rq[rw].count == 0) {
+			/*
+			 * All we care about is not to stall if any request
+			 * is been released after we set TASK_UNINTERRUPTIBLE.
+			 * This is the most efficient place to unplug the queue
+			 * in case we hit the race and we can get the request
+			 * without waiting.
+			 */
+			generic_unplug_device(q);
 			schedule();
+		}
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
@@ -829,9 +837,11 @@ void blkdev_release_request(struct reque
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests &&
-				waitqueue_active(&q->wait_for_requests[rw]))
-			wake_up(&q->wait_for_requests[rw]);
+		if (++q->rq[rw].count >= q->batch_requests) {
+			smp_mb();
+			if (waitqueue_active(&q->wait_for_requests[rw]))
+				wake_up(&q->wait_for_requests[rw]);
+		}
 	}
 }
 
@@ -1200,6 +1210,11 @@ void submit_bh(int rw, struct buffer_hea
 
 	generic_make_request(rw, bh);
 
+	/* fix race condition with wait_on_buffer() */
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(&bh->b_wait))
+		wake_up(&bh->b_wait);
+
 	switch (rw) {
 		case WRITE:
 			kstat.pgpgout += count;
diff -urNp 2.4.20rc1/fs/buffer.c hangs-2.4/fs/buffer.c
--- 2.4.20rc1/fs/buffer.c	Sat Nov  2 19:45:40 2002
+++ hangs-2.4/fs/buffer.c	Tue Nov 12 02:17:56 2002
@@ -153,10 +153,23 @@ void __wait_on_buffer(struct buffer_head
 	get_bh(bh);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 		if (!buffer_locked(bh))
 			break;
+		/*
+		 * We must read tq_disk in TQ_ACTIVE after the
+		 * add_wait_queue effect is visible to other cpus.
+		 * We could unplug some line above it wouldn't matter
+		 * but we can't do that right after add_wait_queue
+		 * without an smp_mb() in between because spin_unlock
+		 * has inclusive semantics.
+		 * Doing it here is the most efficient place so we
+		 * don't do a suprious unplug if we get a racy
+		 * wakeup that make buffer_locked to return 0, and
+		 * doing it here avoids an explicit smp_mb() we
+		 * rely on the implicit one in set_task_state.
+		 */
+		run_task_queue(&tq_disk);
 		schedule();
 	} while (buffer_locked(bh));
 	tsk->state = TASK_RUNNING;
@@ -1508,6 +1521,9 @@ static int __block_write_full_page(struc
 
 	/* Done - end_buffer_io_async will unlock */
 	SetPageUptodate(page);
+
+	wakeup_page_waiters(page);
+
 	return 0;
 
 out:
@@ -1539,6 +1555,7 @@ out:
 	} while (bh != head);
 	if (need_unlock)
 		UnlockPage(page);
+	wakeup_page_waiters(page);
 	return err;
 }
 
@@ -1755,6 +1772,8 @@ int block_read_full_page(struct page *pa
 		else
 			submit_bh(READ, bh);
 	}
+
+	wakeup_page_waiters(page);
 	
 	return 0;
 }
@@ -2368,6 +2387,7 @@ int brw_page(int rw, struct page *page, 
 		submit_bh(rw, bh);
 		bh = next;
 	} while (bh != head);
+	wakeup_page_waiters(page);
 	return 0;
 }
 
diff -urNp 2.4.20rc1/fs/reiserfs/inode.c hangs-2.4/fs/reiserfs/inode.c
--- 2.4.20rc1/fs/reiserfs/inode.c	Sat Nov  2 19:45:46 2002
+++ hangs-2.4/fs/reiserfs/inode.c	Tue Nov 12 02:17:56 2002
@@ -1993,6 +1993,7 @@ static int reiserfs_write_full_page(stru
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         UnlockPage(page) ;
     }
diff -urNp 2.4.20rc1/include/linux/pagemap.h hangs-2.4/include/linux/pagemap.h
--- 2.4.20rc1/include/linux/pagemap.h	Sat Nov  2 19:45:48 2002
+++ hangs-2.4/include/linux/pagemap.h	Tue Nov 12 04:35:52 2002
@@ -97,6 +97,8 @@ static inline void wait_on_page(struct p
 		___wait_on_page(page);
 }
 
+extern void wakeup_page_waiters(struct page * page);
+
 /*
  * Returns locked page at given index in given cache, creating it if needed.
  */
diff -urNp 2.4.20rc1/kernel/ksyms.c hangs-2.4/kernel/ksyms.c
--- 2.4.20rc1/kernel/ksyms.c	Sat Nov  2 19:45:48 2002
+++ hangs-2.4/kernel/ksyms.c	Tue Nov 12 04:36:25 2002
@@ -293,6 +293,7 @@ EXPORT_SYMBOL(filemap_fdatasync);
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);
diff -urNp 2.4.20rc1/mm/filemap.c hangs-2.4/mm/filemap.c
--- 2.4.20rc1/mm/filemap.c	Sat Nov  2 19:45:48 2002
+++ hangs-2.4/mm/filemap.c	Tue Nov 12 04:35:40 2002
@@ -909,6 +909,20 @@ void lock_page(struct page *page)
 }
 
 /*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * head;
+
+	head = page_waitqueue(page);
+	if (waitqueue_active(head))
+		wake_up(head);
+}
+
+/*
  * a rather lightweight function, finding and getting a reference to a
  * hashed page atomically.
  */

[-- Attachment #2.1.3: fix-pausing-1 --]
[-- Type: text/plain, Size: 5331 bytes --]

diff -urNp 2.5.47/drivers/block/ll_rw_blk.c hangs-2.5/drivers/block/ll_rw_blk.c
--- 2.5.47/drivers/block/ll_rw_blk.c	Tue Nov 12 01:59:41 2002
+++ hangs-2.5/drivers/block/ll_rw_blk.c	Tue Nov 12 02:37:42 2002
@@ -1281,12 +1281,13 @@ static struct request *get_request_wait(
 
 	spin_lock_prefetch(q->queue_lock);
 
-	generic_unplug_device(q);
 	do {
 		prepare_to_wait_exclusive(&rl->wait, &wait,
 					TASK_UNINTERRUPTIBLE);
-		if (!rl->count)
+		if (!rl->count){
+			generic_unplug_device(q);
 			io_schedule();
+		}
 		finish_wait(&rl->wait, &wait);
 		spin_lock_irq(q->queue_lock);
 		rq = get_request(q, rw);
@@ -1487,8 +1488,11 @@ void __blk_put_request(request_queue_t *
 		rl->count++;
 		if (rl->count >= queue_congestion_off_threshold())
 			clear_queue_congested(q, rw);
-		if (rl->count >= batch_requests && waitqueue_active(&rl->wait))
-			wake_up(&rl->wait);
+		if (rl->count >= batch_requests) {
+			smp_mb();
+			if (waitqueue_active(&rl->wait))
+				wake_up(&rl->wait);
+		}
 	}
 }
 
diff -urNp 2.5.47/fs/buffer.c hangs-2.5/fs/buffer.c
--- 2.5.47/fs/buffer.c	Tue Nov 12 01:59:42 2002
+++ hangs-2.5/fs/buffer.c	Tue Nov 12 02:47:46 2002
@@ -135,9 +135,10 @@ void __wait_on_buffer(struct buffer_head
 	get_bh(bh);
 	do {
 		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-		blk_run_queues();
-		if (buffer_locked(bh))
+		if (buffer_locked(bh)) {
+			blk_run_queues();
 			schedule();
+		}
 	} while (buffer_locked(bh));
 	put_bh(bh);
 	finish_wait(wqh, &wait);
@@ -1727,7 +1728,8 @@ done:
 		if (uptodate)
 			SetPageUptodate(page);
 		end_page_writeback(page);
-	}
+	} else
+		wakeup_page_waiters(page);
 	if (err == 0)
 		return ret;
 	return err;
@@ -2011,6 +2013,7 @@ int block_read_full_page(struct page *pa
 		else
 			submit_bh(READ, bh);
 	}
+	wakeup_page_waiters(page);
 	return 0;
 }
 
@@ -2315,6 +2318,8 @@ static int end_bio_bh_io_sync(struct bio
 int submit_bh(int rw, struct buffer_head * bh)
 {
 	struct bio *bio;
+	int ret;
+	wait_queue_head_t *wqh = bh_waitq_head(bh);
 
 	BUG_ON(!buffer_locked(bh));
 	BUG_ON(!buffer_mapped(bh));
@@ -2348,7 +2353,13 @@ int submit_bh(int rw, struct buffer_head
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
 
-	return submit_bio(rw, bio);
+	ret =  submit_bio(rw, bio);
+
+	smp_mb(); /* spin_unlock may have inclusive semantics */
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+
+	return ret;
 }
 
 /**
diff -urNp 2.5.47/fs/reiserfs/inode.c hangs-2.5/fs/reiserfs/inode.c
--- 2.5.47/fs/reiserfs/inode.c	Thu Oct 31 01:42:25 2002
+++ hangs-2.5/fs/reiserfs/inode.c	Tue Nov 12 02:50:47 2002
@@ -1987,6 +1987,7 @@ static int reiserfs_write_full_page(stru
     */
     if (nr) {
         submit_bh_for_writepage(arr, nr) ;
+	wakeup_page_waiters(page);
     } else {
         end_page_writeback(page) ;
     }
diff -urNp 2.5.47/include/linux/pagemap.h hangs-2.5/include/linux/pagemap.h
--- 2.5.47/include/linux/pagemap.h	Tue Nov 12 01:59:43 2002
+++ hangs-2.5/include/linux/pagemap.h	Tue Nov 12 02:45:27 2002
@@ -122,4 +122,7 @@ static inline void wait_on_page_writebac
 }
 
 extern void end_page_writeback(struct page *page);
+
+extern void wakeup_page_waiters(struct page * page);
+
 #endif /* _LINUX_PAGEMAP_H */
diff -urNp 2.5.47/mm/filemap.c hangs-2.5/mm/filemap.c
--- 2.5.47/mm/filemap.c	Tue Nov 12 01:59:43 2002
+++ hangs-2.5/mm/filemap.c	Tue Nov 12 02:44:59 2002
@@ -272,9 +272,10 @@ void wait_on_page_bit(struct page *page,
 
 	do {
 		prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE);
-		sync_page(page);
-		if (test_bit(bit_nr, &page->flags))
+		if (test_bit(bit_nr, &page->flags)) {
+			sync_page(page);
 			io_schedule();
+		}
 	} while (test_bit(bit_nr, &page->flags));
 	finish_wait(waitqueue, &wait);
 }
@@ -336,15 +337,30 @@ void __lock_page(struct page *page)
 
 	while (TestSetPageLocked(page)) {
 		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-		sync_page(page);
-		if (PageLocked(page))
+		if (PageLocked(page)) {
+			sync_page(page);
 			io_schedule();
+		}
 	}
 	finish_wait(wqh, &wait);
 }
 EXPORT_SYMBOL(__lock_page);
 
 /*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+	wait_queue_head_t * wqh;
+
+	wqh = page_waitqueue(page);
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
+/*
  * a rather lightweight function, finding and getting a reference to a
  * hashed page atomically.
  */
diff -urNp 2.5.47/mm/page_io.c hangs-2.5/mm/page_io.c
--- 2.5.47/mm/page_io.c	Thu Oct 31 01:41:56 2002
+++ hangs-2.5/mm/page_io.c	Tue Nov 12 02:50:12 2002
@@ -104,6 +104,7 @@ int swap_writepage(struct page *page)
 	SetPageWriteback(page);
 	unlock_page(page);
 	submit_bio(WRITE, bio);
+	wakeup_page_waiters(page);
 out:
 	return ret;
 }
@@ -121,6 +122,7 @@ int swap_readpage(struct file *file, str
 	}
 	inc_page_state(pswpin);
 	submit_bio(READ, bio);
+	wakeup_page_waiters(page);
 out:
 	return ret;
 }
--- hangs-2.5/kernel/ksyms.c.~1~	Tue Nov 12 01:59:43 2002
+++ hangs-2.5/kernel/ksyms.c	Tue Nov 12 04:36:37 2002
@@ -336,6 +336,7 @@ EXPORT_SYMBOL(filemap_fdatawrite);
 EXPORT_SYMBOL(filemap_fdatawait);
 EXPORT_SYMBOL(lock_page);
 EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
 
 /* device registration */
 EXPORT_SYMBOL(register_chrdev);

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:03                     ` Andrea Arcangeli
@ 2003-05-27 20:08                       ` Marcelo Tosatti
  2003-05-27 20:25                         ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: Marcelo Tosatti @ 2003-05-27 20:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marc-Christian Petersen, linux-kernel, Carl-Daniel Hailfinger,
	manish, Christian Klose, William Lee Irwin III



On Tue, 27 May 2003, Andrea Arcangeli wrote:

> > It seems your "fix-pausing" patch is fixing a potential wakeup
> > miss, right? (I looked quickly throught it). Could you explain me the
>
> yes, not just one but multiple of them, all similar. lots of boxes were
> hanging in a weird manner until I found and fixed this glitch.
>
> > problem its trying to fix and how?
>
> I'm attaching the old email, it should have all the explanataions.
>
> but don't use that old patch (that was the first revision and it missed
> one last race in wait_for_request noticed by Chris or Andrew [or
> both?]), use this one instead (seems just the second revision, should be
> that one plus that last race fix):
>
> 	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2

I wonder if the additional wakeups result in performance degradation (not
that it matters much in case there is no other way to fix the problem).

But anyway I would like to have some numbers with/without the patch.

Do you have them ?

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:33                   ` Marcelo Tosatti
  2003-05-27 18:39                     ` Marc-Christian Petersen
  2003-05-27 20:03                     ` Andrea Arcangeli
@ 2003-05-27 20:08                     ` Chris Mason
  2 siblings, 0 replies; 142+ messages in thread
From: Chris Mason @ 2003-05-27 20:08 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrea Arcangeli, Marc-Christian Petersen, linux-kernel,
	Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tue, 2003-05-27 at 14:33, Marcelo Tosatti wrote:

> Andrea,
> 
> It seems your "fix-pausing" patch is fixing a potential wakeup
> miss, right? (I looked quickly throught it). Could you explain me the
> problem its trying to fix and how?
> 
> Its too late to fix that in 2.4.21 (rc5 is going out in hours).

The bug report seems to be on ext2, and on a box with 3.5GB of ram and
4G of dirty data.  So, I don't think he is hitting the fix-pausing bug,
which needs just the right set of conditions to miss unplugs:  

1) bdflush can't be awake, so the percentage of dirty buffers has to be
somewhat low.  Otherwise bdflush will trigger unplugs.

2) kupdate needs to be stuck waiting on the super lock, otherwise
kupdate would be triggering unplugs

2a) Some process needs to be calling wait_on_buffer() with the super
lock held.  This makes it pretty much impossible to trigger on ext2
without using O_SYNC mode.

3) You've got to race in __wait_on_buffer (cut n' paste from an old mail
from Andrea)

       CPU0                    CPU1 
       -----------------       ------------------------ 
       reiserfs_writepage 
       lock_buffer() 
                               fsync_buffers_list() under lock_super() 
                               wait_on_buffer() 
                               run_task_queue(&tq_disk) -> noop 
                               schedule() <- hang with lock_super acquired 
       submit_bh() 
       /* don't unplug here */ 


With ext3, you can trigger with two procs, it gets much easier if you
toss a schedule() into submit_bh(), right before generic_make_request. 
reiserfs + the data logging patches is easier to trigger and produces
longer pauses.

For ext3:
A: while(1) sync
B: while(1) write(fd, 8k); fsync(fd); ftruncate(fd, 0);

The idea behind proc B is to increase the chances the
sync and the fsync are trying to write and wait on the same buffer. 

ext3 is hung on a metadata block, while it tries to get write access to
the block before logging it.  This ends up calling wait_on_buffer with
the super held while in proc B, while proc A is in sync flushing the
metadata block.  

I  trigged the hang in ext3 during block allocation, so the ftruncate
makes sure ext3 is constantly allocating blocks (and always dirtying the
same bitmap/direct block).

It isn't a perfect reproduction of the hang, because in ext3 kjournald
wakes up every once and a while (~30 seconds or more) and kicks the
transaction.  But, with more procs running, someone could be waiting
with the journal lock held, which would keep kjournald from fixing
things.




^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:35                   ` Marc-Christian Petersen
@ 2003-05-27 20:10                     ` Andrea Arcangeli
  2003-05-27 20:24                       ` Marc-Christian Petersen
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 20:10 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose, William Lee Irwin III

Hi,

On Tue, May 27, 2003 at 08:35:33PM +0200, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 20:25, Andrea Arcangeli wrote:
> 
> Hi Andrea,
> 
> > not exactly decreases I/O throughput, the latest I/O benchmarks I seen
> it decreases performance. I've seen this, Con also saw this (well it's better 
> than the 'nr_requests = 4' change ;) but mouse stops are still there.
> 
> > from Randy (dbench/tiotest/bonnie/etc..) were still the fastest and it
> > included the lowlatency elevator patch. So it may not help latency but
> > it doesn't hurt in the numbers, at least not in the high end (that in
> > theory is the one that needs the overkill length in the I/O queue most).
> I agree with the last sentence, in theory, but practice showed something 
> different (about 10% to 15% performance decrease)
> 
> But I am quite sure that this depends on your machine/hardware. Using IDE 
> instead of SCSI for example.

10/15 performance drop doesn't sound good, no matter what hardware ;).

However in contest I recall there was quite an improvement in latency at
least (I mean, it had some positive effect too)

Getting the best throughput and latency at the same time is normally not
possible, however evaluating if it's losing excessive throughput given a
certain latency improvement is difficult.


> 
> > However it definitely helps latency for me and I had a number of
> > positive reports.
> It helps but it's not as good as 2.4.18 stock.

I'll try to find what's the precise reason of the interactivity drop
with the 2.4.18->2.4.19 blkdev changes on Thu. I think I shortly looked
into it once but there was no definitive answer, or anyways going back
to the 2.4.18 code didn't appeal or make much sense.

However I suspect this responsiveness issue could be storage hardware
dependent.

The sentence by Linus in the last few days while talking with Jens,
about storage that reorders stuff and starve requests at the two ends of
the platter was very scary, maybe you're really bitten by something like
that. Linux does the right thing but your hardware keeps posting stuff
under the os and mine doesn't.


> 
> > Also make sure that you elvtune -r 0 -w 0 /dev/hda, also the journaling
> I also tried that.
> 
> > may affect the latency so you can try with plain ext2 to be sure it's
> > not a fs issue.
> Sure, I did this too. FS independent, where ReiserFS is still the best for 
> this scenario with the most few pauses than any other FS (ext2, ext3, ...)
> 
> But for desktop usage: not acceptable! No way, No go!
> 
> > the lowlatency elevator patch may not be perfect but it definitely seems
> > to work better here. especially since there's no apparent throughput
> > loss, it makes lots of sense to keep it applied, or it would waste lots
> > of ram for apparently no gain.
> hehe, well wasting RAM for no gain is my next part on my todo ;) (cache 
> everything even if there is no RAM for example, well but this is not the 
> point in this thread)
> 
> ciao, Marc
> 


Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 19:34                               ` manish
@ 2003-05-27 20:20                                 ` Andrea Arcangeli
  2003-05-27 20:25                                   ` Marc-Christian Petersen
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 20:20 UTC (permalink / raw)
  To: manish
  Cc: Marcelo Tosatti, Marc-Christian Petersen, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

On Tue, May 27, 2003 at 12:34:38PM -0700, manish wrote:
> Marcelo Tosatti wrote:
> 
> >
> >On Tue, 27 May 2003, manish wrote:
> >
> >>Marcelo Tosatti wrote:
> >>
> >>>On Tue, 27 May 2003, manish wrote:
> >>>
> >>>>Marc-Christian Petersen wrote:
> >>>>
> >>>>>On Tuesday 27 May 2003 20:33, Marcelo Tosatti wrote:
> >>>>>
> >>>>>Hi Marcelo,
> >>>>>
> >>>>>>It seems your "fix-pausing" patch is fixing a potential wakeup
> >>>>>>miss, right? (I looked quickly throught it). Could you explain me the
> >>>>>>problem its trying to fix and how?
> >>>>>>
> >>>>>Please have also a look here:
> >>>>>
> >>>>>http://hypermail.idiosynkrasia.net/linux-kernel/archived/2002/week45/0305.html
> >>>>>
> >>>>>ciao, Marc
> >>>>>
> >>>>Hello !
> >>>>
> >>>>I applied the fix-pausing-2 patch to the 2.4.20 kernel. This time on,
> >>>>the stack trace:
> >>>>
> >>>>sys_write
> >>>>generic_file_write
> >>>>ext2_get_group_desc
> >>>>bread
> >>>>__wait_on_buffer
> >>>>schedule
> >>>>
> >>>Huh? You mean bonnie still deadlocks or ?
> >>>
> >>At the time the processes get stuck:
> >>
> >>
> >>[root@dyn-10-123-130-235 vm]# more /proc/meminfo
> >>       total:    used:    free:  shared: buffers:  cached:
> >>Mem:  3709870080 3699126272 10743808        0 18313216 3531255808
> >>Swap: 1077501952        0 1077501952
> >>MemTotal:      3622920 kB
> >>MemFree:         10492 kB
> >>MemShared:           0 kB
> >>Buffers:         17884 kB
> >>Cached:        3448492 kB
> >>SwapCached:          0 kB
> >>Active:          25252 kB
> >>Inactive:      3445344 kB
> >>HighTotal:     2752512 kB
> >>HighFree:         2120 kB
> >>LowTotal:       870408 kB
> >>LowFree:          8372 kB
> >>SwapTotal:     1052248 kB
> >>SwapFree:      1052248 kB
> >>
> >
> >Ok, so just to confirm: You're still getting pauses with Andrea's patches
> >but no hangs anymore?
> >
> >Correct?
> >
> Hi Marcelo,
> 
> I have applied Andrea's patch to two kernels:
> 
> 1. Stock 2.4.20
> 2. 2.4.20 with the io_request_lock removed.
> 
> The tests on the first one are still going. The tests on the second one 
> showed processes getting stuck for long times (> 5 minutes) and not 
> paused ...

sorry if it's a dumb question but what is the "io_request_lock removed"
thing? Hope you didn't delete any io_request_lock, if you did you can
get worse things than crashes (i.e. mm/fs corruption). the pausing bug
was a genuine race (quite innocent, if you could trigger a disk unplug
you could recover from it)

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:10                     ` Andrea Arcangeli
@ 2003-05-27 20:24                       ` Marc-Christian Petersen
  2003-05-27 20:45                         ` Andrea Arcangeli
  2003-05-27 20:55                         ` Jens Axboe
  0 siblings, 2 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 20:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose, William Lee Irwin III

On Tuesday 27 May 2003 22:10, Andrea Arcangeli wrote:

Hi Andrea,

> 10/15 performance drop doesn't sound good, no matter what hardware ;).
lol, well. YES ;)

> However in contest I recall there was quite an improvement in latency at
> least (I mean, it had some positive effect too)
Yeah, but latency != throughput ;)

> Getting the best throughput and latency at the same time is normally not
> possible, however evaluating if it's losing excessive throughput given a
> certain latency improvement is difficult.
It is possible. I use 2.5 (preferably -mm tree) now more then any 2.4*.
I use the AS (Anticipatory IO Scheduler) which AKPM included in his tree.
This scheduler is kicking ass. Everything is rock fast, I can trash my HD to 
whatever I want, I still get no mouse stops, keyboard stops or anything like 
that. Even starting up multiple programs is possible while trashing the HD. 
Sure, it takes longer but it works :)

I try to backport BIO and then AS for quite over 2 weeks now, but it seems, at 
least for me, that it's an impossible mission ;(


> I'll try to find what's the precise reason of the interactivity drop
cool. Thanks.

> with the 2.4.18->2.4.19 blkdev changes on Thu. I think I shortly looked
> into it once but there was no definitive answer, or anyways going back
> to the 2.4.18 code didn't appeal or make much sense.
Yeah, that's not an option. The throughput has been increased in 2.4.19 
compared to 2.4.18.

> However I suspect this responsiveness issue could be storage hardware
> dependent.
Hmm, I am quite sure that it isn't. I have ton's of mostly totally different 
hardware in my company, also test machines for WOLK at freenet.de (the 
biggest I had was a QUAD Xeon 1GHz with 16GB memory and hardware RAID (Compaq 
ML570 to be exact (f*cking nice machine btw. ;) and I even hit it on that 
machine. Friends of mine having also different hardware then me, also hitting 
that bug. _If_ it's the case of storage hardware, then many storage hardware 
is affected ;)

> The sentence by Linus in the last few days while talking with Jens,
> about storage that reorders stuff and starve requests at the two ends of
> the platter was very scary, maybe you're really bitten by something like
> that. Linux does the right thing but your hardware keeps posting stuff
> under the os and mine doesn't.
Oh, did I miss something at lkml or was it privately?

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:08                       ` Marcelo Tosatti
@ 2003-05-27 20:25                         ` Andrea Arcangeli
  2003-05-27 22:18                           ` Andrew Morton
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 20:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Marc-Christian Petersen, linux-kernel, Carl-Daniel Hailfinger,
	manish, Christian Klose, William Lee Irwin III

On Tue, May 27, 2003 at 05:08:38PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Tue, 27 May 2003, Andrea Arcangeli wrote:
> 
> > > It seems your "fix-pausing" patch is fixing a potential wakeup
> > > miss, right? (I looked quickly throught it). Could you explain me the
> >
> > yes, not just one but multiple of them, all similar. lots of boxes were
> > hanging in a weird manner until I found and fixed this glitch.
> >
> > > problem its trying to fix and how?
> >
> > I'm attaching the old email, it should have all the explanataions.
> >
> > but don't use that old patch (that was the first revision and it missed
> > one last race in wait_for_request noticed by Chris or Andrew [or
> > both?]), use this one instead (seems just the second revision, should be
> > that one plus that last race fix):
> >
> > 	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2aa1/9980_fix-pausing-2
> 
> I wonder if the additional wakeups result in performance degradation (not
> that it matters much in case there is no other way to fix the problem).

in theory yes.

> 
> But anyway I would like to have some numbers with/without the patch.
> 
> Do you have them ?

Hmm, in bigbox.html we should find the difference of the timings
before/after, and I recall it wasn't measurable. I can search for it on
Thu if you want the exact numbers.

However the last numbers from Randy showed my tree going faster than 2.5
with bonnie and tiotest so I think we don't need to worry and I would
probably not fix it in a different way in 2.4 even if it would mean a 1%
degradation. When it was shipped there was no time to measure any
degradation but the problem it fix is so severe that we never had any
doubt if to include it or not ;).

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:20                                 ` Andrea Arcangeli
@ 2003-05-27 20:25                                   ` Marc-Christian Petersen
  2003-05-27 20:42                                     ` manish
  0 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 20:25 UTC (permalink / raw)
  To: Andrea Arcangeli, manish
  Cc: Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger,
	Christian Klose, William Lee Irwin III

On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:

Hi Andrea,


> > 1. Stock 2.4.20
> > 2. 2.4.20 with the io_request_lock removed.
> > The tests on the first one are still going. The tests on the second one
> > showed processes getting stuck for long times (> 5 minutes) and not
> > paused ...
> sorry if it's a dumb question but what is the "io_request_lock removed"
> thing? Hope you didn't delete any io_request_lock, if you did you can
> get worse things than crashes (i.e. mm/fs corruption). the pausing bug
> was a genuine race (quite innocent, if you could trigger a disk unplug
> you could recover from it)
>
> Andrea
funny. I asked him the same ;)

see his response:

-----------------------------------------------------------------------
>what is this io_request_lock patch you are talking about?
>
>ciao, Marc
>
We made some changes to the 2.4.20 kernel to remove the io_request_lock 
and replace with queue_lock and host_lock.
-----------------------------------------------------------------------

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:25                                   ` Marc-Christian Petersen
@ 2003-05-27 20:42                                     ` manish
  2003-05-27 20:47                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 20:42 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrea Arcangeli, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

Marc-Christian Petersen wrote:

>On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
>
>Hi Andrea,
>
>
>>>1. Stock 2.4.20
>>>2. 2.4.20 with the io_request_lock removed.
>>>The tests on the first one are still going. The tests on the second one
>>>showed processes getting stuck for long times (> 5 minutes) and not
>>>paused ...
>>>
>>sorry if it's a dumb question but what is the "io_request_lock removed"
>>thing? Hope you didn't delete any io_request_lock, if you did you can
>>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
>>was a genuine race (quite innocent, if you could trigger a disk unplug
>>you could recover from it)
>>
>>Andrea
>>
>funny. I asked him the same ;)
>
>see his response:
>
>-----------------------------------------------------------------------
>
>>what is this io_request_lock patch you are talking about?
>>
>>ciao, Marc
>>
>We made some changes to the 2.4.20 kernel to remove the io_request_lock 
>and replace with queue_lock and host_lock.
>-----------------------------------------------------------------------
>
>ciao, Marc
>
We made a change in the 2.4.20 kernel to remove the io_request_lock and 
replace with the host_lock and the queue_lock.  Probably, not a right 
thing to do

Thanks
Manish




^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:24                       ` Marc-Christian Petersen
@ 2003-05-27 20:45                         ` Andrea Arcangeli
  2003-05-27 20:53                           ` Marc-Christian Petersen
  2003-05-27 20:55                         ` Jens Axboe
  1 sibling, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 20:45 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose, William Lee Irwin III

On Tue, May 27, 2003 at 10:24:22PM +0200, Marc-Christian Petersen wrote:
> I try to backport BIO and then AS for quite over 2 weeks now, but it seems, at 
> least for me, that it's an impossible mission ;(

bio breaks all drivers, not a good idea to backport ;)

note that the anticipatory scheduler generates very bad results with the
winmark. it certainly has merits but it has large downsides too.

I would be also curious if you could compare anticipatory with CFQ. The
CFQ was designed to provide the highest possible degree of fariness.

> > I'll try to find what's the precise reason of the interactivity drop
> cool. Thanks.
> 
> > with the 2.4.18->2.4.19 blkdev changes on Thu. I think I shortly looked
> > into it once but there was no definitive answer, or anyways going back
> > to the 2.4.18 code didn't appeal or make much sense.
> Yeah, that's not an option. The throughput has been increased in 2.4.19 
> compared to 2.4.18.

agreed.

> 
> > However I suspect this responsiveness issue could be storage hardware
> > dependent.
> Hmm, I am quite sure that it isn't. I have ton's of mostly totally different 
> hardware in my company, also test machines for WOLK at freenet.de (the 
> biggest I had was a QUAD Xeon 1GHz with 16GB memory and hardware RAID (Compaq 
> ML570 to be exact (f*cking nice machine btw. ;) and I even hit it on that 
> machine. Friends of mine having also different hardware then me, also hitting 
> that bug. _If_ it's the case of storage hardware, then many storage hardware 
> is affected ;)

;)

> > The sentence by Linus in the last few days while talking with Jens,
> > about storage that reorders stuff and starve requests at the two ends of
> > the platter was very scary, maybe you're really bitten by something like
> > that. Linux does the right thing but your hardware keeps posting stuff
> > under the os and mine doesn't.
> Oh, did I miss something at lkml or was it privately?

I read it on l-k yesterday a few days ago, search emails from Linus with
Jens somewhere in CC and you should find it.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:42                                     ` manish
@ 2003-05-27 20:47                                       ` Andrea Arcangeli
  2003-05-27 20:50                                         ` manish
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 20:47 UTC (permalink / raw)
  To: manish
  Cc: Marc-Christian Petersen, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

On Tue, May 27, 2003 at 01:42:32PM -0700, manish wrote:
> Marc-Christian Petersen wrote:
> 
> >On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
> >
> >Hi Andrea,
> >
> >
> >>>1. Stock 2.4.20
> >>>2. 2.4.20 with the io_request_lock removed.
> >>>The tests on the first one are still going. The tests on the second one
> >>>showed processes getting stuck for long times (> 5 minutes) and not
> >>>paused ...
> >>>
> >>sorry if it's a dumb question but what is the "io_request_lock removed"
> >>thing? Hope you didn't delete any io_request_lock, if you did you can
> >>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
> >>was a genuine race (quite innocent, if you could trigger a disk unplug
> >>you could recover from it)
> >>
> >>Andrea
> >>
> >funny. I asked him the same ;)
> >
> >see his response:
> >
> >-----------------------------------------------------------------------
> >
> >>what is this io_request_lock patch you are talking about?
> >>
> >>ciao, Marc
> >>
> >We made some changes to the 2.4.20 kernel to remove the io_request_lock 
> >and replace with queue_lock and host_lock.
> >-----------------------------------------------------------------------
> >
> >ciao, Marc
> >
> We made a change in the 2.4.20 kernel to remove the io_request_lock and 
> replace with the host_lock and the queue_lock.  Probably, not a right 
> thing to do

right you are, but never mind, only remeber e2fsck the fs before
booting the box so you don't risk fs corruption later with the solid
kernels.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:47                                       ` Andrea Arcangeli
@ 2003-05-27 20:50                                         ` manish
  2003-05-27 21:05                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: manish @ 2003-05-27 20:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marc-Christian Petersen, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

Andrea Arcangeli wrote:

>On Tue, May 27, 2003 at 01:42:32PM -0700, manish wrote:
>
>>Marc-Christian Petersen wrote:
>>
>>>On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
>>>
>>>Hi Andrea,
>>>
>>>
>>>>>1. Stock 2.4.20
>>>>>2. 2.4.20 with the io_request_lock removed.
>>>>>The tests on the first one are still going. The tests on the second one
>>>>>showed processes getting stuck for long times (> 5 minutes) and not
>>>>>paused ...
>>>>>
>>>>sorry if it's a dumb question but what is the "io_request_lock removed"
>>>>thing? Hope you didn't delete any io_request_lock, if you did you can
>>>>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
>>>>was a genuine race (quite innocent, if you could trigger a disk unplug
>>>>you could recover from it)
>>>>
>>>>Andrea
>>>>
>>>funny. I asked him the same ;)
>>>
>>>see his response:
>>>
>>>-----------------------------------------------------------------------
>>>
>>>>what is this io_request_lock patch you are talking about?
>>>>
>>>>ciao, Marc
>>>>
>>>We made some changes to the 2.4.20 kernel to remove the io_request_lock 
>>>and replace with queue_lock and host_lock.
>>>-----------------------------------------------------------------------
>>>
>>>ciao, Marc
>>>
>>We made a change in the 2.4.20 kernel to remove the io_request_lock and 
>>replace with the host_lock and the queue_lock.  Probably, not a right 
>>thing to do
>>
>
>right you are, but never mind, only remeber e2fsck the fs before
>booting the box so you don't risk fs corruption later with the solid
>kernels.
>
>Andrea
>
So, does it imply that we cannot remove the io_request_lock in 2.4 at all?

Thanks
Manish




^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:45                         ` Andrea Arcangeli
@ 2003-05-27 20:53                           ` Marc-Christian Petersen
  2003-05-27 21:00                             ` Jens Axboe
  0 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 20:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose, William Lee Irwin III

On Tuesday 27 May 2003 22:45, Andrea Arcangeli wrote:

Hi Andrea,

> > I try to backport BIO and then AS for quite over 2 weeks now, but it
> > seems, at least for me, that it's an impossible mission ;(
> bio breaks all drivers, not a good idea to backport ;)
HAHAHAH. Another wasted 2 weeks in my life ;-)

But why does it brake all drivers? Could you please elaborate a bit?

> note that the anticipatory scheduler generates very bad results with the
> winmark. it certainly has merits but it has large downsides too.
hmm, I am not aware of it, or even I _was_ not aware of it till now.

> I would be also curious if you could compare anticipatory with CFQ. The
> CFQ was designed to provide the highest possible degree of fariness.
I'll can bench it, sure. I used CFQ before I switched to AS because I was 
curious about AS and as I didn't see a real difference in latency but AS gave 
me more throughput, I use AS from now on.

> I read it on l-k yesterday a few days ago, search emails from Linus with
> Jens somewhere in CC and you should find it.
Already found it :) thank you.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:24                       ` Marc-Christian Petersen
  2003-05-27 20:45                         ` Andrea Arcangeli
@ 2003-05-27 20:55                         ` Jens Axboe
  2003-05-27 21:05                           ` William Lee Irwin III
  1 sibling, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-27 20:55 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrea Arcangeli, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tue, May 27 2003, Marc-Christian Petersen wrote:
> I try to backport BIO and then AS for quite over 2 weeks now, but it
> seems, at least for me, that it's an impossible mission ;(

You're nuts, that's not only incredibly silly it's not even needed for
what you want.

What you want is the proper io scheduler abstraction interface. With
that in place, you can port the 2.5 io schedulers without too much
trouble. They have very little dependencies on bio itself ('bio' has
become on of the most abused terms in 2.5. I use it only to describe the
io structure).

You basically need to pin down users that directly manipulate the queue
to extract/insert requests. So step one is doing elv_add_request(),
elv_next_request, and elv_remove_request(). That is a 1:1 mapping to
what 2.4 has right now, so you should be able to accomplish this change
without changing how the code works.

But still, why on earth waste your time with something like this now
when we are so close to 2.6? 2.4 is a stable code base, it should stay
that way. I'm really not interested in more esoteric 2.4 backports, the
vendor kernels are bad enough as it is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:53                           ` Marc-Christian Petersen
@ 2003-05-27 21:00                             ` Jens Axboe
  2003-05-27 21:11                               ` Marc-Christian Petersen
  0 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-27 21:00 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrea Arcangeli, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tue, May 27 2003, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 22:45, Andrea Arcangeli wrote:
> 
> Hi Andrea,
> 
> > > I try to backport BIO and then AS for quite over 2 weeks now, but it
> > > seems, at least for me, that it's an impossible mission ;(
> > bio breaks all drivers, not a good idea to backport ;)
> HAHAHAH. Another wasted 2 weeks in my life ;-)
> 
> But why does it brake all drivers? Could you please elaborate a bit?

Are you serious? Please tell me you haven't spend two weeks on the
project not realising this?

I think the problem here is that you are saying 'bio' when you really
mean something else. bio is the 2.5 io structure. What _exactly_ do you
mean with 'backporting bio'? I don't think you have the slightest idea
of the nastiness involved with doing something like that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:55                         ` Jens Axboe
@ 2003-05-27 21:05                           ` William Lee Irwin III
  2003-05-27 21:18                             ` Jens Axboe
  2003-05-27 21:33                             ` Andrea Arcangeli
  0 siblings, 2 replies; 142+ messages in thread
From: William Lee Irwin III @ 2003-05-27 21:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, Andrea Arcangeli, Marcelo Tosatti,
	linux-kernel, Carl-Daniel Hailfinger, manish, Christian Klose

On Tue, May 27, 2003 at 10:55:16PM +0200, Jens Axboe wrote:
> But still, why on earth waste your time with something like this now
> when we are so close to 2.6? 2.4 is a stable code base, it should stay
> that way. I'm really not interested in more esoteric 2.4 backports, the
> vendor kernels are bad enough as it is.

They've backported everything else, so I guess it stood to reason it'd
happen eventually.

I, for one, got a good laugh out of it. =) Makes me wonder if the 2.4
distro backport trees' diffs are bigger than 2.4 itself yet.


-- wli

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:50                                         ` manish
@ 2003-05-27 21:05                                           ` Andrea Arcangeli
  0 siblings, 0 replies; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 21:05 UTC (permalink / raw)
  To: manish
  Cc: Marc-Christian Petersen, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, Christian Klose, William Lee Irwin III

On Tue, May 27, 2003 at 01:50:55PM -0700, manish wrote:
> Andrea Arcangeli wrote:
> 
> >On Tue, May 27, 2003 at 01:42:32PM -0700, manish wrote:
> >
> >>Marc-Christian Petersen wrote:
> >>
> >>>On Tuesday 27 May 2003 22:20, Andrea Arcangeli wrote:
> >>>
> >>>Hi Andrea,
> >>>
> >>>
> >>>>>1. Stock 2.4.20
> >>>>>2. 2.4.20 with the io_request_lock removed.
> >>>>>The tests on the first one are still going. The tests on the second one
> >>>>>showed processes getting stuck for long times (> 5 minutes) and not
> >>>>>paused ...
> >>>>>
> >>>>sorry if it's a dumb question but what is the "io_request_lock removed"
> >>>>thing? Hope you didn't delete any io_request_lock, if you did you can
> >>>>get worse things than crashes (i.e. mm/fs corruption). the pausing bug
> >>>>was a genuine race (quite innocent, if you could trigger a disk unplug
> >>>>you could recover from it)
> >>>>
> >>>>Andrea
> >>>>
> >>>funny. I asked him the same ;)
> >>>
> >>>see his response:
> >>>
> >>>-----------------------------------------------------------------------
> >>>
> >>>>what is this io_request_lock patch you are talking about?
> >>>>
> >>>>ciao, Marc
> >>>>
> >>>We made some changes to the 2.4.20 kernel to remove the io_request_lock 
> >>>and replace with queue_lock and host_lock.
> >>>-----------------------------------------------------------------------
> >>>
> >>>ciao, Marc
> >>>
> >>We made a change in the 2.4.20 kernel to remove the io_request_lock and 
> >>replace with the host_lock and the queue_lock.  Probably, not a right 
> >>thing to do
> >>
> >
> >right you are, but never mind, only remeber e2fsck the fs before
> >booting the box so you don't risk fs corruption later with the solid
> >kernels.
> >
> >Andrea
> >
> So, does it imply that we cannot remove the io_request_lock in 2.4 at all?

io_request_lock can be at most made per-device in 2.4, this is just the
case in my tree for istance. Locks are there for a reason, unless you
redesign the code to work more scalar, you can't just drop them and
expect stuff to work. But the io_request_lock has nothing to do with
both the hangs and the delays, it only hurts scalability if you've lots
of devices and lots of cpus.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 21:00                             ` Jens Axboe
@ 2003-05-27 21:11                               ` Marc-Christian Petersen
  2003-05-27 21:19                                 ` Jens Axboe
  0 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-27 21:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrea Arcangeli, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tuesday 27 May 2003 23:00, Jens Axboe wrote:

Hi Jens,

> Are you serious? Please tell me you haven't spend two weeks on the
> project not realising this?
Well, 2 weeks means in hours not more than 5 or 6 just delayed over many days.

And it was further just to go deeper into the code, not a real attempt to 
backport it. NM.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 21:05                           ` William Lee Irwin III
@ 2003-05-27 21:18                             ` Jens Axboe
  2003-05-27 21:33                             ` Andrea Arcangeli
  1 sibling, 0 replies; 142+ messages in thread
From: Jens Axboe @ 2003-05-27 21:18 UTC (permalink / raw)
  To: William Lee Irwin III, Marc-Christian Petersen, Andrea Arcangeli,
	Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose

On Tue, May 27 2003, William Lee Irwin III wrote:
> I, for one, got a good laugh out of it. =) Makes me wonder if the 2.4
> distro backport trees' diffs are bigger than 2.4 itself yet.

Heh, well they're open for inspection, it's probably not far off :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 21:11                               ` Marc-Christian Petersen
@ 2003-05-27 21:19                                 ` Jens Axboe
  0 siblings, 0 replies; 142+ messages in thread
From: Jens Axboe @ 2003-05-27 21:19 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrea Arcangeli, Marcelo Tosatti, linux-kernel,
	Carl-Daniel Hailfinger, manish, Christian Klose,
	William Lee Irwin III

On Tue, May 27 2003, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 23:00, Jens Axboe wrote:
> 
> Hi Jens,
> 
> > Are you serious? Please tell me you haven't spend two weeks on the
> > project not realising this?
> Well, 2 weeks means in hours not more than 5 or 6 just delayed over many days.
> 
> And it was further just to go deeper into the code, not a real attempt to 
> backport it. NM.

A bigger analysis of the problem before starting mindless (and useless)
porting would have brought you a lot farther :)

If you're just looking to port some io schedulers, the explanation I
left you in the previous mail should be plenty to get you started.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 21:05                           ` William Lee Irwin III
  2003-05-27 21:18                             ` Jens Axboe
@ 2003-05-27 21:33                             ` Andrea Arcangeli
  1 sibling, 0 replies; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 21:33 UTC (permalink / raw)
  To: William Lee Irwin III, Jens Axboe, Marc-Christian Petersen,
	Marcelo Tosatti, linux-kernel, Carl-Daniel Hailfinger, manish,
	Christian Klose

On Tue, May 27, 2003 at 02:05:18PM -0700, William Lee Irwin III wrote:
> They've backported everything else, so I guess it stood to reason it'd
> happen eventually.

you probably forgot we have varyio in 2.4 due the lack of bio ;)

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 20:25                         ` Andrea Arcangeli
@ 2003-05-27 22:18                           ` Andrew Morton
  2003-05-27 22:38                             ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: Andrew Morton @ 2003-05-27 22:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: marcelo, m.c.p, linux-kernel, c-d.hailfinger.kernel.2003, manish,
	christian.klose, wli

Andrea Arcangeli <andrea@suse.de> wrote:
>
> However the last numbers from Randy showed my tree going faster than 2.5
> with bonnie and tiotest so I think we don't need to worry and I would
> probably not fix it in a different way in 2.4 even if it would mean a 1%
> degradation.

That could be because -aa quadruples the size of the VM readahead window.

Changes such as that should be removed when assessing the performance
impact of this particular patch.



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 22:18                           ` Andrew Morton
@ 2003-05-27 22:38                             ` Andrea Arcangeli
  2003-05-27 22:40                               ` Andrew Morton
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: marcelo, m.c.p, linux-kernel, c-d.hailfinger.kernel.2003, manish,
	christian.klose, wli

On Tue, May 27, 2003 at 03:18:30PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > However the last numbers from Randy showed my tree going faster than 2.5
> > with bonnie and tiotest so I think we don't need to worry and I would
> > probably not fix it in a different way in 2.4 even if it would mean a 1%
> > degradation.
> 
> That could be because -aa quadruples the size of the VM readahead window.
> 
> Changes such as that should be removed when assessing the performance
> impact of this particular patch.

I understand that was a generic benchmark against 2.5, not meant to
evaluate the effect of the fixed readahead (see the name of the patch
"readahead-got-broken-somehwere"). I don't see any good reason why
should Randy cripple down my tree before benchmarking against 2.5? if
something it's ok to apply some of my patches to 2.5, that's great, the
other way around not IMHO.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 22:38                             ` Andrea Arcangeli
@ 2003-05-27 22:40                               ` Andrew Morton
  2003-05-27 22:58                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: Andrew Morton @ 2003-05-27 22:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: marcelo, m.c.p, linux-kernel, c-d.hailfinger.kernel.2003, manish,
	christian.klose, wli

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Tue, May 27, 2003 at 03:18:30PM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > However the last numbers from Randy showed my tree going faster than 2.5
> > > with bonnie and tiotest so I think we don't need to worry and I would
> > > probably not fix it in a different way in 2.4 even if it would mean a 1%
> > > degradation.
> > 
> > That could be because -aa quadruples the size of the VM readahead window.
> > 
> > Changes such as that should be removed when assessing the performance
> > impact of this particular patch.
> 
> I understand that was a generic benchmark against 2.5, not meant to
> evaluate the effect of the fixed readahead (see the name of the patch
> "readahead-got-broken-somehwere"). I don't see any good reason why
> should Randy cripple down my tree before benchmarking against 2.5? if
> something it's ok to apply some of my patches to 2.5, that's great, the
> other way around not IMHO.
> 

No.

What I am saying is that evaluation of the effect of an IO scheduler change
cannot be performed when there is a 4:1 change in the readhead window present
in the same tree.

ie: we cannot conclude anything about the effect of the IO scheduler change
from Randy's numbers.  Too many variables.



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 22:40                               ` Andrew Morton
@ 2003-05-27 22:58                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-27 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: marcelo, m.c.p, linux-kernel, c-d.hailfinger.kernel.2003, manish,
	christian.klose, wli

On Tue, May 27, 2003 at 03:40:49PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Tue, May 27, 2003 at 03:18:30PM -0700, Andrew Morton wrote:
> > > Andrea Arcangeli <andrea@suse.de> wrote:
> > > >
> > > > However the last numbers from Randy showed my tree going faster than 2.5
> > > > with bonnie and tiotest so I think we don't need to worry and I would
> > > > probably not fix it in a different way in 2.4 even if it would mean a 1%
> > > > degradation.
> > > 
> > > That could be because -aa quadruples the size of the VM readahead window.
> > > 
> > > Changes such as that should be removed when assessing the performance
> > > impact of this particular patch.
> > 
> > I understand that was a generic benchmark against 2.5, not meant to
> > evaluate the effect of the fixed readahead (see the name of the patch
> > "readahead-got-broken-somehwere"). I don't see any good reason why
> > should Randy cripple down my tree before benchmarking against 2.5? if
> > something it's ok to apply some of my patches to 2.5, that's great, the
> > other way around not IMHO.
> > 
> 
> No.
> 
> What I am saying is that evaluation of the effect of an IO scheduler change
> cannot be performed when there is a 4:1 change in the readhead window present
> in the same tree.
> 
> ie: we cannot conclude anything about the effect of the IO scheduler change
> from Randy's numbers.  Too many variables.

an accurate evaluation can't be made from such comparison, but I never
claimed that to be an accurate evaluation, I just said we don't need to
worry, == "can't be too bad".

I just said it can't be too bad. and this is true, you even admit that a
readahead change for sure has more impact than whatever change the
fix-pausing generated. That's all I meant. Can't be too bad. the fact
mainline doesn't do readahead properly is much worse thing than whatever
slowdown can be generated by the fix pausing.

Furthmore I said we can deduce the accurate numbers from bigbox.html,
with very minor changes (not 2.4 vs 2.5) that as well shows the fix for
the deadlock not measurable as far as I can tell.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:04           ` Marc-Christian Petersen
@ 2003-05-27 23:06             ` Georg Nikodym
  2003-05-27 23:26               ` Christopher S. Aker
  2003-05-28  5:33             ` Con Kolivas
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 142+ messages in thread
From: Georg Nikodym @ 2003-05-27 23:06 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: manish, Carl-Daniel Hailfinger, Andrea Arcangeli,
	Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

[-- Attachment #1: Type: text/plain, Size: 322 bytes --]

On Tue, 27 May 2003 20:04:49 +0200
Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:

> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard
> is dead/:
>      speak _NOW_ please, doesn't matter who you are!

Uh, ok.  These pauses have kept me from using anything newer than riel's
2.4.19-rmap15a

-g

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 23:06             ` Georg Nikodym
@ 2003-05-27 23:26               ` Christopher S. Aker
  0 siblings, 0 replies; 142+ messages in thread
From: Christopher S. Aker @ 2003-05-27 23:26 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: linux-kernel, manish, Carl-Daniel Hailfinger, Andrea Arcangeli,
	Marcelo Tosatti, Christian Klose, William Lee Irwin III,
	Georg Nikodym

> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard
> is dead/:  speak _NOW_ please, doesn't matter who you are!

I've been able to reproduce the pauses on two different machines/mb/processor,
although each machine has >= 2.5GB ram.  I can reproduce this in 2.4.19, 2.4.20,
and the 2.4.21-rc1/rc2/rc3.

After the machine un-pauses, everything completes/returns to normal.  I don't
experience deadlocked processes.

Both my machines are IDE, using UDMA, hdparam stuff is maxxed; messing with
bdflush, elvtune doesn't make any difference.  Limiting the ram on the machines
didn't help.

Pauses have lasted anywhere from a few seconds to a few minutes. Anything later
than 2.4.18 is unusable for me because of this.

-Chris



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:04           ` Marc-Christian Petersen
  2003-05-27 23:06             ` Georg Nikodym
@ 2003-05-28  5:33             ` Con Kolivas
  2003-05-28  6:04               ` Jens Axboe
  2003-05-28  7:16             ` Marc Wilson
  2003-05-28  9:36             ` Ragnar Hojland Espinosa
  3 siblings, 1 reply; 142+ messages in thread
From: Con Kolivas @ 2003-05-28  5:33 UTC (permalink / raw)
  To: Marc-Christian Petersen, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli
  Cc: Marcelo Tosatti, linux-kernel, Christian Klose, William Lee Irwin III

On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> On Tuesday 27 May 2003 19:50, manish wrote:
>
> Hi Manish,
>
> > It is not a system hang but the processes hang showing the same stack
> > trace. This is certainly not a pause since the bonnie processes that
> > were hung (or deadlocked) never completed after several hrs. The stack
> > trace  was the same.
>
> then you are hitting a different bug or a bug related to the issues
> Christian Klose and me and $tons of others were complaining.
>
> The bug you are hitting might be the problem with "process stuck in D
> state" Andrea Arcangeli fixed, let me guess, over half a year ago or so.
>
> In case you have a good mind to try to address your issue, you might want
> to try out the patch you can find here:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2
>aa1/9980_fix-pausing-2
>
> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> dead/: speak _NOW_ please, doesn't matter who you are!

Yo!

I'll throw my babushka into the ring too. I think it's obvious from MCP's 
comments that I've been involved in testing this problem. I've spent hours, 
possibly days trying to find a way to fix the pauses introduced since 
2.4.19pre1. I agree with what MCP describes that the machine can come to a 
standstill under any sort of disk i/o and is unusable for a variable length 
of time. I've been playing with all sorts of numbers in my patchset to try 
and limit it with only mild success. The best results I've had without a 
major decrease in throughput was using akpm's read latency 2 patch but by 
significantly reducing the nr_requests. It was changing the number of 
requests that I discovered dropping them to 4 fixed the problem but destroyed 
write throughput. I was pleased to see AA give the problem recognition after 
my contest results on his kernel but disappointed that the problem only was 
reduced, not fixed.

I have seen it on every piece of hardware I have used a 2.4.19+ kernel on 
using the desktop. I have no idea what the real problem is, but I firmly 
believe with MCP that it is the biggest flaw in 2.4 on the desktop (no idea 
what it does to servers). We've tried over and over again fiddling with the 
numbers and patches and only going to less than 2.4.19 fixes it completely.

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  5:33             ` Con Kolivas
@ 2003-05-28  6:04               ` Jens Axboe
  2003-05-28  7:13                 ` Con Kolivas
  0 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28  6:04 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Marc-Christian Petersen, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli, Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Wed, May 28 2003, Con Kolivas wrote:
> On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> > On Tuesday 27 May 2003 19:50, manish wrote:
> >
> > Hi Manish,
> >
> > > It is not a system hang but the processes hang showing the same stack
> > > trace. This is certainly not a pause since the bonnie processes that
> > > were hung (or deadlocked) never completed after several hrs. The stack
> > > trace  was the same.
> >
> > then you are hitting a different bug or a bug related to the issues
> > Christian Klose and me and $tons of others were complaining.
> >
> > The bug you are hitting might be the problem with "process stuck in D
> > state" Andrea Arcangeli fixed, let me guess, over half a year ago or so.
> >
> > In case you have a good mind to try to address your issue, you might want
> > to try out the patch you can find here:
> >
> > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.21rc2
> >aa1/9980_fix-pausing-2
> >
> > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> > dead/: speak _NOW_ please, doesn't matter who you are!
> 
> Yo!
> 
> I'll throw my babushka into the ring too. I think it's obvious from MCP's 
> comments that I've been involved in testing this problem. I've spent hours, 
> possibly days trying to find a way to fix the pauses introduced since 
> 2.4.19pre1. I agree with what MCP describes that the machine can come to a 
> standstill under any sort of disk i/o and is unusable for a variable length 
> of time. I've been playing with all sorts of numbers in my patchset to try 
> and limit it with only mild success. The best results I've had without a 
> major decrease in throughput was using akpm's read latency 2 patch but by 
> significantly reducing the nr_requests. It was changing the number of 
> requests that I discovered dropping them to 4 fixed the problem but destroyed 
> write throughput. I was pleased to see AA give the problem recognition after 
> my contest results on his kernel but disappointed that the problem only was 
> reduced, not fixed.

Does the problem change at all if you force batch_requests to 0?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  6:04               ` Jens Axboe
@ 2003-05-28  7:13                 ` Con Kolivas
  2003-05-28  7:13                   ` Jens Axboe
  0 siblings, 1 reply; 142+ messages in thread
From: Con Kolivas @ 2003-05-28  7:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli, Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Wed, 28 May 2003 16:04, Jens Axboe wrote:
> On Wed, May 28 2003, Con Kolivas wrote:
> > On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> > > On Tuesday 27 May 2003 19:50, manish wrote:
> > >
> > > Hi Manish,
> > >
> > > > It is not a system hang but the processes hang showing the same stack
> > > > trace. This is certainly not a pause since the bonnie processes that
> > > > were hung (or deadlocked) never completed after several hrs. The
> > > > stack trace  was the same.
> > >
> > > then you are hitting a different bug or a bug related to the issues
> > > Christian Klose and me and $tons of others were complaining.
> > >
> > > The bug you are hitting might be the problem with "process stuck in D
> > > state" Andrea Arcangeli fixed, let me guess, over half a year ago or
> > > so.
> > >
> > > In case you have a good mind to try to address your issue, you might
> > > want to try out the patch you can find here:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> > >1rc2 aa1/9980_fix-pausing-2
> > >
> > > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> > > dead/: speak _NOW_ please, doesn't matter who you are!
> >
> > Yo!
> >
> > I'll throw my babushka into the ring too. I think it's obvious from MCP's
> > comments that I've been involved in testing this problem. I've spent
> > hours, possibly days trying to find a way to fix the pauses introduced
> > since 2.4.19pre1. I agree with what MCP describes that the machine can
> > come to a standstill under any sort of disk i/o and is unusable for a
> > variable length of time. I've been playing with all sorts of numbers in
> > my patchset to try and limit it with only mild success. The best results
> > I've had without a major decrease in throughput was using akpm's read
> > latency 2 patch but by significantly reducing the nr_requests. It was
> > changing the number of requests that I discovered dropping them to 4
> > fixed the problem but destroyed write throughput. I was pleased to see AA
> > give the problem recognition after my contest results on his kernel but
> > disappointed that the problem only was reduced, not fixed.
>
> Does the problem change at all if you force batch_requests to 0?

I've tried batch_requests to 1 by itself (without changing the nr_request) and 
that didn't fix it, but recall dropping nr_requests to 2 (which would make 
batch requests==0) made the machine fail to boot so I haven't tried batch 
requests 0 by itself. Should it boot with it == 0?

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:13                 ` Con Kolivas
@ 2003-05-28  7:13                   ` Jens Axboe
  2003-05-28  7:32                     ` Marc-Christian Petersen
  0 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28  7:13 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Marc-Christian Petersen, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli, Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Wed, May 28 2003, Con Kolivas wrote:
> On Wed, 28 May 2003 16:04, Jens Axboe wrote:
> > On Wed, May 28 2003, Con Kolivas wrote:
> > > On Wed, 28 May 2003 04:04, Marc-Christian Petersen wrote:
> > > > On Tuesday 27 May 2003 19:50, manish wrote:
> > > >
> > > > Hi Manish,
> > > >
> > > > > It is not a system hang but the processes hang showing the same stack
> > > > > trace. This is certainly not a pause since the bonnie processes that
> > > > > were hung (or deadlocked) never completed after several hrs. The
> > > > > stack trace  was the same.
> > > >
> > > > then you are hitting a different bug or a bug related to the issues
> > > > Christian Klose and me and $tons of others were complaining.
> > > >
> > > > The bug you are hitting might be the problem with "process stuck in D
> > > > state" Andrea Arcangeli fixed, let me guess, over half a year ago or
> > > > so.
> > > >
> > > > In case you have a good mind to try to address your issue, you might
> > > > want to try out the patch you can find here:
> > > >
> > > > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.2
> > > >1rc2 aa1/9980_fix-pausing-2
> > > >
> > > > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is
> > > > dead/: speak _NOW_ please, doesn't matter who you are!
> > >
> > > Yo!
> > >
> > > I'll throw my babushka into the ring too. I think it's obvious from MCP's
> > > comments that I've been involved in testing this problem. I've spent
> > > hours, possibly days trying to find a way to fix the pauses introduced
> > > since 2.4.19pre1. I agree with what MCP describes that the machine can
> > > come to a standstill under any sort of disk i/o and is unusable for a
> > > variable length of time. I've been playing with all sorts of numbers in
> > > my patchset to try and limit it with only mild success. The best results
> > > I've had without a major decrease in throughput was using akpm's read
> > > latency 2 patch but by significantly reducing the nr_requests. It was
> > > changing the number of requests that I discovered dropping them to 4
> > > fixed the problem but destroyed write throughput. I was pleased to see AA
> > > give the problem recognition after my contest results on his kernel but
> > > disappointed that the problem only was reduced, not fixed.
> >
> > Does the problem change at all if you force batch_requests to 0?
> 
> I've tried batch_requests to 1 by itself (without changing the
> nr_request) and that didn't fix it, but recall dropping nr_requests to
> 2 (which would make batch requests==0) made the machine fail to boot
> so I haven't tried batch requests 0 by itself. Should it boot with it
> == 0?

If you leave nr_requests as it is, I don't see why it should not boot
with batch_requests == 0.

I can't see in all of these mails whether backing out akpm's starvation
patch makes the problem go away. Does it?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:04           ` Marc-Christian Petersen
  2003-05-27 23:06             ` Georg Nikodym
  2003-05-28  5:33             ` Con Kolivas
@ 2003-05-28  7:16             ` Marc Wilson
  2003-05-28 19:53               ` David Ford
  2003-05-28  9:36             ` Ragnar Hojland Espinosa
  3 siblings, 1 reply; 142+ messages in thread
From: Marc Wilson @ 2003-05-28  7:16 UTC (permalink / raw)
  To: linux-kernel

On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
>      speak _NOW_ please, doesn't matter who you are!

Ok, add my box to the list.  Variety of post 2.4.18 kernels, -ac's, -rc's,
etc... all demonstrate it to one degree or another.

Lately it's gotten REALLY bad.

Currently I'm using 21-rc2-ac2 and it freezes for upwards of 15 sec
regularly when I'm exercising the HD (three simultaneous brag threads
downloading from various newsgroups).  The mouse moves, but other than
that, X is entirely unresponsive.  An xterm with continually scrolling
text, for example, will appear to stop scrolling until the kernel comes
back.

The HD light is on solid the whole time.

21-rc2 does it too.  I haven't tried anything later than that yet. Well, I
tried 20-ck7 and it ate my RAID0 due to a DMA-ism and I've not tested
anything else since. :(

-- 
 Marc Wilson |     Nothing in life is to be feared.  It is only to
 msw@cox.net |     be understood.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:13                   ` Jens Axboe
@ 2003-05-28  7:32                     ` Marc-Christian Petersen
  2003-05-28  7:35                       ` Jens Axboe
  0 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28  7:32 UTC (permalink / raw)
  To: Jens Axboe, Con Kolivas
  Cc: manish, Carl-Daniel Hailfinger, Andrea Arcangeli,
	Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Wednesday 28 May 2003 09:13, Jens Axboe wrote:

Hi Jens,

> If you leave nr_requests as it is, I don't see why it should not boot
> with batch_requests == 0.
> I can't see in all of these mails whether backing out akpm's starvation
> patch makes the problem go away. Does it?
If you mean 
"http://linux.bkbits.net:8080/linux-2.4/diffs/drivers/block/ll_rw_blk.c@1.29?nav=index.html|ChangeSet@-2y|cset@1.160|hist/drivers/block/ll_rw_blk.c"

that one, the answer is YES.

ciao, Marc



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:32                     ` Marc-Christian Petersen
@ 2003-05-28  7:35                       ` Jens Axboe
  2003-05-28  7:51                         ` Andrew Morton
  0 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28  7:35 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Con Kolivas, manish, Andrea Arcangeli, Marcelo Tosatti,
	linux-kernel, Andrew Morton

On Wed, May 28 2003, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 09:13, Jens Axboe wrote:
> 
> Hi Jens,
> 
> > If you leave nr_requests as it is, I don't see why it should not boot
> > with batch_requests == 0.
> > I can't see in all of these mails whether backing out akpm's starvation
> > patch makes the problem go away. Does it?
> If you mean 

> "http://linux.bkbits.net:8080/linux-2.4/diffs/drivers/block/ll_rw_blk.c@1.29?nav=index.html|ChangeSet@-2y|cset@1.160|hist/drivers/block/ll_rw_blk.c"
> 
> that one, the answer is YES.

That's the one, yes. Andrew, looks like your patch brought out some
really bad behaviour.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:35                       ` Jens Axboe
@ 2003-05-28  7:51                         ` Andrew Morton
  2003-05-28  8:30                           ` Jens Axboe
                                             ` (2 more replies)
  0 siblings, 3 replies; 142+ messages in thread
From: Andrew Morton @ 2003-05-28  7:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: m.c.p, kernel, manish, andrea, marcelo, linux-kernel

Jens Axboe <axboe@suse.de> wrote:
>
> > that one, the answer is YES.
> 
>  That's the one, yes. Andrew, looks like your patch brought out some
>  really bad behaviour.

Yes, but why?

It'd be interesting if any of these changes make a difference.


 drivers/block/ll_rw_blk.c |    7 
 fs/buffer.c               | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 3033 insertions(+), 4 deletions(-)

diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~a	2003-05-28 00:48:09.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 00:50:02.000000000 -0700
@@ -590,10 +590,10 @@ static struct request *__get_request_wai
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	generic_unplug_device(q);
-	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+	add_wait_queue(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
+		generic_unplug_device(q);
 		if (q->rq[rw].count == 0)
 			schedule();
 		spin_lock_irq(&io_request_lock);
@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests &&
-				waitqueue_active(&q->wait_for_requests[rw]))
+		if (++q->rq[rw].count >= q->batch_requests)
 			wake_up(&q->wait_for_requests[rw]);
 	}
 }

_


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:51                         ` Andrew Morton
@ 2003-05-28  8:30                           ` Jens Axboe
  2003-05-28  8:43                             ` Marc-Christian Petersen
  2003-05-28  8:40                           ` Marc-Christian Petersen
  2003-05-28 10:13                           ` Matthias Mueller
  2 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28  8:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: m.c.p, kernel, manish, andrea, marcelo, linux-kernel

On Wed, May 28 2003, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > > that one, the answer is YES.
> > 
> >  That's the one, yes. Andrew, looks like your patch brought out some
> >  really bad behaviour.
> 
> Yes, but why?
> 
> It'd be interesting if any of these changes make a difference.
> 
> 
>  drivers/block/ll_rw_blk.c |    7 
>  fs/buffer.c               | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 3033 insertions(+), 4 deletions(-)
> 
> diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~a	2003-05-28 00:48:09.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 00:50:02.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
>  	register struct request *rq;
>  	DECLARE_WAITQUEUE(wait, current);
>  
> -	generic_unplug_device(q);
> -	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> +	add_wait_queue(&q->wait_for_requests[rw], &wait);
>  	do {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
> +		generic_unplug_device(q);
>  		if (q->rq[rw].count == 0)
>  			schedule();
>  		spin_lock_irq(&io_request_lock);
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
>  	 */
>  	if (q) {
>  		list_add(&req->queue, &q->rq[rw].free);
> -		if (++q->rq[rw].count >= q->batch_requests &&
> -				waitqueue_active(&q->wait_for_requests[rw]))
> +		if (++q->rq[rw].count >= q->batch_requests)
>  			wake_up(&q->wait_for_requests[rw]);
>  	}
>  }

The unplug() move could be the key, in theory we could end up having to
unplug the queue again.

Question to the ones seeing the stalls - does a sysrq-s make things go
again?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:51                         ` Andrew Morton
  2003-05-28  8:30                           ` Jens Axboe
@ 2003-05-28  8:40                           ` Marc-Christian Petersen
  2003-05-28 10:13                           ` Matthias Mueller
  2 siblings, 0 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28  8:40 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe; +Cc: kernel, manish, andrea, marcelo, linux-kernel

On Wednesday 28 May 2003 09:51, Andrew Morton wrote:

Hi Andrew,

> Yes, but why?
I don't know :(

> It'd be interesting if any of these changes make a difference.
I'll check it this evening! Many thanks.

ciao, Marc



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  8:30                           ` Jens Axboe
@ 2003-05-28  8:43                             ` Marc-Christian Petersen
  0 siblings, 0 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28  8:43 UTC (permalink / raw)
  To: Jens Axboe, Andrew Morton; +Cc: kernel, manish, andrea, marcelo, linux-kernel

On Wednesday 28 May 2003 10:30, Jens Axboe wrote:

Hi Jens,

> The unplug() move could be the key, in theory we could end up having to
> unplug the queue again.
Hmm, afaik fix-pausing-2 patch does it similar, moving unplug_device() to the 
same place.

> Question to the ones seeing the stalls - does a sysrq-s make things go
> again?
no (at least not for me)

ciao, Marc



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-27 18:04           ` Marc-Christian Petersen
                               ` (2 preceding siblings ...)
  2003-05-28  7:16             ` Marc Wilson
@ 2003-05-28  9:36             ` Ragnar Hojland Espinosa
  2003-05-28  9:45               ` Jens Axboe
                                 ` (2 more replies)
  3 siblings, 3 replies; 142+ messages in thread
From: Ragnar Hojland Espinosa @ 2003-05-28  9:36 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: manish, Carl-Daniel Hailfinger, Andrea Arcangeli,
	Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
> 
> ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
>      speak _NOW_ please, doesn't matter who you are!

FWIW, me too.

Actually it just happens in the fixing stage when burning prebuilt iso
images from the hard disk (same IDE channel as the burner, 2.4.20)
Having a completely frozen machine under X was quite panic inducing ;)

A friend told me they also get regular "pauses" when quitting from
vmware.
-- 
Ragnar Hojland - Project Manager
Linalco "Especialistas Linux y en Software Libre"
http://www.linalco.com Tel: +34-91-5970074 Fax: +34-91-5970083

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  9:36             ` Ragnar Hojland Espinosa
@ 2003-05-28  9:45               ` Jens Axboe
  2003-05-28  9:53               ` Marc-Christian Petersen
  2003-05-28 10:58               ` Alan Cox
  2 siblings, 0 replies; 142+ messages in thread
From: Jens Axboe @ 2003-05-28  9:45 UTC (permalink / raw)
  To: Ragnar Hojland Espinosa
  Cc: Marc-Christian Petersen, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli, Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Wed, May 28 2003, Ragnar Hojland Espinosa wrote:
> On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
> > 
> > ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
> >      speak _NOW_ please, doesn't matter who you are!
> 
> FWIW, me too.
> 
> Actually it just happens in the fixing stage when burning prebuilt iso
> images from the hard disk (same IDE channel as the burner, 2.4.20)
> Having a completely frozen machine under X was quite panic inducing ;)
> 
> A friend told me they also get regular "pauses" when quitting from
> vmware.

Lemme guess, hard drive on the same channel as the burner? There's
nothing we can do about that, hardware limitation. The reason you see it
during fixation is because that's one long single command, and we cannot
preempt the channel and service requests while that is going on.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  9:36             ` Ragnar Hojland Espinosa
  2003-05-28  9:45               ` Jens Axboe
@ 2003-05-28  9:53               ` Marc-Christian Petersen
  2003-05-28 10:01                 ` Jens Axboe
  2003-05-28 10:58               ` Alan Cox
  2 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28  9:53 UTC (permalink / raw)
  To: Ragnar Hojland Espinosa
  Cc: manish, Carl-Daniel Hailfinger, Andrea Arcangeli,
	Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Wednesday 28 May 2003 11:36, Ragnar Hojland Espinosa wrote:

Hi Ragnar,

> Actually it just happens in the fixing stage when burning prebuilt iso
> images from the hard disk (same IDE channel as the burner, 2.4.20)
> Having a completely frozen machine under X was quite panic inducing ;)
That's a problem of IDE itself. I still say IDE is broken by design ;-)

> A friend told me they also get regular "pauses" when quitting from
> vmware.
Yep, occurs also with my machines.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  9:53               ` Marc-Christian Petersen
@ 2003-05-28 10:01                 ` Jens Axboe
  0 siblings, 0 replies; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 10:01 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Ragnar Hojland Espinosa, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli, Marcelo Tosatti, linux-kernel, Christian Klose,
	William Lee Irwin III

On Wed, May 28 2003, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 11:36, Ragnar Hojland Espinosa wrote:
> 
> Hi Ragnar,
> 
> > Actually it just happens in the fixing stage when burning prebuilt iso
> > images from the hard disk (same IDE channel as the burner, 2.4.20)
> > Having a completely frozen machine under X was quite panic inducing ;)
> That's a problem of IDE itself. I still say IDE is broken by design ;-)

It is actually possible to use the IMMED bit of the CLOSE_TRACK command
to get around this. In that case the cd-r will return the command as
completed and the drive on the same channel can service requests.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:51                         ` Andrew Morton
  2003-05-28  8:30                           ` Jens Axboe
  2003-05-28  8:40                           ` Marc-Christian Petersen
@ 2003-05-28 10:13                           ` Matthias Mueller
  2003-05-28 10:18                             ` Jens Axboe
                                               ` (2 more replies)
  2 siblings, 3 replies; 142+ messages in thread
From: Matthias Mueller @ 2003-05-28 10:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, m.c.p, kernel, manish, andrea, marcelo, linux-kernel

On Wed, May 28, 2003 at 12:51:56AM -0700, Andrew Morton wrote:
> It'd be interesting if any of these changes make a difference.
> 
> 
>  drivers/block/ll_rw_blk.c |    7 
>  fs/buffer.c               | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 3033 insertions(+), 4 deletions(-)
> 
> diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~a	2003-05-28 00:48:09.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 00:50:02.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
>  	register struct request *rq;
>  	DECLARE_WAITQUEUE(wait, current);
>  
> -	generic_unplug_device(q);
> -	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> +	add_wait_queue(&q->wait_for_requests[rw], &wait);
>  	do {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
> +		generic_unplug_device(q);
>  		if (q->rq[rw].count == 0)
>  			schedule();
>  		spin_lock_irq(&io_request_lock);
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
>  	 */
>  	if (q) {
>  		list_add(&req->queue, &q->rq[rw].free);
> -		if (++q->rq[rw].count >= q->batch_requests &&
> -				waitqueue_active(&q->wait_for_requests[rw]))
> +		if (++q->rq[rw].count >= q->batch_requests)
>  			wake_up(&q->wait_for_requests[rw]);
>  	}
>  }
> 

Works fine on my notebook. Good throughput and no mouse hangs anymore.

Thanks,
Matthias
-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:13                           ` Matthias Mueller
@ 2003-05-28 10:18                             ` Jens Axboe
  2003-05-28 10:23                             ` Andrew Morton
  2003-05-28 10:24                             ` Marc-Christian Petersen
  2 siblings, 0 replies; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 10:18 UTC (permalink / raw)
  To: Andrew Morton, m.c.p, kernel, manish, andrea, marcelo, linux-kernel

On Wed, May 28 2003, Matthias Mueller wrote:
> On Wed, May 28, 2003 at 12:51:56AM -0700, Andrew Morton wrote:
> > It'd be interesting if any of these changes make a difference.
> > 
> > 
> >  drivers/block/ll_rw_blk.c |    7 
> >  fs/buffer.c               | 3030 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 3033 insertions(+), 4 deletions(-)
> > 
> > diff -puN drivers/block/ll_rw_blk.c~a drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~a	2003-05-28 00:48:09.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 00:50:02.000000000 -0700
> > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> >  	register struct request *rq;
> >  	DECLARE_WAITQUEUE(wait, current);
> >  
> > -	generic_unplug_device(q);
> > -	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > +	add_wait_queue(&q->wait_for_requests[rw], &wait);
> >  	do {
> >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > +		generic_unplug_device(q);
> >  		if (q->rq[rw].count == 0)
> >  			schedule();
> >  		spin_lock_irq(&io_request_lock);
> > @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> >  	 */
> >  	if (q) {
> >  		list_add(&req->queue, &q->rq[rw].free);
> > -		if (++q->rq[rw].count >= q->batch_requests &&
> > -				waitqueue_active(&q->wait_for_requests[rw]))
> > +		if (++q->rq[rw].count >= q->batch_requests)
> >  			wake_up(&q->wait_for_requests[rw]);
> >  	}
> >  }
> > 
> 
> Works fine on my notebook. Good throughput and no mouse hangs anymore.

Could you possibly try just the last hunk of the patch, then? Ie just
remove the waitqueue_active(&q->wait_for_requests[rw]) check, leave the
rest as-is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:13                           ` Matthias Mueller
  2003-05-28 10:18                             ` Jens Axboe
@ 2003-05-28 10:23                             ` Andrew Morton
  2003-05-28 10:25                               ` Jens Axboe
                                                 ` (4 more replies)
  2003-05-28 10:24                             ` Marc-Christian Petersen
  2 siblings, 5 replies; 142+ messages in thread
From: Andrew Morton @ 2003-05-28 10:23 UTC (permalink / raw)
  To: Matthias Mueller
  Cc: axboe, m.c.p, kernel, manish, andrea, marcelo, linux-kernel

Matthias Mueller <matthias.mueller@rz.uni-karlsruhe.de> wrote:
>
> Works fine on my notebook. Good throughput and no mouse hangs anymore.

Interesting.

Could you please work out which change caused it?  Go back to stock 2.4 and
then apply this:


diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
@@ -590,10 +590,10 @@ static struct request *__get_request_wai
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	generic_unplug_device(q);
 	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
+		generic_unplug_device(q);
 		if (q->rq[rw].count == 0)
 			schedule();
 		spin_lock_irq(&io_request_lock);



then this:

diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~2	2003-05-28 03:21:03.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:09.000000000 -0700
@@ -590,7 +590,7 @@ static struct request *__get_request_wai
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+	add_wait_queue(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		generic_unplug_device(q);


Then this (totally unlikely, don't bother):

diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
--- 24/drivers/block/ll_rw_blk.c~3	2003-05-28 03:21:15.000000000 -0700
+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:39.000000000 -0700
@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests &&
-				waitqueue_active(&q->wait_for_requests[rw]))
+		if (++q->rq[rw].count >= q->batch_requests)
 			wake_up(&q->wait_for_requests[rw]);
 	}
 }

_


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:13                           ` Matthias Mueller
  2003-05-28 10:18                             ` Jens Axboe
  2003-05-28 10:23                             ` Andrew Morton
@ 2003-05-28 10:24                             ` Marc-Christian Petersen
  2 siblings, 0 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28 10:24 UTC (permalink / raw)
  To: Matthias Mueller, Andrew Morton
  Cc: Jens Axboe, kernel, manish, andrea, marcelo, linux-kernel

On Wednesday 28 May 2003 12:13, Matthias Mueller wrote:

Hi Matthias, Andrew,

> > It'd be interesting if any of these changes make a difference.
> Works fine on my notebook. Good throughput and no mouse hangs anymore.
damn, I *KNEW* Andrew is able to fix this. I knew that for over a year!! ;)

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:23                             ` Andrew Morton
@ 2003-05-28 10:25                               ` Jens Axboe
  2003-05-28 10:48                                 ` Con Kolivas
  2003-05-28 10:29                               ` Con Kolivas
                                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 10:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthias Mueller, m.c.p, kernel, manish, andrea, marcelo, linux-kernel

On Wed, May 28 2003, Andrew Morton wrote:
> Matthias Mueller <matthias.mueller@rz.uni-karlsruhe.de> wrote:
> >
> > Works fine on my notebook. Good throughput and no mouse hangs anymore.
> 
> Interesting.
> 
> Could you please work out which change caused it?  Go back to stock 2.4 and
> then apply this:
> 
> 
> diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
>  	register struct request *rq;
>  	DECLARE_WAITQUEUE(wait, current);
>  
> -	generic_unplug_device(q);
>  	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
>  	do {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
> +		generic_unplug_device(q);
>  		if (q->rq[rw].count == 0)
>  			schedule();
>  		spin_lock_irq(&io_request_lock);

I think it was already established that this wasn't the reason. Was my
first suspect too, though...

> then this:
> 
> diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~2	2003-05-28 03:21:03.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:09.000000000 -0700
> @@ -590,7 +590,7 @@ static struct request *__get_request_wai
>  	register struct request *rq;
>  	DECLARE_WAITQUEUE(wait, current);
>  
> -	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> +	add_wait_queue(&q->wait_for_requests[rw], &wait);
>  	do {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		generic_unplug_device(q);

Since we do a general wake_up(), only the order of wakeups matter here
right (lifo vs fifo). Given that, the _exclusive() should be more fair
possibly at the cost of a bit of throughput.

> Then this (totally unlikely, don't bother):
> 
> diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~3	2003-05-28 03:21:15.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:39.000000000 -0700
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
>  	 */
>  	if (q) {
>  		list_add(&req->queue, &q->rq[rw].free);
> -		if (++q->rq[rw].count >= q->batch_requests &&
> -				waitqueue_active(&q->wait_for_requests[rw]))
> +		if (++q->rq[rw].count >= q->batch_requests)
>  			wake_up(&q->wait_for_requests[rw]);
>  	}
>  }

Well it's the only one left :). But you are right, try one of them at
the time, establishing the effect of each of them.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:23                             ` Andrew Morton
  2003-05-28 10:25                               ` Jens Axboe
@ 2003-05-28 10:29                               ` Con Kolivas
  2003-05-28 10:29                                 ` Marc-Christian Petersen
  2003-05-28 12:10                               ` Matthias Mueller
                                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 142+ messages in thread
From: Con Kolivas @ 2003-05-28 10:29 UTC (permalink / raw)
  To: Andrew Morton, Matthias Mueller
  Cc: axboe, m.c.p, manish, andrea, marcelo, linux-kernel

On Wed, 28 May 2003 20:23, Andrew Morton wrote:
> Matthias Mueller <matthias.mueller@rz.uni-karlsruhe.de> wrote:
> > Works fine on my notebook. Good throughput and no mouse hangs anymore.
>
> Interesting.
>
> Could you please work out which change caused it?  Go back to stock 2.4 and
> then apply this:
>
>
> diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
>  	register struct request *rq;
>  	DECLARE_WAITQUEUE(wait, current);
>
> -	generic_unplug_device(q);
>  	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
>  	do {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
> +		generic_unplug_device(q);
>  		if (q->rq[rw].count == 0)
>  			schedule();
>  		spin_lock_irq(&io_request_lock);

It's not this because this is the layout in my -ck* and it still exhibits the 
pauses.



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:29                               ` Con Kolivas
@ 2003-05-28 10:29                                 ` Marc-Christian Petersen
  0 siblings, 0 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28 10:29 UTC (permalink / raw)
  To: Con Kolivas, Andrew Morton, Matthias Mueller
  Cc: axboe, manish, andrea, marcelo, linux-kernel

On Wednesday 28 May 2003 12:29, Con Kolivas wrote:

Hi Con, AKPM, Jens,

> > diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
> > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> >  	register struct request *rq;
> >  	DECLARE_WAITQUEUE(wait, current);
> >
> > -	generic_unplug_device(q);
> >  	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> >  	do {
> >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > +		generic_unplug_device(q);
> >  		if (q->rq[rw].count == 0)
> >  			schedule();
> >  		spin_lock_irq(&io_request_lock);
> It's not this because this is the layout in my -ck* and it still exhibits
> the pauses.
Same for -WOLK*

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:25                               ` Jens Axboe
@ 2003-05-28 10:48                                 ` Con Kolivas
  2003-05-28 10:50                                   ` Jens Axboe
  2003-05-28 11:03                                   ` Nick Piggin
  0 siblings, 2 replies; 142+ messages in thread
From: Con Kolivas @ 2003-05-28 10:48 UTC (permalink / raw)
  To: Jens Axboe, Andrew Morton
  Cc: Matthias Mueller, m.c.p, manish, andrea, marcelo, linux-kernel

On Wed, 28 May 2003 20:25, Jens Axboe wrote:
> On Wed, May 28 2003, Andrew Morton wrote:
> > Matthias Mueller <matthias.mueller@rz.uni-karlsruhe.de> wrote:
> > > Works fine on my notebook. Good throughput and no mouse hangs anymore.
> >
> > Interesting.
> >
> > Could you please work out which change caused it?  Go back to stock 2.4
> > and then apply this:
> >
> >
> > diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
> > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> >  	register struct request *rq;
> >  	DECLARE_WAITQUEUE(wait, current);
> >
> > -	generic_unplug_device(q);
> >  	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> >  	do {
> >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > +		generic_unplug_device(q);
> >  		if (q->rq[rw].count == 0)
> >  			schedule();
> >  		spin_lock_irq(&io_request_lock);
>
> I think it was already established that this wasn't the reason. Was my
> first suspect too, though...
>
> > then this:
> >
> > diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~2	2003-05-28 03:21:03.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:09.000000000 -0700
> > @@ -590,7 +590,7 @@ static struct request *__get_request_wai
> >  	register struct request *rq;
> >  	DECLARE_WAITQUEUE(wait, current);
> >
> > -	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > +	add_wait_queue(&q->wait_for_requests[rw], &wait);
> >  	do {
> >  		set_current_state(TASK_UNINTERRUPTIBLE);
> >  		generic_unplug_device(q);
>
> Since we do a general wake_up(), only the order of wakeups matter here
> right (lifo vs fifo). Given that, the _exclusive() should be more fair
> possibly at the cost of a bit of throughput.
>
> > Then this (totally unlikely, don't bother):
> >
> > diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> > --- 24/drivers/block/ll_rw_blk.c~3	2003-05-28 03:21:15.000000000 -0700
> > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:39.000000000 -0700
> > @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> >  	 */
> >  	if (q) {
> >  		list_add(&req->queue, &q->rq[rw].free);
> > -		if (++q->rq[rw].count >= q->batch_requests &&
> > -				waitqueue_active(&q->wait_for_requests[rw]))
> > +		if (++q->rq[rw].count >= q->batch_requests)
> >  			wake_up(&q->wait_for_requests[rw]);
> >  	}
> >  }
>
> Well it's the only one left :). But you are right, try one of them at
> the time, establishing the effect of each of them.

THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read 
midstream.

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:48                                 ` Con Kolivas
@ 2003-05-28 10:50                                   ` Jens Axboe
  2003-05-28 10:59                                     ` Andrew Morton
  2003-05-28 11:03                                   ` Nick Piggin
  1 sibling, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 10:50 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Andrew Morton, Matthias Mueller, m.c.p, manish, andrea, marcelo,
	linux-kernel

On Wed, May 28 2003, Con Kolivas wrote:
> On Wed, 28 May 2003 20:25, Jens Axboe wrote:
> > On Wed, May 28 2003, Andrew Morton wrote:
> > > Matthias Mueller <matthias.mueller@rz.uni-karlsruhe.de> wrote:
> > > > Works fine on my notebook. Good throughput and no mouse hangs anymore.
> > >
> > > Interesting.
> > >
> > > Could you please work out which change caused it?  Go back to stock 2.4
> > > and then apply this:
> > >
> > >
> > > diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> > > --- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
> > > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
> > > @@ -590,10 +590,10 @@ static struct request *__get_request_wai
> > >  	register struct request *rq;
> > >  	DECLARE_WAITQUEUE(wait, current);
> > >
> > > -	generic_unplug_device(q);
> > >  	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > >  	do {
> > >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > > +		generic_unplug_device(q);
> > >  		if (q->rq[rw].count == 0)
> > >  			schedule();
> > >  		spin_lock_irq(&io_request_lock);
> >
> > I think it was already established that this wasn't the reason. Was my
> > first suspect too, though...
> >
> > > then this:
> > >
> > > diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> > > --- 24/drivers/block/ll_rw_blk.c~2	2003-05-28 03:21:03.000000000 -0700
> > > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:09.000000000 -0700
> > > @@ -590,7 +590,7 @@ static struct request *__get_request_wai
> > >  	register struct request *rq;
> > >  	DECLARE_WAITQUEUE(wait, current);
> > >
> > > -	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> > > +	add_wait_queue(&q->wait_for_requests[rw], &wait);
> > >  	do {
> > >  		set_current_state(TASK_UNINTERRUPTIBLE);
> > >  		generic_unplug_device(q);
> >
> > Since we do a general wake_up(), only the order of wakeups matter here
> > right (lifo vs fifo). Given that, the _exclusive() should be more fair
> > possibly at the cost of a bit of throughput.
> >
> > > Then this (totally unlikely, don't bother):
> > >
> > > diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> > > --- 24/drivers/block/ll_rw_blk.c~3	2003-05-28 03:21:15.000000000 -0700
> > > +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:39.000000000 -0700
> > > @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> > >  	 */
> > >  	if (q) {
> > >  		list_add(&req->queue, &q->rq[rw].free);
> > > -		if (++q->rq[rw].count >= q->batch_requests &&
> > > -				waitqueue_active(&q->wait_for_requests[rw]))
> > > +		if (++q->rq[rw].count >= q->batch_requests)
> > >  			wake_up(&q->wait_for_requests[rw]);
> > >  	}
> > >  }
> >
> > Well it's the only one left :). But you are right, try one of them at
> > the time, establishing the effect of each of them.
> 
> THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read 
> midstream.

Cool, especially since we can easily apply this to -rc5 without any
worries. Marcelo, if you please...?

===== drivers/block/ll_rw_blk.c 1.44 vs edited =====
--- 1.44/drivers/block/ll_rw_blk.c	Mon Apr 14 12:53:03 2003
+++ edited/drivers/block/ll_rw_blk.c	Wed May 28 12:49:30 2003
@@ -829,8 +829,7 @@
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests &&
-				waitqueue_active(&q->wait_for_requests[rw]))
+		if (++q->rq[rw].count >= q->batch_requests)
 			wake_up(&q->wait_for_requests[rw]);
 	}
 }

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  9:36             ` Ragnar Hojland Espinosa
  2003-05-28  9:45               ` Jens Axboe
  2003-05-28  9:53               ` Marc-Christian Petersen
@ 2003-05-28 10:58               ` Alan Cox
  2003-05-29  8:34                 ` Ragnar Hojland Espinosa
  2 siblings, 1 reply; 142+ messages in thread
From: Alan Cox @ 2003-05-28 10:58 UTC (permalink / raw)
  To: Ragnar Hojland Espinosa
  Cc: Marc-Christian Petersen, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli, Marcelo Tosatti, Linux Kernel Mailing List,
	Christian Klose, William Lee Irwin III

On Mer, 2003-05-28 at 10:36, Ragnar Hojland Espinosa wrote:
> Actually it just happens in the fixing stage when burning prebuilt iso
> images from the hard disk (same IDE channel as the burner, 2.4.20)
> Having a completely frozen machine under X was quite panic inducing ;)

If you have a disk and the burner ont he same channel this is quite
normal. The fixate is a single ATAPI command and like all ATA commands
locks the bus to both master/slave for its duration of execution.

Its an IDE limitation


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:50                                   ` Jens Axboe
@ 2003-05-28 10:59                                     ` Andrew Morton
  2003-05-28 11:17                                       ` Marc-Christian Petersen
  0 siblings, 1 reply; 142+ messages in thread
From: Andrew Morton @ 2003-05-28 10:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kernel, matthias.mueller, m.c.p, manish, andrea, marcelo, linux-kernel

Jens Axboe <axboe@suse.de> wrote:
>
> > THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read 
>  > midstream.
> 
>  Cool, especially since we can easily apply this to -rc5 without any
>  worries. Marcelo, if you please...?
> 
>  ===== drivers/block/ll_rw_blk.c 1.44 vs edited =====
>  --- 1.44/drivers/block/ll_rw_blk.c	Mon Apr 14 12:53:03 2003
>  +++ edited/drivers/block/ll_rw_blk.c	Wed May 28 12:49:30 2003
>  @@ -829,8 +829,7 @@
>   	 */
>   	if (q) {
>   		list_add(&req->queue, &q->rq[rw].free);
>  -		if (++q->rq[rw].count >= q->batch_requests &&
>  -				waitqueue_active(&q->wait_for_requests[rw]))
>  +		if (++q->rq[rw].count >= q->batch_requests)
>   			wake_up(&q->wait_for_requests[rw]);
>   	}
>   }

umm, I'd like confirmation of that.

The waitqueue_active() test is wrong because of a missing barrier, but only
on SMP.  And if it does make a mistake it will surely correct itself when the
next request is put back. (That's why I left it there...)

More testing, please.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:48                                 ` Con Kolivas
  2003-05-28 10:50                                   ` Jens Axboe
@ 2003-05-28 11:03                                   ` Nick Piggin
  1 sibling, 0 replies; 142+ messages in thread
From: Nick Piggin @ 2003-05-28 11:03 UTC (permalink / raw)
  To: Con Kolivas, Jens Axboe, Andrew Morton
  Cc: Matthias Mueller, m.c.p, manish, andrea, marcelo, linux-kernel



Con Kolivas wrote:

>On Wed, 28 May 2003 20:25, Jens Axboe wrote:
>
>>On Wed, May 28 2003, Andrew Morton wrote:
>>
>>>Then this (totally unlikely, don't bother):
>>>
>>>diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
>>>--- 24/drivers/block/ll_rw_blk.c~3	2003-05-28 03:21:15.000000000 -0700
>>>+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:39.000000000 -0700
>>>@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
>>> 	 */
>>> 	if (q) {
>>> 		list_add(&req->queue, &q->rq[rw].free);
>>>-		if (++q->rq[rw].count >= q->batch_requests &&
>>>-				waitqueue_active(&q->wait_for_requests[rw]))
>>>+		if (++q->rq[rw].count >= q->batch_requests)
>>> 			wake_up(&q->wait_for_requests[rw]);
>>> 	}
>>> }
>>>
>>Well it's the only one left :). But you are right, try one of them at
>>the time, establishing the effect of each of them.
>>
>
>THIS IS IT! The last one. No pauses writing a 2Gb file now unless I do a read 
>midstream.
>
>
OK, I can't see how this would make a difference, but there
is similar (batch_requests) code in the mm tree, so it would
be nice if someone would work out what is going on.



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:59                                     ` Andrew Morton
@ 2003-05-28 11:17                                       ` Marc-Christian Petersen
  2003-05-28 11:27                                         ` Andrew Morton
  2003-05-29 12:52                                         ` Andrea Arcangeli
  0 siblings, 2 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28 11:17 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

On Wednesday 28 May 2003 12:59, Andrew Morton wrote:

Hi Andrew,

> umm, I'd like confirmation of that.
>
> The waitqueue_active() test is wrong because of a missing barrier, but only
> on SMP.  And if it does make a mistake it will surely correct itself when
> the next request is put back. (That's why I left it there...)
> More testing, please.
Does the attached one make sense?

ciao, Marc



[-- Attachment #2: llrwblk.patch --]
[-- Type: text/x-diff, Size: 478 bytes --]

--- old/drivers/block/ll_rw_blk.c	2003-05-14 23:11:08.000000000 +0200
+++ new/drivers/block/ll_rw_blk.c	2003-05-28 13:04:34.000000000 +0200
@@ -829,9 +829,10 @@ void blkdev_release_request(struct reque
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= q->batch_requests &&
-				waitqueue_active(&q->wait_for_requests[rw]))
+		if (++q->rq[rw].count >= q->batch_requests) {
+			smp_mb();
 			wake_up(&q->wait_for_requests[rw]);
+		}
 	}
 }
 

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 11:17                                       ` Marc-Christian Petersen
@ 2003-05-28 11:27                                         ` Andrew Morton
  2003-05-28 11:31                                           ` Marc-Christian Petersen
  2003-05-28 11:41                                           ` Con Kolivas
  2003-05-29 12:52                                         ` Andrea Arcangeli
  1 sibling, 2 replies; 142+ messages in thread
From: Andrew Morton @ 2003-05-28 11:27 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: axboe, kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:
>
> Does the attached one make sense?

Nope.

Guys, you're the ones who can reproduce this.  Please spend more time
working out which chunk (or combination thereof) actually fixes the
problem.  If indeed any of them do.  

I'm suspecting that Con's fingers slipped.



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 11:27                                         ` Andrew Morton
@ 2003-05-28 11:31                                           ` Marc-Christian Petersen
  2003-05-28 12:53                                             ` Jens Axboe
  2003-05-29 16:23                                             ` Marc-Christian Petersen
  2003-05-28 11:41                                           ` Con Kolivas
  1 sibling, 2 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-28 11:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: axboe, kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

On Wednesday 28 May 2003 13:27, Andrew Morton wrote:

Hi Akpm,

> > Does the attached one make sense?
> Nope.
nm.

> Guys, you're the ones who can reproduce this.  Please spend more time
> working out which chunk (or combination thereof) actually fixes the
> problem.  If indeed any of them do.
As I said, I will test it this evening. ATM I don't have time to recompile and 
reboot. This evening I will test extensively, even on SMP, SCSI, IDE and so 
on.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 11:27                                         ` Andrew Morton
  2003-05-28 11:31                                           ` Marc-Christian Petersen
@ 2003-05-28 11:41                                           ` Con Kolivas
  1 sibling, 0 replies; 142+ messages in thread
From: Con Kolivas @ 2003-05-28 11:41 UTC (permalink / raw)
  To: Andrew Morton, Marc-Christian Petersen
  Cc: axboe, matthias.mueller, manish, andrea, marcelo, linux-kernel

On Wed, 28 May 2003 21:27, Andrew Morton wrote:
> Marc-Christian Petersen <m.c.p@wolk-project.de> wrote:
> > Does the attached one make sense?
>
> Nope.
>
> Guys, you're the ones who can reproduce this.  Please spend more time
> working out which chunk (or combination thereof) actually fixes the
> problem.  If indeed any of them do.
>
> I'm suspecting that Con's fingers slipped.

I've been known to be email trigger happy in the past but a serious thrashing 
with just this one change made massive improvements. 

However - 
One test case does not a fix give.

Others please test this. It's extremely important.

If you're interested the best test for me is:
dd if=/dev/zero of=dump bs=4096 count=512000

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:23                             ` Andrew Morton
  2003-05-28 10:25                               ` Jens Axboe
  2003-05-28 10:29                               ` Con Kolivas
@ 2003-05-28 12:10                               ` Matthias Mueller
  2003-05-28 12:14                                 ` Matthias Mueller
  2003-05-29 13:19                                 ` Andrea Arcangeli
  2003-05-28 14:00                               ` Con Kolivas
  2003-05-29  1:32                               ` manish
  4 siblings, 2 replies; 142+ messages in thread
From: Matthias Mueller @ 2003-05-28 12:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: axboe, m.c.p, kernel, manish, andrea, marcelo, linux-kernel

On Wed, May 28, 2003 at 03:23:15AM -0700, Andrew Morton wrote:
> Could you please work out which change caused it?  Go back to stock 2.4 and
> then apply this:
> 
> 
> diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
> @@ -590,10 +590,10 @@ static struct request *__get_request_wai
>  	register struct request *rq;
>  	DECLARE_WAITQUEUE(wait, current);
>  
> -	generic_unplug_device(q);
>  	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
>  	do {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
> +		generic_unplug_device(q);
>  		if (q->rq[rw].count == 0)
>  			schedule();
>  		spin_lock_irq(&io_request_lock);
> 
> 
> 
> then this:
> 
> diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~2	2003-05-28 03:21:03.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:09.000000000 -0700
> @@ -590,7 +590,7 @@ static struct request *__get_request_wai
>  	register struct request *rq;
>  	DECLARE_WAITQUEUE(wait, current);
>  
> -	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> +	add_wait_queue(&q->wait_for_requests[rw], &wait);
>  	do {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		generic_unplug_device(q);
> 
> 
> Then this (totally unlikely, don't bother):
> 
> diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
> --- 24/drivers/block/ll_rw_blk.c~3	2003-05-28 03:21:15.000000000 -0700
> +++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:39.000000000 -0700
> @@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
>  	 */
>  	if (q) {
>  		list_add(&req->queue, &q->rq[rw].free);
> -		if (++q->rq[rw].count >= q->batch_requests &&
> -				waitqueue_active(&q->wait_for_requests[rw]))
> +		if (++q->rq[rw].count >= q->batch_requests)
>  			wake_up(&q->wait_for_requests[rw]);
>  	}
>  }
> 
> _

Tested all of them and some combinations:
patch 1 alone: still mouse hangs
patch 2 alone: still mouse hangs
patch 3 alone: no hangs, but I get some zombie process (starting a lot of
               xterms results in zombie xterms, not noticed with vanilla
               and the other patches)
patch 1+2: no mouse hangs
patch 1+2+3: no mouse hangs, no zombies

Bye,
Matthias
-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:10                               ` Matthias Mueller
@ 2003-05-28 12:14                                 ` Matthias Mueller
  2003-05-28 12:21                                   ` Carl-Daniel Hailfinger
  2003-05-29 13:19                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 142+ messages in thread
From: Matthias Mueller @ 2003-05-28 12:14 UTC (permalink / raw)
  To: Andrew Morton, axboe, m.c.p, kernel, manish, andrea, marcelo,
	linux-kernel

On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> Tested all of them and some combinations:
> patch 1 alone: still mouse hangs
> patch 2 alone: still mouse hangs
> patch 3 alone: no hangs, but I get some zombie process (starting a lot of
>                xterms results in zombie xterms, not noticed with vanilla
>                and the other patches)
> patch 1+2: no mouse hangs
> patch 1+2+3: no mouse hangs, no zombies

Forgot to mention: no zombies with patch 1 or 2

Matthias
-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:14                                 ` Matthias Mueller
@ 2003-05-28 12:21                                   ` Carl-Daniel Hailfinger
  2003-05-28 12:23                                     ` Matthias Mueller
  0 siblings, 1 reply; 142+ messages in thread
From: Carl-Daniel Hailfinger @ 2003-05-28 12:21 UTC (permalink / raw)
  To: Matthias Mueller
  Cc: Andrew Morton, axboe, m.c.p, kernel, manish, andrea, marcelo,
	linux-kernel

Matthias Mueller wrote:
> On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> 
>>Tested all of them and some combinations:
>>patch 1 alone: still mouse hangs
>>patch 2 alone: still mouse hangs
>>patch 3 alone: no hangs, but I get some zombie process (starting a lot of
>>               xterms results in zombie xterms, not noticed with vanilla
>>               and the other patches)
>>patch 1+2: no mouse hangs
>>patch 1+2+3: no mouse hangs, no zombies
> 
> 
> Forgot to mention: no zombies with patch 1 or 2

So 1+2 gives you zombies?


Carl-Daniel


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:21                                   ` Carl-Daniel Hailfinger
@ 2003-05-28 12:23                                     ` Matthias Mueller
  2003-05-28 12:28                                       ` Carl-Daniel Hailfinger
  0 siblings, 1 reply; 142+ messages in thread
From: Matthias Mueller @ 2003-05-28 12:23 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Andrew Morton, axboe, m.c.p, kernel, manish, andrea, marcelo,
	linux-kernel

On Wed, May 28, 2003 at 02:21:08PM +0200, Carl-Daniel Hailfinger wrote:
> Matthias Mueller wrote:
> > On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> > 
> >>Tested all of them and some combinations:
> >>patch 1 alone: still mouse hangs
> >>patch 2 alone: still mouse hangs
> >>patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> >>               xterms results in zombie xterms, not noticed with vanilla
> >>               and the other patches)
> >>patch 1+2: no mouse hangs
> >>patch 1+2+3: no mouse hangs, no zombies
> > 
> > 
> > Forgot to mention: no zombies with patch 1 or 2
> 
> So 1+2 gives you zombies?

No, work ok, just forgot to mention that, too. I think I should go to
sleep...

Matthias

-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:23                                     ` Matthias Mueller
@ 2003-05-28 12:28                                       ` Carl-Daniel Hailfinger
  2003-05-28 12:38                                         ` Matthias Mueller
  0 siblings, 1 reply; 142+ messages in thread
From: Carl-Daniel Hailfinger @ 2003-05-28 12:28 UTC (permalink / raw)
  To: Matthias Mueller
  Cc: Andrew Morton, axboe, m.c.p, kernel, manish, andrea, marcelo,
	linux-kernel

Matthias Mueller wrote:
> On Wed, May 28, 2003 at 02:21:08PM +0200, Carl-Daniel Hailfinger wrote:
> 
>>Matthias Mueller wrote:
>>
>>>On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
>>>
>>>
>>>>Tested all of them and some combinations:
>>>>patch 1 alone:    hangs, no zombies
>>>>patch 2 alone:    hangs, no zombies
>>>>patch 3 alone: no hangs,    zombies
>>>>patch 1+2:     no hangs, no zombies
>>>>patch 1+2+3:   no hangs, no zombies

Right?


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:28                                       ` Carl-Daniel Hailfinger
@ 2003-05-28 12:38                                         ` Matthias Mueller
  0 siblings, 0 replies; 142+ messages in thread
From: Matthias Mueller @ 2003-05-28 12:38 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Andrew Morton, axboe, m.c.p, kernel, manish, andrea, marcelo,
	linux-kernel

On Wed, May 28, 2003 at 02:28:10PM +0200, Carl-Daniel Hailfinger wrote:
> Matthias Mueller wrote:
> > On Wed, May 28, 2003 at 02:21:08PM +0200, Carl-Daniel Hailfinger wrote:
> > 
> >>Matthias Mueller wrote:
> >>
> >>>On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> >>>
> >>>
> >>>>Tested all of them and some combinations:
> >>>>patch 1 alone:    hangs, no zombies
> >>>>patch 2 alone:    hangs, no zombies
> >>>>patch 3 alone: no hangs,    zombies
> >>>>patch 1+2:     no hangs, no zombies
> >>>>patch 1+2+3:   no hangs, no zombies
> 
> Right?
Yes.

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 11:31                                           ` Marc-Christian Petersen
@ 2003-05-28 12:53                                             ` Jens Axboe
  2003-05-28 12:58                                               ` Matthias Mueller
                                                                 ` (4 more replies)
  2003-05-29 16:23                                             ` Marc-Christian Petersen
  1 sibling, 5 replies; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 12:53 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrew Morton, kernel, matthias.mueller, manish, andrea, marcelo,
	linux-kernel

On Wed, May 28 2003, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> 
> Hi Akpm,
> 
> > > Does the attached one make sense?
> > Nope.
> nm.
> 
> > Guys, you're the ones who can reproduce this.  Please spend more time
> > working out which chunk (or combination thereof) actually fixes the
> > problem.  If indeed any of them do.
> As I said, I will test it this evening. ATM I don't have time to
> recompile and reboot. This evening I will test extensively, even on
> SMP, SCSI, IDE and so on.

May I ask how you are reproducing the bad results? I'm trying in vain
here...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:53                                             ` Jens Axboe
@ 2003-05-28 12:58                                               ` Matthias Mueller
  2003-05-28 13:07                                               ` Carl-Daniel Hailfinger
                                                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 142+ messages in thread
From: Matthias Mueller @ 2003-05-28 12:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, Andrew Morton, kernel, manish, andrea,
	marcelo, linux-kernel

On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

I can reproduce it with dd if=/dev/zero of=trash bs=4096 count=65000 on my
notebook (probably a slower harddisk makes it easier to see the mouse
hangs).

Matthias

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:53                                             ` Jens Axboe
  2003-05-28 12:58                                               ` Matthias Mueller
@ 2003-05-28 13:07                                               ` Carl-Daniel Hailfinger
  2003-05-28 13:08                                                 ` Jens Axboe
  2003-05-28 13:25                                               ` Stefan Foerster
                                                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 142+ messages in thread
From: Carl-Daniel Hailfinger @ 2003-05-28 13:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, Andrew Morton, kernel, matthias.mueller,
	manish, andrea, marcelo, linux-kernel

Jens Axboe wrote:
> On Wed, May 28 2003, Marc-Christian Petersen wrote:
> 
>>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
>>
>>>Guys, you're the ones who can reproduce this.  Please spend more time
>>>working out which chunk (or combination thereof) actually fixes the
>>>problem.  If indeed any of them do.
>>
>>As I said, I will test it this evening. ATM I don't have time to
>>recompile and reboot. This evening I will test extensively, even on
>>SMP, SCSI, IDE and so on.
> 
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

Quoting Con Kolivas:

dd if=/dev/zero of=dump bs=4096 count=512000


HTH,
Carl-Daniel


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:07                                               ` Carl-Daniel Hailfinger
@ 2003-05-28 13:08                                                 ` Jens Axboe
  2003-05-28 13:16                                                   ` Matthias Mueller
                                                                     ` (3 more replies)
  0 siblings, 4 replies; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 13:08 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Marc-Christian Petersen, Andrew Morton, kernel, matthias.mueller,
	manish, andrea, marcelo, linux-kernel

On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
> Jens Axboe wrote:
> > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > 
> >>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> >>
> >>>Guys, you're the ones who can reproduce this.  Please spend more time
> >>>working out which chunk (or combination thereof) actually fixes the
> >>>problem.  If indeed any of them do.
> >>
> >>As I said, I will test it this evening. ATM I don't have time to
> >>recompile and reboot. This evening I will test extensively, even on
> >>SMP, SCSI, IDE and so on.
> > 
> > May I ask how you are reproducing the bad results? I'm trying in vain
> > here...
> 
> Quoting Con Kolivas:
> 
> dd if=/dev/zero of=dump bs=4096 count=512000

already tried that, no go. on ide/scsi? what filesystem? how much ram?
anything else running? smp/up?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:08                                                 ` Jens Axboe
@ 2003-05-28 13:16                                                   ` Matthias Mueller
  2003-05-28 13:21                                                   ` Con Kolivas
                                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 142+ messages in thread
From: Matthias Mueller @ 2003-05-28 13:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carl-Daniel Hailfinger, Marc-Christian Petersen, Andrew Morton,
	kernel, manish, andrea, marcelo, linux-kernel

On Wed, May 28, 2003 at 03:08:39PM +0200, Jens Axboe wrote:
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> > 
> > Quoting Con Kolivas:
> > 
> > dd if=/dev/zero of=dump bs=4096 count=512000
> 
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

ide-notebook-harddrive, tested with ext2 and ext3. 256MB Ram, X11 started,
idle bind9 and idle postgresql. Tested directly after a reboot, ~85MB Ram
used without buffers/cache.

Matthias

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:08                                                 ` Jens Axboe
  2003-05-28 13:16                                                   ` Matthias Mueller
@ 2003-05-28 13:21                                                   ` Con Kolivas
  2003-05-28 13:30                                                     ` Carl-Daniel Hailfinger
  2003-05-28 13:27                                                   ` Stefan Foerster
  2003-05-28 14:28                                                   ` Chris Mason
  3 siblings, 1 reply; 142+ messages in thread
From: Con Kolivas @ 2003-05-28 13:21 UTC (permalink / raw)
  To: Jens Axboe, Carl-Daniel Hailfinger
  Cc: Marc-Christian Petersen, Andrew Morton, matthias.mueller, manish,
	andrea, marcelo, linux-kernel

On Wed, 28 May 2003 23:08, Jens Axboe wrote:
> On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
> > Jens Axboe wrote:
> > > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > >>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > >>>Guys, you're the ones who can reproduce this.  Please spend more time
> > >>>working out which chunk (or combination thereof) actually fixes the
> > >>>problem.  If indeed any of them do.
> > >>
> > >>As I said, I will test it this evening. ATM I don't have time to
> > >>recompile and reboot. This evening I will test extensively, even on
> > >>SMP, SCSI, IDE and so on.
> > >
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> >
> > Quoting Con Kolivas:
> >
> > dd if=/dev/zero of=dump bs=4096 count=512000
>
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

I'm using UP on IDE. I reproduce it easily on a P3 256Mb laptop with 5400rpm 
drive, and less easily but still occurs on a P4 2.53 512Mb pc with 2x7200rpm 
software raid 0 IDE drives. Even if the only thing you try to do is move the 
mouse, the mouse will freeze for up to 30secs. When you first start the write 
no disk activity happens for up to a few seconds, then it will start writing 
madly and the machine will come to a standstill for a variable length of 
time. Then it will come back to life for a few seconds only to die again for 
a few seconds and so on till the write is complete.

Still testing combinations to see which is the best, but 1+2 seems better than 
3 alone as doing reads midstream in the write don't cause hangs. I haven't 
seen zombie processes ever.

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:53                                             ` Jens Axboe
  2003-05-28 12:58                                               ` Matthias Mueller
  2003-05-28 13:07                                               ` Carl-Daniel Hailfinger
@ 2003-05-28 13:25                                               ` Stefan Foerster
  2003-05-28 18:19                                               ` Zwane Mwaikambo
  2003-05-28 18:47                                               ` Elladan
  4 siblings, 0 replies; 142+ messages in thread
From: Stefan Foerster @ 2003-05-28 13:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, Andrew Morton, kernel, matthias.mueller,
	manish, andrea, marcelo, linux-kernel

* Jens Axboe <axboe@suse.de> wrote:
> On Wed, May 28 2003, Marc-Christian Petersen wrote:
>>> Guys, you're the ones who can reproduce this.  Please spend more time
>>> working out which chunk (or combination thereof) actually fixes the
>>> problem.  If indeed any of them do.
>> As I said, I will test it this evening. ATM I don't have time to
>> recompile and reboot. This evening I will test extensively, even on
>> SMP, SCSI, IDE and so on.
> 
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

It is easily reproducable by using dd with an appropriate blocksize
reading from /dev/zero.

With chunk #3 from Andrew, I do not get pauses, but I noticed text
scrolling in an xterm stopping for like a second.

I did not get any zombie processes.

Ciao
Stefan


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:08                                                 ` Jens Axboe
  2003-05-28 13:16                                                   ` Matthias Mueller
  2003-05-28 13:21                                                   ` Con Kolivas
@ 2003-05-28 13:27                                                   ` Stefan Foerster
  2003-05-28 13:37                                                     ` Stefan Foerster
  2003-05-28 14:28                                                   ` Chris Mason
  3 siblings, 1 reply; 142+ messages in thread
From: Stefan Foerster @ 2003-05-28 13:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carl-Daniel Hailfinger, Marc-Christian Petersen, Andrew Morton,
	kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

* Jens Axboe <axboe@suse.de> wrote:
> On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
>> dd if=/dev/zero of=dump bs=4096 count=512000
> 
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

Doesn't matter if IDE or SCSI, to be honest, SCSI with the old aic7xxx
from vanilla 2.4.20 is even worse than IDE.

My box is up, had only my window manager with some open xterms
running, nothing which should create any load.


Ciao
Stefan
-- 
Stefan Förster                                  Public Key: 0xBBE2A9E9
FdI #122: Updateritis - Softwarebulemie (Frank Klemm)


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:21                                                   ` Con Kolivas
@ 2003-05-28 13:30                                                     ` Carl-Daniel Hailfinger
  2003-05-28 13:33                                                       ` Con Kolivas
  0 siblings, 1 reply; 142+ messages in thread
From: Carl-Daniel Hailfinger @ 2003-05-28 13:30 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Jens Axboe, Marc-Christian Petersen, Andrew Morton,
	matthias.mueller, manish, andrea, marcelo, linux-kernel

Con Kolivas wrote:
> On Wed, 28 May 2003 23:08, Jens Axboe wrote:
> 
>>On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
>>
>>>Jens Axboe wrote:
>>>
>>>>On Wed, May 28 2003, Marc-Christian Petersen wrote:
>>>>
>>>>>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
>>>>>
>>>>>>Guys, you're the ones who can reproduce this.  Please spend more time
>>>>>>working out which chunk (or combination thereof) actually fixes the
>>>>>>problem.  If indeed any of them do.
>>>>>
>>>>>As I said, I will test it this evening. ATM I don't have time to
>>>>>recompile and reboot. This evening I will test extensively, even on
>>>>>SMP, SCSI, IDE and so on.
>>>>
>>>>May I ask how you are reproducing the bad results? I'm trying in vain
>>>>here...
>>>
>>>Quoting Con Kolivas:
>>>
>>>dd if=/dev/zero of=dump bs=4096 count=512000
>>
>>already tried that, no go. on ide/scsi? what filesystem? how much ram?
>>anything else running? smp/up?
> 
> 
> I'm using UP on IDE. I reproduce it easily on a P3 256Mb laptop with 5400rpm 
> drive, and less easily but still occurs on a P4 2.53 512Mb pc with 2x7200rpm 
> software raid 0 IDE drives. Even if the only thing you try to do is move the 
> mouse, the mouse will freeze for up to 30secs. When you first start the write 
> no disk activity happens for up to a few seconds, then it will start writing 
> madly and the machine will come to a standstill for a variable length of 
> time. Then it will come back to life for a few seconds only to die again for 
> a few seconds and so on till the write is complete.
> 
> Still testing combinations to see which is the best, but 1+2 seems better than 
> 3 alone as doing reads midstream in the write don't cause hangs. I haven't 
> seen zombie processes ever.

Just curious - which compiler did you use?


Carl-Daniel


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:30                                                     ` Carl-Daniel Hailfinger
@ 2003-05-28 13:33                                                       ` Con Kolivas
  0 siblings, 0 replies; 142+ messages in thread
From: Con Kolivas @ 2003-05-28 13:33 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Jens Axboe, Marc-Christian Petersen, Andrew Morton,
	matthias.mueller, manish, andrea, marcelo, linux-kernel

On Wed, 28 May 2003 23:30, Carl-Daniel Hailfinger wrote:
> Con Kolivas wrote:
> > On Wed, 28 May 2003 23:08, Jens Axboe wrote:
> >>On Wed, May 28 2003, Carl-Daniel Hailfinger wrote:
> >>>Jens Axboe wrote:
> >>>>On Wed, May 28 2003, Marc-Christian Petersen wrote:
> >>>>>On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> >>>>>>Guys, you're the ones who can reproduce this.  Please spend more time
> >>>>>>working out which chunk (or combination thereof) actually fixes the
> >>>>>>problem.  If indeed any of them do.
> >>>>>
> >>>>>As I said, I will test it this evening. ATM I don't have time to
> >>>>>recompile and reboot. This evening I will test extensively, even on
> >>>>>SMP, SCSI, IDE and so on.
> >>>>
> >>>>May I ask how you are reproducing the bad results? I'm trying in vain
> >>>>here...
> >>>
> >>>Quoting Con Kolivas:
> >>>
> >>>dd if=/dev/zero of=dump bs=4096 count=512000
> >>
> >>already tried that, no go. on ide/scsi? what filesystem? how much ram?
> >>anything else running? smp/up?
> >
> > I'm using UP on IDE. I reproduce it easily on a P3 256Mb laptop with
> > 5400rpm drive, and less easily but still occurs on a P4 2.53 512Mb pc
> > with 2x7200rpm software raid 0 IDE drives. Even if the only thing you try
> > to do is move the mouse, the mouse will freeze for up to 30secs. When you
> > first start the write no disk activity happens for up to a few seconds,
> > then it will start writing madly and the machine will come to a
> > standstill for a variable length of time. Then it will come back to life
> > for a few seconds only to die again for a few seconds and so on till the
> > write is complete.
> >
> > Still testing combinations to see which is the best, but 1+2 seems better
> > than 3 alone as doing reads midstream in the write don't cause hangs. I
> > haven't seen zombie processes ever.
>
> Just curious - which compiler did you use?

For this latest testing gcc 3.2.2

The hangs predate this to a time when I was using 2.95.3 and getting the 
hangs.

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:27                                                   ` Stefan Foerster
@ 2003-05-28 13:37                                                     ` Stefan Foerster
  0 siblings, 0 replies; 142+ messages in thread
From: Stefan Foerster @ 2003-05-28 13:37 UTC (permalink / raw)
  To: linux-kernel

* Stefan Foerster <stefan@stefan-foerster.de> wrote:
[...]
> Doesn't matter if IDE or SCSI, to be honest, SCSI with the old aic7xxx
> from vanilla 2.4.20 is even worse than IDE.
> 
> My box is up, had only my window manager with some open xterms
> running, nothing which should create any load.

Oh silly me, forgot to include that info: I have  512MB of RAM, an Athlon XP.

Filesystems didn't seem to matter much in my tests, got hangs with
ext2, ext3 and XFS.


Ciao
Stefan
-- 
Stefan Förster                                  Public Key: 0xBBE2A9E9
FdI #44: Verdeckter Fehler - Siemens hat mitentwickelt. (Jörg Pechau)


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:23                             ` Andrew Morton
                                                 ` (2 preceding siblings ...)
  2003-05-28 12:10                               ` Matthias Mueller
@ 2003-05-28 14:00                               ` Con Kolivas
  2003-05-29 13:24                                 ` Andrea Arcangeli
  2003-05-29  1:32                               ` manish
  4 siblings, 1 reply; 142+ messages in thread
From: Con Kolivas @ 2003-05-28 14:00 UTC (permalink / raw)
  To: Andrew Morton, Matthias Mueller
  Cc: axboe, m.c.p, manish, andrea, marcelo, linux-kernel

On Wed, 28 May 2003 20:23, Andrew Morton wrote:
> Could you please work out which change caused it?  Go back to stock 2.4 and
> then apply this:
>
[snip] 1

> then this:
[snip] 2

> Then this (totally unlikely, don't bother):
[snip] 3

Ok patch combination final score for me is as follows in the presence of a 
large continuous write:
1 No change
2 No change
3 improvement++; minor hangs with reads
1+2 improvement+++; minor pauses with switching applications
1+2+3 improvement++++; no pauses

Applications may start up slowly that's fine. The mouse cursor keeps spinning 
and responding at all times though with 1+2+3 which it hasn't done in 2.4 for 
a year or so.

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 13:08                                                 ` Jens Axboe
                                                                     ` (2 preceding siblings ...)
  2003-05-28 13:27                                                   ` Stefan Foerster
@ 2003-05-28 14:28                                                   ` Chris Mason
  2003-05-28 14:33                                                     ` Jens Axboe
  3 siblings, 1 reply; 142+ messages in thread
From: Chris Mason @ 2003-05-28 14:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carl-Daniel Hailfinger, Marc-Christian Petersen, Andrew Morton,
	kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

On Wed, 2003-05-28 at 09:08, Jens Axboe wrote:
>  
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> > 
> > Quoting Con Kolivas:
> > 
> > dd if=/dev/zero of=dump bs=4096 count=512000
> 
> already tried that, no go. on ide/scsi? what filesystem? how much ram?
> anything else running? smp/up?

I think we've got a few different problems.  On SMP boxes, you need to
have the fix-pausing patch from andrea applied to catch all the corner
cases.

On UP boxes it's possible the requests are starving in the drive, SCSI
users should try with the max tags set down to something sensible,
between 8 and 32.

IDE people can try lowering the max_kb_per_request paramater in
/proc/ide/<drive>/settings, but this should only affect starvation with
the writeback cache on.

I made a patch a while ago that timed how long people spent waiting in
__get_request_wait, it might help us figure out where the starvation is
really happening.
 
-chris



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 14:28                                                   ` Chris Mason
@ 2003-05-28 14:33                                                     ` Jens Axboe
  2003-05-28 14:58                                                       ` Chris Mason
  0 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 14:33 UTC (permalink / raw)
  To: Chris Mason
  Cc: Carl-Daniel Hailfinger, Marc-Christian Petersen, Andrew Morton,
	kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

On Wed, May 28 2003, Chris Mason wrote:
> On Wed, 2003-05-28 at 09:08, Jens Axboe wrote:
> >  
> > > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > > here...
> > > 
> > > Quoting Con Kolivas:
> > > 
> > > dd if=/dev/zero of=dump bs=4096 count=512000
> > 
> > already tried that, no go. on ide/scsi? what filesystem? how much ram?
> > anything else running? smp/up?
> 
> I think we've got a few different problems.  On SMP boxes, you need to
> have the fix-pausing patch from andrea applied to catch all the corner
> cases.

Agree

> 
> On UP boxes it's possible the requests are starving in the drive, SCSI
> users should try with the max tags set down to something sensible,
> between 8 and 32.
> 
> IDE people can try lowering the max_kb_per_request paramater in
> /proc/ide/<drive>/settings, but this should only affect starvation with
> the writeback cache on.
> 
> I made a patch a while ago that timed how long people spent waiting in
> __get_request_wait, it might help us figure out where the starvation is
> really happening.

But this seems totally unrelated to the reported problems, we are
talking about complete stalls of the mouse. No amount of io starvation
should provoke something like that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 14:33                                                     ` Jens Axboe
@ 2003-05-28 14:58                                                       ` Chris Mason
  2003-05-28 15:39                                                         ` Jens Axboe
  0 siblings, 1 reply; 142+ messages in thread
From: Chris Mason @ 2003-05-28 14:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carl-Daniel Hailfinger, Marc-Christian Petersen, Andrew Morton,
	kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

On Wed, 2003-05-28 at 10:33, Jens Axboe wrote:

> > On UP boxes it's possible the requests are starving in the drive, SCSI
> > users should try with the max tags set down to something sensible,
> > between 8 and 32.
> > 
> > IDE people can try lowering the max_kb_per_request paramater in
> > /proc/ide/<drive>/settings, but this should only affect starvation with
> > the writeback cache on.
> > 
> > I made a patch a while ago that timed how long people spent waiting in
> > __get_request_wait, it might help us figure out where the starvation is
> > really happening.
> 
> But this seems totally unrelated to the reported problems, we are
> talking about complete stalls of the mouse. No amount of io starvation
> should provoke something like that.

Well, if it wasn't io related starvation, andrew's batch requests patch
wouldn't change things.  I'm hoping the stats patch will get us some
numbers to go along with the perceived stalls, almost done merging.

-chris



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 14:58                                                       ` Chris Mason
@ 2003-05-28 15:39                                                         ` Jens Axboe
  2003-05-28 23:38                                                           ` Chris Mason
  0 siblings, 1 reply; 142+ messages in thread
From: Jens Axboe @ 2003-05-28 15:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: Carl-Daniel Hailfinger, Marc-Christian Petersen, Andrew Morton,
	kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

On Wed, May 28 2003, Chris Mason wrote:
> On Wed, 2003-05-28 at 10:33, Jens Axboe wrote:
> 
> > > On UP boxes it's possible the requests are starving in the drive, SCSI
> > > users should try with the max tags set down to something sensible,
> > > between 8 and 32.
> > > 
> > > IDE people can try lowering the max_kb_per_request paramater in
> > > /proc/ide/<drive>/settings, but this should only affect starvation with
> > > the writeback cache on.
> > > 
> > > I made a patch a while ago that timed how long people spent waiting in
> > > __get_request_wait, it might help us figure out where the starvation is
> > > really happening.
> > 
> > But this seems totally unrelated to the reported problems, we are
> > talking about complete stalls of the mouse. No amount of io starvation
> > should provoke something like that.
> 
> Well, if it wasn't io related starvation, andrew's batch requests patch
> wouldn't change things.  I'm hoping the stats patch will get us some
> numbers to go along with the perceived stalls, almost done merging.

Correction then, it doesn't appear to be starvation in the usual sense.
But you are right, pulling some stats out of the situation would be
nice. I still can't reproduce here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:53                                             ` Jens Axboe
                                                                 ` (2 preceding siblings ...)
  2003-05-28 13:25                                               ` Stefan Foerster
@ 2003-05-28 18:19                                               ` Zwane Mwaikambo
  2003-05-28 18:32                                                 ` Zwane Mwaikambo
  2003-05-28 18:47                                               ` Elladan
  4 siblings, 1 reply; 142+ messages in thread
From: Zwane Mwaikambo @ 2003-05-28 18:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, Andrew Morton, kernel, matthias.mueller,
	manish, andrea, marcelo, linux-kernel

On Wed, 28 May 2003, Jens Axboe wrote:

> > > Guys, you're the ones who can reproduce this.  Please spend more time
> > > working out which chunk (or combination thereof) actually fixes the
> > > problem.  If indeed any of them do.
> > As I said, I will test it this evening. ATM I don't have time to
> > recompile and reboot. This evening I will test extensively, even on
> > SMP, SCSI, IDE and so on.
> 
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

I can reproduce across spindles with cvs import'ing a kernel tree,
make sure you're running X11 and try and do things in it, e.g. scrolling 
windows, dragging etc.

	Zwane
 -- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 18:19                                               ` Zwane Mwaikambo
@ 2003-05-28 18:32                                                 ` Zwane Mwaikambo
  0 siblings, 0 replies; 142+ messages in thread
From: Zwane Mwaikambo @ 2003-05-28 18:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, Andrew Morton, kernel, matthias.mueller,
	manish, andrea, marcelo, linux-kernel

On Wed, 28 May 2003, Zwane Mwaikambo wrote:

> I can reproduce across spindles with cvs import'ing a kernel tree,
> make sure you're running X11 and try and do things in it, e.g. scrolling 
> windows, dragging etc.

Forgot to mention, 2x 400MHz/512MB RAM, read is from UW2/7200 write to 
UDMA33/5400 (w/ 2MB cache).

	Zwane
-- 
function.linuxpower.ca

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:53                                             ` Jens Axboe
                                                                 ` (3 preceding siblings ...)
  2003-05-28 18:19                                               ` Zwane Mwaikambo
@ 2003-05-28 18:47                                               ` Elladan
  2003-05-28 23:03                                                 ` Con Kolivas
  4 siblings, 1 reply; 142+ messages in thread
From: Elladan @ 2003-05-28 18:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Marc-Christian Petersen, Andrew Morton, kernel, matthias.mueller,
	manish, andrea, marcelo, linux-kernel

On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > 
> > Hi Akpm,
> > 
> > > > Does the attached one make sense?
> > > Nope.
> > nm.
> > 
> > > Guys, you're the ones who can reproduce this.  Please spend more time
> > > working out which chunk (or combination thereof) actually fixes the
> > > problem.  If indeed any of them do.
> > As I said, I will test it this evening. ATM I don't have time to
> > recompile and reboot. This evening I will test extensively, even on
> > SMP, SCSI, IDE and so on.
> 
> May I ask how you are reproducing the bad results? I'm trying in vain
> here...

It might be useful to check what video hardware and X servers people are
using here.  If the behavior is just mouse freezups, the "silken mouse"
feature of XFree might have some effect, since it involves XFree binding
a signal to mouse device events.

-J

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28  7:16             ` Marc Wilson
@ 2003-05-28 19:53               ` David Ford
  0 siblings, 0 replies; 142+ messages in thread
From: David Ford @ 2003-05-28 19:53 UTC (permalink / raw)
  To: linux-kernel

Hmm, odd.  I see similar dead time in 2.5.x, it is annoying but I 
haven't had any time to track it down.  I'm currently on .69 and 
planning on putting .70 on this evening.

David

Marc Wilson wrote:

>On Tue, May 27, 2003 at 08:04:49PM +0200, Marc-Christian Petersen wrote:
>  
>
>>ALL: Anyone who has this kind of pauses/stops/mouse is dead/keyboard is dead/:
>>     speak _NOW_ please, doesn't matter who you are!
>>    
>>
>
>Ok, add my box to the list.  Variety of post 2.4.18 kernels, -ac's, -rc's,
>etc... all demonstrate it to one degree or another.
>
>Lately it's gotten REALLY bad.
>
>Currently I'm using 21-rc2-ac2 and it freezes for upwards of 15 sec
>regularly when I'm exercising the HD (three simultaneous brag threads
>downloading from various newsgroups).  The mouse moves, but other than
>that, X is entirely unresponsive.  An xterm with continually scrolling
>text, for example, will appear to stop scrolling until the kernel comes
>back.
>
>The HD light is on solid the whole time.
>
>21-rc2 does it too.  I haven't tried anything later than that yet. Well, I
>tried 20-ck7 and it ate my RAID0 due to a DMA-ism and I've not tested
>anything else since. :(
>



^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 18:47                                               ` Elladan
@ 2003-05-28 23:03                                                 ` Con Kolivas
  2003-05-29 13:09                                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: Con Kolivas @ 2003-05-28 23:03 UTC (permalink / raw)
  To: Elladan, Jens Axboe
  Cc: Marc-Christian Petersen, Andrew Morton, matthias.mueller, manish,
	andrea, marcelo, linux-kernel

On Thu, 29 May 2003 04:47, Elladan wrote:
> On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > >
> > > Hi Akpm,
> > >
> > > > > Does the attached one make sense?
> > > >
> > > > Nope.
> > >
> > > nm.
> > >
> > > > Guys, you're the ones who can reproduce this.  Please spend more time
> > > > working out which chunk (or combination thereof) actually fixes the
> > > > problem.  If indeed any of them do.
> > >
> > > As I said, I will test it this evening. ATM I don't have time to
> > > recompile and reboot. This evening I will test extensively, even on
> > > SMP, SCSI, IDE and so on.
> >
> > May I ask how you are reproducing the bad results? I'm trying in vain
> > here...
>
> It might be useful to check what video hardware and X servers people are
> using here.  If the behavior is just mouse freezups, the "silken mouse"
> feature of XFree might have some effect, since it involves XFree binding
> a signal to mouse device events.

Xfree 3.3.6, 4.2,4.3
Drivers nvidia, nv, sis, sisfb, vesa, vesafb

are the drivers on the machines where I've seen it happen so far - ie without 
discrimination.

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 15:39                                                         ` Jens Axboe
@ 2003-05-28 23:38                                                           ` Chris Mason
  0 siblings, 0 replies; 142+ messages in thread
From: Chris Mason @ 2003-05-28 23:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carl-Daniel Hailfinger, Marc-Christian Petersen, Andrew Morton,
	kernel, matthias.mueller, manish, andrea, marcelo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1578 bytes --]

On Wed, 2003-05-28 at 11:39, Jens Axboe wrote:

> Correction then, it doesn't appear to be starvation in the usual sense.
> But you are right, pulling some stats out of the situation would be
> nice. I still can't reproduce here.

Well, it's not pretty but it gets some numbers out there.  This patch
only calculates the time spent waiting in __get_request_wait, it isn't
interested in any other metrics.  stats are per-queue and are reset when
you mount the FS, you get a print out either when you unmount the FS or
when you run elvtune /dev/xxx (no other args, just enough to trigger the
read ioctl).

The output looks like this (after a dbench 50 run 2.4.21-rc6)

device 03:04: num_req 12248, total jiffies waited 26729
        417 forced to wait
        1 min wait, 432 max wait
        64 average wait
        314 < 100, 62 < 200, 20 < 300, 20 < 400, 1 < 500
        0 waits longer than 500 jiffies

It tells us there were 12248 total requests (merges don't count), and
that we spent 26,729 jiffies waiting in __get_request_wait.  We had to
wait 417 times, the minimum was 1 and the max was 432 jiffies.  The line
with the < signs is a simple way to get the deviations.  314 requests
waited < 100 jiffies, 62 requests waited less than 200 jiffies, etc.

People who see stalls on UP machines and have seen improvements by
playing with code in drivers/block/ll_rw_blk.c are encouraged to try
getting numbers with this patch applied.  It will make it easier to
figure things out.

I haven't tried Andrea's fix-pausing on top of this yet, any rejects
should be minor.

-chris


[-- Attachment #2: lat-stat-3.diff --]
[-- Type: text/plain, Size: 4770 bytes --]

===== drivers/block/blkpg.c 1.9 vs edited =====
--- 1.9/drivers/block/blkpg.c	Sat Mar 30 06:58:05 2002
+++ edited/drivers/block/blkpg.c	Wed May 28 19:33:16 2003
@@ -261,6 +261,7 @@
 			return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);
 			
 		case BLKELVGET:
+			blk_print_stats(dev);
 			return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
 					       (blkelv_ioctl_arg_t *) arg);
 		case BLKELVSET:
===== drivers/block/ll_rw_blk.c 1.44 vs edited =====
--- 1.44/drivers/block/ll_rw_blk.c	Mon Apr 14 06:53:03 2003
+++ edited/drivers/block/ll_rw_blk.c	Wed May 28 19:34:10 2003
@@ -442,6 +442,56 @@
 	spin_lock_init(&q->queue_lock);
 }
 
+void blk_print_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	unsigned long avg_wait;
+	unsigned long min_wait;
+	unsigned long high_wait;
+	unsigned long *d;
+
+	q = blk_get_queue(dev);
+	if (!q)
+		return;
+
+	min_wait = q->min_wait;
+	if (min_wait == ~0UL)
+		min_wait = 0;
+	if (q->num_wait) 
+		avg_wait = q->total_wait / q->num_wait;
+	else
+		avg_wait = 0;
+	printk("device %s: num_req %lu, total jiffies waited %lu\n", 
+	       kdevname(dev), q->num_req, q->total_wait);
+	printk("\t%lu forced to wait\n", q->num_wait);
+	printk("\t%lu min wait, %lu max wait\n", min_wait, q->max_wait);
+	printk("\t%lu average wait\n", avg_wait);
+	d = q->deviation;
+	printk("\t%lu < 100, %lu < 200, %lu < 300, %lu < 400, %lu < 500\n",
+               d[0], d[1], d[2], d[3], d[4]);
+	high_wait = d[0] + d[1] + d[2] + d[3] + d[4];
+	high_wait = q->num_wait - high_wait;
+	printk("\t%lu waits longer than 500 jiffies\n", high_wait);
+}
+
+static void reset_stats(request_queue_t *q)
+{
+	q->max_wait		= 0;
+	q->min_wait		= ~0UL;
+	q->total_wait		= 0;
+	q->num_req		= 0;
+	q->num_wait		= 0;
+	memset(q->deviation, 0, sizeof(q->deviation));
+}
+void blk_reset_stats(kdev_t dev) 
+{
+	request_queue_t *q;
+	q = blk_get_queue(dev);
+	if (!q)
+	    return;
+	printk("reset latency stats on device %s\n", kdevname(dev));
+	reset_stats(q);
+}
 static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
 
 /**
@@ -491,6 +541,9 @@
 	q->plug_tq.routine	= &generic_unplug_device;
 	q->plug_tq.data		= q;
 	q->plugged        	= 0;
+
+	reset_stats(q);
+
 	/*
 	 * These booleans describe the queue properties.  We set the
 	 * default (and most common) values here.  Other drivers can
@@ -588,6 +641,8 @@
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
+	unsigned long wait_start = jiffies;
+	unsigned long time_waited;
 	DECLARE_WAITQUEUE(wait, current);
 
 	generic_unplug_device(q);
@@ -602,6 +657,18 @@
 	} while (rq == NULL);
 	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
+
+	time_waited = jiffies - wait_start;
+	if (time_waited > q->max_wait)
+		q->max_wait = time_waited;
+	if (time_waited && time_waited < q->min_wait)
+		q->min_wait = time_waited;
+	q->total_wait += time_waited;
+	q->num_wait++;
+	if (time_waited < 500) {
+		q->deviation[time_waited/100]++;
+	}
+
 	return rq;
 }
 
@@ -1064,6 +1131,7 @@
 	req->rq_dev = bh->b_rdev;
 	req->start_time = jiffies;
 	req_new_io(req, 0, count);
+	q->num_req++;
 	blk_started_io(count);
 	add_request(q, req, insert_here);
 out:
===== fs/super.c 1.49 vs edited =====
--- 1.49/fs/super.c	Wed Dec 18 21:34:24 2002
+++ edited/fs/super.c	Wed May 28 19:29:26 2003
@@ -404,6 +404,7 @@
 	up_write(&s->s_umount);
 	put_super(s);
 	put_filesystem(fs);
+	blk_print_stats(dev);
 	if (bdev)
 		blkdev_put(bdev, BDEV_FS);
 	else
@@ -726,6 +727,7 @@
 	if (!fs_type->read_super(s, data, flags & MS_VERBOSE ? 1 : 0))
 		goto Einval;
 	s->s_flags |= MS_ACTIVE;
+	blk_reset_stats(dev);
 	path_release(&nd);
 	return s;
 
===== include/linux/blkdev.h 1.23 vs edited =====
--- 1.23/include/linux/blkdev.h	Fri Nov 29 17:03:01 2002
+++ edited/include/linux/blkdev.h	Wed May 28 19:27:18 2003
@@ -138,8 +138,17 @@
 	 * Tasks wait here for free read and write requests
 	 */
 	wait_queue_head_t	wait_for_requests[2];
+	unsigned long           max_wait;
+	unsigned long           min_wait;
+	unsigned long           total_wait;
+	unsigned long           num_req;
+	unsigned long           num_wait;
+	unsigned long           deviation[5];
 };
 
+void blk_reset_stats(kdev_t dev);
+void blk_print_stats(kdev_t dev);
+
 #define blk_queue_plugged(q)	(q)->plugged
 #define blk_fs_request(rq)	((rq)->cmd == READ || (rq)->cmd == WRITE)
 #define blk_queue_empty(q)	list_empty(&(q)->queue_head)
@@ -217,6 +226,7 @@
 extern void generic_make_request(int rw, struct buffer_head * bh);
 extern inline request_queue_t *blk_get_queue(kdev_t dev);
 extern void blkdev_release_request(struct request *);
+extern void blk_print_stats(kdev_t dev);
 
 /*
  * Access functions for manipulating queue properties

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:23                             ` Andrew Morton
                                                 ` (3 preceding siblings ...)
  2003-05-28 14:00                               ` Con Kolivas
@ 2003-05-29  1:32                               ` manish
  4 siblings, 0 replies; 142+ messages in thread
From: manish @ 2003-05-29  1:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthias Mueller, axboe, m.c.p, kernel, andrea, marcelo, linux-kernel

Andrew Morton wrote:

>Matthias Mueller <matthias.mueller@rz.uni-karlsruhe.de> wrote:
>
>>Works fine on my notebook. Good throughput and no mouse hangs anymore.
>>
>
>Interesting.
>
>Could you please work out which change caused it?  Go back to stock 2.4 and
>then apply this:
>
>
>diff -puN drivers/block/ll_rw_blk.c~1 drivers/block/ll_rw_blk.c
>--- 24/drivers/block/ll_rw_blk.c~1	2003-05-28 03:20:42.000000000 -0700
>+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:20:57.000000000 -0700
>@@ -590,10 +590,10 @@ static struct request *__get_request_wai
> 	register struct request *rq;
> 	DECLARE_WAITQUEUE(wait, current);
> 
>-	generic_unplug_device(q);
> 	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
> 	do {
> 		set_current_state(TASK_UNINTERRUPTIBLE);
>+		generic_unplug_device(q);
> 		if (q->rq[rw].count == 0)
> 			schedule();
> 		spin_lock_irq(&io_request_lock);
>
>
>
>then this:
>
>diff -puN drivers/block/ll_rw_blk.c~2 drivers/block/ll_rw_blk.c
>--- 24/drivers/block/ll_rw_blk.c~2	2003-05-28 03:21:03.000000000 -0700
>+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:09.000000000 -0700
>@@ -590,7 +590,7 @@ static struct request *__get_request_wai
> 	register struct request *rq;
> 	DECLARE_WAITQUEUE(wait, current);
> 
>-	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
>+	add_wait_queue(&q->wait_for_requests[rw], &wait);
> 	do {
> 		set_current_state(TASK_UNINTERRUPTIBLE);
> 		generic_unplug_device(q);
>
>
>Then this (totally unlikely, don't bother):
>
>diff -puN drivers/block/ll_rw_blk.c~3 drivers/block/ll_rw_blk.c
>--- 24/drivers/block/ll_rw_blk.c~3	2003-05-28 03:21:15.000000000 -0700
>+++ 24-akpm/drivers/block/ll_rw_blk.c	2003-05-28 03:21:39.000000000 -0700
>@@ -829,8 +829,7 @@ void blkdev_release_request(struct reque
> 	 */
> 	if (q) {
> 		list_add(&req->queue, &q->rq[rw].free);
>-		if (++q->rq[rw].count >= q->batch_requests &&
>-				waitqueue_active(&q->wait_for_requests[rw]))
>+		if (++q->rq[rw].count >= q->batch_requests)
> 			wake_up(&q->wait_for_requests[rw]);
> 	}
> }
>
>_
>
Hello !

I have applied patch 1+2+3 and it seemed to have solved the 
stalls/pauses that I was seeing with the stock kernel after long hrs of 
test using bonnie.

Thanks much
Manish





^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 10:58               ` Alan Cox
@ 2003-05-29  8:34                 ` Ragnar Hojland Espinosa
  0 siblings, 0 replies; 142+ messages in thread
From: Ragnar Hojland Espinosa @ 2003-05-29  8:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Marc-Christian Petersen, manish, Carl-Daniel Hailfinger,
	Andrea Arcangeli, Marcelo Tosatti, Linux Kernel Mailing List,
	Christian Klose, William Lee Irwin III

On Wed, May 28, 2003 at 11:58:43AM +0100, Alan Cox wrote:
> On Mer, 2003-05-28 at 10:36, Ragnar Hojland Espinosa wrote:
> > Actually it just happens in the fixing stage when burning prebuilt iso
> > images from the hard disk (same IDE channel as the burner, 2.4.20)
> > Having a completely frozen machine under X was quite panic inducing ;)
> 
> If you have a disk and the burner ont he same channel this is quite
> normal. The fixate is a single ATAPI command and like all ATA commands
> locks the bus to both master/slave for its duration of execution.
> 
> Its an IDE limitation

Thats what you get for cheap hardware ;)  Anyway, I do have two
questions regarding pauses when fixating, in case someone knows..

- Why it doesn't the freeze always happen (I think it doesn't)
- Why doesn't the complete computer freeze happen always.

-- 
Ragnar Hojland - Project Manager
Linalco "Especialistas Linux y en Software Libre"
http://www.linalco.com Tel: +34-91-5970074 Fax: +34-91-5970083

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 11:17                                       ` Marc-Christian Petersen
  2003-05-28 11:27                                         ` Andrew Morton
@ 2003-05-29 12:52                                         ` Andrea Arcangeli
  1 sibling, 0 replies; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-29 12:52 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Andrew Morton, Jens Axboe, kernel, matthias.mueller, manish,
	marcelo, linux-kernel

On Wed, May 28, 2003 at 01:17:59PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 28 May 2003 12:59, Andrew Morton wrote:
> 
> Hi Andrew,
> 
> > umm, I'd like confirmation of that.
> >
> > The waitqueue_active() test is wrong because of a missing barrier, but only
> > on SMP.  And if it does make a mistake it will surely correct itself when
> > the next request is put back. (That's why I left it there...)
> > More testing, please.
> Does the attached one make sense?

btw, I already fixed this race in my tree:

void blkdev_release_request(struct request *req)
{
	request_queue_t *q = req->q;

	req->rq_status = RQ_INACTIVE;
	req->q = NULL;

	/*
	 * Request may not have originated from ll_rw_blk. if not,
	 * assume it has free buffers and check waiters
	 */
	if (q) {
		list_add(&req->queue, &q->rq.free);
		if (++q->rq.count >= q->batch_requests && !blk_oversized_queue_batch(q)) {
			smp_mb();
			if (waitqueue_active(&q->wait_for_requests))
				wake_up(&q->wait_for_requests);


so if this was this one my tree wouldn't exibith it (and it would
trigger on smp only).

> 
> ciao, Marc
> 
> 




Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 23:03                                                 ` Con Kolivas
@ 2003-05-29 13:09                                                   ` Andrea Arcangeli
  2003-05-29 15:04                                                     ` Con Kolivas
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-29 13:09 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Elladan, Jens Axboe, Marc-Christian Petersen, Andrew Morton,
	matthias.mueller, manish, marcelo, linux-kernel

On Thu, May 29, 2003 at 09:03:42AM +1000, Con Kolivas wrote:
> On Thu, 29 May 2003 04:47, Elladan wrote:
> > On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> > > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > > > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > > >
> > > > Hi Akpm,
> > > >
> > > > > > Does the attached one make sense?
> > > > >
> > > > > Nope.
> > > >
> > > > nm.
> > > >
> > > > > Guys, you're the ones who can reproduce this.  Please spend more time
> > > > > working out which chunk (or combination thereof) actually fixes the
> > > > > problem.  If indeed any of them do.
> > > >
> > > > As I said, I will test it this evening. ATM I don't have time to
> > > > recompile and reboot. This evening I will test extensively, even on
> > > > SMP, SCSI, IDE and so on.
> > >
> > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > here...
> >
> > It might be useful to check what video hardware and X servers people are
> > using here.  If the behavior is just mouse freezups, the "silken mouse"
> > feature of XFree might have some effect, since it involves XFree binding
> > a signal to mouse device events.
> 
> Xfree 3.3.6, 4.2,4.3
> Drivers nvidia, nv, sis, sisfb, vesa, vesafb
> 
> are the drivers on the machines where I've seen it happen so far - ie without 
> discrimination.

what about the window manager? do you use focus follow mouse? Just
trying to find a pattern. For the record KDE 3.1 + focus follow mouse
and X 4.3.0 here, I guess Jens uses the same software combination. the
mouse for me is always perfectly fluid no matter how fast and how long I
write, no matter if I don't touch the mouse for minutes, ALT+TAB as
well. I definitely can't reproduce in any way the mouse stalls (I'm
using cp /dev/zero .  on a ext3 fs in ordered mode). hardware is 1G of
ram smp IDE single spindle primary master matrox GS450. I almost
couldn't notice the background write flood if I only would increase the
xmms buffer (infact I thought it stopped writing for a dozen seconds out
of space, and instead it was still writing).  (kernel is 2.4.21rc4aa1 of
course)

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 12:10                               ` Matthias Mueller
  2003-05-28 12:14                                 ` Matthias Mueller
@ 2003-05-29 13:19                                 ` Andrea Arcangeli
  2003-05-29 14:10                                   ` Matthias Mueller
  1 sibling, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-29 13:19 UTC (permalink / raw)
  To: Andrew Morton, axboe, m.c.p, kernel, manish, marcelo, linux-kernel

On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> Tested all of them and some combinations:
> patch 1 alone: still mouse hangs
> patch 2 alone: still mouse hangs
> patch 3 alone: no hangs, but I get some zombie process (starting a lot of
>                xterms results in zombie xterms, not noticed with vanilla
>                and the other patches)
> patch 1+2: no mouse hangs
> patch 1+2+3: no mouse hangs, no zombies

I can't find a sense in the zombie thing, how can you generate zombie at
all from xterms? That sounds like your userspace is terribly broken and
it may have race conditions or whatever. In no way those patches can
generate or not-generate zombies from xterms. I never ever seen a zombie
xterm in my whole linux experience.

either that or the GUI is doing something intentionally to try to reduce
the number of wait4 syscalls to the miniumum colescing the wait4, but
that would be very bad design of the GUI software since you're not going
to start an xterm (or whatever else window) a every millisecond, so it
would be very pointless and confusing, I certainly wouldn't like it.
(the wait4 thing I don't love it even in the servers where it might
be accepted as a microoptimization)

It's impossible to trust the rest of the report while hearing about such
a fundamental brekage in the core of your GUI, the mouse hangs could be
just an userspace bug that triggers when some timing changes in presence
of writes, or whatever. So please install an userspace that never
generates zombie xterm ever, and see if you can reproduce still.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 14:00                               ` Con Kolivas
@ 2003-05-29 13:24                                 ` Andrea Arcangeli
  2003-05-29 13:55                                   ` Willy Tarreau
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-29 13:24 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Andrew Morton, Matthias Mueller, axboe, m.c.p, manish, marcelo,
	linux-kernel

On Thu, May 29, 2003 at 12:00:11AM +1000, Con Kolivas wrote:
> On Wed, 28 May 2003 20:23, Andrew Morton wrote:
> > Could you please work out which change caused it?  Go back to stock 2.4 and
> > then apply this:
> >
> [snip] 1
> 
> > then this:
> [snip] 2
> 
> > Then this (totally unlikely, don't bother):
> [snip] 3
> 
> Ok patch combination final score for me is as follows in the presence of a 
> large continuous write:
> 1 No change
> 2 No change
> 3 improvement++; minor hangs with reads
> 1+2 improvement+++; minor pauses with switching applications
> 1+2+3 improvement++++; no pauses

then please try 1+2 alone too (i.e. w/o 3), because it's not obvious to me
that you're really the race in 3 in a single write (I spotted and just
fixed such a race in my tree some months ago, but thought it was a
theoretical one only, I mean on x86).

The improvement++ might be just an emotional feeling if you didn't
generate numbers to measure it (I know on myself it can happen when you
try a new patch, that everything seems faster until you really measure
it ;).

> Applications may start up slowly that's fine. The mouse cursor keeps spinning 
> and responding at all times though with 1+2+3 which it hasn't done in 2.4 for 

the mouse cursor always worked and still works fine for me (and I was
just running with 3 applied, just to get the theretical bit correct).

> a year or so.
> 
> Con


Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 13:24                                 ` Andrea Arcangeli
@ 2003-05-29 13:55                                   ` Willy Tarreau
  2003-05-29 14:09                                     ` Con Kolivas
                                                       ` (3 more replies)
  0 siblings, 4 replies; 142+ messages in thread
From: Willy Tarreau @ 2003-05-29 13:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Con Kolivas, Andrew Morton, Matthias Mueller, axboe, m.c.p,
	manish, marcelo, linux-kernel

Hello !

I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB, scsi).
It's the *FIRST* time I have ever seen my mouse cursor hang (just a little bit
however, and totally acceptable) ! Usually, my kernel include -aa VM and lowlat
patches, and I've never encountered this behaviour on this machine with such a
configuration. However, with stock kernel, I admit that during the 2 minutes it
takes to write the 2G file, I see the mouse stick two or three times during
about 1 second, which is quite acceptable IMHO. Opening an xterm may take 10s
to get to the prompt (more annoying). Same to launch 'ps'.

I use a fairly simple window manager (ctwm), which doesn't access the disk once
it's launched. It never gets stuck during all the operation if I disable the
swap. If I enable the swap, it sometimes takes one or two seconds to draw a
menu. The swap is used up to about 4 MB.

I then tried -rc6 with ll_rw_blk from -rc5, and it's worse, even with swap
disabled. The hangs happen more often, but are about the same durations. So I
confirm that -rc6 is better here than -rc5.

I retried with rc4aa1, and everything went very smooth again ; it takes at most
1 second to get an xterm with the prompt ready, and ps responds immediately. So
I think that there are two things here:
  - those who experience very long hangs may use a heavy window manager
    which does continuous disk accesses (I mean it accesses the disk for any
    simple operation).
  - a hungry WM may also be swapped during such operations, rendering it
    totally unusable, particularly if the swap is on the same physical disk
    as the file being written to.

So, could the people who report long hangs retry with swap disabled ?
Can we limit the amount of memory consummed by the cache during such a write ?

Regards,
Willy


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 13:55                                   ` Willy Tarreau
@ 2003-05-29 14:09                                     ` Con Kolivas
  2003-05-29 14:38                                     ` Matthias Mueller
                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 142+ messages in thread
From: Con Kolivas @ 2003-05-29 14:09 UTC (permalink / raw)
  To: Willy Tarreau, Andrea Arcangeli
  Cc: Andrew Morton, Matthias Mueller, axboe, m.c.p, manish, marcelo,
	linux-kernel

On Thu, 29 May 2003 23:55, Willy Tarreau wrote:
> Hello !
>
> I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB,
> scsi). It's the *FIRST* time I have ever seen my mouse cursor hang (just a
> little bit however, and totally acceptable) ! Usually, my kernel include
> -aa VM and lowlat patches, and I've never encountered this behaviour on
> this machine with such a configuration. However, with stock kernel, I admit
> that during the 2 minutes it takes to write the 2G file, I see the mouse
> stick two or three times during about 1 second, which is quite acceptable
> IMHO. Opening an xterm may take 10s to get to the prompt (more annoying).
> Same to launch 'ps'.
>
> I use a fairly simple window manager (ctwm), which doesn't access the disk
> once it's launched. It never gets stuck during all the operation if I
> disable the swap. If I enable the swap, it sometimes takes one or two
> seconds to draw a menu. The swap is used up to about 4 MB.
>
> I then tried -rc6 with ll_rw_blk from -rc5, and it's worse, even with swap
> disabled. The hangs happen more often, but are about the same durations. So
> I confirm that -rc6 is better here than -rc5.
>
> I retried with rc4aa1, and everything went very smooth again ; it takes at
> most 1 second to get an xterm with the prompt ready, and ps responds
> immediately. So I think that there are two things here:
>   - those who experience very long hangs may use a heavy window manager
>     which does continuous disk accesses (I mean it accesses the disk for
> any simple operation).
>   - a hungry WM may also be swapped during such operations, rendering it
>     totally unusable, particularly if the swap is on the same physical disk
>     as the file being written to.
>
> So, could the people who report long hangs retry with swap disabled ?
> Can we limit the amount of memory consummed by the cache during such a
> write ?

I still get hangs with rc6 with massive writeouts to swap. The problem was 
that I was getting hangs without writeouts to swap with 2.4.19pre1 
->2.4.21pre5. I didn't expect the patch backout to suddenly make writing to 
swap occur for free (although that would be nice).

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 13:19                                 ` Andrea Arcangeli
@ 2003-05-29 14:10                                   ` Matthias Mueller
  2003-05-29 16:22                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: Matthias Mueller @ 2003-05-29 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, axboe, m.c.p, kernel, manish, marcelo, linux-kernel

On Thu, May 29, 2003 at 03:19:37PM +0200, Andrea Arcangeli wrote:
> On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> > Tested all of them and some combinations:
> > patch 1 alone: still mouse hangs
> > patch 2 alone: still mouse hangs
> > patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> >                xterms results in zombie xterms, not noticed with vanilla
> >                and the other patches)
> > patch 1+2: no mouse hangs
> > patch 1+2+3: no mouse hangs, no zombies
> 
> I can't find a sense in the zombie thing, how can you generate zombie at
> all from xterms? That sounds like your userspace is terribly broken and
> it may have race conditions or whatever. In no way those patches can
> generate or not-generate zombies from xterms. I never ever seen a zombie
> xterm in my whole linux experience.

I rechecked everything an noticed, that it wasn't a xterm, but a wrapper
script, that executed rxvt. I changed that to plain xterm and the zombies
were gone. So I think there was probably a bug in rxvt triggered there.
After that I redid the tests, with the same result (and no zombies).
I can feel no difference between 1+2 or 1+2+3.

Matthias
-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 13:55                                   ` Willy Tarreau
  2003-05-29 14:09                                     ` Con Kolivas
@ 2003-05-29 14:38                                     ` Matthias Mueller
  2003-05-29 16:10                                       ` Willy TARREAU
  2003-05-29 14:45                                     ` Marc-Christian Petersen
  2003-05-29 16:19                                     ` Andrea Arcangeli
  3 siblings, 1 reply; 142+ messages in thread
From: Matthias Mueller @ 2003-05-29 14:38 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andrea Arcangeli, Con Kolivas, Andrew Morton, axboe, m.c.p,
	manish, marcelo, linux-kernel

On Thu, May 29, 2003 at 03:55:08PM +0200, Willy Tarreau wrote:
> Hello !
> 
> I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB, scsi).
> It's the *FIRST* time I have ever seen my mouse cursor hang (just a little bit
> however, and totally acceptable) ! Usually, my kernel include -aa VM and lowlat
> patches, and I've never encountered this behaviour on this machine with such a
> configuration. However, with stock kernel, I admit that during the 2 minutes it
> takes to write the 2G file, I see the mouse stick two or three times during
> about 1 second, which is quite acceptable IMHO. Opening an xterm may take 10s
> to get to the prompt (more annoying). Same to launch 'ps'.
> 
> I use a fairly simple window manager (ctwm), which doesn't access the disk once
> it's launched. It never gets stuck during all the operation if I disable the
> swap. If I enable the swap, it sometimes takes one or two seconds to draw a
> menu. The swap is used up to about 4 MB.
> 
> I then tried -rc6 with ll_rw_blk from -rc5, and it's worse, even with swap
> disabled. The hangs happen more often, but are about the same durations. So I
> confirm that -rc6 is better here than -rc5.
> 
> I retried with rc4aa1, and everything went very smooth again ; it takes at most
> 1 second to get an xterm with the prompt ready, and ps responds immediately. So
> I think that there are two things here:
>   - those who experience very long hangs may use a heavy window manager
>     which does continuous disk accesses (I mean it accesses the disk for any
>     simple operation).
>   - a hungry WM may also be swapped during such operations, rendering it
>     totally unusable, particularly if the swap is on the same physical disk
>     as the file being written to.
> 
> So, could the people who report long hangs retry with swap disabled ?
> Can we limit the amount of memory consummed by the cache during such a write ?

I run fluxbox, not a very heavy window manager, but I installed ctwm and
tried again with vanilla 2.4.20. If I disabled swap the short hangs (1s) are
gone, but the long mouse hangs (10s) are still there.

Matthias
-- 
Matthias.Mueller@rz.uni-karlsruhe.de
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 13:55                                   ` Willy Tarreau
  2003-05-29 14:09                                     ` Con Kolivas
  2003-05-29 14:38                                     ` Matthias Mueller
@ 2003-05-29 14:45                                     ` Marc-Christian Petersen
  2003-05-29 16:06                                       ` Willy TARREAU
  2003-05-29 16:19                                     ` Andrea Arcangeli
  3 siblings, 1 reply; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-29 14:45 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andrea Arcangeli, Con Kolivas, Andrew Morton, Matthias Mueller,
	axboe, marcelo, linux-kernel

On Thursday 29 May 2003 15:55, Willy Tarreau wrote:

Hi Willy,

> I've done a few tests with -rc6 on my dev machine (dual xp 1.5G, 512 MB,
> scsi). It's the *FIRST* time I have ever seen my mouse cursor hang (just a
> little bit however, and totally acceptable) ! Usually, my kernel include -aa
> VM and lowlat patches, and I've never encountered this behaviour on this
> machine with such a configuration. However, with stock kernel, I admit that
> during the 2 minutes it takes to write the 2G file, I see the mouse stick
> two or three times during about 1 second, which is quite acceptable IMHO.
WRONG. A mouse stick is not acceptable in _any_ way. Other OS' can handle this 
pretty well, and if Linux has problems with mouse sticks, this has to be 
fixed! Either in kernel space or in userspace (XFree86).

> Opening an xterm may take 10s to get to the prompt (more annoying). Same to
> launch 'ps'.
ACK!

> I retried with rc4aa1, and everything went very smooth again ; it takes at
> most 1 second to get an xterm with the prompt ready, and ps responds
> immediately. So I think that there are two things here:
>   - those who experience very long hangs may use a heavy window manager
>     which does continuous disk accesses (I mean it accesses the disk for
>     any simple operation).
>   - a hungry WM may also be swapped during such operations, rendering it
>     totally unusable, particularly if the swap is on the same physical disk
>     as the file being written to.
Well, sorry, but: no!

The pauses/stops occurs no matter of what WindowManager (KDE2/3, WindowMaker, 
fvwm, gnome etc. foobar). The point why you are not seeing such things with 
-aa is his Lowlatency Elevator and lowlatency-fixes and some important fixes 
which are not in stock kernel yet.

I reproduced mouse sticks and keyboard does not accept anything problems for 
$seconds with _every_ kernel which is based on 2.4.19/2.4.20/2.4.21*. This 
also includes -AA (well, not that braindead bad like mainline did before the 
fix) but this is because of lowlat elevator from Andrea. And as I told 
yesterday (or 2 days ago? dunno) lowlat elevator drops throughput (Andrea, it 
_does_ ;).

It's not just only mouse hangs (as I've reported tons of times) but also 
keyboard does not accept any input (delay varies between 1 to 15 seconds) and 
this also applies if you don't run X at all.

Another fine example is:

- Start a screen session, not running X at all.
- Trash your HD with tons of writes.
- Press Ctrl-A-C for a new screen session.

You will see, it takes as long as, you wrote above, with starting up an Xterm 
or calling ps. It does _not_ happen with 2.4.18!

> So, could the people who report long hangs retry with swap disabled ?
It's somewhat better but not acceptable.

> Can we limit the amount of memory consummed by the cache during such a
> write ?
I ask for such a feature since years ;)

Well, my summary: The bug is there, for over 15 months ( I won't mention it 
again that I've reported the bug 15 months ago ;-) ... It _may_ be some very 
obscure hardware problem to be able to reproduce this bug but as this thread 
shows up, there are tons of people who can reproduce this with different 
hardware starting with 2.4.19-pre1.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 13:09                                                   ` Andrea Arcangeli
@ 2003-05-29 15:04                                                     ` Con Kolivas
  0 siblings, 0 replies; 142+ messages in thread
From: Con Kolivas @ 2003-05-29 15:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Elladan, Jens Axboe, Marc-Christian Petersen, Andrew Morton,
	matthias.mueller, manish, marcelo, linux-kernel

On Thu, 29 May 2003 23:09, Andrea Arcangeli wrote:
> On Thu, May 29, 2003 at 09:03:42AM +1000, Con Kolivas wrote:
> > On Thu, 29 May 2003 04:47, Elladan wrote:
> > > On Wed, May 28, 2003 at 02:53:12PM +0200, Jens Axboe wrote:
> > > > On Wed, May 28 2003, Marc-Christian Petersen wrote:
> > > > > On Wednesday 28 May 2003 13:27, Andrew Morton wrote:
> > > > >
> > > > > Hi Akpm,
> > > > >
> > > > > > > Does the attached one make sense?
> > > > > >
> > > > > > Nope.
> > > > >
> > > > > nm.
> > > > >
> > > > > > Guys, you're the ones who can reproduce this.  Please spend more
> > > > > > time working out which chunk (or combination thereof) actually
> > > > > > fixes the problem.  If indeed any of them do.
> > > > >
> > > > > As I said, I will test it this evening. ATM I don't have time to
> > > > > recompile and reboot. This evening I will test extensively, even on
> > > > > SMP, SCSI, IDE and so on.
> > > >
> > > > May I ask how you are reproducing the bad results? I'm trying in vain
> > > > here...
> > >
> > > It might be useful to check what video hardware and X servers people
> > > are using here.  If the behavior is just mouse freezups, the "silken
> > > mouse" feature of XFree might have some effect, since it involves XFree
> > > binding a signal to mouse device events.
> >
> > Xfree 3.3.6, 4.2,4.3
> > Drivers nvidia, nv, sis, sisfb, vesa, vesafb
> >
> > are the drivers on the machines where I've seen it happen so far - ie
> > without discrimination.
>
> what about the window manager? do you use focus follow mouse? Just
> trying to find a pattern. For the record KDE 3.1 + focus follow mouse
> and X 4.3.0 here, I guess Jens uses the same software combination. the
> mouse for me is always perfectly fluid no matter how fast and how long I
> write, no matter if I don't touch the mouse for minutes, ALT+TAB as
> well. I definitely can't reproduce in any way the mouse stalls (I'm
> using cp /dev/zero .  on a ext3 fs in ordered mode). hardware is 1G of
> ram smp IDE single spindle primary master matrox GS450. I almost
> couldn't notice the background write flood if I only would increase the
> xmms buffer (infact I thought it stopped writing for a dozen seconds out
> of space, and instead it was still writing).  (kernel is 2.4.21rc4aa1 of
> course)

Why should it matter what wm I use if the pauses were there before and not 
there now?

Con

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 14:45                                     ` Marc-Christian Petersen
@ 2003-05-29 16:06                                       ` Willy TARREAU
  2003-05-29 16:49                                         ` Andrea Arcangeli
  0 siblings, 1 reply; 142+ messages in thread
From: Willy TARREAU @ 2003-05-29 16:06 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: Willy Tarreau, Andrea Arcangeli, Con Kolivas, Andrew Morton,
	Matthias Mueller, axboe, marcelo, linux-kernel

On Thu, May 29, 2003 at 04:45:26PM +0200, Marc-Christian Petersen wrote:
> > machine with such a configuration. However, with stock kernel, I admit that
> > during the 2 minutes it takes to write the 2G file, I see the mouse stick
> > two or three times during about 1 second, which is quite acceptable IMHO.
> WRONG. A mouse stick is not acceptable in _any_ way. Other OS' can handle this 
Excuse me, Marc, I didn't mean it was normally acceptable, but quite acceptable
compared to what other people report.

> > Opening an xterm may take 10s to get to the prompt (more annoying). Same to
> > launch 'ps'.
> ACK!

The problem is specifically due to the cache, and only related to I/O but not
to other subsystems : if I start 50 xterms during that write, they take the
same time to respond as when there's only one. And they all respond
simultaneously, showing that they were all waiting for the files to be read
from the disk. But I cannot hang anything which doesn't need disk access.
Perhaps some people have their X server swap !

> The pauses/stops occurs no matter of what WindowManager (KDE2/3, WindowMaker, 
> fvwm, gnome etc. foobar). The point why you are not seeing such things with 
> -aa is his Lowlatency Elevator and lowlatency-fixes and some important fixes 
> which are not in stock kernel yet.

Do you agree that if the WM does no disk access and the mouse/keyboard freezes,
it means that X and/or the WM swap ? And if it's not the case, then it's related
to something else, and I don't see how playing with elevators can help!

> I reproduced mouse sticks and keyboard does not accept anything problems for 
> $seconds with _every_ kernel which is based on 2.4.19/2.4.20/2.4.21*. This 
> also includes -AA (well, not that braindead bad like mainline did before the 
> fix) but this is because of lowlat elevator from Andrea. And as I told 
> yesterday (or 2 days ago? dunno) lowlat elevator drops throughput (Andrea, it 
> _does_ ;).

I also confirm it does ; it takes 122 seconds to write this file in -rc6, and
142 seconds in -aa. But I don't think that desktop people would notice anyway.

> It's not just only mouse hangs (as I've reported tons of times) but also 
> keyboard does not accept any input (delay varies between 1 to 15 seconds) and 
> this also applies if you don't run X at all.

in fact, we don't know if the keyboard doesn't accept inputs or if the process
bound to the TTY is stuck ! If Alt-SysRq replies immediately, the problem is on
the user process side.

> - Start a screen session, not running X at all.
> - Trash your HD with tons of writes.
> - Press Ctrl-A-C for a new screen session.
> 
> You will see, it takes as long as, you wrote above, with starting up an Xterm 
> or calling ps. It does _not_ happen with 2.4.18!

I think that for this, screen will need to allocate some memory, which may take
some time under these conditions. I don't have screen right here, so I won't
try it, but I suspect that a program which uses pre-allocated memory will have
no problem at all.

> > So, could the people who report long hangs retry with swap disabled ?
> It's somewhat better but not acceptable.

OK

> > Can we limit the amount of memory consummed by the cache during such a
> > write ?
> I ask for such a feature since years ;)

another solution would be to be able to specify that a process could use
pre-allocated memory.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 14:38                                     ` Matthias Mueller
@ 2003-05-29 16:10                                       ` Willy TARREAU
  0 siblings, 0 replies; 142+ messages in thread
From: Willy TARREAU @ 2003-05-29 16:10 UTC (permalink / raw)
  To: Willy Tarreau, Andrea Arcangeli, Con Kolivas, Andrew Morton,
	axboe, m.c.p, manish, marcelo, linux-kernel

On Thu, May 29, 2003 at 04:38:28PM +0200, Matthias Mueller wrote:
 
> I run fluxbox, not a very heavy window manager, but I installed ctwm and
> tried again with vanilla 2.4.20. If I disabled swap the short hangs (1s) are
> gone, but the long mouse hangs (10s) are still there.

Thanks for the test, but I find it really amazing that the mouse hangs while
it has nothing to do with any block device at all !

Cheers,
Willy


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 13:55                                   ` Willy Tarreau
                                                       ` (2 preceding siblings ...)
  2003-05-29 14:45                                     ` Marc-Christian Petersen
@ 2003-05-29 16:19                                     ` Andrea Arcangeli
  3 siblings, 0 replies; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-29 16:19 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Con Kolivas, Andrew Morton, Matthias Mueller, axboe, m.c.p,
	manish, marcelo, linux-kernel

On Thu, May 29, 2003 at 03:55:08PM +0200, Willy Tarreau wrote:
> So, could the people who report long hangs retry with swap disabled ?
> Can we limit the amount of memory consummed by the cache during such a write ?

the vm should be (i.e. is supposed to be) smart enough not to unmap
anything significant just because of large writes. I'm sure it's not
swapping anything on my desktop during write flood (and certainly not
the mouse pointer) but checking with swapoff is certainly a good hint to
be sure.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 14:10                                   ` Matthias Mueller
@ 2003-05-29 16:22                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-29 16:22 UTC (permalink / raw)
  To: Andrew Morton, axboe, m.c.p, kernel, manish, marcelo, linux-kernel

On Thu, May 29, 2003 at 04:10:34PM +0200, Matthias Mueller wrote:
> On Thu, May 29, 2003 at 03:19:37PM +0200, Andrea Arcangeli wrote:
> > On Wed, May 28, 2003 at 02:10:40PM +0200, Matthias Mueller wrote:
> > > Tested all of them and some combinations:
> > > patch 1 alone: still mouse hangs
> > > patch 2 alone: still mouse hangs
> > > patch 3 alone: no hangs, but I get some zombie process (starting a lot of
> > >                xterms results in zombie xterms, not noticed with vanilla
> > >                and the other patches)
> > > patch 1+2: no mouse hangs
> > > patch 1+2+3: no mouse hangs, no zombies
> > 
> > I can't find a sense in the zombie thing, how can you generate zombie at
> > all from xterms? That sounds like your userspace is terribly broken and
> > it may have race conditions or whatever. In no way those patches can
> > generate or not-generate zombies from xterms. I never ever seen a zombie
> > xterm in my whole linux experience.
> 
> I rechecked everything an noticed, that it wasn't a xterm, but a wrapper
> script, that executed rxvt. I changed that to plain xterm and the zombies
> were gone. So I think there was probably a bug in rxvt triggered there.
> After that I redid the tests, with the same result (and no zombies).
> I can feel no difference between 1+2 or 1+2+3.

this sounds very sane now thanks for fixing the issues with the zombies!

it also makes sense to me that 1+2 is the same as 1+2+3, because I'd be
very surprised if the (purely smp) race condition in 3 made a whole lot
of difference for interactivity of a large write.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 11:31                                           ` Marc-Christian Petersen
  2003-05-28 12:53                                             ` Jens Axboe
@ 2003-05-29 16:23                                             ` Marc-Christian Petersen
  1 sibling, 0 replies; 142+ messages in thread
From: Marc-Christian Petersen @ 2003-05-29 16:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: axboe, kernel, matthias.mueller, andrea, marcelo, linux-kernel

On Wednesday 28 May 2003 13:31, Marc-Christian Petersen wrote:

Hi Andrew,

> > Guys, you're the ones who can reproduce this.  Please spend more time
> > working out which chunk (or combination thereof) actually fixes the
> > problem.  If indeed any of them do.
> As I said, I will test it this evening. ATM I don't have time to recompile
> and reboot. This evening I will test extensively, even on SMP, SCSI, IDE
> and so on.
Sorry, haven't had any time yesterday.

So my 10¢ comment for the patches (like the ones in -rc6).

1. Braindead pausings are GONE (mouse is not sticky as w/o the patch).
2. Mouse sticks are still there rarely (short ones, max. 1 second)
    (If one can say 1 second is short ...).
3. all three patches are needed.

No side effects yet tho. Works with SCSI, IDE and SMP.

ciao, Marc


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 16:06                                       ` Willy TARREAU
@ 2003-05-29 16:49                                         ` Andrea Arcangeli
  2003-05-29 17:46                                           ` Willy Tarreau
  0 siblings, 1 reply; 142+ messages in thread
From: Andrea Arcangeli @ 2003-05-29 16:49 UTC (permalink / raw)
  To: Willy TARREAU
  Cc: Marc-Christian Petersen, Con Kolivas, Andrew Morton,
	Matthias Mueller, axboe, marcelo, linux-kernel

On Thu, May 29, 2003 at 06:06:04PM +0200, Willy TARREAU wrote:
> I also confirm it does ; it takes 122 seconds to write this file in -rc6, and
> 142 seconds in -aa. But I don't think that desktop people would notice anyway.

btw, were you running parallel reads or writes at the same time? (i.e.
launching xterms or ps etc.. in parallel?) I ask because if xterm
startups quick is because the write workload is getting more seeks in
its way.

I'd be very interested if you can measure a bonnie performance change in
contigous reads and writes on a otherwise completely idle machine, the
size of the queue has to be big enough to keep the I/O pipeline full
during contigous writes at full speed.  saying that throughput decrease
alone is not enough to evaluate the reason of this drop.

you can also try with:

	echo 20 500 0 0 500 3000 30 10 >/proc/sys/vm/bdflush

just in case.

Andrea

^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-29 16:49                                         ` Andrea Arcangeli
@ 2003-05-29 17:46                                           ` Willy Tarreau
  0 siblings, 0 replies; 142+ messages in thread
From: Willy Tarreau @ 2003-05-29 17:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Willy TARREAU, Marc-Christian Petersen, Con Kolivas,
	Andrew Morton, Matthias Mueller, axboe, marcelo, linux-kernel

On Thu, May 29, 2003 at 06:49:40PM +0200, Andrea Arcangeli wrote:
> btw, were you running parallel reads or writes at the same time? (i.e.
> launching xterms or ps etc.. in parallel?) I ask because if xterm
> startups quick is because the write workload is getting more seeks in
> its way.

Well, you're right, I was starting some xterms, but not that much perhaps
a tens during all the test.
 
> I'd be very interested if you can measure a bonnie performance change in
> contigous reads and writes on a otherwise completely idle machine, the
> size of the queue has to be big enough to keep the I/O pipeline full
> during contigous writes at full speed.

for this I'll have to install bonnie, I won't do it right now.

> you can also try with:
> 
> 	echo 20 500 0 0 500 3000 30 10 >/proc/sys/vm/bdflush

interestingly, it seems as the lower the last 2 values, the longer it takes.
I retried without opening any xterm, and it took 130 seconds. With the above
changes to bdflush, 135 s. With '80 50', 118s.

vmstat also show me that the test begins at a sustained 16-19 MB/s write
throughput during about the first minute. Then it starts to show regular drops
to 5-7 MB/s for 6-7s, and goes back to full speed. Since this is on reiserfs,
I wonder if this activity is not related to the journal.

Moreover, the disk still writes during about 10s after the end of the dd, so
I don't think that mesuring the time dd takes to complete is a good indicator
of anything (or I should try with a final sync).

If I write simultaneously to two 1G files, wait a few time and then read from
them while still writing, I begin to wait a few seconds for xterm to give me
the prompt. But when writes finish and there are only concurrent reads,
everything gets smooth again, eventhough the disk emits a terrible seek sound !

Cheers,
Willy


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
  2003-05-28 18:55           ` Thomas Tonino
@ 2003-06-02 10:43             ` Jens Axboe
  0 siblings, 0 replies; 142+ messages in thread
From: Jens Axboe @ 2003-06-02 10:43 UTC (permalink / raw)
  To: Thomas Tonino; +Cc: linux-kernel

On Wed, May 28 2003, Thomas Tonino wrote:
> Jens Axboe wrote:
> 
> >Lemme guess, hard drive on the same channel as the burner? There's
> >nothing we can do about that, hardware limitation.
> 
> hmmm... most drives these days have a command to read free buffer capacity, 
> so there is no need to send more than the drive can swallow - and no need 
> to tie up the channel.

As we cannot do more than 128kb in a single request (cdrecord uses 63kb
for writing), there's no problem there. I think you are misunderstanding
me. This is not a problem with ide layer starving the hard drive by
continually sending writes to the cd-r, it's a problem with not being
able to preempt service for a single command duration.

> >The reason you see it
> >during fixation is because that's one long single command, and we cannot
> >preempt the channel and service requests while that is going on.
> 
> But this may be the exception that breaks the rule. Bah.

No, that is the entire problem.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 142+ messages in thread

* Re: 2.4.20: Proccess stuck in __lock_page ...
       [not found]         ` <20030528095014$7b21@gated-at.bofh.it>
@ 2003-05-28 18:55           ` Thomas Tonino
  2003-06-02 10:43             ` Jens Axboe
  0 siblings, 1 reply; 142+ messages in thread
From: Thomas Tonino @ 2003-05-28 18:55 UTC (permalink / raw)
  To: linux-kernel

Jens Axboe wrote:

> Lemme guess, hard drive on the same channel as the burner? There's
> nothing we can do about that, hardware limitation.

hmmm... most drives these days have a command to read free buffer capacity, so 
there is no need to send more than the drive can swallow - and no need to tie up 
the channel.

> The reason you see it
> during fixation is because that's one long single command, and we cannot
> preempt the channel and service requests while that is going on.

But this may be the exception that breaks the rule. Bah.


Thomas


^ permalink raw reply	[flat|nested] 142+ messages in thread

end of thread, other threads:[~2003-06-02 10:29 UTC | newest]

Thread overview: 142+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-27  3:41 2.4.20: Proccess stuck in __lock_page manish
2003-05-27  4:03 ` Marcelo Tosatti
2003-05-27  4:25   ` manish
2003-05-27  4:59     ` Marcelo Tosatti
2003-05-27 15:29       ` manish
2003-05-27 16:59         ` Marcelo Tosatti
2003-05-27  4:31   ` manish
2003-05-27 14:14   ` Carl-Daniel Hailfinger
2003-05-27 14:28     ` William Lee Irwin III
2003-05-27 17:27     ` Marcelo Tosatti
2003-05-27 17:36       ` Marc-Christian Petersen
2003-05-27 17:47         ` Marcelo Tosatti
2003-05-27 17:52           ` Marc-Christian Petersen
2003-05-27 17:57             ` Marcelo Tosatti
2003-05-27 18:08               ` Marc-Christian Petersen
2003-05-27 18:25                 ` Andrea Arcangeli
2003-05-27 18:33                   ` Marcelo Tosatti
2003-05-27 18:39                     ` Marc-Christian Petersen
2003-05-27 19:00                       ` manish
2003-05-27 19:01                         ` Marcelo Tosatti
2003-05-27 19:09                           ` manish
2003-05-27 19:12                           ` manish
2003-05-27 19:28                             ` Marcelo Tosatti
2003-05-27 19:34                               ` manish
2003-05-27 20:20                                 ` Andrea Arcangeli
2003-05-27 20:25                                   ` Marc-Christian Petersen
2003-05-27 20:42                                     ` manish
2003-05-27 20:47                                       ` Andrea Arcangeli
2003-05-27 20:50                                         ` manish
2003-05-27 21:05                                           ` Andrea Arcangeli
2003-05-27 20:03                     ` Andrea Arcangeli
2003-05-27 20:08                       ` Marcelo Tosatti
2003-05-27 20:25                         ` Andrea Arcangeli
2003-05-27 22:18                           ` Andrew Morton
2003-05-27 22:38                             ` Andrea Arcangeli
2003-05-27 22:40                               ` Andrew Morton
2003-05-27 22:58                                 ` Andrea Arcangeli
2003-05-27 20:08                     ` Chris Mason
2003-05-27 18:35                   ` Marc-Christian Petersen
2003-05-27 20:10                     ` Andrea Arcangeli
2003-05-27 20:24                       ` Marc-Christian Petersen
2003-05-27 20:45                         ` Andrea Arcangeli
2003-05-27 20:53                           ` Marc-Christian Petersen
2003-05-27 21:00                             ` Jens Axboe
2003-05-27 21:11                               ` Marc-Christian Petersen
2003-05-27 21:19                                 ` Jens Axboe
2003-05-27 20:55                         ` Jens Axboe
2003-05-27 21:05                           ` William Lee Irwin III
2003-05-27 21:18                             ` Jens Axboe
2003-05-27 21:33                             ` Andrea Arcangeli
2003-05-27 18:09               ` manish
2003-05-27 17:53           ` manish
2003-05-27 18:01             ` Marc-Christian Petersen
2003-05-27 18:16               ` Marcelo Tosatti
2003-05-27 18:25                 ` Marc-Christian Petersen
2003-05-27 18:12           ` Matthias Mueller
2003-05-27 17:36       ` William Lee Irwin III
2003-05-27 17:38       ` Carl-Daniel Hailfinger
2003-05-27 17:50         ` manish
2003-05-27 18:04           ` Marc-Christian Petersen
2003-05-27 23:06             ` Georg Nikodym
2003-05-27 23:26               ` Christopher S. Aker
2003-05-28  5:33             ` Con Kolivas
2003-05-28  6:04               ` Jens Axboe
2003-05-28  7:13                 ` Con Kolivas
2003-05-28  7:13                   ` Jens Axboe
2003-05-28  7:32                     ` Marc-Christian Petersen
2003-05-28  7:35                       ` Jens Axboe
2003-05-28  7:51                         ` Andrew Morton
2003-05-28  8:30                           ` Jens Axboe
2003-05-28  8:43                             ` Marc-Christian Petersen
2003-05-28  8:40                           ` Marc-Christian Petersen
2003-05-28 10:13                           ` Matthias Mueller
2003-05-28 10:18                             ` Jens Axboe
2003-05-28 10:23                             ` Andrew Morton
2003-05-28 10:25                               ` Jens Axboe
2003-05-28 10:48                                 ` Con Kolivas
2003-05-28 10:50                                   ` Jens Axboe
2003-05-28 10:59                                     ` Andrew Morton
2003-05-28 11:17                                       ` Marc-Christian Petersen
2003-05-28 11:27                                         ` Andrew Morton
2003-05-28 11:31                                           ` Marc-Christian Petersen
2003-05-28 12:53                                             ` Jens Axboe
2003-05-28 12:58                                               ` Matthias Mueller
2003-05-28 13:07                                               ` Carl-Daniel Hailfinger
2003-05-28 13:08                                                 ` Jens Axboe
2003-05-28 13:16                                                   ` Matthias Mueller
2003-05-28 13:21                                                   ` Con Kolivas
2003-05-28 13:30                                                     ` Carl-Daniel Hailfinger
2003-05-28 13:33                                                       ` Con Kolivas
2003-05-28 13:27                                                   ` Stefan Foerster
2003-05-28 13:37                                                     ` Stefan Foerster
2003-05-28 14:28                                                   ` Chris Mason
2003-05-28 14:33                                                     ` Jens Axboe
2003-05-28 14:58                                                       ` Chris Mason
2003-05-28 15:39                                                         ` Jens Axboe
2003-05-28 23:38                                                           ` Chris Mason
2003-05-28 13:25                                               ` Stefan Foerster
2003-05-28 18:19                                               ` Zwane Mwaikambo
2003-05-28 18:32                                                 ` Zwane Mwaikambo
2003-05-28 18:47                                               ` Elladan
2003-05-28 23:03                                                 ` Con Kolivas
2003-05-29 13:09                                                   ` Andrea Arcangeli
2003-05-29 15:04                                                     ` Con Kolivas
2003-05-29 16:23                                             ` Marc-Christian Petersen
2003-05-28 11:41                                           ` Con Kolivas
2003-05-29 12:52                                         ` Andrea Arcangeli
2003-05-28 11:03                                   ` Nick Piggin
2003-05-28 10:29                               ` Con Kolivas
2003-05-28 10:29                                 ` Marc-Christian Petersen
2003-05-28 12:10                               ` Matthias Mueller
2003-05-28 12:14                                 ` Matthias Mueller
2003-05-28 12:21                                   ` Carl-Daniel Hailfinger
2003-05-28 12:23                                     ` Matthias Mueller
2003-05-28 12:28                                       ` Carl-Daniel Hailfinger
2003-05-28 12:38                                         ` Matthias Mueller
2003-05-29 13:19                                 ` Andrea Arcangeli
2003-05-29 14:10                                   ` Matthias Mueller
2003-05-29 16:22                                     ` Andrea Arcangeli
2003-05-28 14:00                               ` Con Kolivas
2003-05-29 13:24                                 ` Andrea Arcangeli
2003-05-29 13:55                                   ` Willy Tarreau
2003-05-29 14:09                                     ` Con Kolivas
2003-05-29 14:38                                     ` Matthias Mueller
2003-05-29 16:10                                       ` Willy TARREAU
2003-05-29 14:45                                     ` Marc-Christian Petersen
2003-05-29 16:06                                       ` Willy TARREAU
2003-05-29 16:49                                         ` Andrea Arcangeli
2003-05-29 17:46                                           ` Willy Tarreau
2003-05-29 16:19                                     ` Andrea Arcangeli
2003-05-29  1:32                               ` manish
2003-05-28 10:24                             ` Marc-Christian Petersen
2003-05-28  7:16             ` Marc Wilson
2003-05-28 19:53               ` David Ford
2003-05-28  9:36             ` Ragnar Hojland Espinosa
2003-05-28  9:45               ` Jens Axboe
2003-05-28  9:53               ` Marc-Christian Petersen
2003-05-28 10:01                 ` Jens Axboe
2003-05-28 10:58               ` Alan Cox
2003-05-29  8:34                 ` Ragnar Hojland Espinosa
     [not found] <20030527035006$5339@gated-at.bofh.it>
     [not found] ` <20030527175008$3573@gated-at.bofh.it>
     [not found]   ` <20030527180016$418c@gated-at.bofh.it>
     [not found]     ` <20030527182011$4acb@gated-at.bofh.it>
     [not found]       ` <20030528094008$1500@gated-at.bofh.it>
     [not found]         ` <20030528095014$7b21@gated-at.bofh.it>
2003-05-28 18:55           ` Thomas Tonino
2003-06-02 10:43             ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).