All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] Problem in ops_address.c :: gfs_writepage ?
@ 2007-02-19 15:00 Mathieu Avila
  2007-02-19 22:00 ` Wendy Cheng
  0 siblings, 1 reply; 4+ messages in thread
From: Mathieu Avila @ 2007-02-19 15:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hello all,

I need advice about a bug in GFS that may also affect other filesystems
(like ext3).

The problem:
It is possible that the function "ops_address.c :: gfs_writepage" does
not write the page it's asked to, because the transaction lock is
taken. This is valid, and in such case, it should return an error code,
so that the kernel knows it was not possible to write the page. But
this function does not return an error code; instead, it returns 0.
I've looked at ext3, it does the same. This is valid and there's no
corruption, as the page is "redirtied" so that it will be flushed later.
Returning an error code is not solution, because it's possible that no
page is flushable, and also 'sync' misinterprets the error code as an
I/O error. There may be other implications, too.

The problem comes when there is quite a stress on the filesystem.
I've made a test program that opens 1 file, writes 1Go
(at least more than the system's total memory), then opens a 2nd file,
and writes as much data as it can.
When the number of dirty pages go beyond /proc/sys/vm/dirty_ratio,
some pages must be flushed synchronously, so that the writer is blocked
in writing, and the system does not starve of free clean pages to use.

But precisely, in that situation, there are multiple times when
gfs_writepage cannot perform its duty, because of the transaction lock.
The pages that we need to flush cannot be flushed, but the point is that
they are accounted as such : flushed. The process is then authorized to
continue writing data and more and more pages are marked dirty.

Hopefully, in most cases, pdflush comes to rescue us : it starts
flushing pages so that the program recovers free new pages to dirty.
Sometimes the system is stable : if the pages sent by "pdflush" are
written fastly enough by the block device. But if we set pdflush to be
less aggressive and dirty_ratio higher, it is possible to have OOM
Killer on a machine, and then being completely blocked. In practical,
we've experienced it using the test program "bonnie++" whose purpose is
to test a FS performance. Bonnie++ makes multiple files of 1GB when it
is asked to run long multi-Go writes. There is no problem with 5 GB (5
files of 1 GB) but many machines in the cluster are OOM killed with 10GB
bonnies....
Setting more aggressive parameters for dirty_ratio and pdflush is not
a complete solution (altough the problems happens much later or not at
all), and kills performance.

Proposed solution:

Keep a counter of pages in gfs_inode whose value represents those not
written in gfs_writepage, and at the end of do_do_write_buf, call
"balance_dirty_pages_ratelimited(file->f_mapping);" as many times. The
counter is possibly shared by multiple processes, but we are assured
that there is no transaction at that moment so pages can be flushed, if
"balance_dirty_pages_ratelimited" determines that it must reclaim dirty
pages. Otherwise performance is not affected.

About ext3:

On a normal local disk with ext3, the problem does not happen, altough
it should be affected as well. I suppose that's because ext3 doesn't
keep the transaction lock so long, and a local disk is enough fast so
that pdflush keeps the dirty pages counter low.

Patch below. Beware, it's quite ugly. credits :
Oliver.Cozette at seanodes.com and Mathieu.Avila at seanodes.com

--
Mathieu Avila 

Index: cluster/gfs-kernel/src/gfs/ops_file.c
===================================================================
--- cluster/gfs-kernel/src/gfs/ops_file.c~       
+++ cluster/gfs-kernel/src/gfs/ops_file.c       
@@ -25,7 +25,7 @@
 #include <linux/mm.h>
 #include <asm/uaccess.h>
 #include <linux/gfs_ioctl.h>
-
+#include <linux/writeback.h>
 #include "gfs.h"
 #include "bmap.h"
 #include "dio.h"
@@ -824,6 +824,22 @@
                        goto fail_ipres;
        }

+       do
+         {
+           int pages_not_written = atomic_read(&ip->pages_not_written);
+           int result;
+           if ( pages_not_written <= 0 )
+             {
+               break;
+             }
+           result=cmpxchg(&ip->pages_not_written.counter,
pages_not_written, (pages_not_written - 1) );
+           if (result == pages_not_written)
+             {
+               balance_dirty_pages_ratelimited(file->f_mapping);
+             }
+         }
+       while (1);
+

Index: cluster/gfs-kernel/src/gfs/incore.h
===================================================================
--- cluster/gfs-kernel/src/gfs/incore.h~
+++ cluster/gfs-kernel/src/gfs/incore.h
@@ -617,6 +617,8 @@

        unsigned int i_greedy; /* The amount of time to be greedy */
        unsigned long i_last_pfault; /* The time of the last page fault
*/ +
+        atomic_t pages_not_written; /* Due to journal locking, how
many pages were not written when "gfs_writepage" was called */ };

 /*

Index: cluster/gfs-kernel/src/gfs/ops_address.c
===================================================================
--- cluster/gfs-kernel/src/gfs/ops_address.c~
+++ cluster/gfs-kernel/src/gfs/ops_address.c
@@ -179,6 +183,7 @@
                return -EIO;
        }
        if (get_transaction) {
+               atomic_inc(&ip->pages_not_written);
                redirty_page_for_writepage(wbc, page);
                unlock_page(page);
                return 0;



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] Problem in ops_address.c :: gfs_writepage ?
  2007-02-19 15:00 [Cluster-devel] Problem in ops_address.c :: gfs_writepage ? Mathieu Avila
@ 2007-02-19 22:00 ` Wendy Cheng
  2007-02-20 10:59   ` Mathieu Avila
  0 siblings, 1 reply; 4+ messages in thread
From: Wendy Cheng @ 2007-02-19 22:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Mathieu Avila wrote:
> Hello all,
>
> I need advice about a bug in GFS that may also affect other filesystems
> (like ext3).
>
> The problem:
> It is possible that the function "ops_address.c :: gfs_writepage" does
> not write the page it's asked to, because the transaction lock is
> taken. This is valid, and in such case, it should return an error code,
> so that the kernel knows it was not possible to write the page. But
> this function does not return an error code; instead, it returns 0.
> I've looked at ext3, it does the same. This is valid and there's no
> corruption, as the page is "redirtied" so that it will be flushed later.
> Returning an error code is not solution, because it's possible that no
> page is flushable, and also 'sync' misinterprets the error code as an
> I/O error. There may be other implications, too.
>
> The problem comes when there is quite a stress on the filesystem.
> I've made a test program that opens 1 file, writes 1Go
> (at least more than the system's total memory), then opens a 2nd file,
> and writes as much data as it can.
> When the number of dirty pages go beyond /proc/sys/vm/dirty_ratio,
> some pages must be flushed synchronously, so that the writer is blocked
> in writing, and the system does not starve of free clean pages to use.
>
> But precisely, in that situation, there are multiple times when
> gfs_writepage cannot perform its duty, because of the transaction lock.
>   
Yes, we did have this problem in the past with direct IO and SYNC flag.
> [snip]
> we've experienced it using the test program "bonnie++" whose purpose is
> to test a FS performance. Bonnie++ makes multiple files of 1GB when it
> is asked to run long multi-Go writes. There is no problem with 5 GB (5
> files of 1 GB) but many machines in the cluster are OOM killed with 10GB
> bonnies....
>   
I would like to know more about your experiments. So these bonnie++(s) 
are run on each cluster node with independent file sets ?

> Setting more aggressive parameters for dirty_ratio and pdflush is not
> a complete solution (altough the problems happens much later or not at
> all), and kills performance.
>
> Proposed solution:
>
> Keep a counter of pages in gfs_inode whose value represents those not
> written in gfs_writepage, and at the end of do_do_write_buf, call
> "balance_dirty_pages_ratelimited(file->f_mapping);" as many times. The
> counter is possibly shared by multiple processes, but we are assured
> that there is no transaction at that moment so pages can be flushed, if
> "balance_dirty_pages_ratelimited" determines that it must reclaim dirty
> pages. Otherwise performance is not affected.
>   
In general, this approach looks ok if we do have this flushing problem. 
However, GFS flush code has been embedded in glock code so I would think 
it would be better to do this within glock code. My cluster nodes 
happened to be out at this moment. Will look into this when the cluster 
is re-assembled.

-- Wendy



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] Problem in ops_address.c :: gfs_writepage ?
  2007-02-19 22:00 ` Wendy Cheng
@ 2007-02-20 10:59   ` Mathieu Avila
  2007-02-21  5:04     ` Wendy Cheng
  0 siblings, 1 reply; 4+ messages in thread
From: Mathieu Avila @ 2007-02-20 10:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Le Mon, 19 Feb 2007 17:00:10 -0500,
Wendy Cheng <wcheng@redhat.com> a ?crit :

> Mathieu Avila wrote:
> > Hello all,
> > But precisely, in that situation, there are multiple times when
> > gfs_writepage cannot perform its duty, because of the transaction
> > lock. 

> Yes, we did have this problem in the past with direct IO and SYNC
> flag.

I understand that in that case, data are written with get_transaction
lock taken, so that gfs_writepage never writes the pages, and you go
beyond dirty_ratio limit if you write too much pages. How did you do to
get rid of it ?

> > [snip]
> > we've experienced it using the test program "bonnie++" whose
> > purpose is to test a FS performance. Bonnie++ makes multiple files
> > of 1GB when it is asked to run long multi-Go writes. There is no
> > problem with 5 GB (5 files of 1 GB) but many machines in the
> > cluster are OOM killed with 10GB bonnies....
> >   

> I would like to know more about your experiments. So these
> bonnie++(s) are run on each cluster node with independent file sets ?
> 

Exactly. I run "bonnie++ -s 10G" on each node, in different directories
of the same GFS file system. To get the problem happen surely
and quicker, i tune pdflush by quite disabling it :
echo 300000 > /proc/sys/vm/dirty_expire_centisecs 
echo 50000 > /proc/sys/vm/dirty_writeback_centisecs 
..., so that only /proc/sys/vm/dirty_ratio plays in. Not doing this
makes the problem harder to reproduce. (but reproducible)

The problem happens only when it starts the writeback of the dirty
pages of the 2nd file, once it is done with the 1st file.  We still
have to determine why. So you can get the same problem with a program
that :
- opens 2 files
- writes 1Go in the 1st one,
- writes indefinitely in the second file.

We use a particular block device that doesn't read/write at the same
speed on all nodes. Don't know if this can help.

> > Keep a counter of pages in gfs_inode whose value represents those
> > not written in gfs_writepage, and at the end of do_do_write_buf,
> > call "balance_dirty_pages_ratelimited(file->f_mapping);" as many
> > times. The counter is possibly shared by multiple processes, but we
> > are assured that there is no transaction at that moment so pages
> > can be flushed, if "balance_dirty_pages_ratelimited" determines
> > that it must reclaim dirty pages. Otherwise performance is not
> > affected. 

> In general, this approach looks ok if we do have this flushing
> problem. However, GFS flush code has been embedded in glock code so I
> would think it would be better to do this within glock code. 

These are points we do not understand.

- I understand that the flushing code is done in glock, (when lock is
demoted or taken by another node, isn't it ?). Why isn't it possible to
let the kernel decide which pages to flush when it needs to ? For
example, in that particular case, it is not a good idea to flush the
page only when the lock is lost, the kernel needs to flush pages.

- Why does not gfs_writepage return an error when the page cannot be
flushed ?

- The balance_dirty_pages_ratelimited function is called inside
get_transaction/set_transaction, (i.e between gfs_trans_begin and
gfs_trans_end), therefore gfs_writepage should never work. Do you know
if there is some kind of asynchronous (kiocb ?) write so that
gfs_writepage is called later on the same process ? If not, what could
make gfs_writepage happen sometimes inside a transaction, and sometimes
not ?

- Ext3 should be affected as well, but it isn't. Is that because the
transaction lock is taken for a much shorter period of time, so that
dirty pages that are not flushed when the lock is taken will be
succesfully flushed later ?

- Some other file systems in the kernel : NTFS and ReiserFS, do explicit
calls to balance_dirty_pages_ratelimited
http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L284
http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L2101
http://lxr.linux.no/source/fs/reiserfs/file.c?v=2.6.11;a=x86_64#L1351
But they also redefines some generic functions from the kernel. Maybe
they have a strong reason to do so ?

> -- Wendy

Thank you for your answer,

--
Mathieu



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] Problem in ops_address.c :: gfs_writepage ?
  2007-02-20 10:59   ` Mathieu Avila
@ 2007-02-21  5:04     ` Wendy Cheng
  0 siblings, 0 replies; 4+ messages in thread
From: Wendy Cheng @ 2007-02-21  5:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Mathieu Avila wrote:

>Le Mon, 19 Feb 2007 17:00:10 -0500,
>Wendy Cheng <wcheng@redhat.com> a ?crit :
>
>  
>
>>Mathieu Avila wrote:
>>    
>>
>>>Hello all,
>>>But precisely, in that situation, there are multiple times when
>>>gfs_writepage cannot perform its duty, because of the transaction
>>>lock. 
>>>      
>>>
>
>  
>
>>Yes, we did have this problem in the past with direct IO and SYNC
>>flag.
>>    
>>
>
>I understand that in that case, data are written with get_transaction
>lock taken, so that gfs_writepage never writes the pages, and you go
>beyond dirty_ratio limit if you write too much pages. How did you do to
>get rid of it ?
>  
>
GFS doesn't allow a process to do writepage if it is already in the 
middle of transaction. In Direct IO and O_SYNC cases, we did something 
similar to (and simpler than) what you've proposed. Right after 
transaction is marked completed, we do filemap_fdatawrite() and wait for 
them to complete. VM dirty_ratio is not relevant since both O_DIRECT and 
O_SYNC require data to be (completely) flushed anyway. In default buffer 
write mode (your case), the design is to take advantages of VM's 
writeback - so if flushing can be avoided, it is delayed. Guess that's 
why we get into the issue you described in previous post.

>  
>
>>>[snip]
>>>we've experienced it using the test program "bonnie++" whose
>>>purpose is to test a FS performance. Bonnie++ makes multiple files
>>>of 1GB when it is asked to run long multi-Go writes. There is no
>>>problem with 5 GB (5 files of 1 GB) but many machines in the
>>>cluster are OOM killed with 10GB bonnies....
>>>  
>>>      
>>>
>
>  
>
>>I would like to know more about your experiments. So these
>>bonnie++(s) are run on each cluster node with independent file sets ?
>>
>>    
>>
>
>Exactly. I run "bonnie++ -s 10G" on each node, in different directories
>of the same GFS file system. To get the problem happen surely
>and quicker, i tune pdflush by quite disabling it :
>echo 300000 > /proc/sys/vm/dirty_expire_centisecs 
>echo 50000 > /proc/sys/vm/dirty_writeback_centisecs 
>..., so that only /proc/sys/vm/dirty_ratio plays in. Not doing this
>makes the problem harder to reproduce. (but reproducible)
>  
>
ok ... so this is a corner case since 10G is a very large number... Did 
you have real applications get into OOM state or this issue is only seen 
via this bonnie++ experiment ?

>The problem happens only when it starts the writeback of the dirty
>pages of the 2nd file, once it is done with the 1st file.  We still
>have to determine why. So you can get the same problem with a program
>that :
>- opens 2 files
>- writes 1Go in the 1st one,
>- writes indefinitely in the second file.
>  
>
GFS actually chunks one big write into 4M pieces. I would expect the 
indefinite write should cause journal to wrap around and flush occurs. 
Will need sometime to look into this issue.

>We use a particular block device that doesn't read/write at the same
>speed on all nodes. Don't know if this can help.
>  
>
I doubt it is relevant.

>  
>
>>>Keep a counter of pages in gfs_inode whose value represents those
>>>not written in gfs_writepage, and at the end of do_do_write_buf,
>>>call "balance_dirty_pages_ratelimited(file->f_mapping);" as many
>>>times. The counter is possibly shared by multiple processes, but we
>>>are assured that there is no transaction at that moment so pages
>>>can be flushed, if "balance_dirty_pages_ratelimited" determines
>>>that it must reclaim dirty pages. Otherwise performance is not
>>>affected. 
>>>      
>>>
>
>  
>
>>In general, this approach looks ok if we do have this flushing
>>problem. However, GFS flush code has been embedded in glock code so I
>>would think it would be better to do this within glock code. 
>>    
>>
>
>These are points we do not understand.
>
>- I understand that the flushing code is done in glock, (when lock is
>demoted or taken by another node, isn't it ?). Why isn't it possible to
>let the kernel decide which pages to flush when it needs to ? For
>example, in that particular case, it is not a good idea to flush the
>page only when the lock is lost, the kernel needs to flush pages.
>  
>
The pdflush is allowed to flush gfs pages freely, as long as it can get 
the individual page lock. The only thing not allowed is if the process 
is already in the middle of transaction, the writepage will return. The 
original intention (my guess though, the original author left the team 
two years ago) was there would be plenty chances for pages to get 
flushed (by lock demote, journal flush, etc).  It is apparently an 
optimization effort to take advantages of VM writeback.

>- Why does not gfs_writepage return an error when the page cannot be
>flushed ?
>  
>
Well ... no comment :) .. but really think about it, what else can we do 
other than returning 0 ? We don't want VM to mis-interpret the return 
code as IO errors.

>- The balance_dirty_pages_ratelimited function is called inside
>get_transaction/set_transaction, (i.e between gfs_trans_begin and
>gfs_trans_end), therefore gfs_writepage should never work. Do you know
>if there is some kind of asynchronous (kiocb ?) write so that
>gfs_writepage is called later on the same process ? If not, what could
>make gfs_writepage happen sometimes inside a transaction, and sometimes
>not ?
>  
>
Unless the transaction is marked completed, gfs_writepage is a no-op.

>- Ext3 should be affected as well, but it isn't. Is that because the
>transaction lock is taken for a much shorter period of time, so that
>dirty pages that are not flushed when the lock is taken will be
>succesfully flushed later ?
>  
>
In ext3 case, it is a journal lock so it could be unlock and lock again 
somewhere. In GFS case, it is simply a check to see whether this process 
is already in the middle of transaction. So between in and out of 
do_do_write_buf, the transaction state doesn't change.

>- Some other file systems in the kernel : NTFS and ReiserFS, do explicit
>calls to balance_dirty_pages_ratelimited
>http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L284
>http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L2101
>http://lxr.linux.no/source/fs/reiserfs/file.c?v=2.6.11;a=x86_64#L1351
>But they also redefines some generic functions from the kernel. Maybe
>they have a strong reason to do so ?
>
>  
>
Again, the proposal looks ok and we are really grateful for the work. 
Whatever the issue is finally settled, we'll definitely ack the 
contribution. However, since GFS1 has been a stable filesystem and run 
by many production system, we would like to be careful about the new 
code. Would need more time to look into the run time behavior. Ping us 
next Tuesday if we haven't responsed by then ?

-- Wendy

 



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-02-21  5:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-19 15:00 [Cluster-devel] Problem in ops_address.c :: gfs_writepage ? Mathieu Avila
2007-02-19 22:00 ` Wendy Cheng
2007-02-20 10:59   ` Mathieu Avila
2007-02-21  5:04     ` Wendy Cheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.