linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux 2.4.30-rc3
@ 2005-03-26 16:28 Marcelo Tosatti
  2005-03-28  7:34 ` Ville Herva
  0 siblings, 1 reply; 22+ messages in thread
From: Marcelo Tosatti @ 2005-03-26 16:28 UTC (permalink / raw)
  To: linux-kernel


Hi, 

Here goes -rc3.

A nasty typo happened while merging v2.6 load_elf_library() DoS fix,
which could leap to oopses.

Summary of changes from v2.4.30-rc2 to v2.4.30-rc3
============================================

Marcelo Tosatti:
  o Andreas Arens: Fix deadly mismerge of binfmt_elf DoS fix
  o Change VERSION to 2.4.30-rc3



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3
  2005-03-26 16:28 Linux 2.4.30-rc3 Marcelo Tosatti
@ 2005-03-28  7:34 ` Ville Herva
  2005-03-28 16:55   ` Linux 2.4.30-rc3 md/ext3 problems Ville Herva
  0 siblings, 1 reply; 22+ messages in thread
From: Ville Herva @ 2005-03-28  7:34 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Sat, Mar 26, 2005 at 01:28:01PM -0300, you [Marcelo Tosatti] wrote:
> 
> Hi, 
> 
> Here goes -rc3.
> 
> A nasty typo happened while merging v2.6 load_elf_library() DoS fix,
> which could leap to oopses.
> 
> Summary of changes from v2.4.30-rc2 to v2.4.30-rc3
> ============================================
> 
> Marcelo Tosatti:
>   o Andreas Arens: Fix deadly mismerge of binfmt_elf DoS fix
>   o Change VERSION to 2.4.30-rc3

I just upgraded from linux-2.4.21 + vserser 0.17 to 2.4.30rc3 + vserver
1.2.10. The box has been running stable with 2.4.21 + vserver 0.17/0.16 for
a few years (uptime before reboot was nearly 400 days.)

The boot went fine, but after few hours I got 
Message from syslogd@box at Sun Mar 27 22:07:00 2005 ...
turing kernel: journal commit I/O error

and dmesg is filled with 
--8<-----------------------------------------------------------------------
EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
--8<-----------------------------------------------------------------------

This is roofs, on top software raid1 and two ide disks. mdstat claims it's
healthy:

--8<-----------------------------------------------------------------------
md3 : active raid1 hdc3[1] hda3[0]
      37955648 blocks [2/2] [UU]
--8<-----------------------------------------------------------------------

While dmesg has filled up and /var/log/messages is read-only - I can't see
all the kernel messages - there appears to be no IO errors from the
underlying devices (md, ide). smartctl -a does not report errors for hda nor
hdc.

During reboot, fsck was run for md3, and it was clean. Now I get

--8<-----------------------------------------------------------------------
Block bitmap differences:  -(7800660--7801060) -(7801934--7802030) -(7802370--7802602) -(7802604--7802613) -(7802681--7802700) -(7802715--7802716) -(7802726--7802732) -(7802744--7802750)-(7802914--7802927) -(7802934--7802937) -(7802946--7802964)  -(7803392--7803417) -(7805060--7808825) -(7808976--7809608) 
Fix? no

Inode bitmap differences:  -3899400
Fix? no
--8<-----------------------------------------------------------------------

No errors from the badblocks part of the fsck, though.

Running fsck triggers the "journal commit I/O error" messages again, and
still no IO errors from either md or ide.

This _could_ have something to do with the vserver patch but it doesn't
appear so. Also, it doesn't immediately look like hardware problem. 

Any ideas?


-- v -- 

v@iki.fi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Linux 2.4.30-rc3 md/ext3 problems
  2005-03-28  7:34 ` Ville Herva
@ 2005-03-28 16:55   ` Ville Herva
  2005-03-28 17:25     ` Willy Tarreau
  2005-03-29  0:10     ` Neil Brown
  0 siblings, 2 replies; 22+ messages in thread
From: Ville Herva @ 2005-03-28 16:55 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel

On Mon, Mar 28, 2005 at 10:34:05AM +0300, [Ville Herva] wrote:
> 
> I just upgraded from linux-2.4.21 + vserser 0.17 to 2.4.30rc3 + vserver
> 1.2.10. The box has been running stable with 2.4.21 + vserver 0.17/0.16 for
> a few years (uptime before reboot was nearly 400 days.)
> 
> The boot went fine, but after few hours I got 
> Message from syslogd@box at Sun Mar 27 22:07:00 2005 ...
> kernel: journal commit I/O error
> 
> and dmesg is filled with 
> --8<-----------------------------------------------------------------------
> EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
> EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
> EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
> EXT3-fs error (device md(9,3)) in start_transaction: Journal has aborted
> --8<-----------------------------------------------------------------------
> 
> This is roofs, on top software raid1 and two ide disks. mdstat claims it's
> healthy:
> 
> --8<-----------------------------------------------------------------------
> md3 : active raid1 hdc3[1] hda3[0]
>       37955648 blocks [2/2] [UU]
> --8<-----------------------------------------------------------------------
> 
> While dmesg has filled up and /var/log/messages is read-only - I can't see
> all the kernel messages - there appears to be no IO errors from the
> underlying devices (md, ide). smartctl -a does not report errors for hda nor
> hdc.
> 
> During reboot, fsck was run for md3, and it was clean. Now I get
> 
> --8<-----------------------------------------------------------------------
> Block bitmap differences:  -(7800660--7801060) -(7801934--7802030) -(7802370--7802602) -(7802604--7802613) -(7802681--7802700) -(7802715--7802716) -(7802726--7802732) -(7802744--7802750)-(7802914--7802927) -(7802934--7802937) -(7802946--7802964)  -(7803392--7803417) -(7805060--7808825) -(7808976--7809608) 
> Fix? no
> 
> Inode bitmap differences:  -3899400
> Fix? no
> --8<-----------------------------------------------------------------------
> 
> No errors from the badblocks part of the fsck, though.
> 
> Running fsck triggers the "journal commit I/O error" messages again, and
> still no IO errors from either md or ide.
> 
> This _could_ have something to do with the vserver patch but it doesn't
> appear so. Also, it doesn't immediately look like hardware problem. 

I rebooted (fsck took the fs errors away, no big offenders), and after a few
minutes, I got the same error ("journal commit I/O error"). So it doesn't
appear all that random memory corruption. The error happened right when I
logged out, but that might have been a coincidence. No ide nor md errors
this time either. 

I don't know what to suspect. What I gather from changelogs, there haven't
been any critical looking ext3 changes in 2.4 lately, but then again,
vserver doesn't mess with block layer / ext3 journalling either.

Any ideas?


-- v -- 

v@iki.fi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems
  2005-03-28 16:55   ` Linux 2.4.30-rc3 md/ext3 problems Ville Herva
@ 2005-03-28 17:25     ` Willy Tarreau
  2005-03-28 19:29       ` Ville Herva
  2005-03-29  0:10     ` Neil Brown
  1 sibling, 1 reply; 22+ messages in thread
From: Willy Tarreau @ 2005-03-28 17:25 UTC (permalink / raw)
  To: vherva; +Cc: Marcelo Tosatti, linux-kernel

Hi Ville,

On Mon, Mar 28, 2005 at 07:55:01PM +0300, Ville Herva wrote:
(...) 
> I rebooted (fsck took the fs errors away, no big offenders), and after a few
> minutes, I got the same error ("journal commit I/O error"). So it doesn't
> appear all that random memory corruption. The error happened right when I
> logged out, but that might have been a coincidence. No ide nor md errors
> this time either. 
> 
> I don't know what to suspect. What I gather from changelogs, there haven't
> been any critical looking ext3 changes in 2.4 lately, but then again,
> vserver doesn't mess with block layer / ext3 journalling either.
> 
> Any ideas?

Since you don't seem to be willing to remove vserver, I guess you really
need it on this machine, and to be honnest, I too don't see what trouble
it could cause in this area. However, could you try removing the journal,
or simply mount the FS as ext2 ? It would help to narrow the problem down.

To resume, you have your root on ext3 on top of soft raid1 consisting in
two IDE disks, which works in 2.4.21 but not on 2.4.30-rc3, that's
correct ? There was a fix last week by Neil Brown about RAID1 rebuild
process (degraded array of 3 disks, etc...), unless it obviously does
not come from there, you might want to try reverting it first ? The
next one is from Doug Ledford on 2004/09/18 and should only affect SMP.

My different raid machines run either reiserfs or xfs on soft raid5 on
top of scsi and with kernel 2.4.27, so there's not much to compare...
Perhaps someone on the list has a setup similar to yours and could test
the kernel ?

Cheers,
Willy


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems
  2005-03-28 17:25     ` Willy Tarreau
@ 2005-03-28 19:29       ` Ville Herva
  0 siblings, 0 replies; 22+ messages in thread
From: Ville Herva @ 2005-03-28 19:29 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Marcelo Tosatti, linux-kernel

On Mon, Mar 28, 2005 at 07:25:58PM +0200, you [Willy Tarreau] wrote:
> 
> Since you don't seem to be willing to remove vserver, I guess you really
> need it on this machine, and to be honnest, 

Yes, the machine is in production, and for that it, it needs vserver. The
fact it is in production also makes it a tad hankala awkward to test
different options, at least the most experimental ones.

> I too don't see what trouble it could cause in this area. 

Neither do I. Of course it could be memory corruption caused by vserver in
different part of kernel, but the fact that the symptoms are so consistent,
make me think that is unlikely. But not impossible.

> However, could you try removing the journal, or simply mount the FS as
> ext2 ? It would help to narrow the problem down.

That is a good idea, if only I could boot the box at will and reliably
reproduce the problem. While it took less than ten minutes to trigger it the
first time, it took more than five hours the first time. 

I will try to reproduce the problem on different setup first, and if I can
do that, I'll try your suggestion.
 
> To resume, you have your root on ext3 on top of soft raid1 consisting in
> two IDE disks, which works in 2.4.21 but not on 2.4.30-rc3, that's
> correct ? 

Correct. This is PII 266MHz, no SMP.

> There was a fix last week by Neil Brown about RAID1 rebuild process
> (degraded array of 3 disks, etc...), unless it obviously does not come
> from there, you might want to try reverting it first ? 

Sounds sane, although the raid array was not in degraded state at any stage
and no raid rebuild aver triggered.

> The next one is from Doug Ledford on 2004/09/18 and should only affect
> SMP.

Ok, as said, this is UP.
 
> My different raid machines run either reiserfs or xfs on soft raid5 on
> top of scsi and with kernel 2.4.27, so there's not much to compare...
> Perhaps someone on the list has a setup similar to yours and could test
> the kernel ?

I will try to contruct a similar setup on another machine.


thanks for your insights,

-- v -- 

v@iki.fi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems
  2005-03-28 16:55   ` Linux 2.4.30-rc3 md/ext3 problems Ville Herva
  2005-03-28 17:25     ` Willy Tarreau
@ 2005-03-29  0:10     ` Neil Brown
  2005-03-29 21:52       ` Marcelo Tosatti
  1 sibling, 1 reply; 22+ messages in thread
From: Neil Brown @ 2005-03-29  0:10 UTC (permalink / raw)
  To: vherva; +Cc: Marcelo Tosatti, linux-kernel

On Monday March 28, vherva@vianova.fi wrote:
> On Mon, Mar 28, 2005 at 10:34:05AM +0300, [Ville Herva] wrote:
> > 
> > I just upgraded from linux-2.4.21 + vserser 0.17 to 2.4.30rc3 + vserver
> > 1.2.10. The box has been running stable with 2.4.21 + vserver 0.17/0.16 for
> > a few years (uptime before reboot was nearly 400 days.)
> > 
> > The boot went fine, but after few hours I got 
> > Message from syslogd@box at Sun Mar 27 22:07:00 2005 ...
> > kernel: journal commit I/O error

I got that error on 2.4.30-rc1 a couple of times, and now cannot
reproduce it :-(
But if you got it too, then it wasn't just bad luck.

The ext3 code in 2.4.30-rc does have a few more checks for IO errors
which will cause the journal to be aborted and produce this error, so
I suspect that change which caused the problem is a change in ext3.
However that doesn't mean the bug is there.

The extra code in ext3 seems to just check if buffer_uptodate is false
after it has waited on a locked buffer, and triggers a journal abort
if it isn't.  This should be perfectly safe, and I cannot find any
logic error near by.  But nor can I find any errors that would cause a
buffer returned from raid1 to not be uptodate (unless there really was
an IO error).


NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems
  2005-03-29  0:10     ` Neil Brown
@ 2005-03-29 21:52       ` Marcelo Tosatti
  2005-03-30  4:06         ` Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check) Neil Brown
  0 siblings, 1 reply; 22+ messages in thread
From: Marcelo Tosatti @ 2005-03-29 21:52 UTC (permalink / raw)
  To: Neil Brown; +Cc: vherva, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1398 bytes --]

On Tue, Mar 29, 2005 at 10:10:34AM +1000, Neil Brown wrote:
> On Monday March 28, vherva@vianova.fi wrote:
> > On Mon, Mar 28, 2005 at 10:34:05AM +0300, [Ville Herva] wrote:
> > > 
> > > I just upgraded from linux-2.4.21 + vserser 0.17 to 2.4.30rc3 + vserver
> > > 1.2.10. The box has been running stable with 2.4.21 + vserver 0.17/0.16 for
> > > a few years (uptime before reboot was nearly 400 days.)
> > > 
> > > The boot went fine, but after few hours I got 
> > > Message from syslogd@box at Sun Mar 27 22:07:00 2005 ...
> > > kernel: journal commit I/O error
> 
> I got that error on 2.4.30-rc1 a couple of times, and now cannot
> reproduce it :-(
> But if you got it too, then it wasn't just bad luck.
> 
> The ext3 code in 2.4.30-rc does have a few more checks for IO errors
> which will cause the journal to be aborted and produce this error, so
> I suspect that change which caused the problem is a change in ext3.
> However that doesn't mean the bug is there.
> 
> The extra code in ext3 seems to just check if buffer_uptodate is false
> after it has waited on a locked buffer, and triggers a journal abort
> if it isn't.  This should be perfectly safe, and I cannot find any
> logic error near by.  But nor can I find any errors that would cause a
> buffer returned from raid1 to not be uptodate (unless there really was
> an IO error).

Attached is the backout patch, for convenience.

[-- Attachment #2: revert-ext3-eh.patch --]
[-- Type: text/plain, Size: 6310 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2005/03/29 18:49:25-03:00 marcelo@logos.cnet 
#   Cset exclude: hifumi.hisashi@lab.ntt.co.jp|ChangeSet|20050226095914|25750
# 
# mm/filemap.c
#   2005/03/29 18:49:22-03:00 marcelo@logos.cnet +0 -0
#   Exclude
# 
# include/linux/jbd.h
#   2005/03/29 18:49:22-03:00 marcelo@logos.cnet +0 -0
#   Exclude
# 
# fs/jbd/transaction.c
#   2005/03/29 18:49:21-03:00 marcelo@logos.cnet +0 -0
#   Exclude
# 
# fs/jbd/journal.c
#   2005/03/29 18:49:21-03:00 marcelo@logos.cnet +0 -0
#   Exclude
# 
# fs/jbd/commit.c
#   2005/03/29 18:49:21-03:00 marcelo@logos.cnet +0 -0
#   Exclude
# 
# fs/ext3/super.c
#   2005/03/29 18:49:21-03:00 marcelo@logos.cnet +0 -0
#   Exclude
# 
# fs/ext3/fsync.c
#   2005/03/29 18:49:21-03:00 marcelo@logos.cnet +0 -0
#   Exclude
# 
diff -Nru a/fs/ext3/fsync.c b/fs/ext3/fsync.c
--- a/fs/ext3/fsync.c	2005-03-29 18:50:56 -03:00
+++ b/fs/ext3/fsync.c	2005-03-29 18:50:56 -03:00
@@ -69,7 +69,7 @@
 	if (test_opt(inode->i_sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
 		ret |= fsync_inode_data_buffers(inode);
 
-	ret |= ext3_force_commit(inode->i_sb);
+	ext3_force_commit(inode->i_sb);
 
 	return ret;
 }
diff -Nru a/fs/ext3/super.c b/fs/ext3/super.c
--- a/fs/ext3/super.c	2005-03-29 18:50:56 -03:00
+++ b/fs/ext3/super.c	2005-03-29 18:50:56 -03:00
@@ -1608,13 +1608,12 @@
 
 static int ext3_sync_fs(struct super_block *sb)
 {
-	int err;
 	tid_t target;
 	
 	sb->s_dirt = 0;
 	target = log_start_commit(EXT3_SB(sb)->s_journal, NULL);
-	err = log_wait_commit(EXT3_SB(sb)->s_journal, target);
-	return err;
+	log_wait_commit(EXT3_SB(sb)->s_journal, target);
+	return 0;
 }
 
 /*
diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
--- a/fs/jbd/commit.c	2005-03-29 18:50:55 -03:00
+++ b/fs/jbd/commit.c	2005-03-29 18:50:55 -03:00
@@ -92,7 +92,7 @@
 	struct buffer_head *wbuf[64];
 	int bufs;
 	int flags;
-	int err = 0;
+	int err;
 	unsigned long blocknr;
 	char *tagp = NULL;
 	journal_header_t *header;
@@ -299,8 +299,6 @@
 			spin_unlock(&journal_datalist_lock);
 			unlock_journal(journal);
 			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
 			/* the journal_head may have been removed now */
 			lock_journal(journal);
 			goto write_out_data;
@@ -328,8 +326,6 @@
 			spin_unlock(&journal_datalist_lock);
 			unlock_journal(journal);
 			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
 			lock_journal(journal);
 			spin_lock(&journal_datalist_lock);
 			continue;	/* List may have changed */
@@ -355,9 +351,6 @@
 	}
 	spin_unlock(&journal_datalist_lock);
 
-	if (err)
-		__journal_abort_hard(journal);
-
 	/*
 	 * If we found any dirty or locked buffers, then we should have
 	 * looped back up to the write_out_data label.  If there weren't
@@ -548,8 +541,6 @@
 		if (buffer_locked(bh)) {
 			unlock_journal(journal);
 			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
 			lock_journal(journal);
 			goto wait_for_iobuf;
 		}
@@ -611,8 +602,6 @@
 		if (buffer_locked(bh)) {
 			unlock_journal(journal);
 			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
 			lock_journal(journal);
 			goto wait_for_ctlbuf;
 		}
@@ -661,8 +650,6 @@
 		bh->b_end_io = journal_end_buffer_io_sync;
 		submit_bh(WRITE, bh);
 		wait_on_buffer(bh);
-		if (unlikely(!buffer_uptodate(bh)))
-			err = -EIO;
 		put_bh(bh);		/* One for getblk() */
 		journal_unlock_journal_head(descriptor);
 	}
@@ -674,9 +661,6 @@
 
 skip_commit: /* The journal should be unlocked by now. */
 
-	if (err)
-		__journal_abort_hard(journal);
-	
 	/* Call any callbacks that had been registered for handles in this
 	 * transaction.  It is up to the callback to free any allocated
 	 * memory.
diff -Nru a/fs/jbd/journal.c b/fs/jbd/journal.c
--- a/fs/jbd/journal.c	2005-03-29 18:50:55 -03:00
+++ b/fs/jbd/journal.c	2005-03-29 18:50:55 -03:00
@@ -582,10 +582,8 @@
  * Wait for a specified commit to complete.
  * The caller may not hold the journal lock.
  */
-int log_wait_commit (journal_t *journal, tid_t tid)
+void log_wait_commit (journal_t *journal, tid_t tid)
 {
-	int err = 0;
-
 	lock_kernel();
 #ifdef CONFIG_JBD_DEBUG
 	lock_journal(journal);
@@ -602,12 +600,6 @@
 		sleep_on(&journal->j_wait_done_commit);
 	}
 	unlock_kernel();
-
-	if (unlikely(is_journal_aborted(journal))) {
-		printk(KERN_EMERG "journal commit I/O error\n");
-		err = -EIO;
-	}
-	return err;
 }
 
 /*
@@ -1334,7 +1326,7 @@
 
 	/* Wait for the log commit to complete... */
 	if (transaction)
-		err = log_wait_commit(journal, transaction->t_tid);
+		log_wait_commit(journal, transaction->t_tid);
 
 	/* ...and flush everything in the log out to disk. */
 	lock_journal(journal);
diff -Nru a/fs/jbd/transaction.c b/fs/jbd/transaction.c
--- a/fs/jbd/transaction.c	2005-03-29 18:50:55 -03:00
+++ b/fs/jbd/transaction.c	2005-03-29 18:50:55 -03:00
@@ -1484,7 +1484,7 @@
 		 * to wait for the commit to complete.  
 		 */
 		if (handle->h_sync && !(current->flags & PF_MEMALLOC))
-			err = log_wait_commit(journal, tid);
+			log_wait_commit(journal, tid);
 	}
 	kfree(handle);
 	return err;
@@ -1509,7 +1509,7 @@
 		goto out;
 	}
 	handle->h_sync = 1;
-	ret = journal_stop(handle);
+	journal_stop(handle);
 out:
 	unlock_kernel();
 	return ret;
diff -Nru a/include/linux/jbd.h b/include/linux/jbd.h
--- a/include/linux/jbd.h	2005-03-29 18:50:55 -03:00
+++ b/include/linux/jbd.h	2005-03-29 18:50:55 -03:00
@@ -848,7 +848,7 @@
 
 extern int	log_space_left (journal_t *); /* Called with journal locked */
 extern tid_t	log_start_commit (journal_t *, transaction_t *);
-extern int	log_wait_commit (journal_t *, tid_t);
+extern void	log_wait_commit (journal_t *, tid_t);
 extern int	log_do_checkpoint (journal_t *, int);
 
 extern void	log_wait_for_space(journal_t *, int nblocks);
diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c	2005-03-29 18:50:55 -03:00
+++ b/mm/filemap.c	2005-03-29 18:50:55 -03:00
@@ -3261,12 +3261,7 @@
 			status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
 	}
 	
-	/*
-	 * generic_osync_inode always returns 0 or negative value.
-	 * So 'status < written' is always true, and written should
-	 * be returned if status >= 0.
-	 */
-	err = (status < 0) ? status : written;
+	err = written ? written : status;
 out:
 
 	return err;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-03-29 21:52       ` Marcelo Tosatti
@ 2005-03-30  4:06         ` Neil Brown
  2005-03-30 11:59           ` Marcelo Tosatti
  0 siblings, 1 reply; 22+ messages in thread
From: Neil Brown @ 2005-03-30  4:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: vherva, linux-kernel

On Tuesday March 29, marcelo.tosatti@cyclades.com wrote:
> 
> Attached is the backout patch, for convenience.

Thanks.  I had another look, and think I may be able to see the
problem.  If I'm right, it is a problem with this patch.

> diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
> --- a/fs/jbd/commit.c	2005-03-29 18:50:55 -03:00
> +++ b/fs/jbd/commit.c	2005-03-29 18:50:55 -03:00
> @@ -92,7 +92,7 @@
>  	struct buffer_head *wbuf[64];
>  	int bufs;
>  	int flags;
> -	int err = 0;
> +	int err;
>  	unsigned long blocknr;
>  	char *tagp = NULL;
>  	journal_header_t *header;
> @@ -299,8 +299,6 @@
>  			spin_unlock(&journal_datalist_lock);
>  			unlock_journal(journal);
>  			wait_on_buffer(bh);
> -			if (unlikely(!buffer_uptodate(bh)))
> -				err = -EIO;
>  			/* the journal_head may have been removed now */
>  			lock_journal(journal);
>  			goto write_out_data;


I think the "!buffer_update(bh)" test is not safe at this point as,
after the wait_on_buffer which could cause a schedule, the bh may
no longer exist, or be for the same block.  There doesn't seem to be
any locking or refcounting that would keep it valid.

Note the comment "the journal_head may have been removed now".
If the journal_head is gone, the associated buffer_head is likely gone
as well. 

I'm not certain that this is right, but it seems possible and would
explain the symptoms.  Maybe Stephen or Andrew could comments?


> --- a/mm/filemap.c	2005-03-29 18:50:55 -03:00
> +++ b/mm/filemap.c	2005-03-29 18:50:55 -03:00
> @@ -3261,12 +3261,7 @@
>  			status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
>  	}
>  	
> -	/*
> -	 * generic_osync_inode always returns 0 or negative value.
> -	 * So 'status < written' is always true, and written should
> -	 * be returned if status >= 0.
> -	 */
> -	err = (status < 0) ? status : written;
> +	err = written ? written : status;
>  out:
>  
>  	return err;

As an aside, this looks extremely dubious to me.

There is a loop earlier in this routine (do_generic_file_write) that
passes a piece-at-a-time of the write request to prepare_write /
commit_write.
Successes are counted in 'written'.  A failure causes the loop to
abort with a status in 'status'.

If some of the write succeeded and some failed, then I believe the
correct behaviour is to return the number of bytes that succeeded.
However this change to the return status (remember the above patch is
a reversal) causes any failure to over-ride any success. This, I
think, is wrong.

NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-03-30  4:06         ` Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check) Neil Brown
@ 2005-03-30 11:59           ` Marcelo Tosatti
  2005-04-05 22:40             ` Stephen C. Tweedie
  0 siblings, 1 reply; 22+ messages in thread
From: Marcelo Tosatti @ 2005-03-30 11:59 UTC (permalink / raw)
  To: Neil Brown, sct, akpm; +Cc: vherva, linux-kernel, hifumi.hisashi

On Wed, Mar 30, 2005 at 02:06:39PM +1000, Neil Brown wrote:
> On Tuesday March 29, marcelo.tosatti@cyclades.com wrote:
> > 
> > Attached is the backout patch, for convenience.
> 
> Thanks.  I had another look, and think I may be able to see the
> problem.  If I'm right, it is a problem with this patch.
> 
> > diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
> > --- a/fs/jbd/commit.c	2005-03-29 18:50:55 -03:00
> > +++ b/fs/jbd/commit.c	2005-03-29 18:50:55 -03:00
> > @@ -92,7 +92,7 @@
> >  	struct buffer_head *wbuf[64];
> >  	int bufs;
> >  	int flags;
> > -	int err = 0;
> > +	int err;
> >  	unsigned long blocknr;
> >  	char *tagp = NULL;
> >  	journal_header_t *header;
> > @@ -299,8 +299,6 @@
> >  			spin_unlock(&journal_datalist_lock);
> >  			unlock_journal(journal);
> >  			wait_on_buffer(bh);
> > -			if (unlikely(!buffer_uptodate(bh)))
> > -				err = -EIO;
> >  			/* the journal_head may have been removed now */
> >  			lock_journal(journal);
> >  			goto write_out_data;
> 
> 
> I think the "!buffer_update(bh)" test is not safe at this point as,
> after the wait_on_buffer which could cause a schedule, the bh may
> no longer exist, or be for the same block.  There doesn't seem to be
> any locking or refcounting that would keep it valid.
> 
> Note the comment "the journal_head may have been removed now".
> If the journal_head is gone, the associated buffer_head is likely gone
> as well. 

Seems to be possible, yes.

> I'm not certain that this is right, but it seems possible and would
> explain the symptoms.  Maybe Stephen or Andrew could comments?

Andrew, Stephen?

> > --- a/mm/filemap.c	2005-03-29 18:50:55 -03:00
> > +++ b/mm/filemap.c	2005-03-29 18:50:55 -03:00
> > @@ -3261,12 +3261,7 @@
> >  			status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> >  	}
> >  	
> > -	/*
> > -	 * generic_osync_inode always returns 0 or negative value.
> > -	 * So 'status < written' is always true, and written should
> > -	 * be returned if status >= 0.
> > -	 */
> > -	err = (status < 0) ? status : written;
> > +	err = written ? written : status;
> >  out:
> >  
> >  	return err;
> 
> As an aside, this looks extremely dubious to me.
> 
> There is a loop earlier in this routine (do_generic_file_write) that
> passes a piece-at-a-time of the write request to prepare_write /
> commit_write.
> Successes are counted in 'written'.  A failure causes the loop to
> abort with a status in 'status'.
> 
> If some of the write succeeded and some failed, then I believe the
> correct behaviour is to return the number of bytes that succeeded.
> However this change to the return status (remember the above patch is
> a reversal) causes any failure to over-ride any success. This, I
> think, is wrong.

Yeap, that part also looks wrong.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-03-30 11:59           ` Marcelo Tosatti
@ 2005-04-05 22:40             ` Stephen C. Tweedie
       [not found]               ` <1112740856.4148.145.camel@sisko.sctweedie.blueyonder.co.uk >
  0 siblings, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-05 22:40 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Neil Brown, Andrew Morton, vherva, linux-kernel, hifumi.hisashi,
	Stephen Tweedie

Hi,

On Wed, 2005-03-30 at 12:59, Marcelo Tosatti wrote:

> > I'm not certain that this is right, but it seems possible and would
> > explain the symptoms.  Maybe Stephen or Andrew could comments?
> 
> Andrew, Stephen?

Sorry, was offline for a week last week; I'll try to look at this more
closely tomorrow.  Checking the buffer_uptodate() without either a
refcount or a lock certainly looks unsafe at first glance.

There are lots of ways to pin the bh in that particular bit of the
code.  The important thing will be to do so without causing leaks if
we're truly finished with the buffer after this flush.


> > If some of the write succeeded and some failed, then I believe the
> > correct behaviour is to return the number of bytes that succeeded.
> > However this change to the return status (remember the above patch is
> > a reversal) causes any failure to over-ride any success. This, I
> > think, is wrong.
> 
> Yeap, that part also looks wrong.

Certainly it's normal for a short read/write to imply either error or
EOF, without the error necessarily needing to be returned explicitly. 
I'm not convinced that the Singleunix language actually requires that,
but it seems the most obvious and consistent behaviour.

--Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
       [not found]               ` <1112740856.4148.145.camel@sisko.sctweedie.blueyonder.co.uk >
@ 2005-04-06 10:01                 ` Hifumi Hisashi
  2005-04-06 14:20                   ` Stephen C. Tweedie
                                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Hifumi Hisashi @ 2005-04-06 10:01 UTC (permalink / raw)
  To: Stephen C. Tweedie, Marcelo Tosatti
  Cc: Neil Brown, Andrew Morton, vherva, linux-kernel

Hi.

At 07:40 05/04/06, Stephen C. Tweedie wrote:
 >Sorry, was offline for a week last week; I'll try to look at this more
 >closely tomorrow.  Checking the buffer_uptodate() without either a
 >refcount or a lock certainly looks unsafe at first glance.
 >
 >There are lots of ways to pin the bh in that particular bit of the
 >code.  The important thing will be to do so without causing leaks if
 >we're truly finished with the buffer after this flush.
 >

I have measured the bh refcount before the buffer_uptodate() for a few days.
I found out that the bh refcount sometimes reached to 0 .
So, I think following modifications are effective.

diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
--- 2.4.30-rc3/fs/jbd/commit.c	2005-04-06 17:14:47.000000000 +0900
+++ 2.4.30-rc3_patch/fs/jbd/commit.c	2005-04-06 17:18:49.000000000 +0900
@@ -295,6 +295,7 @@
  		struct buffer_head *bh;
  		jh = jh->b_tprev;	/* Wait on the last written */
  		bh = jh2bh(jh);
+		get_bh(bh);
  		if (buffer_locked(bh)) {
  			spin_unlock(&journal_datalist_lock);
  			unlock_journal(journal);
@@ -302,11 +303,14 @@
  			if (unlikely(!buffer_uptodate(bh)))
  				err = -EIO;
  			/* the journal_head may have been removed now */
+			put_bh(bh);
  			lock_journal(journal);
  			goto write_out_data;
  		} else if (buffer_dirty(bh)) {
+			put_bh(bh);
  			goto write_out_data_locked;
  		}
+		put_bh(bh);
  	} while (jh != commit_transaction->t_sync_datalist);
  	goto write_out_data_locked;



 >
 >> > If some of the write succeeded and some failed, then I believe the
 >> > correct behaviour is to return the number of bytes that succeeded.
 >> > However this change to the return status (remember the above patch is
 >> > a reversal) causes any failure to over-ride any success. This, I
 >> > think, is wrong.
 >>
 >> Yeap, that part also looks wrong.
 >
 >Certainly it's normal for a short read/write to imply either error or
 >EOF, without the error necessarily needing to be returned explicitly.
 >I'm not convinced that the Singleunix language actually requires that,
 >but it seems the most obvious and consistent behaviour.
 >
 >--Stephen

When an O_SYNC flag is set , if commit_write() succeed but 
generic_osync_inode() return
error due to I/O failure, write() must fail .

I think that following error handling code is rational in 
do_generic_file_write() .

	if (file->f_flags & O_SYNC)
		err = (status < 0) ? status : written;
	else
		err = written ? written : status;
	out:

	return err;


Thanks. 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-06 10:01                 ` Hifumi Hisashi
@ 2005-04-06 14:20                   ` Stephen C. Tweedie
  2005-04-07  3:00                     ` Hifumi Hisashi
  2005-04-06 20:10                   ` Stephen C. Tweedie
  2005-04-13 14:34                   ` Stephen C. Tweedie
  2 siblings, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-06 14:20 UTC (permalink / raw)
  To: Hifumi Hisashi
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel,
	Stephen Tweedie

Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

>  >Certainly it's normal for a short read/write to imply either error or
>  >EOF, without the error necessarily needing to be returned explicitly.
>  >I'm not convinced that the Singleunix language actually requires that,
>  >but it seems the most obvious and consistent behaviour.

> When an O_SYNC flag is set , if commit_write() succeed but 
> generic_osync_inode() return
> error due to I/O failure, write() must fail .

Yes.  But it is conventional to interpret a short write as being a
failure.  Returning less bytes than were requested in the write
indicates that the rest failed.  It just doesn't give the exact nature
of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
never permitted unless there are errors of some description.

--Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-06 10:01                 ` Hifumi Hisashi
  2005-04-06 14:20                   ` Stephen C. Tweedie
@ 2005-04-06 20:10                   ` Stephen C. Tweedie
  2005-04-07 15:51                     ` Stephen C. Tweedie
  2005-04-13 14:34                   ` Stephen C. Tweedie
  2 siblings, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-06 20:10 UTC (permalink / raw)
  To: Hifumi Hisashi
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel,
	Stephen Tweedie

Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

> I have measured the bh refcount before the buffer_uptodate() for a few days.
> I found out that the bh refcount sometimes reached to 0 .
> So, I think following modifications are effective.

Thanks --- it certainly looks like this should fix the dereferencing of
invalid bh's, and this code mirrors what 2.6 does around that area.

However, 2.6 is suspected of still having leaks in ext3.  To be certain
that we're not just backporting one of those to 2.4, we need to
understand who exactly is going to clean up these bh's if they are in
fact unused once we complete this code and do the final put_bh().  

I'll give this patch a spin with Andrew's fsx-based leak stress and see
if anything unpleasant appears.  

Cheers,
 Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-06 14:20                   ` Stephen C. Tweedie
@ 2005-04-07  3:00                     ` Hifumi Hisashi
  0 siblings, 0 replies; 22+ messages in thread
From: Hifumi Hisashi @ 2005-04-07  3:00 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel

Hi,

At 23:20 05/04/06, Stephen C. Tweedie wrote:
 >Yes.  But it is conventional to interpret a short write as being a
 >failure.  Returning less bytes than were requested in the write
 >indicates that the rest failed.  It just doesn't give the exact nature
 >of the failure (EIO vs ENOSPC etc.)  For regular files, a short write is
 >never permitted unless there are errors of some description.

When commit_write() FULLY succeed (requested bytes == returned bytes) but
generic_osync_inode() return error due to I/O failure, current 
do_generic_file_write() cannot
return error. I encountered above situation a lot under an I/O trouble 
condition .

In ver 2.6.11, the return value of generic_osync_inode() is returned 
directry to user
when I/O failure occur.


thanks.

   


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-06 20:10                   ` Stephen C. Tweedie
@ 2005-04-07 15:51                     ` Stephen C. Tweedie
  2005-04-11 12:55                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-07 15:51 UTC (permalink / raw)
  To: Hifumi Hisashi
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel,
	Stephen Tweedie

Hi,

On Wed, 2005-04-06 at 21:10, Stephen C. Tweedie wrote:

> However, 2.6 is suspected of still having leaks in ext3.  To be certain
> that we're not just backporting one of those to 2.4, we need to
> understand who exactly is going to clean up these bh's if they are in
> fact unused once we complete this code and do the final put_bh().  
> 
> I'll give this patch a spin with Andrew's fsx-based leak stress and see
> if anything unpleasant appears.  

I'm currently running with the buffer-trace debug patch, on 2.4, with a
custom patch to put every buffer jbd ever sees onto a per-superblock
list, and remove it only when the bh is destroyed in
put_unused_buffer_head().  At unmount time, we can walk that list to
find stale buffers attached to data pages (invalidate_buffers() already
does such a walk for metadata.)

I just ran it with a quick test and it found 300,000 buffers still
present at unmount.  Whoops, I guess I need to move the check to _after_
the final invalidate_inodes() call. :-)

But this method ought to allow us to test for this sort of leak a lot
more reliably, and to report via the usual buffer history tracing the
most recent activity on any bh that does leak.

Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested
for Marcelo.  I've got uptodate buffer_trace for 2.6 anyway, so it
shouldn't be too hard.

--Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-07 15:51                     ` Stephen C. Tweedie
@ 2005-04-11 12:55                       ` Stephen C. Tweedie
  2005-04-11 20:46                         ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-11 12:55 UTC (permalink / raw)
  To: Hifumi Hisashi
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel,
	Stephen Tweedie

Hi,

On Thu, 2005-04-07 at 16:51, Stephen C. Tweedie wrote:

> I'm currently running with the buffer-trace debug patch, on 2.4, with a
> custom patch to put every buffer jbd ever sees onto a per-superblock
> list, and remove it only when the bh is destroyed in
> put_unused_buffer_head().  At unmount time, we can walk that list to
> find stale buffers attached to data pages (invalidate_buffers() already
> does such a walk for metadata.)

After a 50-process dbench run, unmount is showing ~300 buffers
unclaimed; that seems to be OK, it's a large stress test doing _lots_ of
create/unlink and we expect to see a few buffers around at the end where
they were truncated during commit and couldn't be instantly reclaimed.

The main thing now is to test these buffers to make 100% sure that they
are in a state where the VM can reclaim them.  That looks fine: the
buffers I see are unjournaled, have no jh, and are clean and on the
BUF_CLEAN lru.

Andrew, what was the exact illegal state of the pages you were seeing
when fixing that recent leak?  It looks like it's nothing more complex
than dirty buffers on an anon page.  I think that simply calling
try_to_release_page() for all the remaining buffers at umount time will
be enough to catch these; if that function fails, it tells us that the
VM can't reclaim these pages.  The only thing that would be required on
top of that would be a check that the page is also on the VM LRU lists.

--Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-11 12:55                       ` Stephen C. Tweedie
@ 2005-04-11 20:46                         ` Andrew Morton
  2005-04-11 22:24                           ` Stephen C. Tweedie
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2005-04-11 20:46 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: hifumi.hisashi, marcelo.tosatti, neilb, vherva, linux-kernel, sct

"Stephen C. Tweedie" <sct@redhat.com> wrote:
>
> Andrew, what was the exact illegal state of the pages you were seeing
>  when fixing that recent leak?  It looks like it's nothing more complex
>  than dirty buffers on an anon page.

Correct.

>  I think that simply calling
>  try_to_release_page() for all the remaining buffers at umount time will

Presumably these pages have no ->mapping, so try_to_release_page() will
call try_to_free_buffers().

>  be enough to catch these; if that function fails, it tells us that the
>  VM can't reclaim these pages.

Yes, if the buffers are dirty then 2.4's try_to_free_buffers() won't free
them.

>  The only thing that would be required on
>  top of that would be a check that the page is also on the VM LRU lists.

Why do we have dirty buffers left over at umount time?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-11 20:46                         ` Andrew Morton
@ 2005-04-11 22:24                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-11 22:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hifumi.hisashi, Marcelo Tosatti, Neil Brown, vherva,
	linux-kernel, Stephen Tweedie

Hi,

On Mon, 2005-04-11 at 21:46, Andrew Morton wrote:
> "Stephen C. Tweedie" <sct@redhat.com> wrote:
> >
> > Andrew, what was the exact illegal state of the pages you were seeing
> >  when fixing that recent leak?  It looks like it's nothing more complex
> >  than dirty buffers on an anon page.
> 
> Correct.

Good --- I think I've got the debugging solid against 2.4 for that case,
patching against 2.6 now.

> >  I think that simply calling
> >  try_to_release_page() for all the remaining buffers at umount time will
> 
> Presumably these pages have no ->mapping, so try_to_release_page() will
> call try_to_free_buffers().

Exactly.

> >  The only thing that would be required on
> >  top of that would be a check that the page is also on the VM LRU lists.
> 
> Why do we have dirty buffers left over at umount time?

In the leak case we're worried about, the buffers are dirty but the page
is anon so there's nothing to clean them up.  The buffers _will_ be
destroyed by unmount, sure; invalidate_bdev() should see to that.  But
I'm doing the bh leak check before we get to the final
invalidate_bdev(), so they should still be available for testing at that
point.

--Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-06 10:01                 ` Hifumi Hisashi
  2005-04-06 14:20                   ` Stephen C. Tweedie
  2005-04-06 20:10                   ` Stephen C. Tweedie
@ 2005-04-13 14:34                   ` Stephen C. Tweedie
  2005-04-27 22:22                     ` Stephen C. Tweedie
  2 siblings, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-13 14:34 UTC (permalink / raw)
  To: Hifumi Hisashi
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel,
	Stephen Tweedie

Hi,

On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote:

> I have measured the bh refcount before the buffer_uptodate() for a few days.
> I found out that the bh refcount sometimes reached to 0 .
> So, I think following modifications are effective.
> 
> diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c
> --- 2.4.30-rc3/fs/jbd/commit.c	2005-04-06 17:14:47.000000000 +0900
> +++ 2.4.30-rc3_patch/fs/jbd/commit.c	2005-04-06 17:18:49.000000000 +0900
> @@ -302,11 +303,14 @@
>   			if (unlikely(!buffer_uptodate(bh)))
>   				err = -EIO;
>   			/* the journal_head may have been removed now */
> +			put_bh(bh);
>   			lock_journal(journal);
>   			goto write_out_data;

...

In further testing, this chunk is causing some problems.  It is entirely
legal for a buffer to be !uptodate here, although the path is somewhat
convoluted.  The trouble is, once the IO has completed,
journal_dirty_data() can steal the buffer from the committing
transaction.  And once that happens, journal_unmap_buffer() can then
release the buffer and clear the uptodate bit.  

I've run this under buffer tracing and can show you this exact path on a
trace.  It only takes a few seconds to reproduce under fsx.

For this to have happened, though, journal_dirty_data needs to have
waited on the data, so it has an opportunity to see the !uptodate state
at that point.

I think it's safe enough to sidestep this by double-checking to see
whether the bh is still on the committing transaction after we've waited
for the buffer and retaken journaling locks, but before testing
buffer_uptodate().

--Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-13 14:34                   ` Stephen C. Tweedie
@ 2005-04-27 22:22                     ` Stephen C. Tweedie
  2005-04-28  8:54                       ` Hifumi Hisashi
  0 siblings, 1 reply; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-27 22:22 UTC (permalink / raw)
  To: Hifumi Hisashi
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel,
	Stephen Tweedie

Hi,

On Wed, 2005-04-13 at 15:34, Stephen C. Tweedie wrote:

> >   			if (unlikely(!buffer_uptodate(bh)))
> >   				err = -EIO;
> >   			/* the journal_head may have been removed now */
> > +			put_bh(bh);
> >   			lock_journal(journal);
> >   			goto write_out_data;

> In further testing, this chunk is causing some problems.  It is entirely
> legal for a buffer to be !uptodate here, although the path is somewhat
> convoluted.  The trouble is, once the IO has completed,
> journal_dirty_data() can steal the buffer from the committing
> transaction.  And once that happens, journal_unmap_buffer() can then
> release the buffer and clear the uptodate bit.  

It turns out that simply testing for 

	(!buffer_uptodate(bh) && buffer_mapped(bh))

is enough to fix this: the situation where the uptodate bit is cleared
manually here also clears the buffer_mapped bit, and that is done under
bh lock so we don't even have to worry about the order in which the
bitops occur.  There's another path later on in commit.c where the same
test encounters the same problem; that gets cured by the same fix.  With
the EIO errors turned into BUG()s for debugging, I've been able to run
under severe load, with 50msec commit intervals, for a day or so on SMP
without running into any false positives.

In checking out the patch one last time, though, I found one anomaly. 
The patch that was submitted to 2.6 for this problem also changed
write_inode_now() to return an error value.  The patch that got
submitted to 2.4 had no such change.  Was there a reason for this
difference between the two versions?

Cheers,
 Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-27 22:22                     ` Stephen C. Tweedie
@ 2005-04-28  8:54                       ` Hifumi Hisashi
  2005-04-28 11:15                         ` Stephen C. Tweedie
  0 siblings, 1 reply; 22+ messages in thread
From: Hifumi Hisashi @ 2005-04-28  8:54 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel


At 07:22 05/04/28, Stephen C. Tweedie wrote:
 >In checking out the patch one last time, though, I found one anomaly.
 >The patch that was submitted to 2.6 for this problem also changed
 >write_inode_now() to return an error value.  The patch that got
 >submitted to 2.4 had no such change.  Was there a reason for this
 >difference between the two versions?

Right. Also in ver 2.4, I know write_inode_now() has to be changed for perfect
I/O error detection during synchronous writing.

In do_generic_file_write(mm/filemap.c), does the current return value handling is
unchanged?

Thanks, 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
  2005-04-28  8:54                       ` Hifumi Hisashi
@ 2005-04-28 11:15                         ` Stephen C. Tweedie
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen C. Tweedie @ 2005-04-28 11:15 UTC (permalink / raw)
  To: Hifumi Hisashi
  Cc: Marcelo Tosatti, Neil Brown, Andrew Morton, vherva, linux-kernel,
	Stephen Tweedie

Hi,

On Thu, 2005-04-28 at 09:54, Hifumi Hisashi wrote:
> At 07:22 05/04/28, Stephen C. Tweedie wrote:
>  >In checking out the patch one last time, though, I found one anomaly.
>  >The patch that was submitted to 2.6 for this problem also changed
>  >write_inode_now() to return an error value.  The patch that got
>  >submitted to 2.4 had no such change.  Was there a reason for this
>  >difference between the two versions?
> 
> Right. Also in ver 2.4, I know write_inode_now() has to be changed for perfect
> I/O error detection during synchronous writing.
> 
> In do_generic_file_write(mm/filemap.c), does the current return value handling is
> unchanged?

At present I'm using your changed version, but that is wrong --- in most
cases, we want to return short writes on errors.  It's only in the
synchronous case that we want to do otherwise, and even then only on EIO
on the inode.  For example, if we run out of quota halfway through an
O_SYNC write beyond EOF, then the bulk of the IO *has* completed
successfully, and we want to indicate that in the error.  So I'll be
re-jigging that bit of the code.

Cheers,
 Stephen


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2005-04-28 11:16 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-03-26 16:28 Linux 2.4.30-rc3 Marcelo Tosatti
2005-03-28  7:34 ` Ville Herva
2005-03-28 16:55   ` Linux 2.4.30-rc3 md/ext3 problems Ville Herva
2005-03-28 17:25     ` Willy Tarreau
2005-03-28 19:29       ` Ville Herva
2005-03-29  0:10     ` Neil Brown
2005-03-29 21:52       ` Marcelo Tosatti
2005-03-30  4:06         ` Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check) Neil Brown
2005-03-30 11:59           ` Marcelo Tosatti
2005-04-05 22:40             ` Stephen C. Tweedie
     [not found]               ` <1112740856.4148.145.camel@sisko.sctweedie.blueyonder.co.uk >
2005-04-06 10:01                 ` Hifumi Hisashi
2005-04-06 14:20                   ` Stephen C. Tweedie
2005-04-07  3:00                     ` Hifumi Hisashi
2005-04-06 20:10                   ` Stephen C. Tweedie
2005-04-07 15:51                     ` Stephen C. Tweedie
2005-04-11 12:55                       ` Stephen C. Tweedie
2005-04-11 20:46                         ` Andrew Morton
2005-04-11 22:24                           ` Stephen C. Tweedie
2005-04-13 14:34                   ` Stephen C. Tweedie
2005-04-27 22:22                     ` Stephen C. Tweedie
2005-04-28  8:54                       ` Hifumi Hisashi
2005-04-28 11:15                         ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).