linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: test12-pre5
       [not found] <00120522275601.09076@gimli>
@ 2000-12-05 21:58 ` Linus Torvalds
  0 siblings, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2000-12-05 21:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Kernel Mailing List



On Tue, 5 Dec 2000, Daniel Phillips wrote:
> 
> OK, I see - this isn't easy at all.  You start the io if necessary, and
> some time later it completes.

Right. You don't know when. Once completed, it will unlock the page and
wake up waiters. It will also set PG_Uptodate if the read was successful
(and obviously it might not have been).

If you only care about the contents of the page, you can just test the
Uptodate flag - if the page is up-to-date you may not care whether IO is
outstanding on the page, or whether somebody is modifying page state
(removing page buffers etc). So it's entirely legal to look up a page in
the page cache and only look at uptodate.

(Of course, if it _isn't_ uptodate, you will still have to get the page
lock and re-test and possibly start the IO if it still isn't up-to-date
after you got the page lock. This is what all the generic_file_read()
stuff does for you).

>				  The locking state is therefore
> indeterminate after ->readpage or ->writepage; all we know is that
> after some finite amount of time the lock bit will go down.

Yes. It might have completed immediately (ramdisks or what not), or be
fast enough that by the time you get back it's already done. But the
likely case is that the page will be locked, and you'd be better off going
away and doing something else in the meantime (this is certainly important
for the VM layer - it needs to continue doing swap-outs so that it
doesn't just dribble the pages out one by one).

The main reason for the page locking changes for the writepage() case were
really:

 - the swapping code serializes accesses by using the page lock, and
   doesn't have any other underlying serializing primitives. So we needed
   to let the brw_page() code leave the lock set until the write has
   physically completed.

 - the VM code really wants to have a notion of "I have this many pages in
   flight", so that it can sanely make a decision on when to start waiting
   on page writeout completion, instead of just writing as much as it can.
   The block device layer does some of this, of course, but the VM layer
   can do more by just using the "nr_async_pages" thing. However, before
   the unlock changes, this could not work for shared file mappings.

I hope that explains the change,

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  3:20 test12-pre5 Linus Torvalds
                   ` (3 preceding siblings ...)
  2000-12-05 15:48 ` test12-pre5 Daniel Phillips
@ 2000-12-06 13:18 ` Panu Matilainen
  4 siblings, 0 replies; 19+ messages in thread
From: Panu Matilainen @ 2000-12-06 13:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Andrew Morton,
	Stephen C. Tweedie, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo, urban

On Mon, 4 Dec 2000, Linus Torvalds wrote:

>
> Ok, this contains one of the fixes for the dirty inode buffer list (the
> other fix is pending, simply because I still want to understand why it
> would be needed at all). Al?
>
> Also, it has the final installment of the PageDirty handling, and now
> officially direct IO can work by just marking the physical page dirty and
> be done with it. NFS along with all filesystems have been converted, the
> one hold-out still being smbfs.
>
> Who works on smbfs these days? I see two ways of fixing smbfs now that
> "writepage()" only gets an anonymous page and no "struct file" information
> any more (this also fixes the double page unlock that Andrew saw).

Urban Widmark is the smbfs maintainer nowadays. BTW Urban thanks again for
the NetApp-fix, it has saved my behind quite a few times by now :)

        - Panu -


>
>  - disable shared mmap over smbfs (very easily done by just setting
>    writepage to NULL)
>
>  - fetch the dentry that writepage needs by just looking at the
>    inode->i_dentry list and/or just make the smbfs page cache host be the
>    dentry instead of the inode like other filesystems. The first approach
>    assumes that all paths are equal for writeout, the second one assumes
>    that there are no hard linking going on in smbfs.
>
> Somebody more knowledgeable than I will have to make the decision
> (otherwise I'll just end up disabling shared mmap - I doubt anybody really
> uses it anyway, but it would be more polite to just support it).
>
> NOTE! There's another change to "writepage()" semantics than just dropping
> the "struct file": the new writepage() is supposed to mirror the logic of
> readpage(), and unlock the page when it is done with it. This allows the
> VM system more visibility into what IO is pending (which the VM doesn't
> take advantage of yet, but now it can _truly_ use the same logic for both
> swapout and for dirty file writeback).
>
> The other change is that I forward-ported the ymfpci driver from 2.2.18,
> as it works better than the ALSA one on my now-to-be-main-laptop ;)
>
> [ Alan - I ahve your patches in my incoming queue still, I wanted to get
>   an interim version out to check with Al on the block list and the VM
>   stuff with Rik and people. ]
>
> 		Linus
>
> ----
>  - pre5:
>     - Jaroslav Kysela: ymfpci driver
>     - me: get rid of bogus MS_INVALIDATE semantics
>     - me: final part of the PageDirty() saga
>     - Rusty Russell: 4-way SMP iptables fix
>     - Al Viro: oops - bad ext2 inode dirty block bug
>
>  - pre4:
>     - Andries Brouwer: final isofs pieces.
>     - Kai Germaschewski: ISDN
>     - play CD audio correctly, don't stop after 12 minutes.
>     - Anton Altaparmakov: disable NTFS mmap for now, as it doesn't work.
>     - Stephen Tweedie: fix inode dirty block handling
>     - Bill Hartner: reschedule_idle - prefer right cpu
>     - Johannes Erdfelt: USB updates
>     - Alan Cox: synchronize
>     - Richard Henderson: alpha updates and optimizations
>     - Geert Uytterhoeven: fbdev could be fooled into crashing fix
>     - Trond Myklebust: NFS filehandles in inode rather than dentry
>
>  - pre3:
>     - me: more PageDirty / swapcache handling
>     - Neil Brown: raid and md init fixes
>     - David Brownell: pci hotplug sanitization.
>     - Kanoj Sarcar: mips64 update
>     - Kai Germaschewski: ISDN sync
>     - Andreas Bombe: ieee1394 cleanups and fixes
>     - Johannes Erdfelt: USB update
>     - David Miller: Sparc and net update
>     - Trond Myklebust: RPC layer SMP fixes
>     - Thomas Sailer: mixed sound driver fixes
>     - Tigran Aivazian: use atomic_dec_and_lock() for free_uid()
>
>  - pre2:
>     - Peter Anvin: more P4 configuration parsing
>     - Stephen Tweedie: O_SYNC patches. Make O_SYNC/fsync/fdatasync
>       do the right thing.
>     - Keith Owens: make mdule loading use the right struct module size
>     - Boszormenyi Zoltan: get MTRR's right for the >32-bit case
>     - Alan Cox: various random documentation etc
>     - Dario Ballabio: EATA and u14-34f update
>     - Ivan Kokshaysky: unbreak alpha ruffian
>     - Richard Henderson: PCI bridge initialization on alpha
>     - Zach Brown: correct locking in Maestro driver
>     - Geert Uytterhoeven: more m68k updates
>     - Andrey Savochkin: eepro100 update
>     - Dag Brattli: irda update
>     - Johannes Erdfelt: USB update
>
>  - pre1: (for ISDN synchronization _ONLY_! Not complete!)
>     - Byron Stanoszek: correct decimal precision for CPU MHz in
>       /proc/cpuinfo
>     - Ollie Lho: SiS pirq routing.
>     - Andries Brouwer: isofs cleanups
>     - Matt Kraai: /proc read() on directories should return EISDIR, not EINVAL
>     - me: be stricter about what we accept as a PCI bridge setup.
>     - me: always set PCI interrupts to be level-triggered when we enable them.
>     - me: updated PageDirty and swap cache handling
>     - Peter Anvin: update A20 code to work without keyboard controller
>     - Kai Germaschewski: ISDN updates
>     - Russell King: ARM updates
>     - Geert Uytterhoeven: m68k updates
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 20:17                 ` test12-pre5 Alexander Viro
@ 2000-12-05 23:15                   ` Stephen C. Tweedie
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen C. Tweedie @ 2000-12-05 23:15 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Stephen C. Tweedie, Kernel Mailing List,
	Alexander Viro, Andrew Morton, Alan Cox, Christoph Rohland,
	Rik van Riel, MOLNAR Ingo

Hi,

On Tue, Dec 05, 2000 at 03:17:07PM -0500, Alexander Viro wrote:
> 
> 
> On Tue, 5 Dec 2000, Linus Torvalds wrote:
> 
> > And this is not just a "it happens to be like this" kind of thing. It
> > _has_ to be like this, because every time we call clear_inode() we are
> > going to physically free the memory associated with the inode very soon
> > afterwards. Which means that _any_ use of the inode had better be long
> > gone. Dirty buffers included.
> 
> Urgh. Linus, AFAICS we _all_ agree on that. The only real question is
> whether we consider calling clear_inode() with droppable dirty buffers
> to be OK. It can't happen on the dispose_list() path and I'ld rather
> see it _not_ happening on the delete_inode() one. It's a policy question,
> not the correctness one.

Right, because if we get this wrong, the kernel won't complain, but
we'll have a rare, impossible-to-diagnose potential data corrupter for
any applications which do recovery on reboot.  

If we're not going to change the code, then at the very least a huge
warning comment above clear_inode() is necessary to make it explicit
that you shouldn't ever pass in an inode with dirty buffers unless
nlink==0, and I'd rather see the BUG there instead.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 19:48               ` test12-pre5 Linus Torvalds
@ 2000-12-05 20:17                 ` Alexander Viro
  2000-12-05 23:15                   ` test12-pre5 Stephen C. Tweedie
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Viro @ 2000-12-05 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Tue, 5 Dec 2000, Linus Torvalds wrote:

> And this is not just a "it happens to be like this" kind of thing. It
> _has_ to be like this, because every time we call clear_inode() we are
> going to physically free the memory associated with the inode very soon
> afterwards. Which means that _any_ use of the inode had better be long
> gone. Dirty buffers included.

Urgh. Linus, AFAICS we _all_ agree on that. The only real question is
whether we consider calling clear_inode() with droppable dirty buffers
to be OK. It can't happen on the dispose_list() path and I'ld rather
see it _not_ happening on the delete_inode() one. It's a policy question,
not the correctness one.

IOW, I would prefer to have BUG() instead of invalidate_inode_buffers()
and let the ->delete_inode() make sure that list is empty. I'm not saying
that current code doesn't work. However, "let's clean after the
truncate_inode_page()/foo_delete_inode(), they might leave some junk
on the list" looks like a wrong thing.

Notice that policy wrt pages is already of that kind - clear_inode()
expects the callers to make sure that ->i_data.nr_pages is zero instead of
trying to clean after them. I think that we will be better off with
similar rules wrt dirty buffers list.

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 18:59             ` test12-pre5 Alexander Viro
@ 2000-12-05 19:48               ` Linus Torvalds
  2000-12-05 20:17                 ` test12-pre5 Alexander Viro
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2000-12-05 19:48 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen C. Tweedie, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Tue, 5 Dec 2000, Alexander Viro wrote:
> > 
> > Stephen is _wrong_ wrt fsync().
> > 
> > Why?
> > 
> > Think about it for a second. How the hell could you even _call_ fsync() on
> > a file that no longer exists, and has no file handles open to it?
> 		^^^^^^^^^^^^^^
> clear_inode() <- dispose_list() <- prune_icache().

No. prune_icache() will absolutely _refuse_ to prune an inode that is
still in use. Where "in use" is defined to be an aggregate of many things,
including the fact that the inode is dirty (even if there are no actual
references to it any more) _or_ the fact that the inode has cached data
associated with it.

Page cache counts as cached data.

As does dirty buffers.

So clean_inode() is basically always called only for "dead" objects. 

And this is not just a "it happens to be like this" kind of thing. It
_has_ to be like this, because every time we call clear_inode() we are
going to physically free the memory associated with the inode very soon
afterwards. Which means that _any_ use of the inode had better be long
gone. Dirty buffers included.

So it's definitely ok to say that "once we get to clean_inode(), there is
no way in h*ll that dirty buffers can be valid any more". Because either
we have checked that by hand (prune_icache), or we're killing the inode
outright (no more users and the inode has been removed).

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 18:33           ` test12-pre5 Linus Torvalds
@ 2000-12-05 18:59             ` Alexander Viro
  2000-12-05 19:48               ` test12-pre5 Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Viro @ 2000-12-05 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Tue, 5 Dec 2000, Linus Torvalds wrote:

> > So Stephen is right wrt fsync() (it will not get that stuff on disk).
> > However, it's not a bug - if that crap will not end up on disk we
> > will only win.
> 
> Stephen is _wrong_ wrt fsync().
> 
> Why?
> 
> Think about it for a second. How the hell could you even _call_ fsync() on
> a file that no longer exists, and has no file handles open to it?
		^^^^^^^^^^^^^^
clear_inode() <- dispose_list() <- prune_icache().

IOW, if file still exists, but is closed by everyone, etc. you _can_
get clear_inode() on it. <thinks> Ah, I see your point. OK, how about that:
	* clear_inode() _can_ be called for still-alive objects.
	* no matter how it is called, we don't give a damn for the stuff
on the list.
	* moreover, if it gets called for object that is still alive the
list is just empty. It doesn't contain anything valuable (as in every
case) _and_ it doesn't contain random crap.

If that's what you were talking about - fine with me.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 17:48       ` test12-pre5 Linus Torvalds
  2000-12-05 18:14         ` test12-pre5 Alexander Viro
@ 2000-12-05 18:50         ` Stephen C. Tweedie
  1 sibling, 0 replies; 19+ messages in thread
From: Stephen C. Tweedie @ 2000-12-05 18:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Alexander Viro, Kernel Mailing List,
	Alexander Viro, Andrew Morton, Alan Cox, Christoph Rohland,
	Rik van Riel, MOLNAR Ingo

Hi,

On Tue, Dec 05, 2000 at 09:48:51AM -0800, Linus Torvalds wrote:
> 
> On Tue, 5 Dec 2000, Stephen C. Tweedie wrote:
> > 
> > That is still buggy.  We MUST NOT invalidate the inode buffers unless
> > i_nlink == 0, because otherwise a subsequent open() and fsync() will
> > have forgotten what buffers are dirty, and hence will fail to
> > synchronise properly with the disk.
> 
> Are you all on drugs?
> 
> Look at where clear_inode() is called. It's called by
> ext2_delete_inode(). It's not called by close(). Never has, never will.

It is also called during prune_icache().  Yes, I know that the
CAN_UNUSE() macro should make sure we never try to do this on a
count==0 inode which still has dirty buffers, but I'd feel happier if
we had the safety net to guard against any callers ever doing a
clear_inode while we have dirty buffers around.

We have two completely different paths to clear_inode() here: one in
which there had better not be any dirty buffers or we've got a data
corrupter on our hands, and a second in which dirty buffers are
irrelevant because we're doing a delete.

That's the problem --- doing the invalidate in clear_inode() feels
like the wrong place to do it, because our treatment of the dirty
buffers list differs depending on the caller.  I'd be much happer with
the delete case doing the invalidate, and leave clear_inode BUG()ing
if there are any dirty buffers left, and that's basically what the
test with i_nlink==0 did.

> So I repeat: are there known bugs in this area left in pre5? And with
> "bugs", I don't mean fever-induced rants like the above (*).

No, I don't think so.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 18:14         ` test12-pre5 Alexander Viro
@ 2000-12-05 18:33           ` Linus Torvalds
  2000-12-05 18:59             ` test12-pre5 Alexander Viro
  0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2000-12-05 18:33 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen C. Tweedie, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Tue, 5 Dec 2000, Alexander Viro wrote:
> 
> Sigh. OK, let me put it that way:
> 
> 	* we _can_ have dirty blocks on the list when inode gets freed.

Agreed.

> 	* no, CAN_UNUSE will not see them.

CAN_UNUSE() is not used at all for the final forcible removal of an inode
that has no users.

> 	* no, we do not give a damn about forgetting them.

Indeed. We'd be better off marking them clean, to make sure they never hit
the disk.

> So Stephen is right wrt fsync() (it will not get that stuff on disk).
> However, it's not a bug - if that crap will not end up on disk we
> will only win.

Stephen is _wrong_ wrt fsync().

Why?

Think about it for a second. How the hell could you even _call_ fsync() on
a file that no longer exists, and has no file handles open to it?

By the time we call clear_inode(), fsync() had better not be an issue any
more. The only reason fscyn() could be an issue is if somebody starts
calling clear_inode() left and right, but hey, if that happens your
filesystem would be so FUBAR'ed that you really wouldn't care any more.
Data integrity of "fsync()" is the least of your worries.

			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 17:48       ` test12-pre5 Linus Torvalds
@ 2000-12-05 18:14         ` Alexander Viro
  2000-12-05 18:33           ` test12-pre5 Linus Torvalds
  2000-12-05 18:50         ` test12-pre5 Stephen C. Tweedie
  1 sibling, 1 reply; 19+ messages in thread
From: Alexander Viro @ 2000-12-05 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Tue, 5 Dec 2000, Linus Torvalds wrote:

> Get your acts together, guys. Stop blathering and frothing at the mouth.
> The only time clear_inode() should be called is (a) when we prune the
> inode cache - and we CLEARLY cannot prune an inode if it still has dirty
> blocks associated with it and CAN_UNUSE() already enforces this or (b)
> when we're getting rid of the very last occurrence of the inode. In one
> case the dirty list is already empty. In the other, invalidating it is
> clearly the right thing and the _required_ thing to do.


Sigh. OK, let me put it that way:

	* we _can_ have dirty blocks on the list when inode gets freed.
	* no, CAN_UNUSE will not see them.
	* no, we do not give a damn about forgetting them.

So Stephen is right wrt fsync() (it will not get that stuff on disk).
However, it's not a bug - if that crap will not end up on disk we
will only win.

However, I think that things would be cleaner if we wouldn't have that
call in clear_inode(). Why? Because filesystems that had ordered writing
the blocks would better tell us when they are no longer interested in it.
It does not matter for the current code. It may matter a lot for any
fs that does write ordering of any kind.

IOW, while with the current tree invalidate_inode_buffers() is not a band-aid
(just a bad misnomer) it may easily become such.

> So I repeat: are there known bugs in this area left in pre5?

AFAIK, none. I _really_ don't like the way we handle that stuff now (at
the very least let's rename the bloody thing - it doesn't invalidate
anything), but as far as I can see the code is technically correct.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 17:09     ` test12-pre5 Stephen C. Tweedie
  2000-12-05 17:28       ` test12-pre5 Alexander Viro
@ 2000-12-05 17:48       ` Linus Torvalds
  2000-12-05 18:14         ` test12-pre5 Alexander Viro
  2000-12-05 18:50         ` test12-pre5 Stephen C. Tweedie
  1 sibling, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2000-12-05 17:48 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alexander Viro, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Tue, 5 Dec 2000, Stephen C. Tweedie wrote:
> 
> On Mon, Dec 04, 2000 at 08:00:03PM -0800, Linus Torvalds wrote:
> > 
> > On Mon, 4 Dec 2000, Alexander Viro wrote:
> > > 
> > This _is_ what clear_inode() does in pre5 (and in pre4, for that matter):
> > 
> > 	void clear_inode(struct inode *inode)
> > 	{  
> > 	        if (!list_empty(&inode->i_dirty_buffers))
> > 	                invalidate_inode_buffers(inode);
> 
> That is still buggy.  We MUST NOT invalidate the inode buffers unless
> i_nlink == 0, because otherwise a subsequent open() and fsync() will
> have forgotten what buffers are dirty, and hence will fail to
> synchronise properly with the disk.

Are you all on drugs?

Look at where clear_inode() is called. It's called by
ext2_delete_inode(). It's not called by close(). Never has, never will.

clear_inode() _destroys_ the inode. WE HAVE TO GET RID OF BUFFERS at that
point. Because there won't be anything left to index to. We're going to
free the memory in RAM held by the inode. Blathering on about
"i_nlink" etc is a complete and utter waste of time, because even if nlink
is a million, there's no way we could leave the buffers indexed on the
inode. We _will_ call destroy_inode() soon afterwards.

The thing that protects the inode while it is still truly in use, is the
CAN_UNUSE() macro, which explicitly tests that the inode is not used by
anybody and has no dirty pages. If that macro doesn't work, then we have a
damn more serious problem than nlink to worry about.

Get your acts together, guys. Stop blathering and frothing at the mouth.
The only time clear_inode() should be called is (a) when we prune the
inode cache - and we CLEARLY cannot prune an inode if it still has dirty
blocks associated with it and CAN_UNUSE() already enforces this or (b)
when we're getting rid of the very last occurrence of the inode. In one
case the dirty list is already empty. In the other, invalidating it is
clearly the right thing and the _required_ thing to do.

So I repeat: are there known bugs in this area left in pre5? And with
"bugs", I don't mean fever-induced rants like the above (*).

I don't see any. Andrew cannot re-create any. But I won't rest easy until
people agree that the code is ok as-is. After that I can consider applying
optimizations, because right now my personal conviction is that the
patches from Al are cleanups and optimizations, but _not_ bug-fixes. And
they will absolutely not get applied before there is some consensus on
these issues.

		Linus

(*) And yes, you can smack me on the head for that outburst if it turns
out that I just didn't see anything. I'll apologize. But right now I'm
irritated.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05 17:09     ` test12-pre5 Stephen C. Tweedie
@ 2000-12-05 17:28       ` Alexander Viro
  2000-12-05 17:48       ` test12-pre5 Linus Torvalds
  1 sibling, 0 replies; 19+ messages in thread
From: Alexander Viro @ 2000-12-05 17:28 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Tue, 5 Dec 2000, Stephen C. Tweedie wrote:

> That is still buggy.  We MUST NOT invalidate the inode buffers unless
> i_nlink == 0, because otherwise a subsequent open() and fsync() will
> have forgotten what buffers are dirty, and hence will fail to
> synchronise properly with the disk.

Correction: they _will_ eventually end up on disk, but yes, fsync() may
miss them.

> Al, I agreed with your observation on bforget() needing the
> remove_inode_queue() call.  Is there anywhere else we need it?

unmap_buffer(). Same story, but on the data side. I don't see anything else
right now.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  4:00   ` test12-pre5 Linus Torvalds
  2000-12-05  4:25     ` test12-pre5 Alexander Viro
@ 2000-12-05 17:09     ` Stephen C. Tweedie
  2000-12-05 17:28       ` test12-pre5 Alexander Viro
  2000-12-05 17:48       ` test12-pre5 Linus Torvalds
  1 sibling, 2 replies; 19+ messages in thread
From: Stephen C. Tweedie @ 2000-12-05 17:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Kernel Mailing List, Alexander Viro,
	Andrew Morton, Stephen C. Tweedie, Alan Cox, Christoph Rohland,
	Rik van Riel, MOLNAR Ingo

Hi,

On Mon, Dec 04, 2000 at 08:00:03PM -0800, Linus Torvalds wrote:
> 
> On Mon, 4 Dec 2000, Alexander Viro wrote:
> > 
> This _is_ what clear_inode() does in pre5 (and in pre4, for that matter):
> 
> 	void clear_inode(struct inode *inode)
> 	{  
> 	        if (!list_empty(&inode->i_dirty_buffers))
> 	                invalidate_inode_buffers(inode);

That is still buggy.  We MUST NOT invalidate the inode buffers unless
i_nlink == 0, because otherwise a subsequent open() and fsync() will
have forgotten what buffers are dirty, and hence will fail to
synchronise properly with the disk.

Al, I agreed with your observation on bforget() needing the
remove_inode_queue() call.  Is there anywhere else we need it?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  3:20 test12-pre5 Linus Torvalds
                   ` (2 preceding siblings ...)
  2000-12-05  7:51 ` test12-pre5 Andrew Morton
@ 2000-12-05 15:48 ` Daniel Phillips
  2000-12-06 13:18 ` test12-pre5 Panu Matilainen
  4 siblings, 0 replies; 19+ messages in thread
From: Daniel Phillips @ 2000-12-05 15:48 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Linus Torvalds wrote:
> NOTE! There's another change to "writepage()" semantics than just dropping
> the "struct file": the new writepage() is supposed to mirror the logic of
> readpage(), and unlock the page when it is done with it. This allows the
> VM system more visibility into what IO is pending (which the VM doesn't
> take advantage of yet, but now it can _truly_ use the same logic for both
> swapout and for dirty file writeback).

Or maybe readpage should *not* unlock the page.  What if we wanted to
follow the writepage immediately by traversing the page's buffers?  We'd
have to lock the page again, and we wouldn't know what happened in the
interim.

Thanks for fixing the (struct file *)'s!  (major wart gone)

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  3:20 test12-pre5 Linus Torvalds
  2000-12-05  3:42 ` test12-pre5 Alexander Viro
  2000-12-05  5:30 ` test12-pre5 Mohammad A. Haque
@ 2000-12-05  7:51 ` Andrew Morton
  2000-12-05 15:48 ` test12-pre5 Daniel Phillips
  2000-12-06 13:18 ` test12-pre5 Panu Matilainen
  4 siblings, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2000-12-05  7:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Stephen C. Tweedie,
	Alan Cox, Christoph Rohland, Rik van Riel, MOLNAR Ingo

Linus Torvalds wrote:
> 
> Ok, this contains one of the fixes for the dirty inode buffer list (the
> other fix is pending, simply because I still want to understand why it
> would be needed at all). Al?

I've run the same test suite against vanilla test12-pre5 on two machines for
five hours. On ext2/IDE/SMP+UP it's solid.

I'll test Al's latest bforget_inode patch overnight, but that's already
been through the wringer once.

> Also, it has the final installment of the PageDirty handling, and now
> officially direct IO can work by just marking the physical page dirty and
> be done with it. NFS along with all filesystems have been converted, the
> one hold-out still being smbfs.
> 
> Who works on smbfs these days? I see two ways of fixing smbfs now that
> "writepage()" only gets an anonymous page and no "struct file" information
> any more (this also fixes the double page unlock that Andrew saw).
                                                        ^^^^^^
                                                        Mike Galbraith.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  3:20 test12-pre5 Linus Torvalds
  2000-12-05  3:42 ` test12-pre5 Alexander Viro
@ 2000-12-05  5:30 ` Mohammad A. Haque
  2000-12-05  7:51 ` test12-pre5 Andrew Morton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Mohammad A. Haque @ 2000-12-05  5:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

The following fixes to many arguments error in fs/udf/inode.c for
test12-pre5

--- fs/udf/inode.c.orig Mon Dec  4 23:34:23 2000
+++ fs/udf/inode.c      Tue Dec  5 00:26:59 2000
@@ -202,7 +202,7 @@
        mark_buffer_dirty(bh);
        udf_release_data(bh);
 
-       inode->i_data.a_ops->writepage(NULL, page);
+       inode->i_data.a_ops->writepage(page);
        UnlockPage(page);
        page_cache_release(page);
 

-- 

=====================================================================
Mohammad A. Haque                              http://www.haque.net/ 
                                               mhaque@haque.net

  "Alcohol and calculus don't mix.             Project Lead
   Don't drink and derive." --Unknown          http://wm.themes.org/
                                               batmanppc@themes.org
=====================================================================
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  4:00   ` test12-pre5 Linus Torvalds
@ 2000-12-05  4:25     ` Alexander Viro
  2000-12-05 17:09     ` test12-pre5 Stephen C. Tweedie
  1 sibling, 0 replies; 19+ messages in thread
From: Alexander Viro @ 2000-12-05  4:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Andrew Morton,
	Stephen C. Tweedie, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Mon, 4 Dec 2000, Linus Torvalds wrote:

> So? Why wouldn't clear_inode() get rid of it?

It will. Mea culpa. However, other reasons for taking the bh of freed
block from the list still stand. IOW, consider that part as an
optimization.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  3:42 ` test12-pre5 Alexander Viro
@ 2000-12-05  4:00   ` Linus Torvalds
  2000-12-05  4:25     ` test12-pre5 Alexander Viro
  2000-12-05 17:09     ` test12-pre5 Stephen C. Tweedie
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2000-12-05  4:00 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Kernel Mailing List, Alexander Viro, Andrew Morton,
	Stephen C. Tweedie, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Mon, 4 Dec 2000, Alexander Viro wrote:
> 
> On Mon, 4 Dec 2000, Linus Torvalds wrote:
> > 
> > Ok, this contains one of the fixes for the dirty inode buffer list (the
> > other fix is pending, simply because I still want to understand why it
> > would be needed at all). Al?
> 
> See previous posting. BTW, -pre5 doesn't do the right thing in clear_inode().
> 
> Scenario: bh of indirect block is busy (whatever reason, flush_dirty_buffers(),
> anything that can bump ->b_count for a while). ext2_truncate() frees the
> thing and does bforget(). bh is left on the inode's list. Woops...

So? Why wouldn't clear_inode() get rid of it?

> The minimal fix would be to make clear_inode() empty the list. IMO it's
> worse than preventing the freed stuff from being on that list...

This _is_ what clear_inode() does in pre5 (and in pre4, for that matter):

	void clear_inode(struct inode *inode)
	{  
	        if (!list_empty(&inode->i_dirty_buffers))
	                invalidate_inode_buffers(inode);

		...

which I find perfectly readable. And should mean that no dirty buffers
should be associated with the inode any more. ext2 calls clear_inode()
from ext2_free_inode(), and as far as I can tell the only thing that can
happen after that is that the inode is still scheduled for write-out
(which explains how the bug you fixed would cause a dirty block to be
attached to the inode _after_ we did a clear_inode() on it).

Or are you thinking of something else?

		Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: test12-pre5
  2000-12-05  3:20 test12-pre5 Linus Torvalds
@ 2000-12-05  3:42 ` Alexander Viro
  2000-12-05  4:00   ` test12-pre5 Linus Torvalds
  2000-12-05  5:30 ` test12-pre5 Mohammad A. Haque
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Alexander Viro @ 2000-12-05  3:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, Alexander Viro, Andrew Morton,
	Stephen C. Tweedie, Alan Cox, Christoph Rohland, Rik van Riel,
	MOLNAR Ingo



On Mon, 4 Dec 2000, Linus Torvalds wrote:

> 
> Ok, this contains one of the fixes for the dirty inode buffer list (the
> other fix is pending, simply because I still want to understand why it
> would be needed at all). Al?

See previous posting. BTW, -pre5 doesn't do the right thing in clear_inode().

Scenario: bh of indirect block is busy (whatever reason, flush_dirty_buffers(),
anything that can bump ->b_count for a while). ext2_truncate() frees the
thing and does bforget(). bh is left on the inode's list. Woops...

The minimal fix would be to make clear_inode() empty the list. IMO it's
worse than preventing the freed stuff from being on that list...

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* test12-pre5
@ 2000-12-05  3:20 Linus Torvalds
  2000-12-05  3:42 ` test12-pre5 Alexander Viro
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Linus Torvalds @ 2000-12-05  3:20 UTC (permalink / raw)
  To: Kernel Mailing List
  Cc: Alexander Viro, Andrew Morton, Stephen C. Tweedie, Alan Cox,
	Christoph Rohland, Rik van Riel, MOLNAR Ingo


Ok, this contains one of the fixes for the dirty inode buffer list (the
other fix is pending, simply because I still want to understand why it
would be needed at all). Al?

Also, it has the final installment of the PageDirty handling, and now
officially direct IO can work by just marking the physical page dirty and
be done with it. NFS along with all filesystems have been converted, the
one hold-out still being smbfs.

Who works on smbfs these days? I see two ways of fixing smbfs now that
"writepage()" only gets an anonymous page and no "struct file" information
any more (this also fixes the double page unlock that Andrew saw).

 - disable shared mmap over smbfs (very easily done by just setting
   writepage to NULL)

 - fetch the dentry that writepage needs by just looking at the
   inode->i_dentry list and/or just make the smbfs page cache host be the
   dentry instead of the inode like other filesystems. The first approach
   assumes that all paths are equal for writeout, the second one assumes
   that there are no hard linking going on in smbfs.

Somebody more knowledgeable than I will have to make the decision
(otherwise I'll just end up disabling shared mmap - I doubt anybody really
uses it anyway, but it would be more polite to just support it).

NOTE! There's another change to "writepage()" semantics than just dropping
the "struct file": the new writepage() is supposed to mirror the logic of
readpage(), and unlock the page when it is done with it. This allows the
VM system more visibility into what IO is pending (which the VM doesn't
take advantage of yet, but now it can _truly_ use the same logic for both
swapout and for dirty file writeback).

The other change is that I forward-ported the ymfpci driver from 2.2.18,
as it works better than the ALSA one on my now-to-be-main-laptop ;)

[ Alan - I ahve your patches in my incoming queue still, I wanted to get
  an interim version out to check with Al on the block list and the VM
  stuff with Rik and people. ]

		Linus

----
 - pre5:
    - Jaroslav Kysela: ymfpci driver
    - me: get rid of bogus MS_INVALIDATE semantics
    - me: final part of the PageDirty() saga
    - Rusty Russell: 4-way SMP iptables fix
    - Al Viro: oops - bad ext2 inode dirty block bug

 - pre4:
    - Andries Brouwer: final isofs pieces.
    - Kai Germaschewski: ISDN
    - play CD audio correctly, don't stop after 12 minutes.
    - Anton Altaparmakov: disable NTFS mmap for now, as it doesn't work. 
    - Stephen Tweedie: fix inode dirty block handling
    - Bill Hartner: reschedule_idle - prefer right cpu
    - Johannes Erdfelt: USB updates
    - Alan Cox: synchronize
    - Richard Henderson: alpha updates and optimizations
    - Geert Uytterhoeven: fbdev could be fooled into crashing fix
    - Trond Myklebust: NFS filehandles in inode rather than dentry

 - pre3:
    - me: more PageDirty / swapcache handling
    - Neil Brown: raid and md init fixes
    - David Brownell: pci hotplug sanitization.
    - Kanoj Sarcar: mips64 update
    - Kai Germaschewski: ISDN sync
    - Andreas Bombe: ieee1394 cleanups and fixes
    - Johannes Erdfelt: USB update
    - David Miller: Sparc and net update
    - Trond Myklebust: RPC layer SMP fixes
    - Thomas Sailer: mixed sound driver fixes
    - Tigran Aivazian: use atomic_dec_and_lock() for free_uid()

 - pre2:
    - Peter Anvin: more P4 configuration parsing
    - Stephen Tweedie: O_SYNC patches. Make O_SYNC/fsync/fdatasync
      do the right thing.
    - Keith Owens: make mdule loading use the right struct module size
    - Boszormenyi Zoltan: get MTRR's right for the >32-bit case
    - Alan Cox: various random documentation etc
    - Dario Ballabio: EATA and u14-34f update
    - Ivan Kokshaysky: unbreak alpha ruffian
    - Richard Henderson: PCI bridge initialization on alpha
    - Zach Brown: correct locking in Maestro driver
    - Geert Uytterhoeven: more m68k updates
    - Andrey Savochkin: eepro100 update
    - Dag Brattli: irda update
    - Johannes Erdfelt: USB update

 - pre1: (for ISDN synchronization _ONLY_! Not complete!)
    - Byron Stanoszek: correct decimal precision for CPU MHz in
      /proc/cpuinfo
    - Ollie Lho: SiS pirq routing.
    - Andries Brouwer: isofs cleanups
    - Matt Kraai: /proc read() on directories should return EISDIR, not EINVAL
    - me: be stricter about what we accept as a PCI bridge setup.
    - me: always set PCI interrupts to be level-triggered when we enable them.
    - me: updated PageDirty and swap cache handling
    - Peter Anvin: update A20 code to work without keyboard controller
    - Kai Germaschewski: ISDN updates
    - Russell King: ARM updates
    - Geert Uytterhoeven: m68k updates

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2000-12-06 13:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <00120522275601.09076@gimli>
2000-12-05 21:58 ` test12-pre5 Linus Torvalds
2000-12-05  3:20 test12-pre5 Linus Torvalds
2000-12-05  3:42 ` test12-pre5 Alexander Viro
2000-12-05  4:00   ` test12-pre5 Linus Torvalds
2000-12-05  4:25     ` test12-pre5 Alexander Viro
2000-12-05 17:09     ` test12-pre5 Stephen C. Tweedie
2000-12-05 17:28       ` test12-pre5 Alexander Viro
2000-12-05 17:48       ` test12-pre5 Linus Torvalds
2000-12-05 18:14         ` test12-pre5 Alexander Viro
2000-12-05 18:33           ` test12-pre5 Linus Torvalds
2000-12-05 18:59             ` test12-pre5 Alexander Viro
2000-12-05 19:48               ` test12-pre5 Linus Torvalds
2000-12-05 20:17                 ` test12-pre5 Alexander Viro
2000-12-05 23:15                   ` test12-pre5 Stephen C. Tweedie
2000-12-05 18:50         ` test12-pre5 Stephen C. Tweedie
2000-12-05  5:30 ` test12-pre5 Mohammad A. Haque
2000-12-05  7:51 ` test12-pre5 Andrew Morton
2000-12-05 15:48 ` test12-pre5 Daniel Phillips
2000-12-06 13:18 ` test12-pre5 Panu Matilainen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).