linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* hundreds of mount --bind mountpoints?
@ 2001-04-22 16:32 David L. Parsley
  2001-04-22 16:41 ` Alexander Viro
  2001-04-23 11:43 ` Christoph Rohland
  0 siblings, 2 replies; 42+ messages in thread
From: David L. Parsley @ 2001-04-22 16:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: ingo.oeser, viro

Hi,

I'm still working on a packaging system for diskless (quasi-embedded)
devices.  The root filesystem is all tmpfs, and I attach packages inside
it.  Since symlinks in a tmpfs filesystem cost 4k each (ouch!), I'm
considering using mount --bind for everything.  This appears to use very
little memory, but I'm wondering if I'll run into problems when I start
having many hundreds of bind mountings.  Any feel for this?

regards,
	David

--
David L. Parsley
Roanoke College Network Administrator

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-22 16:32 hundreds of mount --bind mountpoints? David L. Parsley
@ 2001-04-22 16:41 ` Alexander Viro
  2001-04-23 11:43 ` Christoph Rohland
  1 sibling, 0 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-22 16:41 UTC (permalink / raw)
  To: David L. Parsley; +Cc: linux-kernel, ingo.oeser



On Sun, 22 Apr 2001, David L. Parsley wrote:

> Hi,
> 
> I'm still working on a packaging system for diskless (quasi-embedded)
> devices.  The root filesystem is all tmpfs, and I attach packages inside
> it.  Since symlinks in a tmpfs filesystem cost 4k each (ouch!), I'm
> considering using mount --bind for everything.  This appears to use very
> little memory, but I'm wondering if I'll run into problems when I start
> having many hundreds of bind mountings.  Any feel for this?

Memory use is sizeof(struct vfsmount) per binding. In principle, you can get
in trouble when size of /proc/mount will get past 4Kb - you'll get only
first 4 (actually 3, IIRC) kilobytes, so stuff that relies on the contents
of said file may get unhappy. It's fixable, though.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-22 16:32 hundreds of mount --bind mountpoints? David L. Parsley
  2001-04-22 16:41 ` Alexander Viro
@ 2001-04-23 11:43 ` Christoph Rohland
  2001-04-23 13:17   ` Ingo Oeser
  2001-04-23 14:08   ` David L. Parsley
  1 sibling, 2 replies; 42+ messages in thread
From: Christoph Rohland @ 2001-04-23 11:43 UTC (permalink / raw)
  To: David L. Parsley; +Cc: linux-kernel, ingo.oeser, viro

Hi David,

On Sun, 22 Apr 2001, David L. Parsley wrote:
> I'm still working on a packaging system for diskless
> (quasi-embedded) devices.  The root filesystem is all tmpfs, and I
> attach packages inside it.  Since symlinks in a tmpfs filesystem
> cost 4k each (ouch!), I'm considering using mount --bind for
> everything.

What about fixing tmpfs instead?

Greetings
		Christoph



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 11:43 ` Christoph Rohland
@ 2001-04-23 13:17   ` Ingo Oeser
  2001-04-23 14:54     ` Christoph Rohland
  2001-04-23 14:08   ` David L. Parsley
  1 sibling, 1 reply; 42+ messages in thread
From: Ingo Oeser @ 2001-04-23 13:17 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: David L. Parsley, linux-kernel, viro

On Mon, Apr 23, 2001 at 01:43:27PM +0200, Christoph Rohland wrote:
> On Sun, 22 Apr 2001, David L. Parsley wrote:
> > attach packages inside it.  Since symlinks in a tmpfs filesystem
> > cost 4k each (ouch!), I'm considering using mount --bind for
> > everything.
> 
> What about fixing tmpfs instead?

The question is: How? If you do it like ramfs, you cannot swap
these symlinks and this is effectively a mlock(symlink) operation
allowed for normal users. -> BAD!

One idea is to only use a page, if the entry will be pushed into
swap and thus only wasting swap, not memory (where we have more
of it).

But allocating a page on memory pressure is also not a bright
idea.

OTOH we could force this entry to swap immedately, after we
copied it from the dentry. So we can do an GFP_ATOMIC allocation
and do not too much harm to memory pressure and only make the IO
a bit stormier.

I think there are a lot of races, which I don't see now.

So please don't beat me too much, if this is a completly stupid
idea, ok?  ;-)


Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
         <<<<<<<<<<<<     been there and had much fun   >>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 11:43 ` Christoph Rohland
  2001-04-23 13:17   ` Ingo Oeser
@ 2001-04-23 14:08   ` David L. Parsley
  2001-04-23 14:14     ` Alexander Viro
  1 sibling, 1 reply; 42+ messages in thread
From: David L. Parsley @ 2001-04-23 14:08 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: linux-kernel, ingo.oeser, viro

Christoph Rohland wrote:
> 
> Hi David,
> 
> On Sun, 22 Apr 2001, David L. Parsley wrote:
> > I'm still working on a packaging system for diskless
> > (quasi-embedded) devices.  The root filesystem is all tmpfs, and I
> > attach packages inside it.  Since symlinks in a tmpfs filesystem
> > cost 4k each (ouch!), I'm considering using mount --bind for
> > everything.
> 
> What about fixing tmpfs instead?

That would be great - are you volunteering? ;-)  Seriously - I might be
able to look at what ramfs does and port that to tmpfs for my needs, but
that's about the extent of my kernel hacking skills.  For now, mount
--bind looks like it'll work just fine.  If somebody wants to fix tmpfs,
I'll be happy to test patches; it'll just change a couple of lines in my
package loading logic (mount --bind x y -> ln -s x y).

What I'm not sure of is which solution is actually 'better' - I'm
guessing that performance-wise, neither will make a noticable
difference, so I guess memory usage would be the deciding factor.  If I
can get a lot closer to the size of a symlink (10-20 bytes) that would
be best.  The issue with /proc/mounts really shouldn't hurt anything - I
could almost get by without mounting /proc anyway, it's mainly a
convenience.

regards,
	David

-- 
David L. Parsley
Network Administrator
Roanoke College

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 14:08   ` David L. Parsley
@ 2001-04-23 14:14     ` Alexander Viro
  0 siblings, 0 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-23 14:14 UTC (permalink / raw)
  To: David L. Parsley; +Cc: Christoph Rohland, linux-kernel, ingo.oeser



On Mon, 23 Apr 2001, David L. Parsley wrote:

> What I'm not sure of is which solution is actually 'better' - I'm
> guessing that performance-wise, neither will make a noticable
> difference, so I guess memory usage would be the deciding factor.  If I

Bindings are faster on lookup. For obvious reasons - in case of symlinks
you do name resolution every time you traverse the link; in case of
bindings it is done when you create them.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 13:17   ` Ingo Oeser
@ 2001-04-23 14:54     ` Christoph Rohland
  2001-04-23 15:23       ` Ingo Oeser
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Rohland @ 2001-04-23 14:54 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: David L. Parsley, linux-kernel, viro

Hi Ingo,

On Mon, 23 Apr 2001, Ingo Oeser wrote:
> On Mon, Apr 23, 2001 at 01:43:27PM +0200, Christoph Rohland wrote:
>> On Sun, 22 Apr 2001, David L. Parsley wrote:
>> > attach packages inside it.  Since symlinks in a tmpfs filesystem
>> > cost 4k each (ouch!), I'm considering using mount --bind for
>> > everything.
>> 
>> What about fixing tmpfs instead?
> 
> The question is: How? If you do it like ramfs, you cannot swap
> these symlinks and this is effectively a mlock(symlink) operation
> allowed for normal users. -> BAD!

How about storing it into the inode structure if it fits into the
fs-private union? If it is too big we allocate the page as we do it
now. The union has 192 bytes. This should be sufficient for most
cases.

Greetings
		Christoph



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 14:54     ` Christoph Rohland
@ 2001-04-23 15:23       ` Ingo Oeser
  2001-04-23 15:36         ` Alexander Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Ingo Oeser @ 2001-04-23 15:23 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: David L. Parsley, linux-kernel, viro

Hi Chris,

On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
> > The question is: How? If you do it like ramfs, you cannot swap
> > these symlinks and this is effectively a mlock(symlink) operation
> > allowed for normal users. -> BAD!
> 
> How about storing it into the inode structure if it fits into the
> fs-private union? If it is too big we allocate the page as we do it
> now. The union has 192 bytes. This should be sufficient for most
> cases.

Great idea. We allocate this space anyway. And we don't have to
care about the internals of this union, because never have to use
it outside the kernel ;-)

I like it. ext2fs does the same, so there should be no VFS
hassles involved. Al?

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
         <<<<<<<<<<<<     been there and had much fun   >>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 15:23       ` Ingo Oeser
@ 2001-04-23 15:36         ` Alexander Viro
  2001-04-23 20:45           ` Ingo Oeser
                             ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-23 15:36 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Christoph Rohland, David L. Parsley, linux-kernel



On Mon, 23 Apr 2001, Ingo Oeser wrote:

> Hi Chris,
> 
> On Mon, Apr 23, 2001 at 04:54:02PM +0200, Christoph Rohland wrote:
> > > The question is: How? If you do it like ramfs, you cannot swap
> > > these symlinks and this is effectively a mlock(symlink) operation
> > > allowed for normal users. -> BAD!
> > 
> > How about storing it into the inode structure if it fits into the
> > fs-private union? If it is too big we allocate the page as we do it
> > now. The union has 192 bytes. This should be sufficient for most
> > cases.
> 
> Great idea. We allocate this space anyway. And we don't have to
> care about the internals of this union, because never have to use
> it outside the kernel ;-)
> 
> I like it. ext2fs does the same, so there should be no VFS
> hassles involved. Al?

We should get ext2 and friends to move the sucker _out_ of struct inode.
As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
it really has to be done. More filesystems adding stuff into the union
is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
->clear_inode() is the right place for freeing it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 15:36         ` Alexander Viro
@ 2001-04-23 20:45           ` Ingo Oeser
  2001-04-23 20:56             ` Christoph Hellwig
                               ` (2 more replies)
  2001-04-23 21:19           ` Richard Gooch
  2001-04-24  6:33           ` Christoph Rohland
  2 siblings, 3 replies; 42+ messages in thread
From: Ingo Oeser @ 2001-04-23 20:45 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Christoph Rohland, David L. Parsley, linux-kernel

On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
> > Great idea. We allocate this space anyway. And we don't have to
> > care about the internals of this union, because never have to use
> > it outside the kernel ;-)
> > 
> > I like it. ext2fs does the same, so there should be no VFS
> > hassles involved. Al?
> 
> We should get ext2 and friends to move the sucker _out_ of struct inode.
> As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> it really has to be done. More filesystems adding stuff into the union
> is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> ->clear_inode() is the right place for freeing it.

You need an inode anyway. So why not using the space in it? tmpfs
would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
this kind of symlinks.

Last time we suggested this, people ended up with some OS trying
it and getting worse performance. 

Why? You need to allocate the VFS-inode (vnode in other OSs) and
the on-disk-inode anyway at the same time. You get better
performance and less fragmentation, if you allocate them both
together[1].

So that struct inode around is ok.

BTW: Is it still less than one page? Then it doesn't make me
   nervous. Why? Guess what granularity we allocate at, if we
   just store pointers instead of the inode.u. Or do you like
   every FS creating his own slab cache?

Regards

Ingo Oeser

[1] Which is true for other allocations, too.
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
         <<<<<<<<<<<<     been there and had much fun   >>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 20:45           ` Ingo Oeser
@ 2001-04-23 20:56             ` Christoph Hellwig
  2001-04-23 22:00               ` Ingo Oeser
  2001-04-23 22:47             ` Andreas Dilger
  2001-04-24  1:37             ` Jan Harkes
  2 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2001-04-23 20:56 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Alexander Viro, Christoph Rohland, David L. Parsley, linux-kernel

In article <20010423224505.H719@nightmaster.csn.tu-chemnitz.de> you wrote:
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance. 

Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].

Because having an union in generic code that includes filesystem-specific
memebers is ugly? It's one of those a little more performance for a lot of
bad style optimizations.

	Christoph


-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 15:36         ` Alexander Viro
  2001-04-23 20:45           ` Ingo Oeser
@ 2001-04-23 21:19           ` Richard Gooch
  2001-04-23 22:42             ` Albert D. Cahalan
  2001-04-23 22:49             ` Richard Gooch
  2001-04-24  6:33           ` Christoph Rohland
  2 siblings, 2 replies; 42+ messages in thread
From: Richard Gooch @ 2001-04-23 21:19 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Alexander Viro, Christoph Rohland, David L. Parsley, linux-kernel

Ingo Oeser writes:
> On Mon, Apr 23, 2001 at 11:36:24AM -0400, Alexander Viro wrote:
> > > Great idea. We allocate this space anyway. And we don't have to
> > > care about the internals of this union, because never have to use
> > > it outside the kernel ;-)
> > > 
> > > I like it. ext2fs does the same, so there should be no VFS
> > > hassles involved. Al?
> > 
> > We should get ext2 and friends to move the sucker _out_ of struct inode.
> > As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> > it really has to be done. More filesystems adding stuff into the union
> > is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> > ->clear_inode() is the right place for freeing it.
> 
> You need an inode anyway. So why not using the space in it? tmpfs
> would only use sizeof(*inode.u)-sizeof(struct shmem_inode_info) for
> this kind of symlinks.
> 
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance. 
> 
> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].

We want to take out that union because it sucks for virtual
filesystems. Besides, it's ugly.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 20:56             ` Christoph Hellwig
@ 2001-04-23 22:00               ` Ingo Oeser
  2001-04-23 22:10                 ` Alexander Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Ingo Oeser @ 2001-04-23 22:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, Christoph Rohland, David L. Parsley, lm, linux-kernel

On Mon, Apr 23, 2001 at 10:56:16PM +0200, Christoph Hellwig wrote:
> In article <20010423224505.H719@nightmaster.csn.tu-chemnitz.de> you wrote:
> > Last time we suggested this, people ended up with some OS trying
> > it and getting worse performance. 
> 
> Which OS? Neither BSD nor SVR4/SVR5 (or even SVR3) do that.

Don't remember. I think Larry McVoy told the story, so I cc'ed
him ;-)

> Because having an union in generic code that includes filesystem-specific
> memebers is ugly? It's one of those a little more performance for a lot of
> bad style optimizations.

We have this kind of stuff all over the place. If we allocate
some small amount of memory and and need some small amount
associated with this memory, there is no problem with a little
waste.

Waste is better than fragmentation. This is the lesson people
learned from segments in the ia32.

Objects are easier to manage, if they are the same size.

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
         <<<<<<<<<<<<     been there and had much fun   >>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 22:00               ` Ingo Oeser
@ 2001-04-23 22:10                 ` Alexander Viro
  0 siblings, 0 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-23 22:10 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Christoph Hellwig, Christoph Rohland, David L. Parsley, lm, linux-kernel



On Tue, 24 Apr 2001, Ingo Oeser wrote:

> We have this kind of stuff all over the place. If we allocate
> some small amount of memory and and need some small amount
> associated with this memory, there is no problem with a little
> waste.

Little? How about quarter of kilobyte per inode? sizeof(struct inode)
is nearly half-kilobyte. And icache can easily get to ~100000 elements.

> Waste is better than fragmentation. This is the lesson people
> learned from segments in the ia32.
> 
> Objects are easier to manage, if they are the same size.

So don't keep them in the same cache. Notice that quite a few systems
keep vnode separately from fs-specific data. For a very good reason.

								Al


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 21:19           ` Richard Gooch
@ 2001-04-23 22:42             ` Albert D. Cahalan
  2001-04-23 22:49             ` Richard Gooch
  1 sibling, 0 replies; 42+ messages in thread
From: Albert D. Cahalan @ 2001-04-23 22:42 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Ingo Oeser, Alexander Viro, Christoph Rohland, David L. Parsley,
	linux-kernel

Richard Gooch writes:

> We want to take out that union because it sucks for virtual
> filesystems. Besides, it's ugly.

I hope you won't mind if people trash this with benchmarks.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 20:45           ` Ingo Oeser
  2001-04-23 20:56             ` Christoph Hellwig
@ 2001-04-23 22:47             ` Andreas Dilger
  2001-04-24  1:37             ` Jan Harkes
  2 siblings, 0 replies; 42+ messages in thread
From: Andreas Dilger @ 2001-04-23 22:47 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Alexander Viro, Christoph Rohland, David L. Parsley, linux-kernel

Ingo Oeser writes:
> > We should get ext2 and friends to move the sucker _out_ of struct inode.
> > As it is, sizeof(struct inode) is way too large. This is 2.5 stuff, but
> > it really has to be done. More filesystems adding stuff into the union
> > is a Bad Thing(tm). If you want to allocates space - allocate if yourself;
> > ->clear_inode() is the right place for freeing it.
> 
> BTW: Is it still less than one page? Then it doesn't make me
>    nervous. Why? Guess what granularity we allocate at, if we
>    just store pointers instead of the inode.u. Or do you like
>    every FS creating his own slab cache?

I would much rather we allocate a slab cache for each fs type (it
would be trivial at register_fs time).  Most people have only a limited
number of filesystems active at a single time, yet tens or hundreds of
thousands of inodes in the inode slab cache.  Making the per-fs private
inode data as small as possible would reduce memory wastage considerably,
and not impact performance (AFAICS) if we use a per-fs type slab cache
for fs private data.

Consider, when I was doing some fs benchmark, my inode slab cache was
over 120k items on a 128MB machine.  At 480 butes per inode, this is
almost 58 MB, close to half of RAM.  Reducing this to exactly ext2
sized inodes would save (50 - 27) * 4 * 120k = 11MB of memory (on 32-bit
systems)!!! (This assumes nfs_inode_info is the largest).

This also makes it possible to safely (and efficiently) use external
filesystem modules without the need to recompile the kernel.  Granted,
if the external filesystem doesn't use more than the largest .u struct,
then it is currently possible as well, but that number changes, so it
is not safe.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 21:19           ` Richard Gooch
  2001-04-23 22:42             ` Albert D. Cahalan
@ 2001-04-23 22:49             ` Richard Gooch
  2001-04-23 23:07               ` Alexander Viro
  2001-04-23 23:13               ` Richard Gooch
  1 sibling, 2 replies; 42+ messages in thread
From: Richard Gooch @ 2001-04-23 22:49 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: Ingo Oeser, Alexander Viro, Christoph Rohland, David L. Parsley,
	linux-kernel

Albert D. Cahalan writes:
> Richard Gooch writes:
> 
> > We want to take out that union because it sucks for virtual
> > filesystems. Besides, it's ugly.
> 
> I hope you won't mind if people trash this with benchmarks.

But they can't. At least, not for a well designed patch. If there is a
real issue of fragmentation, then there are ways to fix that without
using a bloated union structure. Don't punish some filesystems just
because others have a problem.

Solutions to avoid fragmentation:

- keep a separate VFSinode and FSinode slab cache
- allocate an enlarged VFSinode that contains the FSinode at the end,
  with the generic pointer in the VFSinode part pointing to FSinode
  part.

It's simply wrong to bloat everyone because some random FS found it
easier to thow in a union.

Besides, for every benchmark that shows how fragmentation hurts, I can
generate a benchmark showing how inode bloat hurts. Lies, damn lies
and benchmarks.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 22:49             ` Richard Gooch
@ 2001-04-23 23:07               ` Alexander Viro
  2001-04-23 23:24                 ` Rik van Riel
  2001-04-23 23:13               ` Richard Gooch
  1 sibling, 1 reply; 42+ messages in thread
From: Alexander Viro @ 2001-04-23 23:07 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Albert D. Cahalan, Ingo Oeser, Christoph Rohland,
	David L. Parsley, linux-kernel



On Mon, 23 Apr 2001, Richard Gooch wrote:

> - keep a separate VFSinode and FSinode slab cache

Yup.

> - allocate an enlarged VFSinode that contains the FSinode at the end,
>   with the generic pointer in the VFSinode part pointing to FSinode
>   part.

Please, don't. It would help with bloat only if you allocated these
beasts separately for each fs and then you end up with _many_ allocators
that can generate pointer to struct inode. 

"One type - one allocator" is a good rule - violating it turns into major
PITA couple of years down the road 9 times out of 10.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 22:49             ` Richard Gooch
  2001-04-23 23:07               ` Alexander Viro
@ 2001-04-23 23:13               ` Richard Gooch
  1 sibling, 0 replies; 42+ messages in thread
From: Richard Gooch @ 2001-04-23 23:13 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Albert D. Cahalan, Ingo Oeser, Christoph Rohland,
	David L. Parsley, linux-kernel

Alexander Viro writes:
> 
> 
> On Mon, 23 Apr 2001, Richard Gooch wrote:
> 
> > - keep a separate VFSinode and FSinode slab cache
> 
> Yup.
> 
> > - allocate an enlarged VFSinode that contains the FSinode at the end,
> >   with the generic pointer in the VFSinode part pointing to FSinode
> >   part.
> 
> Please, don't. It would help with bloat only if you allocated these
> beasts separately for each fs and then you end up with _many_ allocators
> that can generate pointer to struct inode. 
> 
> "One type - one allocator" is a good rule - violating it turns into
> major PITA couple of years down the road 9 times out of 10.

Agreed. The better option is the separate VFSinode and FSinode caches.
The enlarged inode scheme is also ugly, like the unions. It's just
less bloated :-)

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 23:07               ` Alexander Viro
@ 2001-04-23 23:24                 ` Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2001-04-23 23:24 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Richard Gooch, Albert D. Cahalan, Ingo Oeser, Christoph Rohland,
	David L. Parsley, linux-kernel

On Mon, 23 Apr 2001, Alexander Viro wrote:
> On Mon, 23 Apr 2001, Richard Gooch wrote:
>
> > - keep a separate VFSinode and FSinode slab cache
>
> Yup.

Would it make sense to unify these with the struct
address_space ?

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 20:45           ` Ingo Oeser
  2001-04-23 20:56             ` Christoph Hellwig
  2001-04-23 22:47             ` Andreas Dilger
@ 2001-04-24  1:37             ` Jan Harkes
  2001-04-24  2:53               ` Alexander Viro
  2 siblings, 1 reply; 42+ messages in thread
From: Jan Harkes @ 2001-04-24  1:37 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Alexander Viro, Christoph Rohland, David L. Parsley, linux-kernel

On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:
> Last time we suggested this, people ended up with some OS trying
> it and getting worse performance. 
> 
> Why? You need to allocate the VFS-inode (vnode in other OSs) and
> the on-disk-inode anyway at the same time. You get better
> performance and less fragmentation, if you allocate them both
> together[1].
> 
> So that struct inode around is ok.
> 
> BTW: Is it still less than one page? Then it doesn't make me
>    nervous. Why? Guess what granularity we allocate at, if we
>    just store pointers instead of the inode.u. Or do you like
>    every FS creating his own slab cache?

I've actually got the coda_inode_info (inode->u.u_coda_fs_i) split out
of the union in my development kernel. It doesn't shrink the size of the
struct inode yet, Coda isn't the biggest user at the moment.

But, it forced me to do several cleanups in the code, and it even has
resulted in fixing a 'leak'. Not a real memory loss leak one, but we
left uninitialized inodes around in the icache for no good reason. Also
changing a but in a coda specific header file does trigger an almost
complete rebuild of the whole kernel (coda.h -> coda_fs_i.h -> fs.h ->
everything?)

The allocation overhead really isn't that bad. kmalloc/kfree are also
using the slabcache, and a trivial variant using a 'private' slabcache
gave me the counters to find the 'leak' I mentioned before.

I can't really evaluate performance impacts. The struct inode is still
the same size, so for now there even is a little bit of additional
memory pressure. Also, Coda wasn't really developed to achieve high
performance but more to explore novel features.

Jan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24  1:37             ` Jan Harkes
@ 2001-04-24  2:53               ` Alexander Viro
  2001-04-24  9:58                 ` David Woodhouse
  2001-04-24 21:59                 ` Trond Myklebust
  0 siblings, 2 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-24  2:53 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Ingo Oeser, Christoph Rohland, David L. Parsley, linux-kernel



On Mon, 23 Apr 2001, Jan Harkes wrote:

> On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

> > BTW: Is it still less than one page? Then it doesn't make me
> >    nervous. Why? Guess what granularity we allocate at, if we
> >    just store pointers instead of the inode.u. Or do you like
> >    every FS creating his own slab cache?

Oh, for crying out loud. All it takes is half an hour per filesystem.
Here - completely untested patch that does it for NFS. Took about that
long. Absolutely straightforward, very easy to verify correctness.

Some stuff may need tweaking, but not much (e.g. some functions
should take nfs_inode_info instead of inodes, etc.). From the look
of flushd cache it seems that we would be better off with cyclic
lists instead of single-linked ones for the hash, but I didn't look
deep enough.

So consider the patch below as proof-of-concept. Enjoy:

diff -urN S4-pre6/fs/nfs/flushd.c S4-pre6-nfs/fs/nfs/flushd.c
--- S4-pre6/fs/nfs/flushd.c	Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/flushd.c	Mon Apr 23 22:23:11 2001
@@ -162,11 +162,11 @@
 
 	if (NFS_FLAGS(inode) & NFS_INO_FLUSH)
 		goto out;
-	inode->u.nfs_i.hash_next = NULL;
+	NFS_I(inode)->hash_next = NULL;
 
 	q = &cache->inodes;
 	while (*q)
-		q = &(*q)->u.nfs_i.hash_next;
+		q = &NFS_I(*q)->hash_next;
 	*q = inode;
 
 	/* Note: we increase the inode i_count in order to prevent
@@ -188,9 +188,9 @@
 
 	q = &cache->inodes;
 	while (*q && *q != inode)
-		q = &(*q)->u.nfs_i.hash_next;
+		q = &NFS_I(*q)->hash_next;
 	if (*q) {
-		*q = inode->u.nfs_i.hash_next;
+		*q = NFS_I(inode)->hash_next;
 		NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
 		iput(inode);
 	}
@@ -238,8 +238,8 @@
 	cache->inodes = NULL;
 
 	while ((inode = next) != NULL) {
-		next = next->u.nfs_i.hash_next;
-		inode->u.nfs_i.hash_next = NULL;
+		next = NFS_I(next)->hash_next;
+		NFS_I(inode)->hash_next = NULL;
 		NFS_FLAGS(inode) &= ~NFS_INO_FLUSH;
 
 		if (flush) {
diff -urN S4-pre6/fs/nfs/inode.c S4-pre6-nfs/fs/nfs/inode.c
--- S4-pre6/fs/nfs/inode.c	Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/inode.c	Mon Apr 23 22:43:45 2001
@@ -40,11 +40,14 @@
 #define NFSDBG_FACILITY		NFSDBG_VFS
 #define NFS_PARANOIA 1
 
+static kmem_cache_t *nfs_inode_cachep;
+
 static struct inode * __nfs_fhget(struct super_block *, struct nfs_fh *, struct nfs_fattr *);
 void nfs_zap_caches(struct inode *);
 static void nfs_invalidate_inode(struct inode *);
 
 static void nfs_read_inode(struct inode *);
+static void nfs_clear_inode(struct inode *);
 static void nfs_delete_inode(struct inode *);
 static void nfs_put_super(struct super_block *);
 static void nfs_umount_begin(struct super_block *);
@@ -52,6 +55,7 @@
 
 static struct super_operations nfs_sops = { 
 	read_inode:	nfs_read_inode,
+	clear_inode:	nfs_clear_inode,
 	put_inode:	force_delete,
 	delete_inode:	nfs_delete_inode,
 	put_super:	nfs_put_super,
@@ -96,23 +100,44 @@
 static void
 nfs_read_inode(struct inode * inode)
 {
+	struct nfs_inode_info *nfsi;
+
+	nfsi = kmem_cache_alloc(nfs_inode_cachep, GFP_KERNEL);
+	if (!nfsi)
+		goto Enomem;
+
 	inode->i_blksize = inode->i_sb->s_blocksize;
 	inode->i_mode = 0;
 	inode->i_rdev = 0;
+	inode->u.generic_ip = nfsi;
 	NFS_FILEID(inode) = 0;
 	NFS_FSID(inode) = 0;
 	NFS_FLAGS(inode) = 0;
-	INIT_LIST_HEAD(&inode->u.nfs_i.read);
-	INIT_LIST_HEAD(&inode->u.nfs_i.dirty);
-	INIT_LIST_HEAD(&inode->u.nfs_i.commit);
-	INIT_LIST_HEAD(&inode->u.nfs_i.writeback);
-	inode->u.nfs_i.nread = 0;
-	inode->u.nfs_i.ndirty = 0;
-	inode->u.nfs_i.ncommit = 0;
-	inode->u.nfs_i.npages = 0;
+	INIT_LIST_HEAD(&nfsi->read);
+	INIT_LIST_HEAD(&nfsi->dirty);
+	INIT_LIST_HEAD(&nfsi->commit);
+	INIT_LIST_HEAD(&nfsi->writeback);
+	nfsi->nread = 0;
+	nfsi->ndirty = 0;
+	nfsi->ncommit = 0;
+	nfsi->npages = 0;
 	NFS_CACHEINV(inode);
 	NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
 	NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
+	return;
+
+Enomem:
+	make_bad_inode(inode);
+	return;
+}
+
+static void
+nfs_clear_inode(struct inode * inode)
+{
+	struct nfs_inode_info *p = NFS_I(inode);
+	inode->u.generic_ip = NULL;
+	if (p)
+		kmem_cache_free(nfs_inode_cachep, p);
 }
 
 static void
@@ -594,7 +619,7 @@
 		NFS_CACHE_ISIZE(inode) = fattr->size;
 		NFS_ATTRTIMEO(inode) = NFS_MINATTRTIMEO(inode);
 		NFS_ATTRTIMEO_UPDATE(inode) = jiffies;
-		memcpy(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh));
+		memcpy(NFS_FH(inode), fh, sizeof(struct nfs_fh));
 	}
 	nfs_refresh_inode(inode, fattr);
 }
@@ -621,7 +646,7 @@
 		return 0;
 	if (NFS_FILEID(inode) != fattr->fileid)
 		return 0;
-	if (memcmp(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh)) != 0)
+	if (memcmp(NFS_FH(inode), fh, sizeof(struct nfs_fh)) != 0)
 		return 0;
 	return 1;
 }
@@ -640,7 +665,7 @@
 		return 1;
 
 	/* Has the filehandle changed? If so is the old one stale? */
-	if (memcmp(&inode->u.nfs_i.fh, fh, sizeof(inode->u.nfs_i.fh)) != 0 &&
+	if (memcmp(NFS_FH(inode), fh, sizeof(struct nfs_fh)) != 0 &&
 	    __nfs_revalidate_inode(NFS_SERVER(inode),inode) == -ESTALE)
 		return 1;
 
@@ -1056,6 +1081,24 @@
 extern int nfs_init_readpagecache(void);
 extern int nfs_destroy_readpagecache(void);
 
+int nfs_init_inodecache(void)
+{
+	nfs_inode_cachep = kmem_cache_create("nfs_inode_cache",
+					     sizeof(struct nfs_inode_info),
+					     0, SLAB_HWCACHE_ALIGN,
+					     NULL, NULL);
+	if (nfs_inode_cachep == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void nfs_destroy_inodecache(void)
+{
+	if (kmem_cache_destroy(nfs_inode_cachep))
+		printk(KERN_INFO "nfs_inode_cache: not all structures were freed\n");
+}
+
 /*
  * Initialize NFS
  */
@@ -1067,6 +1110,10 @@
 	if (err)
 		return err;
 
+	err = nfs_init_inodecache();
+	if (err)
+		return err;
+
 	err = nfs_init_readpagecache();
 	if (err)
 		return err;
@@ -1080,6 +1127,7 @@
 static void __exit exit_nfs_fs(void)
 {
 	nfs_destroy_readpagecache();
+	nfs_destroy_inodecache();
 	nfs_destroy_nfspagecache();
 #ifdef CONFIG_PROC_FS
 	rpc_proc_unregister("nfs");
diff -urN S4-pre6/fs/nfs/read.c S4-pre6-nfs/fs/nfs/read.c
--- S4-pre6/fs/nfs/read.c	Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/read.c	Mon Apr 23 22:18:25 2001
@@ -153,7 +153,7 @@
 {
 	struct list_head	*head, *next;
 
-	head = &inode->u.nfs_i.read;
+	head = &NFS_I(inode)->read;
 	next = head->next;
 	while (next != head) {
 		struct nfs_page *req = nfs_list_entry(next);
@@ -183,11 +183,12 @@
 nfs_mark_request_read(struct nfs_page *req)
 {
 	struct inode *inode = req->wb_inode;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 
 	spin_lock(&nfs_wreq_lock);
 	if (list_empty(&req->wb_list)) {
-		nfs_list_add_request(req, &inode->u.nfs_i.read);
-		inode->u.nfs_i.nread++;
+		nfs_list_add_request(req, &nfsi->read);
+		nfsi->nread++;
 	}
 	spin_unlock(&nfs_wreq_lock);
 	/*
@@ -234,7 +235,7 @@
 			break;
 	}
 
-	if (inode->u.nfs_i.nread >= NFS_SERVER(inode)->rpages ||
+	if (NFS_I(inode)->nread >= NFS_SERVER(inode)->rpages ||
 	    page_index(page) == (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
 		nfs_pagein_inode(inode, 0, 0);
 	if (new)
@@ -372,10 +373,11 @@
 nfs_scan_read_timeout(struct inode *inode, struct list_head *dst)
 {
 	int	pages;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 	spin_lock(&nfs_wreq_lock);
-	pages = nfs_scan_list_timeout(&inode->u.nfs_i.read, dst, inode);
-	inode->u.nfs_i.nread -= pages;
-	if ((inode->u.nfs_i.nread == 0) != list_empty(&inode->u.nfs_i.read))
+	pages = nfs_scan_list_timeout(&nfsi->read, dst, inode);
+	nfsi->nread -= pages;
+	if ((nfsi->nread == 0) != list_empty(&nfsi->read))
 		printk(KERN_ERR "NFS: desynchronized value of nfs_i.nread.\n");
 	spin_unlock(&nfs_wreq_lock);
 	return pages;
@@ -385,10 +387,11 @@
 nfs_scan_read(struct inode *inode, struct list_head *dst, unsigned long idx_start, unsigned int npages)
 {
 	int	res;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 	spin_lock(&nfs_wreq_lock);
-	res = nfs_scan_list(&inode->u.nfs_i.read, dst, NULL, idx_start, npages);
-	inode->u.nfs_i.nread -= res;
-	if ((inode->u.nfs_i.nread == 0) != list_empty(&inode->u.nfs_i.read))
+	res = nfs_scan_list(&nfsi->read, dst, NULL, idx_start, npages);
+	nfsi->nread -= res;
+	if ((nfsi->nread == 0) != list_empty(&nfsi->read))
 		printk(KERN_ERR "NFS: desynchronized value of nfs_i.nread.\n");
 	spin_unlock(&nfs_wreq_lock);
 	return res;
diff -urN S4-pre6/fs/nfs/write.c S4-pre6-nfs/fs/nfs/write.c
--- S4-pre6/fs/nfs/write.c	Sat Apr 21 14:35:21 2001
+++ S4-pre6-nfs/fs/nfs/write.c	Mon Apr 23 22:15:06 2001
@@ -329,14 +329,15 @@
 static inline void
 nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 {
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 	if (!list_empty(&req->wb_hash))
 		return;
 	if (!NFS_WBACK_BUSY(req))
 		printk(KERN_ERR "NFS: unlocked request attempted hashed!\n");
-	if (list_empty(&inode->u.nfs_i.writeback))
+	if (list_empty(&nfsi->writeback))
 		atomic_inc(&inode->i_count);
-	inode->u.nfs_i.npages++;
-	list_add(&req->wb_hash, &inode->u.nfs_i.writeback);
+	nfsi->npages++;
+	list_add(&req->wb_hash, &nfsi->writeback);
 	req->wb_count++;
 }
 
@@ -347,6 +348,7 @@
 nfs_inode_remove_request(struct nfs_page *req)
 {
 	struct inode *inode;
+	struct nfs_inode_info *nfsi;
 	spin_lock(&nfs_wreq_lock);
 	if (list_empty(&req->wb_hash)) {
 		spin_unlock(&nfs_wreq_lock);
@@ -355,12 +357,13 @@
 	if (!NFS_WBACK_BUSY(req))
 		printk(KERN_ERR "NFS: unlocked request attempted unhashed!\n");
 	inode = req->wb_inode;
+	nfsi = NFS_I(inode);
 	list_del(&req->wb_hash);
 	INIT_LIST_HEAD(&req->wb_hash);
-	inode->u.nfs_i.npages--;
-	if ((inode->u.nfs_i.npages == 0) != list_empty(&inode->u.nfs_i.writeback))
+	nfsi->npages--;
+	if ((nfsi->npages == 0) != list_empty(&nfsi->writeback))
 		printk(KERN_ERR "NFS: desynchronized value of nfs_i.npages.\n");
-	if (list_empty(&inode->u.nfs_i.writeback))
+	if (list_empty(&nfsi->writeback))
 		iput(inode);
 	if (!nfs_have_writebacks(inode) && !nfs_have_read(inode))
 		inode_remove_flushd(inode);
@@ -376,7 +379,7 @@
 {
 	struct list_head	*head, *next;
 
-	head = &inode->u.nfs_i.writeback;
+	head = &NFS_I(inode)->writeback;
 	next = head->next;
 	while (next != head) {
 		struct nfs_page *req = nfs_inode_wb_entry(next);
@@ -448,8 +451,9 @@
 
 	spin_lock(&nfs_wreq_lock);
 	if (list_empty(&req->wb_list)) {
-		nfs_list_add_request(req, &inode->u.nfs_i.dirty);
-		inode->u.nfs_i.ndirty++;
+		struct nfs_inode_info *nfsi = NFS_I(inode);
+		nfs_list_add_request(req, &nfsi->dirty);
+		nfsi->ndirty++;
 	}
 	spin_unlock(&nfs_wreq_lock);
 	/*
@@ -466,7 +470,7 @@
 nfs_dirty_request(struct nfs_page *req)
 {
 	struct inode *inode = req->wb_inode;
-	return !list_empty(&req->wb_list) && req->wb_list_head == &inode->u.nfs_i.dirty;
+	return !list_empty(&req->wb_list) && req->wb_list_head == &NFS_I(inode)->dirty;
 }
 
 #ifdef CONFIG_NFS_V3
@@ -480,8 +484,9 @@
 
 	spin_lock(&nfs_wreq_lock);
 	if (list_empty(&req->wb_list)) {
-		nfs_list_add_request(req, &inode->u.nfs_i.commit);
-		inode->u.nfs_i.ncommit++;
+		struct nfs_inode_info *nfsi = NFS_I(inode);
+		nfs_list_add_request(req, &nfsi->commit);
+		nfsi->ncommit++;
 	}
 	spin_unlock(&nfs_wreq_lock);
 	/*
@@ -657,7 +662,7 @@
 		idx_end = idx_start + npages - 1;
 
 	spin_lock(&nfs_wreq_lock);
-	head = &inode->u.nfs_i.writeback;
+	head = &NFS_I(inode)->writeback;
 	p = head->next;
 	while (p != head) {
 		unsigned long pg_idx;
@@ -720,10 +725,11 @@
 nfs_scan_dirty_timeout(struct inode *inode, struct list_head *dst)
 {
 	int	pages;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 	spin_lock(&nfs_wreq_lock);
-	pages = nfs_scan_list_timeout(&inode->u.nfs_i.dirty, dst, inode);
-	inode->u.nfs_i.ndirty -= pages;
-	if ((inode->u.nfs_i.ndirty == 0) != list_empty(&inode->u.nfs_i.dirty))
+	pages = nfs_scan_list_timeout(&nfsi->dirty, dst, inode);
+	nfsi->ndirty -= pages;
+	if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
 		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
 	spin_unlock(&nfs_wreq_lock);
 	return pages;
@@ -734,10 +740,11 @@
 nfs_scan_commit_timeout(struct inode *inode, struct list_head *dst)
 {
 	int	pages;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 	spin_lock(&nfs_wreq_lock);
-	pages = nfs_scan_list_timeout(&inode->u.nfs_i.commit, dst, inode);
-	inode->u.nfs_i.ncommit -= pages;
-	if ((inode->u.nfs_i.ncommit == 0) != list_empty(&inode->u.nfs_i.commit))
+	pages = nfs_scan_list_timeout(&nfsi->commit, dst, inode);
+	nfsi->ncommit -= pages;
+	if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
 		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
 	spin_unlock(&nfs_wreq_lock);
 	return pages;
@@ -783,10 +790,11 @@
 nfs_scan_dirty(struct inode *inode, struct list_head *dst, struct file *file, unsigned long idx_start, unsigned int npages)
 {
 	int	res;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 	spin_lock(&nfs_wreq_lock);
-	res = nfs_scan_list(&inode->u.nfs_i.dirty, dst, file, idx_start, npages);
-	inode->u.nfs_i.ndirty -= res;
-	if ((inode->u.nfs_i.ndirty == 0) != list_empty(&inode->u.nfs_i.dirty))
+	res = nfs_scan_list(&nfsi->dirty, dst, file, idx_start, npages);
+	nfsi->ndirty -= res;
+	if ((nfsi->ndirty == 0) != list_empty(&nfsi->dirty))
 		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ndirty.\n");
 	spin_unlock(&nfs_wreq_lock);
 	return res;
@@ -797,10 +805,11 @@
 nfs_scan_commit(struct inode *inode, struct list_head *dst, struct file *file, unsigned long idx_start, unsigned int npages)
 {
 	int	res;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 	spin_lock(&nfs_wreq_lock);
-	res = nfs_scan_list(&inode->u.nfs_i.commit, dst, file, idx_start, npages);
-	inode->u.nfs_i.ncommit -= res;
-	if ((inode->u.nfs_i.ncommit == 0) != list_empty(&inode->u.nfs_i.commit))
+	res = nfs_scan_list(&nfsi->commit, dst, file, idx_start, npages);
+	nfsi->ncommit -= res;
+	if ((nfsi->ncommit == 0) != list_empty(&nfsi->commit))
 		printk(KERN_ERR "NFS: desynchronized value of nfs_i.ncommit.\n");
 	spin_unlock(&nfs_wreq_lock);
 	return res;
@@ -885,7 +894,7 @@
 		/*
 		 * If we're over the soft limit, flush out old requests
 		 */
-		if (inode->u.nfs_i.npages >= MAX_REQUEST_SOFT)
+		if (NFS_I(inode)->npages >= MAX_REQUEST_SOFT)
 			nfs_wb_file(inode, file);
 		new = nfs_create_request(file, inode, page, offset, bytes);
 		if (!new)
@@ -952,8 +961,9 @@
 nfs_strategy(struct inode *inode)
 {
 	unsigned int	dirty, wpages;
+	struct nfs_inode_info *nfsi = NFS_I(inode);
 
-	dirty  = inode->u.nfs_i.ndirty;
+	dirty  = nfsi->ndirty;
 	wpages = NFS_SERVER(inode)->wpages;
 #ifdef CONFIG_NFS_V3
 	if (NFS_PROTO(inode)->version == 2) {
@@ -962,7 +972,7 @@
 	} else {
 		if (dirty >= wpages)
 			nfs_flush_file(inode, NULL, 0, 0, 0);
-		if (inode->u.nfs_i.ncommit > NFS_STRATEGY_PAGES * wpages &&
+		if (nfsi->ncommit > NFS_STRATEGY_PAGES * wpages &&
 		    atomic_read(&nfs_nr_requests) > MAX_REQUEST_SOFT)
 			nfs_commit_file(inode, NULL, 0, 0, 0);
 	}
@@ -974,7 +984,7 @@
 	 * If we're running out of free requests, flush out everything
 	 * in order to reduce memory useage...
 	 */
-	if (inode->u.nfs_i.npages > MAX_REQUEST_SOFT)
+	if (nfsi->npages > MAX_REQUEST_SOFT)
 		nfs_wb_all(inode);
 }
 
@@ -1141,7 +1151,7 @@
 	/* Set up the argument struct */
 	nfs_write_rpcsetup(head, data);
 	if (stable) {
-		if (!inode->u.nfs_i.ncommit)
+		if (!NFS_I(inode)->ncommit)
 			data->args.stable = NFS_FILE_SYNC;
 		else
 			data->args.stable = NFS_DATA_SYNC;
diff -urN S4-pre6/include/linux/fs.h S4-pre6-nfs/include/linux/fs.h
--- S4-pre6/include/linux/fs.h	Sat Apr 21 14:35:32 2001
+++ S4-pre6-nfs/include/linux/fs.h	Mon Apr 23 22:40:10 2001
@@ -448,7 +448,6 @@
 		struct msdos_inode_info		msdos_i;
 		struct umsdos_inode_info	umsdos_i;
 		struct iso_inode_info		isofs_i;
-		struct nfs_inode_info		nfs_i;
 		struct sysv_inode_info		sysv_i;
 		struct affs_inode_info		affs_i;
 		struct ufs_inode_info		ufs_i;
diff -urN S4-pre6/include/linux/nfs_fs.h S4-pre6-nfs/include/linux/nfs_fs.h
--- S4-pre6/include/linux/nfs_fs.h	Sat Apr 21 14:35:32 2001
+++ S4-pre6-nfs/include/linux/nfs_fs.h	Mon Apr 23 22:40:17 2001
@@ -63,39 +63,44 @@
  */
 #define NFS_SUPER_MAGIC			0x6969
 
-#define NFS_FH(inode)			(&(inode)->u.nfs_i.fh)
+static inline struct nfs_inode_info *NFS_I(struct inode *inode)
+{
+	return (struct nfs_inode_info *)inode->u.generic_ip;
+}
+
+#define NFS_FH(inode)			(&NFS_I(inode)->fh)
 #define NFS_SERVER(inode)		(&(inode)->i_sb->u.nfs_sb.s_server)
 #define NFS_CLIENT(inode)		(NFS_SERVER(inode)->client)
 #define NFS_PROTO(inode)		(NFS_SERVER(inode)->rpc_ops)
 #define NFS_REQUESTLIST(inode)		(NFS_SERVER(inode)->rw_requests)
 #define NFS_ADDR(inode)			(RPC_PEERADDR(NFS_CLIENT(inode)))
 #define NFS_CONGESTED(inode)		(RPC_CONGESTED(NFS_CLIENT(inode)))
-#define NFS_COOKIEVERF(inode)		((inode)->u.nfs_i.cookieverf)
-#define NFS_READTIME(inode)		((inode)->u.nfs_i.read_cache_jiffies)
-#define NFS_CACHE_CTIME(inode)		((inode)->u.nfs_i.read_cache_ctime)
-#define NFS_CACHE_MTIME(inode)		((inode)->u.nfs_i.read_cache_mtime)
-#define NFS_CACHE_ATIME(inode)		((inode)->u.nfs_i.read_cache_atime)
-#define NFS_CACHE_ISIZE(inode)		((inode)->u.nfs_i.read_cache_isize)
-#define NFS_NEXTSCAN(inode)		((inode)->u.nfs_i.nextscan)
+#define NFS_COOKIEVERF(inode)		(NFS_I(inode)->cookieverf)
+#define NFS_READTIME(inode)		(NFS_I(inode)->read_cache_jiffies)
+#define NFS_CACHE_CTIME(inode)		(NFS_I(inode)->read_cache_ctime)
+#define NFS_CACHE_MTIME(inode)		(NFS_I(inode)->read_cache_mtime)
+#define NFS_CACHE_ATIME(inode)		(NFS_I(inode)->read_cache_atime)
+#define NFS_CACHE_ISIZE(inode)		(NFS_I(inode)->read_cache_isize)
+#define NFS_NEXTSCAN(inode)		(NFS_I(inode)->nextscan)
 #define NFS_CACHEINV(inode) \
 do { \
 	NFS_READTIME(inode) = jiffies - NFS_MAXATTRTIMEO(inode) - 1; \
 } while (0)
-#define NFS_ATTRTIMEO(inode)		((inode)->u.nfs_i.attrtimeo)
+#define NFS_ATTRTIMEO(inode)		(NFS_I(inode)->attrtimeo)
 #define NFS_MINATTRTIMEO(inode) \
 	(S_ISDIR(inode->i_mode)? NFS_SERVER(inode)->acdirmin \
 			       : NFS_SERVER(inode)->acregmin)
 #define NFS_MAXATTRTIMEO(inode) \
 	(S_ISDIR(inode->i_mode)? NFS_SERVER(inode)->acdirmax \
 			       : NFS_SERVER(inode)->acregmax)
-#define NFS_ATTRTIMEO_UPDATE(inode)	((inode)->u.nfs_i.attrtimeo_timestamp)
+#define NFS_ATTRTIMEO_UPDATE(inode)	(NFS_I(inode)->attrtimeo_timestamp)
 
-#define NFS_FLAGS(inode)		((inode)->u.nfs_i.flags)
+#define NFS_FLAGS(inode)		(NFS_I(inode)->flags)
 #define NFS_REVALIDATING(inode)		(NFS_FLAGS(inode) & NFS_INO_REVALIDATING)
 #define NFS_STALE(inode)		(NFS_FLAGS(inode) & NFS_INO_STALE)
 
-#define NFS_FILEID(inode)		((inode)->u.nfs_i.fileid)
-#define NFS_FSID(inode)			((inode)->u.nfs_i.fsid)
+#define NFS_FILEID(inode)		(NFS_I(inode)->fileid)
+#define NFS_FSID(inode)			(NFS_I(inode)->fsid)
 
 /* Inode Flags */
 #define NFS_USE_READDIRPLUS(inode)	((NFS_FLAGS(inode) & NFS_INO_ADVISE_RDPLUS) ? 1 : 0)
@@ -212,13 +217,13 @@
 static inline int
 nfs_have_read(struct inode *inode)
 {
-	return !list_empty(&inode->u.nfs_i.read);
+	return !list_empty(&NFS_I(inode)->read);
 }
 
 static inline int
 nfs_have_writebacks(struct inode *inode)
 {
-	return !list_empty(&inode->u.nfs_i.writeback);
+	return !list_empty(&NFS_I(inode)->writeback);
 }
 
 static inline int





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-23 15:36         ` Alexander Viro
  2001-04-23 20:45           ` Ingo Oeser
  2001-04-23 21:19           ` Richard Gooch
@ 2001-04-24  6:33           ` Christoph Rohland
  2 siblings, 0 replies; 42+ messages in thread
From: Christoph Rohland @ 2001-04-24  6:33 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Ingo Oeser, David L. Parsley, linux-kernel

Hi Alexander,

On Mon, 23 Apr 2001, Alexander Viro wrote:
>> I like it. ext2fs does the same, so there should be no VFS
>> hassles involved. Al?
> 
> We should get ext2 and friends to move the sucker _out_ of struct
> inode.  As it is, sizeof(struct inode) is way too large. This is 2.5
> stuff, but it really has to be done. More filesystems adding stuff
> into the union is a Bad Thing(tm). If you want to allocates space -
> allocate if yourself; ->clear_inode() is the right place for freeing
> it.

Yes, I agree that the union is way too large and I did not plan to
extend it but simply use the size it has.

if (strlen(path) < sizeof(inode->u))
        inline the symlink;
else
        put it into the page cache;

So if somebody really cleans up the private inode structures it will
not trigger that often any more and we perhaps have to rethink the
idea.

But also if we use struct shmem_inode_info which is 92 bytes right now
we would inline all symlinks on my machine.

If we reduced its size to 32 (which could be easily done) we would
still inline 6642 out of 9317 symlinks on my machine. That's not bad.

Greetings
		Christoph



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24  2:53               ` Alexander Viro
@ 2001-04-24  9:58                 ` David Woodhouse
  2001-04-24 10:07                   ` Alexander Viro
  2001-04-24 21:59                 ` Trond Myklebust
  1 sibling, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2001-04-24  9:58 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Jan Harkes, Ingo Oeser, Christoph Rohland, David L. Parsley,
	linux-kernel


viro@math.psu.edu said:
>  Oh, for crying out loud. All it takes is half an hour per filesystem.

Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
surprised.

--
dwmw2



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24  9:58                 ` David Woodhouse
@ 2001-04-24 10:07                   ` Alexander Viro
  2001-04-24 10:19                     ` David Woodhouse
  2001-04-24 10:35                     ` Christoph Rohland
  0 siblings, 2 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-24 10:07 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jan Harkes, Ingo Oeser, Christoph Rohland, David L. Parsley,
	linux-kernel



On Tue, 24 Apr 2001, David Woodhouse wrote:

> 
> viro@math.psu.edu said:
> >  Oh, for crying out loud. All it takes is half an hour per filesystem.
> 
> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd be very
> surprised.

<tone polite> What's stopping you? </tone>
You _are_ JFFS maintainer, aren't you?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 10:07                   ` Alexander Viro
@ 2001-04-24 10:19                     ` David Woodhouse
  2001-04-24 10:35                     ` Christoph Rohland
  1 sibling, 0 replies; 42+ messages in thread
From: David Woodhouse @ 2001-04-24 10:19 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Jan Harkes, Ingo Oeser, Christoph Rohland, David L. Parsley,
	linux-kernel


viro@math.psu.edu said:
>  <tone polite> What's stopping you? </tone> You _are_ JFFS maintainer,
> aren't you?

It already uses...

#define JFFS2_INODE_INFO(i) (&i->u.jffs2_i) 

It's trivial to switch over when the size of the inode union goes below the 
size of struct jffs2_inode_info. Until then, I'd just be wasting space.

--
dwmw2



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 10:07                   ` Alexander Viro
  2001-04-24 10:19                     ` David Woodhouse
@ 2001-04-24 10:35                     ` Christoph Rohland
  2001-04-24 10:54                       ` Alexander Viro
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Rohland @ 2001-04-24 10:35 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David Woodhouse, Jan Harkes, Ingo Oeser, David L. Parsley, linux-kernel

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
>> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
>> be very surprised.
> 
> <tone polite> What's stopping you? </tone>
> You _are_ JFFS maintainer, aren't you?

So is this the start to change all filesystems in 2.4? I am not sure
we should do that. 

Greetings
		Christoph



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 10:35                     ` Christoph Rohland
@ 2001-04-24 10:54                       ` Alexander Viro
  2001-04-24 12:51                         ` David Woodhouse
                                           ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-24 10:54 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: David Woodhouse, Jan Harkes, Ingo Oeser, David L. Parsley, linux-kernel



On 24 Apr 2001, Christoph Rohland wrote:

> Hi Al,
> 
> On Tue, 24 Apr 2001, Alexander Viro wrote:
> >> Half an hour? If it takes more than about 5 minutes for JFFS2 I'd
> >> be very surprised.
> > 
> > <tone polite> What's stopping you? </tone>
> > You _are_ JFFS maintainer, aren't you?
> 
> So is this the start to change all filesystems in 2.4? I am not sure
> we should do that. 

Encapsulation part is definitely worth doing - it cleans the code up
and doesn't change the result of compile. Adding allocation/freeing/
cache initialization/cache removal and chaninging FOOFS_I() definition -
well, it's probably worth to keep such patches around, but whether
to switch any individual filesystem during 2.4 is a policy decision.
Up to maintainer, indeed. Notice that these patches (separate allocation
per se) are going to be within 3-4Kb per filesystem _and_ completely
straightforward.

What I would like to avoid is scenario like

Maintainers of filesystems with large private inodes: Why would we separate
them? We would only waste memory, since the other filesystems stay in ->u
and keep it large.

Maintainers of the rest of filesystems: Since there's no patches that would
take large stuff out of ->u, why would we bother?

So yes, IMO having such patches available _is_ a good thing. And in 2.5
we definitely want them in the tree. If encapsulation part gets there
during 2.4 and separate allocation is available for all of them it will
be easier to do without PITA in process.
								Al



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 10:54                       ` Alexander Viro
@ 2001-04-24 12:51                         ` David Woodhouse
  2001-04-24 13:34                         ` Christoph Rohland
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 42+ messages in thread
From: David Woodhouse @ 2001-04-24 12:51 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Christoph Rohland, Jan Harkes, Ingo Oeser, David L. Parsley,
	linux-kernel


viro@math.psu.edu said:
>  What I would like to avoid is scenario like
> Maintainers of filesystems with large private inodes: Why would we
> separate them? We would only waste memory, since the other filesystems
> stay in ->u and keep it large.

> Maintainers of the rest of filesystems: Since there's no patches that
> would take large stuff out of ->u, why would we bother?

> So yes, IMO having such patches available _is_ a good thing. And in
> 2.5 we definitely want them in the tree. If encapsulation part gets
> there during 2.4 and separate allocation is available for all of them
> it will be easier to do without PITA in process.

JFFS2 has the encapsulation part already. I'll make it do separate 
allocation in 2.5, when it's actually a gain.

--
dwmw2



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 10:54                       ` Alexander Viro
  2001-04-24 12:51                         ` David Woodhouse
@ 2001-04-24 13:34                         ` Christoph Rohland
  2001-04-24 16:52                           ` David L. Parsley
  2001-05-04 19:17                           ` [Patch] encapsulate shmem access to shmem_inode_info Christoph Rohland
  2001-04-24 16:04                         ` Can't read SCSI TAPE Masaki Tsuji
  2001-04-24 18:36                         ` hundreds of mount --bind mountpoints? Andreas Dilger
  3 siblings, 2 replies; 42+ messages in thread
From: Christoph Rohland @ 2001-04-24 13:34 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David Woodhouse, Jan Harkes, Ingo Oeser, David L. Parsley, linux-kernel

Hi Al,

On Tue, 24 Apr 2001, Alexander Viro wrote:
> So yes, IMO having such patches available _is_ a good thing. And in
> 2.5 we definitely want them in the tree. If encapsulation part gets
> there during 2.4 and separate allocation is available for all of
> them it will be easier to do without PITA in process.

OK I will do that for tmpfs soon. And I will do the symlink inlining
with that patch.

Greetings
		Christoph



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Can't read SCSI TAPE
  2001-04-24 10:54                       ` Alexander Viro
  2001-04-24 12:51                         ` David Woodhouse
  2001-04-24 13:34                         ` Christoph Rohland
@ 2001-04-24 16:04                         ` Masaki Tsuji
  2001-04-24 18:36                         ` hundreds of mount --bind mountpoints? Andreas Dilger
  3 siblings, 0 replies; 42+ messages in thread
From: Masaki Tsuji @ 2001-04-24 16:04 UTC (permalink / raw)
  To: linux-kernel

Dear sirs,

Although 'tar' can write to SCSI-TAPE, can't read from.
'tar' reports ....

......
-rw-r--r-- root/root    xxxxx 2001-xx-xx 01:23 usr/bin/xxxxxx
tar: Skipping to next file header                            <------"A"
-rw-r--r-- root/root    xxxxx 2001-xx-xx 01:23 usr/bin/xxxxxxx
......


"A" means written data is wrong, doesn't it???


Thanks for any help.

------------------------------------------
Detailed ->

System...
Kernel : 2.2.11 + raid0145-19990824-2.2.11.gz
          or
         2.2.11
tar    : GNU tar 1.12
mt     : mt-st v. 0.4
glibc2 : glibc-2.0.7pre6

Hardware...
Mother   : Intel Celeron x2 (SMP)
TAPE drv : SONY SDT-9000
TAPE     : DDS1 DDS2 DDS3
SCSI card: AHA-1542
Cable    : SCSI-2 Hi-impeadance , length 0.5m
------------------------------------------

-- 
Masaki Tsuji

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 13:34                         ` Christoph Rohland
@ 2001-04-24 16:52                           ` David L. Parsley
  2001-05-05  7:48                             ` [Patch] inline symlinks for tmpfs Christoph Rohland
  2001-05-04 19:17                           ` [Patch] encapsulate shmem access to shmem_inode_info Christoph Rohland
  1 sibling, 1 reply; 42+ messages in thread
From: David L. Parsley @ 2001-04-24 16:52 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: Alexander Viro, Ingo Oeser, linux-kernel

Christoph Rohland wrote:
> 
> OK I will do that for tmpfs soon. And I will do the symlink inlining
> with that patch.

Wow, this thread really exploded, eh?  But thanks, Christoph, I look
forward to seeing
your patch.  4k symlinks really suck for embedders who never swap out
pages. ;-)

regards,
	David

-- 
David L. Parsley
Network Administrator
Roanoke College

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 10:54                       ` Alexander Viro
                                           ` (2 preceding siblings ...)
  2001-04-24 16:04                         ` Can't read SCSI TAPE Masaki Tsuji
@ 2001-04-24 18:36                         ` Andreas Dilger
  2001-04-24 18:49                           ` Alexander Viro
  3 siblings, 1 reply; 42+ messages in thread
From: Andreas Dilger @ 2001-04-24 18:36 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Christoph Rohland, David Woodhouse, Jan Harkes, Ingo Oeser,
	David L. Parsley, linux-kernel

Al writes:
> Encapsulation part is definitely worth doing - it cleans the code up
> and doesn't change the result of compile. Adding allocation/freeing/
> cache initialization/cache removal and chaninging FOOFS_I() definition -
> well, it's probably worth to keep such patches around, but whether
> to switch any individual filesystem during 2.4 is a policy decision.
> Up to maintainer, indeed. Notice that these patches (separate allocation
> per se) are going to be within 3-4Kb per filesystem _and_ completely
> straightforward.

One thing to watch out for is that the current code zeros the u. struct
for us (as you pointed out to me previously), but allocating from the
slab cache will not...  This could be an interesting source of bugs for
some filesystems that assume zero'd inode_info structs.

Fortunately, it doesn't appear that my patch to clean out all of the
"duplicate" zero initializers in the fs-specific code was accepted...

> What I would like to avoid is scenario like:
> 
> Maintainers of filesystems with large private inodes: Why would we separate
> them? We would only waste memory, since the other filesystems stay in ->u
> and keep it large.

Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
approximate for 32-bit arches - I was just counting by hand and not
strictly checking alignment), then almost all other filesystems are below
25 * __u32 (i.e. half of the previous size).

For large-private-inode filesystems, we are wasting memory in EVERY inode
in the slab cache, not just ones in use with the large private inode.  If
it were the most common filesystem (ext2, maybe reiser, msdos next) then
it wouldn't make much difference.

At some point reducing the union size is not efficient to have separate
slab allocations from a memory usage standpoint.

The remaining info structs are (approx. for 32-bit arch) (size in __u32):

ext2	27
affs	26
ufs	25
socket	24
shmem	22
coda	20
qnx4	18

minix	16
umsdos	15
hpfs	15
efs	14
sysv	13
reiser	12
udf	12
ntfs	11
ncp	10
msdos	9
adfs	7
smb	6
usbdev	5
proc	4
iso	4
bfs	3
romfs	2


> Maintainers of the rest of filesystems: Since there's no patches that would
> take large stuff out of ->u, why would we bother?

Maybe the size of the union can depend on CONFIG_*_FS?  There should be
an absolute minimum size (16 * __u32 or so), but then people who want
reiserfs as their primary fs do not need to pay the memory penalty of ext2.
For ext2 (the next largest and most common fs), we could make it part of
the union if it is compiled in, and on a slab cache if it is a module?

Should uncommon-but-widely-used things like socket and shmem have their
own slab cache, or should they just allocate from the generic size-32 slab?

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 18:36                         ` hundreds of mount --bind mountpoints? Andreas Dilger
@ 2001-04-24 18:49                           ` Alexander Viro
  2001-04-24 19:11                             ` Andreas Dilger
  2001-04-24 22:01                             ` Ingo Oeser
  0 siblings, 2 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-24 18:49 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Rohland, David Woodhouse, Jan Harkes, Ingo Oeser,
	David L. Parsley, linux-kernel



On Tue, 24 Apr 2001, Andreas Dilger wrote:

> One thing to watch out for is that the current code zeros the u. struct
> for us (as you pointed out to me previously), but allocating from the
> slab cache will not...  This could be an interesting source of bugs for
> some filesystems that assume zero'd inode_info structs.

True, but easy to catch.
 
> Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
> approximate for 32-bit arches - I was just counting by hand and not
> strictly checking alignment), then almost all other filesystems are below
> 25 * __u32 (i.e. half of the previous size).

Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
I'm thinking about.
 
> Maybe the size of the union can depend on CONFIG_*_FS?  There should be
> an absolute minimum size (16 * __u32 or so), but then people who want
> reiserfs as their primary fs do not need to pay the memory penalty of ext2.
> For ext2 (the next largest and most common fs), we could make it part of
> the union if it is compiled in, and on a slab cache if it is a module?

NO. Sorry about shouting, but that's the way to madness. I can understand
code depending on SMP vs. UP and similar beasts, but presense of specific
filesystems.... <shudder>

> Should uncommon-but-widely-used things like socket and shmem have their
> own slab cache, or should they just allocate from the generic size-32 slab?

That's pretty interesting - especially for sockets. I wonder whether
we would get problems with separate allocation of these - we don't
go from inode to socket all that often, but...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 18:49                           ` Alexander Viro
@ 2001-04-24 19:11                             ` Andreas Dilger
  2001-04-24 22:01                             ` Ingo Oeser
  1 sibling, 0 replies; 42+ messages in thread
From: Andreas Dilger @ 2001-04-24 19:11 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andreas Dilger, Christoph Rohland, David Woodhouse, Jan Harkes,
	Ingo Oeser, David L. Parsley, linux-kernel

Al writes:
> > Well, if we get rid of NFS (50 x __u32) and HFS (44 * __u32) (sizes are
> > approximate for 32-bit arches - I was just counting by hand and not
> > strictly checking alignment), then almost all other filesystems are below
> > 25 * __u32 (i.e. half of the previous size).
> 
> Yeah, but NFS suddenly takes 25+50 words... That's the type of complaints
> I'm thinking about.

But then again, you are saving 50-25 words for every non-NFS inode, and I
think _most_ systems will have more local inodes than NFS inodes.  Even
NFS servers will have local inodes, only clients (AFAIK) use nfs_inode_info.

> > Maybe the size of the union can depend on CONFIG_*_FS?  There should be
> > an absolute minimum size (16 * __u32 or so), but then people who want
> > reiserfs as their primary fs do not need to pay the memory penalty of ext2.
> > For ext2 (the next largest and most common fs), we could make it part of
> > the union if it is compiled in, and on a slab cache if it is a module?
> 
> NO. Sorry about shouting, but that's the way to madness. I can understand
> code depending on SMP vs. UP and similar beasts, but presense of specific
> filesystems.... <shudder>

But then again, if the size of nfs_inode_info changes, it is the same
problem... sizeof(struct inode) may have changed (depends if slab has
some padding between inodes or not).  If we stick to a minimum size
(16 words or maybe even 8), then it will never change anymore, and we
do not have overhead for small inode_info structs.

> > Should uncommon-but-widely-used things like socket and shmem have their
> > own slab cache, or should they just allocate from the generic size-32 slab?
> 
> That's pretty interesting - especially for sockets. I wonder whether
> we would get problems with separate allocation of these - we don't
> go from inode to socket all that often, but...

I never thought of that.  I guess the socket code does not know which
fs the inode_info was allocated from, so it cannot free it from the slab
(even if it had access to these slabs, which it does not).  In that case,
each fs would have struct socket as the minimum size allocatable, which
is unfortunately one of the largest inode_info sizes.  It is smaller
than ext2, but...

Any ideas?  Do we ever get back into fs-specific clear_inode() from
a socket?  In that case, the socket would just hold a pointer to the
fs-specific inode_info inside its own struct socket until the inode
is dropped.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24  2:53               ` Alexander Viro
  2001-04-24  9:58                 ` David Woodhouse
@ 2001-04-24 21:59                 ` Trond Myklebust
  2001-04-24 22:09                   ` Alexander Viro
  2001-04-24 22:31                   ` Trond Myklebust
  1 sibling, 2 replies; 42+ messages in thread
From: Trond Myklebust @ 2001-04-24 21:59 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel

>>>>> " " == Alexander Viro <viro@math.psu.edu> writes:

     > On Mon, 23 Apr 2001, Jan Harkes wrote:

    >> On Mon, Apr 23, 2001 at 10:45:05PM +0200, Ingo Oeser wrote:

    >> > BTW: Is it still less than one page? Then it doesn't make me
    >> >    nervous. Why? Guess what granularity we allocate at, if we
    >> >    just store pointers instead of the inode.u. Or do you like
    >> >    every FS creating his own slab cache?

     > Oh, for crying out loud. All it takes is half an hour per
     > filesystem.  Here - completely untested patch that does it for
     > NFS. Took about that long. Absolutely straightforward, very
     > easy to verify correctness.

     > Some stuff may need tweaking, but not much (e.g. some functions
     > should take nfs_inode_info instead of inodes, etc.). From the
     > look of flushd cache it seems that we would be better off with
     > cyclic lists instead of single-linked ones for the hash, but I
     > didn't look deep enough.

     > So consider the patch below as proof-of-concept. Enjoy:

Hi Al,

  I believe your patch introduces a race for the NFS case. The problem
lies in the fact that nfs_find_actor() needs to read several of the
fields from nfs_inode_info. By adding an allocation after the inode
has been hashed, you are creating a window during which the inode can
be found by find_inode(), but during which you aren't even guaranteed
that the nfs_inode_info exists let alone that it's been initialized
by nfs_fill_inode().

One solution could be to have find_inode sleep on encountering a
locked inode. It would have to be something along the lines of

static struct inode * find_inode(struct super_block * sb, unsigned long ino, struct list_head *head, find_inode_t find_actor, void *opaque)
{
        struct list_head *tmp;
        struct inode * inode;

        tmp = head;
        for (;;) {
                tmp = tmp->next;
                inode = NULL;
                if (tmp == head)
                        break;
                inode = list_entry(tmp, struct inode, i_hash);
                if (inode->i_ino != ino)
                        continue;
                if (inode->i_sb != sb)
                        continue;
                if (find_actor) {
                        if (inode->i_state & I_LOCK) {
                                spin_unlock(&inode_lock);
                                __wait_on_inode(inode);
                                spin_lock(&inode_lock);
                                tmp = head;
                                continue;
                        }
                        if (!find_actor(inode, ino, opaque))
                                continue;
                }
                break;
        }
        return inode;
}


Cheers,
   Trond

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 18:49                           ` Alexander Viro
  2001-04-24 19:11                             ` Andreas Dilger
@ 2001-04-24 22:01                             ` Ingo Oeser
  1 sibling, 0 replies; 42+ messages in thread
From: Ingo Oeser @ 2001-04-24 22:01 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andreas Dilger, Christoph Rohland, David Woodhouse, Jan Harkes,
	David L. Parsley, linux-kernel

On Tue, Apr 24, 2001 at 02:49:23PM -0400, Alexander Viro wrote:
> On Tue, 24 Apr 2001, Andreas Dilger wrote:
> > One thing to watch out for is that the current code zeros the u. struct
> > for us (as you pointed out to me previously), but allocating from the
> > slab cache will not...  This could be an interesting source of bugs for
> > some filesystems that assume zero'd inode_info structs.
> True, but easy to catch.

Jepp. Just request SLAB_ZERO (still to be implemented) instead of
SLAB_POISON or provide an constructor.

A nice set of macros for this would make it quite easy. The ctor
is the way to handle it. May be we could even put all the fs
specific initalizers into it (e.g. magics, zeroes).

Regards

Ingo Oeser
-- 
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
         <<<<<<<<<<<<     been there and had much fun   >>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 21:59                 ` Trond Myklebust
@ 2001-04-24 22:09                   ` Alexander Viro
  2001-04-24 22:31                   ` Trond Myklebust
  1 sibling, 0 replies; 42+ messages in thread
From: Alexander Viro @ 2001-04-24 22:09 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel



On 24 Apr 2001, Trond Myklebust wrote:

> Hi Al,
> 
>   I believe your patch introduces a race for the NFS case. The problem
> lies in the fact that nfs_find_actor() needs to read several of the
> fields from nfs_inode_info. By adding an allocation after the inode
> has been hashed, you are creating a window during which the inode can
> be found by find_inode(), but during which you aren't even guaranteed
> that the nfs_inode_info exists let alone that it's been initialized
> by nfs_fill_inode().

_Ouch_. So what are you going to do if another iget4() comes between
the moment when you hash the inode and set these fields? You are
filling them only after you drop inode_lock, so AFAICS the current
code has the same problem.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: hundreds of mount --bind mountpoints?
  2001-04-24 21:59                 ` Trond Myklebust
  2001-04-24 22:09                   ` Alexander Viro
@ 2001-04-24 22:31                   ` Trond Myklebust
  1 sibling, 0 replies; 42+ messages in thread
From: Trond Myklebust @ 2001-04-24 22:31 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel

>>>>> " " == Alexander Viro <viro@math.psu.edu> writes:

     > _Ouch_. So what are you going to do if another iget4() comes
     > between the moment when you hash the inode and set these
     > fields? You are filling them only after you drop inode_lock, so
     > AFAICS the current code has the same problem.

The entire call to iget4() is protected by the BKL in all relevant
instances. As long as we don't sleep between find_inode() and
nfs_fill_inode(), we're safe.

In fact the BKL protection is needed also for another reason: we don't
actually initialize the inode in the I_LOCK-protected read_inode() but
instead rely on the caller of iget4 to do it for us. The reason is
that one we would need to pass the struct nfs_fattr to read_inode()
and this wasn't possible until the ReiserFS people introduced
read_inode2().

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Patch] encapsulate shmem access to shmem_inode_info
  2001-04-24 13:34                         ` Christoph Rohland
  2001-04-24 16:52                           ` David L. Parsley
@ 2001-05-04 19:17                           ` Christoph Rohland
       [not found]                             ` <3AF405FC.11D37FEB@bellsouth.net>
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Rohland @ 2001-05-04 19:17 UTC (permalink / raw)
  To: Alexander Viro, Linus Torvalds
  Cc: David Woodhouse, Jan Harkes, Ingo Oeser, David L. Parsley, linux-kernel

Hi,

On 24 Apr 2001, Christoph Rohland wrote:
> Hi Al,
> 
> On Tue, 24 Apr 2001, Alexander Viro wrote:
>> So yes, IMO having such patches available _is_ a good thing. And in
>> 2.5 we definitely want them in the tree. If encapsulation part gets
>> there during 2.4 and separate allocation is available for all of
>> them it will be easier to do without PITA in process.
> 
> OK I will do that for tmpfs soon. And I will do the symlink inlining
> with that patch.

Here comes the patch to encapsulate all accesses to struct
shmem_inode_info into a macro. It is now trivial to allocate the
private part independently from the inode.

Greetings
		Christoph

P.S: The symlink inlining will come in a separate patch

diff -uNr 2.4.4-mmap_write/include/linux/shmem_fs.h 2.4.4-mmap_write-SHMEM_I/include/linux/shmem_fs.h
--- 2.4.4-mmap_write/include/linux/shmem_fs.h	Tue May  1 20:02:00 2001
+++ 2.4.4-mmap_write-SHMEM_I/include/linux/shmem_fs.h	Tue May  1 20:06:10 2001
@@ -18,14 +18,15 @@
 } swp_entry_t;
 
 struct shmem_inode_info {
-	spinlock_t	lock;
-	struct semaphore sem;
-	unsigned long	max_index;
-	swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
-	swp_entry_t   **i_indirect; /* doubly indirect blocks */
-	unsigned long	swapped;
-	int		locked;     /* into memory */
+	spinlock_t		lock;
+	struct semaphore 	sem;
+	unsigned long		max_index;
+	swp_entry_t		i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
+	swp_entry_t           **i_indirect; /* doubly indirect blocks */
+	unsigned long		swapped;
+	int			locked;     /* into memory */
 	struct list_head	list;
+	struct inode	       *inode;
 };
 
 struct shmem_sb_info {
@@ -35,5 +36,7 @@
 	unsigned long free_inodes;  /* How many are left for allocation */
 	spinlock_t    stat_lock;
 };
+
+#define SHMEM_I(inode)  (&inode->u.shmem_i)
 
 #endif
diff -uNr 2.4.4-mmap_write/ipc/shm.c 2.4.4-mmap_write-SHMEM_I/ipc/shm.c
--- 2.4.4-mmap_write/ipc/shm.c	Wed Apr 11 12:36:47 2001
+++ 2.4.4-mmap_write-SHMEM_I/ipc/shm.c	Tue May  1 20:06:10 2001
@@ -348,6 +348,7 @@
 
 static void shm_get_stat (unsigned long *rss, unsigned long *swp) 
 {
+	struct shmem_inode_info *info;
 	int i;
 
 	*rss = 0;
@@ -361,10 +362,11 @@
 		if(shp == NULL)
 			continue;
 		inode = shp->shm_file->f_dentry->d_inode;
-		spin_lock (&inode->u.shmem_i.lock);
+		info = SHMEM_I(inode);
+		spin_lock (&info->lock);
 		*rss += inode->i_mapping->nrpages;
-		*swp += inode->u.shmem_i.swapped;
-		spin_unlock (&inode->u.shmem_i.lock);
+		*swp += info->swapped;
+		spin_unlock (&info->lock);
 	}
 }
 
diff -uNr 2.4.4-mmap_write/mm/shmem.c 2.4.4-mmap_write-SHMEM_I/mm/shmem.c
--- 2.4.4-mmap_write/mm/shmem.c	Tue May  1 20:02:00 2001
+++ 2.4.4-mmap_write-SHMEM_I/mm/shmem.c	Wed May  2 16:46:00 2001
@@ -73,7 +73,7 @@
 	unsigned long freed;
 
 	freed = (inode->i_blocks/BLOCKS_PER_PAGE) -
-		(inode->i_mapping->nrpages + inode->u.shmem_i.swapped);
+		(inode->i_mapping->nrpages + SHMEM_I(inode)->swapped);
 	if (freed){
 		struct shmem_sb_info * info = &inode->i_sb->u.shmem_sb;
 		inode->i_blocks -= freed*BLOCKS_PER_PAGE;
@@ -159,7 +159,7 @@
 	unsigned long index, start;
 	unsigned long freed = 0;
 	swp_entry_t **base, **ptr, **last;
-	struct shmem_inode_info * info = &inode->u.shmem_i;
+	struct shmem_inode_info * info = SHMEM_I(inode);
 
 	down(&info->sem);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -206,7 +206,7 @@
 	struct shmem_sb_info *info = &inode->i_sb->u.shmem_sb;
 
 	spin_lock (&shmem_ilock);
-	list_del (&inode->u.shmem_i.list);
+	list_del (&SHMEM_I(inode)->list);
 	spin_unlock (&shmem_ilock);
 	inode->i_size = 0;
 	shmem_truncate (inode);
@@ -239,7 +239,7 @@
 		goto out;
 	
 	inode = page->mapping->host;
-	info = &inode->u.shmem_i;
+	info = SHMEM_I(inode);
 	swap = __get_swap_page(2);
 	error = -ENOMEM;
 	if (!swap.val)
@@ -407,7 +407,7 @@
 		page_cache_release(*ptr);
 	}
 
-	info = &inode->u.shmem_i;
+	info = SHMEM_I(inode);
 	down (&info->sem);
 	/* retest we may have slept */  	
 
@@ -415,7 +415,7 @@
 	if (inode->i_size < (loff_t) idx * PAGE_CACHE_SIZE)
 		goto failed;
 
-	*ptr = shmem_getpage_locked(&inode->u.shmem_i, inode, idx);
+	*ptr = shmem_getpage_locked(info, inode, idx);
 	if (IS_ERR (*ptr))
 		goto failed;
 
@@ -462,7 +462,7 @@
 void shmem_lock(struct file * file, int lock)
 {
 	struct inode * inode = file->f_dentry->d_inode;
-	struct shmem_inode_info * info = &inode->u.shmem_i;
+	struct shmem_inode_info * info = SHMEM_I(inode);
 	struct page * page;
 	unsigned long idx, size;
 
@@ -521,7 +521,8 @@
 		inode->i_rdev = to_kdev_t(dev);
 		inode->i_mapping->a_ops = &shmem_aops;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		info = &inode->u.shmem_i;
+		info = SHMEM_I(inode);
+		info->inode = inode;
 		spin_lock_init (&info->lock);
 		sema_init (&info->sem, 1);
 		switch (mode & S_IFMT) {
@@ -542,7 +543,7 @@
 			break;
 		}
 		spin_lock (&shmem_ilock);
-		list_add (&inode->u.shmem_i.list, &shmem_inodes);
+		list_add (&SHMEM_I(inode)->list, &shmem_inodes);
 		spin_unlock (&shmem_ilock);
 	}
 	return inode;
@@ -629,7 +630,7 @@
 			__get_user(dummy, buf+bytes-1);
 		}
 
-		info = &inode->u.shmem_i;
+		info = SHMEM_I(inode);
 		down (&info->sem);
 		page = shmem_getpage_locked(info, inode, index);
 		up (&info->sem);
@@ -658,8 +659,8 @@
 			buf += bytes;
 			if (pos > inode->i_size) 
 				inode->i_size = pos;
-			if (inode->u.shmem_i.max_index <= index)
-				inode->u.shmem_i.max_index = index+1;
+			if (info->max_index <= index)
+				info->max_index = index+1;
 
 		}
 unlock:
@@ -940,7 +941,7 @@
 		
 	inode = dentry->d_inode;
 	down(&inode->i_sem);
-	page = shmem_getpage_locked(&inode->u.shmem_i, inode, 0);
+	page = shmem_getpage_locked(SHMEM_I(inode), inode, 0);
 	if (IS_ERR(page))
 		goto fail;
 	kaddr = kmap(page);
@@ -1220,12 +1221,11 @@
 	return -1;
 }
 
-static int shmem_unuse_inode (struct inode *inode, swp_entry_t entry, struct page *page)
+static int shmem_unuse_inode (struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
 {
 	swp_entry_t **base, **ptr;
 	unsigned long idx;
 	int offset;
-	struct shmem_inode_info *info = &inode->u.shmem_i;
 	
 	idx = 0;
 	spin_lock (&info->lock);
@@ -1246,7 +1246,7 @@
 	spin_unlock (&info->lock);
 	return 0;
 found:
-	add_to_page_cache(page, inode->i_mapping, offset + idx);
+	add_to_page_cache(page, info->inode->i_mapping, offset + idx);
 	set_page_dirty(page);
 	SetPageUptodate(page);
 	UnlockPage(page);
@@ -1261,13 +1261,13 @@
 void shmem_unuse(swp_entry_t entry, struct page *page)
 {
 	struct list_head *p;
-	struct inode * inode;
+	struct shmem_inode_info * info;
 
 	spin_lock (&shmem_ilock);
 	list_for_each(p, &shmem_inodes) {
-		inode = list_entry(p, struct inode, u.shmem_i.list);
+		info = list_entry(p, struct shmem_inode_info, list);
 
-		if (shmem_unuse_inode(inode, entry, page))
+		if (shmem_unuse_inode(info, entry, page))
 			break;
 	}
 	spin_unlock (&shmem_ilock);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Patch] inline symlinks for tmpfs
  2001-04-24 16:52                           ` David L. Parsley
@ 2001-05-05  7:48                             ` Christoph Rohland
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Rohland @ 2001-05-05  7:48 UTC (permalink / raw)
  To: David L. Parsley; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 307 bytes --]

Hi David,

On Tue, 24 Apr 2001, David L. Parsley wrote:
>> OK I will do that for tmpfs soon. And I will do the symlink
>> inlining with that patch.

OK, here comes the patch for the symlink inlining. It is on top of my
previous patch to encapsulate access to the private inode info.

Greetings
		Christoph


[-- Attachment #2: patch-inline_symlink --]
[-- Type: text/plain, Size: 4567 bytes --]

diff -uNr 2.4.4-mmap_write-SHMEM_I/mm/shmem.c 2.4.4-mmap_write-SHMEM_I-symlink/mm/shmem.c
--- 2.4.4-mmap_write-SHMEM_I/mm/shmem.c	Fri May  4 21:32:22 2001
+++ 2.4.4-mmap_write-SHMEM_I-symlink/mm/shmem.c	Fri May  4 21:37:34 2001
@@ -41,7 +41,6 @@
 static struct inode_operations shmem_inode_operations;
 static struct file_operations shmem_dir_operations;
 static struct inode_operations shmem_dir_inode_operations;
-static struct inode_operations shmem_symlink_inode_operations;
 static struct vm_operations_struct shmem_vm_ops;
 
 LIST_HEAD (shmem_inodes);
@@ -205,11 +204,13 @@
 {
 	struct shmem_sb_info *info = &inode->i_sb->u.shmem_sb;
 
-	spin_lock (&shmem_ilock);
-	list_del (&SHMEM_I(inode)->list);
-	spin_unlock (&shmem_ilock);
 	inode->i_size = 0;
-	shmem_truncate (inode);
+	if (inode->i_op->truncate == shmem_truncate){ 
+		spin_lock (&shmem_ilock);
+		list_del (&SHMEM_I(inode)->list);
+		spin_unlock (&shmem_ilock);
+		shmem_truncate(inode);
+	}
 	spin_lock (&info->stat_lock);
 	info->free_inodes++;
 	spin_unlock (&info->stat_lock);
@@ -532,6 +533,9 @@
 		case S_IFREG:
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
+			spin_lock (&shmem_ilock);
+			list_add (&SHMEM_I(inode)->list, &shmem_inodes);
+			spin_unlock (&shmem_ilock);
 			break;
 		case S_IFDIR:
 			inode->i_nlink++;
@@ -539,17 +543,17 @@
 			inode->i_fop = &shmem_dir_operations;
 			break;
 		case S_IFLNK:
-			inode->i_op = &shmem_symlink_inode_operations;
 			break;
 		}
-		spin_lock (&shmem_ilock);
-		list_add (&SHMEM_I(inode)->list, &shmem_inodes);
-		spin_unlock (&shmem_ilock);
 	}
 	return inode;
 }
 
 #ifdef CONFIG_TMPFS
+
+static struct inode_operations shmem_symlink_inode_operations;
+static struct inode_operations shmem_symlink_inline_operations;
+
 static ssize_t
 shmem_file_write(struct file *file,const char *buf,size_t count,loff_t *ppos)
 {
@@ -930,33 +934,54 @@
 	struct inode *inode;
 	struct page *page;
 	char *kaddr;
+	struct shmem_inode_info * info;
 
 	error = shmem_mknod(dir, dentry, S_IFLNK | S_IRWXUGO, 0);
 	if (error)
 		return error;
 
-	len = strlen(symname);
+	len = strlen(symname) + 1;
 	if (len > PAGE_SIZE)
 		return -ENAMETOOLONG;
-		
+
 	inode = dentry->d_inode;
-	down(&inode->i_sem);
-	page = shmem_getpage_locked(SHMEM_I(inode), inode, 0);
-	if (IS_ERR(page))
-		goto fail;
-	kaddr = kmap(page);
-	memcpy(kaddr, symname, len);
-	kunmap(page);
+	info = SHMEM_I(inode);
 	inode->i_size = len;
-	SetPageDirty(page);
-	UnlockPage(page);
-	page_cache_release(page);
-	up(&inode->i_sem);
+	if (len <= sizeof(struct shmem_inode_info)) {
+		/* do it inline */
+		memcpy(info, symname, len);
+		inode->i_op = &shmem_symlink_inline_operations;
+	} else {
+		spin_lock (&shmem_ilock);
+		list_add (&info->list, &shmem_inodes);
+		spin_unlock (&shmem_ilock);
+		down(&inode->i_sem);
+		page = shmem_getpage_locked(info, inode, 0);
+		if (IS_ERR(page)) {
+			up(&inode->i_sem);
+			return PTR_ERR(page);
+		}
+		kaddr = kmap(page);
+		memcpy(kaddr, symname, len);
+		kunmap(page);
+		SetPageDirty(page);
+		UnlockPage(page);
+		page_cache_release(page);
+		up(&inode->i_sem);
+		inode->i_op = &shmem_symlink_inode_operations;
+	}
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	return 0;
-fail:
-	up(&inode->i_sem);
-	return PTR_ERR(page);
+}
+
+static int shmem_readlink_inline(struct dentry *dentry, char *buffer, int buflen)
+{
+	return vfs_readlink(dentry,buffer,buflen, (const char *)SHMEM_I(dentry->d_inode));
+}
+
+static int shmem_follow_link_inline(struct dentry *dentry, struct nameidata *nd)
+{
+	return vfs_follow_link(nd, (const char *)SHMEM_I(dentry->d_inode));
 }
 
 static int shmem_readlink(struct dentry *dentry, char *buffer, int buflen)
@@ -986,6 +1011,17 @@
 	return res;
 }
 
+static struct inode_operations shmem_symlink_inline_operations = {
+	readlink:	shmem_readlink_inline,
+	follow_link:	shmem_follow_link_inline,
+};
+
+static struct inode_operations shmem_symlink_inode_operations = {
+	truncate:	shmem_truncate,
+	readlink:	shmem_readlink,
+	follow_link:	shmem_follow_link,
+};
+
 static int shmem_parse_options(char *options, int *mode, unsigned long * blocks, unsigned long *inodes)
 {
 	char *this_char, *value;
@@ -1118,14 +1154,6 @@
 
 static struct inode_operations shmem_inode_operations = {
 	truncate:	shmem_truncate,
-};
-
-static struct inode_operations shmem_symlink_inode_operations = {
-	truncate:	shmem_truncate,
-#ifdef CONFIG_TMPFS
-	readlink:	shmem_readlink,
-	follow_link:	shmem_follow_link,
-#endif
 };
 
 static struct file_operations shmem_dir_operations = {

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Resend] Collection of tmpfs patches
       [not found]                             ` <3AF405FC.11D37FEB@bellsouth.net>
@ 2001-05-06 13:58                               ` Christoph Rohland
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Rohland @ 2001-05-06 13:58 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Albert Cranford

[-- Attachment #1: Type: text/plain, Size: 384 bytes --]

Hi,

There is some confusion about my latest tmpfs fixes. There were three
patches which are cummulative against 2.4.4:

1) deadlock fix for write out of mmap regions. (AFAIK this is
   integrated in the -ac kernels)
2) encapsulate access to shmem_inode_info
3) Do inline symlinks

I attach all these patches to this mail in the case that somebody
missed one.

Greetings
		Christoph


[-- Attachment #2: deadlock fix --]
[-- Type: text/plain, Size: 4338 bytes --]

diff -uNr 2.4.4/include/linux/shmem_fs.h c/include/linux/shmem_fs.h
--- 2.4.4/include/linux/shmem_fs.h	Sun Apr 29 20:33:00 2001
+++ c/include/linux/shmem_fs.h	Sun Apr 29 22:43:56 2001
@@ -19,6 +19,7 @@
 
 struct shmem_inode_info {
 	spinlock_t	lock;
+	struct semaphore sem;
 	unsigned long	max_index;
 	swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
 	swp_entry_t   **i_indirect; /* doubly indirect blocks */
diff -uNr 2.4.4/mm/shmem.c c/mm/shmem.c
--- 2.4.4/mm/shmem.c	Mon Apr 30 09:45:39 2001
+++ c/mm/shmem.c	Tue May  1 15:15:38 2001
@@ -161,6 +161,7 @@
 	swp_entry_t **base, **ptr, **last;
 	struct shmem_inode_info * info = &inode->u.shmem_i;
 
+	down(&info->sem);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 	spin_lock (&info->lock);
 	index = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
@@ -197,6 +198,7 @@
 	info->swapped -= freed;
 	shmem_recalc_inode(inode);
 	spin_unlock (&info->lock);
+	up(&info->sem);
 }
 
 static void shmem_delete_inode(struct inode * inode)
@@ -281,15 +283,12 @@
  * still need to guard against racing with shm_writepage(), which might
  * be trying to move the page to the swap cache as we run.
  */
-static struct page * shmem_getpage_locked(struct inode * inode, unsigned long idx)
+static struct page * shmem_getpage_locked(struct shmem_inode_info *info, struct inode * inode, unsigned long idx)
 {
 	struct address_space * mapping = inode->i_mapping;
-	struct shmem_inode_info *info;
 	struct page * page;
 	swp_entry_t *entry;
 
-	info = &inode->u.shmem_i;
-
 repeat:
 	page = find_lock_page(mapping, idx);
 	if (page)
@@ -393,6 +392,7 @@
 
 static int shmem_getpage(struct inode * inode, unsigned long idx, struct page **ptr)
 {
+	struct shmem_inode_info *info;
 	struct address_space * mapping = inode->i_mapping;
 	int error;
 
@@ -407,27 +407,28 @@
 		page_cache_release(*ptr);
 	}
 
-	down (&inode->i_sem);
-	/* retest we may have slept */
+	info = &inode->u.shmem_i;
+	down (&info->sem);
+	/* retest we may have slept */  	
+
+	*ptr = ERR_PTR(-EFAULT);
 	if (inode->i_size < (loff_t) idx * PAGE_CACHE_SIZE)
-		goto sigbus;
-	*ptr = shmem_getpage_locked(inode, idx);
+		goto failed;
+
+	*ptr = shmem_getpage_locked(&inode->u.shmem_i, inode, idx);
 	if (IS_ERR (*ptr))
 		goto failed;
+
 	UnlockPage(*ptr);
-	up (&inode->i_sem);
+	up (&info->sem);
 	return 0;
 failed:
-	up (&inode->i_sem);
+	up (&info->sem);
 	error = PTR_ERR(*ptr);
-	*ptr = NOPAGE_OOM;
-	if (error != -EFBIG)
-		*ptr = NOPAGE_SIGBUS;
-	return error;
-sigbus:
-	up (&inode->i_sem);
 	*ptr = NOPAGE_SIGBUS;
-	return -EFAULT;
+	if (error == -ENOMEM)
+		*ptr = NOPAGE_OOM;
+	return error;
 }
 
 struct page * shmem_nopage(struct vm_area_struct * vma, unsigned long address, int no_share)
@@ -500,6 +501,7 @@
 struct inode *shmem_get_inode(struct super_block *sb, int mode, int dev)
 {
 	struct inode * inode;
+	struct shmem_inode_info *info;
 
 	spin_lock (&sb->u.shmem_sb.stat_lock);
 	if (!sb->u.shmem_sb.free_inodes) {
@@ -519,7 +521,9 @@
 		inode->i_rdev = to_kdev_t(dev);
 		inode->i_mapping->a_ops = &shmem_aops;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		spin_lock_init (&inode->u.shmem_i.lock);
+		info = &inode->u.shmem_i;
+		spin_lock_init (&info->lock);
+		sema_init (&info->sem, 1);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -549,6 +553,7 @@
 shmem_file_write(struct file *file,const char *buf,size_t count,loff_t *ppos)
 {
 	struct inode	*inode = file->f_dentry->d_inode; 
+	struct shmem_inode_info *info;
 	unsigned long	limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
 	loff_t		pos;
 	struct page	*page;
@@ -624,7 +629,11 @@
 			__get_user(dummy, buf+bytes-1);
 		}
 
-		page = shmem_getpage_locked(inode, index);
+		info = &inode->u.shmem_i;
+		down (&info->sem);
+		page = shmem_getpage_locked(info, inode, index);
+		up (&info->sem);
+
 		status = PTR_ERR(page);
 		if (IS_ERR(page))
 			break;
@@ -635,7 +644,6 @@
 		}
 
 		kaddr = kmap(page);
-// can this do a truncated write? cr
 		status = copy_from_user(kaddr+offset, buf, bytes);
 		kunmap(page);
 		if (status)
@@ -932,7 +940,7 @@
 		
 	inode = dentry->d_inode;
 	down(&inode->i_sem);
-	page = shmem_getpage_locked(inode, 0);
+	page = shmem_getpage_locked(&inode->u.shmem_i, inode, 0);
 	if (IS_ERR(page))
 		goto fail;
 	kaddr = kmap(page);

[-- Attachment #3: encapsulate access to shmem_inode_info --]
[-- Type: text/plain, Size: 6178 bytes --]

diff -uNr 2.4.4-mmap_write/include/linux/shmem_fs.h 2.4.4-mmap_write-SHMEM_I/include/linux/shmem_fs.h
--- 2.4.4-mmap_write/include/linux/shmem_fs.h	Tue May  1 20:02:00 2001
+++ 2.4.4-mmap_write-SHMEM_I/include/linux/shmem_fs.h	Tue May  1 20:06:10 2001
@@ -18,14 +18,15 @@
 } swp_entry_t;
 
 struct shmem_inode_info {
-	spinlock_t	lock;
-	struct semaphore sem;
-	unsigned long	max_index;
-	swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
-	swp_entry_t   **i_indirect; /* doubly indirect blocks */
-	unsigned long	swapped;
-	int		locked;     /* into memory */
+	spinlock_t		lock;
+	struct semaphore 	sem;
+	unsigned long		max_index;
+	swp_entry_t		i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
+	swp_entry_t           **i_indirect; /* doubly indirect blocks */
+	unsigned long		swapped;
+	int			locked;     /* into memory */
 	struct list_head	list;
+	struct inode	       *inode;
 };
 
 struct shmem_sb_info {
@@ -35,5 +36,7 @@
 	unsigned long free_inodes;  /* How many are left for allocation */
 	spinlock_t    stat_lock;
 };
+
+#define SHMEM_I(inode)  (&inode->u.shmem_i)
 
 #endif
diff -uNr 2.4.4-mmap_write/ipc/shm.c 2.4.4-mmap_write-SHMEM_I/ipc/shm.c
--- 2.4.4-mmap_write/ipc/shm.c	Wed Apr 11 12:36:47 2001
+++ 2.4.4-mmap_write-SHMEM_I/ipc/shm.c	Tue May  1 20:06:10 2001
@@ -348,6 +348,7 @@
 
 static void shm_get_stat (unsigned long *rss, unsigned long *swp) 
 {
+	struct shmem_inode_info *info;
 	int i;
 
 	*rss = 0;
@@ -361,10 +362,11 @@
 		if(shp == NULL)
 			continue;
 		inode = shp->shm_file->f_dentry->d_inode;
-		spin_lock (&inode->u.shmem_i.lock);
+		info = SHMEM_I(inode);
+		spin_lock (&info->lock);
 		*rss += inode->i_mapping->nrpages;
-		*swp += inode->u.shmem_i.swapped;
-		spin_unlock (&inode->u.shmem_i.lock);
+		*swp += info->swapped;
+		spin_unlock (&info->lock);
 	}
 }
 
diff -uNr 2.4.4-mmap_write/mm/shmem.c 2.4.4-mmap_write-SHMEM_I/mm/shmem.c
--- 2.4.4-mmap_write/mm/shmem.c	Tue May  1 20:02:00 2001
+++ 2.4.4-mmap_write-SHMEM_I/mm/shmem.c	Wed May  2 16:46:00 2001
@@ -73,7 +73,7 @@
 	unsigned long freed;
 
 	freed = (inode->i_blocks/BLOCKS_PER_PAGE) -
-		(inode->i_mapping->nrpages + inode->u.shmem_i.swapped);
+		(inode->i_mapping->nrpages + SHMEM_I(inode)->swapped);
 	if (freed){
 		struct shmem_sb_info * info = &inode->i_sb->u.shmem_sb;
 		inode->i_blocks -= freed*BLOCKS_PER_PAGE;
@@ -159,7 +159,7 @@
 	unsigned long index, start;
 	unsigned long freed = 0;
 	swp_entry_t **base, **ptr, **last;
-	struct shmem_inode_info * info = &inode->u.shmem_i;
+	struct shmem_inode_info * info = SHMEM_I(inode);
 
 	down(&info->sem);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -206,7 +206,7 @@
 	struct shmem_sb_info *info = &inode->i_sb->u.shmem_sb;
 
 	spin_lock (&shmem_ilock);
-	list_del (&inode->u.shmem_i.list);
+	list_del (&SHMEM_I(inode)->list);
 	spin_unlock (&shmem_ilock);
 	inode->i_size = 0;
 	shmem_truncate (inode);
@@ -239,7 +239,7 @@
 		goto out;
 	
 	inode = page->mapping->host;
-	info = &inode->u.shmem_i;
+	info = SHMEM_I(inode);
 	swap = __get_swap_page(2);
 	error = -ENOMEM;
 	if (!swap.val)
@@ -407,7 +407,7 @@
 		page_cache_release(*ptr);
 	}
 
-	info = &inode->u.shmem_i;
+	info = SHMEM_I(inode);
 	down (&info->sem);
 	/* retest we may have slept */  	
 
@@ -415,7 +415,7 @@
 	if (inode->i_size < (loff_t) idx * PAGE_CACHE_SIZE)
 		goto failed;
 
-	*ptr = shmem_getpage_locked(&inode->u.shmem_i, inode, idx);
+	*ptr = shmem_getpage_locked(info, inode, idx);
 	if (IS_ERR (*ptr))
 		goto failed;
 
@@ -462,7 +462,7 @@
 void shmem_lock(struct file * file, int lock)
 {
 	struct inode * inode = file->f_dentry->d_inode;
-	struct shmem_inode_info * info = &inode->u.shmem_i;
+	struct shmem_inode_info * info = SHMEM_I(inode);
 	struct page * page;
 	unsigned long idx, size;
 
@@ -521,7 +521,8 @@
 		inode->i_rdev = to_kdev_t(dev);
 		inode->i_mapping->a_ops = &shmem_aops;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		info = &inode->u.shmem_i;
+		info = SHMEM_I(inode);
+		info->inode = inode;
 		spin_lock_init (&info->lock);
 		sema_init (&info->sem, 1);
 		switch (mode & S_IFMT) {
@@ -542,7 +543,7 @@
 			break;
 		}
 		spin_lock (&shmem_ilock);
-		list_add (&inode->u.shmem_i.list, &shmem_inodes);
+		list_add (&SHMEM_I(inode)->list, &shmem_inodes);
 		spin_unlock (&shmem_ilock);
 	}
 	return inode;
@@ -629,7 +630,7 @@
 			__get_user(dummy, buf+bytes-1);
 		}
 
-		info = &inode->u.shmem_i;
+		info = SHMEM_I(inode);
 		down (&info->sem);
 		page = shmem_getpage_locked(info, inode, index);
 		up (&info->sem);
@@ -658,8 +659,8 @@
 			buf += bytes;
 			if (pos > inode->i_size) 
 				inode->i_size = pos;
-			if (inode->u.shmem_i.max_index <= index)
-				inode->u.shmem_i.max_index = index+1;
+			if (info->max_index <= index)
+				info->max_index = index+1;
 
 		}
 unlock:
@@ -940,7 +941,7 @@
 		
 	inode = dentry->d_inode;
 	down(&inode->i_sem);
-	page = shmem_getpage_locked(&inode->u.shmem_i, inode, 0);
+	page = shmem_getpage_locked(SHMEM_I(inode), inode, 0);
 	if (IS_ERR(page))
 		goto fail;
 	kaddr = kmap(page);
@@ -1220,12 +1221,11 @@
 	return -1;
 }
 
-static int shmem_unuse_inode (struct inode *inode, swp_entry_t entry, struct page *page)
+static int shmem_unuse_inode (struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
 {
 	swp_entry_t **base, **ptr;
 	unsigned long idx;
 	int offset;
-	struct shmem_inode_info *info = &inode->u.shmem_i;
 	
 	idx = 0;
 	spin_lock (&info->lock);
@@ -1246,7 +1246,7 @@
 	spin_unlock (&info->lock);
 	return 0;
 found:
-	add_to_page_cache(page, inode->i_mapping, offset + idx);
+	add_to_page_cache(page, info->inode->i_mapping, offset + idx);
 	set_page_dirty(page);
 	SetPageUptodate(page);
 	UnlockPage(page);
@@ -1261,13 +1261,13 @@
 void shmem_unuse(swp_entry_t entry, struct page *page)
 {
 	struct list_head *p;
-	struct inode * inode;
+	struct shmem_inode_info * info;
 
 	spin_lock (&shmem_ilock);
 	list_for_each(p, &shmem_inodes) {
-		inode = list_entry(p, struct inode, u.shmem_i.list);
+		info = list_entry(p, struct shmem_inode_info, list);
 
-		if (shmem_unuse_inode(inode, entry, page))
+		if (shmem_unuse_inode(info, entry, page))
 			break;
 	}
 	spin_unlock (&shmem_ilock);

[-- Attachment #4: inline symlinks --]
[-- Type: text/plain, Size: 4567 bytes --]

diff -uNr 2.4.4-mmap_write-SHMEM_I/mm/shmem.c 2.4.4-mmap_write-SHMEM_I-symlink/mm/shmem.c
--- 2.4.4-mmap_write-SHMEM_I/mm/shmem.c	Fri May  4 21:32:22 2001
+++ 2.4.4-mmap_write-SHMEM_I-symlink/mm/shmem.c	Fri May  4 21:37:34 2001
@@ -41,7 +41,6 @@
 static struct inode_operations shmem_inode_operations;
 static struct file_operations shmem_dir_operations;
 static struct inode_operations shmem_dir_inode_operations;
-static struct inode_operations shmem_symlink_inode_operations;
 static struct vm_operations_struct shmem_vm_ops;
 
 LIST_HEAD (shmem_inodes);
@@ -205,11 +204,13 @@
 {
 	struct shmem_sb_info *info = &inode->i_sb->u.shmem_sb;
 
-	spin_lock (&shmem_ilock);
-	list_del (&SHMEM_I(inode)->list);
-	spin_unlock (&shmem_ilock);
 	inode->i_size = 0;
-	shmem_truncate (inode);
+	if (inode->i_op->truncate == shmem_truncate){ 
+		spin_lock (&shmem_ilock);
+		list_del (&SHMEM_I(inode)->list);
+		spin_unlock (&shmem_ilock);
+		shmem_truncate(inode);
+	}
 	spin_lock (&info->stat_lock);
 	info->free_inodes++;
 	spin_unlock (&info->stat_lock);
@@ -532,6 +533,9 @@
 		case S_IFREG:
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
+			spin_lock (&shmem_ilock);
+			list_add (&SHMEM_I(inode)->list, &shmem_inodes);
+			spin_unlock (&shmem_ilock);
 			break;
 		case S_IFDIR:
 			inode->i_nlink++;
@@ -539,17 +543,17 @@
 			inode->i_fop = &shmem_dir_operations;
 			break;
 		case S_IFLNK:
-			inode->i_op = &shmem_symlink_inode_operations;
 			break;
 		}
-		spin_lock (&shmem_ilock);
-		list_add (&SHMEM_I(inode)->list, &shmem_inodes);
-		spin_unlock (&shmem_ilock);
 	}
 	return inode;
 }
 
 #ifdef CONFIG_TMPFS
+
+static struct inode_operations shmem_symlink_inode_operations;
+static struct inode_operations shmem_symlink_inline_operations;
+
 static ssize_t
 shmem_file_write(struct file *file,const char *buf,size_t count,loff_t *ppos)
 {
@@ -930,33 +934,54 @@
 	struct inode *inode;
 	struct page *page;
 	char *kaddr;
+	struct shmem_inode_info * info;
 
 	error = shmem_mknod(dir, dentry, S_IFLNK | S_IRWXUGO, 0);
 	if (error)
 		return error;
 
-	len = strlen(symname);
+	len = strlen(symname) + 1;
 	if (len > PAGE_SIZE)
 		return -ENAMETOOLONG;
-		
+
 	inode = dentry->d_inode;
-	down(&inode->i_sem);
-	page = shmem_getpage_locked(SHMEM_I(inode), inode, 0);
-	if (IS_ERR(page))
-		goto fail;
-	kaddr = kmap(page);
-	memcpy(kaddr, symname, len);
-	kunmap(page);
+	info = SHMEM_I(inode);
 	inode->i_size = len;
-	SetPageDirty(page);
-	UnlockPage(page);
-	page_cache_release(page);
-	up(&inode->i_sem);
+	if (len <= sizeof(struct shmem_inode_info)) {
+		/* do it inline */
+		memcpy(info, symname, len);
+		inode->i_op = &shmem_symlink_inline_operations;
+	} else {
+		spin_lock (&shmem_ilock);
+		list_add (&info->list, &shmem_inodes);
+		spin_unlock (&shmem_ilock);
+		down(&inode->i_sem);
+		page = shmem_getpage_locked(info, inode, 0);
+		if (IS_ERR(page)) {
+			up(&inode->i_sem);
+			return PTR_ERR(page);
+		}
+		kaddr = kmap(page);
+		memcpy(kaddr, symname, len);
+		kunmap(page);
+		SetPageDirty(page);
+		UnlockPage(page);
+		page_cache_release(page);
+		up(&inode->i_sem);
+		inode->i_op = &shmem_symlink_inode_operations;
+	}
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	return 0;
-fail:
-	up(&inode->i_sem);
-	return PTR_ERR(page);
+}
+
+static int shmem_readlink_inline(struct dentry *dentry, char *buffer, int buflen)
+{
+	return vfs_readlink(dentry,buffer,buflen, (const char *)SHMEM_I(dentry->d_inode));
+}
+
+static int shmem_follow_link_inline(struct dentry *dentry, struct nameidata *nd)
+{
+	return vfs_follow_link(nd, (const char *)SHMEM_I(dentry->d_inode));
 }
 
 static int shmem_readlink(struct dentry *dentry, char *buffer, int buflen)
@@ -986,6 +1011,17 @@
 	return res;
 }
 
+static struct inode_operations shmem_symlink_inline_operations = {
+	readlink:	shmem_readlink_inline,
+	follow_link:	shmem_follow_link_inline,
+};
+
+static struct inode_operations shmem_symlink_inode_operations = {
+	truncate:	shmem_truncate,
+	readlink:	shmem_readlink,
+	follow_link:	shmem_follow_link,
+};
+
 static int shmem_parse_options(char *options, int *mode, unsigned long * blocks, unsigned long *inodes)
 {
 	char *this_char, *value;
@@ -1118,14 +1154,6 @@
 
 static struct inode_operations shmem_inode_operations = {
 	truncate:	shmem_truncate,
-};
-
-static struct inode_operations shmem_symlink_inode_operations = {
-	truncate:	shmem_truncate,
-#ifdef CONFIG_TMPFS
-	readlink:	shmem_readlink,
-	follow_link:	shmem_follow_link,
-#endif
 };
 
 static struct file_operations shmem_dir_operations = {

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2001-05-06 13:57 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-04-22 16:32 hundreds of mount --bind mountpoints? David L. Parsley
2001-04-22 16:41 ` Alexander Viro
2001-04-23 11:43 ` Christoph Rohland
2001-04-23 13:17   ` Ingo Oeser
2001-04-23 14:54     ` Christoph Rohland
2001-04-23 15:23       ` Ingo Oeser
2001-04-23 15:36         ` Alexander Viro
2001-04-23 20:45           ` Ingo Oeser
2001-04-23 20:56             ` Christoph Hellwig
2001-04-23 22:00               ` Ingo Oeser
2001-04-23 22:10                 ` Alexander Viro
2001-04-23 22:47             ` Andreas Dilger
2001-04-24  1:37             ` Jan Harkes
2001-04-24  2:53               ` Alexander Viro
2001-04-24  9:58                 ` David Woodhouse
2001-04-24 10:07                   ` Alexander Viro
2001-04-24 10:19                     ` David Woodhouse
2001-04-24 10:35                     ` Christoph Rohland
2001-04-24 10:54                       ` Alexander Viro
2001-04-24 12:51                         ` David Woodhouse
2001-04-24 13:34                         ` Christoph Rohland
2001-04-24 16:52                           ` David L. Parsley
2001-05-05  7:48                             ` [Patch] inline symlinks for tmpfs Christoph Rohland
2001-05-04 19:17                           ` [Patch] encapsulate shmem access to shmem_inode_info Christoph Rohland
     [not found]                             ` <3AF405FC.11D37FEB@bellsouth.net>
2001-05-06 13:58                               ` [Resend] Collection of tmpfs patches Christoph Rohland
2001-04-24 16:04                         ` Can't read SCSI TAPE Masaki Tsuji
2001-04-24 18:36                         ` hundreds of mount --bind mountpoints? Andreas Dilger
2001-04-24 18:49                           ` Alexander Viro
2001-04-24 19:11                             ` Andreas Dilger
2001-04-24 22:01                             ` Ingo Oeser
2001-04-24 21:59                 ` Trond Myklebust
2001-04-24 22:09                   ` Alexander Viro
2001-04-24 22:31                   ` Trond Myklebust
2001-04-23 21:19           ` Richard Gooch
2001-04-23 22:42             ` Albert D. Cahalan
2001-04-23 22:49             ` Richard Gooch
2001-04-23 23:07               ` Alexander Viro
2001-04-23 23:24                 ` Rik van Riel
2001-04-23 23:13               ` Richard Gooch
2001-04-24  6:33           ` Christoph Rohland
2001-04-23 14:08   ` David L. Parsley
2001-04-23 14:14     ` Alexander Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).