All of lore.kernel.org
 help / color / mirror / Atom feed
* JFFS3 memory consumption
@ 2005-01-26  8:44 Artem B. Bityuckiy
  2005-01-26  9:10 ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-26  8:44 UTC (permalink / raw)
  To: MTD List

Hi, how about to discuss one more problem?

JFFS2 stores anything on flash using special container data structures 
called
nodes. Nodes have different types and different lengths. Nodes may be 
situated
everywhere within the block.

As David Woodhouse said, JFFS2 has no any structure on flash. There are no
specific addresses which one may access to get specific JFFS2 information.
For example, most file systems have the super block data structure, which 
may
be found at the specified address and the information about the whole file
system may be read from that super block. JFFS2 has no super block. The 
only
it has is the set of nodes on it. Nodes may be placed to any offset within
the flash device, so may have any position. JFFS2 just writes nodes,
sequentially.

The above design assumes two things:

1. When mounting JFFS2, the flash media scanning must be done. Since we 
have no
super block, we must scan the flash device and identify the offsets of 
nodes,
read their header information and only then "build" the file system's 
tree.
This is the reason why the mount process is slow in JFFS2. And, roughly
speaking, the mount time grows lineary with the growing file system size .
But this is not the subject of this speech.

2. Nodes... They are the only JFFS2 data containers. They have no 
determined
position on the flash chip. They are moved by the Garbage Collection from 
time
to time, so their position is dynamically changed. This is problem.
To handle this situation, JFFS2 needs to keep track of all its nodes in 
memory
since obviously, it is unacceptable to scan the whole flash device to find
the position of all the inode's nodes when user issued, say, the file read
operation. JFFS2 needs to quickly find all the needed nodes on flash.

This is why JFFS2 have small object for each JFFS2 node in RAM memory. 
This
means, more nodes we have on the flash (i.e., mode data on the JFFS2 file
system), more RAM JFFS2 needs to keep track of these nodes. The memory
occupied by these object is called "in-core" memory. The RAM objects which
correspond to nodes are called "node_ref" objects and have the
"struct jffs2_raw_node_ref" type.

Roughly speaking, the JFFS2 in-core memory consumption is the linear 
function
of the file system size.

This is the problem which I want to discuss.

The JFFS2 design is ideal from the wear-leveling viewpoint, it is simple 
and
clean, it is very robust in cases of power loss. But it has the major 
problem -
scalability. We have the linear dependency of the mount time and the RAM 
usage.
This may be the problem with the tendency of the flash sizes growth.

Unfortunately, I believe this is fundamental JFFS2 problem, caused by its 
"pure"
log structured design. Denote denote the above dependency as Tm = K1N and 
M = K2N,
where
        - Tm is the mount time;
        - M is the required in-core memory;
        - K1, K2 are some coefficients;
        - N is the number of nodes on Flash.

I think with the JFFS2 design we can not have other dependency then 
linear, we can only
make K1 and K2 smaller.

K1 is decreased by the Ferenc Havasi's "summary" patch well. Would be nice 
to decrease
K2 either. Any Ideas how to do so?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26  8:44 JFFS3 memory consumption Artem B. Bityuckiy
@ 2005-01-26  9:10 ` David Woodhouse
  2005-01-26 15:57   ` Artem B. Bityuckiy
  2005-01-27 17:35   ` Artem B. Bityuckiy
  0 siblings, 2 replies; 29+ messages in thread
From: David Woodhouse @ 2005-01-26  9:10 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List

On Wed, 2005-01-26 at 08:44 +0000, Artem B. Bityuckiy wrote:
> K1 is decreased by the Ferenc Havasi's "summary" patch well. Would be
> nice to decrease K2 either. Any Ideas how to do so?

For a start, we can increase the maximum node size. Limiting it to
PAGE_SIZE means we have a larger number of nodes than we'd otherwise
have. If we quadruple that, we'll probably end up with about a third as
many nodes. 

We can also look at true write-behind caching, if we're a _lot_ more
careful about free space than we are now. To have data outstanding in
the page cache, we need to be able to guarantee that we have enough free
space to write it out, and we need to be able to write it out _without_
allocating memory. That includes any memory which might be required for
GC. Doing that would allow us to coalesce writes so we fill a page (or a
full node size) more often.

Removing the __totlen field from the node ref is also a 25% improvement
in its ram usage.

Also, perhaps we could use _arrays_ of nodes instead of lists? That
would allow us to drop one of the list pointers. Probably it's easiest
to do this for the next_phys list -- make it an array in physical order.
I'm not sure how workable that would be. You'd need to deal with
reallocation, and it'd be hard to remove obsolete elements from the
middle; you'd probably just have to leave them there.

Perhaps we can use values other than pointers in our next_phys and
next_in_ino lists. These objects are in slab caches, and are quite
neatly laid out. We could find a way of referencing them which doesn't
require a whole 32 bits (or 64 on some machines).

Another idea which I don't think is workable but which someone may be
able to get something from... can we reduce the 'offset' to something
smaller than 32 bits? Make it an offset within the block instead of from
the start of the medium? Probably hard because it's not that quick to
_find_ the block given just a raw_node_ref*. But if we've already put
all the nodes for a given block into an _array_ rather than a linked
list then maybe...

There are some implementable ideas in there, and some more random ones. 
It's all still linear though.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26  9:10 ` David Woodhouse
@ 2005-01-26 15:57   ` Artem B. Bityuckiy
  2005-01-26 17:06     ` Josh Boyer
  2005-01-27 17:35   ` Artem B. Bityuckiy
  1 sibling, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-26 15:57 UTC (permalink / raw)
  To: David Woodhouse; +Cc: MTD List

On Wed, 26 Jan 2005, David Woodhouse wrote:

> On Wed, 2005-01-26 at 08:44 +0000, Artem B. Bityuckiy wrote:
> > K1 is decreased by the Ferenc Havasi's "summary" patch well. Would be
> > nice to decrease K2 either. Any Ideas how to do so?
> 
> For a start, we can increase the maximum node size. Limiting it to
> PAGE_SIZE means we have a larger number of nodes than we'd otherwise
> have. If we quadruple that, we'll probably end up with about a third as
> many nodes. 
I thought about to have several 4K chunks within one node. Thus we will 
keep reading fast but will have few nodes number. Didi you mean this?

> We can also look at true write-behind caching, if we're a _lot_ more
> careful about free space than we are now. To have data outstanding in
> the page cache, we need to be able to guarantee that we have enough free
> space to write it out, and we need to be able to write it out _without_
> allocating memory. That includes any memory which might be required for
> GC. Doing that would allow us to coalesce writes so we fill a page (or a
> full node size) more often.
Yes, agree, good idea!. This would be very good improvement. But I suppose 
this is the subject to distinct big discussion.

> Removing the __totlen field from the node ref is also a 25% improvement
> in its ram usage.
By the way, I have already done this at JFFS3.

> 
> Also, perhaps we could use _arrays_ of nodes instead of lists? That
> would allow us to drop one of the list pointers. Probably it's easiest
> to do this for the next_phys list -- make it an array in physical order.
> I'm not sure how workable that would be. You'd need to deal with
> reallocation, and it'd be hard to remove obsolete elements from the
> middle; you'd probably just have to leave them there.
Possibly. But IMHO this is not the best approach. How will we remove one 
node from list? Don't know.

A suppose we do not really need per-block node_ref list. We how have 
summaries and may read summary when the per-block list is needed. What do 
you think? See any potential problem?

> Perhaps we can use values other than pointers in our next_phys and
> next_in_ino lists. These objects are in slab caches, and are quite
> neatly laid out. We could find a way of referencing them which doesn't
> require a whole 32 bits (or 64 on some machines).
No, this is not always true. Imaging objects from different slabs. 
They are not "close" to each other. Moreover, we will need to take into account 
redzoning, which may be switched on or off. This is not JFFS3's deal, 
IMHO.


> Another idea which I don't think is workable but which someone may be
> able to get something from... can we reduce the 'offset' to something
> smaller than 32 bits? Make it an offset within the block instead of from
> the start of the medium? Probably hard because it's not that quick to
> _find_ the block given just a raw_node_ref*. But if we've already put
> all the nodes for a given block into an _array_ rather than a linked
> list then maybe...
Hmm, may be. But I suspect this need too much effords, but the result will 
not be very noticable.
 

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26 15:57   ` Artem B. Bityuckiy
@ 2005-01-26 17:06     ` Josh Boyer
  2005-01-26 17:11       ` Artem B. Bityuckiy
  0 siblings, 1 reply; 29+ messages in thread
From: Josh Boyer @ 2005-01-26 17:06 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List, David Woodhouse

On Wed, 2005-01-26 at 09:57, Artem B. Bityuckiy wrote:
> On Wed, 26 Jan 2005, David Woodhouse wrote:
> 
> > On Wed, 2005-01-26 at 08:44 +0000, Artem B. Bityuckiy wrote:
> > > K1 is decreased by the Ferenc Havasi's "summary" patch well. Would be
> > > nice to decrease K2 either. Any Ideas how to do so?
> > 
> > For a start, we can increase the maximum node size. Limiting it to
> > PAGE_SIZE means we have a larger number of nodes than we'd otherwise
> > have. If we quadruple that, we'll probably end up with about a third as
> > many nodes. 
> I thought about to have several 4K chunks within one node. Thus we will 
> keep reading fast but will have few nodes number. Didi you mean this?

If you use larger "chunks" you'll get better overall compression.  E.g. 
a 64KiB node should compress better than a 4KiB node.  This is partially
how squashfs achieves better compression that cramfs.

So the end result is better compression and fewer nodes.

josh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26 17:06     ` Josh Boyer
@ 2005-01-26 17:11       ` Artem B. Bityuckiy
  2005-01-26 18:15         ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-26 17:11 UTC (permalink / raw)
  To: Josh Boyer; +Cc: MTD List, David Woodhouse

On Wed, 26 Jan 2005, Josh Boyer wrote:

> On Wed, 2005-01-26 at 09:57, Artem B. Bityuckiy wrote:
> > On Wed, 26 Jan 2005, David Woodhouse wrote:
> > 
> > > On Wed, 2005-01-26 at 08:44 +0000, Artem B. Bityuckiy wrote:
> > > > K1 is decreased by the Ferenc Havasi's "summary" patch well. Would be
> > > > nice to decrease K2 either. Any Ideas how to do so?
> > > 
> > > For a start, we can increase the maximum node size. Limiting it to
> > > PAGE_SIZE means we have a larger number of nodes than we'd otherwise
> > > have. If we quadruple that, we'll probably end up with about a third as
> > > many nodes. 
> > I thought about to have several 4K chunks within one node. Thus we will 
> > keep reading fast but will have few nodes number. Didi you mean this?
> 
> If you use larger "chunks" you'll get better overall compression.  E.g. 
> a 64KiB node should compress better than a 4KiB node.  This is partially
> how squashfs achieves better compression that cramfs.
Surely, but how about read degradation?

> 
> So the end result is better compression and fewer nodes.
> 
> josh
> 
> 

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26 17:11       ` Artem B. Bityuckiy
@ 2005-01-26 18:15         ` David Woodhouse
  2005-01-26 18:33           ` Artem B. Bityuckiy
  0 siblings, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2005-01-26 18:15 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List

On Wed, 2005-01-26 at 17:11 +0000, Artem B. Bityuckiy wrote:
> > If you use larger "chunks" you'll get better overall compression.  E.g. 
> > a 64KiB node should compress better than a 4KiB node.  This is partially
> > how squashfs achieves better compression that cramfs.

> Surely, but how about read degradation?

See what zisofs does. If you populate adjacent pages when decompressing,
rather than just throwing away the extra data you had to decompress
anyway, then it's not so bad.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26 18:15         ` David Woodhouse
@ 2005-01-26 18:33           ` Artem B. Bityuckiy
  2005-01-26 18:55             ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-26 18:33 UTC (permalink / raw)
  To: David Woodhouse; +Cc: MTD List

On Wed, 26 Jan 2005, David Woodhouse wrote:

> On Wed, 2005-01-26 at 17:11 +0000, Artem B. Bityuckiy wrote:
> > > If you use larger "chunks" you'll get better overall compression.  E.g. 
> > > a 64KiB node should compress better than a 4KiB node.  This is partially
> > > how squashfs achieves better compression that cramfs.
> 
> > Surely, but how about read degradation?
> 
> See what zisofs does. If you populate adjacent pages when decompressing,
> rather than just throwing away the extra data you had to decompress
> anyway, then it's not so bad.
Ah! I looked to isofs/compress.c:zisofs_readpage()
Looks like good idea!

Ok, we need to select how many pages will we have at one node. I suppose 
4-8 pages is enough. What digit do you imagine? 
 

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26 18:33           ` Artem B. Bityuckiy
@ 2005-01-26 18:55             ` David Woodhouse
  0 siblings, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2005-01-26 18:55 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List

On Wed, 2005-01-26 at 18:33 +0000, Artem B. Bityuckiy wrote:
> Ok, we need to select how many pages will we have at one node. I
> suppose 4-8 pages is enough. What digit do you imagine? 

I think we relax the spec so that it's unlimited. I was imagining 4 or
8; we can see what works best.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-26  9:10 ` David Woodhouse
  2005-01-26 15:57   ` Artem B. Bityuckiy
@ 2005-01-27 17:35   ` Artem B. Bityuckiy
  2005-01-27 20:23     ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-27 17:35 UTC (permalink / raw)
  To: David Woodhouse; +Cc: MTD List

On Wed, 26 Jan 2005, David Woodhouse wrote:
> We can also look at true write-behind caching, if we're a _lot_ more
> careful about free space than we are now. To have data outstanding in
> the page cache, we need to be able to guarantee that we have enough free
> space to write it out, and we need to be able to write it out _without_
> allocating memory. That includes any memory which might be required for
> GC. Doing that would allow us to coalesce writes so we fill a page (or a
> full node size) more often.

How do you think, if we implement write-back caching (I do not imagine 
how yet) then will we still need to distinguish between "pristine" and 
"normal" nodes?

Since JFFS2 is now write-through, we do write small data chunks if there 
are few changes in the cache. This is for space efficiency, right?

In case of write-back cache, a lot of small changes will be merged to the 
page and we possibly may just write it fully. Perhaps, we might not bother 
writing only part of page.

The agvantage of this is that we will not need to overcompicate the GC 
getting it merge "normal" nodes and produce "pristine" ones. All our nodes 
will be pristine.

On commit_write request we may really write several neigbour pages either 
if they are marked dirty, thus having nodes with several pages.

If we have only 4K-multiple data chunks on nodes, we will even save some 
RAM (both in-core and fragtree).

Comments?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-27 17:35   ` Artem B. Bityuckiy
@ 2005-01-27 20:23     ` Thomas Gleixner
  2005-01-27 22:35       ` David Woodhouse
  2005-01-28  8:50       ` Artem B. Bityuckiy
  0 siblings, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2005-01-27 20:23 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List, David Woodhouse

On Thu, 2005-01-27 at 17:35 +0000, Artem B. Bityuckiy wrote:
> On Wed, 26 Jan 2005, David Woodhouse wrote:
> > We can also look at true write-behind caching, if we're a _lot_ more
> > careful about free space than we are now. To have data outstanding in
> > the page cache, we need to be able to guarantee that we have enough free
> > space to write it out, and we need to be able to write it out _without_
> > allocating memory. That includes any memory which might be required for
> > GC. Doing that would allow us to coalesce writes so we fill a page (or a
> > full node size) more often.
> 
> How do you think, if we implement write-back caching (I do not imagine 
> how yet) then will we still need to distinguish between "pristine" and 
> "normal" nodes?
> 
> Since JFFS2 is now write-through, we do write small data chunks if there 
> are few changes in the cache. This is for space efficiency, right?

Keep it write-through !

Write caching is contrary to powerfail robustness.

It worked always that way and many embedded users rely on this. If you
want to have the writes cached, then make it an option and not the
default behaviour. But then you have still to think about the solution
for the non cached case.

We discussed the two writeblock approach for long. I think its the happy
medium.

Writeblock A is used for GC and direct write of pristine nodes. 

Writeblock B is used for direct write through of normal nodes and all
the noise which happens when files are written in small chunks.

When B becomes full, it is immidiately recycled into A. So GC can merge
the tiny nodes and remove the obsolete ones.

This will also reduce memory consumption, as you have only those noisy
nodes in two blocks (the active B block and the previous B block, which
is actually recycled). 

And you still preserve the powerfail robustness.

> In case of write-back cache, a lot of small changes will be merged to the 
> page and we possibly may just write it fully. Perhaps, we might not bother 
> writing only part of page.
>
> The agvantage of this is that we will not need to overcompicate the GC 
> getting it merge "normal" nodes and produce "pristine" ones. All our nodes 
> will be pristine.
> 
> On commit_write request we may really write several neigbour pages either 
> if they are marked dirty, thus having nodes with several pages.
> 
> If we have only 4K-multiple data chunks on nodes, we will even save some 
> RAM (both in-core and fragtree).

That's just imagination. Analyse a couple of JFFS2 images to figure out
how many 4k multiple data chunks you have. Those which are live usually
in a part of the filesystem which is almost never updated
(/bin, /lib ....) on embedded systems. 

The parts where data are written frequently are usually 

logging, data acquisition and data bases. All of them have one in
common: small writes. And most of them are sensitive to data loss.

tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-27 20:23     ` Thomas Gleixner
@ 2005-01-27 22:35       ` David Woodhouse
  2005-01-28  0:19         ` Thomas Gleixner
  2005-01-28  8:50       ` Artem B. Bityuckiy
  1 sibling, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2005-01-27 22:35 UTC (permalink / raw)
  To: tglx; +Cc: MTD List

On Thu, 2005-01-27 at 21:23 +0100, Thomas Gleixner wrote:
> Keep it write-through !
> 
> Write caching is contrary to powerfail robustness.

No. O_SYNC, fsync() and sync() are there for a reason, and even on NAND
today you have to use them correctly according to POSIX if you want sane
results. We're not pandering to stupidity.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-27 22:35       ` David Woodhouse
@ 2005-01-28  0:19         ` Thomas Gleixner
  2005-01-28  0:33           ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2005-01-28  0:19 UTC (permalink / raw)
  To: David Woodhouse; +Cc: MTD List

On Thu, 2005-01-27 at 22:35 +0000, David Woodhouse wrote:
> On Thu, 2005-01-27 at 21:23 +0100, Thomas Gleixner wrote:
> > Keep it write-through !
> > 
> > Write caching is contrary to powerfail robustness.
> 
> No. O_SYNC, fsync() and sync() are there for a reason, and even on NAND
> today you have to use them correctly according to POSIX if you want sane
> results. We're not pandering to stupidity.

Sure. I'm aware of O_SYNC, fsync() and sync().

Please reread http://sources.redhat.com/jffs2/jffs2.pdf

tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28  0:19         ` Thomas Gleixner
@ 2005-01-28  0:33           ` David Woodhouse
  2005-01-28  0:44             ` Thomas Gleixner
  2005-01-28  1:53             ` Josh Boyer
  0 siblings, 2 replies; 29+ messages in thread
From: David Woodhouse @ 2005-01-28  0:33 UTC (permalink / raw)
  To: tglx; +Cc: MTD List

On Fri, 2005-01-28 at 01:19 +0100, Thomas Gleixner wrote:
> Sure. I'm aware of O_SYNC, fsync() and sync().
> 
> Please reread http://sources.redhat.com/jffs2/jffs2.pdf

That isn't a promise that I'll continue to guarantee anything that POSIX
doesn't guarantee, especially in JFFS3.

I'm not going to object to a mount option which forces O_SYNC behaviour
for all files. But there's no need for that to be the default.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28  0:33           ` David Woodhouse
@ 2005-01-28  0:44             ` Thomas Gleixner
  2005-01-28  8:45               ` Artem B. Bityuckiy
  2005-01-28  1:53             ` Josh Boyer
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2005-01-28  0:44 UTC (permalink / raw)
  To: David Woodhouse; +Cc: MTD List

On Fri, 2005-01-28 at 00:33 +0000, David Woodhouse wrote:
> On Fri, 2005-01-28 at 01:19 +0100, Thomas Gleixner wrote:
> > Sure. I'm aware of O_SYNC, fsync() and sync().
> > 
> > Please reread http://sources.redhat.com/jffs2/jffs2.pdf
> 
> That isn't a promise that I'll continue to guarantee anything that POSIX
> doesn't guarantee, especially in JFFS3.
> 
> I'm not going to object to a mount option which forces O_SYNC behaviour
> for all files. But there's no need for that to be the default.

Agreed. 

If it is JFFS3 it can have the default O_NOSYNC, but the option O_SYNC
still requires to solve the underlaying problem.

tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28  0:33           ` David Woodhouse
  2005-01-28  0:44             ` Thomas Gleixner
@ 2005-01-28  1:53             ` Josh Boyer
  1 sibling, 0 replies; 29+ messages in thread
From: Josh Boyer @ 2005-01-28  1:53 UTC (permalink / raw)
  To: David Woodhouse; +Cc: tglx, MTD List

On Fri, 2005-01-28 at 00:33 +0000, David Woodhouse wrote:
> On Fri, 2005-01-28 at 01:19 +0100, Thomas Gleixner wrote:
> > Sure. I'm aware of O_SYNC, fsync() and sync().
> > 
> > Please reread http://sources.redhat.com/jffs2/jffs2.pdf
> 
> That isn't a promise that I'll continue to guarantee anything that POSIX
> doesn't guarantee, especially in JFFS3.
> 
> I'm not going to object to a mount option which forces O_SYNC behaviour
> for all files. But there's no need for that to be the default.

Isn't that what the MS_SYNCHRONOUS flag is for?  I seem to recall using
that in a different filesystem once.  Something like:

mount -t <fstype> -o sync <dev> <mnt>

josh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28  0:44             ` Thomas Gleixner
@ 2005-01-28  8:45               ` Artem B. Bityuckiy
  2005-01-28  9:03                 ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-28  8:45 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: MTD List, David Woodhouse

> If it is JFFS3 it can have the default O_NOSYNC, but the option O_SYNC
> still requires to solve the underlaying problem.

Ok guys, seems we have agreed with:
1. Wrtite-back cache is good idea
2. For files opened with O_SYNC and if the file system was mounted with
-o sync option, we have to do write through caching.

But my original proposal assumed we agree with that and was about a bit 
another thing. The question was: do we really want to distinguish between 
pritine nodes and non-pristine nodes.

Recall for those who do not remeber JFFS2 internals: pristine node is that 
node which contains PAGE_SIZE (4K) bytes of data. These nodes may quicly 
satisfy read request - just unpack the node's data to the page provided by 
upper layer. Non-pristine nodes are those nodes which contain fewer then 
PAGE_SIZE bytes of data or they may contain 4K of data, but be partially 
overlapped by newer node with more actual data.

My Idea is that having write-back cache a lot of writes will first go to 
the cache, and we may just always write by 4K fractions. No more 
not-pristine nodes. All nodes are now 4K or multiple of 4K.

Adwantages:
Simpler GC. Now we not need to merge non-pristine nodes. When Garbage 
Collect one block, do not produce dirt in another block (because of 
merging). GC is much faster. We do not need to iget() anymore when we meet 
non-pristine node in GC.

Drawbacks:
Yes, if we are forced to do synchronose IO, we well waste more space. For 
example, user does 1 byte transactions, we write 4K instead every time. 
But is this so bad?

Comments?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-27 20:23     ` Thomas Gleixner
  2005-01-27 22:35       ` David Woodhouse
@ 2005-01-28  8:50       ` Artem B. Bityuckiy
  1 sibling, 0 replies; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-28  8:50 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: MTD List, David Woodhouse

> That's just imagination. Analyse a couple of JFFS2 images to figure out
> how many 4k multiple data chunks you have. Those which are live usually
> in a part of the filesystem which is almost never updated
> (/bin, /lib ....) on embedded systems.
Hmm, I tend to agree with you.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28  8:45               ` Artem B. Bityuckiy
@ 2005-01-28  9:03                 ` Thomas Gleixner
  2005-01-28  9:23                   ` Artem B. Bityuckiy
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2005-01-28  9:03 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List, David Woodhouse

On Fri, 2005-01-28 at 08:45 +0000, Artem B. Bityuckiy wrote:
> But my original proposal assumed we agree with that and was about a bit 
> another thing. The question was: do we really want to distinguish between 
> pritine nodes and non-pristine nodes.

Yes.

> Recall for those who do not remeber JFFS2 internals: pristine node is that 
> node which contains PAGE_SIZE (4K) bytes of data. These nodes may quicly 
> satisfy read request - just unpack the node's data to the page provided by 
> upper layer. Non-pristine nodes are those nodes which contain fewer then 
> PAGE_SIZE bytes of data or they may contain 4K of data, but be partially 
> overlapped by newer node with more actual data.
> 
> My Idea is that having write-back cache a lot of writes will first go to 
> the cache, and we may just always write by 4K fractions. No more 
> not-pristine nodes. All nodes are now 4K or multiple of 4K.

I disagree. Are you really trying to tell me that _ALL_ nodes are 4K
size aligned ? Just look at your filesystem. 

> Adwantages:
> Simpler GC. Now we not need to merge non-pristine nodes. When Garbage 
> Collect one block, do not produce dirt in another block (because of 
> merging). GC is much faster. We do not need to iget() anymore when we meet 
> non-pristine node in GC.

It's not simpler, because you have to take care of synchronous
transactions.

> Drawbacks:
> Yes, if we are forced to do synchronose IO, we well waste more space. For 
> example, user does 1 byte transactions, we write 4K instead every time. 
> But is this so bad?

Yes, because it requires much more write/erase cycles and the
performance will be worse than the current JFFS2.

tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28  9:03                 ` Thomas Gleixner
@ 2005-01-28  9:23                   ` Artem B. Bityuckiy
  2005-01-28 18:57                     ` Artem B. Bityuckiy
  0 siblings, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-28  9:23 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: MTD List, David Woodhouse

On Fri, 28 Jan 2005, Thomas Gleixner wrote:
> On Fri, 2005-01-28 at 08:45 +0000, Artem B. Bityuckiy wrote:
> > But my original proposal assumed we agree with that and was about a bit 
> > another thing. The question was: do we really want to distinguish between 
> > pritine nodes and non-pristine nodes.
> 
> Yes.
> 
> > Recall for those who do not remeber JFFS2 internals: pristine node is that 
> > node which contains PAGE_SIZE (4K) bytes of data. These nodes may quicly 
> > satisfy read request - just unpack the node's data to the page provided by 
> > upper layer. Non-pristine nodes are those nodes which contain fewer then 
> > PAGE_SIZE bytes of data or they may contain 4K of data, but be partially 
> > overlapped by newer node with more actual data.
> > 
> > My Idea is that having write-back cache a lot of writes will first go to 
> > the cache, and we may just always write by 4K fractions. No more 
> > not-pristine nodes. All nodes are now 4K or multiple of 4K.
> 
> I disagree. Are you really trying to tell me that _ALL_ nodes are 4K
> size aligned ? Just look at your filesystem. 
:-) I did not pretend to be correct.

> 
> > Adwantages:
> > Simpler GC. Now we not need to merge non-pristine nodes. When Garbage 
> > Collect one block, do not produce dirt in another block (because of 
> > merging). GC is much faster. We do not need to iget() anymore when we meet 
> > non-pristine node in GC.
> 
> It's not simpler, because you have to take care of synchronous
> transactions.
Haven't get it what you means.

> 
> > Drawbacks:
> > Yes, if we are forced to do synchronose IO, we well waste more space. For 
> > example, user does 1 byte transactions, we write 4K instead every time. 
> > But is this so bad?
> 
> Yes, because it requires much more write/erase cycles and the
> performance will be worse than the current JFFS2.
> 
> tglx

Ok. Convinced. We have to have two flavours of nodes - pristine and 
non-pristine. Thanks.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28  9:23                   ` Artem B. Bityuckiy
@ 2005-01-28 18:57                     ` Artem B. Bityuckiy
  2005-01-30 18:23                       ` Artem B. Bityuckiy
  2005-01-30 18:58                       ` Artem B. Bityuckiy
  0 siblings, 2 replies; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-28 18:57 UTC (permalink / raw)
  To: Thomas Gleixner, Josh Boyer, David Woodhouse; +Cc: MTD List

Hi,

I have recently described the JFFS2 memory consumption problem. David has
suggested several ideas.

The best idea is the write-back cache, I think. But this touches the 
memory
consumption only by implication. It is really about IO speed improvement.
So, I think we may discuss how to implement this WB cache in another 
thread.
I think this would be *very* good JFFS3 attribute.

Another David's Idea about the maximum inode node size augmentation is 
also
pretty nice for me. The idea to implement zisofs-like read is very good, I 
think.
I've noted both of these, hope this will be in the JFFS3 design.

But I have another, more radical idea. It decreases the memory consumption 
better
in most cases. While we currently have the linear dependency on nodes 
number, my
idea changes this to the dependency on inodes number.

Most often the nodes number is much larger then the number of inodes. But 
there are
exceptions there and my proposal will not help in these cases much.

Before describing Idea, some definitions:

Summary node
~~~~~~~~~~~
The node which describes all other nodes in one block. Summaries are 
placed at the
end of each block. For each other node in this block summaries contain the 
following
information:

node offset within the block
node length
node type
for direntry nodes:
        direntry name
        direntry target/parent inode
for inode nodes:
        inode number
        data offset
...... may be something other.

Thus, if we want to know the block's layout, we may just read its summary 
node.

Inode checkpoint node (ICP)
~~~~~~~~~~~~~~~~~~~~~~~~~~
Checkpoint node describes all the nodes related to some inode. Checkpoints 
may
be placed anywhere within flash, as any other node. For each node related 
to
the inode the ICP contains:


node offset within the flash
node length
node type
for direntry nodes:
        direntry name
        direntry target/parent inode
for inode nodes:
        inode number
        data offset
...... may be something other.

Just almost the same as summary, but the per-inode information.


The Idea
~~~~~~~
Imagine we have file system, where each inode has ICP and each block has 
Summary.
Then we may keep in-core only the positions of ICPs for each inode.

In this case:
1. when we need per-inode nedes list, we read ICP and we have it!
2. when we need per block nodes list, we read Summary and we have it!

No need to keep the per-block and per-inode node_ref lists in-core 
anymore!

The per-inode list is only needed on the iget() call. After the iget() 
call
we will have the per-inode list in memory (say, in form of fragtree or the
dirent list).

The per-block list is only needed for the Garbage Collection.

When the file is opened and it is changed, wee build the ICP of this file 
in RAM.
We flush this ICP on iput() call. Summaries are also being formed in RAM 
and
flushed when the block is almost full.

So far so good. But there are a bundle of problems we must solve.
Just count some of them:

1. Unclean reboots. In this case we will have no up-to-date ICPs and 
Summaries.
I hope in this case we still will be ables to use old JFFS2 algorithms.
I suspect this case includes situations when we have no flash space to 
flush ICPs
or we are out of memory while forming it. We will just not write them then 
which
is similar to unclean reboot.

2. GC tends to merge non-pristine nodes. So it changes ICPs, and the GC 
process
will be slower.

3. Since we form ICPs of opened file in memory, we will possibly need to 
reserve some
flash space to be able to flush these ICPs on unmount. This needs very 
accurate
flash space accounting.

I believe these are not all problems, but think the biggest ones.
At the first glance all of them are solvable.

Thanks for your comments.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28 18:57                     ` Artem B. Bityuckiy
@ 2005-01-30 18:23                       ` Artem B. Bityuckiy
  2005-01-30 18:58                       ` Artem B. Bityuckiy
  1 sibling, 0 replies; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-30 18:23 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: MTD List, David Woodhouse

On Fri, 2005-01-28 at 18:57 +0000, Artem B. Bityuckiy wrote:
> Hi,
> 
> I have recently described the JFFS2 memory consumption problem. David has
> suggested several ideas.
> 
> The best idea is the write-back cache, I think. But this touches the 
> memory
> consumption only by implication. It is really about IO speed improvement.
> So, I think we may discuss how to implement this WB cache in another 
> thread.
> I think this would be *very* good JFFS3 attribute.
> 
> Another David's Idea about the maximum inode node size augmentation is 
> also
> pretty nice for me. The idea to implement zisofs-like read is very good, I 
> think.
> I've noted both of these, hope this will be in the JFFS3 design.
> 
> But I have another, more radical idea. It decreases the memory consumption 
> better
> in most cases. While we currently have the linear dependency on nodes 
> number, my
> idea changes this to the dependency on inodes number.
> 
> Most often the nodes number is much larger then the number of inodes. But 
> there are
> exceptions there and my proposal will not help in these cases much.
> 
> Before describing Idea, some definitions:
> 
> Summary node
> ~~~~~~~~~~~
> The node which describes all other nodes in one block. Summaries are 
> placed at the
> end of each block. For each other node in this block summaries contain the 
> following
> information:
> 
> node offset within the block
> node length
> node type
> for direntry nodes:
>         direntry name
>         direntry target/parent inode
> for inode nodes:
>         inode number
>         data offset
> ...... may be something other.
> 
> Thus, if we want to know the block's layout, we may just read its summary 
> node.
> 
> Inode checkpoint node (ICP)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Checkpoint node describes all the nodes related to some inode. Checkpoints 
> may
> be placed anywhere within flash, as any other node. For each node related 
> to
> the inode the ICP contains:
> 
> 
> node offset within the flash
> node length
> node type
> for direntry nodes:
>         direntry name
>         direntry target/parent inode
> for inode nodes:
>         inode number
>         data offset
> ...... may be something other.
> 
> Just almost the same as summary, but the per-inode information.

Alternatively, in order to keep ICPs small, we may store in ICP only
references to all the summary nodes which contain at least one reference
to the inode's node. Thus, we will save some space, but lost some speed.

This means, say, if some inode has its nodes scattered over blocks
1,100,1003, then ICP's data will contain only these three digits.

> 
> 
> The Idea
> ~~~~~~~
> Imagine we have file system, where each inode has ICP and each block has 
> Summary.
> Then we may keep in-core only the positions of ICPs for each inode.
> 
> In this case:
> 1. when we need per-inode nedes list, we read ICP and we have it!
> 2. when we need per block nodes list, we read Summary and we have it!
> 
> No need to keep the per-block and per-inode node_ref lists in-core 
> anymore!
> 
> The per-inode list is only needed on the iget() call. After the iget() 
> call
> we will have the per-inode list in memory (say, in form of fragtree or the
> dirent list).
> 
> The per-block list is only needed for the Garbage Collection.
> 
> When the file is opened and it is changed, wee build the ICP of this file 
> in RAM.
> We flush this ICP on iput() call. Summaries are also being formed in RAM 
> and
> flushed when the block is almost full.
> 
> So far so good. But there are a bundle of problems we must solve.
> Just count some of them:
> 
> 1. Unclean reboots. In this case we will have no up-to-date ICPs and 
> Summaries.
> I hope in this case we still will be ables to use old JFFS2 algorithms.
> I suspect this case includes situations when we have no flash space to 
> flush ICPs
> or we are out of memory while forming it. We will just not write them then 
> which
> is similar to unclean reboot.
> 
> 2. GC tends to merge non-pristine nodes. So it changes ICPs, and the GC 
> process
> will be slower.
> 
> 3. Since we form ICPs of opened file in memory, we will possibly need to 
> reserve some
> flash space to be able to flush these ICPs on unmount. This needs very 
> accurate
> flash space accounting.
> 
> I believe these are not all problems, but think the biggest ones.
> At the first glance all of them are solvable.
> 
> Thanks for your comments.
> 
> --
> Best Regards,
> Artem B. Bityuckiy,
> St.-Petersburg, Russia.
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-28 18:57                     ` Artem B. Bityuckiy
  2005-01-30 18:23                       ` Artem B. Bityuckiy
@ 2005-01-30 18:58                       ` Artem B. Bityuckiy
  2005-01-31 16:46                         ` Jared Hulbert
  1 sibling, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-30 18:58 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: MTD List, David Woodhouse

Another interpretation of my idea.

We may look to the summary nodes as to the file system superblock.
Indeed,
1. The summary nodes altogether (will call this "summary super block")
give us the full file system map;
2. The summary nodes have "fixed position", at least we can find their
locations quickly, without scanning the file system fully.

Unlike "classic" super blocks, the "summary super block" is distributed
entity. It is scattered over different JFFS3 blocks. But this property
will make JFFS3 unclean reboot-tolerant, since only small part of our
"summary super block" may be lost due to unclean reboot. And we always
may scan the summary-less block and build the block's nodes map.

Since we consider all summary nodes as one single, united entity, we
they should be tightly coupled. But the are not. Summaries are
independent of each other. This is the problem.

For example, imagine we want to build the track of inode's nodes. For
this reason, we bust scan all the summary nodes. This is unacceptably
long operation. For this operation I see two possible solutions:

1. For each inode we keep in-core the list of block numbers, where this
inode's nodes are present. Thus, to build the inode's nodes track, we
need to read only these summary nodes.

2. Use ICPs.

Comments?

On Fri, 2005-01-28 at 18:57 +0000, Artem B. Bityuckiy wrote:
> Hi,
> 
> I have recently described the JFFS2 memory consumption problem. David has
> suggested several ideas.
> 
> The best idea is the write-back cache, I think. But this touches the 
> memory
> consumption only by implication. It is really about IO speed improvement.
> So, I think we may discuss how to implement this WB cache in another 
> thread.
> I think this would be *very* good JFFS3 attribute.
> 
> Another David's Idea about the maximum inode node size augmentation is 
> also
> pretty nice for me. The idea to implement zisofs-like read is very good, I 
> think.
> I've noted both of these, hope this will be in the JFFS3 design.
> 
> But I have another, more radical idea. It decreases the memory consumption 
> better
> in most cases. While we currently have the linear dependency on nodes 
> number, my
> idea changes this to the dependency on inodes number.
> 
> Most often the nodes number is much larger then the number of inodes. But 
> there are
> exceptions there and my proposal will not help in these cases much.
> 
> Before describing Idea, some definitions:
> 
> Summary node
> ~~~~~~~~~~~
> The node which describes all other nodes in one block. Summaries are 
> placed at the
> end of each block. For each other node in this block summaries contain the 
> following
> information:
> 
> node offset within the block
> node length
> node type
> for direntry nodes:
>         direntry name
>         direntry target/parent inode
> for inode nodes:
>         inode number
>         data offset
> ...... may be something other.
> 
> Thus, if we want to know the block's layout, we may just read its summary 
> node.
> 
> Inode checkpoint node (ICP)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Checkpoint node describes all the nodes related to some inode. Checkpoints 
> may
> be placed anywhere within flash, as any other node. For each node related 
> to
> the inode the ICP contains:
> 
> 
> node offset within the flash
> node length
> node type
> for direntry nodes:
>         direntry name
>         direntry target/parent inode
> for inode nodes:
>         inode number
>         data offset
> ...... may be something other.
> 
> Just almost the same as summary, but the per-inode information.
> 
> 
> The Idea
> ~~~~~~~
> Imagine we have file system, where each inode has ICP and each block has 
> Summary.
> Then we may keep in-core only the positions of ICPs for each inode.
> 
> In this case:
> 1. when we need per-inode nedes list, we read ICP and we have it!
> 2. when we need per block nodes list, we read Summary and we have it!
> 
> No need to keep the per-block and per-inode node_ref lists in-core 
> anymore!
> 
> The per-inode list is only needed on the iget() call. After the iget() 
> call
> we will have the per-inode list in memory (say, in form of fragtree or the
> dirent list).
> 
> The per-block list is only needed for the Garbage Collection.
> 
> When the file is opened and it is changed, wee build the ICP of this file 
> in RAM.
> We flush this ICP on iput() call. Summaries are also being formed in RAM 
> and
> flushed when the block is almost full.
> 
> So far so good. But there are a bundle of problems we must solve.
> Just count some of them:
> 
> 1. Unclean reboots. In this case we will have no up-to-date ICPs and 
> Summaries.
> I hope in this case we still will be ables to use old JFFS2 algorithms.
> I suspect this case includes situations when we have no flash space to 
> flush ICPs
> or we are out of memory while forming it. We will just not write them then 
> which
> is similar to unclean reboot.
> 
> 2. GC tends to merge non-pristine nodes. So it changes ICPs, and the GC 
> process
> will be slower.
> 
> 3. Since we form ICPs of opened file in memory, we will possibly need to 
> reserve some
> flash space to be able to flush these ICPs on unmount. This needs very 
> accurate
> flash space accounting.
> 
> I believe these are not all problems, but think the biggest ones.
> At the first glance all of them are solvable.
> 
> Thanks for your comments.
> 
> --
> Best Regards,
> Artem B. Bityuckiy,
> St.-Petersburg, Russia.
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-30 18:58                       ` Artem B. Bityuckiy
@ 2005-01-31 16:46                         ` Jared Hulbert
  2005-01-31 18:05                           ` Artem B. Bityuckiy
  0 siblings, 1 reply; 29+ messages in thread
From: Jared Hulbert @ 2005-01-31 16:46 UTC (permalink / raw)
  To: dedekind; +Cc: David Woodhouse, Thomas Gleixner, MTD List

> 1. For each inode we keep in-core the list of block numbers, where this
> inode's nodes are present. Thus, to build the inode's nodes track, we
> need to read only these summary nodes.
> 
> 2. Use ICPs.


I'm not sure I understand the difference.  In case 1) are you
suggesting keeping the ICP like data in the summary nodes?

I think the ICP's are a great idea.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-31 16:46                         ` Jared Hulbert
@ 2005-01-31 18:05                           ` Artem B. Bityuckiy
  2005-02-01 16:49                             ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-01-31 18:05 UTC (permalink / raw)
  To: Jared Hulbert; +Cc: David Woodhouse, Thomas Gleixner, MTD List

> I'm not sure I understand the difference.  In case 1) are you
> suggesting keeping the ICP like data in the summary nodes?
>
> I think the ICP's are a great idea.

Will try to express what I think.

So, we have summaries at the end of each block. Imagine an user has issued 
the open() call for some file. This means, that the VFS will invoke the 
JFFS3 inode iget() operation. The iget() is supposed to build the file's 
fragtree in order to be able to quickly read any file's data.

Obviously, to build the inode in the iget() we must know the positions of 
all the inode's nodes. In JFFS2 it is simply done by means of keeping the 
in-core objects for each node and having the per-inode lists. We do not 
want this, we want to save RAM.

How may we build the inode in the iget() operation?

1. For each inode we keep the in-core a list of block numbers (let's call 
this list the "inode_blocks list"). These are those blocks, which have at 
least one node belonging to our inode (the inode we are trying to build). 
Thus, to build the inode, we need:

for each block in the "inode_blocks list"
{
	read the block's summary node;
	find all nodes belonging to our inode in the summary;
}

If the inode's nodes are scattered over many blocks, this will quite slow;

But we may do the following optimizations:

A. Have special GC strategy to put (or try to put) nodes belonging to the 
same inode to the same blocks. Thus we may keep the number of blocks where 
the inode's nodes are scattered relatively small.

B. The summary entries in the summary nodes are sorted by the node offset. 
But when we are searching for the node which belongs to the specific 
inode, we would like to have the summary entries sorted by the inode 
number (then we may use quick binary search algorithm). So we may store in 
summary both physical offset-sorted entries and inode number-sorted 
entries. Possibly, the physical offset sorting is not really needed (the 
GC will just move nodes in another order).

So, having these both optimizations, we may keep the iget() operation fast 
enough.

2. ICP... The alternative is using ICP. ICP contains the ICP entries in 
its body. These entries describe all the nodes of our inode. Thus, for 
each inode, we may keep in-core only the inode's ICP offset. And when we 
need to build the inode in the iget() call, we need just to read the 
corresponding ICP, and that's all.

The problem here is that ICP contains the physical offsets of nodes. But 
the GC works and moves nodes brom one block to another, changing their 
offsets. And we need to often update ICPs - even for static RO files.

Thoughts?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-01-31 18:05                           ` Artem B. Bityuckiy
@ 2005-02-01 16:49                             ` Thomas Gleixner
  2005-02-02 15:44                               ` Artem B. Bityuckiy
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2005-02-01 16:49 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List, David Woodhouse

On Mon, 2005-01-31 at 18:05 +0000, Artem B. Bityuckiy wrote:
> Will try to express what I think.
> 
> If the inode's nodes are scattered over many blocks, this will quite slow;
> 
> But we may do the following optimizations:
> 
> A. Have special GC strategy to put (or try to put) nodes belonging to the 
> same inode to the same blocks. Thus we may keep the number of blocks where 
> the inode's nodes are scattered relatively small.

This might turn out to be a GC nightmare. Every update to a file might
result in shuffling tons of nodes around or at least trying to do so.

Keep it simple.

> 2. ICP... The alternative is using ICP. ICP contains the ICP entries in 
> its body. These entries describe all the nodes of our inode. Thus, for 
> each inode, we may keep in-core only the inode's ICP offset. And when we 
> need to build the inode in the iget() call, we need just to read the 
> corresponding ICP, and that's all.
> 
> The problem here is that ICP contains the physical offsets of nodes. But 
> the GC works and moves nodes brom one block to another, changing their 
> offsets. And we need to often update ICPs - even for static RO files.

Not only GC makes this necessary, every write to a file too. 

> Thoughts?

I'm not sure which of the mechanisms will provide us more horror. My
guess is ICP. The summary method is definitely less performant, but I
think therefor less complicated to handle. So we trade performance for
simplicity, which makes more sense for me than the other way round.

tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-02-01 16:49                             ` Thomas Gleixner
@ 2005-02-02 15:44                               ` Artem B. Bityuckiy
  2005-02-02 18:34                                 ` Jared Hulbert
  0 siblings, 1 reply; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-02-02 15:44 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: MTD List, David Woodhouse

On Tue, 1 Feb 2005, Thomas Gleixner wrote:
> This might turn out to be a GC nightmare. Every update to a file might
> result in shuffling tons of nodes around or at least trying to do so.
>
> Keep it simple.
Agree, this might be too complicate approach.

> Not only GC makes this necessary, every write to a file too.
Agreed.

> I'm not sure which of the mechanisms will provide us more horror. My
> guess is ICP. The summary method is definitely less performant, but I
> think therefor less complicated to handle. So we trade performance for
> simplicity, which makes more sense for me than the other way round.
>

I'll try to describe ICP approach and the horror I may foresee.

Some terms and definitions:
~~~~~~~~~~~~~~~~~~~~~~~~~~
ICP describes nodes and contains *ICP entries*. Each ICP entry corresponds 
to one node.

ICP entry is *valid*, if the node it refers to is valid and is at the 
correct position (i.e., GC didn't move it).

ICP entry is *invalid* if it is not valid.

ICP node is *obsolete* if:
1. the number of its valid ICP entries is less then certain watermark 
(MIN_ICP_ENTRIES);
2. The percentage of its valid ICP entries is less then the certain 
watermark (MIN_ICP_ENTRIES_PERC).

Next, the ICP entries have versions. The next ICP entry has the version of 
the previous one + 1.

Consider use-cases for better understanding. Suppose we have 3 ICP nodes 
for the inode #4. Their versions are 1,2 and 3. Suppose inode #4 has 3000 
nodes (versions 1-3000) if it is the regular file inode or 3000 direntries 
if it is the directory inode.

Case 1.
	ICP node v1 describes nodes v1-1000.
	ICP node v2 describes nodes v1001-2500.
	ICP node v3 describes nodes v2501-3000.
All ICP nodes are valid, everybody are happy.
To build the inode #4 we need to read all three ICPs.

Case 2.
	ICP node v1 describes nodes v1-3000.
	ICP node v2 describes nodes v10-3000.
	ICP node v3 describes nodes 1-2000.
The ICP node v1 is obsolete since there is newer ICP node (v2) which 
describe almost of the nodes of the inode #4. ICP node v2 is valid, but 
contains only 1000 valid ICP entries since entries v0-2000 are overriden 
by the next ICP node. ICP node v3 is valid, it partially overrides ICP 
node v3.
To build inode #4, we need to read ICP node v2 and ICP node v3.

The analogy to the ordinary inode nodes is obvious.

-------------------------------------------------------------

So, how am I offer to operate?
Basically, I offer to keep in-core references to inode's ICPs. Suppose we 
have flash with data and each inode has valid ICP(s). To be not very 
vague, I'll just describe the behavior I expect in different cases.

1.
If the nodes are moved by the GC, some ICP entries become invalid. For 
those nodes, which have been moved, we keep in-core node_ref. Thus, we 
will have two things in-core:
a). ICP reference(s).
b). node_refs for all the invalid ICP entries.

Here we may again, introduce some watermark - the maximum number of 
node_ref which we may have in-core. If we exceed this watermark, we write 
new ICP node to flash. In new ICP node makes the previous one obsolete, we 
include the ICP entries for all nodes to our new ICP. Otherwise, only 
those which were actually changed.

2.
When the file is changed (not appended), the situation is the same as in  
the first example.

3. If the file is appended, old ICPs are not changed. So we just write new 
ICPs which describe newer nodes only.

Several notes:
1. I did not consider the unclean reboot situation. I think this problem 
is handleable, but may make the mount process very complicated.
2. small inodes, which contain few nodes, may not have ICP. I think it is 
reasonable to use the MIN_ICP_ENTRIES watermark here (described above).

Comments?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-02-02 15:44                               ` Artem B. Bityuckiy
@ 2005-02-02 18:34                                 ` Jared Hulbert
  2005-02-02 19:02                                   ` Thomas Gleixner
  2005-02-03 10:54                                   ` Artem B. Bityuckiy
  0 siblings, 2 replies; 29+ messages in thread
From: Jared Hulbert @ 2005-02-02 18:34 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: MTD List, Thomas Gleixner, David Woodhouse

What do you mean by in-core?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-02-02 18:34                                 ` Jared Hulbert
@ 2005-02-02 19:02                                   ` Thomas Gleixner
  2005-02-03 10:54                                   ` Artem B. Bityuckiy
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2005-02-02 19:02 UTC (permalink / raw)
  To: Jared Hulbert; +Cc: David Woodhouse, MTD List

On Wed, 2005-02-02 at 10:34 -0800, Jared Hulbert wrote:
> What do you mean by in-core?

RAM

tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: JFFS3 memory consumption
  2005-02-02 18:34                                 ` Jared Hulbert
  2005-02-02 19:02                                   ` Thomas Gleixner
@ 2005-02-03 10:54                                   ` Artem B. Bityuckiy
  1 sibling, 0 replies; 29+ messages in thread
From: Artem B. Bityuckiy @ 2005-02-03 10:54 UTC (permalink / raw)
  To: Jared Hulbert; +Cc: MTD List, Thomas Gleixner, David Woodhouse

Jared Hulbert wrote:
> What do you mean by in-core?

This is the term used in JFFS2. All objects which JFFS2 allocates in RAM 
are divided on two groups: in-core objects and other.

In-core are those objects which must be allocated in order to mount to 
JFFS2 and work with it. So, the in-core memory is the memory which JFFS2 
consumes when it is mounted irrespective of whether one work with it or 
not.

For example, to mount 1GiB JFFS2 partition full of data one need, say, 5 
MiB RAM. Once you have mounted JFFS2, these 5 MiB will be consumed forever 
(until you unmount it). To open files and lookup directories JFFS2 needs 
more RAM. But this RAM is freed when, roughly speaking, you close your 
files.

So, the major JFFS2 drawback is that its in-core RAM consumption depends 
linearly on the amount of data you store *on flash*. And I'm trying to 
find the way to relax this.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2005-02-03 10:54 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-26  8:44 JFFS3 memory consumption Artem B. Bityuckiy
2005-01-26  9:10 ` David Woodhouse
2005-01-26 15:57   ` Artem B. Bityuckiy
2005-01-26 17:06     ` Josh Boyer
2005-01-26 17:11       ` Artem B. Bityuckiy
2005-01-26 18:15         ` David Woodhouse
2005-01-26 18:33           ` Artem B. Bityuckiy
2005-01-26 18:55             ` David Woodhouse
2005-01-27 17:35   ` Artem B. Bityuckiy
2005-01-27 20:23     ` Thomas Gleixner
2005-01-27 22:35       ` David Woodhouse
2005-01-28  0:19         ` Thomas Gleixner
2005-01-28  0:33           ` David Woodhouse
2005-01-28  0:44             ` Thomas Gleixner
2005-01-28  8:45               ` Artem B. Bityuckiy
2005-01-28  9:03                 ` Thomas Gleixner
2005-01-28  9:23                   ` Artem B. Bityuckiy
2005-01-28 18:57                     ` Artem B. Bityuckiy
2005-01-30 18:23                       ` Artem B. Bityuckiy
2005-01-30 18:58                       ` Artem B. Bityuckiy
2005-01-31 16:46                         ` Jared Hulbert
2005-01-31 18:05                           ` Artem B. Bityuckiy
2005-02-01 16:49                             ` Thomas Gleixner
2005-02-02 15:44                               ` Artem B. Bityuckiy
2005-02-02 18:34                                 ` Jared Hulbert
2005-02-02 19:02                                   ` Thomas Gleixner
2005-02-03 10:54                                   ` Artem B. Bityuckiy
2005-01-28  1:53             ` Josh Boyer
2005-01-28  8:50       ` Artem B. Bityuckiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.