Re: Q. cache in squashfs?

From: Phillip Lougher <phillip@lougher.demon.co.uk>
To: "J. R. Okajima" <hooanon05@yahoo.co.jp>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: Q. cache in squashfs?
Date: Fri, 09 Jul 2010 11:32:17 +0100	[thread overview]
Message-ID: <4C36FAB1.6010506@lougher.demon.co.uk> (raw)
In-Reply-To: <15323.1278662033@jrobl>

J. R. Okajima wrote:
>> Phillip Lougher:
>>> What I think you're seeing here is the negative effect of fragment
>>> blocks (tail-end packing) in the native squashfs example and the
>>> positive effect of vfs/loop block caching in the ext3 on squashfs example.
>> Thank you very much for your explanation.
>> I think the number of cached decompressed fragment blocks is related
>> too. I thought it is much larger, but I found it is 3 by default. I will
>> try larger value with/without -no-fragments which you pointed.
> 
> The -no-fragments shows better performance, but it is very small.
> It doesn't seem that the number of fragment blocks is large on my test
> environment.

That is *very* surprising.  How many fragments do you have?

> 
> Next, I tried increasing the number of cache entries in squashfs.
> squashfs_fill_super()
>         /* Allocate read_page block */
> -       msblk->read_page = squashfs_cache_init("data", 1, msblk->block_size);
> +       msblk->read_page = squashfs_cache_init("data", 100, msblk->block_size);

That is the *wrong* cache.  Read_page isn't really a cache (it is merely
allocated as a cache to re-use code).  This is used to store the data block
in the read_page() routine, and the entire contents are explicitly pushed into the
page cache.  As the entire contents are pushed into the page cache, it is
*very* unlikely the VFS is calling Squashfs to re-read *this* data.  If it is
then something fundamental is broken, or you're seeing page cache shrinkage.

> and
> CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=100 (it was 3)
> which is for msblk->fragment_cache.

and which should make *no* difference if you've used the -no-fragments option to
build an image without fragments.

Squashfs has three types of compressed block, each with different
caching behaviour

1. Data blocks.  Once read, the entire contents are pushed in the page cache.
    They are not cached by Squashfs.  If you've got repeated reads of *these*
    blocks then you're seeing page cache shrinkage or flushing.

2. Fragment blocks.  These are large data blocks which have have multiple small
    files packed together.  In read_page() the file for which the fragment has
    been read is pushed into the page cache.  The other contents of the fragment
    block (the other files) are not, so they're temporarily cached in the
    squashfs fragment cache in the belief they'll be requested soon (locality of
    reference and all that stuff).

3. Metadata blocks (always 8K).  These store inode and directory metadata, and
    are (unsurprisingly) read when inodes are looked-up/instantiated and when
    directory look-up takes place. These blocks tend to store multiple inodes
    and directories packed together (for greater compression).  As such they're
    temporarily cached in the squashfs metadata_cache in belief they'll be re-used
    soon (again locality of reference).

It is fragments and metadata blocks which show the potential for
repeated re-reading on random access patterns.

As you've presumably eliminated fragments from your image, that leaves
metadata blocks as the *only* cause of repeated re-reading/decompression.

You should have modified the size of the metadata cache, from 8 to something
larger, i.e.

  msblk->block_cache = squashfs_cache_init("metadata",
                         SQUASHFS_CACHED_BLKS, SQUASHFS_METADATA_SIZE);

As a rough guide, to see how much to increase the cache so that it caches the
entire amount of metadata in your image, you can add up the uncompressed
sizes of the inode and directory tables reported by mksquashfs.

But there's a mystery here, I'll be very much surprised if your test image has
more than 64K of metadata, which would fit into the existing 8 entry metadata
cache.

> Of course, these numbers are not generic solution, but they are large
> enough to keep all blocks for my test.

> 
> It shows much better performance. 

If you've done as you said, it should have made no difference whatsoever, unless
the page pushing into the page cache is broken.

So there's a big mystery here.

>All blocks are cached and the number
> of decompression for native squashfs (a.img) is almost equivalent to the
> case of nested ext3 (b.img). But a.img consumes CPU much more than
> b.img.
> My guess for CPU is the cost to search in cache.
> squashfs_cache_get()
> 		for (i = 0; i < cache->entries; i++)
> 			if (cache->entry[i].block == block)
> 				break;
> The value of cache->entries grows, and its search cost grows too.

As you seriously suggesting a scan of a 100 entry table on a modern CPU
makes any noticable difference?
> 
> Befor I am going to introduce a hash table or something to reduce the
> search cost, I think it is better to convert the squashfs cache into
> generic system cache. The hash index will be based on the block number.
> I don't know it will be able to combine with the page cache. But at
> least, it will be able to kmem_cache_create() and register_shrinker().
> 
> Phillip, how do you think about converting the cache system?
> 

That was discussed on this list back in 2008, and there are pros and cons
to doing this.  You can look at the list archives for the discussion and
so I won't repeat it here.  At the moment I see this as a red herring
because your results suggest something more fundamental is wrong.  Doing
what you did above with the size of the read_page cache should not have
made any difference, and if it did, it suggests pages which *should* be
in the page cache (explicitly pushed there by the read_page() routine) are
not there.  In short its not a question of should Squashfs be using the
page cache, for the pages in question it already is.

I'll try and reproduce your results, as they're to be frank
significantly at variance to my previous experience.  Maybe there's a bug
or VFS changes means the page pushing into the page cache isn't working, but
I cannot see where your repeated block reading/decompression results are
coming from.

Phillip

> 
> J. R. Okajima
>