From mboxrd@z Thu Jan  1 00:00:00 1970
From: Tom Marshall <tom@cyngn.com>
Subject: Re: fs compression
Date: Wed, 20 May 2015 15:46:30 -0700
Message-ID: <20150520224630.GA10927@eden.sea.cyngn.com>
References: <1431145253-2019-1-git-send-email-jaegeuk@kernel.org>
 <1431145253-2019-3-git-send-email-jaegeuk@kernel.org>
 <20150513020208.GK15721@dastard>
 <20150513064802.GA48682@jaegeuk-mac02.hsd1.ca.comcast.net>
 <20150514003721.GN15721@dastard>
 <20150516132403.GA2998@thunk.org>
 <20150516171326.GA24795@eden.sea.cyngn.com>
 <20150520174635.GA17651@eden.sea.cyngn.com>
 <20150520213641.GM2871@thunk.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Jaegeuk Kim <jaegeuk@kernel.org>, linux-fsdevel@vger.kernel.org
To: Theodore Ts'o <tytso@mit.edu>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-pd0-f181.google.com ([209.85.192.181]:36538 "EHLO
	mail-pd0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753123AbbETWqf (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 20 May 2015 18:46:35 -0400
Received: by pdfh10 with SMTP id h10so84765076pdf.3
        for <linux-fsdevel@vger.kernel.org>; Wed, 20 May 2015 15:46:35 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <20150520213641.GM2871@thunk.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

So I have this all working as described.  I haven't implemented readahead
yet (readpages) so it's slow.  I'll be doing that next.

The other thing to note is that since the uncompressed size is stored inside
the file data, stat() requires reading both the inode and the first page of
the file.  That's not optimal, but I don't know if other generic out-of-band
solutions (such as xattrs) would be any better.  I suppose it depends on
whether the xattr info is read in with the inode or not.

Also, on the subject of file size, I'm currently swapping out the i_size for
the compressed i_size before calling down into the filesystem's readpage. 
Yes that's a nasty hack that I'll need to address.

On Wed, May 20, 2015 at 05:36:41PM -0400, Theodore Ts'o wrote:
> On Wed, May 20, 2015 at 10:46:35AM -0700, Tom Marshall wrote:
> > So I've been playing around a bit and I have a basic strategy laid out.
> > Please let me know if I'm on the right track.
> > 
> > Compressed file attributes
> > ==========================
> > 
> > The filesystem is responsible for detecting whether a file is compressed and
> > hooking into the compression lib.  This may be done with an inode flag,
> > xattr, or any other applicable method.  No other special attributes are
> > necessary.
> 
> So I assume what you are implementing is read-only compression; that
> is, once the file is written, and the attribute set indicating that
> this is a compressed file, it is now immutable.

That is TBD.  Our use case right now only requires read-only, but I think
read-write would be a nice thing if it's not too convoluted.

fallocate() is supported on the major filesystems, and I imagine the same
mechanisms could be used to provide rewriting of the "compression clusters".

> > Compressed file format
> > ======================
> > 
> > Compressed files shall have header, block map, and data sections.
> > 
> > Header:
> > 
> > byte[4]		magic		'zzzz' (not strictly needed)
> > byte		param1		method and flags
> > 	bits 0..3 = compression method (1=zlib, 2=lz4, etc.)
> > 	bits 4..7 = flags (none defined yet)
> > byte		blocksize	log2 of blocksize (max 31)
> 
> I suggest using the term "compression cluster" to distinguish this
> from the file system block size.

Sure, just a name...

> > le48		orig_size	original uncompressed file size
> > 
> > 
> > Block map:
> > 
> > Vector of le16 (if blocksize <= 16) or le32 (if blocksize > 16).  Each entry
> > is the compressed size of the block.  Zero indicates that the block is
> > stored uncompressed, in case compression expanded the block.
> 
> What I would store instead is list of 32 or 64-bit offsets, where the
> nth entry in the array indicates the starting offset of the nth
> compression cluster.

Why?  This would both increase the space requirements and require some other
mechanism to indicate uncompressed compression clusters (eg. setting the
high bit or something).

> > Questions and issues
>  ====================
> > 
> > Should there be any padding for the data blocks?  For example, if writing is
> > to be supported, padding the compressed data to the filesystem block size
> > would allow for easy rewriting of individual blocks without disturbing the
> > surrounding blocks.  Perhaps padding could be indicated by a flag.
> 
> If you add padding then you defeat the whole point of adding
> compression.  What if the initial contents of a 64k cluster was all
> zeros, so it trivially compresses down to a few dozen bytes; but then
> it gets replaced by completely uncompressible data?  If you add 64k
> worth of padding to each block, then you're not saving any space, so
> what's the point?

Sorry, I meant padding to the filesystem block size.  So, for example, if a
64kb compression cluster is compressed to 31kb, it would use 8*4kb blocks
and the next compression cluster would start on a new block.

> > The compression code must be able to read pages from the underlying
> > filesystem.  This involves using the pagecache.  But the uncompressed data
> > is what ultimately should end up in the pagecache.  This is where I'm
> > currently stuck.  How do I implement the code such that the underlying
> > compressed data may be read (using the pagecache or not) while not
> > disturbing the pagecache for the uncompressed data?  I'm wondering if I need
> > to create an internal address_space to pass down into the underlying
> > readpage?  Or is there another way to do this?
> 
> So I would *not* reference the compressed data via the page cache.  If
> you do that, then you end up wasting space in the page cache, since
> the page cache will contain both the compressed and decompressed data
> --- and once the data has been decompressed, the compressed version is
> completely useless.  So it's better to have the file system supply the
> physical location on disk, and then to read in the compressed data to
> a scratched set of page which is freed immediately after you are done
> decompressing things.

I'm currently using an internal address_space to pass down into the
underlying filesystem to make things easy.  I'm too inexperienced in
filesystem development to unravel how to plumb in anything else (but I'm
learning quickly!)

If you have ideas about how to do the underlying readpage without the
pagecache, please enlighten me.

Or perhaps I could manually release the page from the private cache after
the uncompressed data has been extracted?

> 
> This is why compression is so very different from encryption.  The
> constraints make it quite different.
> 
> Regards,
> 
> 						- Ted
> 						
>