From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S263131AbTJQROL (ORCPT ); Fri, 17 Oct 2003 13:14:11 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S263137AbTJQROL (ORCPT ); Fri, 17 Oct 2003 13:14:11 -0400 Received: from 81-2-122-30.bradfords.org.uk ([81.2.122.30]:11650 "EHLO 81-2-122-30.bradfords.org.uk") by vger.kernel.org with ESMTP id S263131AbTJQROG (ORCPT ); Fri, 17 Oct 2003 13:14:06 -0400 Date: Fri, 17 Oct 2003 18:15:34 +0100 From: John Bradford Message-Id: <200310171715.h9HHFYUi001735@81-2-122-30.bradfords.org.uk> To: Eli Carter Cc: jw schultz , linux-kernel@vger.kernel.org In-Reply-To: <3F90175B.2000502@inet.com> References: <1066163449.4286.4.camel@Borogove> <20031015133305.GF24799@bitwizard.nl> <3F8D6417.8050409@pobox.com> <20031016162926.GF1663@velociraptor.random> <3F8ECA3E.4030208@draigBrady.com> <20031016231235.GB29279@pegasys.ws> <200310170803.h9H83ahx000164@81-2-122-30.bradfords.org.uk> <3F90025A.7070504@inet.com> <200310171527.h9HFRvU2001448@81-2-122-30.bradfords.org.uk> <3F90175B.2000502@inet.com> Subject: Re: Transparent compression in the FS Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Quote from Eli Carter : > John Bradford wrote: > >>>The upshot of all that would be that if you needed space, it would be > >>>there, (just overwrite the uncompressed versions of files), but until > >>>you do, you can access the uncompressed data quickly. > >>> > >>>You could even take it one step further, and compress files with gzip > >>>by default, and re-compress them with bzip2 after long periods of > >>>inactivity. > >> > >>Note that a file compressed with bzip2 is not necessarily smaller than > >>the same file compressed with gzip. (It can be quite a bit larger in fact.) > > > > > > Have you noticed that with real-life data, or only test cases? > > Real-life data. I don't remember the exact details for certain, but as > best as I can recall: I was dealing with copies of output from build > logs, telnet sessions, messages files, or the like (i.e. text) that were > (many,) many MB in size (and probably highly repetitititititive). I > wound up with a loop that compressed each file into a gzip and a bzip2, > compared the sizes, and killed the larger. There were a number of .gz's > that won. (I have also read that gzip is better at text compression > whereas bzip2 is better at binary compression. No, I don't remember the > source.) Wow, I'm really suprised, I've always had good results with text, although quite possibly not as repetitive as yours. I have noticed that uncompressable data such as /dev/random is almost always expanded to a greater extent with bzip2, which is why I asked. > But that is immaterial... You have to deal with the case where the > 'better' algorithm gives 'worse' results (by size). Keep in mind that > some data won't compress at all (for a given algorithm), and winds up > needing more space in the compressed form. (In which case we add a byte > to say "this is not compressed" and keep the original form.) > > uncompressed -> gzip; gzip -> bzip2 would be by far the normal case > But, sometimes gzip can't get it any smaller, or would increase the > size. (Keep in mind we may be storing a file that is already compressed...) > > So your scheme needs to note when compression fails so it doesn't try > again, so we see: > > uncompressed -> gzip or uncompressed(gzip failed) > gzip -> bzip2 or gzip(bzip2 failed) > uncompressed(gzip failed) -> bzip2 or uncompressed(bzip2 failed) It might also be worth only using a much slower compressor if we get at least N% better results. Literally saving 0.1% of the size at the expense of 5x worse decompression time is possibly not worth while in most cases. > If it were me, I'd do it with one compression algorthim as a > proof-of-concept, then add a second, and then generalize it to N cases > (which would not be hard once the 2 cases was done). > > But I must say, I like your idea of keeping the uncompressed form around > until we need the space. (I'd also want to track reads separately from > writes.) Yes - one write would be worth a lot of reads in terms of keeping it uncompressed. Heh, it might also help if you get a bad sector on one of the copies :-). John.