* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview [not found] <20100528173510.GA12166@ca-server1.us.oracle.com> @ 2010-06-02 6:03 ` Minchan Kim 2010-06-02 15:27 ` Dan Magenheimer 2010-06-02 13:00 ` Jamie Lokier 2010-06-02 13:24 ` Christoph Hellwig 2 siblings, 1 reply; 15+ messages in thread From: Minchan Kim @ 2010-06-02 6:03 UTC (permalink / raw) To: Dan Magenheimer Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk Hello. I think cleancache approach is cool. :) I have some suggestions and questions. On Sat, May 29, 2010 at 2:35 AM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview > > Changes since V1: > - Rebased to 2.6.34 (no functional changes) > - Convert to sane types (Al Viro) > - Define some raw constants (Konrad Wilk) > - Add ack from Andreas Dilger > > In previous patch postings, cleancache was part of the Transcendent > Memory ("tmem") patchset. =C2=A0This patchset refocuses not on the underl= ying > technology (tmem) but instead on the useful functionality provided for Li= nux, > and provides a clean API so that cleancache can provide this very useful > functionality either via a Xen tmem driver OR completely independent of t= mem. > For example: Nitin Gupta (of compcache and ramzswap fame) is implementing > an in-kernel compression "backend" for cleancache; some believe > cleancache will be a very nice interface for building RAM-like functional= ity > for pseudo-RAM devices such as SSD or phase-change memory; and a Pune > University team is looking at a backend for virtio (see OLS'2010). > > A more complete description of cleancache can be found in the introductor= y > comment in mm/cleancache.c (in PATCH 2/7) which is included below > for convenience. > > Note that an earlier version of this patch is now shipping in OpenSuSE 11= .2 > and will soon ship in a release of Oracle Enterprise Linux. =C2=A0Underly= ing > tmem technology is now shipping in Oracle VM 2.2 and was just released > in Xen 4.0 on April 15, 2010. =C2=A0(Search news.google.com for Transcend= ent > Memory) > > Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> > Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org> > > =C2=A0fs/btrfs/extent_io.c =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A09 + > =C2=A0fs/btrfs/super.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A0= 2 > =C2=A0fs/buffer.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= | =C2=A0 =C2=A05 + > =C2=A0fs/ext3/super.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 = =C2=A02 > =C2=A0fs/ext4/super.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 = =C2=A02 > =C2=A0fs/mpage.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = | =C2=A0 =C2=A07 + > =C2=A0fs/ocfs2/super.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A0= 3 > =C2=A0fs/super.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = | =C2=A0 =C2=A08 + > =C2=A0include/linux/cleancache.h | =C2=A0 90 +++++++++++++++++++ > =C2=A0include/linux/fs.h =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A05 + > =C2=A0mm/Kconfig =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = | =C2=A0 22 ++++ > =C2=A0mm/Makefile =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= | =C2=A0 =C2=A01 > =C2=A0mm/cleancache.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A020= 3 +++++++++++++++++++++++++++++++++++++++++++++ > =C2=A0mm/filemap.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2= =A0 11 ++ > =C2=A0mm/truncate.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2= =A0 10 ++ > =C2=A015 files changed, 380 insertions(+) > > Cleancache can be thought of as a page-granularity victim cache for clean > pages that the kernel's pageframe replacement algorithm (PFRA) would like > to keep around, but can't since there isn't enough memory. =C2=A0So when = the > PFRA "evicts" a page, it first attempts to put it into a synchronous > concurrency-safe page-oriented pseudo-RAM device (such as Xen's Transcend= ent > Memory, aka "tmem", or in-kernel compressed memory, aka "zmem", or other > RAM-like devices) which is not directly accessible or addressable by the > kernel and is of unknown and possibly time-varying size. =C2=A0And when a > cleancache-enabled filesystem wishes to access a page in a file on disk, > it first checks cleancache to see if it already contains it; if it does, > the page is copied into the kernel and a disk access is avoided. > This pseudo-RAM device links itself to cleancache by setting the > cleancache_ops pointer appropriately and the functions it provides must > conform to certain semantics as follows: > > Most important, cleancache is "ephemeral". =C2=A0Pages which are copied i= nto > cleancache have an indefinite lifetime which is completely unknowable > by the kernel and so may or may not still be in cleancache at any later t= ime. > Thus, as its name implies, cleancache is not suitable for dirty pages. = =C2=A0The > pseudo-RAM has complete discretion over what pages to preserve and what > pages to discard and when. > > A filesystem calls "init_fs" to obtain a pool id which, if positive, must= be > saved in the filesystem's superblock; a negative return value indicates > failure. =C2=A0A "put_page" will copy a (presumably about-to-be-evicted) = page into > pseudo-RAM and associate it with the pool id, the file inode, and a page > index into the file. =C2=A0(The combination of a pool id, an inode, and a= n index > is called a "handle".) =C2=A0A "get_page" will copy the page, if found, f= rom > pseudo-RAM into kernel memory. =C2=A0A "flush_page" will ensure the page = no longer > is present in pseudo-RAM; a "flush_inode" will flush all pages associated > with the specified inode; and a "flush_fs" will flush all pages in all > inodes specified by the given pool id. > > A "init_shared_fs", like init, obtains a pool id but tells the pseudo-RAM > to treat the pool as shared using a 128-bit UUID as a key. =C2=A0On syste= ms > that may run multiple kernels (such as hard partitioned or virtualized > systems) that may share a clustered filesystem, and where the pseudo-RAM > may be shared among those kernels, calls to init_shared_fs that specify t= he > same UUID will receive the same pool id, thus allowing the pages to > be shared. =C2=A0Note that any security requirements must be imposed outs= ide > of the kernel (e.g. by "tools" that control the pseudo-RAM). =C2=A0Or a > pseudo-RAM implementation can simply disable shared_init by always > returning a negative value. > > If a get_page is successful on a non-shared pool, the page is flushed (th= us > making cleancache an "exclusive" cache). =C2=A0On a shared pool, the page Do you have any reason about force "exclusive" on a non-shared pool? To free memory on pesudo-RAM? I want to make it "inclusive" by some reason but unfortunately I can't say why I want it now. While you mentioned it's "exclusive", cleancache_get_page doesn't flush the page at below code. Is it a role of user who implement cleancache_ops->get_page? +int __cleancache_get_page(struct page *page) +{ + int ret =3D 0; + int pool_id =3D page->mapping->host->i_sb->cleancache_poolid; + + if (pool_id >=3D 0) { + ret =3D (*cleancache_ops->get_page)(pool_id, + page->mapping->host->i_in= o, + page->index, + page); + if (ret =3D=3D CLEANCACHE_GET_PAGE_SUCCESS) + succ_gets++; + else + failed_gets++; + } + return ret; +} +EXPORT_SYMBOL(__cleancache_get_page); If backed device is ram(ie), Could we _move_ the pages from page cache to cleancache? I mean I don't want to copy page when get/put operation. we can just move page in case of backed device "ram". Is it possible? You send the patches which is core of cleancache but I don't see any use ca= se. Could you send use case patches with this series? It could help understand cleancache's benefit. --=20 Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-02 6:03 ` [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview Minchan Kim @ 2010-06-02 15:27 ` Dan Magenheimer 2010-06-02 16:38 ` Minchan Kim 0 siblings, 1 reply; 15+ messages in thread From: Dan Magenheimer @ 2010-06-02 15:27 UTC (permalink / raw) To: Minchan Kim Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk Hi Minchan -- > I think cleancache approach is cool. :) > I have some suggestions and questions. Thanks for your interest! > > If a get_page is successful on a non-shared pool, the page is flushed > (thus > > making cleancache an "exclusive" cache). =C2=A0On a shared pool, the pa= ge >=20 > Do you have any reason about force "exclusive" on a non-shared pool? > To free memory on pesudo-RAM? > I want to make it "inclusive" by some reason but unfortunately I can't > say why I want it now. The main reason is to free up memory in pseudo-RAM and to avoid unnecessary cleancache_flush calls. If you want inclusive, the page can be put immediately following the get. If put-after-get for inclusive becomes common, the interface could easily be extended to add a "get_no_flush" call. =20 > While you mentioned it's "exclusive", cleancache_get_page doesn't > flush the page at below code. > Is it a role of user who implement cleancache_ops->get_page? Yes, the flush is done by the cleancache implementation. > If backed device is ram(ie), Could we _move_ the pages from page cache > to cleancache? > I mean I don't want to copy page when get/put operation. we can just > move page in case of backed device "ram". Is it possible? By "move", do you mean changing the virtual mappings? Yes, this could be done as long as the source and destination are both directly addressable (that is, true physical RAM), but requires TLB manipulation and has some complicated corner cases. The copy semantics simplifies the implementation on both the "frontend" and the "backend" and also allows the backend to do fancy things on-the-fly like page compression and page deduplication. > You send the patches which is core of cleancache but I don't see any > use case. > Could you send use case patches with this series? > It could help understand cleancache's benefit. Do you mean the Xen Transcendent Memory ("tmem") implementation? If so, this is four files in the Xen source tree (common/tmem.c, common/tmem_xen.c, include/xen/tmem.h, include/xen/tmem_xen.h). There is also an html document in the Xen source tree, which can be viewed here: http://oss.oracle.com/projects/tmem/dist/documentation/internals/xen4-inter= nals-v01.html=20 Or did you mean a cleancache_ops "backend"? For tmem, there is one file linux/drivers/xen/tmem.c and it interfaces between the cleancache_ops calls and Xen hypercalls. It should be in a Xenlinux pv_ops tree soon, or I can email it sooner. I am also eagerly awaiting Nitin Gupta's cleancache backend and implementation to do in-kernel page cache compression. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-02 15:27 ` Dan Magenheimer @ 2010-06-02 16:38 ` Minchan Kim 2010-06-02 23:02 ` Dan Magenheimer 0 siblings, 1 reply; 15+ messages in thread From: Minchan Kim @ 2010-06-02 16:38 UTC (permalink / raw) To: Dan Magenheimer Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk Hi, Dan.=20 On Wed, Jun 02, 2010 at 08:27:48AM -0700, Dan Magenheimer wrote: > Hi Minchan -- >=20 > > I think cleancache approach is cool. :) > > I have some suggestions and questions. >=20 > Thanks for your interest! >=20 > > > If a get_page is successful on a non-shared pool, the page is flush= ed > > (thus > > > making cleancache an "exclusive" cache). =A0On a shared pool, the p= age > >=20 > > Do you have any reason about force "exclusive" on a non-shared pool? > > To free memory on pesudo-RAM? > > I want to make it "inclusive" by some reason but unfortunately I can'= t > > say why I want it now. >=20 > The main reason is to free up memory in pseudo-RAM and to > avoid unnecessary cleancache_flush calls. If you want > inclusive, the page can be put immediately following > the get. If put-after-get for inclusive becomes common, > the interface could easily be extended to add a "get_no_flush" > call. Sounds good to me.=20 > =20 > > While you mentioned it's "exclusive", cleancache_get_page doesn't > > flush the page at below code. > > Is it a role of user who implement cleancache_ops->get_page? >=20 > Yes, the flush is done by the cleancache implementation. >=20 > > If backed device is ram(ie), Could we _move_ the pages from page cach= e > > to cleancache? > > I mean I don't want to copy page when get/put operation. we can just > > move page in case of backed device "ram". Is it possible? >=20 > By "move", do you mean changing the virtual mappings? Yes, > this could be done as long as the source and destination are > both directly addressable (that is, true physical RAM), but > requires TLB manipulation and has some complicated corner > cases. The copy semantics simplifies the implementation on > both the "frontend" and the "backend" and also allows the > backend to do fancy things on-the-fly like page compression > and page deduplication. Agree. But I don't mean it.=20 If I use brd as backend, i want to do it follwing as.=20 put_page : remove_from_page_cache(page); brd_insert_page(page); get_page : brd_lookup_page(page); add_to_page_cache(page); Of course, I know it's impossible without new metadata and modification o= f=20 page cache handling and it makes front and backend's good layered design.= =20 What I want is to remove copy overhead when backend is ram and it's also part of main memory(ie, we have page descriptor).=20 Do you have an idea? >=20 > > You send the patches which is core of cleancache but I don't see any > > use case. > > Could you send use case patches with this series? > > It could help understand cleancache's benefit. >=20 > Do you mean the Xen Transcendent Memory ("tmem") implementation? > If so, this is four files in the Xen source tree (common/tmem.c, > common/tmem_xen.c, include/xen/tmem.h, include/xen/tmem_xen.h). > There is also an html document in the Xen source tree, which can > be viewed here: > http://oss.oracle.com/projects/tmem/dist/documentation/internals/xen4-i= nternals-v01.html=20 >=20 > Or did you mean a cleancache_ops "backend"? For tmem, there > is one file linux/drivers/xen/tmem.c and it interfaces between > the cleancache_ops calls and Xen hypercalls. It should be in > a Xenlinux pv_ops tree soon, or I can email it sooner. I mean "backend". :)=20 >=20 > I am also eagerly awaiting Nitin Gupta's cleancache backend > and implementation to do in-kernel page cache compression. Do Nitin say he will make backend of cleancache for page cache compression?=20 It would be good feature.=20 I have a interest, too. :) Thanks, Dan.=20 >=20 > Thanks, > Dan --=20 Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-02 16:38 ` Minchan Kim @ 2010-06-02 23:02 ` Dan Magenheimer 2010-06-03 2:46 ` Nitin Gupta 0 siblings, 1 reply; 15+ messages in thread From: Dan Magenheimer @ 2010-06-02 23:02 UTC (permalink / raw) To: Minchan Kim Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk > From: Minchan Kim [mailto:minchan.kim@gmail.com] > > I am also eagerly awaiting Nitin Gupta's cleancache backend > > and implementation to do in-kernel page cache compression. >=20 > Do Nitin say he will make backend of cleancache for > page cache compression? >=20 > It would be good feature. > I have a interest, too. :) That was Nitin's plan for his GSOC project when we last discussed this. Nitin is on the cc list and can comment if this has changed. > > By "move", do you mean changing the virtual mappings? Yes, > > this could be done as long as the source and destination are > > both directly addressable (that is, true physical RAM), but > > requires TLB manipulation and has some complicated corner > > cases. The copy semantics simplifies the implementation on > > both the "frontend" and the "backend" and also allows the > > backend to do fancy things on-the-fly like page compression > > and page deduplication. >=20 > Agree. But I don't mean it. > If I use brd as backend, i want to do it follwing as. >=20 > <snip> >=20 > Of course, I know it's impossible without new metadata and > modification of page cache handling and it makes front and > backend's good layered design. >=20 > What I want is to remove copy overhead when backend is ram > and it's also part of main memory(ie, we have page descriptor). >=20 > Do you have an idea? Copy overhead on modern processors is very low now due to very wide memory buses. The additional metadata and code to handle coherency and concurrency, plus existing overhead for batching and asynchronous access to brd is likely much higher than the cost to avoid copying. But if you did implement this without copying, I think you might need a different set of hooks in various places. I don't know. > > Or did you mean a cleancache_ops "backend"? For tmem, there > > is one file linux/drivers/xen/tmem.c and it interfaces between > > the cleancache_ops calls and Xen hypercalls. It should be in > > a Xenlinux pv_ops tree soon, or I can email it sooner. >=20 > I mean "backend". :) I dropped the code used for a RHEL6beta Xen tmem driver here: http://oss.oracle.com/projects/tmem/dist/files/RHEL6beta/tmem-backend.patch= =20 Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-02 23:02 ` Dan Magenheimer @ 2010-06-03 2:46 ` Nitin Gupta 2010-06-03 4:53 ` Andreas Dilger 0 siblings, 1 reply; 15+ messages in thread From: Nitin Gupta @ 2010-06-03 2:46 UTC (permalink / raw) To: Dan Magenheimer Cc: Minchan Kim, chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk On 06/03/2010 04:32 AM, Dan Magenheimer wrote: >> From: Minchan Kim [mailto:minchan.kim@gmail.com] > >>> I am also eagerly awaiting Nitin Gupta's cleancache backend >>> and implementation to do in-kernel page cache compression. >> >> Do Nitin say he will make backend of cleancache for >> page cache compression? >> >> It would be good feature. >> I have a interest, too. :) > > That was Nitin's plan for his GSOC project when we last discussed > this. Nitin is on the cc list and can comment if this has > changed. > Yes, I have just started work on in-kernel page cache compression backend for cleancache :) Thanks, Nitin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-03 2:46 ` Nitin Gupta @ 2010-06-03 4:53 ` Andreas Dilger 2010-06-03 6:25 ` Nitin Gupta 0 siblings, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2010-06-03 4:53 UTC (permalink / raw) To: ngupta Cc: Dan Magenheimer, Minchan Kim, chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk On 2010-06-02, at 20:46, Nitin Gupta wrote: > On 06/03/2010 04:32 AM, Dan Magenheimer wrote: >>> From: Minchan Kim [mailto:minchan.kim@gmail.com] >>=20 >>>> I am also eagerly awaiting Nitin Gupta's cleancache backend >>>> and implementation to do in-kernel page cache compression. >>>=20 >>> Do Nitin say he will make backend of cleancache for >>> page cache compression? >>>=20 >>> It would be good feature. >>> I have a interest, too. :) >>=20 >> That was Nitin's plan for his GSOC project when we last discussed >> this. Nitin is on the cc list and can comment if this has >> changed. >=20 > Yes, I have just started work on in-kernel page cache compression > backend for cleancache :) Is there a design doc for this implementation? I was thinking it would = be quite clever to do compression in, say, 64kB or 128kB chunks in a = mapping (to get decent compression) and then write these compressed = chunks directly from the page cache to disk in btrfs and/or a revived = compressed ext4. That would mean that the on-disk compression algorithm needs to match = the in-memory algorithm, which implies that the in-memory compression = algorithm should be selectable on a per-mapping basis. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-03 4:53 ` Andreas Dilger @ 2010-06-03 6:25 ` Nitin Gupta 2010-06-03 15:43 ` Dan Magenheimer 0 siblings, 1 reply; 15+ messages in thread From: Nitin Gupta @ 2010-06-03 6:25 UTC (permalink / raw) To: Andreas Dilger Cc: Dan Magenheimer, Minchan Kim, chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk On 06/03/2010 10:23 AM, Andreas Dilger wrote: > On 2010-06-02, at 20:46, Nitin Gupta wrote: >> On 06/03/2010 04:32 AM, Dan Magenheimer wrote: >>>> From: Minchan Kim [mailto:minchan.kim@gmail.com] >>> >>>>> I am also eagerly awaiting Nitin Gupta's cleancache backend >>>>> and implementation to do in-kernel page cache compression. >>>> >>>> Do Nitin say he will make backend of cleancache for >>>> page cache compression? >>>> >>>> It would be good feature. >>>> I have a interest, too. :) >>> >>> That was Nitin's plan for his GSOC project when we last discussed >>> this. Nitin is on the cc list and can comment if this has >>> changed. >> >> Yes, I have just started work on in-kernel page cache compression >> backend for cleancache :) > > Is there a design doc for this implementation? Its all on physical paper :) Anyways, the design is quite simple as it simply has to act on cleancache callbacks. > I was thinking it would be quite clever to do compression in, say, 64kB or 128kB chunks in a mapping (to get decent compression) and then write these compressed chunks directly from the page cache to disk in btrfs and/or a revived compressed ext4. > Batching of pages to get good compression ratio seems doable. However, writing this compressed data (with/without batching) to disk seems quite difficult. Pages given out to cleancache are not part of pagecache and the disk might also contain uncompressed version of the same data. There is also the problem of efficient on-disk structure for storing variable sized compressed chunks. I'm not sure how we can deal with all these issues. Thanks, Nitin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-03 6:25 ` Nitin Gupta @ 2010-06-03 15:43 ` Dan Magenheimer 2010-06-04 9:36 ` Nitin Gupta 0 siblings, 1 reply; 15+ messages in thread From: Dan Magenheimer @ 2010-06-03 15:43 UTC (permalink / raw) To: ngupta, andreas.dilger Cc: Minchan Kim, chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk > On 06/03/2010 10:23 AM, Andreas Dilger wrote: > > On 2010-06-02, at 20:46, Nitin Gupta wrote: >=20 > > I was thinking it would be quite clever to do compression in, say, > > 64kB or 128kB chunks in a mapping (to get decent compression) and > > then write these compressed chunks directly from the page cache > > to disk in btrfs and/or a revived compressed ext4. >=20 > Batching of pages to get good compression ratio seems doable. Is there evidence that batching a set of random individual 4K pages will have a significantly better compression ratio than compressing the pages separately? I certainly understand that if the pages are from the same file, compression is likely to be better, but pages evicted from the page cache (which is the source for all cleancache_puts) are likely to be quite a bit more random than that, aren't they? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-03 15:43 ` Dan Magenheimer @ 2010-06-04 9:36 ` Nitin Gupta 2010-06-04 13:45 ` Minchan Kim 0 siblings, 1 reply; 15+ messages in thread From: Nitin Gupta @ 2010-06-04 9:36 UTC (permalink / raw) To: Dan Magenheimer Cc: andreas.dilger, Minchan Kim, chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk On 06/03/2010 09:13 PM, Dan Magenheimer wrote: >> On 06/03/2010 10:23 AM, Andreas Dilger wrote: >>> On 2010-06-02, at 20:46, Nitin Gupta wrote: >> >>> I was thinking it would be quite clever to do compression in, say, >>> 64kB or 128kB chunks in a mapping (to get decent compression) and >>> then write these compressed chunks directly from the page cache >>> to disk in btrfs and/or a revived compressed ext4. >> >> Batching of pages to get good compression ratio seems doable. > > Is there evidence that batching a set of random individual 4K > pages will have a significantly better compression ratio than > compressing the pages separately? I certainly understand that > if the pages are from the same file, compression is likely to > be better, but pages evicted from the page cache (which is > the source for all cleancache_puts) are likely to be quite a > bit more random than that, aren't they? > Batching of pages from random files may not be so effective but it would be interesting to collect some data for this. Still, per-inode batching of pages seems doable and this should help us get over this problem. Thanks, Nitin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-04 9:36 ` Nitin Gupta @ 2010-06-04 13:45 ` Minchan Kim 0 siblings, 0 replies; 15+ messages in thread From: Minchan Kim @ 2010-06-04 13:45 UTC (permalink / raw) To: Nitin Gupta Cc: Dan Magenheimer, andreas.dilger, chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk Hi, Nitin. I am happy to hear you started this work. On Fri, Jun 04, 2010 at 03:06:49PM +0530, Nitin Gupta wrote: > On 06/03/2010 09:13 PM, Dan Magenheimer wrote: > >> On 06/03/2010 10:23 AM, Andreas Dilger wrote: > >>> On 2010-06-02, at 20:46, Nitin Gupta wrote: > >> > >>> I was thinking it would be quite clever to do compression in, say, > >>> 64kB or 128kB chunks in a mapping (to get decent compression) and > >>> then write these compressed chunks directly from the page cache > >>> to disk in btrfs and/or a revived compressed ext4. > >> > >> Batching of pages to get good compression ratio seems doable. > > > > Is there evidence that batching a set of random individual 4K > > pages will have a significantly better compression ratio than > > compressing the pages separately? I certainly understand that > > if the pages are from the same file, compression is likely to > > be better, but pages evicted from the page cache (which is > > the source for all cleancache_puts) are likely to be quite a > > bit more random than that, aren't they? > > > > > Batching of pages from random files may not be so effective but > it would be interesting to collect some data for this. Still, > per-inode batching of pages seems doable and this should help > us get over this problem. 1) Please, consider system memory pressure case. In such case, we have to release compressed cache pages. Or it would be better to discard not-good-compression pages when you compress it. 2) This work is related to page reclaiming. Page reclaiming is to make free memory. But this work might free memory little than old. I admit your concept is good in terms of I/O cost. But we might discard more clean pages than old if you want to do batching of pages for good compression. 3) testcase. As I mentioned, it could be good in terms of I/O cost. But it could change system's behavior due to page consumption of backend. so many page scanning/reclaiming could be happen. It means hot pages can be discarded with this patch. But it's a just guessing. So we need number with testcase we can measure I/O and system responsivness. > > Thanks, > Nitin -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview [not found] <20100528173510.GA12166@ca-server1.us.oracle.com> 2010-06-02 6:03 ` [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview Minchan Kim @ 2010-06-02 13:00 ` Jamie Lokier 2010-06-02 15:35 ` Dan Magenheimer 2010-06-02 13:24 ` Christoph Hellwig 2 siblings, 1 reply; 15+ messages in thread From: Jamie Lokier @ 2010-06-02 13:00 UTC (permalink / raw) To: Dan Magenheimer Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk Dan Magenheimer wrote: > Most important, cleancache is "ephemeral". Pages which are copied into > cleancache have an indefinite lifetime which is completely unknowable > by the kernel and so may or may not still be in cleancache at any later time. > Thus, as its name implies, cleancache is not suitable for dirty pages. The > pseudo-RAM has complete discretion over what pages to preserve and what > pages to discard and when. Fwiw, the feature sounds useful to userspace too, for those things with memory hungry caches like web browsers. Any plans to make it available to userspace? Thanks, -- Jamie ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-02 13:00 ` Jamie Lokier @ 2010-06-02 15:35 ` Dan Magenheimer 0 siblings, 0 replies; 15+ messages in thread From: Dan Magenheimer @ 2010-06-02 15:35 UTC (permalink / raw) To: Jamie Lokier Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk > From: Jamie Lokier [mailto:jamie@shareable.org] > Subject: Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): > overview >=20 > Dan Magenheimer wrote: > > Most important, cleancache is "ephemeral". Pages which are copied > into > > cleancache have an indefinite lifetime which is completely unknowable > > by the kernel and so may or may not still be in cleancache at any > later time. > > Thus, as its name implies, cleancache is not suitable for dirty > pages. The > > pseudo-RAM has complete discretion over what pages to preserve and > what > > pages to discard and when. >=20 > Fwiw, the feature sounds useful to userspace too, for those things > with memory hungry caches like web browsers. Any plans to make it > available to userspace? No plans yet, though we agree it sounds useful, at least for apps that bypass the page cache (e.g. O_DIRECT). If you have time and interest to investigate this further, I'd be happy to help. Send email offlist. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview [not found] <20100528173510.GA12166@ca-server1.us.oracle.com> 2010-06-02 6:03 ` [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview Minchan Kim 2010-06-02 13:00 ` Jamie Lokier @ 2010-06-02 13:24 ` Christoph Hellwig 2010-06-02 16:07 ` Dan Magenheimer 2 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2010-06-02 13:24 UTC (permalink / raw) To: Dan Magenheimer Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk Please give your patches some semi-resonable subject line. > fs/btrfs/super.c | 2 > fs/buffer.c | 5 + > fs/ext3/super.c | 2 > fs/ext4/super.c | 2 > fs/mpage.c | 7 + > fs/ocfs2/super.c | 3 > fs/super.c | 8 + This is missing out a whole lot of filesystems. Even more so why the hell do you need hooks into the filesystem? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview 2010-06-02 13:24 ` Christoph Hellwig @ 2010-06-02 16:07 ` Dan Magenheimer 0 siblings, 0 replies; 15+ messages in thread From: Dan Magenheimer @ 2010-06-02 16:07 UTC (permalink / raw) To: Christoph Hellwig Cc: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs, linux-kernel, linux-fsdevel, linux-ext4, ocfs2-devel, linux-mm, ngupta, jeremy, JBeulich, kurt.hackel, npiggin, dave.mccracken, riel, avi, konrad.wilk > From: Christoph Hellwig [mailto:hch@infradead.org] > Subject: Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): > overview Hi Christophe -- Thanks for your feedback! > > fs/btrfs/super.c | 2 > > fs/buffer.c | 5 + > > fs/ext3/super.c | 2 > > fs/ext4/super.c | 2 > > fs/mpage.c | 7 + > > fs/ocfs2/super.c | 3 > > fs/super.c | 8 + >=20 > This is missing out a whole lot of filesystems. Even more so why the > hell do you need hooks into the filesystem? Let me rephrase/regroup your question. Let me know if I missed anything... 1) Why is the VFS layer involved at all? VFS hooks are necessary to avoid a disk read when a page is already in cleancache and to maintain coherency (via cleancache_flush operations) between cleancache, the page cache, and disk. This very small, very clean set of hooks (placed by Chris Mason) all compile into nothingness if cleancache is config'ed off, and turn into "if (*p =3D=3D NULL)" if config'ed on but no "backend" claims cleancache_ops or if an fs doesn't opt-in (see below). 2) Why do the individual filesystems need to be modified? Some filesystems are built entirely on top of VFS and the hooks in VFS are sufficient, so don't require an fs "cleancache_init" hook; the initial implementation of cleancache didn't provide this hook. But for some fs (such as btrfs) the VFS hooks are incomplete and one or more hooks in the fs-specific code is required. For some other fs's (such as tmpfs), cleancache may even be counterproductive. So it seemed prudent to require an fs to "opt in" to use cleancache, which requires at least one hook in any fs. 3) Why are filesystems missing? =20 Only because they haven't been tested. The existence proof of four fs's (ext3/ext4/ocfs2/btfrs) should be sufficient to validate the concept, the opt-in approach means that untested filesystems are not affected, and the hooks in the four fs's should serve as examples to show that it should be very easy to add more fs's in the future. > Please give your patches some semi-resonable subject line. Not sure what you mean... are the subject lines too short? Or should I leave off the back-reference to Transcendent Memory? Or please suggest something you think is more reasonable? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview @ 2010-05-28 17:35 Dan Magenheimer 0 siblings, 0 replies; 15+ messages in thread From: Dan Magenheimer @ 2010-05-28 17:35 UTC (permalink / raw) To: chris.mason, viro, akpm, adilger, tytso, mfasheh, joel.becker, matthew, linux-btrfs [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview Changes since V1: - Rebased to 2.6.34 (no functional changes) - Convert to sane types (Al Viro) - Define some raw constants (Konrad Wilk) - Add ack from Andreas Dilger In previous patch postings, cleancache was part of the Transcendent Memory ("tmem") patchset. This patchset refocuses not on the underlying technology (tmem) but instead on the useful functionality provided for Linux, and provides a clean API so that cleancache can provide this very useful functionality either via a Xen tmem driver OR completely independent of tmem. For example: Nitin Gupta (of compcache and ramzswap fame) is implementing an in-kernel compression "backend" for cleancache; some believe cleancache will be a very nice interface for building RAM-like functionality for pseudo-RAM devices such as SSD or phase-change memory; and a Pune University team is looking at a backend for virtio (see OLS'2010). A more complete description of cleancache can be found in the introductory comment in mm/cleancache.c (in PATCH 2/7) which is included below for convenience. Note that an earlier version of this patch is now shipping in OpenSuSE 11.2 and will soon ship in a release of Oracle Enterprise Linux. Underlying tmem technology is now shipping in Oracle VM 2.2 and was just released in Xen 4.0 on April 15, 2010. (Search news.google.com for Transcendent Memory) Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org> fs/btrfs/extent_io.c | 9 + fs/btrfs/super.c | 2 fs/buffer.c | 5 + fs/ext3/super.c | 2 fs/ext4/super.c | 2 fs/mpage.c | 7 + fs/ocfs2/super.c | 3 fs/super.c | 8 + include/linux/cleancache.h | 90 +++++++++++++++++++ include/linux/fs.h | 5 + mm/Kconfig | 22 ++++ mm/Makefile | 1 mm/cleancache.c | 203 +++++++++++++++++++++++++++++++++++++++++++++ mm/filemap.c | 11 ++ mm/truncate.c | 10 ++ 15 files changed, 380 insertions(+) Cleancache can be thought of as a page-granularity victim cache for clean pages that the kernel's pageframe replacement algorithm (PFRA) would like to keep around, but can't since there isn't enough memory. So when the PFRA "evicts" a page, it first attempts to put it into a synchronous concurrency-safe page-oriented pseudo-RAM device (such as Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory, aka "zmem", or other RAM-like devices) which is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. And when a cleancache-enabled filesystem wishes to access a page in a file on disk, it first checks cleancache to see if it already contains it; if it does, the page is copied into the kernel and a disk access is avoided. This pseudo-RAM device links itself to cleancache by setting the cleancache_ops pointer appropriately and the functions it provides must conform to certain semantics as follows: Most important, cleancache is "ephemeral". Pages which are copied into cleancache have an indefinite lifetime which is completely unknowable by the kernel and so may or may not still be in cleancache at any later time. Thus, as its name implies, cleancache is not suitable for dirty pages. The pseudo-RAM has complete discretion over what pages to preserve and what pages to discard and when. A filesystem calls "init_fs" to obtain a pool id which, if positive, must be saved in the filesystem's superblock; a negative return value indicates failure. A "put_page" will copy a (presumably about-to-be-evicted) page into pseudo-RAM and associate it with the pool id, the file inode, and a page index into the file. (The combination of a pool id, an inode, and an index is called a "handle".) A "get_page" will copy the page, if found, from pseudo-RAM into kernel memory. A "flush_page" will ensure the page no longer is present in pseudo-RAM; a "flush_inode" will flush all pages associated with the specified inode; and a "flush_fs" will flush all pages in all inodes specified by the given pool id. A "init_shared_fs", like init, obtains a pool id but tells the pseudo-RAM to treat the pool as shared using a 128-bit UUID as a key. On systems that may run multiple kernels (such as hard partitioned or virtualized systems) that may share a clustered filesystem, and where the pseudo-RAM may be shared among those kernels, calls to init_shared_fs that specify the same UUID will receive the same pool id, thus allowing the pages to be shared. Note that any security requirements must be imposed outside of the kernel (e.g. by "tools" that control the pseudo-RAM). Or a pseudo-RAM implementation can simply disable shared_init by always returning a negative value. If a get_page is successful on a non-shared pool, the page is flushed (thus making cleancache an "exclusive" cache). On a shared pool, the page is NOT flushed on a successful get_page so that it remains accessible to other sharers. The kernel is responsible for ensuring coherency between cleancache (shared or not), the page cache, and the filesystem, using cleancache flush operations as required. Note that the pseudo-RAM must enforce put-put-get coherency and get-get coherency. For the former, if two puts are made to the same handle but with different data, say AAA by the first put and BBB by the second, a subsequent get can never return the stale data (AAA). For get-get coherency, if a get for a given handle fails, subsequent gets for that handle will never succeed unless preceded by a successful put with that handle. Last, pseudo-RAM provides no SMP serialization guarantees; if two different Linux threads are putting an flushing a page with the same handle, the results are indeterminate. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20100528173510.GA12166@ca-server1.us.oracle.comAANLkTilV-4_QaNq5O0WSplDx1Oq7JvkgVrEiR1rgf1up@mail.gmail.com>]
end of thread, other threads:[~2010-06-04 13:45 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20100528173510.GA12166@ca-server1.us.oracle.com> 2010-06-02 6:03 ` [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview Minchan Kim 2010-06-02 15:27 ` Dan Magenheimer 2010-06-02 16:38 ` Minchan Kim 2010-06-02 23:02 ` Dan Magenheimer 2010-06-03 2:46 ` Nitin Gupta 2010-06-03 4:53 ` Andreas Dilger 2010-06-03 6:25 ` Nitin Gupta 2010-06-03 15:43 ` Dan Magenheimer 2010-06-04 9:36 ` Nitin Gupta 2010-06-04 13:45 ` Minchan Kim 2010-06-02 13:00 ` Jamie Lokier 2010-06-02 15:35 ` Dan Magenheimer 2010-06-02 13:24 ` Christoph Hellwig 2010-06-02 16:07 ` Dan Magenheimer 2010-05-28 17:35 Dan Magenheimer [not found] <20100528173510.GA12166@ca-server1.us.oracle.comAANLkTilV-4_QaNq5O0WSplDx1Oq7JvkgVrEiR1rgf1up@mail.gmail.com>
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).