* [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 18:52 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-27 18:52 UTC (permalink / raw) To: Linus Torvalds Cc: linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi Linus -- Frontswap now has FOUR users: Two already merged in-tree (zcache and Xen) and two still in development but in public git trees (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel changes required to support transcendent memory; part 1 was cleancache which you merged at 3.0 (and which now has FIVE users). Frontswap patches have been in linux-next since June 3 (with zero changes since Sep 22). First posted to lkml in June 2009, frontswap is now at version 11 and has incorporated feedback from a wide range of kernel developers. For a good overview, see http://lwn.net/Articles/454795. If further rationale is needed, please see the end of this email for more info. SO... Please pull: git://oss.oracle.com/git/djm/tmem.git #tmem since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb Linus Torvalds (1): Linux 3.1-rc6 (identical commits being pulled by sfr into linux-next since Sep22) Note that in addition to frontswap, this commit series includes some minor changes to cleancache necessary for consistency with changes for frontswap required by Andrew Morton (e.g. flush->invalidate name change; use debugfs instead of sysfs). As a result, a handful of cleancache-related VFS files incur only a very small change. Dan Magenheimer (8): mm: frontswap: add frontswap header file mm: frontswap: core swap subsystem hooks and headers mm: frontswap: core frontswap functionality mm: frontswap: config and doc files mm: cleancache: s/flush/invalidate/ mm: frontswap/cleancache: s/flush/invalidate/ mm: cleancache: report statistics via debugfs instead of sysfs. mm: cleancache: Use __read_mostly as appropiate. Diffstat: .../ABI/testing/sysfs-kernel-mm-cleancache | 11 - Documentation/vm/cleancache.txt | 41 ++-- Documentation/vm/frontswap.txt | 210 +++++++++++++++ drivers/staging/zcache/zcache-main.c | 10 +- drivers/xen/tmem.c | 10 +- fs/buffer.c | 2 +- fs/super.c | 2 +- include/linux/cleancache.h | 24 +- include/linux/frontswap.h | 9 +- include/linux/swap.h | 4 + include/linux/swapfile.h | 13 + mm/Kconfig | 17 ++ mm/Makefile | 1 + mm/cleancache.c | 98 +++----- mm/filemap.c | 2 +- mm/frontswap.c | 273 ++++++++++++++++++++ mm/page_io.c | 12 + mm/swapfile.c | 64 ++++- mm/truncate.c | 10 +- 19 files changed, 672 insertions(+), 141 deletions(-) ==== FURTHER RATIONALE, INFORMATION, AND LINKS: In-kernel users (grep for CONFIG_FRONTSWAP): - drivers/staging/zcache (since 2.6.39) - drivers/xen/tmem.c (since 3.1) - drivers/xen/xen-selfballoon.c (since 3.1) Users in development in public git trees: - "RAMster" driver, see ramster branch of git://oss.oracle.com/git/djm/tmem.git - KVM port now underway, see: https://github.com/sashalevin/kvm-tmem/commits/tmem History of frontswap code: - code first written in Dec 2008 - previously known as "hswap" and "preswap" - first public posting in Feb 2009 - first LKML posting on June 19, 2009 - renamed frontswap, posted on May 28, 2010 - in linux-next since June 3, 2011 - incorporated feedback from: (partial list) Andrew Morton, Jan Beulich, Konrad Wilk, Jeremy Fitzhardinge, Kamezawa Hiroyuki, Seth Jennings (IBM) Linux kernel distros incorporating frontswap: - Oracle UEK 2.6.39 Beta: http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] http://kernel.opensuse.org/cgit/kernel/ - a popular Gentoo distro http://forums.gentoo.org/viewtopic-t-862105.html Xen distros supporting Linux guests with frontswap: - Xen hypervisor backend since Xen 4.0 (2009) http://www.xen.org/files/Xen_4_0_Datasheet.pdf - OracleVM since 2.2 (2009) http://twitter.com/#!/Djelibeybi/status/113876514688352256 Public visibility for frontswap (as part of transcendent memory): - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle Open World 2011, two LSF/MM Summits (2010,2011), and three Xen Summits (2009,2010,2011) - http://lwn.net/Articles/454795 (current overview) - http://lwn.net/Articles/386090 (2010) - http://lwn.net/Articles/340080 (2009) ^ permalink raw reply [flat|nested] 175+ messages in thread
* [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 18:52 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-27 18:52 UTC (permalink / raw) To: Linus Torvalds Cc: linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi Linus -- Frontswap now has FOUR users: Two already merged in-tree (zcache and Xen) and two still in development but in public git trees (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel changes required to support transcendent memory; part 1 was cleancache which you merged at 3.0 (and which now has FIVE users). Frontswap patches have been in linux-next since June 3 (with zero changes since Sep 22). First posted to lkml in June 2009, frontswap is now at version 11 and has incorporated feedback from a wide range of kernel developers. For a good overview, see http://lwn.net/Articles/454795. If further rationale is needed, please see the end of this email for more info. SO... Please pull: git://oss.oracle.com/git/djm/tmem.git #tmem since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb Linus Torvalds (1): Linux 3.1-rc6 (identical commits being pulled by sfr into linux-next since Sep22) Note that in addition to frontswap, this commit series includes some minor changes to cleancache necessary for consistency with changes for frontswap required by Andrew Morton (e.g. flush->invalidate name change; use debugfs instead of sysfs). As a result, a handful of cleancache-related VFS files incur only a very small change. Dan Magenheimer (8): mm: frontswap: add frontswap header file mm: frontswap: core swap subsystem hooks and headers mm: frontswap: core frontswap functionality mm: frontswap: config and doc files mm: cleancache: s/flush/invalidate/ mm: frontswap/cleancache: s/flush/invalidate/ mm: cleancache: report statistics via debugfs instead of sysfs. mm: cleancache: Use __read_mostly as appropiate. Diffstat: .../ABI/testing/sysfs-kernel-mm-cleancache | 11 - Documentation/vm/cleancache.txt | 41 ++-- Documentation/vm/frontswap.txt | 210 +++++++++++++++ drivers/staging/zcache/zcache-main.c | 10 +- drivers/xen/tmem.c | 10 +- fs/buffer.c | 2 +- fs/super.c | 2 +- include/linux/cleancache.h | 24 +- include/linux/frontswap.h | 9 +- include/linux/swap.h | 4 + include/linux/swapfile.h | 13 + mm/Kconfig | 17 ++ mm/Makefile | 1 + mm/cleancache.c | 98 +++----- mm/filemap.c | 2 +- mm/frontswap.c | 273 ++++++++++++++++++++ mm/page_io.c | 12 + mm/swapfile.c | 64 ++++- mm/truncate.c | 10 +- 19 files changed, 672 insertions(+), 141 deletions(-) ==== FURTHER RATIONALE, INFORMATION, AND LINKS: In-kernel users (grep for CONFIG_FRONTSWAP): - drivers/staging/zcache (since 2.6.39) - drivers/xen/tmem.c (since 3.1) - drivers/xen/xen-selfballoon.c (since 3.1) Users in development in public git trees: - "RAMster" driver, see ramster branch of git://oss.oracle.com/git/djm/tmem.git - KVM port now underway, see: https://github.com/sashalevin/kvm-tmem/commits/tmem History of frontswap code: - code first written in Dec 2008 - previously known as "hswap" and "preswap" - first public posting in Feb 2009 - first LKML posting on June 19, 2009 - renamed frontswap, posted on May 28, 2010 - in linux-next since June 3, 2011 - incorporated feedback from: (partial list) Andrew Morton, Jan Beulich, Konrad Wilk, Jeremy Fitzhardinge, Kamezawa Hiroyuki, Seth Jennings (IBM) Linux kernel distros incorporating frontswap: - Oracle UEK 2.6.39 Beta: http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] http://kernel.opensuse.org/cgit/kernel/ - a popular Gentoo distro http://forums.gentoo.org/viewtopic-t-862105.html Xen distros supporting Linux guests with frontswap: - Xen hypervisor backend since Xen 4.0 (2009) http://www.xen.org/files/Xen_4_0_Datasheet.pdf - OracleVM since 2.2 (2009) http://twitter.com/#!/Djelibeybi/status/113876514688352256 Public visibility for frontswap (as part of transcendent memory): - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle Open World 2011, two LSF/MM Summits (2010,2011), and three Xen Summits (2009,2010,2011) - http://lwn.net/Articles/454795 (current overview) - http://lwn.net/Articles/386090 (2010) - http://lwn.net/Articles/340080 (2009) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
[parent not found: <alpine.DEB.2.00.1110271318220.7639@chino.kir.corp.google.com20111027211157.GA1199@infradead.org>]
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-10-27 19:30 ` Kurt Hackel -1 siblings, 0 replies; 175+ messages in thread From: Kurt Hackel @ 2011-10-27 19:30 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi, As the dev manager for OracleVM (x86) I'd like to express my interest in seeing frontswap get merged upstream. The OracleVM product has been capable of working with frontswap for over a year now, and we'd very much like to see the complete cleancache+frontswap feature set fully upstreamed. Oracle is also fully committed to the ongoing maintenance of frontswap. thanks kurt Kurt C. Hackel Development Director Oracle VM kurt.hackel@oracle.com On 10/27/2011 11:52 AM, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > > since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb > Linus Torvalds (1): > > Linux 3.1-rc6 > > (identical commits being pulled by sfr into linux-next since Sep22) > > Note that in addition to frontswap, this commit series includes > some minor changes to cleancache necessary for consistency with > changes for frontswap required by Andrew Morton (e.g. flush->invalidate > name change; use debugfs instead of sysfs). As a result, a handful > of cleancache-related VFS files incur only a very small change. > > Dan Magenheimer (8): > mm: frontswap: add frontswap header file > mm: frontswap: core swap subsystem hooks and headers > mm: frontswap: core frontswap functionality > mm: frontswap: config and doc files > mm: cleancache: s/flush/invalidate/ > mm: frontswap/cleancache: s/flush/invalidate/ > mm: cleancache: report statistics via debugfs instead of sysfs. > mm: cleancache: Use __read_mostly as appropiate. > > Diffstat: > .../ABI/testing/sysfs-kernel-mm-cleancache | 11 - > Documentation/vm/cleancache.txt | 41 ++-- > Documentation/vm/frontswap.txt | 210 +++++++++++++++ > drivers/staging/zcache/zcache-main.c | 10 +- > drivers/xen/tmem.c | 10 +- > fs/buffer.c | 2 +- > fs/super.c | 2 +- > include/linux/cleancache.h | 24 +- > include/linux/frontswap.h | 9 +- > include/linux/swap.h | 4 + > include/linux/swapfile.h | 13 + > mm/Kconfig | 17 ++ > mm/Makefile | 1 + > mm/cleancache.c | 98 +++----- > mm/filemap.c | 2 +- > mm/frontswap.c | 273 ++++++++++++++++++++ > mm/page_io.c | 12 + > mm/swapfile.c | 64 ++++- > mm/truncate.c | 10 +- > 19 files changed, 672 insertions(+), 141 deletions(-) > > ==== > > FURTHER RATIONALE, INFORMATION, AND LINKS: > > In-kernel users (grep for CONFIG_FRONTSWAP): > - drivers/staging/zcache (since 2.6.39) > - drivers/xen/tmem.c (since 3.1) > - drivers/xen/xen-selfballoon.c (since 3.1) > > Users in development in public git trees: > - "RAMster" driver, see ramster branch of > git://oss.oracle.com/git/djm/tmem.git > - KVM port now underway, see: > https://github.com/sashalevin/kvm-tmem/commits/tmem > > History of frontswap code: > - code first written in Dec 2008 > - previously known as "hswap" and "preswap" > - first public posting in Feb 2009 > - first LKML posting on June 19, 2009 > - renamed frontswap, posted on May 28, 2010 > - in linux-next since June 3, 2011 > - incorporated feedback from: (partial list) > Andrew Morton, Jan Beulich, Konrad Wilk, > Jeremy Fitzhardinge, Kamezawa Hiroyuki, > Seth Jennings (IBM) > > Linux kernel distros incorporating frontswap: > - Oracle UEK 2.6.39 Beta: > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] > http://kernel.opensuse.org/cgit/kernel/ > - a popular Gentoo distro > http://forums.gentoo.org/viewtopic-t-862105.html > > Xen distros supporting Linux guests with frontswap: > - Xen hypervisor backend since Xen 4.0 (2009) > http://www.xen.org/files/Xen_4_0_Datasheet.pdf > - OracleVM since 2.2 (2009) > http://twitter.com/#!/Djelibeybi/status/113876514688352256 > > Public visibility for frontswap (as part of transcendent memory): > - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle > Open World 2011, two LSF/MM Summits (2010,2011), and three > Xen Summits (2009,2010,2011) > - http://lwn.net/Articles/454795 (current overview) > - http://lwn.net/Articles/386090 (2010) > - http://lwn.net/Articles/340080 (2009) ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 19:30 ` Kurt Hackel 0 siblings, 0 replies; 175+ messages in thread From: Kurt Hackel @ 2011-10-27 19:30 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi, As the dev manager for OracleVM (x86) I'd like to express my interest in seeing frontswap get merged upstream. The OracleVM product has been capable of working with frontswap for over a year now, and we'd very much like to see the complete cleancache+frontswap feature set fully upstreamed. Oracle is also fully committed to the ongoing maintenance of frontswap. thanks kurt Kurt C. Hackel Development Director Oracle VM kurt.hackel@oracle.com On 10/27/2011 11:52 AM, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > > since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb > Linus Torvalds (1): > > Linux 3.1-rc6 > > (identical commits being pulled by sfr into linux-next since Sep22) > > Note that in addition to frontswap, this commit series includes > some minor changes to cleancache necessary for consistency with > changes for frontswap required by Andrew Morton (e.g. flush->invalidate > name change; use debugfs instead of sysfs). As a result, a handful > of cleancache-related VFS files incur only a very small change. > > Dan Magenheimer (8): > mm: frontswap: add frontswap header file > mm: frontswap: core swap subsystem hooks and headers > mm: frontswap: core frontswap functionality > mm: frontswap: config and doc files > mm: cleancache: s/flush/invalidate/ > mm: frontswap/cleancache: s/flush/invalidate/ > mm: cleancache: report statistics via debugfs instead of sysfs. > mm: cleancache: Use __read_mostly as appropiate. > > Diffstat: > .../ABI/testing/sysfs-kernel-mm-cleancache | 11 - > Documentation/vm/cleancache.txt | 41 ++-- > Documentation/vm/frontswap.txt | 210 +++++++++++++++ > drivers/staging/zcache/zcache-main.c | 10 +- > drivers/xen/tmem.c | 10 +- > fs/buffer.c | 2 +- > fs/super.c | 2 +- > include/linux/cleancache.h | 24 +- > include/linux/frontswap.h | 9 +- > include/linux/swap.h | 4 + > include/linux/swapfile.h | 13 + > mm/Kconfig | 17 ++ > mm/Makefile | 1 + > mm/cleancache.c | 98 +++----- > mm/filemap.c | 2 +- > mm/frontswap.c | 273 ++++++++++++++++++++ > mm/page_io.c | 12 + > mm/swapfile.c | 64 ++++- > mm/truncate.c | 10 +- > 19 files changed, 672 insertions(+), 141 deletions(-) > > ==== > > FURTHER RATIONALE, INFORMATION, AND LINKS: > > In-kernel users (grep for CONFIG_FRONTSWAP): > - drivers/staging/zcache (since 2.6.39) > - drivers/xen/tmem.c (since 3.1) > - drivers/xen/xen-selfballoon.c (since 3.1) > > Users in development in public git trees: > - "RAMster" driver, see ramster branch of > git://oss.oracle.com/git/djm/tmem.git > - KVM port now underway, see: > https://github.com/sashalevin/kvm-tmem/commits/tmem > > History of frontswap code: > - code first written in Dec 2008 > - previously known as "hswap" and "preswap" > - first public posting in Feb 2009 > - first LKML posting on June 19, 2009 > - renamed frontswap, posted on May 28, 2010 > - in linux-next since June 3, 2011 > - incorporated feedback from: (partial list) > Andrew Morton, Jan Beulich, Konrad Wilk, > Jeremy Fitzhardinge, Kamezawa Hiroyuki, > Seth Jennings (IBM) > > Linux kernel distros incorporating frontswap: > - Oracle UEK 2.6.39 Beta: > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] > http://kernel.opensuse.org/cgit/kernel/ > - a popular Gentoo distro > http://forums.gentoo.org/viewtopic-t-862105.html > > Xen distros supporting Linux guests with frontswap: > - Xen hypervisor backend since Xen 4.0 (2009) > http://www.xen.org/files/Xen_4_0_Datasheet.pdf > - OracleVM since 2.2 (2009) > http://twitter.com/#!/Djelibeybi/status/113876514688352256 > > Public visibility for frontswap (as part of transcendent memory): > - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle > Open World 2011, two LSF/MM Summits (2010,2011), and three > Xen Summits (2009,2010,2011) > - http://lwn.net/Articles/454795 (current overview) > - http://lwn.net/Articles/386090 (2010) > - http://lwn.net/Articles/340080 (2009) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-10-27 20:18 ` David Rientjes -1 siblings, 0 replies; 175+ messages in thread From: David Rientjes @ 2011-10-27 20:18 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, 27 Oct 2011, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > Isn't this something that should go through the -mm tree? ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 20:18 ` David Rientjes 0 siblings, 0 replies; 175+ messages in thread From: David Rientjes @ 2011-10-27 20:18 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, 27 Oct 2011, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > Isn't this something that should go through the -mm tree? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 20:18 ` David Rientjes @ 2011-10-27 21:11 ` Christoph Hellwig -1 siblings, 0 replies; 175+ messages in thread From: Christoph Hellwig @ 2011-10-27 21:11 UTC (permalink / raw) To: David Rientjes Cc: Dan Magenheimer, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, Oct 27, 2011 at 01:18:40PM -0700, David Rientjes wrote: > Isn't this something that should go through the -mm tree? It should have. It should also have ACKs from the core VM developers, and at least the few I talked to about it really didn't seem to like it. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 21:11 ` Christoph Hellwig 0 siblings, 0 replies; 175+ messages in thread From: Christoph Hellwig @ 2011-10-27 21:11 UTC (permalink / raw) To: David Rientjes Cc: Dan Magenheimer, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, Oct 27, 2011 at 01:18:40PM -0700, David Rientjes wrote: > Isn't this something that should go through the -mm tree? It should have. It should also have ACKs from the core VM developers, and at least the few I talked to about it really didn't seem to like it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 21:11 ` Christoph Hellwig @ 2011-10-27 21:49 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-27 21:49 UTC (permalink / raw) To: Christoph Hellwig, David Rientjes Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Christoph Hellwig [mailto:hch@infradead.org] > Sent: Thursday, October 27, 2011 3:12 PM > To: David Rientjes > Cc: Dan Magenheimer; Linus Torvalds; linux-mm@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy > Fitzhardinge; Seth Jennings; ngupta@vflare.org; levinsasha928@gmail.com; Chris Mason; > JBeulich@novell.com; Dave Hansen; Jonathan Corbet; Neo Jia > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Thu, Oct 27, 2011 at 01:18:40PM -0700, David Rientjes wrote: > > Isn't this something that should go through the -mm tree? > > It should have. It should also have ACKs from the core VM developers, > and at least the few I talked to about it really didn't seem to like it. Yes, it would have been nice to have it go through the -mm tree. But, *sigh*, I guess it will be up to Linus again to decide if "didn't seem to like it" is sufficient to block functionality that has found use by a number of in-kernel users and by real shipping products... and continues to grow in usefulness. If Linux truly subscribes to the "code rules" mantra, no core VM developer has proposed anything -- even a design, let alone working code -- that comes close to providing the functionality and flexibility that frontswap (and cleancache) provides, and frontswap provides it with a very VERY small impact on existing kernel code AND has been posted and working for 2+ years. (And during that 2+ years, excellent feedback has improved the "kernel-ness" of the code, but NONE of the core frontswap design/hooks have changed... because frontswap _just works_!) Perhaps other frontswap users would be so kind as to reply on this thread with their opinions... Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 21:49 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-27 21:49 UTC (permalink / raw) To: Christoph Hellwig, David Rientjes Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Christoph Hellwig [mailto:hch@infradead.org] > Sent: Thursday, October 27, 2011 3:12 PM > To: David Rientjes > Cc: Dan Magenheimer; Linus Torvalds; linux-mm@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy > Fitzhardinge; Seth Jennings; ngupta@vflare.org; levinsasha928@gmail.com; Chris Mason; > JBeulich@novell.com; Dave Hansen; Jonathan Corbet; Neo Jia > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Thu, Oct 27, 2011 at 01:18:40PM -0700, David Rientjes wrote: > > Isn't this something that should go through the -mm tree? > > It should have. It should also have ACKs from the core VM developers, > and at least the few I talked to about it really didn't seem to like it. Yes, it would have been nice to have it go through the -mm tree. But, *sigh*, I guess it will be up to Linus again to decide if "didn't seem to like it" is sufficient to block functionality that has found use by a number of in-kernel users and by real shipping products... and continues to grow in usefulness. If Linux truly subscribes to the "code rules" mantra, no core VM developer has proposed anything -- even a design, let alone working code -- that comes close to providing the functionality and flexibility that frontswap (and cleancache) provides, and frontswap provides it with a very VERY small impact on existing kernel code AND has been posted and working for 2+ years. (And during that 2+ years, excellent feedback has improved the "kernel-ness" of the code, but NONE of the core frontswap design/hooks have changed... because frontswap _just works_!) Perhaps other frontswap users would be so kind as to reply on this thread with their opinions... Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 21:49 ` Dan Magenheimer @ 2011-10-27 21:52 ` Christoph Hellwig -1 siblings, 0 replies; 175+ messages in thread From: Christoph Hellwig @ 2011-10-27 21:52 UTC (permalink / raw) To: Dan Magenheimer Cc: Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: > If Linux truly subscribes to the "code rules" mantra, no core > VM developer has proposed anything -- even a design, let alone > working code -- that comes close to providing the functionality > and flexibility that frontswap (and cleancache) provides, and > frontswap provides it with a very VERY small impact on existing > kernel code AND has been posted and working for 2+ years. > (And during that 2+ years, excellent feedback has improved the > "kernel-ness" of the code, but NONE of the core frontswap > design/hooks have changed... because frontswap _just works_!) It might work for whatever defintion of work, but you certainly couldn't convince anyone that matters that it's actually sexy and we'd actually need it. Only actually working on Xen of course doesn't help. In the end it's a bunch of really ugly hooks over core code, without a clear defintion of how they work or a killer use case. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 21:52 ` Christoph Hellwig 0 siblings, 0 replies; 175+ messages in thread From: Christoph Hellwig @ 2011-10-27 21:52 UTC (permalink / raw) To: Dan Magenheimer Cc: Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: > If Linux truly subscribes to the "code rules" mantra, no core > VM developer has proposed anything -- even a design, let alone > working code -- that comes close to providing the functionality > and flexibility that frontswap (and cleancache) provides, and > frontswap provides it with a very VERY small impact on existing > kernel code AND has been posted and working for 2+ years. > (And during that 2+ years, excellent feedback has improved the > "kernel-ness" of the code, but NONE of the core frontswap > design/hooks have changed... because frontswap _just works_!) It might work for whatever defintion of work, but you certainly couldn't convince anyone that matters that it's actually sexy and we'd actually need it. Only actually working on Xen of course doesn't help. In the end it's a bunch of really ugly hooks over core code, without a clear defintion of how they work or a killer use case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 21:52 ` Christoph Hellwig @ 2011-10-27 22:21 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-27 22:21 UTC (permalink / raw) To: Christoph Hellwig Cc: David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Christoph Hellwig [mailto:hch@infradead.org] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: > > If Linux truly subscribes to the "code rules" mantra, no core > > VM developer has proposed anything -- even a design, let alone > > working code -- that comes close to providing the functionality > > and flexibility that frontswap (and cleancache) provides, and > > frontswap provides it with a very VERY small impact on existing > > kernel code AND has been posted and working for 2+ years. > > (And during that 2+ years, excellent feedback has improved the > > "kernel-ness" of the code, but NONE of the core frontswap > > design/hooks have changed... because frontswap _just works_!) > > It might work for whatever defintion of work, but you certainly couldn't > convince anyone that matters that it's actually sexy and we'd actually > need it. Only actually working on Xen of course doesn't help. > > In the end it's a bunch of really ugly hooks over core code, without > a clear defintion of how they work or a killer use case. Hi Christoph -- You might find it useful to read the whole base email and/or the lwn article referenced. Frontswap and cleancache have now gone far beyond X-e-n** and even beyond virtualization. That's why my talk at Linuxcon was titled "Transcendent Memory: Not Just for Virtualization Anymore". (And I stated at that talk that I have personally not written a line of X-e-n code in over a year now.) The same frontswap hooks _just work_ for zcache, RAMster and (soon) KVM too... and there's more uses coming. Those that take the time to understand its use model DO find frontswap useful. Is "sexy" or "killer use case" a requirement for Linus to merge code now? If so, he can plan to spend a lot more time diving as I'll bet there isn't much code that measures up. Thanks, Dan ** /me suspects that Christoph has a /dev/null filter for email containing that word so has cleverly spelled it out to defeat that filter :-) ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 22:21 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-27 22:21 UTC (permalink / raw) To: Christoph Hellwig Cc: David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Christoph Hellwig [mailto:hch@infradead.org] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: > > If Linux truly subscribes to the "code rules" mantra, no core > > VM developer has proposed anything -- even a design, let alone > > working code -- that comes close to providing the functionality > > and flexibility that frontswap (and cleancache) provides, and > > frontswap provides it with a very VERY small impact on existing > > kernel code AND has been posted and working for 2+ years. > > (And during that 2+ years, excellent feedback has improved the > > "kernel-ness" of the code, but NONE of the core frontswap > > design/hooks have changed... because frontswap _just works_!) > > It might work for whatever defintion of work, but you certainly couldn't > convince anyone that matters that it's actually sexy and we'd actually > need it. Only actually working on Xen of course doesn't help. > > In the end it's a bunch of really ugly hooks over core code, without > a clear defintion of how they work or a killer use case. Hi Christoph -- You might find it useful to read the whole base email and/or the lwn article referenced. Frontswap and cleancache have now gone far beyond X-e-n** and even beyond virtualization. That's why my talk at Linuxcon was titled "Transcendent Memory: Not Just for Virtualization Anymore". (And I stated at that talk that I have personally not written a line of X-e-n code in over a year now.) The same frontswap hooks _just work_ for zcache, RAMster and (soon) KVM too... and there's more uses coming. Those that take the time to understand its use model DO find frontswap useful. Is "sexy" or "killer use case" a requirement for Linus to merge code now? If so, he can plan to spend a lot more time diving as I'll bet there isn't much code that measures up. Thanks, Dan ** /me suspects that Christoph has a /dev/null filter for email containing that word so has cleverly spelled it out to defeat that filter :-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 21:52 ` Christoph Hellwig @ 2011-10-28 7:12 ` Sasha Levin -1 siblings, 0 replies; 175+ messages in thread From: Sasha Levin @ 2011-10-28 7:12 UTC (permalink / raw) To: Christoph Hellwig Cc: Dan Magenheimer, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, 2011-10-27 at 17:52 -0400, Christoph Hellwig wrote: > On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: > > If Linux truly subscribes to the "code rules" mantra, no core > > VM developer has proposed anything -- even a design, let alone > > working code -- that comes close to providing the functionality > > and flexibility that frontswap (and cleancache) provides, and > > frontswap provides it with a very VERY small impact on existing > > kernel code AND has been posted and working for 2+ years. > > (And during that 2+ years, excellent feedback has improved the > > "kernel-ness" of the code, but NONE of the core frontswap > > design/hooks have changed... because frontswap _just works_!) > > It might work for whatever defintion of work, but you certainly couldn't > convince anyone that matters that it's actually sexy and we'd actually > need it. Only actually working on Xen of course doesn't help. Theres a working POC of it on KVM, mostly based on reusing in-kernel Xen code. I felt it would be difficult to try and merge any tmem KVM patches until both frontswap and cleancache are in the kernel, thats why the development is currently paused at the POC level. -- Sasha. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 7:12 ` Sasha Levin 0 siblings, 0 replies; 175+ messages in thread From: Sasha Levin @ 2011-10-28 7:12 UTC (permalink / raw) To: Christoph Hellwig Cc: Dan Magenheimer, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, 2011-10-27 at 17:52 -0400, Christoph Hellwig wrote: > On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: > > If Linux truly subscribes to the "code rules" mantra, no core > > VM developer has proposed anything -- even a design, let alone > > working code -- that comes close to providing the functionality > > and flexibility that frontswap (and cleancache) provides, and > > frontswap provides it with a very VERY small impact on existing > > kernel code AND has been posted and working for 2+ years. > > (And during that 2+ years, excellent feedback has improved the > > "kernel-ness" of the code, but NONE of the core frontswap > > design/hooks have changed... because frontswap _just works_!) > > It might work for whatever defintion of work, but you certainly couldn't > convince anyone that matters that it's actually sexy and we'd actually > need it. Only actually working on Xen of course doesn't help. Theres a working POC of it on KVM, mostly based on reusing in-kernel Xen code. I felt it would be difficult to try and merge any tmem KVM patches until both frontswap and cleancache are in the kernel, thats why the development is currently paused at the POC level. -- Sasha. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
[parent not found: <CAOzbF4fnD=CGR-nizZoBxmFSuAjFC3uAHf3wDj5RLneJvJhrOQ@mail.gmail.comCAOJsxLGOTw7rtFnqeHvzFxifA0QgPVDHZzrEo=-uB2Gkrvp=JQ@mail.gmail.com>]
[parent not found: <552d2067-474d-4aef-a9a4-89e5fd8ef84f@default20111031181651.GF3466@redhat.com>]
[parent not found: <60592afd-97aa-4eaf-b86b-f6695d31c7f1@default20111031223717.GI3466@redhat.com>]
[parent not found: <1b2e4f74-7058-4712-85a7-84198723e3ee@default20111101012017.GJ3466@redhat.com>]
[parent not found: <6a9db6d9-6f13-4855-b026-ba668c29ddfa@default20111101180702.GL3466@redhat.com>]
[parent not found: <b8a0ca71-a31b-488a-9a92-2502d4a6e9bf@default20111102013122.GA18879@redhat.com>]
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 7:12 ` Sasha Levin @ 2011-10-28 7:30 ` Cyclonus J -1 siblings, 0 replies; 175+ messages in thread From: Cyclonus J @ 2011-10-28 7:30 UTC (permalink / raw) To: Sasha Levin Cc: Christoph Hellwig, Dan Magenheimer, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 12:12 AM, Sasha Levin <levinsasha928@gmail.com> wrote: > On Thu, 2011-10-27 at 17:52 -0400, Christoph Hellwig wrote: >> On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: >> > If Linux truly subscribes to the "code rules" mantra, no core >> > VM developer has proposed anything -- even a design, let alone >> > working code -- that comes close to providing the functionality >> > and flexibility that frontswap (and cleancache) provides, and >> > frontswap provides it with a very VERY small impact on existing >> > kernel code AND has been posted and working for 2+ years. >> > (And during that 2+ years, excellent feedback has improved the >> > "kernel-ness" of the code, but NONE of the core frontswap >> > design/hooks have changed... because frontswap _just works_!) >> >> It might work for whatever defintion of work, but you certainly couldn't >> convince anyone that matters that it's actually sexy and we'd actually >> need it. Only actually working on Xen of course doesn't help. > > Theres a working POC of it on KVM, mostly based on reusing in-kernel Xen > code. > > I felt it would be difficult to try and merge any tmem KVM patches until > both frontswap and cleancache are in the kernel, thats why the > development is currently paused at the POC level. Same here. I am working a KVM support for Transcedent Memory as well. It would be nice to see this in the mainline. Thanks, CJ > > -- > > Sasha. > > ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 7:30 ` Cyclonus J 0 siblings, 0 replies; 175+ messages in thread From: Cyclonus J @ 2011-10-28 7:30 UTC (permalink / raw) To: Sasha Levin Cc: Christoph Hellwig, Dan Magenheimer, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 12:12 AM, Sasha Levin <levinsasha928@gmail.com> wrote: > On Thu, 2011-10-27 at 17:52 -0400, Christoph Hellwig wrote: >> On Thu, Oct 27, 2011 at 02:49:31PM -0700, Dan Magenheimer wrote: >> > If Linux truly subscribes to the "code rules" mantra, no core >> > VM developer has proposed anything -- even a design, let alone >> > working code -- that comes close to providing the functionality >> > and flexibility that frontswap (and cleancache) provides, and >> > frontswap provides it with a very VERY small impact on existing >> > kernel code AND has been posted and working for 2+ years. >> > (And during that 2+ years, excellent feedback has improved the >> > "kernel-ness" of the code, but NONE of the core frontswap >> > design/hooks have changed... because frontswap _just works_!) >> >> It might work for whatever defintion of work, but you certainly couldn't >> convince anyone that matters that it's actually sexy and we'd actually >> need it. Only actually working on Xen of course doesn't help. > > Theres a working POC of it on KVM, mostly based on reusing in-kernel Xen > code. > > I felt it would be difficult to try and merge any tmem KVM patches until > both frontswap and cleancache are in the kernel, thats why the > development is currently paused at the POC level. Same here. I am working a KVM support for Transcedent Memory as well. It would be nice to see this in the mainline. Thanks, CJ > > -- > > Sasha. > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 7:30 ` Cyclonus J @ 2011-10-28 14:26 ` Pekka Enberg -1 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 14:26 UTC (permalink / raw) To: Cyclonus J Cc: Sasha Levin, Christoph Hellwig, Dan Magenheimer, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 10:30 AM, Cyclonus J <cyclonusj@gmail.com> wrote: >> I felt it would be difficult to try and merge any tmem KVM patches until >> both frontswap and cleancache are in the kernel, thats why the >> development is currently paused at the POC level. > > Same here. I am working a KVM support for Transcedent Memory as well. > It would be nice to see this in the mainline. We don't really merge code for future projects - especially when it touches the core kernel. As for the frontswap patches, there's pretty no ACKs from MM people apart from one Reviewed-by from Andrew. I really don't see why the pull request is sent directly to Linus... Pekka ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 14:26 ` Pekka Enberg 0 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 14:26 UTC (permalink / raw) To: Cyclonus J Cc: Sasha Levin, Christoph Hellwig, Dan Magenheimer, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 10:30 AM, Cyclonus J <cyclonusj@gmail.com> wrote: >> I felt it would be difficult to try and merge any tmem KVM patches until >> both frontswap and cleancache are in the kernel, thats why the >> development is currently paused at the POC level. > > Same here. I am working a KVM support for Transcedent Memory as well. > It would be nice to see this in the mainline. We don't really merge code for future projects - especially when it touches the core kernel. As for the frontswap patches, there's pretty no ACKs from MM people apart from one Reviewed-by from Andrew. I really don't see why the pull request is sent directly to Linus... Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 14:26 ` Pekka Enberg @ 2011-10-28 15:21 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 15:21 UTC (permalink / raw) To: Pekka Enberg, Cyclonus J Cc: Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Pekka Enberg [mailto:penberg@kernel.org] > > On Fri, Oct 28, 2011 at 10:30 AM, Cyclonus J <cyclonusj@gmail.com> wrote: > >> I felt it would be difficult to try and merge any tmem KVM patches until > >> both frontswap and cleancache are in the kernel, thats why the > >> development is currently paused at the POC level. > > > > Same here. I am working a KVM support for Transcedent Memory as well. > > It would be nice to see this in the mainline. > > We don't really merge code for future projects - especially when it > touches the core kernel. Hi Pekka -- If you grep the 3.1 source for CONFIG_FRONTSWAP, you will find two users already in-kernel waiting for frontswap to be merged. I think Sasha and Neo (and Brian and Nitin and ...) are simply indicating that there can be more, but there is a chicken-and-egg problem that can best be resolved by merging the (really very small and barely invasive) frontswap patchset. > As for the frontswap patches, there's pretty no ACKs from MM people > apart from one Reviewed-by from Andrew. I really don't see why the > pull request is sent directly to Linus... Has there not been ample opportunity (in 2-1/2 years) for other MM people to contribute? I'm certainly not trying to subvert any useful technical discussion and if there is some documented MM process I am failing to follow, please point me to it. But there are real users and real distros and real products waiting, so if there are any real issues, let's get them resolved. Thanks, Dan P.S. before commenting further, I suggest that you read the background material at http://lwn.net/Articles/454795/ (with an open mind :-). ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 15:21 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 15:21 UTC (permalink / raw) To: Pekka Enberg, Cyclonus J Cc: Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Pekka Enberg [mailto:penberg@kernel.org] > > On Fri, Oct 28, 2011 at 10:30 AM, Cyclonus J <cyclonusj@gmail.com> wrote: > >> I felt it would be difficult to try and merge any tmem KVM patches until > >> both frontswap and cleancache are in the kernel, thats why the > >> development is currently paused at the POC level. > > > > Same here. I am working a KVM support for Transcedent Memory as well. > > It would be nice to see this in the mainline. > > We don't really merge code for future projects - especially when it > touches the core kernel. Hi Pekka -- If you grep the 3.1 source for CONFIG_FRONTSWAP, you will find two users already in-kernel waiting for frontswap to be merged. I think Sasha and Neo (and Brian and Nitin and ...) are simply indicating that there can be more, but there is a chicken-and-egg problem that can best be resolved by merging the (really very small and barely invasive) frontswap patchset. > As for the frontswap patches, there's pretty no ACKs from MM people > apart from one Reviewed-by from Andrew. I really don't see why the > pull request is sent directly to Linus... Has there not been ample opportunity (in 2-1/2 years) for other MM people to contribute? I'm certainly not trying to subvert any useful technical discussion and if there is some documented MM process I am failing to follow, please point me to it. But there are real users and real distros and real products waiting, so if there are any real issues, let's get them resolved. Thanks, Dan P.S. before commenting further, I suggest that you read the background material at http://lwn.net/Articles/454795/ (with an open mind :-). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
[parent not found: <CAOJsxLEE-qf9me1SAZLFiEVhHVnDh7BDrSx1+abe9R4mfkhD=g@mail.gmail.com20111028163053.GC1319@redhat.com>]
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 15:21 ` Dan Magenheimer @ 2011-10-28 15:36 ` Pekka Enberg -1 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 15:36 UTC (permalink / raw) To: Dan Magenheimer Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet Hi Dan, On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > If you grep the 3.1 source for CONFIG_FRONTSWAP, you will find > two users already in-kernel waiting for frontswap to be merged. > I think Sasha and Neo (and Brian and Nitin and ...) are simply > indicating that there can be more, but there is a chicken-and-egg > problem that can best be resolved by merging the (really very small > and barely invasive) frontswap patchset. Yup, I was referring to the two external projects. I also happen to think that only Xen matters because zcache is in staging. So that's one user in the tree. On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> As for the frontswap patches, there's pretty no ACKs from MM people >> apart from one Reviewed-by from Andrew. I really don't see why the >> pull request is sent directly to Linus... > > Has there not been ample opportunity (in 2-1/2 years) for other > MM people to contribute? I'm certainly not trying to subvert any > useful technical discussion and if there is some documented MM process > I am failing to follow, please point me to it. But there are > real users and real distros and real products waiting, so if there > are any real issues, let's get them resolved. You are changing core kernel code without ACKs from relevant maintainers. That's very unfortunate. Existing users certainly matter but that doesn't mean you get to merge code without maintainers even looking at it. Looking at your patches, there's no trace that anyone outside your own development team even looked at the patches. Why do you feel that it's OK to ask Linus to pull them? > P.S. before commenting further, I suggest that you read the > background material at http://lwn.net/Articles/454795/ > (with an open mind :-). I'm not for or against frontswap. I assume we need something like that since Xen and KVM folks are interested. That doesn't mean you get a free pass to add more complexity to the VM. So really, why don't you just use scripts/get_maintainer.pl and simply ask the relevant people for their ACK? Pekka ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 15:36 ` Pekka Enberg 0 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 15:36 UTC (permalink / raw) To: Dan Magenheimer Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet Hi Dan, On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > If you grep the 3.1 source for CONFIG_FRONTSWAP, you will find > two users already in-kernel waiting for frontswap to be merged. > I think Sasha and Neo (and Brian and Nitin and ...) are simply > indicating that there can be more, but there is a chicken-and-egg > problem that can best be resolved by merging the (really very small > and barely invasive) frontswap patchset. Yup, I was referring to the two external projects. I also happen to think that only Xen matters because zcache is in staging. So that's one user in the tree. On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> As for the frontswap patches, there's pretty no ACKs from MM people >> apart from one Reviewed-by from Andrew. I really don't see why the >> pull request is sent directly to Linus... > > Has there not been ample opportunity (in 2-1/2 years) for other > MM people to contribute? I'm certainly not trying to subvert any > useful technical discussion and if there is some documented MM process > I am failing to follow, please point me to it. But there are > real users and real distros and real products waiting, so if there > are any real issues, let's get them resolved. You are changing core kernel code without ACKs from relevant maintainers. That's very unfortunate. Existing users certainly matter but that doesn't mean you get to merge code without maintainers even looking at it. Looking at your patches, there's no trace that anyone outside your own development team even looked at the patches. Why do you feel that it's OK to ask Linus to pull them? > P.S. before commenting further, I suggest that you read the > background material at http://lwn.net/Articles/454795/ > (with an open mind :-). I'm not for or against frontswap. I assume we need something like that since Xen and KVM folks are interested. That doesn't mean you get a free pass to add more complexity to the VM. So really, why don't you just use scripts/get_maintainer.pl and simply ask the relevant people for their ACK? Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 15:36 ` Pekka Enberg @ 2011-10-28 16:30 ` Johannes Weiner -1 siblings, 0 replies; 175+ messages in thread From: Johannes Weiner @ 2011-10-28 16:30 UTC (permalink / raw) To: Pekka Enberg Cc: Dan Magenheimer, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: > >> As for the frontswap patches, there's pretty no ACKs from MM people > >> apart from one Reviewed-by from Andrew. I really don't see why the > >> pull request is sent directly to Linus... > > > > Has there not been ample opportunity (in 2-1/2 years) for other > > MM people to contribute? I'm certainly not trying to subvert any > > useful technical discussion and if there is some documented MM process > > I am failing to follow, please point me to it. But there are > > real users and real distros and real products waiting, so if there > > are any real issues, let's get them resolved. > > You are changing core kernel code without ACKs from relevant > maintainers. That's very unfortunate. Existing users certainly matter > but that doesn't mean you get to merge code without maintainers even > looking at it. > > Looking at your patches, there's no trace that anyone outside your own > development team even looked at the patches. Why do you feel that it's > OK to ask Linus to pull them? People did look at it. In my case, the handwavy benefits did not convince me. The handwavy 'this is useful' from just more people of the same company does not help, either. I want to see a usecase that tangibly gains from this, not just more marketing material. Then we can talk about boring infrastructure and adding hooks to the VM. Convincing the development community of the problem you are trying to solve is the undocumented part of the process you fail to follow. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 16:30 ` Johannes Weiner 0 siblings, 0 replies; 175+ messages in thread From: Johannes Weiner @ 2011-10-28 16:30 UTC (permalink / raw) To: Pekka Enberg Cc: Dan Magenheimer, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: > >> As for the frontswap patches, there's pretty no ACKs from MM people > >> apart from one Reviewed-by from Andrew. I really don't see why the > >> pull request is sent directly to Linus... > > > > Has there not been ample opportunity (in 2-1/2 years) for other > > MM people to contribute? I'm certainly not trying to subvert any > > useful technical discussion and if there is some documented MM process > > I am failing to follow, please point me to it. But there are > > real users and real distros and real products waiting, so if there > > are any real issues, let's get them resolved. > > You are changing core kernel code without ACKs from relevant > maintainers. That's very unfortunate. Existing users certainly matter > but that doesn't mean you get to merge code without maintainers even > looking at it. > > Looking at your patches, there's no trace that anyone outside your own > development team even looked at the patches. Why do you feel that it's > OK to ask Linus to pull them? People did look at it. In my case, the handwavy benefits did not convince me. The handwavy 'this is useful' from just more people of the same company does not help, either. I want to see a usecase that tangibly gains from this, not just more marketing material. Then we can talk about boring infrastructure and adding hooks to the VM. Convincing the development community of the problem you are trying to solve is the undocumented part of the process you fail to follow. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 16:30 ` Johannes Weiner @ 2011-10-28 17:01 ` Pekka Enberg -1 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 17:01 UTC (permalink / raw) To: Johannes Weiner Cc: Dan Magenheimer, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 7:30 PM, Johannes Weiner <jweiner@redhat.com> wrote: > People did look at it. > > In my case, the handwavy benefits did not convince me. The handwavy > 'this is useful' from just more people of the same company does not > help, either. > > I want to see a usecase that tangibly gains from this, not just more > marketing material. Then we can talk about boring infrastructure and > adding hooks to the VM. > > Convincing the development community of the problem you are trying to > solve is the undocumented part of the process you fail to follow. Indeed. I don't also understand why this is useful nor am I convinced enough to actually try to figure out how to do the swapfile hooks cleanly. Pekka ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 17:01 ` Pekka Enberg 0 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 17:01 UTC (permalink / raw) To: Johannes Weiner Cc: Dan Magenheimer, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 7:30 PM, Johannes Weiner <jweiner@redhat.com> wrote: > People did look at it. > > In my case, the handwavy benefits did not convince me. The handwavy > 'this is useful' from just more people of the same company does not > help, either. > > I want to see a usecase that tangibly gains from this, not just more > marketing material. Then we can talk about boring infrastructure and > adding hooks to the VM. > > Convincing the development community of the problem you are trying to > solve is the undocumented part of the process you fail to follow. Indeed. I don't also understand why this is useful nor am I convinced enough to actually try to figure out how to do the swapfile hooks cleanly. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 16:30 ` Johannes Weiner @ 2011-10-28 17:07 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 17:07 UTC (permalink / raw) To: Johannes Weiner, Pekka Enberg Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Johannes Weiner [mailto:jweiner@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > > <dan.magenheimer@oracle.com> wrote: > > Looking at your patches, there's no trace that anyone outside your own > > development team even looked at the patches. Why do you feel that it's > > OK to ask Linus to pull them? > > People did look at it. > > In my case, the handwavy benefits did not convince me. The handwavy > 'this is useful' from just more people of the same company does not > help, either. > > I want to see a usecase that tangibly gains from this, not just more > marketing material. Then we can talk about boring infrastructure and > adding hooks to the VM. > > Convincing the development community of the problem you are trying to > solve is the undocumented part of the process you fail to follow. Hi Johannes -- First, there are several companies and several unaffiliated kernel developers contributing here, building on top of frontswap. I happen to be spearheading it, and my company is backing me up. (It might be more appropriate to note that much of the resistance comes from people of your company... but please let's keep our open-source developer hats on and have a technical discussion rather than one which pleases our respective corporate overlords.) Second, have you read http://lwn.net/Articles/454795/ ? If not, please do. If yes, please explain what you don't see as convincing or tangible or documented. All of this exists today as working publicly available code... it's not marketing material. Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 17:07 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 17:07 UTC (permalink / raw) To: Johannes Weiner, Pekka Enberg Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Johannes Weiner [mailto:jweiner@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > > <dan.magenheimer@oracle.com> wrote: > > Looking at your patches, there's no trace that anyone outside your own > > development team even looked at the patches. Why do you feel that it's > > OK to ask Linus to pull them? > > People did look at it. > > In my case, the handwavy benefits did not convince me. The handwavy > 'this is useful' from just more people of the same company does not > help, either. > > I want to see a usecase that tangibly gains from this, not just more > marketing material. Then we can talk about boring infrastructure and > adding hooks to the VM. > > Convincing the development community of the problem you are trying to > solve is the undocumented part of the process you fail to follow. Hi Johannes -- First, there are several companies and several unaffiliated kernel developers contributing here, building on top of frontswap. I happen to be spearheading it, and my company is backing me up. (It might be more appropriate to note that much of the resistance comes from people of your company... but please let's keep our open-source developer hats on and have a technical discussion rather than one which pleases our respective corporate overlords.) Second, have you read http://lwn.net/Articles/454795/ ? If not, please do. If yes, please explain what you don't see as convincing or tangible or documented. All of this exists today as working publicly available code... it's not marketing material. Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 17:07 ` Dan Magenheimer @ 2011-10-28 18:28 ` John Stoffel -1 siblings, 0 replies; 175+ messages in thread From: John Stoffel @ 2011-10-28 18:28 UTC (permalink / raw) To: Dan Magenheimer Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: Dan> Second, have you read http://lwn.net/Articles/454795/ ? Dan> If not, please do. If yes, please explain what you don't Dan> see as convincing or tangible or documented. All of this Dan> exists today as working publicly available code... it's Dan> not marketing material. I was vaguely interested, so I went and read the LWN article, and it didn't really provide any useful information on *why* this is such a good idea. Particularly, I didn't see any before/after numbers which compared the kernel running various loads both with and without these transcendental memory patches applied. And of course I'd like to see numbers when they patches are applied, but there's no TM (Transcendental Memory) in actual use, so as to quantify the overhead. Your article would also be helped with a couple of diagrams showing how this really helps. Esp in the cases where the system just endlessly says "no" to all TM requests and the kernel or apps need to them fall back to the regular paths. In my case, $WORK is using linux with large memory to run EDA simulations, so if we swap, performance tanks and we're out of luck. So for my needs, I don't see how this helps. For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS file service to two or three clients (not counting the VMs which mount home dirs from there as well) as well as some light WWW developement and service. How would TM benefit me? I don't use Xen, don't want to play with it honestly because I'm busy enough as it is, and I just don't see the hard benefits. So the onus falls on *you* and the other TM developers to sell this code and it's benefits (and to acknowledge it's costs) to the rest of the Kernel developers, esp those who hack on the VM. If you can't come up with hard numbers and good examples with good numbers, then you're out of luck. Thanks, John ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 18:28 ` John Stoffel 0 siblings, 0 replies; 175+ messages in thread From: John Stoffel @ 2011-10-28 18:28 UTC (permalink / raw) To: Dan Magenheimer Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: Dan> Second, have you read http://lwn.net/Articles/454795/ ? Dan> If not, please do. If yes, please explain what you don't Dan> see as convincing or tangible or documented. All of this Dan> exists today as working publicly available code... it's Dan> not marketing material. I was vaguely interested, so I went and read the LWN article, and it didn't really provide any useful information on *why* this is such a good idea. Particularly, I didn't see any before/after numbers which compared the kernel running various loads both with and without these transcendental memory patches applied. And of course I'd like to see numbers when they patches are applied, but there's no TM (Transcendental Memory) in actual use, so as to quantify the overhead. Your article would also be helped with a couple of diagrams showing how this really helps. Esp in the cases where the system just endlessly says "no" to all TM requests and the kernel or apps need to them fall back to the regular paths. In my case, $WORK is using linux with large memory to run EDA simulations, so if we swap, performance tanks and we're out of luck. So for my needs, I don't see how this helps. For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS file service to two or three clients (not counting the VMs which mount home dirs from there as well) as well as some light WWW developement and service. How would TM benefit me? I don't use Xen, don't want to play with it honestly because I'm busy enough as it is, and I just don't see the hard benefits. So the onus falls on *you* and the other TM developers to sell this code and it's benefits (and to acknowledge it's costs) to the rest of the Kernel developers, esp those who hack on the VM. If you can't come up with hard numbers and good examples with good numbers, then you're out of luck. Thanks, John -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 18:28 ` John Stoffel @ 2011-10-28 20:19 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 20:19 UTC (permalink / raw) To: John Stoffel Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: John Stoffel [mailto:john@stoffel.org] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: > > Dan> Second, have you read http://lwn.net/Articles/454795/ ? > Dan> If not, please do. If yes, please explain what you don't > Dan> see as convincing or tangible or documented. All of this > Dan> exists today as working publicly available code... it's > Dan> not marketing material. > > I was vaguely interested, so I went and read the LWN article, and it > didn't really provide any useful information on *why* this is such a > good idea. Hi John -- Thanks for taking the time to read the LWN article and sending some feedback. I admit that, after being immersed in the topic for three years, it's difficult to see it from the perspective of a new reader, so I apologize if I may have left out important stuff. I hope you'll take the time to read this long reply. "WHY" this is such a good idea is the same as WHY it is useful to add RAM to your systems. Tmem expands the amount of useful "space" available to a memory-constrained kernel either via compression (transparent to the rest of the kernel except for the handful of hooks for cleancache and frontswap, using zcache) or via memory that was otherwise not visible to the kernel (hypervisor memory from Xen or KVM, or physical RAM on another clustered system using RAMster). Since a kernel always eats memory until it runs out (and then does its best to balance that maximum fixed amount), this is actually much harder than it sounds. So I'm asking: Is that not clear from the LWN article? Or do you not believe that more "space" is a good idea? Or do you not believe that tmem mitigates that problem? Clearly if you always cram enough RAM into your system so that you never have a paging/swapping problem (i.e your RAM is always greater than your "working set"), tmem's NOT a good idea. So the built-in assumption is that RAM is a constrained resource. Increasingly (especially in virtual machines, but elsewhere as well), this is true. > Particularly, I didn't see any before/after numbers which compared the > kernel running various loads both with and without these > transcendental memory patches applied. And of course I'd like to see > numbers when they patches are applied, but there's no TM > (Transcendental Memory) in actual use, so as to quantify the overhead. Actually there is. But the only serious performance analysis has been on Xen, and I get reamed every time I use that word, so I'm a bit gun-shy. If you are seriously interested and willing to ignore that X-word, see the last few slides of: http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf There's some argument about whether the value will be as high for KVM, but that obviously can't be measured until there is a complete KVM implementation, which requires frontswap. It would be nice to also have some numbers for zcache, I agree. > Your article would also be helped with a couple of diagrams showing > how this really helps. Esp in the cases where the system just > endlessly says "no" to all TM requests and the kernel or apps need to > them fall back to the regular paths. The "no" cases occur whenever there is NO additional memory, so obviously it doesn't help for those cases; the appropriate question for those cases is "how much does it hurt" and the answer is (usually) effectively zero. Again if you know you've always got enough RAM to exceed your working set, don't enable tmem/frontswap/cleancache. For the "does really help" cases, I apologize, but I just can't think how to diagrammatically show clearly that having more RAM is a good thing. > In my case, $WORK is using linux with large memory to run EDA > simulations, so if we swap, performance tanks and we're out of luck. > So for my needs, I don't see how this helps. Do you know what percent of your total system cost is spent on RAM, including variable expense such as power/cooling? Is reducing that cost relevant to your $WORK? Or have you ever ran into a "buy more RAM" situation where you couldn't expand because your machine RAM slots were maxed out? > For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS > file service to two or three clients (not counting the VMs which mount > home dirs from there as well) as well as some light WWW developement > and service. How would TM benefit me? I don't use Xen, don't want to > play with it honestly because I'm busy enough as it is, and I just > don't see the hard benefits. (I use "tmem" since TM means "trademark" to many people.) Does 8GB always cover the sum of the working sets of all your KVM VMs? If so, tmem won't help. If a VM in your workload sometimes spikes, tmem allows that spike to be statistically "load balanced" across RAM claimed by other VMs which may be idle or have a temporarily lower working set. This means less paging/swapping and better sum-over-all-VMs performance. > So the onus falls on *you* and the other TM developers to sell this > code and it's benefits (and to acknowledge it's costs) to the rest of > the Kernel developers, esp those who hack on the VM. If you can't > come up with hard numbers and good examples with good numbers, then Clearly there's a bit of a chicken-and-egg problem. Frontswap (and cleancache) are the foundation, and it's hard to build anything solid without a foundation. For those who "hack on the VM", I can't imagine why the handful of lines in the swap subsystem, which is probably the most stable and barely touched subsystem in Linux or any OS on the planet, is going to be a burden or much of a cost. > you're out of luck. Another way of looking at it is that the open source community is out of luck. Tmem IS going into real shipping distros, but it (and Xen support and zcache and KVM support and cool things like RAMster) probably won't be in the distro "you" care about because this handful of nearly innocuous frontswap hooks didn't get merged. I'm trying to be a good kernel citizen but I can't make people listen who don't want to. Frontswap is the last missing piece. Why so much resistance? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 20:19 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 20:19 UTC (permalink / raw) To: John Stoffel Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: John Stoffel [mailto:john@stoffel.org] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: > > Dan> Second, have you read http://lwn.net/Articles/454795/ ? > Dan> If not, please do. If yes, please explain what you don't > Dan> see as convincing or tangible or documented. All of this > Dan> exists today as working publicly available code... it's > Dan> not marketing material. > > I was vaguely interested, so I went and read the LWN article, and it > didn't really provide any useful information on *why* this is such a > good idea. Hi John -- Thanks for taking the time to read the LWN article and sending some feedback. I admit that, after being immersed in the topic for three years, it's difficult to see it from the perspective of a new reader, so I apologize if I may have left out important stuff. I hope you'll take the time to read this long reply. "WHY" this is such a good idea is the same as WHY it is useful to add RAM to your systems. Tmem expands the amount of useful "space" available to a memory-constrained kernel either via compression (transparent to the rest of the kernel except for the handful of hooks for cleancache and frontswap, using zcache) or via memory that was otherwise not visible to the kernel (hypervisor memory from Xen or KVM, or physical RAM on another clustered system using RAMster). Since a kernel always eats memory until it runs out (and then does its best to balance that maximum fixed amount), this is actually much harder than it sounds. So I'm asking: Is that not clear from the LWN article? Or do you not believe that more "space" is a good idea? Or do you not believe that tmem mitigates that problem? Clearly if you always cram enough RAM into your system so that you never have a paging/swapping problem (i.e your RAM is always greater than your "working set"), tmem's NOT a good idea. So the built-in assumption is that RAM is a constrained resource. Increasingly (especially in virtual machines, but elsewhere as well), this is true. > Particularly, I didn't see any before/after numbers which compared the > kernel running various loads both with and without these > transcendental memory patches applied. And of course I'd like to see > numbers when they patches are applied, but there's no TM > (Transcendental Memory) in actual use, so as to quantify the overhead. Actually there is. But the only serious performance analysis has been on Xen, and I get reamed every time I use that word, so I'm a bit gun-shy. If you are seriously interested and willing to ignore that X-word, see the last few slides of: http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf There's some argument about whether the value will be as high for KVM, but that obviously can't be measured until there is a complete KVM implementation, which requires frontswap. It would be nice to also have some numbers for zcache, I agree. > Your article would also be helped with a couple of diagrams showing > how this really helps. Esp in the cases where the system just > endlessly says "no" to all TM requests and the kernel or apps need to > them fall back to the regular paths. The "no" cases occur whenever there is NO additional memory, so obviously it doesn't help for those cases; the appropriate question for those cases is "how much does it hurt" and the answer is (usually) effectively zero. Again if you know you've always got enough RAM to exceed your working set, don't enable tmem/frontswap/cleancache. For the "does really help" cases, I apologize, but I just can't think how to diagrammatically show clearly that having more RAM is a good thing. > In my case, $WORK is using linux with large memory to run EDA > simulations, so if we swap, performance tanks and we're out of luck. > So for my needs, I don't see how this helps. Do you know what percent of your total system cost is spent on RAM, including variable expense such as power/cooling? Is reducing that cost relevant to your $WORK? Or have you ever ran into a "buy more RAM" situation where you couldn't expand because your machine RAM slots were maxed out? > For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS > file service to two or three clients (not counting the VMs which mount > home dirs from there as well) as well as some light WWW developement > and service. How would TM benefit me? I don't use Xen, don't want to > play with it honestly because I'm busy enough as it is, and I just > don't see the hard benefits. (I use "tmem" since TM means "trademark" to many people.) Does 8GB always cover the sum of the working sets of all your KVM VMs? If so, tmem won't help. If a VM in your workload sometimes spikes, tmem allows that spike to be statistically "load balanced" across RAM claimed by other VMs which may be idle or have a temporarily lower working set. This means less paging/swapping and better sum-over-all-VMs performance. > So the onus falls on *you* and the other TM developers to sell this > code and it's benefits (and to acknowledge it's costs) to the rest of > the Kernel developers, esp those who hack on the VM. If you can't > come up with hard numbers and good examples with good numbers, then Clearly there's a bit of a chicken-and-egg problem. Frontswap (and cleancache) are the foundation, and it's hard to build anything solid without a foundation. For those who "hack on the VM", I can't imagine why the handful of lines in the swap subsystem, which is probably the most stable and barely touched subsystem in Linux or any OS on the planet, is going to be a burden or much of a cost. > you're out of luck. Another way of looking at it is that the open source community is out of luck. Tmem IS going into real shipping distros, but it (and Xen support and zcache and KVM support and cool things like RAMster) probably won't be in the distro "you" care about because this handful of nearly innocuous frontswap hooks didn't get merged. I'm trying to be a good kernel citizen but I can't make people listen who don't want to. Frontswap is the last missing piece. Why so much resistance? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 20:19 ` Dan Magenheimer @ 2011-10-28 20:52 ` John Stoffel -1 siblings, 0 replies; 175+ messages in thread From: John Stoffel @ 2011-10-28 20:52 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: >> From: John Stoffel [mailto:john@stoffel.org] >> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) >> >> >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: >> Dan> Second, have you read http://lwn.net/Articles/454795/ ? Dan> If not, please do. If yes, please explain what you don't Dan> see as convincing or tangible or documented. All of this Dan> exists today as working publicly available code... it's Dan> not marketing material. >> >> I was vaguely interested, so I went and read the LWN article, and it >> didn't really provide any useful information on *why* this is such a >> good idea. Dan> Thanks for taking the time to read the LWN article and sending Dan> some feedback. I admit that, after being immersed in the topic Dan> for three years, it's difficult to see it from the perspective of Dan> a new reader, so I apologize if I may have left out important Dan> stuff. I hope you'll take the time to read this long reply. Will do. But I'm not the person you need to convince here about the usefulness of this code and approach, it's the core VM developers, since they're the ones who will have to understand this stuff and know how to maintain it. And keeping this maintainable is a key goal. Dan> "WHY" this is such a good idea is the same as WHY it is useful to Dan> add RAM to your systems. So why would I use this instead of increasing the physical RAM? Yes, it's an easier thing to do by just installing a new kernel an flipping on the switch, but give me numbers showing an improvement. Dan> Tmem expands the amount of useful "space" available to a Dan> memory-constrained kernel either via compression (transparent to Dan> the rest of the kernel except for the handful of hooks for Dan> cleancache and frontswap, using zcache) Ok, so why not just a targetted swap compression function instead? Why is your method superior? Dan> or via memory that was otherwise not visible to the kernel Dan> (hypervisor memory from Xen or KVM, or physical RAM on another Dan> clustered system using RAMster). This needs more explaining, because I'm not sure I get your assumptions here. For example, from reading your LWN article, I see that one idea of RAMster is to use another systems memory if you run low. Ideally when hooked up via something like Myrinet or some other highspeed/low latency connection. And you do say it works over plane ethernet. Great, show me the numbers! Show me the speedup of the application(s) you've been testing. Dan> Since a kernel always eats memory until it runs out (and then Dan> does its best to balance that maximum fixed amount), this is Dan> actually much harder than it sounds. Yes, it is. I've been running into this issue myself on RHEL5.5 VNC servers which are loaded down with lots of user sessions. If someone kicks in a cp of a large multi-gig file on an NFS mount point, the box slams to a halt. This is the kind of things I think you need to address and make sure you don't slow down. Dan> So I'm asking: Is that not clear from the LWN article? Or Dan> do you not believe that more "space" is a good idea? Or Dan> do you not believe that tmem mitigates that problem? The article doesn't give me a good diagram showing the memory layouts and how you optimize/compress/share memory. And it also doesn't compare performance to just increasing physical memory instead of your approach. Dan> Clearly if you always cram enough RAM into your system so that Dan> you never have a paging/swapping problem (i.e your RAM is always Dan> greater than your "working set"), tmem's NOT a good idea. This is a statement that you should be making right up front. And explaining why this is still a good idea to implement. I can see that if I've got a large system which cannot physically use any more memory, then it might be worth my while to use TMEM to get more performance out of this expensive hardware. But if I've got the room, why is your method better than just adding RAM? Dan> So the built-in assumption is that RAM is a constrained resource. Dan> Increasingly (especially in virtual machines, but elsewhere as Dan> well), this is true. Here's another place where you didn't explain yourself well, and where a diagram would help. If you have a VM server with 16Gb of RAM, does TMEM allow you to run more guests (each of which takes 2G of RAM say) verus before? And what's the performance gain/loss/tradeoff? >> Particularly, I didn't see any before/after numbers which compared the >> kernel running various loads both with and without these >> transcendental memory patches applied. And of course I'd like to see >> numbers when they patches are applied, but there's no TM >> (Transcendental Memory) in actual use, so as to quantify the overhead. Dan> Actually there is. But the only serious performance analysis has Dan> been on Xen, and I get reamed every time I use that word, so I'm Dan> a bit gun-shy. If you are seriously interested and willing to Dan> ignore that X-word, see the last few slides of: I'm not that interested in Xen myself for various reasons, mostly because it's not something I use at $WORK, and it's not something I've spent any time playing with at $HOME in my free time. Dan> http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf Dan> There's some argument about whether the value will be as Dan> high for KVM, but that obviously can't be measured until Dan> there is a complete KVM implementation, which requires Dan> frontswap. Dan> It would be nice to also have some numbers for zcache, I agree. It's not nice, it's REQUIRED. If you can't show numbers which give an improvement, then why would it be accepted? >> Your article would also be helped with a couple of diagrams showing >> how this really helps. Esp in the cases where the system just >> endlessly says "no" to all TM requests and the kernel or apps need to >> them fall back to the regular paths. Dan> The "no" cases occur whenever there is NO additional memory, Dan> so obviously it doesn't help for those cases; the appropriate Dan> question for those cases is "how much does it hurt" and the Dan> answer is (usually) effectively zero. Again if you know Dan> you've always got enough RAM to exceed your working set, Dan> don't enable tmem/frontswap/cleancache. Dan> For the "does really help" cases, I apologize, but I just can't Dan> think how to diagrammatically show clearly that having more RAM Dan> is a good thing. >> In my case, $WORK is using linux with large memory to run EDA >> simulations, so if we swap, performance tanks and we're out of luck. >> So for my needs, I don't see how this helps. Dan> Do you know what percent of your total system cost is spent on Dan> RAM, including variable expense such as power/cooling? Nope, can't quantify it unfortunately. Dan> Is reducing that cost relevant to your $WORK? Or have you ever Dan> ran into a "buy more RAM" situation where you couldn't expand Dan> because your machine RAM slots were maxed out? Generally, my engineers can and will take all the RAM they can, since EDA simulations almost always work better with more RAM, esp as the designs grow in size. But it's also not a hard and fast rule. If a 144Gb box with dual CPUs and 4 cores each costs me $20k or so, then the power/cooling costs aren't as big a concern, because my enginees *time* is where the real cost comes from. And my customers turn around time to get a design done is another big $$$ center. The hardware is cheap. Have you priced EDA licenses from Cadence, Synopsys, or other vendors? But that's besides the point. How much overhead does TMEM incur when it's not being used, but when it's avaiable? >> For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS >> file service to two or three clients (not counting the VMs which mount >> home dirs from there as well) as well as some light WWW developement >> and service. How would TM benefit me? I don't use Xen, don't want to >> play with it honestly because I'm busy enough as it is, and I just >> don't see the hard benefits. Dan> (I use "tmem" since TM means "trademark" to many people.) Yeah, I like your phrase better too, I just got tired of typing the full thing. Dan> Does 8GB always cover the sum of the working sets of all your KVM Dan> VMs? If so, tmem won't help. If a VM in your workload sometimes Dan> spikes, tmem allows that spike to be statistically "load Dan> balanced" across RAM claimed by other VMs which may be idle or Dan> have a temporarily lower working set. This means less Dan> paging/swapping and better sum-over-all-VMs performance. So this is a good thing to show and get hard numbers on. >> So the onus falls on *you* and the other TM developers to sell this >> code and it's benefits (and to acknowledge it's costs) to the rest of >> the Kernel developers, esp those who hack on the VM. If you can't >> come up with hard numbers and good examples with good numbers, then Dan> Clearly there's a bit of a chicken-and-egg problem. Frontswap Dan> (and cleancache) are the foundation, and it's hard to build Dan> anything solid without a foundation. No one is stopping you from building your own house using the Linux foundation, showing that it's a great house and then allowing you to come and re-work the foundations and walls, etc to build the better house. Dan> For those who "hack on the VM", I can't imagine why the handful Dan> of lines in the swap subsystem, which is probably the most stable Dan> and barely touched subsystem in Linux or any OS on the planet, Dan> is going to be a burden or much of a cost. It's the performance and cleanliness aspects that people worry about. >> you're out of luck. Dan> Another way of looking at it is that the open source community is Dan> out of luck. Tmem IS going into real shipping distros, but it Dan> (and Xen support and zcache and KVM support and cool things like Dan> RAMster) probably won't be in the distro "you" care about because Dan> this handful of nearly innocuous frontswap hooks didn't get Dan> merged. I'm trying to be a good kernel citizen but I can't make Dan> people listen who don't want to. No real skin off my nose, because I haven't seen a compelling reason to use TMEM. And if I do run a large Oracle system, with lots of DBs and table spaces, I don't see how TMEM helps me either, because the hardware is such a small part of the cost of a large Oracle deployment. Adding RAM is cheap. TMEM... well it could be useful in an emergency, but unless it's stressed and used alot, it could end up causing more problems than it solves. Dan> Frontswap is the last missing piece. Why so much resistance? Because you haven't sold it well with numbers to show how much overhead it has? I'm being negative because I see now reason to use it. And because I think you can do a better job of selling it and showing the benefits with real numbers. Load of a XEN box, have a VM spike it's memory usage and show how TMEM helps. Compare it to a non-TMEM setup with the same load. John ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 20:52 ` John Stoffel 0 siblings, 0 replies; 175+ messages in thread From: John Stoffel @ 2011-10-28 20:52 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: >> From: John Stoffel [mailto:john@stoffel.org] >> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) >> >> >>>>> "Dan" == Dan Magenheimer <dan.magenheimer@oracle.com> writes: >> Dan> Second, have you read http://lwn.net/Articles/454795/ ? Dan> If not, please do. If yes, please explain what you don't Dan> see as convincing or tangible or documented. All of this Dan> exists today as working publicly available code... it's Dan> not marketing material. >> >> I was vaguely interested, so I went and read the LWN article, and it >> didn't really provide any useful information on *why* this is such a >> good idea. Dan> Thanks for taking the time to read the LWN article and sending Dan> some feedback. I admit that, after being immersed in the topic Dan> for three years, it's difficult to see it from the perspective of Dan> a new reader, so I apologize if I may have left out important Dan> stuff. I hope you'll take the time to read this long reply. Will do. But I'm not the person you need to convince here about the usefulness of this code and approach, it's the core VM developers, since they're the ones who will have to understand this stuff and know how to maintain it. And keeping this maintainable is a key goal. Dan> "WHY" this is such a good idea is the same as WHY it is useful to Dan> add RAM to your systems. So why would I use this instead of increasing the physical RAM? Yes, it's an easier thing to do by just installing a new kernel an flipping on the switch, but give me numbers showing an improvement. Dan> Tmem expands the amount of useful "space" available to a Dan> memory-constrained kernel either via compression (transparent to Dan> the rest of the kernel except for the handful of hooks for Dan> cleancache and frontswap, using zcache) Ok, so why not just a targetted swap compression function instead? Why is your method superior? Dan> or via memory that was otherwise not visible to the kernel Dan> (hypervisor memory from Xen or KVM, or physical RAM on another Dan> clustered system using RAMster). This needs more explaining, because I'm not sure I get your assumptions here. For example, from reading your LWN article, I see that one idea of RAMster is to use another systems memory if you run low. Ideally when hooked up via something like Myrinet or some other highspeed/low latency connection. And you do say it works over plane ethernet. Great, show me the numbers! Show me the speedup of the application(s) you've been testing. Dan> Since a kernel always eats memory until it runs out (and then Dan> does its best to balance that maximum fixed amount), this is Dan> actually much harder than it sounds. Yes, it is. I've been running into this issue myself on RHEL5.5 VNC servers which are loaded down with lots of user sessions. If someone kicks in a cp of a large multi-gig file on an NFS mount point, the box slams to a halt. This is the kind of things I think you need to address and make sure you don't slow down. Dan> So I'm asking: Is that not clear from the LWN article? Or Dan> do you not believe that more "space" is a good idea? Or Dan> do you not believe that tmem mitigates that problem? The article doesn't give me a good diagram showing the memory layouts and how you optimize/compress/share memory. And it also doesn't compare performance to just increasing physical memory instead of your approach. Dan> Clearly if you always cram enough RAM into your system so that Dan> you never have a paging/swapping problem (i.e your RAM is always Dan> greater than your "working set"), tmem's NOT a good idea. This is a statement that you should be making right up front. And explaining why this is still a good idea to implement. I can see that if I've got a large system which cannot physically use any more memory, then it might be worth my while to use TMEM to get more performance out of this expensive hardware. But if I've got the room, why is your method better than just adding RAM? Dan> So the built-in assumption is that RAM is a constrained resource. Dan> Increasingly (especially in virtual machines, but elsewhere as Dan> well), this is true. Here's another place where you didn't explain yourself well, and where a diagram would help. If you have a VM server with 16Gb of RAM, does TMEM allow you to run more guests (each of which takes 2G of RAM say) verus before? And what's the performance gain/loss/tradeoff? >> Particularly, I didn't see any before/after numbers which compared the >> kernel running various loads both with and without these >> transcendental memory patches applied. And of course I'd like to see >> numbers when they patches are applied, but there's no TM >> (Transcendental Memory) in actual use, so as to quantify the overhead. Dan> Actually there is. But the only serious performance analysis has Dan> been on Xen, and I get reamed every time I use that word, so I'm Dan> a bit gun-shy. If you are seriously interested and willing to Dan> ignore that X-word, see the last few slides of: I'm not that interested in Xen myself for various reasons, mostly because it's not something I use at $WORK, and it's not something I've spent any time playing with at $HOME in my free time. Dan> http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf Dan> There's some argument about whether the value will be as Dan> high for KVM, but that obviously can't be measured until Dan> there is a complete KVM implementation, which requires Dan> frontswap. Dan> It would be nice to also have some numbers for zcache, I agree. It's not nice, it's REQUIRED. If you can't show numbers which give an improvement, then why would it be accepted? >> Your article would also be helped with a couple of diagrams showing >> how this really helps. Esp in the cases where the system just >> endlessly says "no" to all TM requests and the kernel or apps need to >> them fall back to the regular paths. Dan> The "no" cases occur whenever there is NO additional memory, Dan> so obviously it doesn't help for those cases; the appropriate Dan> question for those cases is "how much does it hurt" and the Dan> answer is (usually) effectively zero. Again if you know Dan> you've always got enough RAM to exceed your working set, Dan> don't enable tmem/frontswap/cleancache. Dan> For the "does really help" cases, I apologize, but I just can't Dan> think how to diagrammatically show clearly that having more RAM Dan> is a good thing. >> In my case, $WORK is using linux with large memory to run EDA >> simulations, so if we swap, performance tanks and we're out of luck. >> So for my needs, I don't see how this helps. Dan> Do you know what percent of your total system cost is spent on Dan> RAM, including variable expense such as power/cooling? Nope, can't quantify it unfortunately. Dan> Is reducing that cost relevant to your $WORK? Or have you ever Dan> ran into a "buy more RAM" situation where you couldn't expand Dan> because your machine RAM slots were maxed out? Generally, my engineers can and will take all the RAM they can, since EDA simulations almost always work better with more RAM, esp as the designs grow in size. But it's also not a hard and fast rule. If a 144Gb box with dual CPUs and 4 cores each costs me $20k or so, then the power/cooling costs aren't as big a concern, because my enginees *time* is where the real cost comes from. And my customers turn around time to get a design done is another big $$$ center. The hardware is cheap. Have you priced EDA licenses from Cadence, Synopsys, or other vendors? But that's besides the point. How much overhead does TMEM incur when it's not being used, but when it's avaiable? >> For my home system, I run an 8Gb RAM box with a couple of KVM VMs, NFS >> file service to two or three clients (not counting the VMs which mount >> home dirs from there as well) as well as some light WWW developement >> and service. How would TM benefit me? I don't use Xen, don't want to >> play with it honestly because I'm busy enough as it is, and I just >> don't see the hard benefits. Dan> (I use "tmem" since TM means "trademark" to many people.) Yeah, I like your phrase better too, I just got tired of typing the full thing. Dan> Does 8GB always cover the sum of the working sets of all your KVM Dan> VMs? If so, tmem won't help. If a VM in your workload sometimes Dan> spikes, tmem allows that spike to be statistically "load Dan> balanced" across RAM claimed by other VMs which may be idle or Dan> have a temporarily lower working set. This means less Dan> paging/swapping and better sum-over-all-VMs performance. So this is a good thing to show and get hard numbers on. >> So the onus falls on *you* and the other TM developers to sell this >> code and it's benefits (and to acknowledge it's costs) to the rest of >> the Kernel developers, esp those who hack on the VM. If you can't >> come up with hard numbers and good examples with good numbers, then Dan> Clearly there's a bit of a chicken-and-egg problem. Frontswap Dan> (and cleancache) are the foundation, and it's hard to build Dan> anything solid without a foundation. No one is stopping you from building your own house using the Linux foundation, showing that it's a great house and then allowing you to come and re-work the foundations and walls, etc to build the better house. Dan> For those who "hack on the VM", I can't imagine why the handful Dan> of lines in the swap subsystem, which is probably the most stable Dan> and barely touched subsystem in Linux or any OS on the planet, Dan> is going to be a burden or much of a cost. It's the performance and cleanliness aspects that people worry about. >> you're out of luck. Dan> Another way of looking at it is that the open source community is Dan> out of luck. Tmem IS going into real shipping distros, but it Dan> (and Xen support and zcache and KVM support and cool things like Dan> RAMster) probably won't be in the distro "you" care about because Dan> this handful of nearly innocuous frontswap hooks didn't get Dan> merged. I'm trying to be a good kernel citizen but I can't make Dan> people listen who don't want to. No real skin off my nose, because I haven't seen a compelling reason to use TMEM. And if I do run a large Oracle system, with lots of DBs and table spaces, I don't see how TMEM helps me either, because the hardware is such a small part of the cost of a large Oracle deployment. Adding RAM is cheap. TMEM... well it could be useful in an emergency, but unless it's stressed and used alot, it could end up causing more problems than it solves. Dan> Frontswap is the last missing piece. Why so much resistance? Because you haven't sold it well with numbers to show how much overhead it has? I'm being negative because I see now reason to use it. And because I think you can do a better job of selling it and showing the benefits with real numbers. Load of a XEN box, have a VM spike it's memory usage and show how TMEM helps. Compare it to a non-TMEM setup with the same load. John -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 20:52 ` John Stoffel @ 2011-10-30 19:18 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-30 19:18 UTC (permalink / raw) To: John Stoffel Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: John Stoffel [mailto:john@stoffel.org] > Dan> Thanks for taking the time to read the LWN article and sending > Dan> some feedback. I admit that, after being immersed in the topic > Dan> for three years, it's difficult to see it from the perspective of > Dan> a new reader, so I apologize if I may have left out important > Dan> stuff. I hope you'll take the time to read this long reply. > > Will do. But I'm not the person you need to convince here about the > usefulness of this code and approach, it's the core VM developers, True, but you are the one providing useful suggestions while the core VM developers are mostly silent (except for saying things like "don't like it much"). So thank you for your feedback and for taking the time to provide it and for indulging my replies. I/we will need to act on your suggestions, but I need to answer a couple of points/questions you've raised. > since they're the ones who will have to understand this stuff and know > how to maintain it. And keeping this maintainable is a key goal. Absolutely agree. Count the number of frontswap lines that affect the current VM core code and note also how they are very clearly identified. It really is a very VERY small impact to the core VM code (e.g. in the files swapfile.c and page_io.c). (And it's worth noting, and I'm not arguing that it is conclusive, just relevant, that my company has stood up and claimed responsibility to maintain it.) > Ok, so why not just a targetted swap compression function instead? > Why is your method superior? The designer/implementor of zram (which is the closest thing to "targetted swap compression" in the kernel today) has stated elsewhere on this thread that frontswap has advantages over his own zram code. And the frontswap patchset (did I mention how small the impact is?) provides a lot more than just a foundation for compression (zcache). > But that's besides the point. How much overhead does TMEM incur when > it's not being used, but when it's avaiable? This is answered in frontswap.txt in the patchset, but: ZERO overhead if CONFIG_FRONTSWAP=n. All the hooks compile into no-ops. If CONFIG_FRONTSWAP=y and no "tmem backend" registers to use it at runtime, the overhead is one "compare pointer against NULL" for every page actually swapped in or out, which is about as close to ZERO overhead as any code can be. If CONFIG_FRONTSWAP=y AND a "tmem backend" does register, the answer depends on which tmem backend and what it is doing (and yes I agree more numbers are needed), but the overhead is incurred only in the case where a page would otherwise have actually been swapped in or out and can replace the horrible cost of swapping pages. > Dan> Frontswap is the last missing piece. Why so much resistance? > > Because you haven't sold it well with numbers to show how much > overhead it has? > > I'm being negative because I see no reason to use it. And because I > think you can do a better job of selling it and showing the benefits > with real numbers. In your environment where RAM is essentially infinite, and swapping never occurs, I agree there would be no reason for you to enable it. In which case there is no overhead to you. Received loud and clear on the "need more real numbers" though personally I don't have any machines with more than 4GB RAM so I won't personally be testing any EDA environments with 144GB :-} So, in the context of "costs nothing if you don't need it and has very VERY small core code impact", and given that various kernel developers and real users and real distros and real products say on this thread that they DO need it, and given that there are "some" real numbers (for one user, Xen, and agree that some are needed for zcache)... and assuming that the core VM developers bother to read the documentation already provided that addresses the above, let me ask again... Why so much resistance? Thanks, Dan Oops, one more (but I have to use the X-word)... > Load up a XEN box, have a VM spike it's memory usage and show how TMEM > helps. Compare it to a non-TMEM setup with the same load. Yep, that's what the presentation URL I provided (for Xen) measures. Overcommitment (more VMs than otherwise could fit in the physical RAM) AND about a 8% performance improvement on all VMs doing a kernel compile simultaneously. Pretty impressive. ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-30 19:18 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-30 19:18 UTC (permalink / raw) To: John Stoffel Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: John Stoffel [mailto:john@stoffel.org] > Dan> Thanks for taking the time to read the LWN article and sending > Dan> some feedback. I admit that, after being immersed in the topic > Dan> for three years, it's difficult to see it from the perspective of > Dan> a new reader, so I apologize if I may have left out important > Dan> stuff. I hope you'll take the time to read this long reply. > > Will do. But I'm not the person you need to convince here about the > usefulness of this code and approach, it's the core VM developers, True, but you are the one providing useful suggestions while the core VM developers are mostly silent (except for saying things like "don't like it much"). So thank you for your feedback and for taking the time to provide it and for indulging my replies. I/we will need to act on your suggestions, but I need to answer a couple of points/questions you've raised. > since they're the ones who will have to understand this stuff and know > how to maintain it. And keeping this maintainable is a key goal. Absolutely agree. Count the number of frontswap lines that affect the current VM core code and note also how they are very clearly identified. It really is a very VERY small impact to the core VM code (e.g. in the files swapfile.c and page_io.c). (And it's worth noting, and I'm not arguing that it is conclusive, just relevant, that my company has stood up and claimed responsibility to maintain it.) > Ok, so why not just a targetted swap compression function instead? > Why is your method superior? The designer/implementor of zram (which is the closest thing to "targetted swap compression" in the kernel today) has stated elsewhere on this thread that frontswap has advantages over his own zram code. And the frontswap patchset (did I mention how small the impact is?) provides a lot more than just a foundation for compression (zcache). > But that's besides the point. How much overhead does TMEM incur when > it's not being used, but when it's avaiable? This is answered in frontswap.txt in the patchset, but: ZERO overhead if CONFIG_FRONTSWAP=n. All the hooks compile into no-ops. If CONFIG_FRONTSWAP=y and no "tmem backend" registers to use it at runtime, the overhead is one "compare pointer against NULL" for every page actually swapped in or out, which is about as close to ZERO overhead as any code can be. If CONFIG_FRONTSWAP=y AND a "tmem backend" does register, the answer depends on which tmem backend and what it is doing (and yes I agree more numbers are needed), but the overhead is incurred only in the case where a page would otherwise have actually been swapped in or out and can replace the horrible cost of swapping pages. > Dan> Frontswap is the last missing piece. Why so much resistance? > > Because you haven't sold it well with numbers to show how much > overhead it has? > > I'm being negative because I see no reason to use it. And because I > think you can do a better job of selling it and showing the benefits > with real numbers. In your environment where RAM is essentially infinite, and swapping never occurs, I agree there would be no reason for you to enable it. In which case there is no overhead to you. Received loud and clear on the "need more real numbers" though personally I don't have any machines with more than 4GB RAM so I won't personally be testing any EDA environments with 144GB :-} So, in the context of "costs nothing if you don't need it and has very VERY small core code impact", and given that various kernel developers and real users and real distros and real products say on this thread that they DO need it, and given that there are "some" real numbers (for one user, Xen, and agree that some are needed for zcache)... and assuming that the core VM developers bother to read the documentation already provided that addresses the above, let me ask again... Why so much resistance? Thanks, Dan Oops, one more (but I have to use the X-word)... > Load up a XEN box, have a VM spike it's memory usage and show how TMEM > helps. Compare it to a non-TMEM setup with the same load. Yep, that's what the presentation URL I provided (for Xen) measures. Overcommitment (more VMs than otherwise could fit in the physical RAM) AND about a 8% performance improvement on all VMs doing a kernel compile simultaneously. Pretty impressive. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-30 19:18 ` Dan Magenheimer @ 2011-10-30 20:06 ` Dave Hansen -1 siblings, 0 replies; 175+ messages in thread From: Dave Hansen @ 2011-10-30 20:06 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: > > since they're the ones who will have to understand this stuff and know > > how to maintain it. And keeping this maintainable is a key goal. > > Absolutely agree. Count the number of frontswap lines that affect > the current VM core code and note also how they are very clearly > identified. It really is a very VERY small impact to the core VM > code (e.g. in the files swapfile.c and page_io.c). Granted, the impact on the core VM in lines of code is small. But, I think the behavioral impact is potentially huge since tmem's hooks add non-trivial amounts of framework underneath the VM in core paths. In zcache's case, this means a bunch of allocations and an entirely new allocator memory allocator being used in the swap paths. We're certainly still shaking bugs out of the interactions there like with zcache_direct_reclaim_lock. Granted, that's not a tmem/frontswap/cleancache bug, but it does speak to the difficulty and subtlety of writing one of those frameworks underneath the tmem API. -- Dave ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-30 20:06 ` Dave Hansen 0 siblings, 0 replies; 175+ messages in thread From: Dave Hansen @ 2011-10-30 20:06 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: > > since they're the ones who will have to understand this stuff and know > > how to maintain it. And keeping this maintainable is a key goal. > > Absolutely agree. Count the number of frontswap lines that affect > the current VM core code and note also how they are very clearly > identified. It really is a very VERY small impact to the core VM > code (e.g. in the files swapfile.c and page_io.c). Granted, the impact on the core VM in lines of code is small. But, I think the behavioral impact is potentially huge since tmem's hooks add non-trivial amounts of framework underneath the VM in core paths. In zcache's case, this means a bunch of allocations and an entirely new allocator memory allocator being used in the swap paths. We're certainly still shaking bugs out of the interactions there like with zcache_direct_reclaim_lock. Granted, that's not a tmem/frontswap/cleancache bug, but it does speak to the difficulty and subtlety of writing one of those frameworks underneath the tmem API. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-30 20:06 ` Dave Hansen @ 2011-10-30 21:50 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-30 21:50 UTC (permalink / raw) To: Dave Hansen Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet > From: Dave Hansen [mailto:dave@linux.vnet.ibm.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) Thanks Dave (I think ;-) for chiming in. > On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: > > > since they're the ones who will have to understand this stuff and know > > > how to maintain it. And keeping this maintainable is a key goal. > > > > Absolutely agree. Count the number of frontswap lines that affect > > the current VM core code and note also how they are very clearly > > identified. It really is a very VERY small impact to the core VM > > code (e.g. in the files swapfile.c and page_io.c). > > Granted, the impact on the core VM in lines of code is small. But, I > think the behavioral impact is potentially huge since tmem's hooks add > non-trivial amounts of framework underneath the VM in core paths. In > zcache's case, this means a bunch of allocations and an entirely new > allocator memory allocator being used in the swap paths. True BUT (and this is a big BUT) it ONLY affects the core VM path if both CONFIG_FRONTSWAP=y AND if a "tmem backend" such as zcache registers it. So not only is the code maintenance impact very VERY small (which you granted), but there is no impact on users or distros or products that don't turn it on. I also should repeat that the core VM changes introduced by frontswap have remained essentially identical since first proposed circa 2.6.18... the impacted swap code is NOT frequently- changing code. My point in my "Absolutely agree" above, is that the maintenance burden to core VM developers is low. > We're certainly still shaking bugs out of the interactions there like > with zcache_direct_reclaim_lock. Granted, that's not a > tmem/frontswap/cleancache bug, but it does speak to the difficulty and > subtlety of writing one of those frameworks underneath the tmem API. IMHO, that's coming perilously close to saying "we don't accept code that has bugs in it". How many significant pieces of functionality have been added to the kernel EVER where there were NO bugs found in the next few months? How much MERGED functionality (such as new filesystems) has gone into the kernel years before it was broadly deployed? Zcache is currently a staging driver for a reason... I admit it... I wrote zcache in a couple of months (and mostly over the holidays) and it was really the first major Linux kernel driver I'd done. I was surprised as hell when GregKH took it into staging. But it works pretty darn well. Why? Because it is built on the foundation of cleancache and frontswap, which _just work_!! And Seth Jennings (also of IBM for those that don't know) has been doing a great job of finding and fixing bottlenecks, as well as looking at some interesting enhancements. I think he found ONE bug so far... because I hadn't tested on 32-bit highmem machines. Clearly, Seth and IBM see some value in zcache (perhaps, as Ed Tomlinson pointed out, because AIX has similar capability?) But let's not forget that there would be no zcache for Seth or IBM to work on if you hadn't already taken the frontswap patchset into your tree. Frontswap is an ENABLER for zcache, as well as for Xen tmem, for RAMster and (soon according to two kernel developers) possibly also for KVM. Given the tiny maintenance cost, why not merge it? So if you are saying that frontswap is not quite ready to be merged, fine, I can accept that. But there are now a number of features, developers, distros, and products depending on it, so there's a few of us who would like to hear CONCRETE STEPS we need to achieve to make it ready. (John Stoffel is the only one to suggest any... not counting documentation he didn't read, the big one is getting some measurements to show zcache is valuable. Hoping Seth can help with that?) Got any suggestions? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-30 21:50 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-30 21:50 UTC (permalink / raw) To: Dave Hansen Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet > From: Dave Hansen [mailto:dave@linux.vnet.ibm.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) Thanks Dave (I think ;-) for chiming in. > On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: > > > since they're the ones who will have to understand this stuff and know > > > how to maintain it. And keeping this maintainable is a key goal. > > > > Absolutely agree. Count the number of frontswap lines that affect > > the current VM core code and note also how they are very clearly > > identified. It really is a very VERY small impact to the core VM > > code (e.g. in the files swapfile.c and page_io.c). > > Granted, the impact on the core VM in lines of code is small. But, I > think the behavioral impact is potentially huge since tmem's hooks add > non-trivial amounts of framework underneath the VM in core paths. In > zcache's case, this means a bunch of allocations and an entirely new > allocator memory allocator being used in the swap paths. True BUT (and this is a big BUT) it ONLY affects the core VM path if both CONFIG_FRONTSWAP=y AND if a "tmem backend" such as zcache registers it. So not only is the code maintenance impact very VERY small (which you granted), but there is no impact on users or distros or products that don't turn it on. I also should repeat that the core VM changes introduced by frontswap have remained essentially identical since first proposed circa 2.6.18... the impacted swap code is NOT frequently- changing code. My point in my "Absolutely agree" above, is that the maintenance burden to core VM developers is low. > We're certainly still shaking bugs out of the interactions there like > with zcache_direct_reclaim_lock. Granted, that's not a > tmem/frontswap/cleancache bug, but it does speak to the difficulty and > subtlety of writing one of those frameworks underneath the tmem API. IMHO, that's coming perilously close to saying "we don't accept code that has bugs in it". How many significant pieces of functionality have been added to the kernel EVER where there were NO bugs found in the next few months? How much MERGED functionality (such as new filesystems) has gone into the kernel years before it was broadly deployed? Zcache is currently a staging driver for a reason... I admit it... I wrote zcache in a couple of months (and mostly over the holidays) and it was really the first major Linux kernel driver I'd done. I was surprised as hell when GregKH took it into staging. But it works pretty darn well. Why? Because it is built on the foundation of cleancache and frontswap, which _just work_!! And Seth Jennings (also of IBM for those that don't know) has been doing a great job of finding and fixing bottlenecks, as well as looking at some interesting enhancements. I think he found ONE bug so far... because I hadn't tested on 32-bit highmem machines. Clearly, Seth and IBM see some value in zcache (perhaps, as Ed Tomlinson pointed out, because AIX has similar capability?) But let's not forget that there would be no zcache for Seth or IBM to work on if you hadn't already taken the frontswap patchset into your tree. Frontswap is an ENABLER for zcache, as well as for Xen tmem, for RAMster and (soon according to two kernel developers) possibly also for KVM. Given the tiny maintenance cost, why not merge it? So if you are saying that frontswap is not quite ready to be merged, fine, I can accept that. But there are now a number of features, developers, distros, and products depending on it, so there's a few of us who would like to hear CONCRETE STEPS we need to achieve to make it ready. (John Stoffel is the only one to suggest any... not counting documentation he didn't read, the big one is getting some measurements to show zcache is valuable. Hoping Seth can help with that?) Got any suggestions? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-30 20:06 ` Dave Hansen @ 2011-11-02 19:45 ` Rik van Riel -1 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-02 19:45 UTC (permalink / raw) To: Dave Hansen Cc: Dan Magenheimer, John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet On 10/30/2011 04:06 PM, Dave Hansen wrote: > On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: >>> since they're the ones who will have to understand this stuff and know >>> how to maintain it. And keeping this maintainable is a key goal. >> >> Absolutely agree. Count the number of frontswap lines that affect >> the current VM core code and note also how they are very clearly >> identified. It really is a very VERY small impact to the core VM >> code (e.g. in the files swapfile.c and page_io.c). > > Granted, the impact on the core VM in lines of code is small. But, I > think the behavioral impact is potentially huge since tmem's hooks add > non-trivial amounts of framework underneath the VM in core paths. In > zcache's case, this means a bunch of allocations and an entirely new > allocator memory allocator being used in the swap paths. My only real behaviour concern with tmem is that /proc/sys/overcommit_memory will no longer be able to do anything useful, since we'll never know in advance how much memory is available. That may be outweighed by the benefits of having more memory available than before, and a reasonable tradeoff to make for the users. That leaves us with having the code cleaned up to reasonable standards. To be honest, I would rather have larger hooks in the existing mm code, than exported variables and having the hooks live elsewhere (where people changing the "normal" mm code won't see it, and are more likely to break it). ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 19:45 ` Rik van Riel 0 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-02 19:45 UTC (permalink / raw) To: Dave Hansen Cc: Dan Magenheimer, John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet On 10/30/2011 04:06 PM, Dave Hansen wrote: > On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: >>> since they're the ones who will have to understand this stuff and know >>> how to maintain it. And keeping this maintainable is a key goal. >> >> Absolutely agree. Count the number of frontswap lines that affect >> the current VM core code and note also how they are very clearly >> identified. It really is a very VERY small impact to the core VM >> code (e.g. in the files swapfile.c and page_io.c). > > Granted, the impact on the core VM in lines of code is small. But, I > think the behavioral impact is potentially huge since tmem's hooks add > non-trivial amounts of framework underneath the VM in core paths. In > zcache's case, this means a bunch of allocations and an entirely new > allocator memory allocator being used in the swap paths. My only real behaviour concern with tmem is that /proc/sys/overcommit_memory will no longer be able to do anything useful, since we'll never know in advance how much memory is available. That may be outweighed by the benefits of having more memory available than before, and a reasonable tradeoff to make for the users. That leaves us with having the code cleaned up to reasonable standards. To be honest, I would rather have larger hooks in the existing mm code, than exported variables and having the hooks live elsewhere (where people changing the "normal" mm code won't see it, and are more likely to break it). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 19:45 ` Rik van Riel @ 2011-11-02 20:45 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:45 UTC (permalink / raw) To: Rik van Riel, Dave Hansen Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet > From: Rik van Riel [mailto:riel@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 10/30/2011 04:06 PM, Dave Hansen wrote: > > On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: > >>> since they're the ones who will have to understand this stuff and know > >>> how to maintain it. And keeping this maintainable is a key goal. > >> > >> Absolutely agree. Count the number of frontswap lines that affect > >> the current VM core code and note also how they are very clearly > >> identified. It really is a very VERY small impact to the core VM > >> code (e.g. in the files swapfile.c and page_io.c). > > > > Granted, the impact on the core VM in lines of code is small. But, I > > think the behavioral impact is potentially huge since tmem's hooks add > > non-trivial amounts of framework underneath the VM in core paths. In > > zcache's case, this means a bunch of allocations and an entirely new > > allocator memory allocator being used in the swap paths. > > My only real behaviour concern with tmem is that > /proc/sys/overcommit_memory will no longer be able > to do anything useful, since we'll never know in > advance how much memory is available. True, for Case C (as defined in James Bottomley subthread). For Case A and Case B (ie. no tmem backend enabled), end-users can still rely on that existing mechanism, so they have a choice. > That may be outweighed by the benefits of having > more memory available than before, and a reasonable > tradeoff to make for the users. > > That leaves us with having the code cleaned up to > reasonable standards. To be honest, I would rather > have larger hooks in the existing mm code, than > exported variables and having the hooks live elsewhere > (where people changing the "normal" mm code won't see > it, and are more likely to break it). Hmmm... the original hooks in 2009 were larger, but there was lots of feedback to hide the ugly details as much as possible. As a side effect, higher level info is passed via the hooks, e.g. a "struct page *" rather than swaptype/entry, so backends have more flexibility (and IIUC it looks like Andrea's proposed changes to zcache may need the higher level info). But if you want to propose some code showing what you mean by "larger" hooks and they result in the same information available in the backends, and if others agree your hooks are more maintainable, I am certainly open to changing them and re-posting. Note that this could happen post-frontswap-merge too though which would, naturally, be my preference ;-) Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 20:45 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:45 UTC (permalink / raw) To: Rik van Riel, Dave Hansen Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet > From: Rik van Riel [mailto:riel@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 10/30/2011 04:06 PM, Dave Hansen wrote: > > On Sun, 2011-10-30 at 12:18 -0700, Dan Magenheimer wrote: > >>> since they're the ones who will have to understand this stuff and know > >>> how to maintain it. And keeping this maintainable is a key goal. > >> > >> Absolutely agree. Count the number of frontswap lines that affect > >> the current VM core code and note also how they are very clearly > >> identified. It really is a very VERY small impact to the core VM > >> code (e.g. in the files swapfile.c and page_io.c). > > > > Granted, the impact on the core VM in lines of code is small. But, I > > think the behavioral impact is potentially huge since tmem's hooks add > > non-trivial amounts of framework underneath the VM in core paths. In > > zcache's case, this means a bunch of allocations and an entirely new > > allocator memory allocator being used in the swap paths. > > My only real behaviour concern with tmem is that > /proc/sys/overcommit_memory will no longer be able > to do anything useful, since we'll never know in > advance how much memory is available. True, for Case C (as defined in James Bottomley subthread). For Case A and Case B (ie. no tmem backend enabled), end-users can still rely on that existing mechanism, so they have a choice. > That may be outweighed by the benefits of having > more memory available than before, and a reasonable > tradeoff to make for the users. > > That leaves us with having the code cleaned up to > reasonable standards. To be honest, I would rather > have larger hooks in the existing mm code, than > exported variables and having the hooks live elsewhere > (where people changing the "normal" mm code won't see > it, and are more likely to break it). Hmmm... the original hooks in 2009 were larger, but there was lots of feedback to hide the ugly details as much as possible. As a side effect, higher level info is passed via the hooks, e.g. a "struct page *" rather than swaptype/entry, so backends have more flexibility (and IIUC it looks like Andrea's proposed changes to zcache may need the higher level info). But if you want to propose some code showing what you mean by "larger" hooks and they result in the same information available in the backends, and if others agree your hooks are more maintainable, I am certainly open to changing them and re-posting. Note that this could happen post-frontswap-merge too though which would, naturally, be my preference ;-) Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 20:52 ` John Stoffel (?) (?) @ 2011-11-06 22:32 ` Valdis.Kletnieks 2011-11-08 12:15 ` Ed Tomlinson -1 siblings, 1 reply; 175+ messages in thread From: Valdis.Kletnieks @ 2011-11-06 22:32 UTC (permalink / raw) To: John Stoffel Cc: Dan Magenheimer, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet [-- Attachment #1: Type: text/plain, Size: 702 bytes --] On Fri, 28 Oct 2011 16:52:28 EDT, John Stoffel said: > Dan> "WHY" this is such a good idea is the same as WHY it is useful to > Dan> add RAM to your systems. > > So why would I use this instead of increasing the physical RAM? You're welcome to buy me a new laptop that has a third DIMM slot. :) There's a lot of people running hardware that already has the max amount of supported RAM, and who for budget or legacy-support reasons can't easily do a forklift upgrade to a new machine. > if I've got a large system which cannot physically use any more > memory, then it might be worth my while to use TMEM to get more > performance out of this expensive hardware. It's not always a large system.... [-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-06 22:32 ` Valdis.Kletnieks @ 2011-11-08 12:15 ` Ed Tomlinson 0 siblings, 0 replies; 175+ messages in thread From: Ed Tomlinson @ 2011-11-08 12:15 UTC (permalink / raw) To: Valdis.Kletnieks Cc: John Stoffel, Dan Magenheimer, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Sunday 06 November 2011 17:32:54 Valdis.Kletnieks@vt.edu wrote: > On Fri, 28 Oct 2011 16:52:28 EDT, John Stoffel said: > > Dan> "WHY" this is such a good idea is the same as WHY it is useful to > > Dan> add RAM to your systems. > > > > So why would I use this instead of increasing the physical RAM? > > You're welcome to buy me a new laptop that has a third DIMM slot. :) > > There's a lot of people running hardware that already has the max amount of > supported RAM, and who for budget or legacy-support reasons can't easily do a > forklift upgrade to a new machine. I've got three boxes with this problem here. Hense my support for frontswap/cleancache. Ed > > if I've got a large system which cannot physically use any more > > memory, then it might be worth my while to use TMEM to get more > > performance out of this expensive hardware. > > It's not always a large system.... ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-08 12:15 ` Ed Tomlinson 0 siblings, 0 replies; 175+ messages in thread From: Ed Tomlinson @ 2011-11-08 12:15 UTC (permalink / raw) To: Valdis.Kletnieks Cc: John Stoffel, Dan Magenheimer, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Sunday 06 November 2011 17:32:54 Valdis.Kletnieks@vt.edu wrote: > On Fri, 28 Oct 2011 16:52:28 EDT, John Stoffel said: > > Dan> "WHY" this is such a good idea is the same as WHY it is useful to > > Dan> add RAM to your systems. > > > > So why would I use this instead of increasing the physical RAM? > > You're welcome to buy me a new laptop that has a third DIMM slot. :) > > There's a lot of people running hardware that already has the max amount of > supported RAM, and who for budget or legacy-support reasons can't easily do a > forklift upgrade to a new machine. I've got three boxes with this problem here. Hense my support for frontswap/cleancache. Ed > > if I've got a large system which cannot physically use any more > > memory, then it might be worth my while to use TMEM to get more > > performance out of this expensive hardware. > > It's not always a large system.... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 20:19 ` Dan Magenheimer @ 2011-10-31 8:12 ` James Bottomley -1 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-10-31 8:12 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > For those who "hack on the VM", I can't imagine why the handful > of lines in the swap subsystem, which is probably the most stable > and barely touched subsystem in Linux or any OS on the planet, > is going to be a burden or much of a cost. Saying things like this doesn't encourage anyone to trust you. The whole of the MM is a complex, highly interacting system. The recent issues we've had with kswapd and the shrinker code gives a nice demonstration of this ... and that was caused by well tested code updates. You can't hand wave away the need for benchmarks and performance tests. You have also answered all questions about inactive cost by saying "the code has zero cost when it's compiled out" This also is a non starter. For the few use cases it has, this code has to be compiled in. I suspect even Oracle isn't going to ship separate frontswap and non-frontswap kernels in its distro. So you have to quantify what the performance impact is when this code is compiled in but not used. Please do so. James ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 8:12 ` James Bottomley 0 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-10-31 8:12 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > For those who "hack on the VM", I can't imagine why the handful > of lines in the swap subsystem, which is probably the most stable > and barely touched subsystem in Linux or any OS on the planet, > is going to be a burden or much of a cost. Saying things like this doesn't encourage anyone to trust you. The whole of the MM is a complex, highly interacting system. The recent issues we've had with kswapd and the shrinker code gives a nice demonstration of this ... and that was caused by well tested code updates. You can't hand wave away the need for benchmarks and performance tests. You have also answered all questions about inactive cost by saying "the code has zero cost when it's compiled out" This also is a non starter. For the few use cases it has, this code has to be compiled in. I suspect even Oracle isn't going to ship separate frontswap and non-frontswap kernels in its distro. So you have to quantify what the performance impact is when this code is compiled in but not used. Please do so. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 8:12 ` James Bottomley @ 2011-10-31 15:39 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 15:39 UTC (permalink / raw) To: James Bottomley Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) Hi James -- Thanks for the reply. You raise some good points but I hope you will read what I believe are reasonable though long-winded answers. > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > > For those who "hack on the VM", I can't imagine why the handful > > of lines in the swap subsystem, which is probably the most stable > > and barely touched subsystem in Linux or any OS on the planet, > > is going to be a burden or much of a cost. > > Saying things like this doesn't encourage anyone to trust you. The > whole of the MM is a complex, highly interacting system. The recent > issues we've had with kswapd and the shrinker code gives a nice > demonstration of this ... and that was caused by well tested code > updates. I do understand that. My point was that the hooks are placed _statically_ in largely stable code so it's not going to constantly get in the way of VM developers adding new features and fixing bugs, particularly any developers that don't care about whether frontswap works or not. I do think that is a very relevant point about maintenance... do you disagree? Runtime interactions can only occur if the code is config'ed and, if config'ed, only if a tmem backend (e.g. Xen or zcache) enables it also at runtime. When both are enabled, runtime interactions do occur and absolutely must be fully tested. My point was that any _users_ who don't care about whether frontswap works or not don't need to have any concerns about VM system runtime interactions. I think this is also a very relevant point about maintenance... do you disagree? > You can't hand wave away the need for benchmarks and > performance tests. I'm not. Conclusive benchmarks are available for one user (Xen) but not (yet) for other users. I've already acknowledged the feedback desiring benchmarking for zcache, but zcache is already merged (albeit in staging), and Xen tmem is already merged in both Linux and the Xen hypervisor, and cleancache (the alter ego of frontswap) is already merged. So the question is not whether benchmarks are waived, but whether one accepts (1) conclusive benchmarks for Xen; PLUS (2) insufficiently benchmarked zcache; PLUS (3) at least two other interesting-but-not-yet-benchmarkable users; as sufficient for adding this small set of hooks into swap code. I understand that some kernel developers (mostly from one company) continue to completely discount Xen, and thus won't even look at the Xen results. IMHO that is mudslinging. > You have also answered all questions about inactive cost by saying "the > code has zero cost when it's compiled out" This also is a non starter. > For the few use cases it has, this code has to be compiled in. I > suspect even Oracle isn't going to ship separate frontswap and > non-frontswap kernels in its distro. So you have to quantify what the > performance impact is when this code is compiled in but not used. > Please do so. First, no, Oracle is not going to ship separate frontswap and non-frontswap kernels. It IS going to ship a frontswap-enabled kernel and this can be seen in Oracle's publicly-available kernel git tree (the next release, now in Beta). Frontswap is compiled in, but still must be enabled at runtime (e.g. for a Xen guest, either manually by the guest's administrator or automagically by the Oracle VM product's management layer). I did fully quantify the performance impact elsewhere in this thread. The performance impact with CONFIG_FRONTSWAP=n (which is ZERO) is relevant for distros which choose to ignore it entirely. The performance impact for CONFIG_FRONTSWAP=y but not-enabled-at-runtume is one compare-pointer-against-NULL per page actually swapped in or out (essentially ZERO); this is relevant for distros which choose to configure it enabled in case they wish to enable it at runtime in the future. So the remaining question is the performance impact when compile-time AND runtime enabled; this is in the published Xen presentation I've referenced -- the impact is much much less than the performance gain. IMHO benchmark results can be easily manipulated so I prefer to discuss the theoretical underpinnings which, in short, is that just about anything a tmem backend does (hypercall, compression, deduplication, even moving data across a fast network) is a helluva lot faster than swapping a page to disk. Are there corner cases and probably even real workloads where the cost exceeds the benefits? Probably... though less likely for frontswap than for cleancache because ONLY pages that would actually be swapped out/in use frontswap. But I have never suggested that every kernel should always unconditionally compile-time-enable and run-time-enable frontswap... simply that it should be in-tree so those who wish to enable it are able to enable it. Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 15:39 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 15:39 UTC (permalink / raw) To: James Bottomley Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) Hi James -- Thanks for the reply. You raise some good points but I hope you will read what I believe are reasonable though long-winded answers. > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > > For those who "hack on the VM", I can't imagine why the handful > > of lines in the swap subsystem, which is probably the most stable > > and barely touched subsystem in Linux or any OS on the planet, > > is going to be a burden or much of a cost. > > Saying things like this doesn't encourage anyone to trust you. The > whole of the MM is a complex, highly interacting system. The recent > issues we've had with kswapd and the shrinker code gives a nice > demonstration of this ... and that was caused by well tested code > updates. I do understand that. My point was that the hooks are placed _statically_ in largely stable code so it's not going to constantly get in the way of VM developers adding new features and fixing bugs, particularly any developers that don't care about whether frontswap works or not. I do think that is a very relevant point about maintenance... do you disagree? Runtime interactions can only occur if the code is config'ed and, if config'ed, only if a tmem backend (e.g. Xen or zcache) enables it also at runtime. When both are enabled, runtime interactions do occur and absolutely must be fully tested. My point was that any _users_ who don't care about whether frontswap works or not don't need to have any concerns about VM system runtime interactions. I think this is also a very relevant point about maintenance... do you disagree? > You can't hand wave away the need for benchmarks and > performance tests. I'm not. Conclusive benchmarks are available for one user (Xen) but not (yet) for other users. I've already acknowledged the feedback desiring benchmarking for zcache, but zcache is already merged (albeit in staging), and Xen tmem is already merged in both Linux and the Xen hypervisor, and cleancache (the alter ego of frontswap) is already merged. So the question is not whether benchmarks are waived, but whether one accepts (1) conclusive benchmarks for Xen; PLUS (2) insufficiently benchmarked zcache; PLUS (3) at least two other interesting-but-not-yet-benchmarkable users; as sufficient for adding this small set of hooks into swap code. I understand that some kernel developers (mostly from one company) continue to completely discount Xen, and thus won't even look at the Xen results. IMHO that is mudslinging. > You have also answered all questions about inactive cost by saying "the > code has zero cost when it's compiled out" This also is a non starter. > For the few use cases it has, this code has to be compiled in. I > suspect even Oracle isn't going to ship separate frontswap and > non-frontswap kernels in its distro. So you have to quantify what the > performance impact is when this code is compiled in but not used. > Please do so. First, no, Oracle is not going to ship separate frontswap and non-frontswap kernels. It IS going to ship a frontswap-enabled kernel and this can be seen in Oracle's publicly-available kernel git tree (the next release, now in Beta). Frontswap is compiled in, but still must be enabled at runtime (e.g. for a Xen guest, either manually by the guest's administrator or automagically by the Oracle VM product's management layer). I did fully quantify the performance impact elsewhere in this thread. The performance impact with CONFIG_FRONTSWAP=n (which is ZERO) is relevant for distros which choose to ignore it entirely. The performance impact for CONFIG_FRONTSWAP=y but not-enabled-at-runtume is one compare-pointer-against-NULL per page actually swapped in or out (essentially ZERO); this is relevant for distros which choose to configure it enabled in case they wish to enable it at runtime in the future. So the remaining question is the performance impact when compile-time AND runtime enabled; this is in the published Xen presentation I've referenced -- the impact is much much less than the performance gain. IMHO benchmark results can be easily manipulated so I prefer to discuss the theoretical underpinnings which, in short, is that just about anything a tmem backend does (hypercall, compression, deduplication, even moving data across a fast network) is a helluva lot faster than swapping a page to disk. Are there corner cases and probably even real workloads where the cost exceeds the benefits? Probably... though less likely for frontswap than for cleancache because ONLY pages that would actually be swapped out/in use frontswap. But I have never suggested that every kernel should always unconditionally compile-time-enable and run-time-enable frontswap... simply that it should be in-tree so those who wish to enable it are able to enable it. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 15:39 ` Dan Magenheimer @ 2011-11-01 10:13 ` James Bottomley -1 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-01 10:13 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, 2011-10-31 at 08:39 -0700, Dan Magenheimer wrote: > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > Hi James -- > > Thanks for the reply. You raise some good points but > I hope you will read what I believe are reasonable though > long-winded answers. > > > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > > > For those who "hack on the VM", I can't imagine why the handful > > > of lines in the swap subsystem, which is probably the most stable > > > and barely touched subsystem in Linux or any OS on the planet, > > > is going to be a burden or much of a cost. > > > > Saying things like this doesn't encourage anyone to trust you. The > > whole of the MM is a complex, highly interacting system. The recent > > issues we've had with kswapd and the shrinker code gives a nice > > demonstration of this ... and that was caused by well tested code > > updates. > > I do understand that. My point was that the hooks are > placed _statically_ in largely stable code so it's not > going to constantly get in the way of VM developers > adding new features and fixing bugs, particularly > any developers that don't care about whether frontswap > works or not. I do think that is a very relevant > point about maintenance... do you disagree? Well, as I've said, all the mm code is highly interacting, so I don't really see it as "stable" in the way you suggest. What I'm saying is that you need to test a variety of workloads to demonstrate there aren't any nasty interactions. > Runtime interactions can only occur if the code is > config'ed and, if config'ed, only if a tmem backend (e.g. > Xen or zcache) enables it also at runtime. So this, I don't accept without proof ... that's what we initially said about the last set of shrinker updates that caused kswapd to hang sandybridge systems ... > When > both are enabled, runtime interactions do occur > and absolutely must be fully tested. My point was > that any _users_ who don't care about whether frontswap > works or not don't need to have any concerns about > VM system runtime interactions. I think this is also > a very relevant point about maintenance... do you > disagree? I'm sorry, what point about maintenance? > > You can't hand wave away the need for benchmarks and > > performance tests. > > I'm not. Conclusive benchmarks are available for one user > (Xen) but not (yet) for other users. I've already acknowledged > the feedback desiring benchmarking for zcache, but zcache > is already merged (albeit in staging), and Xen tmem > is already merged in both Linux and the Xen hypervisor, > and cleancache (the alter ego of frontswap) is already > merged. The test results for Xen I've seen are simply that "we're faster than swapping to disk, and we can be even better if you use self ballooning". There's no indication (at least in the Xen Summit presentation) what the actual workloads were. > So the question is not whether benchmarks are waived, > but whether one accepts (1) conclusive benchmarks for Xen; > PLUS (2) insufficiently benchmarked zcache; PLUS (3) at > least two other interesting-but-not-yet-benchmarkable users; > as sufficient for adding this small set of hooks into > swap code. That's the point: even for Xen, the benchmarks aren't "conclusive". There may be a workload for which transcendent memory works better, but make -j8 isn't enough of a variety of workloads) > I understand that some kernel developers (mostly from one > company) continue to completely discount Xen, and > thus won't even look at the Xen results. IMHO > that is mudslinging. OK, so lets look at this another way: one of the signs of a good ABI is generic applicability. Any good virtualisation ABI should thus work for all virtualisation systems (including VMware should they choose to take advantage of it). The fact that transcendent memory only seems to work well for Xen is a red flag in this regard. > > You have also answered all questions about inactive cost by saying "the > > code has zero cost when it's compiled out" This also is a non starter. > > For the few use cases it has, this code has to be compiled in. I > > suspect even Oracle isn't going to ship separate frontswap and > > non-frontswap kernels in its distro. So you have to quantify what the > > performance impact is when this code is compiled in but not used. > > Please do so. > > First, no, Oracle is not going to ship separate frontswap and > non-frontswap kernels. It IS going to ship a frontswap-enabled > kernel and this can be seen in Oracle's publicly-available > kernel git tree (the next release, now in Beta). Frontswap is > compiled in, but still must be enabled at runtime (e.g. for > a Xen guest, either manually by the guest's administrator > or automagically by the Oracle VM product's management layer). > > I did fully quantify the performance impact elsewhere in > this thread. The performance impact with CONFIG_FRONTSWAP=n > (which is ZERO) is relevant for distros which choose to > ignore it entirely. The performance impact for CONFIG_FRONTSWAP=y > but not-enabled-at-runtume is one compare-pointer-against-NULL > per page actually swapped in or out (essentially ZERO); > this is relevant for distros which choose to configure it > enabled in case they wish to enable it at runtime in > the future. So what I don't like about this style of argument is the sleight of hand: I would expect the inactive but configured case to show mostly in the shrinker paths, which is where our major problems have been, so that would be cleancache, not frontswap, wouldn't it? > So the remaining question is the performance impact when > compile-time AND runtime enabled; this is in the published > Xen presentation I've referenced -- the impact is much much > less than the performance gain. IMHO benchmark results can > be easily manipulated so I prefer to discuss the theoretical > underpinnings which, in short, is that just about anything > a tmem backend does (hypercall, compression, deduplication, > even moving data across a fast network) is a helluva lot > faster than swapping a page to disk. > > Are there corner cases and probably even real workloads > where the cost exceeds the benefits? Probably... though > less likely for frontswap than for cleancache because ONLY > pages that would actually be swapped out/in use frontswap. > > But I have never suggested that every kernel should always > unconditionally compile-time-enable and run-time-enable > frontswap... simply that it should be in-tree so those > who wish to enable it are able to enable it. In practise, most useful ABIs end up being compiled in ... and useful basically means useful to any constituency, however small. If your ABI is useless, then fine, we don't have to worry about the configured but inactive case (but then again, we wouldn't have to worry about the ABI at all). If it has a use, then kernels will end up shipping with it configured in which is why the inactive performance impact is so important to quantify. James ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 10:13 ` James Bottomley 0 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-01 10:13 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, 2011-10-31 at 08:39 -0700, Dan Magenheimer wrote: > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > Hi James -- > > Thanks for the reply. You raise some good points but > I hope you will read what I believe are reasonable though > long-winded answers. > > > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > > > For those who "hack on the VM", I can't imagine why the handful > > > of lines in the swap subsystem, which is probably the most stable > > > and barely touched subsystem in Linux or any OS on the planet, > > > is going to be a burden or much of a cost. > > > > Saying things like this doesn't encourage anyone to trust you. The > > whole of the MM is a complex, highly interacting system. The recent > > issues we've had with kswapd and the shrinker code gives a nice > > demonstration of this ... and that was caused by well tested code > > updates. > > I do understand that. My point was that the hooks are > placed _statically_ in largely stable code so it's not > going to constantly get in the way of VM developers > adding new features and fixing bugs, particularly > any developers that don't care about whether frontswap > works or not. I do think that is a very relevant > point about maintenance... do you disagree? Well, as I've said, all the mm code is highly interacting, so I don't really see it as "stable" in the way you suggest. What I'm saying is that you need to test a variety of workloads to demonstrate there aren't any nasty interactions. > Runtime interactions can only occur if the code is > config'ed and, if config'ed, only if a tmem backend (e.g. > Xen or zcache) enables it also at runtime. So this, I don't accept without proof ... that's what we initially said about the last set of shrinker updates that caused kswapd to hang sandybridge systems ... > When > both are enabled, runtime interactions do occur > and absolutely must be fully tested. My point was > that any _users_ who don't care about whether frontswap > works or not don't need to have any concerns about > VM system runtime interactions. I think this is also > a very relevant point about maintenance... do you > disagree? I'm sorry, what point about maintenance? > > You can't hand wave away the need for benchmarks and > > performance tests. > > I'm not. Conclusive benchmarks are available for one user > (Xen) but not (yet) for other users. I've already acknowledged > the feedback desiring benchmarking for zcache, but zcache > is already merged (albeit in staging), and Xen tmem > is already merged in both Linux and the Xen hypervisor, > and cleancache (the alter ego of frontswap) is already > merged. The test results for Xen I've seen are simply that "we're faster than swapping to disk, and we can be even better if you use self ballooning". There's no indication (at least in the Xen Summit presentation) what the actual workloads were. > So the question is not whether benchmarks are waived, > but whether one accepts (1) conclusive benchmarks for Xen; > PLUS (2) insufficiently benchmarked zcache; PLUS (3) at > least two other interesting-but-not-yet-benchmarkable users; > as sufficient for adding this small set of hooks into > swap code. That's the point: even for Xen, the benchmarks aren't "conclusive". There may be a workload for which transcendent memory works better, but make -j8 isn't enough of a variety of workloads) > I understand that some kernel developers (mostly from one > company) continue to completely discount Xen, and > thus won't even look at the Xen results. IMHO > that is mudslinging. OK, so lets look at this another way: one of the signs of a good ABI is generic applicability. Any good virtualisation ABI should thus work for all virtualisation systems (including VMware should they choose to take advantage of it). The fact that transcendent memory only seems to work well for Xen is a red flag in this regard. > > You have also answered all questions about inactive cost by saying "the > > code has zero cost when it's compiled out" This also is a non starter. > > For the few use cases it has, this code has to be compiled in. I > > suspect even Oracle isn't going to ship separate frontswap and > > non-frontswap kernels in its distro. So you have to quantify what the > > performance impact is when this code is compiled in but not used. > > Please do so. > > First, no, Oracle is not going to ship separate frontswap and > non-frontswap kernels. It IS going to ship a frontswap-enabled > kernel and this can be seen in Oracle's publicly-available > kernel git tree (the next release, now in Beta). Frontswap is > compiled in, but still must be enabled at runtime (e.g. for > a Xen guest, either manually by the guest's administrator > or automagically by the Oracle VM product's management layer). > > I did fully quantify the performance impact elsewhere in > this thread. The performance impact with CONFIG_FRONTSWAP=n > (which is ZERO) is relevant for distros which choose to > ignore it entirely. The performance impact for CONFIG_FRONTSWAP=y > but not-enabled-at-runtume is one compare-pointer-against-NULL > per page actually swapped in or out (essentially ZERO); > this is relevant for distros which choose to configure it > enabled in case they wish to enable it at runtime in > the future. So what I don't like about this style of argument is the sleight of hand: I would expect the inactive but configured case to show mostly in the shrinker paths, which is where our major problems have been, so that would be cleancache, not frontswap, wouldn't it? > So the remaining question is the performance impact when > compile-time AND runtime enabled; this is in the published > Xen presentation I've referenced -- the impact is much much > less than the performance gain. IMHO benchmark results can > be easily manipulated so I prefer to discuss the theoretical > underpinnings which, in short, is that just about anything > a tmem backend does (hypercall, compression, deduplication, > even moving data across a fast network) is a helluva lot > faster than swapping a page to disk. > > Are there corner cases and probably even real workloads > where the cost exceeds the benefits? Probably... though > less likely for frontswap than for cleancache because ONLY > pages that would actually be swapped out/in use frontswap. > > But I have never suggested that every kernel should always > unconditionally compile-time-enable and run-time-enable > frontswap... simply that it should be in-tree so those > who wish to enable it are able to enable it. In practise, most useful ABIs end up being compiled in ... and useful basically means useful to any constituency, however small. If your ABI is useless, then fine, we don't have to worry about the configured but inactive case (but then again, we wouldn't have to worry about the ABI at all). If it has a use, then kernels will end up shipping with it configured in which is why the inactive performance impact is so important to quantify. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 10:13 ` James Bottomley @ 2011-11-01 18:10 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 18:10 UTC (permalink / raw) To: James Bottomley Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, 2011-10-31 at 08:39 -0700, Dan Magenheimer wrote: > > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > > > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > > > > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > > > > For those who "hack on the VM", I can't imagine why the handful > > > > of lines in the swap subsystem, which is probably the most stable > > > > and barely touched subsystem in Linux or any OS on the planet, > > > > is going to be a burden or much of a cost. > > > > > > Saying things like this doesn't encourage anyone to trust you. The > > > whole of the MM is a complex, highly interacting system. The recent > > > issues we've had with kswapd and the shrinker code gives a nice > > > demonstration of this ... and that was caused by well tested code > > > updates. > > > > I do understand that. My point was that the hooks are > > placed _statically_ in largely stable code so it's not > > going to constantly get in the way of VM developers > > adding new features and fixing bugs, particularly > > any developers that don't care about whether frontswap > > works or not. I do think that is a very relevant > > point about maintenance... do you disagree? > > Well, as I've said, all the mm code is highly interacting, so I don't > really see it as "stable" in the way you suggest. What I'm saying is > that you need to test a variety of workloads to demonstrate there aren't > any nasty interactions. I guess I don't understand how there can be any interactions at all, let alone _nasty_ interactions when there is no code to interact with? For clarity and brevity, let's call the three cases: Case A) CONFIG_FRONTSWAP=n Case B) CONFIG_FRONTSWAP=y and no tmem backend registers Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register There are no interactions in Case A, agreed? I'm not sure if it is clear, but in Case B every hook checks to see if a tmem backend is registered... if not, the hook is a no-op except for the addition of a compare-pointer-against-NULL op, so there is no interaction there either. So the only case where interactions are possible is Case C, which currently only can occur if a user specifies a kernel boot parameter of "tmem" or "zcache". (I know, a bit ugly, but there's a reason for doing it this way, at least for now.) > > Runtime interactions can only occur if the code is > > config'ed and, if config'ed, only if a tmem backend (e.g. > > Xen or zcache) enables it also at runtime. > > So this, I don't accept without proof ... that's what we initially said > about the last set of shrinker updates that caused kswapd to hang > sandybridge systems ... This makes me think that you didn't understand the code underlying Case B above, true? > > When > > both are enabled, runtime interactions do occur > > and absolutely must be fully tested. My point was > > that any _users_ who don't care about whether frontswap > > works or not don't need to have any concerns about > > VM system runtime interactions. I think this is also > > a very relevant point about maintenance... do you > > disagree? > > I'm sorry, what point about maintenance? The point is that only Case C has possible interactions so Case A and Case B end-users and kernel developers need not worry about the maintenance. IOW, if Johannes merges some super major swap subsystem rewrite and he doesn't have a clue if/how to move the frontswap hooks, his patch doesn't affect any Case A or Case B users and not even any Case C users that aren't using latest upstream. That seems relevant to me when we are discussing how much maintenance cost frontswap requires which, I think, was where this subthread started several emails ago :-) > > > You can't hand wave away the need for benchmarks and > > > performance tests. > > > > I'm not. Conclusive benchmarks are available for one user > > (Xen) but not (yet) for other users. I've already acknowledged > > the feedback desiring benchmarking for zcache, but zcache > > is already merged (albeit in staging), and Xen tmem > > is already merged in both Linux and the Xen hypervisor, > > and cleancache (the alter ego of frontswap) is already > > merged. > > The test results for Xen I've seen are simply that "we're faster than > swapping to disk, and we can be even better if you use self ballooning". > There's no indication (at least in the Xen Summit presentation) what the > actual workloads were. > > > So the question is not whether benchmarks are waived, > > but whether one accepts (1) conclusive benchmarks for Xen; > > PLUS (2) insufficiently benchmarked zcache; PLUS (3) at > > least two other interesting-but-not-yet-benchmarkable users; > > as sufficient for adding this small set of hooks into > > swap code. > > That's the point: even for Xen, the benchmarks aren't "conclusive". > There may be a workload for which transcendent memory works better, but > make -j8 isn't enough of a variety of workloads) OK, you got me, I guess "conclusive" is too strong a word. It would be more accurate to say that the theoretical basis for improvement, which some people were very skeptical about, measures to be even better than expected. I agree that one workload isn't enough... I can assure you that there have been others. But I really don't think you are asking for more _positive_ data, you are asking if there is _negative_ data. As you point out "we" are faster than swapping is not a hard bar to clear. IOW comparing any workload that swaps a lot against the same workload swapping a lot less, doesn't really prove anything. OR DOES IT? Considering that reducing swapping is the WHOLE POINT of frontswap, I would argue that it does. Can we agree that if frontswap is doing its job properly on any "normal" workload that is swapping, it is improving on a bad situation? Then let's get back to your implied question about _negative_ data. As described above there is NO impact for Case A and Case B. (The zealot will point out that a pointer-compare against-NULL per page-swapped-in/out is not "NO" impact, but let's ignore him for now.) In Case C, there are demonstrated benefits for SOME workloads... will frontswap HARM some workloads? I have openly admitted that for _cleancache_ on _zcache_, sometimes the cost can exceed the benefits, and this was actually demonstrated by one user on lkml. For _frontswap_ it's really hard to imagine even a very contrived workload where frontswap fails to provide an advantage. I suppose maybe if your swap disk lives on a PCI SSD and your CPU is ancient single-core which does extremely slow copying and compression? IOW, I feel like you are giving me busywork, and any additional evidence I present you will wave away anyway. > > I understand that some kernel developers (mostly from one > > company) continue to completely discount Xen, and > > thus won't even look at the Xen results. IMHO > > that is mudslinging. > > OK, so lets look at this another way: one of the signs of a good ABI is > generic applicability. Any good virtualisation ABI should thus work for > all virtualisation systems (including VMware should they choose to take > advantage of it). The fact that transcendent memory only seems to work > well for Xen is a red flag in this regard. I think the tmem ABI will work fine with any virtualization system, and particularly frontswap will. There are some theoretical arguments that KVM will get little or no benefit, but those arguments pertain primarily to cleancache. And I've noted that the ABI was designed to be very extensible, so if KVM wants a batching interface, they can add one. To repeat from the LWN KS2011 report: "[Linus] stated that, simply, code that actually is used is code that is actually worth something... code aimed at solving the same problem is just a vague idea that is worthless by comparison... Even if it truly is crap, we've had crap in the kernel before. The code does not get better out of tree." AND the API/ABI clearly supports other non-virtualization uses as well. The in-kernel hooks are very simple and the layering is very clean. The ABI is extensible, has been published for nearly three years, and successfully rev'ed once (to accomodate 192-bit exportfs handles for cleancache). Your arguments are on very thin ice here. It sounds like you are saying that unless/until KVM has a completed measurable implementation... and maybe VMware and Hyper/V as well... you don't think the tiny set of hooks that are frontswap should be merged. If so, that "red flag" sounds self-serving, not what I would expect from someone like you. Sorry. > So what I don't like about this style of argument is the sleight of > hand: I would expect the inactive but configured case to show mostly in > the shrinker paths, which is where our major problems have been, so that > would be cleancache, not frontswap, wouldn't it? Yes, this is cleancache (already merged). As described above, frontswap executes no code in Case A or Case B so can't possibly interact with the shrinker path. > > So the remaining question is the performance impact when > > compile-time AND runtime enabled; this is in the published > > Xen presentation I've referenced -- the impact is much much > > less than the performance gain. IMHO benchmark results can > > be easily manipulated so I prefer to discuss the theoretical > > underpinnings which, in short, is that just about anything > > a tmem backend does (hypercall, compression, deduplication, > > even moving data across a fast network) is a helluva lot > > faster than swapping a page to disk. > > > > Are there corner cases and probably even real workloads > > where the cost exceeds the benefits? Probably... though > > less likely for frontswap than for cleancache because ONLY > > pages that would actually be swapped out/in use frontswap. > > > > But I have never suggested that every kernel should always > > unconditionally compile-time-enable and run-time-enable > > frontswap... simply that it should be in-tree so those > > who wish to enable it are able to enable it. > > In practise, most useful ABIs end up being compiled in ... and useful > basically means useful to any constituency, however small. If your ABI > is useless, then fine, we don't have to worry about the configured but > inactive case (but then again, we wouldn't have to worry about the ABI > at all). If it has a use, then kernels will end up shipping with it > configured in which is why the inactive performance impact is so > important to quantify. So do you now understand/agree that the inactive performance is zero and the interaction of an inactive configuration with the remainder of the MM subsystem is zero? And that you and your users will be completely unaffected unless you/they intentionally turn it on, not only compiled in, but explicitly at runtime as well? So... understanding your preference for more workloads and your preference that KVM should be demonstrated as a profitable user first... is there anything else that you think should stand in the way of merging frontswap so that existing and planned kernel developers can build on top of it in-tree? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 18:10 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 18:10 UTC (permalink / raw) To: James Bottomley Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, 2011-10-31 at 08:39 -0700, Dan Magenheimer wrote: > > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > > > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > > > > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote: > > > > For those who "hack on the VM", I can't imagine why the handful > > > > of lines in the swap subsystem, which is probably the most stable > > > > and barely touched subsystem in Linux or any OS on the planet, > > > > is going to be a burden or much of a cost. > > > > > > Saying things like this doesn't encourage anyone to trust you. The > > > whole of the MM is a complex, highly interacting system. The recent > > > issues we've had with kswapd and the shrinker code gives a nice > > > demonstration of this ... and that was caused by well tested code > > > updates. > > > > I do understand that. My point was that the hooks are > > placed _statically_ in largely stable code so it's not > > going to constantly get in the way of VM developers > > adding new features and fixing bugs, particularly > > any developers that don't care about whether frontswap > > works or not. I do think that is a very relevant > > point about maintenance... do you disagree? > > Well, as I've said, all the mm code is highly interacting, so I don't > really see it as "stable" in the way you suggest. What I'm saying is > that you need to test a variety of workloads to demonstrate there aren't > any nasty interactions. I guess I don't understand how there can be any interactions at all, let alone _nasty_ interactions when there is no code to interact with? For clarity and brevity, let's call the three cases: Case A) CONFIG_FRONTSWAP=n Case B) CONFIG_FRONTSWAP=y and no tmem backend registers Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register There are no interactions in Case A, agreed? I'm not sure if it is clear, but in Case B every hook checks to see if a tmem backend is registered... if not, the hook is a no-op except for the addition of a compare-pointer-against-NULL op, so there is no interaction there either. So the only case where interactions are possible is Case C, which currently only can occur if a user specifies a kernel boot parameter of "tmem" or "zcache". (I know, a bit ugly, but there's a reason for doing it this way, at least for now.) > > Runtime interactions can only occur if the code is > > config'ed and, if config'ed, only if a tmem backend (e.g. > > Xen or zcache) enables it also at runtime. > > So this, I don't accept without proof ... that's what we initially said > about the last set of shrinker updates that caused kswapd to hang > sandybridge systems ... This makes me think that you didn't understand the code underlying Case B above, true? > > When > > both are enabled, runtime interactions do occur > > and absolutely must be fully tested. My point was > > that any _users_ who don't care about whether frontswap > > works or not don't need to have any concerns about > > VM system runtime interactions. I think this is also > > a very relevant point about maintenance... do you > > disagree? > > I'm sorry, what point about maintenance? The point is that only Case C has possible interactions so Case A and Case B end-users and kernel developers need not worry about the maintenance. IOW, if Johannes merges some super major swap subsystem rewrite and he doesn't have a clue if/how to move the frontswap hooks, his patch doesn't affect any Case A or Case B users and not even any Case C users that aren't using latest upstream. That seems relevant to me when we are discussing how much maintenance cost frontswap requires which, I think, was where this subthread started several emails ago :-) > > > You can't hand wave away the need for benchmarks and > > > performance tests. > > > > I'm not. Conclusive benchmarks are available for one user > > (Xen) but not (yet) for other users. I've already acknowledged > > the feedback desiring benchmarking for zcache, but zcache > > is already merged (albeit in staging), and Xen tmem > > is already merged in both Linux and the Xen hypervisor, > > and cleancache (the alter ego of frontswap) is already > > merged. > > The test results for Xen I've seen are simply that "we're faster than > swapping to disk, and we can be even better if you use self ballooning". > There's no indication (at least in the Xen Summit presentation) what the > actual workloads were. > > > So the question is not whether benchmarks are waived, > > but whether one accepts (1) conclusive benchmarks for Xen; > > PLUS (2) insufficiently benchmarked zcache; PLUS (3) at > > least two other interesting-but-not-yet-benchmarkable users; > > as sufficient for adding this small set of hooks into > > swap code. > > That's the point: even for Xen, the benchmarks aren't "conclusive". > There may be a workload for which transcendent memory works better, but > make -j8 isn't enough of a variety of workloads) OK, you got me, I guess "conclusive" is too strong a word. It would be more accurate to say that the theoretical basis for improvement, which some people were very skeptical about, measures to be even better than expected. I agree that one workload isn't enough... I can assure you that there have been others. But I really don't think you are asking for more _positive_ data, you are asking if there is _negative_ data. As you point out "we" are faster than swapping is not a hard bar to clear. IOW comparing any workload that swaps a lot against the same workload swapping a lot less, doesn't really prove anything. OR DOES IT? Considering that reducing swapping is the WHOLE POINT of frontswap, I would argue that it does. Can we agree that if frontswap is doing its job properly on any "normal" workload that is swapping, it is improving on a bad situation? Then let's get back to your implied question about _negative_ data. As described above there is NO impact for Case A and Case B. (The zealot will point out that a pointer-compare against-NULL per page-swapped-in/out is not "NO" impact, but let's ignore him for now.) In Case C, there are demonstrated benefits for SOME workloads... will frontswap HARM some workloads? I have openly admitted that for _cleancache_ on _zcache_, sometimes the cost can exceed the benefits, and this was actually demonstrated by one user on lkml. For _frontswap_ it's really hard to imagine even a very contrived workload where frontswap fails to provide an advantage. I suppose maybe if your swap disk lives on a PCI SSD and your CPU is ancient single-core which does extremely slow copying and compression? IOW, I feel like you are giving me busywork, and any additional evidence I present you will wave away anyway. > > I understand that some kernel developers (mostly from one > > company) continue to completely discount Xen, and > > thus won't even look at the Xen results. IMHO > > that is mudslinging. > > OK, so lets look at this another way: one of the signs of a good ABI is > generic applicability. Any good virtualisation ABI should thus work for > all virtualisation systems (including VMware should they choose to take > advantage of it). The fact that transcendent memory only seems to work > well for Xen is a red flag in this regard. I think the tmem ABI will work fine with any virtualization system, and particularly frontswap will. There are some theoretical arguments that KVM will get little or no benefit, but those arguments pertain primarily to cleancache. And I've noted that the ABI was designed to be very extensible, so if KVM wants a batching interface, they can add one. To repeat from the LWN KS2011 report: "[Linus] stated that, simply, code that actually is used is code that is actually worth something... code aimed at solving the same problem is just a vague idea that is worthless by comparison... Even if it truly is crap, we've had crap in the kernel before. The code does not get better out of tree." AND the API/ABI clearly supports other non-virtualization uses as well. The in-kernel hooks are very simple and the layering is very clean. The ABI is extensible, has been published for nearly three years, and successfully rev'ed once (to accomodate 192-bit exportfs handles for cleancache). Your arguments are on very thin ice here. It sounds like you are saying that unless/until KVM has a completed measurable implementation... and maybe VMware and Hyper/V as well... you don't think the tiny set of hooks that are frontswap should be merged. If so, that "red flag" sounds self-serving, not what I would expect from someone like you. Sorry. > So what I don't like about this style of argument is the sleight of > hand: I would expect the inactive but configured case to show mostly in > the shrinker paths, which is where our major problems have been, so that > would be cleancache, not frontswap, wouldn't it? Yes, this is cleancache (already merged). As described above, frontswap executes no code in Case A or Case B so can't possibly interact with the shrinker path. > > So the remaining question is the performance impact when > > compile-time AND runtime enabled; this is in the published > > Xen presentation I've referenced -- the impact is much much > > less than the performance gain. IMHO benchmark results can > > be easily manipulated so I prefer to discuss the theoretical > > underpinnings which, in short, is that just about anything > > a tmem backend does (hypercall, compression, deduplication, > > even moving data across a fast network) is a helluva lot > > faster than swapping a page to disk. > > > > Are there corner cases and probably even real workloads > > where the cost exceeds the benefits? Probably... though > > less likely for frontswap than for cleancache because ONLY > > pages that would actually be swapped out/in use frontswap. > > > > But I have never suggested that every kernel should always > > unconditionally compile-time-enable and run-time-enable > > frontswap... simply that it should be in-tree so those > > who wish to enable it are able to enable it. > > In practise, most useful ABIs end up being compiled in ... and useful > basically means useful to any constituency, however small. If your ABI > is useless, then fine, we don't have to worry about the configured but > inactive case (but then again, we wouldn't have to worry about the ABI > at all). If it has a use, then kernels will end up shipping with it > configured in which is why the inactive performance impact is so > important to quantify. So do you now understand/agree that the inactive performance is zero and the interaction of an inactive configuration with the remainder of the MM subsystem is zero? And that you and your users will be completely unaffected unless you/they intentionally turn it on, not only compiled in, but explicitly at runtime as well? So... understanding your preference for more workloads and your preference that KVM should be demonstrated as a profitable user first... is there anything else that you think should stand in the way of merging frontswap so that existing and planned kernel developers can build on top of it in-tree? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 18:10 ` Dan Magenheimer @ 2011-11-01 18:48 ` Dave Hansen -1 siblings, 0 replies; 175+ messages in thread From: Dave Hansen @ 2011-11-01 18:48 UTC (permalink / raw) To: Dan Magenheimer Cc: James Bottomley, John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet On Tue, 2011-11-01 at 11:10 -0700, Dan Magenheimer wrote: > Case A) CONFIG_FRONTSWAP=n > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register ... > The point is that only Case C has possible interactions > so Case A and Case B end-users and kernel developers need > not worry about the maintenance. I'm personally evaluating this as if all the distributions would turn it on. I'm evaluating as if every one of my employer's systems ships with it and as if it is =y my laptop. Basically, I'm evaluating A/B/C and only looking at the worst-case maintenance cost (C). In other words, I'm ignoring A/B and assuming wide use. I'm curious where you expect to see the code get turned on and used since we might be looking at this from different angles. -- Dave ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 18:48 ` Dave Hansen 0 siblings, 0 replies; 175+ messages in thread From: Dave Hansen @ 2011-11-01 18:48 UTC (permalink / raw) To: Dan Magenheimer Cc: James Bottomley, John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet On Tue, 2011-11-01 at 11:10 -0700, Dan Magenheimer wrote: > Case A) CONFIG_FRONTSWAP=n > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register ... > The point is that only Case C has possible interactions > so Case A and Case B end-users and kernel developers need > not worry about the maintenance. I'm personally evaluating this as if all the distributions would turn it on. I'm evaluating as if every one of my employer's systems ships with it and as if it is =y my laptop. Basically, I'm evaluating A/B/C and only looking at the worst-case maintenance cost (C). In other words, I'm ignoring A/B and assuming wide use. I'm curious where you expect to see the code get turned on and used since we might be looking at this from different angles. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 18:48 ` Dave Hansen @ 2011-11-01 21:32 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 21:32 UTC (permalink / raw) To: Dave Hansen Cc: James Bottomley, John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet > From: Dave Hansen [mailto:dave@linux.vnet.ibm.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > On Tue, 2011-11-01 at 11:10 -0700, Dan Magenheimer wrote: > > Case A) CONFIG_FRONTSWAP=n > > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register > ... > > The point is that only Case C has possible interactions > > so Case A and Case B end-users and kernel developers need > > not worry about the maintenance. > > I'm personally evaluating this as if all the distributions would turn it > on. I'm evaluating as if every one of my employer's systems ships with > it and as if it is =y my laptop. Basically, I'm evaluating A/B/C and > only looking at the worst-case maintenance cost (C). In other words, > I'm ignoring A/B and assuming wide use. Good. Me too. I was just saying that the-company-that-must-not- be-named (from which most of the non-technical objections are coming), can choose A or B as they wish without any impact to their developers or users. > I'm curious where you expect to see the code get turned on and used > since we might be looking at this from different angles. I think we are on the same page. Oracle is turning it on (case B) in the default UEK kernel, for which the Beta git tree is published. Corporate policy keeps me from saying anything in detail about pre-released products, but you saw that our Oracle VM manager responded to this thread, so I'll leave that to your imagination. I think we agreed offlist that zcache is not ready for prime-time and a good measure of when it _will_ be ready is when it is promoted out of staging. I'm really hoping you guys at IBM will drive that (and am willing to get out of the way if you prefer). There's a lot of interest in Oracle in RAMster (which I personally think is very sexy), but I haven't been able to make forward progress in nearly three months now due to other fires and commitments. :-( So are we on the same page? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 21:32 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 21:32 UTC (permalink / raw) To: Dave Hansen Cc: James Bottomley, John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Jonathan Corbet > From: Dave Hansen [mailto:dave@linux.vnet.ibm.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > On Tue, 2011-11-01 at 11:10 -0700, Dan Magenheimer wrote: > > Case A) CONFIG_FRONTSWAP=n > > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register > ... > > The point is that only Case C has possible interactions > > so Case A and Case B end-users and kernel developers need > > not worry about the maintenance. > > I'm personally evaluating this as if all the distributions would turn it > on. I'm evaluating as if every one of my employer's systems ships with > it and as if it is =y my laptop. Basically, I'm evaluating A/B/C and > only looking at the worst-case maintenance cost (C). In other words, > I'm ignoring A/B and assuming wide use. Good. Me too. I was just saying that the-company-that-must-not- be-named (from which most of the non-technical objections are coming), can choose A or B as they wish without any impact to their developers or users. > I'm curious where you expect to see the code get turned on and used > since we might be looking at this from different angles. I think we are on the same page. Oracle is turning it on (case B) in the default UEK kernel, for which the Beta git tree is published. Corporate policy keeps me from saying anything in detail about pre-released products, but you saw that our Oracle VM manager responded to this thread, so I'll leave that to your imagination. I think we agreed offlist that zcache is not ready for prime-time and a good measure of when it _will_ be ready is when it is promoted out of staging. I'm really hoping you guys at IBM will drive that (and am willing to get out of the way if you prefer). There's a lot of interest in Oracle in RAMster (which I personally think is very sexy), but I haven't been able to make forward progress in nearly three months now due to other fires and commitments. :-( So are we on the same page? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 18:10 ` Dan Magenheimer @ 2011-11-02 7:44 ` James Bottomley -1 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-02 7:44 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, 2011-11-01 at 11:10 -0700, Dan Magenheimer wrote: [...] > For clarity and brevity, let's call the three cases: > > Case A) CONFIG_FRONTSWAP=n > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register > > There are no interactions in Case A, agreed? I'm not sure > if it is clear, but in Case B every hook checks to > see if a tmem backend is registered... if not, the > hook is a no-op except for the addition of a > compare-pointer-against-NULL op, so there is no > interaction there either. > > So the only case where interactions are possible is > Case C, which currently only can occur if a user > specifies a kernel boot parameter of "tmem" or "zcache". > (I know, a bit ugly, but there's a reason for doing > it this way, at least for now.) OK, so what I'd like to see is benchmarks for B and C. B should confirm your contention of no cost (which is the ideal anyway) and C quantifies the passive cost to users. [...] > Can we agree that if frontswap is doing its job properly on > any "normal" workload that is swapping, it is improving on a > bad situation? No, not without a set of benchmarks ... that's rather the point of doing them. > Then let's get back to your implied question about _negative_ > data. As described above there is NO impact for Case A > and Case B. (The zealot will point out that a pointer-compare > against-NULL per page-swapped-in/out is not "NO" impact, > but let's ignore him for now.) In Case C, there are > demonstrated benefits for SOME workloads... will frontswap > HARM some workloads? > > I have openly admitted that for _cleancache_ on _zcache_, > sometimes the cost can exceed the benefits, and this was > actually demonstrated by one user on lkml. For _frontswap_ > it's really hard to imagine even a very contrived workload > where frontswap fails to provide an advantage. I suppose > maybe if your swap disk lives on a PCI SSD and your CPU > is ancient single-core which does extremely slow copying > and compression? > > IOW, I feel like you are giving me busywork, and any additional > evidence I present you will wave away anyway. Well, OK, so there's a performance issue in some workloads what the above is basically asking is how bad is it and how widespread? > > > I understand that some kernel developers (mostly from one > > > company) continue to completely discount Xen, and > > > thus won't even look at the Xen results. IMHO > > > that is mudslinging. > > > > OK, so lets look at this another way: one of the signs of a good ABI is > > generic applicability. Any good virtualisation ABI should thus work for > > all virtualisation systems (including VMware should they choose to take > > advantage of it). The fact that transcendent memory only seems to work > > well for Xen is a red flag in this regard. > > I think the tmem ABI will work fine with any virtualization system, > and particularly frontswap will. There are some theoretical arguments > that KVM will get little or no benefit, but those arguments > pertain primarily to cleancache. And I've noted that the ABI > was designed to be very extensible, so if KVM wants a batching > interface, they can add one. To repeat from the LWN KS2011 report: > > "[Linus] stated that, simply, code that actually is used is > code that is actually worth something... code aimed at > solving the same problem is just a vague idea that is > worthless by comparison... Even if it truly is crap, > we've had crap in the kernel before. The code does not > get better out of tree." > > AND the API/ABI clearly supports other non-virtualization uses > as well. The in-kernel hooks are very simple and the layering > is very clean. The ABI is extensible, has been published for > nearly three years, and successfully rev'ed once (to accomodate > 192-bit exportfs handles for cleancache). Your arguments are on > very thin ice here. > > It sounds like you are saying that unless/until KVM has a completed > measurable implementation... and maybe VMware and Hyper/V as well... > you don't think the tiny set of hooks that are frontswap should > be merged. If so, that "red flag" sounds self-serving, not what I > would expect from someone like you. Sorry. Hm, straw man and ad hominem. What I said was "one of the signs of a good ABI is generic applicability". That doesn't mean you have to apply an ABI to every situation by coming up with a demonstration for the use case. It does mean that people should know how to do it. I'm not particularly interested in the hypervisor wars, but it does seem to me that there are legitimate questions about the applicability of this to KVM. [...] > > > But I have never suggested that every kernel should always > > > unconditionally compile-time-enable and run-time-enable > > > frontswap... simply that it should be in-tree so those > > > who wish to enable it are able to enable it. > > > > In practise, most useful ABIs end up being compiled in ... and useful > > basically means useful to any constituency, however small. If your ABI > > is useless, then fine, we don't have to worry about the configured but > > inactive case (but then again, we wouldn't have to worry about the ABI > > at all). If it has a use, then kernels will end up shipping with it > > configured in which is why the inactive performance impact is so > > important to quantify. > > So do you now understand/agree that the inactive performance is zero > and the interaction of an inactive configuration with the remainder > of the MM subsystem is zero? And that you and your users will be > completely unaffected unless you/they intentionally turn it on, > not only compiled in, but explicitly at runtime as well? As I said above, just benchmark it for B and C. As long as nothing nasty is happening, I'm fine with it. > So... understanding your preference for more workloads and your > preference that KVM should be demonstrated as a profitable user > first... is there anything else that you think should stand > in the way of merging frontswap so that existing and planned > kernel developers can build on top of it in-tree? No, I think that's my list. The confusion over a KVM interface is solely because you keep saying it's not a Xen only ABI ... if it were, I'd be fine for it living in the xen tree. James ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 7:44 ` James Bottomley 0 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-02 7:44 UTC (permalink / raw) To: Dan Magenheimer Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, 2011-11-01 at 11:10 -0700, Dan Magenheimer wrote: [...] > For clarity and brevity, let's call the three cases: > > Case A) CONFIG_FRONTSWAP=n > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register > > There are no interactions in Case A, agreed? I'm not sure > if it is clear, but in Case B every hook checks to > see if a tmem backend is registered... if not, the > hook is a no-op except for the addition of a > compare-pointer-against-NULL op, so there is no > interaction there either. > > So the only case where interactions are possible is > Case C, which currently only can occur if a user > specifies a kernel boot parameter of "tmem" or "zcache". > (I know, a bit ugly, but there's a reason for doing > it this way, at least for now.) OK, so what I'd like to see is benchmarks for B and C. B should confirm your contention of no cost (which is the ideal anyway) and C quantifies the passive cost to users. [...] > Can we agree that if frontswap is doing its job properly on > any "normal" workload that is swapping, it is improving on a > bad situation? No, not without a set of benchmarks ... that's rather the point of doing them. > Then let's get back to your implied question about _negative_ > data. As described above there is NO impact for Case A > and Case B. (The zealot will point out that a pointer-compare > against-NULL per page-swapped-in/out is not "NO" impact, > but let's ignore him for now.) In Case C, there are > demonstrated benefits for SOME workloads... will frontswap > HARM some workloads? > > I have openly admitted that for _cleancache_ on _zcache_, > sometimes the cost can exceed the benefits, and this was > actually demonstrated by one user on lkml. For _frontswap_ > it's really hard to imagine even a very contrived workload > where frontswap fails to provide an advantage. I suppose > maybe if your swap disk lives on a PCI SSD and your CPU > is ancient single-core which does extremely slow copying > and compression? > > IOW, I feel like you are giving me busywork, and any additional > evidence I present you will wave away anyway. Well, OK, so there's a performance issue in some workloads what the above is basically asking is how bad is it and how widespread? > > > I understand that some kernel developers (mostly from one > > > company) continue to completely discount Xen, and > > > thus won't even look at the Xen results. IMHO > > > that is mudslinging. > > > > OK, so lets look at this another way: one of the signs of a good ABI is > > generic applicability. Any good virtualisation ABI should thus work for > > all virtualisation systems (including VMware should they choose to take > > advantage of it). The fact that transcendent memory only seems to work > > well for Xen is a red flag in this regard. > > I think the tmem ABI will work fine with any virtualization system, > and particularly frontswap will. There are some theoretical arguments > that KVM will get little or no benefit, but those arguments > pertain primarily to cleancache. And I've noted that the ABI > was designed to be very extensible, so if KVM wants a batching > interface, they can add one. To repeat from the LWN KS2011 report: > > "[Linus] stated that, simply, code that actually is used is > code that is actually worth something... code aimed at > solving the same problem is just a vague idea that is > worthless by comparison... Even if it truly is crap, > we've had crap in the kernel before. The code does not > get better out of tree." > > AND the API/ABI clearly supports other non-virtualization uses > as well. The in-kernel hooks are very simple and the layering > is very clean. The ABI is extensible, has been published for > nearly three years, and successfully rev'ed once (to accomodate > 192-bit exportfs handles for cleancache). Your arguments are on > very thin ice here. > > It sounds like you are saying that unless/until KVM has a completed > measurable implementation... and maybe VMware and Hyper/V as well... > you don't think the tiny set of hooks that are frontswap should > be merged. If so, that "red flag" sounds self-serving, not what I > would expect from someone like you. Sorry. Hm, straw man and ad hominem. What I said was "one of the signs of a good ABI is generic applicability". That doesn't mean you have to apply an ABI to every situation by coming up with a demonstration for the use case. It does mean that people should know how to do it. I'm not particularly interested in the hypervisor wars, but it does seem to me that there are legitimate questions about the applicability of this to KVM. [...] > > > But I have never suggested that every kernel should always > > > unconditionally compile-time-enable and run-time-enable > > > frontswap... simply that it should be in-tree so those > > > who wish to enable it are able to enable it. > > > > In practise, most useful ABIs end up being compiled in ... and useful > > basically means useful to any constituency, however small. If your ABI > > is useless, then fine, we don't have to worry about the configured but > > inactive case (but then again, we wouldn't have to worry about the ABI > > at all). If it has a use, then kernels will end up shipping with it > > configured in which is why the inactive performance impact is so > > important to quantify. > > So do you now understand/agree that the inactive performance is zero > and the interaction of an inactive configuration with the remainder > of the MM subsystem is zero? And that you and your users will be > completely unaffected unless you/they intentionally turn it on, > not only compiled in, but explicitly at runtime as well? As I said above, just benchmark it for B and C. As long as nothing nasty is happening, I'm fine with it. > So... understanding your preference for more workloads and your > preference that KVM should be demonstrated as a profitable user > first... is there anything else that you think should stand > in the way of merging frontswap so that existing and planned > kernel developers can build on top of it in-tree? No, I think that's my list. The confusion over a KVM interface is solely because you keep saying it's not a Xen only ABI ... if it were, I'd be fine for it living in the xen tree. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 7:44 ` James Bottomley @ 2011-11-02 19:39 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 19:39 UTC (permalink / raw) To: James Bottomley Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > Hm, straw man and ad hominem.... Let me apologize to you also for being sarcastic and disrespectful yesterday. I'm very sorry, I really do appreciate your time and effort, and will try to focus on the core of your excellent feedback, rather than write another long rant. > > Case A) CONFIG_FRONTSWAP=n > > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register > > OK, so what I'd like to see is benchmarks for B and C. B should confirm > your contention of no cost (which is the ideal anyway) and C quantifies > the passive cost to users. OK, we'll see what we can do. For B, given the natural variance in any workload that is doing heavy swapping, I'm not sure that I can prove anything, but I suppose it will at least reveal if there are any horrible glaring bugs. However, in turn, I'd ask you to at least confirm by code examination that, not counting swapon and swapoff, the only change to the swapping path is comparing a function pointer in struct frontswap_ops against NULL. (And, for case B, it is NULL, so no function call ever occurs.) OK? For C, understood, benchmarks for zcache needed. > Well, OK, so there's a performance issue in some workloads what the > above is basically asking is how bad is it and how widespread? Just to clarify, the performance issue observed is with cleancache with zcache, not frontswap. That issue has been observed on high-throughput old-single-core-CPU machines, see https://lkml.org/lkml/2011/8/29/225 That issue is because cleancache (like the pagecache) has to speculate on what pages might be needed in the future. Frontswap with zcache ONLY compresses pages that would otherwise be physically swapped to a swap device. So I don't see a performance issue with frontswap. (But, yes, will still provide some benchmarks.) > What I said was "one of the signs of a > good ABI is generic applicability". That doesn't mean you have to apply > an ABI to every situation by coming up with a demonstration for the use > case. It does mean that people should know how to do it. I'm not > particularly interested in the hypervisor wars, but it does seem to me > that there are legitimate questions about the applicability of this to > KVM. The guest->host ABI does work with KVM, and is in Sasha's git tree. It is a very simple shim, very similar to what Xen uses, and will feed the same "opportunities" for swapping to host memory for KVM as for Xen. The arguments regarding KVM are whether, when the ABI is used, if there is a sufficient performance gain, because each page requires a (costly vmexit/vmenter sequence). It seems obvious to me, but I've done what I can to facilitate Sasha's and Neo's tmem-on-KVM work... their code is just not finished yet. As I've discussed with Andrea, the ABI is very extensible so if it makes a huge difference to add "batching" for KVM, the ABI won't get in the way. > As I said above, just benchmark it for B and C. As long as nothing nasty > is happening, I'm fine with it. > > > So... understanding your preference for more workloads and your > > preference that KVM should be demonstrated as a profitable user > > first... is there anything else that you think should stand > > in the way of merging frontswap so that existing and planned > > kernel developers can build on top of it in-tree? > > No, I think that's my list. The confusion over a KVM interface is > solely because you keep saying it's not a Xen only ABI ... if it were, > I'd be fine for it living in the xen tree. OK, thanks! But the core frontswap hooks are in routines in mm/swapfile.c and mm/page_io.c so can't live in the xen tree. And the Xen-specific stuff already does. Sorry, getting long-winded again, but at least not ranting :-} Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 19:39 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 19:39 UTC (permalink / raw) To: James Bottomley Cc: John Stoffel, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > Hm, straw man and ad hominem.... Let me apologize to you also for being sarcastic and disrespectful yesterday. I'm very sorry, I really do appreciate your time and effort, and will try to focus on the core of your excellent feedback, rather than write another long rant. > > Case A) CONFIG_FRONTSWAP=n > > Case B) CONFIG_FRONTSWAP=y and no tmem backend registers > > Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register > > OK, so what I'd like to see is benchmarks for B and C. B should confirm > your contention of no cost (which is the ideal anyway) and C quantifies > the passive cost to users. OK, we'll see what we can do. For B, given the natural variance in any workload that is doing heavy swapping, I'm not sure that I can prove anything, but I suppose it will at least reveal if there are any horrible glaring bugs. However, in turn, I'd ask you to at least confirm by code examination that, not counting swapon and swapoff, the only change to the swapping path is comparing a function pointer in struct frontswap_ops against NULL. (And, for case B, it is NULL, so no function call ever occurs.) OK? For C, understood, benchmarks for zcache needed. > Well, OK, so there's a performance issue in some workloads what the > above is basically asking is how bad is it and how widespread? Just to clarify, the performance issue observed is with cleancache with zcache, not frontswap. That issue has been observed on high-throughput old-single-core-CPU machines, see https://lkml.org/lkml/2011/8/29/225 That issue is because cleancache (like the pagecache) has to speculate on what pages might be needed in the future. Frontswap with zcache ONLY compresses pages that would otherwise be physically swapped to a swap device. So I don't see a performance issue with frontswap. (But, yes, will still provide some benchmarks.) > What I said was "one of the signs of a > good ABI is generic applicability". That doesn't mean you have to apply > an ABI to every situation by coming up with a demonstration for the use > case. It does mean that people should know how to do it. I'm not > particularly interested in the hypervisor wars, but it does seem to me > that there are legitimate questions about the applicability of this to > KVM. The guest->host ABI does work with KVM, and is in Sasha's git tree. It is a very simple shim, very similar to what Xen uses, and will feed the same "opportunities" for swapping to host memory for KVM as for Xen. The arguments regarding KVM are whether, when the ABI is used, if there is a sufficient performance gain, because each page requires a (costly vmexit/vmenter sequence). It seems obvious to me, but I've done what I can to facilitate Sasha's and Neo's tmem-on-KVM work... their code is just not finished yet. As I've discussed with Andrea, the ABI is very extensible so if it makes a huge difference to add "batching" for KVM, the ABI won't get in the way. > As I said above, just benchmark it for B and C. As long as nothing nasty > is happening, I'm fine with it. > > > So... understanding your preference for more workloads and your > > preference that KVM should be demonstrated as a profitable user > > first... is there anything else that you think should stand > > in the way of merging frontswap so that existing and planned > > kernel developers can build on top of it in-tree? > > No, I think that's my list. The confusion over a KVM interface is > solely because you keep saying it's not a Xen only ABI ... if it were, > I'd be fine for it living in the xen tree. OK, thanks! But the core frontswap hooks are in routines in mm/swapfile.c and mm/page_io.c so can't live in the xen tree. And the Xen-specific stuff already does. Sorry, getting long-winded again, but at least not ranting :-} Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 18:28 ` John Stoffel @ 2011-10-31 18:44 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 18:44 UTC (permalink / raw) To: John Stoffel Cc: Dan Magenheimer, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 02:28:20PM -0400, John Stoffel wrote: > and service. How would TM benefit me? I don't use Xen, don't want to > play with it honestly because I'm busy enough as it is, and I just > don't see the hard benefits. If you used Xen tmem would be more or less the equivalent of cache=writethrough/writeback. For us tmem is the linux host pagecache running on the baremetal in short. But at least when we vmexit for a read we read 128-512k of it (depending on if=virtio or others and guest kernel readahead decision), not just a fixed absolutely worst case 4k unit like tmem would do... Without tmem Xen can only work like KVM cache=off. If at least it would drop us a copy, but no, it still does the bounce buffer, so I'd rather bounce in the host kernel function file_read_actor than in some superflous (as far as KVM is concerned) tmem code, plus we normally read orders of magnitude more than 4k in each vmexit, so our default cache=writeback/writethroguh may already be more efficient than if we'd use tmem for that. We could only consider for swap compression but for swap compression I've no idea why we still need to do a copy, instead of just compressing from userland page in zerocopy (worst case using any mechanism introduced to provide stable pages). And when host linux pagecache will go hugepage we'll get a >4k copy in one go while tmem bounce will still be stuck at 4k... ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 18:44 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 18:44 UTC (permalink / raw) To: John Stoffel Cc: Dan Magenheimer, Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 02:28:20PM -0400, John Stoffel wrote: > and service. How would TM benefit me? I don't use Xen, don't want to > play with it honestly because I'm busy enough as it is, and I just > don't see the hard benefits. If you used Xen tmem would be more or less the equivalent of cache=writethrough/writeback. For us tmem is the linux host pagecache running on the baremetal in short. But at least when we vmexit for a read we read 128-512k of it (depending on if=virtio or others and guest kernel readahead decision), not just a fixed absolutely worst case 4k unit like tmem would do... Without tmem Xen can only work like KVM cache=off. If at least it would drop us a copy, but no, it still does the bounce buffer, so I'd rather bounce in the host kernel function file_read_actor than in some superflous (as far as KVM is concerned) tmem code, plus we normally read orders of magnitude more than 4k in each vmexit, so our default cache=writeback/writethroguh may already be more efficient than if we'd use tmem for that. We could only consider for swap compression but for swap compression I've no idea why we still need to do a copy, instead of just compressing from userland page in zerocopy (worst case using any mechanism introduced to provide stable pages). And when host linux pagecache will go hugepage we'll get a >4k copy in one go while tmem bounce will still be stuck at 4k... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 17:07 ` Dan Magenheimer @ 2011-10-30 21:47 ` Johannes Weiner -1 siblings, 0 replies; 175+ messages in thread From: Johannes Weiner @ 2011-10-30 21:47 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > > > From: Johannes Weiner [mailto:jweiner@redhat.com] > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > > > <dan.magenheimer@oracle.com> wrote: > > > Looking at your patches, there's no trace that anyone outside your own > > > development team even looked at the patches. Why do you feel that it's > > > OK to ask Linus to pull them? > > > > People did look at it. > > > > In my case, the handwavy benefits did not convince me. The handwavy > > 'this is useful' from just more people of the same company does not > > help, either. > > > > I want to see a usecase that tangibly gains from this, not just more > > marketing material. Then we can talk about boring infrastructure and > > adding hooks to the VM. > > > > Convincing the development community of the problem you are trying to > > solve is the undocumented part of the process you fail to follow. > > Hi Johannes -- > > First, there are several companies and several unaffiliated kernel > developers contributing here, building on top of frontswap. I happen > to be spearheading it, and my company is backing me up. (It > might be more appropriate to note that much of the resistance comes > from people of your company... but please let's keep our open-source > developer hats on and have a technical discussion rather than one > which pleases our respective corporate overlords.) I didn't mean to start a mud fight about this, I only mentioned the part about your company because I already assume it sees value in tmem - it probably wouldn't fund its development otherwise. I just tend to not care too much about Acks from the same company as the patch itself and I believe other people do the same. > Second, have you read http://lwn.net/Articles/454795/ ? > If not, please do. If yes, please explain what you don't > see as convincing or tangible or documented. All of this > exists today as working publicly available code... it's > not marketing material. I remember answering this to you in private already some time ago when discussing frontswap. You keep proposing a bridge and I keep asking for proof that this is not a bridge to nowhere. Unless that question is answered, I am not interested in discussing the bridge's design. According to the LWN article, there are the following backends: 1. Zcache: allow swapping into compressed memory This sets aside a portion of memory which the kernel will swap compressed pages into upon pressure. Now, obviously, reserving memory from the system for this increases the pressure in the first place, eating away on what space we have for anonymous memory and page cache. Do you auto-size that region depending on workload? If so, how? If not, is it documented how to size it manually? Where are the performance numbers for various workloads, including both those that benefit from every bit of page cache and those that would fit into memory without zcache occupying space? However, looking at the zcache code, it seems it wants to allocate storage pages only when already trying to swap out. Are you sure this works in reality? 2. RAMster: allow swapping between machines in a cluster Are there people using it? It, too, sounds like a good idea but I don't see any proof it actually works as intended. 3. Xen: allow guests to swap into the host. The article mentions that there is code to put the guests under pressure and let them swap to host memory when the pressure is too high. This sounds useful. Where is the code that controls the amount of pressure put on the guests? Where are the performance numbers? Surely you can construct a case where the initial machine sizes are not quite right and then collect data that demonstrates the machines are rebalancing as expected? 4. kvm: same as Xen Apart from the questions that already apply to Xen, I remember KVM people in particular complaining about the synchroneous single-page interface that results in a hypercall per swapped page. What happened to this concern? --- I would really appreciate if you could pick one of those backends and present them as a real and practical solution to real and practical problems. With documentation on configuration and performance data of real workloads. We can discuss implementation details like how memory is exchanged between source and destination when we come to it. I am not asking for just more code that uses your interface, I want to know the real value for real people of the combination of all that stuff. With proof, not just explanations of how it's supposed to work. Until you can accept that, please include Nacked-by: Johannes Weiner <hannes@cmpxchg.org> on all further stand-alone submissions of tmem core code and/or hooks in the VM. Thanks. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-30 21:47 ` Johannes Weiner 0 siblings, 0 replies; 175+ messages in thread From: Johannes Weiner @ 2011-10-30 21:47 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > > > From: Johannes Weiner [mailto:jweiner@redhat.com] > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > > > <dan.magenheimer@oracle.com> wrote: > > > Looking at your patches, there's no trace that anyone outside your own > > > development team even looked at the patches. Why do you feel that it's > > > OK to ask Linus to pull them? > > > > People did look at it. > > > > In my case, the handwavy benefits did not convince me. The handwavy > > 'this is useful' from just more people of the same company does not > > help, either. > > > > I want to see a usecase that tangibly gains from this, not just more > > marketing material. Then we can talk about boring infrastructure and > > adding hooks to the VM. > > > > Convincing the development community of the problem you are trying to > > solve is the undocumented part of the process you fail to follow. > > Hi Johannes -- > > First, there are several companies and several unaffiliated kernel > developers contributing here, building on top of frontswap. I happen > to be spearheading it, and my company is backing me up. (It > might be more appropriate to note that much of the resistance comes > from people of your company... but please let's keep our open-source > developer hats on and have a technical discussion rather than one > which pleases our respective corporate overlords.) I didn't mean to start a mud fight about this, I only mentioned the part about your company because I already assume it sees value in tmem - it probably wouldn't fund its development otherwise. I just tend to not care too much about Acks from the same company as the patch itself and I believe other people do the same. > Second, have you read http://lwn.net/Articles/454795/ ? > If not, please do. If yes, please explain what you don't > see as convincing or tangible or documented. All of this > exists today as working publicly available code... it's > not marketing material. I remember answering this to you in private already some time ago when discussing frontswap. You keep proposing a bridge and I keep asking for proof that this is not a bridge to nowhere. Unless that question is answered, I am not interested in discussing the bridge's design. According to the LWN article, there are the following backends: 1. Zcache: allow swapping into compressed memory This sets aside a portion of memory which the kernel will swap compressed pages into upon pressure. Now, obviously, reserving memory from the system for this increases the pressure in the first place, eating away on what space we have for anonymous memory and page cache. Do you auto-size that region depending on workload? If so, how? If not, is it documented how to size it manually? Where are the performance numbers for various workloads, including both those that benefit from every bit of page cache and those that would fit into memory without zcache occupying space? However, looking at the zcache code, it seems it wants to allocate storage pages only when already trying to swap out. Are you sure this works in reality? 2. RAMster: allow swapping between machines in a cluster Are there people using it? It, too, sounds like a good idea but I don't see any proof it actually works as intended. 3. Xen: allow guests to swap into the host. The article mentions that there is code to put the guests under pressure and let them swap to host memory when the pressure is too high. This sounds useful. Where is the code that controls the amount of pressure put on the guests? Where are the performance numbers? Surely you can construct a case where the initial machine sizes are not quite right and then collect data that demonstrates the machines are rebalancing as expected? 4. kvm: same as Xen Apart from the questions that already apply to Xen, I remember KVM people in particular complaining about the synchroneous single-page interface that results in a hypercall per swapped page. What happened to this concern? --- I would really appreciate if you could pick one of those backends and present them as a real and practical solution to real and practical problems. With documentation on configuration and performance data of real workloads. We can discuss implementation details like how memory is exchanged between source and destination when we come to it. I am not asking for just more code that uses your interface, I want to know the real value for real people of the combination of all that stuff. With proof, not just explanations of how it's supposed to work. Until you can accept that, please include Nacked-by: Johannes Weiner <hannes@cmpxchg.org> on all further stand-alone submissions of tmem core code and/or hooks in the VM. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-30 21:47 ` Johannes Weiner @ 2011-10-30 23:19 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-30 23:19 UTC (permalink / raw) To: Johannes Weiner Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Johannes Weiner [mailto:jweiner@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Johannes -- Thanks for taking the time for some real technical discussion (below). > On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > > > > > From: Johannes Weiner [mailto:jweiner@redhat.com] > > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > > > > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > > > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > > > > <dan.magenheimer@oracle.com> wrote: > > > > Looking at your patches, there's no trace that anyone outside your own > > > > development team even looked at the patches. Why do you feel that it's > > > > OK to ask Linus to pull them? > > > > > > People did look at it. > > > > > > In my case, the handwavy benefits did not convince me. The handwavy > > > 'this is useful' from just more people of the same company does not > > > help, either. > > > > > > I want to see a usecase that tangibly gains from this, not just more > > > marketing material. Then we can talk about boring infrastructure and > > > adding hooks to the VM. > > > > > > Convincing the development community of the problem you are trying to > > > solve is the undocumented part of the process you fail to follow. > > > > Hi Johannes -- > > > > First, there are several companies and several unaffiliated kernel > > developers contributing here, building on top of frontswap. I happen > > to be spearheading it, and my company is backing me up. (It > > might be more appropriate to note that much of the resistance comes > > from people of your company... but please let's keep our open-source > > developer hats on and have a technical discussion rather than one > > which pleases our respective corporate overlords.) > > I didn't mean to start a mud fight about this, I only mentioned the > part about your company because I already assume it sees value in tmem > - it probably wouldn't fund its development otherwise. I just tend to > not care too much about Acks from the same company as the patch itself > and I believe other people do the same. Oops, sorry for mudslinging if none was intended. Although I understand your position about Acks from the same company, isn't that challenging the integrity of the individual's ack/review, implying that they are not really reviewing the code with the same intensity as if it came from another company? Especially with something like tmem, maybe the review is just as valid, and people from the same company have just had more incentive to truly understand the intent and potential of the functionality, as well as the syntax in the code? And maybe, on some patches, reviewers ARE from different companies are "good buddies" and watch each others' back and those reviews are not really complete? So perhaps this default assumption about code review is flawed? > > Second, have you read http://lwn.net/Articles/454795/ ? > > If not, please do. If yes, please explain what you don't > > see as convincing or tangible or documented. All of this > > exists today as working publicly available code... it's > > not marketing material. > > I remember answering this to you in private already some time ago when > discussing frontswap. Yes, reading ahead, all the questions sound familiar and I thought they were all answered (albeit some offlist). I think the conversation ended at that point, so I assumed any issues were resolved. > You keep proposing a bridge and I keep asking for proof that this is > not a bridge to nowhere. Unless that question is answered, I am not > interested in discussing the bridge's design. > > According to the LWN article, there are the following backends: > > 1. Zcache: allow swapping into compressed memory > > This sets aside a portion of memory which the kernel will swap > compressed pages into upon pressure. Now, obviously, reserving memory > from the system for this increases the pressure in the first place, > eating away on what space we have for anonymous memory and page cache. > > Do you auto-size that region depending on workload? Yes. A key value of the whole transcendent memory design is that everything is done dynamically. That's one reason that Nitin Gupta (author of zram) supports zcache. > If so, how? If not, is it documented how to size it manually? See above. There are some zcache policy parameters that can be adjusted manually (currently through sysfs) so we can adjust the defaults as necessary over time. > Where are the performance numbers for various workloads, including > both those that benefit from every bit of page cache and those that > would fit into memory without zcache occupying space? I have agreed already that more zcache measurement is warranted (though I maintain it will get a lot more measurement merged than it will unmerged). So I can only answer theoretically, though I would appreciate your comment if you disagree. Space used for page cache is almost always opportunistic; it is a "guess" that the page will be needed again in the future. Frontswap only stores pages that MUST otherwise be swapped. Swapping occurs only if the clean list is empty (or if the MM system is too slow to respond to changes in workload). In fact some of the pages-to-be-swapped that end up in frontswap can be dirty page cache pages. All of this is handled dynamically. The kernel is still deciding which pages to keep and which to reclaim and which to swap. The hooks simply grab pages as they are going by. That's why the frontswap patch can be so simple and can have many "users" built on top of it. > However, looking at the zcache code, it seems it wants to allocate > storage pages only when already trying to swap out. Are you sure this > works in reality? Yes. I'd encourage you to try it. I'd be a fool if I tried to guarantee that there are no bugs of course. > 2. RAMster: allow swapping between machines in a cluster > > Are there people using it? It, too, sounds like a good idea but I > don't see any proof it actually works as intended. No. I've posted the code publicly but it's still a godawful mess and I'd be embarrassed if anyone looked at it. But the code does work and I've got some ideas on how to make it more upstreamable. If anybody seriously wants to work on it right now, I could do that, but I'd prefer some more time alone with it first. Conceptually, it's just a matter of moving pages to a different machine instead of across a hypercall interface. All the "magic" is in the frontswap and cleancache hooks. They run on both machines, both dynamically managing space (and compressing it too). The code uses ocfs2 for "cluster" discovery and is built on top of a modified zcache. > 3. Xen: allow guests to swap into the host. > > The article mentions that there is code to put the guests under > pressure and let them swap to host memory when the pressure is too > high. This sounds useful. > > Where is the code that controls the amount of pressure put on the > guests? See drivers/xen/xen-selfballoon.c, which was just merged at 3.1, though there have been versions of it floating around for 2+ years. Note there's a bug fix pending that makes the pressure a little less aggressive. I think it is/was submitted for the open 3.2 window. (Note the same file manipulates the number of pages in frontswap.) > Where are the performance numbers? Surely you can construct a case > where the initial machine sizes are not quite right and then collect > data that demonstrates the machines are rebalancing as expected? Yes I can. It just works and with the right tools running, it's even fun to watch. Some interesting performance numbers were published at Xen Summit 2010. See the last few pages of: http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf The speakers notes (so you can follow the presentation without video) are in the same dir. > 4. kvm: same as Xen > > Apart from the questions that already apply to Xen, I remember KVM > people in particular complaining about the synchroneous single-page > interface that results in a hypercall per swapped page. What happened > to this concern? I think we (me and the KVM people) agreed that the best way to determine if this is a concern is to just measure it. Sasha and Neo are working on a KVM implementation which should make this possible (but neither wants to invest a lot of time if frontswap isn't merged or has a clear path to merging). So, again, theoretically, and please argue if you disagree... (and yes I know real measurements are better, but I think we all know how easy it is to manipulate benchmarks so IMHO a theoretical understanding is useful too). What is the cost of a KVM hypercall (vmexit/vmenter) vs the cost of swapping a page? Clearly, reading/writing a disk is a very slow operation, but has very little CPU overhead (though preparing a page to be swapped via blkio is NOT very inexpensive). But if you are swapping, it is almost never the case that the CPU is busy, especially on a multicore CPU. I expect on old slow (e.g. first gen 1 core VT-x processors) this might sometimes be measureable, but rarely an issue. On modern processors, I don't expect it to be significant. BTW, it occurs to me that this is now measureable on Xen too, since Xen tmem works now for fully-virtualized guests. I don't have the machines to reproduce the same experiment, but if you look at the graphs in the Xen presentation, you can see that CPU utilization goes up substantially, but throughput still improves. I am almost positive that the CPU cost of compression/decompression plus the cost of deduplication insert/fetch exceeds the cost of a vmexit/vmenter, so the additional cost of vmexit/vmenter will at most increase the CPU utilization. The real performance gain comes from avoiding (waiting for) disk accesses. > I would really appreciate if you could pick one of those backends and > present them as a real and practical solution to real and practical > problems. With documentation on configuration and performance data of > real workloads. We can discuss implementation details like how memory > is exchanged between source and destination when we come to it. > > I am not asking for just more code that uses your interface, I want to > know the real value for real people of the combination of all that > stuff. With proof, not just explanations of how it's supposed to > work. Well, the Xen implementation is by far the most mature and the Xen presentation above is reasonably conclusive though, as always, more measurements of more workloads would be good. Not to get back into the mudslinging, but certain people from certain companies try to ignore or minimize the value of Xen, so I've been trying to emphasize the other (non-Xen, non-virtualization) code. Personally, I think the Xen use case is sufficient by itself as it solves a problem nobody else has ever solved (or, more precisely, that VMware attempted to solve but, as real VMware customers will attest, did so very poorly). To be a good Linux kernel citizen, I've encouraged my company to hold off on widespread support for Xen tmem until all the parts are upstream in Linux, so there isn't a wide existing body of "proof" data. And releasing customer data from my employer requires an act of God. But private emails to Linus for cleancache seemed to convince him that there was enough justification for cleancache. I thought frontswap was simpler and would be the easy part, but was clearly mistaken :-( We are now proceeding fully with Xen tmem with both frontswap and cleancache in the kernel. > Until you can accept that, please include > > Nacked-by: Johannes Weiner <hannes@cmpxchg.org> > > on all further stand-alone submissions of tmem core code and/or hooks > in the VM. Thanks. If you are willing to accept that Xen is a valid use case, I think I have provided that (although I agree that more data would be good and would be happy to take suggestions for what data to provide). If not, I would call that a form of mudslinging but will add your Nack. Please let me know. Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-30 23:19 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-30 23:19 UTC (permalink / raw) To: Johannes Weiner Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Johannes Weiner [mailto:jweiner@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Johannes -- Thanks for taking the time for some real technical discussion (below). > On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > > > > > From: Johannes Weiner [mailto:jweiner@redhat.com] > > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > > > > > On Fri, Oct 28, 2011 at 06:36:03PM +0300, Pekka Enberg wrote: > > > > On Fri, Oct 28, 2011 at 6:21 PM, Dan Magenheimer > > > > <dan.magenheimer@oracle.com> wrote: > > > > Looking at your patches, there's no trace that anyone outside your own > > > > development team even looked at the patches. Why do you feel that it's > > > > OK to ask Linus to pull them? > > > > > > People did look at it. > > > > > > In my case, the handwavy benefits did not convince me. The handwavy > > > 'this is useful' from just more people of the same company does not > > > help, either. > > > > > > I want to see a usecase that tangibly gains from this, not just more > > > marketing material. Then we can talk about boring infrastructure and > > > adding hooks to the VM. > > > > > > Convincing the development community of the problem you are trying to > > > solve is the undocumented part of the process you fail to follow. > > > > Hi Johannes -- > > > > First, there are several companies and several unaffiliated kernel > > developers contributing here, building on top of frontswap. I happen > > to be spearheading it, and my company is backing me up. (It > > might be more appropriate to note that much of the resistance comes > > from people of your company... but please let's keep our open-source > > developer hats on and have a technical discussion rather than one > > which pleases our respective corporate overlords.) > > I didn't mean to start a mud fight about this, I only mentioned the > part about your company because I already assume it sees value in tmem > - it probably wouldn't fund its development otherwise. I just tend to > not care too much about Acks from the same company as the patch itself > and I believe other people do the same. Oops, sorry for mudslinging if none was intended. Although I understand your position about Acks from the same company, isn't that challenging the integrity of the individual's ack/review, implying that they are not really reviewing the code with the same intensity as if it came from another company? Especially with something like tmem, maybe the review is just as valid, and people from the same company have just had more incentive to truly understand the intent and potential of the functionality, as well as the syntax in the code? And maybe, on some patches, reviewers ARE from different companies are "good buddies" and watch each others' back and those reviews are not really complete? So perhaps this default assumption about code review is flawed? > > Second, have you read http://lwn.net/Articles/454795/ ? > > If not, please do. If yes, please explain what you don't > > see as convincing or tangible or documented. All of this > > exists today as working publicly available code... it's > > not marketing material. > > I remember answering this to you in private already some time ago when > discussing frontswap. Yes, reading ahead, all the questions sound familiar and I thought they were all answered (albeit some offlist). I think the conversation ended at that point, so I assumed any issues were resolved. > You keep proposing a bridge and I keep asking for proof that this is > not a bridge to nowhere. Unless that question is answered, I am not > interested in discussing the bridge's design. > > According to the LWN article, there are the following backends: > > 1. Zcache: allow swapping into compressed memory > > This sets aside a portion of memory which the kernel will swap > compressed pages into upon pressure. Now, obviously, reserving memory > from the system for this increases the pressure in the first place, > eating away on what space we have for anonymous memory and page cache. > > Do you auto-size that region depending on workload? Yes. A key value of the whole transcendent memory design is that everything is done dynamically. That's one reason that Nitin Gupta (author of zram) supports zcache. > If so, how? If not, is it documented how to size it manually? See above. There are some zcache policy parameters that can be adjusted manually (currently through sysfs) so we can adjust the defaults as necessary over time. > Where are the performance numbers for various workloads, including > both those that benefit from every bit of page cache and those that > would fit into memory without zcache occupying space? I have agreed already that more zcache measurement is warranted (though I maintain it will get a lot more measurement merged than it will unmerged). So I can only answer theoretically, though I would appreciate your comment if you disagree. Space used for page cache is almost always opportunistic; it is a "guess" that the page will be needed again in the future. Frontswap only stores pages that MUST otherwise be swapped. Swapping occurs only if the clean list is empty (or if the MM system is too slow to respond to changes in workload). In fact some of the pages-to-be-swapped that end up in frontswap can be dirty page cache pages. All of this is handled dynamically. The kernel is still deciding which pages to keep and which to reclaim and which to swap. The hooks simply grab pages as they are going by. That's why the frontswap patch can be so simple and can have many "users" built on top of it. > However, looking at the zcache code, it seems it wants to allocate > storage pages only when already trying to swap out. Are you sure this > works in reality? Yes. I'd encourage you to try it. I'd be a fool if I tried to guarantee that there are no bugs of course. > 2. RAMster: allow swapping between machines in a cluster > > Are there people using it? It, too, sounds like a good idea but I > don't see any proof it actually works as intended. No. I've posted the code publicly but it's still a godawful mess and I'd be embarrassed if anyone looked at it. But the code does work and I've got some ideas on how to make it more upstreamable. If anybody seriously wants to work on it right now, I could do that, but I'd prefer some more time alone with it first. Conceptually, it's just a matter of moving pages to a different machine instead of across a hypercall interface. All the "magic" is in the frontswap and cleancache hooks. They run on both machines, both dynamically managing space (and compressing it too). The code uses ocfs2 for "cluster" discovery and is built on top of a modified zcache. > 3. Xen: allow guests to swap into the host. > > The article mentions that there is code to put the guests under > pressure and let them swap to host memory when the pressure is too > high. This sounds useful. > > Where is the code that controls the amount of pressure put on the > guests? See drivers/xen/xen-selfballoon.c, which was just merged at 3.1, though there have been versions of it floating around for 2+ years. Note there's a bug fix pending that makes the pressure a little less aggressive. I think it is/was submitted for the open 3.2 window. (Note the same file manipulates the number of pages in frontswap.) > Where are the performance numbers? Surely you can construct a case > where the initial machine sizes are not quite right and then collect > data that demonstrates the machines are rebalancing as expected? Yes I can. It just works and with the right tools running, it's even fun to watch. Some interesting performance numbers were published at Xen Summit 2010. See the last few pages of: http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf The speakers notes (so you can follow the presentation without video) are in the same dir. > 4. kvm: same as Xen > > Apart from the questions that already apply to Xen, I remember KVM > people in particular complaining about the synchroneous single-page > interface that results in a hypercall per swapped page. What happened > to this concern? I think we (me and the KVM people) agreed that the best way to determine if this is a concern is to just measure it. Sasha and Neo are working on a KVM implementation which should make this possible (but neither wants to invest a lot of time if frontswap isn't merged or has a clear path to merging). So, again, theoretically, and please argue if you disagree... (and yes I know real measurements are better, but I think we all know how easy it is to manipulate benchmarks so IMHO a theoretical understanding is useful too). What is the cost of a KVM hypercall (vmexit/vmenter) vs the cost of swapping a page? Clearly, reading/writing a disk is a very slow operation, but has very little CPU overhead (though preparing a page to be swapped via blkio is NOT very inexpensive). But if you are swapping, it is almost never the case that the CPU is busy, especially on a multicore CPU. I expect on old slow (e.g. first gen 1 core VT-x processors) this might sometimes be measureable, but rarely an issue. On modern processors, I don't expect it to be significant. BTW, it occurs to me that this is now measureable on Xen too, since Xen tmem works now for fully-virtualized guests. I don't have the machines to reproduce the same experiment, but if you look at the graphs in the Xen presentation, you can see that CPU utilization goes up substantially, but throughput still improves. I am almost positive that the CPU cost of compression/decompression plus the cost of deduplication insert/fetch exceeds the cost of a vmexit/vmenter, so the additional cost of vmexit/vmenter will at most increase the CPU utilization. The real performance gain comes from avoiding (waiting for) disk accesses. > I would really appreciate if you could pick one of those backends and > present them as a real and practical solution to real and practical > problems. With documentation on configuration and performance data of > real workloads. We can discuss implementation details like how memory > is exchanged between source and destination when we come to it. > > I am not asking for just more code that uses your interface, I want to > know the real value for real people of the combination of all that > stuff. With proof, not just explanations of how it's supposed to > work. Well, the Xen implementation is by far the most mature and the Xen presentation above is reasonably conclusive though, as always, more measurements of more workloads would be good. Not to get back into the mudslinging, but certain people from certain companies try to ignore or minimize the value of Xen, so I've been trying to emphasize the other (non-Xen, non-virtualization) code. Personally, I think the Xen use case is sufficient by itself as it solves a problem nobody else has ever solved (or, more precisely, that VMware attempted to solve but, as real VMware customers will attest, did so very poorly). To be a good Linux kernel citizen, I've encouraged my company to hold off on widespread support for Xen tmem until all the parts are upstream in Linux, so there isn't a wide existing body of "proof" data. And releasing customer data from my employer requires an act of God. But private emails to Linus for cleancache seemed to convince him that there was enough justification for cleancache. I thought frontswap was simpler and would be the easy part, but was clearly mistaken :-( We are now proceeding fully with Xen tmem with both frontswap and cleancache in the kernel. > Until you can accept that, please include > > Nacked-by: Johannes Weiner <hannes@cmpxchg.org> > > on all further stand-alone submissions of tmem core code and/or hooks > in the VM. Thanks. If you are willing to accept that Xen is a valid use case, I think I have provided that (although I agree that more data would be good and would be happy to take suggestions for what data to provide). If not, I would call that a form of mudslinging but will add your Nack. Please let me know. Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 17:07 ` Dan Magenheimer @ 2011-10-31 18:34 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 18:34 UTC (permalink / raw) To: Dan Magenheimer Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Hugh Dickins On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > First, there are several companies and several unaffiliated kernel > developers contributing here, building on top of frontswap. I happen > to be spearheading it, and my company is backing me up. (It > might be more appropriate to note that much of the resistance comes > from people of your company... but please let's keep our open-source > developer hats on and have a technical discussion rather than one > which pleases our respective corporate overlords.) Fair enough to want an independent review but I'd be interesting to also know how many of the several companies and unaffiliated kernel developers are contributing to it that aren't using tmem with Xen. Obviously bounce buffers 4k vmexits are still faster than Xen-paravirt-I/O on disk platter... Note, Hugh is working for another company... and they're using cgroups not KVM nor Xen, so I suggests he'd be a fair reviewer from a non-virt standpoint, if he hopefully has the time to weight in. However keep in mind if we'd see something that can allow KVM to run even faster, we'd be quite silly in not taking advantage of it too, to beat our own SPECvirt record. The whole design idea of KVM (unlike Xen) is to reuse the kernel improvements as much as possible so when the guest runs faster the hypervisor also runs faster with the exact same code. Problem a vmexit doing a bounce buffer every 4k doesn't mix well into SPECvirt in my view and that probably is what has kept us from making any attempt to use tmem API anywhere. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 18:34 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 18:34 UTC (permalink / raw) To: Dan Magenheimer Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Hugh Dickins On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > First, there are several companies and several unaffiliated kernel > developers contributing here, building on top of frontswap. I happen > to be spearheading it, and my company is backing me up. (It > might be more appropriate to note that much of the resistance comes > from people of your company... but please let's keep our open-source > developer hats on and have a technical discussion rather than one > which pleases our respective corporate overlords.) Fair enough to want an independent review but I'd be interesting to also know how many of the several companies and unaffiliated kernel developers are contributing to it that aren't using tmem with Xen. Obviously bounce buffers 4k vmexits are still faster than Xen-paravirt-I/O on disk platter... Note, Hugh is working for another company... and they're using cgroups not KVM nor Xen, so I suggests he'd be a fair reviewer from a non-virt standpoint, if he hopefully has the time to weight in. However keep in mind if we'd see something that can allow KVM to run even faster, we'd be quite silly in not taking advantage of it too, to beat our own SPECvirt record. The whole design idea of KVM (unlike Xen) is to reuse the kernel improvements as much as possible so when the guest runs faster the hypervisor also runs faster with the exact same code. Problem a vmexit doing a bounce buffer every 4k doesn't mix well into SPECvirt in my view and that probably is what has kept us from making any attempt to use tmem API anywhere. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 18:34 ` Andrea Arcangeli @ 2011-10-31 21:45 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 21:45 UTC (permalink / raw) To: Andrea Arcangeli Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Hugh Dickins > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > > First, there are several companies and several unaffiliated kernel > > developers contributing here, building on top of frontswap. I happen > > to be spearheading it, and my company is backing me up. (It > > might be more appropriate to note that much of the resistance comes > > from people of your company... but please let's keep our open-source > > developer hats on and have a technical discussion rather than one > > which pleases our respective corporate overlords.) > > Fair enough to want an independent review but I'd be interesting to > also know how many of the several companies and unaffiliated kernel > developers are contributing to it that aren't using tmem with Xen. Well just to summarize the non-Oracle-non-tmem supportive responses so far to this frontswap thread: Nitin Gupta, for zcache Brian King (IBM), for Linux on Power Sasha Levin and Neo Jia, affiliation unspecified, working on tmem for KVM Ed Tomlinson, affiliation unspecified, end-user of zcache This doesn't count those that replied offlist to Linus to support the merging of cleancache earlier this year, and doesn't count the fair number of people who have offlist asked me about zcache or if KVM supports tmem or when RAMster will be ready. I suppose I could do a better job advertising others' interest... > Note, Hugh is working for another company... and they're using cgroups > not KVM nor Xen, so I suggests he'd be a fair reviewer from a non-virt > standpoint, if he hopefully has the time to weight in. I spent an hour with Hugh at Google this summer, and he (like you) expressed some dislike of the ABI/API and the hooks but he has since told both me and Andrew he doesn't have time to pursue this. Others in Google have shown vague interest in tmem for cgroups but I've been too busy myself to even think about that. > However keep in mind if we'd see something that can allow KVM to run > even faster, we'd be quite silly in not taking advantage of it too, to > beat our own SPECvirt record. The whole design idea of KVM (unlike > Xen) is to reuse the kernel improvements as much as possible so when > the guest runs faster the hypervisor also runs faster with the exact > same code. Problem a vmexit doing a bounce buffer every 4k doesn't mix > well into SPECvirt in my view and that probably is what has kept us > from making any attempt to use tmem API anywhere. If SPECvirt does any swapping that actually goes to disk (doubtful?), frontswap will help. Personally, I think SPECvirt was hand-designed by VMware to favor their platform, but they were chagrined to find that you and KVM cleverly re-implemented transparent content-based page sharing which was the feature for which they were designing SPECvirt. IOW, SPECvirt is benchmarketing not benchmarking... but I know that's important too. :-) Sorry for the topic drift... Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 21:45 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 21:45 UTC (permalink / raw) To: Andrea Arcangeli Cc: Johannes Weiner, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Hugh Dickins > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Fri, Oct 28, 2011 at 10:07:12AM -0700, Dan Magenheimer wrote: > > First, there are several companies and several unaffiliated kernel > > developers contributing here, building on top of frontswap. I happen > > to be spearheading it, and my company is backing me up. (It > > might be more appropriate to note that much of the resistance comes > > from people of your company... but please let's keep our open-source > > developer hats on and have a technical discussion rather than one > > which pleases our respective corporate overlords.) > > Fair enough to want an independent review but I'd be interesting to > also know how many of the several companies and unaffiliated kernel > developers are contributing to it that aren't using tmem with Xen. Well just to summarize the non-Oracle-non-tmem supportive responses so far to this frontswap thread: Nitin Gupta, for zcache Brian King (IBM), for Linux on Power Sasha Levin and Neo Jia, affiliation unspecified, working on tmem for KVM Ed Tomlinson, affiliation unspecified, end-user of zcache This doesn't count those that replied offlist to Linus to support the merging of cleancache earlier this year, and doesn't count the fair number of people who have offlist asked me about zcache or if KVM supports tmem or when RAMster will be ready. I suppose I could do a better job advertising others' interest... > Note, Hugh is working for another company... and they're using cgroups > not KVM nor Xen, so I suggests he'd be a fair reviewer from a non-virt > standpoint, if he hopefully has the time to weight in. I spent an hour with Hugh at Google this summer, and he (like you) expressed some dislike of the ABI/API and the hooks but he has since told both me and Andrew he doesn't have time to pursue this. Others in Google have shown vague interest in tmem for cgroups but I've been too busy myself to even think about that. > However keep in mind if we'd see something that can allow KVM to run > even faster, we'd be quite silly in not taking advantage of it too, to > beat our own SPECvirt record. The whole design idea of KVM (unlike > Xen) is to reuse the kernel improvements as much as possible so when > the guest runs faster the hypervisor also runs faster with the exact > same code. Problem a vmexit doing a bounce buffer every 4k doesn't mix > well into SPECvirt in my view and that probably is what has kept us > from making any attempt to use tmem API anywhere. If SPECvirt does any swapping that actually goes to disk (doubtful?), frontswap will help. Personally, I think SPECvirt was hand-designed by VMware to favor their platform, but they were chagrined to find that you and KVM cleverly re-implemented transparent content-based page sharing which was the feature for which they were designing SPECvirt. IOW, SPECvirt is benchmarketing not benchmarking... but I know that's important too. :-) Sorry for the topic drift... Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 15:36 ` Pekka Enberg @ 2011-10-28 16:37 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 16:37 UTC (permalink / raw) To: Pekka Enberg Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > You are changing core kernel code without ACKs from relevant > maintainers. That's very unfortunate. Existing users certainly matter > but that doesn't mean you get to merge code without maintainers even > looking at it. > > So really, why don't you just use scripts/get_maintainer.pl and simply > ask the relevant people for their ACK? Actually I had done that before posting the patches and, doing it now again, I *do* have many of the relevant people on the ack list, and nearly all on the cc list of the patch postings. (I apologize that I see I missed you on my list.) I think every relevant maintainer has had the chance to review and acknowledge but some have, for whatever reason, chosen not to. > Looking at your patches, there's no trace that anyone outside your own > development team even looked at the patches. Hmmm... I have reviews/acks from IBM, Fujitsu, and Citrix (and a long list of documented Cc's) in the git comments, so I'm not sure what you are seeing. Ah, perhaps you are referring to the naming changes in the cleancache hooks? Akpm required me to rename various frontswap hooks to use "invalidate" in the function name instead of "flush". I took the opportunity to rename the cleancache hooks for consistency in this same patchset and this occurred in only in the most recent version of the patchset. It is true that I didn't ask for Ack's from those maintainers, though these changes would probably have gone through the trivial patch monkey later anyway. > Why do you feel that it's OK to ask Linus to pull them? Frontswap is essentially the second half of the cleancache patchset (or, more accurately, both are halves of the transcendent memory patchset). They are similar in that the hooks in core MM code are fairly trivial and the real value/functionality lies outside of the core kernel; as a result core MM maintainers don't have much interest I guess. Linus personally merged cleancache for 3.0 (quoting from his offlist email: "I've looked through it, and it seems simple enough, with a pretty minimal support burden"); I was assuming a similar path for frontswap. I repeat that I'm not trying to subvert any process. There just doesn't seem to be much of a process in place for this kind of a patchset, and I'm not letting silence or indifference or "don't like it much" get in the way. Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 16:37 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 16:37 UTC (permalink / raw) To: Pekka Enberg Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > You are changing core kernel code without ACKs from relevant > maintainers. That's very unfortunate. Existing users certainly matter > but that doesn't mean you get to merge code without maintainers even > looking at it. > > So really, why don't you just use scripts/get_maintainer.pl and simply > ask the relevant people for their ACK? Actually I had done that before posting the patches and, doing it now again, I *do* have many of the relevant people on the ack list, and nearly all on the cc list of the patch postings. (I apologize that I see I missed you on my list.) I think every relevant maintainer has had the chance to review and acknowledge but some have, for whatever reason, chosen not to. > Looking at your patches, there's no trace that anyone outside your own > development team even looked at the patches. Hmmm... I have reviews/acks from IBM, Fujitsu, and Citrix (and a long list of documented Cc's) in the git comments, so I'm not sure what you are seeing. Ah, perhaps you are referring to the naming changes in the cleancache hooks? Akpm required me to rename various frontswap hooks to use "invalidate" in the function name instead of "flush". I took the opportunity to rename the cleancache hooks for consistency in this same patchset and this occurred in only in the most recent version of the patchset. It is true that I didn't ask for Ack's from those maintainers, though these changes would probably have gone through the trivial patch monkey later anyway. > Why do you feel that it's OK to ask Linus to pull them? Frontswap is essentially the second half of the cleancache patchset (or, more accurately, both are halves of the transcendent memory patchset). They are similar in that the hooks in core MM code are fairly trivial and the real value/functionality lies outside of the core kernel; as a result core MM maintainers don't have much interest I guess. Linus personally merged cleancache for 3.0 (quoting from his offlist email: "I've looked through it, and it seems simple enough, with a pretty minimal support burden"); I was assuming a similar path for frontswap. I repeat that I'm not trying to subvert any process. There just doesn't seem to be much of a process in place for this kind of a patchset, and I'm not letting silence or indifference or "don't like it much" get in the way. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 16:37 ` Dan Magenheimer @ 2011-10-28 16:59 ` Pekka Enberg -1 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 16:59 UTC (permalink / raw) To: Dan Magenheimer Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 7:37 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> Why do you feel that it's OK to ask Linus to pull them? > > Frontswap is essentially the second half of the cleancache > patchset (or, more accurately, both are halves of the > transcendent memory patchset). They are similar in that > the hooks in core MM code are fairly trivial and the > real value/functionality lies outside of the core kernel; > as a result core MM maintainers don't have much interest > I guess. I would not call this commit trivial: http://oss.oracle.com/git/djm/tmem.git/?p=djm/tmem.git;a=commitdiff;h=6ce5607c1edf80f168d1e1f22dc7a85290cf094a You are exporting bunch of mm/swapfile.c variables (including locks) and adding hooks to mm/page_io.c and mm/swapfile.c. Furthermore, code like this: > + if (frontswap) { > + if (frontswap_test(si, i)) > + break; > + else > + continue; > + } does not really help your case. Pekka ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 16:59 ` Pekka Enberg 0 siblings, 0 replies; 175+ messages in thread From: Pekka Enberg @ 2011-10-28 16:59 UTC (permalink / raw) To: Dan Magenheimer Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 7:37 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> Why do you feel that it's OK to ask Linus to pull them? > > Frontswap is essentially the second half of the cleancache > patchset (or, more accurately, both are halves of the > transcendent memory patchset). They are similar in that > the hooks in core MM code are fairly trivial and the > real value/functionality lies outside of the core kernel; > as a result core MM maintainers don't have much interest > I guess. I would not call this commit trivial: http://oss.oracle.com/git/djm/tmem.git/?p=djm/tmem.git;a=commitdiff;h=6ce5607c1edf80f168d1e1f22dc7a85290cf094a You are exporting bunch of mm/swapfile.c variables (including locks) and adding hooks to mm/page_io.c and mm/swapfile.c. Furthermore, code like this: > + if (frontswap) { > + if (frontswap_test(si, i)) > + break; > + else > + continue; > + } does not really help your case. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 16:59 ` Pekka Enberg @ 2011-10-28 17:20 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 17:20 UTC (permalink / raw) To: Pekka Enberg Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Pekka Enberg [mailto:penberg@kernel.org] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Fri, Oct 28, 2011 at 7:37 PM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: > >> Why do you feel that it's OK to ask Linus to pull them? > > > > Frontswap is essentially the second half of the cleancache > > patchset (or, more accurately, both are halves of the > > transcendent memory patchset). They are similar in that > > the hooks in core MM code are fairly trivial and the > > real value/functionality lies outside of the core kernel; > > as a result core MM maintainers don't have much interest > > I guess. > > I would not call this commit trivial: > > http://oss.oracle.com/git/djm/tmem.git/?p=djm/tmem.git;a=commitdiff;h=6ce5607c1edf80f168d1e1f22dc7a852 > 90cf094a > > You are exporting bunch of mm/swapfile.c variables (including locks) > and adding hooks to mm/page_io.c and mm/swapfile.c. Oh, good, some real patch discussion! :-) You'll note that those exports previously were global and were made static in the recent past. The rationale for this is discussed in the FAQ in frontswap.txt which is part of the patchset. The swapfile.c changes are really the meat of the patch. The page_io.c hooks ARE trivial, don't you think? > Furthermore, code > like this: > > > + if (frontswap) { > > + if (frontswap_test(si, i)) > > + break; > > + else > > + continue; > > + } > > does not really help your case. I don't like that much either, but I didn't see a better way to write it without duplicating a bunch of rather obtuse code. Suggestions welcome. Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 17:20 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-28 17:20 UTC (permalink / raw) To: Pekka Enberg Cc: Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Pekka Enberg [mailto:penberg@kernel.org] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Fri, Oct 28, 2011 at 7:37 PM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: > >> Why do you feel that it's OK to ask Linus to pull them? > > > > Frontswap is essentially the second half of the cleancache > > patchset (or, more accurately, both are halves of the > > transcendent memory patchset). They are similar in that > > the hooks in core MM code are fairly trivial and the > > real value/functionality lies outside of the core kernel; > > as a result core MM maintainers don't have much interest > > I guess. > > I would not call this commit trivial: > > http://oss.oracle.com/git/djm/tmem.git/?p=djm/tmem.git;a=commitdiff;h=6ce5607c1edf80f168d1e1f22dc7a852 > 90cf094a > > You are exporting bunch of mm/swapfile.c variables (including locks) > and adding hooks to mm/page_io.c and mm/swapfile.c. Oh, good, some real patch discussion! :-) You'll note that those exports previously were global and were made static in the recent past. The rationale for this is discussed in the FAQ in frontswap.txt which is part of the patchset. The swapfile.c changes are really the meat of the patch. The page_io.c hooks ARE trivial, don't you think? > Furthermore, code > like this: > > > + if (frontswap) { > > + if (frontswap_test(si, i)) > > + break; > > + else > > + continue; > > + } > > does not really help your case. I don't like that much either, but I didn't see a better way to write it without duplicating a bunch of rather obtuse code. Suggestions welcome. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-28 15:21 ` Dan Magenheimer @ 2011-10-31 18:16 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 18:16 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 08:21:31AM -0700, Dan Magenheimer wrote: > real users and real distros and real products waiting, so if there > are any real issues, let's get them resolved. We already told you the real issues there are and you did nothing so far to address those, so much was built on top of a flawed API that I guess an heartquake of massive scale has to come in to actually convince Xen to change any of the huge amount of code built on the flawed API. I don't know the exact Xen details (it's possible Xen design doesn't allow these below 4 issues to be fixed, I've no idea) but for all other non-virt usages (compressed-swap/compressed-pagecache, ramster) I doubt it is impossible to change the design of the tmem API to address at least one of those basic huge troubles that such an API imposes: 1) 4k page limit (no way to handle hugepages) Ok swapcache and pagecache are always 4k, but that may change. Plus it's generally flawed these days to add a new API people will build code on that can't handle hugepages, at least hugetlbfs should be handled. And especially considering it was born for virt, in virt space we only work with hugepages. 2) synchronous 3) not zerocopy, requires one bounce buffer for every get and one bounce buffer again for every put (like highmem I/O with 32bit pci) In my view point 3 is definitely fixable for swapcache compression and pagecache compression, there's no way we can accept a copy before starting compressing the data, the source of the compression algorithm must be the _userland_ page but instead you copy first, and compress on the copy destination, correct me if I'm wrong. 4) can't handle batched requests Requires one vmexit for each 4k page accessed if KVM hypervisor wants to access tmem, it's impossible we want to use this in KVM, at most we could consider exiting every 2M page, impossible to vmexit every 4k or performance is destroyed and we'd run as slow as no-EPT/NPT. Address these 4 points (or at least the ones that are solvable) and it'll become appealing. Or at least try to explain why it's impossible to solve all these 4 points to convince us this API is the best we can get for the non-virt usages (let's ignore Xen/KVM for the sake of this discussion, as Xen may have legitimate reasons for why those 4 above points are impossible to fix). At the moment to me it still looks a legacy-compatibility API to make life easier to Xen users that uses a limited API (at least it's simpler I'd agree on it being simpler this way) to share cache across different guests and tries to impose those above 4 limits (and horrendous performance in accessing tmem from Xen Guest but still faster than I/O isn't it? :) even to the non-virt usages. Even frontswap, there is no way we can accept to do synchronous bounce buffers for every single 4k page that is going to hit swap. That's worse than HIGHMEM 32bit... Obviously you must be mlocking all Oracle db memory so you won't hit that bounce buffering ever with Oracle. Also note, historically there's nobody that hated bounce buffers more than Oracle (at least I remember the highmem issues with pci32 cards :). Also Oracle was the biggest user of hugetlbfs. So it sounds weird that you like this API forces bounce buffering CPU cache-destroying and 4k page units, for everything that passes through it. If I'm wrong please correct me, I hadn't lots of time to check code. But we already raised these points before without much answer. Thanks, Andrea ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 18:16 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 18:16 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Fri, Oct 28, 2011 at 08:21:31AM -0700, Dan Magenheimer wrote: > real users and real distros and real products waiting, so if there > are any real issues, let's get them resolved. We already told you the real issues there are and you did nothing so far to address those, so much was built on top of a flawed API that I guess an heartquake of massive scale has to come in to actually convince Xen to change any of the huge amount of code built on the flawed API. I don't know the exact Xen details (it's possible Xen design doesn't allow these below 4 issues to be fixed, I've no idea) but for all other non-virt usages (compressed-swap/compressed-pagecache, ramster) I doubt it is impossible to change the design of the tmem API to address at least one of those basic huge troubles that such an API imposes: 1) 4k page limit (no way to handle hugepages) Ok swapcache and pagecache are always 4k, but that may change. Plus it's generally flawed these days to add a new API people will build code on that can't handle hugepages, at least hugetlbfs should be handled. And especially considering it was born for virt, in virt space we only work with hugepages. 2) synchronous 3) not zerocopy, requires one bounce buffer for every get and one bounce buffer again for every put (like highmem I/O with 32bit pci) In my view point 3 is definitely fixable for swapcache compression and pagecache compression, there's no way we can accept a copy before starting compressing the data, the source of the compression algorithm must be the _userland_ page but instead you copy first, and compress on the copy destination, correct me if I'm wrong. 4) can't handle batched requests Requires one vmexit for each 4k page accessed if KVM hypervisor wants to access tmem, it's impossible we want to use this in KVM, at most we could consider exiting every 2M page, impossible to vmexit every 4k or performance is destroyed and we'd run as slow as no-EPT/NPT. Address these 4 points (or at least the ones that are solvable) and it'll become appealing. Or at least try to explain why it's impossible to solve all these 4 points to convince us this API is the best we can get for the non-virt usages (let's ignore Xen/KVM for the sake of this discussion, as Xen may have legitimate reasons for why those 4 above points are impossible to fix). At the moment to me it still looks a legacy-compatibility API to make life easier to Xen users that uses a limited API (at least it's simpler I'd agree on it being simpler this way) to share cache across different guests and tries to impose those above 4 limits (and horrendous performance in accessing tmem from Xen Guest but still faster than I/O isn't it? :) even to the non-virt usages. Even frontswap, there is no way we can accept to do synchronous bounce buffers for every single 4k page that is going to hit swap. That's worse than HIGHMEM 32bit... Obviously you must be mlocking all Oracle db memory so you won't hit that bounce buffering ever with Oracle. Also note, historically there's nobody that hated bounce buffers more than Oracle (at least I remember the highmem issues with pci32 cards :). Also Oracle was the biggest user of hugetlbfs. So it sounds weird that you like this API forces bounce buffering CPU cache-destroying and 4k page units, for everything that passes through it. If I'm wrong please correct me, I hadn't lots of time to check code. But we already raised these points before without much answer. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 18:16 ` Andrea Arcangeli @ 2011-10-31 20:58 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 20:58 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Andrea -- Thanks for your input. It's good to have some real technical discussion about the core of tmem. I hope you will take the time to read and consider my reply, and comment on any disagreements. OK, let's go over your concerns about the "flawed API." > 1) 4k page limit (no way to handle hugepages) FALSE. The API/ABI was designed from the beginning to handle different pagesizes. It can even dynamically handle more than one page size, though a different "pool" must be created on the kernel side for each different pagesize. (At the risk of derision, remember I used to code for IA64 so I am very familiar with different pagesizes.) It is true that the current tmem _backends_ (Xen and zcache) reject pagesizes other than 4K, but if there are "frontends" that have a different pagesize, the API/ABI supports it. For hugepages, I agree copying 2M seems odd. But talking about hugepages in the swap subsystem, I think we are talking about a very remote future. (Remember cleancache is _already_ merged so I'm limiting this to swap.) Perhaps in that far future, Intel will have an optimized "copy2M" instruction that can circumvent cache pollution? > 2) synchronous TRUE. (Well, mostly.... RAMster is exploiting some asynchrony but that's all still experimental.) Remember the whole point of tmem/cleancache/frontswap is in environments where memory is scarce and CPU is plentiful, which is increasingly common (especially in virtualization). We all cut our teeth on kernel work in an environment where saving every CPU cycle was important, but in these new memory-constrained many-core environments, the majority of CPU cycles are idle. So does it really matter if the CPU is idle because it is waiting on the disk vs being used for synchronous copying/compression/dedup? See the published Xen benchmarks: CPU utilization goes up, but throughput goes up too. Why? Because physical memory is being used more efficiently. Also IMHO the reason the frontswap hooks and the cleancache hooks can be so simple and elegant and can support many different users is because the API/ABI is synchronous. If you change that, I think you will introduce all sorts of special cases and races and bugs on both sides of the ABI/API. And (IMHO) the end result is that most CPUs are still mostly sitting idle waiting for work to do. > 3) not zerocopy, requires one bounce buffer for every get and one > bounce buffer again for every put (like highmem I/O with 32bit pci) Hmmm... not sure I understand this one. It IS copy-based so is not zerocopy; the page of data is actually moving out of memory controlled/directly-addressable by the kernel into memory that is not controlled/directly-addressable by the kernel. But neither the Xen implementation nor the zcache implementation uses any bounce buffers, even when compressing or dedup'ing. So unless I misunderstand, this one is FALSE. > 4) can't handle batched requests TRUE. Tell me again why a vmexit/vmenter per 4K page is "impossible"? Again you are assuming (1) the CPU had some real work to do instead and (2) that vmexit/vmenter is horribly slow. Even if vmexit/vmenter is thousands of cycles, it is still orders of magnitude faster than a disk access. And vmexit/vmenter is about the same order of magnitude as page copy, and much faster than compression/decompression, both of which still result in a nice win. You are also assuming that frontswap puts/gets are highly frequent. By definition they are not, because they are replacing single-page disk reads/writes due to swapping. That said, the API/ABI is very extensible, so if it were proven that batching was sufficiently valuable, it could be added later... but I don't see it as a showstopper. Really do you? > worse than HIGHMEM 32bit... Obviously you must be mlocking all Oracle > db memory so you won't hit that bounce buffering ever with > Oracle. Also note, historically there's nobody that hated bounce > buffers more than Oracle (at least I remember the highmem issues with > pci32 cards :). Also Oracle was the biggest user of hugetlbfs. I already noted that there's no bounce buffers, but Oracle is not pursuing this because of the Oracle _database_ (though it does work on single node databases). While "Oracle" is often used to equate to its eponymous database, tmem works on lots of workloads and Oracle (even pre-Sun-merger) sells tons of non-DB software. In fact I personally take some heat for putting more emphasis on getting tmem into Linux than in using it to proprietarily improve other Oracle products. > If I'm wrong please correct me, I hadn't lots of time to check > code. But we already raised these points before without much answer. OK, so you're wrong on two of the points and I've corrected you. On two of the points, synchrony and non-batchability, you make claims that (1) these are bad and (2) that there is a better way to achieve the same results with asynchrony and batchability. I do agree you've raised the points before, but I am pretty sure I've always given the same answers, so you shouldn't say that you haven't gotten "much answer" but that you disagree with the answer you got. I've got working code, it's going in real distros and products and has growing usage by (non-Oracle) kernel developers as well as real users clamoring for it or already using it. You claim that by making it asynchronous it would be better, while I claim that it would make it impossibly complicated. (We'd essentially be rewriting, or creating a parallel, blkio subsystem.) You claim that a batch interface is necessary, while I claim that if it is proven that it is needed, it could be added later. We've been talking about this since July 2009, right? If you can do it better, where's your code? I have the highest degree of respect for your abilities and I have no doubt that you could do something similar for KVM over a long weekend... but can you also make it work for Xen, for in-kernel compression, and for cross-kernel clustering (not to mention for other "users" in my queue)? The foundation tmem code in the core kernel (frontswap and cleancache) is elegant in its simplicity and _it works_. REALLY no disrespect intended and I'm sorry if I am flaming, so let me calm down by quoting Linus from the LWN KS2011 article: "[Linus] stated that, simply, code that actually is used is code that is actually worth something... code aimed at solving the same problem is just a vague idea that is worthless by comparison... Even if it truly is crap, we've had crap in the kernel before. The code does not get better out of tree." So, please, all the other parts necessary for tmem are already in-tree, why all the resistance about frontswap? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 20:58 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 20:58 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Andrea -- Thanks for your input. It's good to have some real technical discussion about the core of tmem. I hope you will take the time to read and consider my reply, and comment on any disagreements. OK, let's go over your concerns about the "flawed API." > 1) 4k page limit (no way to handle hugepages) FALSE. The API/ABI was designed from the beginning to handle different pagesizes. It can even dynamically handle more than one page size, though a different "pool" must be created on the kernel side for each different pagesize. (At the risk of derision, remember I used to code for IA64 so I am very familiar with different pagesizes.) It is true that the current tmem _backends_ (Xen and zcache) reject pagesizes other than 4K, but if there are "frontends" that have a different pagesize, the API/ABI supports it. For hugepages, I agree copying 2M seems odd. But talking about hugepages in the swap subsystem, I think we are talking about a very remote future. (Remember cleancache is _already_ merged so I'm limiting this to swap.) Perhaps in that far future, Intel will have an optimized "copy2M" instruction that can circumvent cache pollution? > 2) synchronous TRUE. (Well, mostly.... RAMster is exploiting some asynchrony but that's all still experimental.) Remember the whole point of tmem/cleancache/frontswap is in environments where memory is scarce and CPU is plentiful, which is increasingly common (especially in virtualization). We all cut our teeth on kernel work in an environment where saving every CPU cycle was important, but in these new memory-constrained many-core environments, the majority of CPU cycles are idle. So does it really matter if the CPU is idle because it is waiting on the disk vs being used for synchronous copying/compression/dedup? See the published Xen benchmarks: CPU utilization goes up, but throughput goes up too. Why? Because physical memory is being used more efficiently. Also IMHO the reason the frontswap hooks and the cleancache hooks can be so simple and elegant and can support many different users is because the API/ABI is synchronous. If you change that, I think you will introduce all sorts of special cases and races and bugs on both sides of the ABI/API. And (IMHO) the end result is that most CPUs are still mostly sitting idle waiting for work to do. > 3) not zerocopy, requires one bounce buffer for every get and one > bounce buffer again for every put (like highmem I/O with 32bit pci) Hmmm... not sure I understand this one. It IS copy-based so is not zerocopy; the page of data is actually moving out of memory controlled/directly-addressable by the kernel into memory that is not controlled/directly-addressable by the kernel. But neither the Xen implementation nor the zcache implementation uses any bounce buffers, even when compressing or dedup'ing. So unless I misunderstand, this one is FALSE. > 4) can't handle batched requests TRUE. Tell me again why a vmexit/vmenter per 4K page is "impossible"? Again you are assuming (1) the CPU had some real work to do instead and (2) that vmexit/vmenter is horribly slow. Even if vmexit/vmenter is thousands of cycles, it is still orders of magnitude faster than a disk access. And vmexit/vmenter is about the same order of magnitude as page copy, and much faster than compression/decompression, both of which still result in a nice win. You are also assuming that frontswap puts/gets are highly frequent. By definition they are not, because they are replacing single-page disk reads/writes due to swapping. That said, the API/ABI is very extensible, so if it were proven that batching was sufficiently valuable, it could be added later... but I don't see it as a showstopper. Really do you? > worse than HIGHMEM 32bit... Obviously you must be mlocking all Oracle > db memory so you won't hit that bounce buffering ever with > Oracle. Also note, historically there's nobody that hated bounce > buffers more than Oracle (at least I remember the highmem issues with > pci32 cards :). Also Oracle was the biggest user of hugetlbfs. I already noted that there's no bounce buffers, but Oracle is not pursuing this because of the Oracle _database_ (though it does work on single node databases). While "Oracle" is often used to equate to its eponymous database, tmem works on lots of workloads and Oracle (even pre-Sun-merger) sells tons of non-DB software. In fact I personally take some heat for putting more emphasis on getting tmem into Linux than in using it to proprietarily improve other Oracle products. > If I'm wrong please correct me, I hadn't lots of time to check > code. But we already raised these points before without much answer. OK, so you're wrong on two of the points and I've corrected you. On two of the points, synchrony and non-batchability, you make claims that (1) these are bad and (2) that there is a better way to achieve the same results with asynchrony and batchability. I do agree you've raised the points before, but I am pretty sure I've always given the same answers, so you shouldn't say that you haven't gotten "much answer" but that you disagree with the answer you got. I've got working code, it's going in real distros and products and has growing usage by (non-Oracle) kernel developers as well as real users clamoring for it or already using it. You claim that by making it asynchronous it would be better, while I claim that it would make it impossibly complicated. (We'd essentially be rewriting, or creating a parallel, blkio subsystem.) You claim that a batch interface is necessary, while I claim that if it is proven that it is needed, it could be added later. We've been talking about this since July 2009, right? If you can do it better, where's your code? I have the highest degree of respect for your abilities and I have no doubt that you could do something similar for KVM over a long weekend... but can you also make it work for Xen, for in-kernel compression, and for cross-kernel clustering (not to mention for other "users" in my queue)? The foundation tmem code in the core kernel (frontswap and cleancache) is elegant in its simplicity and _it works_. REALLY no disrespect intended and I'm sorry if I am flaming, so let me calm down by quoting Linus from the LWN KS2011 article: "[Linus] stated that, simply, code that actually is used is code that is actually worth something... code aimed at solving the same problem is just a vague idea that is worthless by comparison... Even if it truly is crap, we've had crap in the kernel before. The code does not get better out of tree." So, please, all the other parts necessary for tmem are already in-tree, why all the resistance about frontswap? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 20:58 ` Dan Magenheimer @ 2011-10-31 22:37 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 22:37 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, Oct 31, 2011 at 01:58:39PM -0700, Dan Magenheimer wrote: > Hmmm... not sure I understand this one. It IS copy-based > so is not zerocopy; the page of data is actually moving out copy-based is my main problem, being synchronous is no big deal I agree. I mean, I don't see why you have to make one copy before you start compressing and then you write to disk the output of the compression algorithm. To me it looks like this API forces on zcache one more copy than necessary. I can't see why this copy is necessary and why zcache isn't working on "struct page" on core kernel structures instead of moving the memory off to a memory object invisible to the core VM. > TRUE. Tell me again why a vmexit/vmenter per 4K page is > "impossible"? Again you are assuming (1) the CPU had some It's sure not impossible, it's just impossible we want it as it'd be too slow. > real work to do instead and (2) that vmexit/vmenter is horribly Sure the CPU has another 1000 VM to schedule. This is like saying virtio-blk isn't needed on desktop virt becauase the desktop isn't doing much I/O. Absurd argument, there are another 1000 desktops doing I/O at the same time of course. > slow. Even if vmexit/vmenter is thousands of cycles, it is still > orders of magnitude faster than a disk access. And vmexit/vmenter I fully agree tmem is faster for Xen than no tmem. That's not the point, we don't need such an articulate hack hiding pages from the guest OS in order to share pagecache, our hypervisor is just a bit more powerful and has a function called file_read_actor that does what your tmem copy does... > is about the same order of magnitude as page copy, and much > faster than compression/decompression, both of which still > result in a nice win. Saying it's a small overhead, is not like saying it is _needed_. Why not add a udelay(1) in it too? Sure it won't be noticeable. > You are also assuming that frontswap puts/gets are highly > frequent. By definition they are not, because they are > replacing single-page disk reads/writes due to swapping. They'll be as frequent as the highmem bounce buffers... > That said, the API/ABI is very extensible, so if it were > proven that batching was sufficiently valuable, it could > be added later... but I don't see it as a showstopper. > Really do you? That's fine with me... but like ->writepages it'll take ages for the fs to switch from writepage to writepages. Considering this is a new API I don't think it's unreasonable to ask at least it to handle immediately zerocopy behavior. So showing the userland mapping to the tmem layer so it can avoid the copy and read from the userland address. Xen will badly choke if ever tries to do that, but zcache should be ok with that. Now there may be algorithms where the page must be stable, but others will be perfectly fine even if the page is changing under the compression, and in that case the page won't be discarded and it'll be marked dirty again. So even if a wrong data goes on disk, we'll rewrite later. I see no reason why there has always to be a copy before starting any compression/encryption as long as the algorithm will not crash its input data isn't changing under it. The ideal API would be to send down page pointers (and handling compound pages too), not to copy. Maybe with a flag where you can also specify offsets so you can send down partial pages too down to a byte granularity. The "copy input data before anything else can happen" looks flawed to me. It is not flawed for Xen because Xen has no knowledge of the guest "struct page" but her I'm talking about the not-virt usages. > So, please, all the other parts necessary for tmem are > already in-tree, why all the resistance about frontswap? Well my comments are generic not specific to frontswap. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 22:37 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-10-31 22:37 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, Oct 31, 2011 at 01:58:39PM -0700, Dan Magenheimer wrote: > Hmmm... not sure I understand this one. It IS copy-based > so is not zerocopy; the page of data is actually moving out copy-based is my main problem, being synchronous is no big deal I agree. I mean, I don't see why you have to make one copy before you start compressing and then you write to disk the output of the compression algorithm. To me it looks like this API forces on zcache one more copy than necessary. I can't see why this copy is necessary and why zcache isn't working on "struct page" on core kernel structures instead of moving the memory off to a memory object invisible to the core VM. > TRUE. Tell me again why a vmexit/vmenter per 4K page is > "impossible"? Again you are assuming (1) the CPU had some It's sure not impossible, it's just impossible we want it as it'd be too slow. > real work to do instead and (2) that vmexit/vmenter is horribly Sure the CPU has another 1000 VM to schedule. This is like saying virtio-blk isn't needed on desktop virt becauase the desktop isn't doing much I/O. Absurd argument, there are another 1000 desktops doing I/O at the same time of course. > slow. Even if vmexit/vmenter is thousands of cycles, it is still > orders of magnitude faster than a disk access. And vmexit/vmenter I fully agree tmem is faster for Xen than no tmem. That's not the point, we don't need such an articulate hack hiding pages from the guest OS in order to share pagecache, our hypervisor is just a bit more powerful and has a function called file_read_actor that does what your tmem copy does... > is about the same order of magnitude as page copy, and much > faster than compression/decompression, both of which still > result in a nice win. Saying it's a small overhead, is not like saying it is _needed_. Why not add a udelay(1) in it too? Sure it won't be noticeable. > You are also assuming that frontswap puts/gets are highly > frequent. By definition they are not, because they are > replacing single-page disk reads/writes due to swapping. They'll be as frequent as the highmem bounce buffers... > That said, the API/ABI is very extensible, so if it were > proven that batching was sufficiently valuable, it could > be added later... but I don't see it as a showstopper. > Really do you? That's fine with me... but like ->writepages it'll take ages for the fs to switch from writepage to writepages. Considering this is a new API I don't think it's unreasonable to ask at least it to handle immediately zerocopy behavior. So showing the userland mapping to the tmem layer so it can avoid the copy and read from the userland address. Xen will badly choke if ever tries to do that, but zcache should be ok with that. Now there may be algorithms where the page must be stable, but others will be perfectly fine even if the page is changing under the compression, and in that case the page won't be discarded and it'll be marked dirty again. So even if a wrong data goes on disk, we'll rewrite later. I see no reason why there has always to be a copy before starting any compression/encryption as long as the algorithm will not crash its input data isn't changing under it. The ideal API would be to send down page pointers (and handling compound pages too), not to copy. Maybe with a flag where you can also specify offsets so you can send down partial pages too down to a byte granularity. The "copy input data before anything else can happen" looks flawed to me. It is not flawed for Xen because Xen has no knowledge of the guest "struct page" but her I'm talking about the not-virt usages. > So, please, all the other parts necessary for tmem are > already in-tree, why all the resistance about frontswap? Well my comments are generic not specific to frontswap. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 22:37 ` Andrea Arcangeli @ 2011-10-31 23:36 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 23:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, Oct 31, 2011 at 01:58:39PM -0700, Dan Magenheimer wrote: > > Hmmm... not sure I understand this one. It IS copy-based > > so is not zerocopy; the page of data is actually moving out > > copy-based is my main problem, being synchronous is no big deal I > agree. > > I mean, I don't see why you have to make one copy before you start > compressing and then you write to disk the output of the compression > algorithm. To me it looks like this API forces on zcache one more copy > than necessary. > > I can't see why this copy is necessary and why zcache isn't working on > "struct page" on core kernel structures instead of moving the memory > off to a memory object invisible to the core VM. Do you see code doing this? I am pretty sure zcache is NOT doing an extra copy, it is compressing from the source page. And I am pretty sure Xen tmem is not doing the extra copy either. Seth and I had discussed ADDING the extra copy in zcache to make the synchronous/irq-disabled time shorter for puts and doing the compression as a separate thread, but I don't think I have seen any patch to implement that. So if this is true (no extra copy), are you happy? Maybe you are saying that the extra copy would be necessary in a KVM implementation of tmem? If so, I haven't thought about a KVM+tmem design enough to comment on that. > > TRUE. Tell me again why a vmexit/vmenter per 4K page is > > "impossible"? Again you are assuming (1) the CPU had some > > It's sure not impossible, it's just impossible we want it as it'd be > too slow. You are clearly speculating here. Wouldn't it be nice to try it and find out? > > real work to do instead and (2) that vmexit/vmenter is horribly > > Sure the CPU has another 1000 VM to schedule. This is like saying > virtio-blk isn't needed on desktop virt becauase the desktop isn't > doing much I/O. Absurd argument, there are another 1000 desktops doing > I/O at the same time of course. But this is truly different, I think at least for the most common cases, because the guest is essentially out of physical memory if it is swapping. And the vmexit/vmenter (I assume, I don't really know KVM) gives the KVM scheduler the opportunity to schedule another of those 1000 VMs if it wishes. Also I'll venture to guess (without any proof) that the path through the blkio subsystem to deal with any swap page and set up the disk I/O is not much shorter than the cost of a vmexit/vmenter on modern systems ;-) Now we are both speculating. :-) > > slow. Even if vmexit/vmenter is thousands of cycles, it is still > > orders of magnitude faster than a disk access. And vmexit/vmenter > > I fully agree tmem is faster for Xen than no tmem. That's not the > point, we don't need such an articulate hack hiding pages from the > guest OS in order to share pagecache, our hypervisor is just a bit > more powerful and has a function called file_read_actor that does what > your tmem copy does... Well either then KVM doesn't need frontswap at all and need not be interfering with a patch that works fine for the other users, or Sasha and Neo will implement it and find that frontswap does (sometimes?) provide some benefits. In either case, I'm not sure why you would be objecting to merging frontswap. > > is about the same order of magnitude as page copy, and much > > faster than compression/decompression, both of which still > > result in a nice win. > > Saying it's a small overhead, is not like saying it is _needed_. Why > not add a udelay(1) in it too? Sure it won't be noticeable. Actually the current implementation of RAMster over LAN adds quite a bit more than udelay(1). But that's all still experimental. It might be interesting to try adding udelay(1) in zcache to see if there is any noticeable effect. > > You are also assuming that frontswap puts/gets are highly > > frequent. By definition they are not, because they are > > replacing single-page disk reads/writes due to swapping. > > They'll be as frequent as the highmem bounce buffers... I don't understand. Sorry, I really am ignorant of highmem systems as I grew up on PA-RISC and IA-64. > > That said, the API/ABI is very extensible, so if it were > > proven that batching was sufficiently valuable, it could > > be added later... but I don't see it as a showstopper. > > Really do you? > > That's fine with me... but like ->writepages it'll take ages for the > fs to switch from writepage to writepages. Considering this is a new > API I don't think it's unreasonable to ask at least it to handle > immediately zerocopy behavior. So showing the userland mapping to the > tmem layer so it can avoid the copy and read from the userland > address. Xen will badly choke if ever tries to do that, but zcache > should be ok with that. > > Now there may be algorithms where the page must be stable, but others > will be perfectly fine even if the page is changing under the > compression, and in that case the page won't be discarded and it'll be > marked dirty again. So even if a wrong data goes on disk, we'll > rewrite later. I see no reason why there has always to be a copy > before starting any compression/encryption as long as the algorithm > will not crash its input data isn't changing under it. > > The ideal API would be to send down page pointers (and handling > compound pages too), not to copy. Maybe with a flag where you can also > specify offsets so you can send down partial pages too down to a byte > granularity. The "copy input data before anything else can happen" > looks flawed to me. It is not flawed for Xen because Xen has no > knowledge of the guest "struct page" but her I'm talking about the > not-virt usages. Again, I think you are assuming things work differently than I think they do. I don't think there is an extra copy before the compression. And Xen isn't choking, nor is zcache. (Note that the Xen tmem implementation, as all of Xen will be soon, is 64-bit only... Seth recently fixed a bug keeping zcache from working in 32-bit highmem systems, so I know 32-bit works for zcache.) So if this is true (no extra copy), are you happy? > > So, please, all the other parts necessary for tmem are > > already in-tree, why all the resistance about frontswap? > > Well my comments are generic not specific to frontswap. OK, but cleancache is already in-tree and open to any improvement ideas you may have. Frontswap is only using the existing ABI/API that cleancache already uses. Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 23:36 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 23:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, Oct 31, 2011 at 01:58:39PM -0700, Dan Magenheimer wrote: > > Hmmm... not sure I understand this one. It IS copy-based > > so is not zerocopy; the page of data is actually moving out > > copy-based is my main problem, being synchronous is no big deal I > agree. > > I mean, I don't see why you have to make one copy before you start > compressing and then you write to disk the output of the compression > algorithm. To me it looks like this API forces on zcache one more copy > than necessary. > > I can't see why this copy is necessary and why zcache isn't working on > "struct page" on core kernel structures instead of moving the memory > off to a memory object invisible to the core VM. Do you see code doing this? I am pretty sure zcache is NOT doing an extra copy, it is compressing from the source page. And I am pretty sure Xen tmem is not doing the extra copy either. Seth and I had discussed ADDING the extra copy in zcache to make the synchronous/irq-disabled time shorter for puts and doing the compression as a separate thread, but I don't think I have seen any patch to implement that. So if this is true (no extra copy), are you happy? Maybe you are saying that the extra copy would be necessary in a KVM implementation of tmem? If so, I haven't thought about a KVM+tmem design enough to comment on that. > > TRUE. Tell me again why a vmexit/vmenter per 4K page is > > "impossible"? Again you are assuming (1) the CPU had some > > It's sure not impossible, it's just impossible we want it as it'd be > too slow. You are clearly speculating here. Wouldn't it be nice to try it and find out? > > real work to do instead and (2) that vmexit/vmenter is horribly > > Sure the CPU has another 1000 VM to schedule. This is like saying > virtio-blk isn't needed on desktop virt becauase the desktop isn't > doing much I/O. Absurd argument, there are another 1000 desktops doing > I/O at the same time of course. But this is truly different, I think at least for the most common cases, because the guest is essentially out of physical memory if it is swapping. And the vmexit/vmenter (I assume, I don't really know KVM) gives the KVM scheduler the opportunity to schedule another of those 1000 VMs if it wishes. Also I'll venture to guess (without any proof) that the path through the blkio subsystem to deal with any swap page and set up the disk I/O is not much shorter than the cost of a vmexit/vmenter on modern systems ;-) Now we are both speculating. :-) > > slow. Even if vmexit/vmenter is thousands of cycles, it is still > > orders of magnitude faster than a disk access. And vmexit/vmenter > > I fully agree tmem is faster for Xen than no tmem. That's not the > point, we don't need such an articulate hack hiding pages from the > guest OS in order to share pagecache, our hypervisor is just a bit > more powerful and has a function called file_read_actor that does what > your tmem copy does... Well either then KVM doesn't need frontswap at all and need not be interfering with a patch that works fine for the other users, or Sasha and Neo will implement it and find that frontswap does (sometimes?) provide some benefits. In either case, I'm not sure why you would be objecting to merging frontswap. > > is about the same order of magnitude as page copy, and much > > faster than compression/decompression, both of which still > > result in a nice win. > > Saying it's a small overhead, is not like saying it is _needed_. Why > not add a udelay(1) in it too? Sure it won't be noticeable. Actually the current implementation of RAMster over LAN adds quite a bit more than udelay(1). But that's all still experimental. It might be interesting to try adding udelay(1) in zcache to see if there is any noticeable effect. > > You are also assuming that frontswap puts/gets are highly > > frequent. By definition they are not, because they are > > replacing single-page disk reads/writes due to swapping. > > They'll be as frequent as the highmem bounce buffers... I don't understand. Sorry, I really am ignorant of highmem systems as I grew up on PA-RISC and IA-64. > > That said, the API/ABI is very extensible, so if it were > > proven that batching was sufficiently valuable, it could > > be added later... but I don't see it as a showstopper. > > Really do you? > > That's fine with me... but like ->writepages it'll take ages for the > fs to switch from writepage to writepages. Considering this is a new > API I don't think it's unreasonable to ask at least it to handle > immediately zerocopy behavior. So showing the userland mapping to the > tmem layer so it can avoid the copy and read from the userland > address. Xen will badly choke if ever tries to do that, but zcache > should be ok with that. > > Now there may be algorithms where the page must be stable, but others > will be perfectly fine even if the page is changing under the > compression, and in that case the page won't be discarded and it'll be > marked dirty again. So even if a wrong data goes on disk, we'll > rewrite later. I see no reason why there has always to be a copy > before starting any compression/encryption as long as the algorithm > will not crash its input data isn't changing under it. > > The ideal API would be to send down page pointers (and handling > compound pages too), not to copy. Maybe with a flag where you can also > specify offsets so you can send down partial pages too down to a byte > granularity. The "copy input data before anything else can happen" > looks flawed to me. It is not flawed for Xen because Xen has no > knowledge of the guest "struct page" but her I'm talking about the > not-virt usages. Again, I think you are assuming things work differently than I think they do. I don't think there is an extra copy before the compression. And Xen isn't choking, nor is zcache. (Note that the Xen tmem implementation, as all of Xen will be soon, is 64-bit only... Seth recently fixed a bug keeping zcache from working in 32-bit highmem systems, so I know 32-bit works for zcache.) So if this is true (no extra copy), are you happy? > > So, please, all the other parts necessary for tmem are > > already in-tree, why all the resistance about frontswap? > > Well my comments are generic not specific to frontswap. OK, but cleancache is already in-tree and open to any improvement ideas you may have. Frontswap is only using the existing ABI/API that cleancache already uses. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 23:36 ` Dan Magenheimer @ 2011-11-01 1:20 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-01 1:20 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, Oct 31, 2011 at 04:36:04PM -0700, Dan Magenheimer wrote: > Do you see code doing this? I am pretty sure zcache is > NOT doing an extra copy, it is compressing from the source > page. And I am pretty sure Xen tmem is not doing the > extra copy either. So below you describe put as a copy of a page from the kernel into the newly allocated PAM space... I guess there's some improvement needed for the documentation at least, a compression is done sometime instead of a copy... I thought you always had to copy first sorry. * "Put" a page, e.g. copy a page from the kernel into newly allocated * PAM space (if such space is available). Tmem_put is complicated by * a corner case: What if a page with matching handle already exists in * tmem? To guarantee coherency, one of two actions is necessary: Either * the data for the page must be overwritten, or the page must be * "flushed" so that the data is not accessible to a subsequent "get". * Since these "duplicate puts" are relatively rare, this implementation * always flushes for simplicity. */ int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index, char *data, size_t size, bool raw, bool ephemeral) { struct tmem_obj *obj = NULL, *objfound = NULL, *objnew = NULL; void *pampd = NULL, *pampd_del = NULL; int ret = -ENOMEM; struct tmem_hashbucket *hb; hb = &pool->hashbucket[tmem_oid_hash(oidp)]; spin_lock(&hb->lock); obj = objfound = tmem_obj_find(hb, oidp); if (obj != NULL) { pampd = tmem_pampd_lookup_in_obj(objfound, index); if (pampd != NULL) { /* if found, is a dup put, flush the old one */ pampd_del = tmem_pampd_delete_from_obj(obj, index); BUG_ON(pampd_del != pampd); (*tmem_pamops.free)(pampd, pool, oidp, index); if (obj->pampd_count == 0) { objnew = obj; objfound = NULL; } pampd = NULL; } } else { obj = objnew = (*tmem_hostops.obj_alloc)(pool); if (unlikely(obj == NULL)) { ret = -ENOMEM; goto out; } tmem_obj_init(obj, hb, pool, oidp); } BUG_ON(obj == NULL); BUG_ON(((objnew != obj) && (objfound != obj)) || (objnew == objfound)); pampd = (*tmem_pamops.create)(data, size, raw, ephemeral, obj->pool, &obj->oid, index); So then .create is calls zcache_pampd_create... static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph, ^^^^^^^^^^ struct tmem_pool *pool, struct tmem_oid *oid, uint32_t index) { void *pampd = NULL, *cdata; size_t clen; int ret; unsigned long count; struct page *page = (struct page *)(data); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ struct zcache_client *cli = pool->client; uint16_t client_id = get_client_id_from_client(cli); unsigned long zv_mean_zsize; unsigned long curr_pers_pampd_count; u64 total_zsize; if (eph) { ret = zcache_compress(page, &cdata, &clen); zcache_compress then does: static int zcache_compress(struct page *from, void **out_va, size_t *out_len) { int ret = 0; unsigned char *dmem = __get_cpu_var(zcache_dstmem); unsigned char *wmem = __get_cpu_var(zcache_workmem); char *from_va; BUG_ON(!irqs_disabled()); if (unlikely(dmem == NULL || wmem == NULL)) goto out; /* no buffer, so can't compress */ from_va = kmap_atomic(from, KM_USER0); mb(); ret = lzo1x_1_compress(from_va, PAGE_SIZE, dmem, out_len, wmem); ^^^^^^^^^ tmem is called from frontswap_put_page. +int __frontswap_put_page(struct page *page) +{ + int ret = -1, dup = 0; + swp_entry_t entry = { .val = page_private(page), }; + int type = swp_type(entry); + struct swap_info_struct *sis = swap_info[type]; + pgoff_t offset = swp_offset(entry); + + BUG_ON(!PageLocked(page)); + BUG_ON(sis == NULL); + if (frontswap_test(sis, offset)) + dup = 1; + ret = (*frontswap_ops.put_page)(type, offset, page); In turn called by swap_writepage: @@ -98,6 +99,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) unlock_page(page); goto out; } + if (frontswap_put_page(page) == 0) { + set_page_writeback(page); + unlock_page(page); + end_page_writeback(page); + goto out; + } bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write); if (bio == NULL) { set_page_dirty(page); And zcache-main.c has #ifdef for both frontswap and cleancache, and the above frontswap_ops.put_page points to the below zcache_frontswap_put_page which even shows a local_irq_save() for the whole time of the compression... did you ever check irq latency with zcache+frontswap? Wonder what the RT folks will say about zcache+frontswap considering local_irq_save is a blocker for preempt-RT. #ifdef CONFIG_CLEANCACHE #include <linux/cleancache.h> #endif #ifdef CONFIG_FRONTSWAP #include <linux/frontswap.h> #endif #ifdef CONFIG_FRONTSWAP /* a single tmem poolid is used for all frontswap "types" (swapfiles) */ static int zcache_frontswap_poolid = -1; /* * Swizzling increases objects per swaptype, increasing tmem concurrency * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS */ #define SWIZ_BITS 4 #define SWIZ_MASK ((1 << SWIZ_BITS) - 1) #define _oswiz(_type, _ind) ((_type << SWIZ_BITS) | (_ind & SWIZ_MASK)) #define iswiz(_ind) (_ind >> SWIZ_BITS) static inline struct tmem_oid oswiz(unsigned type, u32 ind) { struct tmem_oid oid = { .oid = { 0 } }; oid.oid[0] = _oswiz(type, ind); return oid; } static int zcache_frontswap_put_page(unsigned type, pgoff_t offset, struct page *page) { u64 ind64 = (u64)offset; u32 ind = (u32)offset; struct tmem_oid oid = oswiz(type, ind); int ret = -1; unsigned long flags; BUG_ON(!PageLocked(page)); if (likely(ind64 == ind)) { local_irq_save(flags); ret = zcache_put_page(LOCAL_CLIENT, zcache_frontswap_poolid, &oid, iswiz(ind), page); local_irq_restore(flags); } return ret; } /* returns 0 if the page was successfully gotten from frontswap, -1 if * was not present (should never happen!) */ static int zcache_frontswap_get_page(unsigned type, pgoff_t offset, struct page *page) { u64 ind64 = (u64)offset; u32 ind = (u32)offset; struct tmem_oid oid = oswiz(type, ind); int ret = -1; BUG_ON(!PageLocked(page)); if (likely(ind64 == ind)) ret = zcache_get_page(LOCAL_CLIENT, zcache_frontswap_poolid, &oid, iswiz(ind), page); return ret; } /* flush a single page from frontswap */ static void zcache_frontswap_flush_page(unsigned type, pgoff_t offset) { u64 ind64 = (u64)offset; u32 ind = (u32)offset; struct tmem_oid oid = oswiz(type, ind); if (likely(ind64 == ind)) (void)zcache_flush_page(LOCAL_CLIENT, zcache_frontswap_poolid, &oid, iswiz(ind)); } /* flush all pages from the passed swaptype */ static void zcache_frontswap_flush_area(unsigned type) { struct tmem_oid oid; int ind; for (ind = SWIZ_MASK; ind >= 0; ind--) { oid = oswiz(type, ind); (void)zcache_flush_object(LOCAL_CLIENT, zcache_frontswap_poolid, &oid); } } static void zcache_frontswap_init(unsigned ignored) { /* a single tmem poolid is used for all frontswap "types" (swapfiles) */ if (zcache_frontswap_poolid < 0) zcache_frontswap_poolid = zcache_new_pool(LOCAL_CLIENT, TMEM_POOL_PERSIST); } static struct frontswap_ops zcache_frontswap_ops = { .put_page = zcache_frontswap_put_page, .get_page = zcache_frontswap_get_page, .flush_page = zcache_frontswap_flush_page, .flush_area = zcache_frontswap_flush_area, .init = zcache_frontswap_init }; struct frontswap_ops zcache_frontswap_register_ops(void) { struct frontswap_ops old_ops = frontswap_register_ops(&zcache_frontswap_ops); return old_ops; } #endif #ifdef CONFIG_CLEANCACHE static void zcache_cleancache_put_page(int pool_id, struct cleancache_filekey key, pgoff_t index, struct page *page) { u32 ind = (u32) index; struct tmem_oid oid = *(struct tmem_oid *)&key; if (likely(ind == index)) (void)zcache_put_page(LOCAL_CLIENT, pool_id, &oid, index, page); } static int zcache_cleancache_get_page(int pool_id, struct cleancache_filekey key, pgoff_t index, struct page *page) { u32 ind = (u32) index; struct tmem_oid oid = *(struct tmem_oid *)&key; int ret = -1; if (likely(ind == index)) ret = zcache_get_page(LOCAL_CLIENT, pool_id, &oid, index, page); return ret; } static void zcache_cleancache_flush_page(int pool_id, struct cleancache_filekey key, pgoff_t index) { u32 ind = (u32) index; struct tmem_oid oid = *(struct tmem_oid *)&key; if (likely(ind == index)) (void)zcache_flush_page(LOCAL_CLIENT, pool_id, &oid, ind); } static void zcache_cleancache_flush_inode(int pool_id, struct cleancache_filekey key) { struct tmem_oid oid = *(struct tmem_oid *)&key; (void)zcache_flush_object(LOCAL_CLIENT, pool_id, &oid); } static void zcache_cleancache_flush_fs(int pool_id) { if (pool_id >= 0) (void)zcache_destroy_pool(LOCAL_CLIENT, pool_id); } static int zcache_cleancache_init_fs(size_t pagesize) { BUG_ON(sizeof(struct cleancache_filekey) != sizeof(struct tmem_oid)); BUG_ON(pagesize != PAGE_SIZE); return zcache_new_pool(LOCAL_CLIENT, 0); } static int zcache_cleancache_init_shared_fs(char *uuid, size_t pagesize) { /* shared pools are unsupported and map to private */ BUG_ON(sizeof(struct cleancache_filekey) != sizeof(struct tmem_oid)); BUG_ON(pagesize != PAGE_SIZE); return zcache_new_pool(LOCAL_CLIENT, 0); } static struct cleancache_ops zcache_cleancache_ops = { .put_page = zcache_cleancache_put_page, .get_page = zcache_cleancache_get_page, .flush_page = zcache_cleancache_flush_page, .flush_inode = zcache_cleancache_flush_inode, .flush_fs = zcache_cleancache_flush_fs, .init_shared_fs = zcache_cleancache_init_shared_fs, .init_fs = zcache_cleancache_init_fs }; struct cleancache_ops zcache_cleancache_register_ops(void) { struct cleancache_ops old_ops = cleancache_register_ops(&zcache_cleancache_ops); return old_ops; } #endif This zcache functionality is all but pluggable if you've to create a new zcache slightly different implementation for each user (frontswap/cleancache etc...). And the cast of the page when it enters tmem to char: static int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp, uint32_t index, struct page *page) { struct tmem_pool *pool; int ret = -1; BUG_ON(!irqs_disabled()); pool = zcache_get_pool_by_id(cli_id, pool_id); if (unlikely(pool == NULL)) goto out; if (!zcache_freeze && zcache_do_preload(pool) == 0) { /* preload does preempt_disable on success */ ret = tmem_put(pool, oidp, index, (char *)(page), PAGE_SIZE, 0, is_ephemeral(pool)); Is so weird... and then it returns a page when it exits tmem and enters zcache again in zcache_pampd_create. And the "len" get lost at some point inside zcache but I guess that's fixable and not part of the API at least.... but the whole thing looks an exercise to pass through tmem. I don't really understand why one page must become a char at some point and what benefit it would ever provide. I also don't understand how you plan to ever swap the compressed data considering it's hold outside of the kernel not anymore in a struct page. If swap compression was done right, the on-disk data should be stored in the compressed format in a compact way so you spend the CPU once and you also gain disk speed by writing less. How do you plan to achieve this with this design? I like the failing when the size of the compressed data is bigger than the uncompressed one, only in that case the data should go to swap uncompressed of course. That's something in software we can handle and hardware can't handle so well and that's why some older hardware compression for RAM probably didn't takeoff. I've an hard time to be convinced this is the best way to do swap compression especially not seeing how it will ever reach swap on disk. But yes it's not doing an additional copy unlike the tmem_put comment would imply (it's disabling irqs for the whole duration of the compression though). ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 1:20 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-01 1:20 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, Oct 31, 2011 at 04:36:04PM -0700, Dan Magenheimer wrote: > Do you see code doing this? I am pretty sure zcache is > NOT doing an extra copy, it is compressing from the source > page. And I am pretty sure Xen tmem is not doing the > extra copy either. So below you describe put as a copy of a page from the kernel into the newly allocated PAM space... I guess there's some improvement needed for the documentation at least, a compression is done sometime instead of a copy... I thought you always had to copy first sorry. * "Put" a page, e.g. copy a page from the kernel into newly allocated * PAM space (if such space is available). Tmem_put is complicated by * a corner case: What if a page with matching handle already exists in * tmem? To guarantee coherency, one of two actions is necessary: Either * the data for the page must be overwritten, or the page must be * "flushed" so that the data is not accessible to a subsequent "get". * Since these "duplicate puts" are relatively rare, this implementation * always flushes for simplicity. */ int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index, char *data, size_t size, bool raw, bool ephemeral) { struct tmem_obj *obj = NULL, *objfound = NULL, *objnew = NULL; void *pampd = NULL, *pampd_del = NULL; int ret = -ENOMEM; struct tmem_hashbucket *hb; hb = &pool->hashbucket[tmem_oid_hash(oidp)]; spin_lock(&hb->lock); obj = objfound = tmem_obj_find(hb, oidp); if (obj != NULL) { pampd = tmem_pampd_lookup_in_obj(objfound, index); if (pampd != NULL) { /* if found, is a dup put, flush the old one */ pampd_del = tmem_pampd_delete_from_obj(obj, index); BUG_ON(pampd_del != pampd); (*tmem_pamops.free)(pampd, pool, oidp, index); if (obj->pampd_count == 0) { objnew = obj; objfound = NULL; } pampd = NULL; } } else { obj = objnew = (*tmem_hostops.obj_alloc)(pool); if (unlikely(obj == NULL)) { ret = -ENOMEM; goto out; } tmem_obj_init(obj, hb, pool, oidp); } BUG_ON(obj == NULL); BUG_ON(((objnew != obj) && (objfound != obj)) || (objnew == objfound)); pampd = (*tmem_pamops.create)(data, size, raw, ephemeral, obj->pool, &obj->oid, index); So then .create is calls zcache_pampd_create... static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph, ^^^^^^^^^^ struct tmem_pool *pool, struct tmem_oid *oid, uint32_t index) { void *pampd = NULL, *cdata; size_t clen; int ret; unsigned long count; struct page *page = (struct page *)(data); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ struct zcache_client *cli = pool->client; uint16_t client_id = get_client_id_from_client(cli); unsigned long zv_mean_zsize; unsigned long curr_pers_pampd_count; u64 total_zsize; if (eph) { ret = zcache_compress(page, &cdata, &clen); zcache_compress then does: static int zcache_compress(struct page *from, void **out_va, size_t *out_len) { int ret = 0; unsigned char *dmem = __get_cpu_var(zcache_dstmem); unsigned char *wmem = __get_cpu_var(zcache_workmem); char *from_va; BUG_ON(!irqs_disabled()); if (unlikely(dmem == NULL || wmem == NULL)) goto out; /* no buffer, so can't compress */ from_va = kmap_atomic(from, KM_USER0); mb(); ret = lzo1x_1_compress(from_va, PAGE_SIZE, dmem, out_len, wmem); ^^^^^^^^^ tmem is called from frontswap_put_page. +int __frontswap_put_page(struct page *page) +{ + int ret = -1, dup = 0; + swp_entry_t entry = { .val = page_private(page), }; + int type = swp_type(entry); + struct swap_info_struct *sis = swap_info[type]; + pgoff_t offset = swp_offset(entry); + + BUG_ON(!PageLocked(page)); + BUG_ON(sis == NULL); + if (frontswap_test(sis, offset)) + dup = 1; + ret = (*frontswap_ops.put_page)(type, offset, page); In turn called by swap_writepage: @@ -98,6 +99,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) unlock_page(page); goto out; } + if (frontswap_put_page(page) == 0) { + set_page_writeback(page); + unlock_page(page); + end_page_writeback(page); + goto out; + } bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write); if (bio == NULL) { set_page_dirty(page); And zcache-main.c has #ifdef for both frontswap and cleancache, and the above frontswap_ops.put_page points to the below zcache_frontswap_put_page which even shows a local_irq_save() for the whole time of the compression... did you ever check irq latency with zcache+frontswap? Wonder what the RT folks will say about zcache+frontswap considering local_irq_save is a blocker for preempt-RT. #ifdef CONFIG_CLEANCACHE #include <linux/cleancache.h> #endif #ifdef CONFIG_FRONTSWAP #include <linux/frontswap.h> #endif #ifdef CONFIG_FRONTSWAP /* a single tmem poolid is used for all frontswap "types" (swapfiles) */ static int zcache_frontswap_poolid = -1; /* * Swizzling increases objects per swaptype, increasing tmem concurrency * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS */ #define SWIZ_BITS 4 #define SWIZ_MASK ((1 << SWIZ_BITS) - 1) #define _oswiz(_type, _ind) ((_type << SWIZ_BITS) | (_ind & SWIZ_MASK)) #define iswiz(_ind) (_ind >> SWIZ_BITS) static inline struct tmem_oid oswiz(unsigned type, u32 ind) { struct tmem_oid oid = { .oid = { 0 } }; oid.oid[0] = _oswiz(type, ind); return oid; } static int zcache_frontswap_put_page(unsigned type, pgoff_t offset, struct page *page) { u64 ind64 = (u64)offset; u32 ind = (u32)offset; struct tmem_oid oid = oswiz(type, ind); int ret = -1; unsigned long flags; BUG_ON(!PageLocked(page)); if (likely(ind64 == ind)) { local_irq_save(flags); ret = zcache_put_page(LOCAL_CLIENT, zcache_frontswap_poolid, &oid, iswiz(ind), page); local_irq_restore(flags); } return ret; } /* returns 0 if the page was successfully gotten from frontswap, -1 if * was not present (should never happen!) */ static int zcache_frontswap_get_page(unsigned type, pgoff_t offset, struct page *page) { u64 ind64 = (u64)offset; u32 ind = (u32)offset; struct tmem_oid oid = oswiz(type, ind); int ret = -1; BUG_ON(!PageLocked(page)); if (likely(ind64 == ind)) ret = zcache_get_page(LOCAL_CLIENT, zcache_frontswap_poolid, &oid, iswiz(ind), page); return ret; } /* flush a single page from frontswap */ static void zcache_frontswap_flush_page(unsigned type, pgoff_t offset) { u64 ind64 = (u64)offset; u32 ind = (u32)offset; struct tmem_oid oid = oswiz(type, ind); if (likely(ind64 == ind)) (void)zcache_flush_page(LOCAL_CLIENT, zcache_frontswap_poolid, &oid, iswiz(ind)); } /* flush all pages from the passed swaptype */ static void zcache_frontswap_flush_area(unsigned type) { struct tmem_oid oid; int ind; for (ind = SWIZ_MASK; ind >= 0; ind--) { oid = oswiz(type, ind); (void)zcache_flush_object(LOCAL_CLIENT, zcache_frontswap_poolid, &oid); } } static void zcache_frontswap_init(unsigned ignored) { /* a single tmem poolid is used for all frontswap "types" (swapfiles) */ if (zcache_frontswap_poolid < 0) zcache_frontswap_poolid = zcache_new_pool(LOCAL_CLIENT, TMEM_POOL_PERSIST); } static struct frontswap_ops zcache_frontswap_ops = { .put_page = zcache_frontswap_put_page, .get_page = zcache_frontswap_get_page, .flush_page = zcache_frontswap_flush_page, .flush_area = zcache_frontswap_flush_area, .init = zcache_frontswap_init }; struct frontswap_ops zcache_frontswap_register_ops(void) { struct frontswap_ops old_ops = frontswap_register_ops(&zcache_frontswap_ops); return old_ops; } #endif #ifdef CONFIG_CLEANCACHE static void zcache_cleancache_put_page(int pool_id, struct cleancache_filekey key, pgoff_t index, struct page *page) { u32 ind = (u32) index; struct tmem_oid oid = *(struct tmem_oid *)&key; if (likely(ind == index)) (void)zcache_put_page(LOCAL_CLIENT, pool_id, &oid, index, page); } static int zcache_cleancache_get_page(int pool_id, struct cleancache_filekey key, pgoff_t index, struct page *page) { u32 ind = (u32) index; struct tmem_oid oid = *(struct tmem_oid *)&key; int ret = -1; if (likely(ind == index)) ret = zcache_get_page(LOCAL_CLIENT, pool_id, &oid, index, page); return ret; } static void zcache_cleancache_flush_page(int pool_id, struct cleancache_filekey key, pgoff_t index) { u32 ind = (u32) index; struct tmem_oid oid = *(struct tmem_oid *)&key; if (likely(ind == index)) (void)zcache_flush_page(LOCAL_CLIENT, pool_id, &oid, ind); } static void zcache_cleancache_flush_inode(int pool_id, struct cleancache_filekey key) { struct tmem_oid oid = *(struct tmem_oid *)&key; (void)zcache_flush_object(LOCAL_CLIENT, pool_id, &oid); } static void zcache_cleancache_flush_fs(int pool_id) { if (pool_id >= 0) (void)zcache_destroy_pool(LOCAL_CLIENT, pool_id); } static int zcache_cleancache_init_fs(size_t pagesize) { BUG_ON(sizeof(struct cleancache_filekey) != sizeof(struct tmem_oid)); BUG_ON(pagesize != PAGE_SIZE); return zcache_new_pool(LOCAL_CLIENT, 0); } static int zcache_cleancache_init_shared_fs(char *uuid, size_t pagesize) { /* shared pools are unsupported and map to private */ BUG_ON(sizeof(struct cleancache_filekey) != sizeof(struct tmem_oid)); BUG_ON(pagesize != PAGE_SIZE); return zcache_new_pool(LOCAL_CLIENT, 0); } static struct cleancache_ops zcache_cleancache_ops = { .put_page = zcache_cleancache_put_page, .get_page = zcache_cleancache_get_page, .flush_page = zcache_cleancache_flush_page, .flush_inode = zcache_cleancache_flush_inode, .flush_fs = zcache_cleancache_flush_fs, .init_shared_fs = zcache_cleancache_init_shared_fs, .init_fs = zcache_cleancache_init_fs }; struct cleancache_ops zcache_cleancache_register_ops(void) { struct cleancache_ops old_ops = cleancache_register_ops(&zcache_cleancache_ops); return old_ops; } #endif This zcache functionality is all but pluggable if you've to create a new zcache slightly different implementation for each user (frontswap/cleancache etc...). And the cast of the page when it enters tmem to char: static int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp, uint32_t index, struct page *page) { struct tmem_pool *pool; int ret = -1; BUG_ON(!irqs_disabled()); pool = zcache_get_pool_by_id(cli_id, pool_id); if (unlikely(pool == NULL)) goto out; if (!zcache_freeze && zcache_do_preload(pool) == 0) { /* preload does preempt_disable on success */ ret = tmem_put(pool, oidp, index, (char *)(page), PAGE_SIZE, 0, is_ephemeral(pool)); Is so weird... and then it returns a page when it exits tmem and enters zcache again in zcache_pampd_create. And the "len" get lost at some point inside zcache but I guess that's fixable and not part of the API at least.... but the whole thing looks an exercise to pass through tmem. I don't really understand why one page must become a char at some point and what benefit it would ever provide. I also don't understand how you plan to ever swap the compressed data considering it's hold outside of the kernel not anymore in a struct page. If swap compression was done right, the on-disk data should be stored in the compressed format in a compact way so you spend the CPU once and you also gain disk speed by writing less. How do you plan to achieve this with this design? I like the failing when the size of the compressed data is bigger than the uncompressed one, only in that case the data should go to swap uncompressed of course. That's something in software we can handle and hardware can't handle so well and that's why some older hardware compression for RAM probably didn't takeoff. I've an hard time to be convinced this is the best way to do swap compression especially not seeing how it will ever reach swap on disk. But yes it's not doing an additional copy unlike the tmem_put comment would imply (it's disabling irqs for the whole duration of the compression though). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 1:20 ` Andrea Arcangeli @ 2011-11-01 16:41 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 16:41 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, Oct 31, 2011 at 04:36:04PM -0700, Dan Magenheimer wrote: > > Do you see code doing this? I am pretty sure zcache is > > NOT doing an extra copy, it is compressing from the source > > page. And I am pretty sure Xen tmem is not doing the > > extra copy either. > > So below you describe put as a copy of a page from the kernel into the > newly allocated PAM space... I guess there's some improvement needed > for the documentation at least, a compression is done sometime instead > of a copy... I thought you always had to copy first sorry. I suppose this documentation (note, it is in drivers/staging/zcache, not in the proposed frontswap patchset) could be misleading. It is really tough in a short comment to balance between describing the general concept to readers trying to understand the big picture, and the high level of detail needed if you are trying to really understand what is going on in the code. But one can always read the code. > zcache_compress then does: > > ret = lzo1x_1_compress(from_va, PAGE_SIZE, dmem, out_len, wmem); > ^^^^^^^^ > > tmem is called from frontswap_put_page. > > In turn called by swap_writepage: > > the above frontswap_ops.put_page points to the below > zcache_frontswap_put_page which even shows a local_irq_save() for the > whole time of the compression... did you ever check irq latency with > zcache+frontswap? Wonder what the RT folks will say about > zcache+frontswap considering local_irq_save is a blocker for preempt-RT. This is a known problem: zcache is currently not very good for high-response RT environments because it currently compresses a page of data with interrupts disabled, which takes (IIRC) about 20000 cycles. (I suspect though, without proof, that this is not the worst irq-disabled path in the kernel.) As noted earlier, this is fixable at the cost of the extra copy which could be implemented as an option later if needed. Or, as always, the RT folks can just not enable zcache. Or maybe smarter developers than me will find a solution that will work even better. Also, yes, as I said, zcache currently is written to assume 4k pagesize, but the tmem.c code/API (see below for more on that file) is pagesize-independent. > And zcache-main.c has #ifdef for both frontswap and cleancache > #ifdef CONFIG_CLEANCACHE > #include <linux/cleancache.h> > #endif > #ifdef CONFIG_FRONTSWAP > #include <linux/frontswap.h> > #endif Yeah, remember zcache was merged before either cleancache or frontswap, so this ugliness was necessary to get around the chicken-and-egg problem. Zcache will definitely need some work before it is ready to move out of staging, and your feedback here is useful for that, but I don't see that as condemning frontswap, do you? > This zcache functionality is all but pluggable if you've to create a > new zcache slightly different implementation for each user > (frontswap/cleancache etc...). Not quite sure what you are saying here, but IIUC, the alternative was to push the tmem semantics up into the hooks (e.g. into swapfile.c). This is what the very first tmem patch did, before I was advised to (1) split cleancache and frontswap so that they could be reviewed separately and (2) move the details of tmem into a different "layer" (cleancache.c/h and frontswap.c/h). So in order to move ugliness out of the core VM, a bit more ugliness is required in the tmem shim/backend. > struct page *page = (struct page *)(data); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > And the cast of the page when it enters > tmem to char: > > ret = tmem_put(pool, oidp, index, (char *)(page), > PAGE_SIZE, 0, is_ephemeral(pool)); > > Is so weird... and then it returns a page when it exits tmem and > enters zcache again in zcache_pampd_create. > > And the "len" get lost at some point inside zcache but I guess that's > fixable and not part of the API at least.... but the whole thing looks > an exercise to pass through tmem. I don't really understand why one > page must become a char at some point and what benefit it would ever > provide. This is the "fix highmem" bug fix from Seth Jennings. The file tmem.c in zcache is an attempt to separate out the core tmem functionality and data structures so that it can (eventually) be in the lib/ directory and be used by multiple backends. (RAMster uses tmem.c unchanged.) The code in tmem.c reflects my "highmem-blindness" in that a single pointer is assumed to be able to address the "PAMPD" (as opposed to a struct page * and an offset, necessary for a 32-bit highmem system). Seth cleverly discovered this ugly two-line fix that (at least for now) avoided major mods to tmem.c. > I also don't understand how you plan to ever swap the compressed data > considering it's hold outside of the kernel not anymore in a struct > page. If swap compression was done right, the on-disk data should be > stored in the compressed format in a compact way so you spend the CPU > once and you also gain disk speed by writing less. How do you plan to > achieve this with this design? First ignoring frontswap, there is currently no way to move a page of swap data from one swap device to another swap device except by moving it first into RAM (in the swap cache), right? Frontswap doesn't solve that problem either, though it would be cool if it could. The "partial swapoff" functionality in the patch, added so that it can be called from frontswap_shrink, enables pages to be pulled out of frontswap into swap cache so that they can be moved if desired/necessary onto a real swap device. The selfballooning code in drivers/xen calls frontswap_shrink to pull swap pages out of the Xen hypervisor when memory pressure is reduced. Frontswap_shrink is not yet called from zcache. Note, however, that unlike swap-disks, compressed pages in frontswap CAN be silently moved to another "device". This is the foundation of RAMster, which moves those compressed pages to the RAM of another machine. The device _could_ be some special type of real-swap-disk, I suppose. > I like the failing when the size of the compressed data is bigger than > the uncompressed one, only in that case the data should go to swap > uncompressed of course. That's something in software we can handle and > hardware can't handle so well and that's why some older hardware > compression for RAM probably didn't takeoff. Yes, this is a good example of the most important feature of tmem/frontswap: Every frontswap_put can be rejected for whatever reason the tmem backend chooses, entirely dynamically. Not only is it true that hardware can't handle this well, but the Linux block I/O subsystem can't handle it either. I've suggested in the frontswap documentation that this is also a key to allowing "mixed RAM + phase-change RAM" systems to be useful. Also I think this is also why many linux vm/vfs/fs/bio developers "don't like it much" (where "it" is cleancache or frontswap). They are not used to losing control of data to some other non-kernel-controlled entity and not used to being told "NO" when they are trying to move data somewhere. IOW, they are control freaks and tmem is out of their control so it must be defeated ;-) > I've an hard time to be convinced this is the best way to do swap > compression especially not seeing how it will ever reach swap on > disk. But yes it's not doing an additional copy unlike the tmem_put > comment would imply (it's disabling irqs for the whole duration of the > compression though). I hope the earlier explanation about frontswap_shrink helps. It's also good to note that the only other successful Linux implementation of swap compression is zram, and zram's creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8) So where are we now? Are you now supportive of merging frontswap? If not, can you suggest any concrete steps that will gain your support? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 16:41 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 16:41 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, Oct 31, 2011 at 04:36:04PM -0700, Dan Magenheimer wrote: > > Do you see code doing this? I am pretty sure zcache is > > NOT doing an extra copy, it is compressing from the source > > page. And I am pretty sure Xen tmem is not doing the > > extra copy either. > > So below you describe put as a copy of a page from the kernel into the > newly allocated PAM space... I guess there's some improvement needed > for the documentation at least, a compression is done sometime instead > of a copy... I thought you always had to copy first sorry. I suppose this documentation (note, it is in drivers/staging/zcache, not in the proposed frontswap patchset) could be misleading. It is really tough in a short comment to balance between describing the general concept to readers trying to understand the big picture, and the high level of detail needed if you are trying to really understand what is going on in the code. But one can always read the code. > zcache_compress then does: > > ret = lzo1x_1_compress(from_va, PAGE_SIZE, dmem, out_len, wmem); > ^^^^^^^^ > > tmem is called from frontswap_put_page. > > In turn called by swap_writepage: > > the above frontswap_ops.put_page points to the below > zcache_frontswap_put_page which even shows a local_irq_save() for the > whole time of the compression... did you ever check irq latency with > zcache+frontswap? Wonder what the RT folks will say about > zcache+frontswap considering local_irq_save is a blocker for preempt-RT. This is a known problem: zcache is currently not very good for high-response RT environments because it currently compresses a page of data with interrupts disabled, which takes (IIRC) about 20000 cycles. (I suspect though, without proof, that this is not the worst irq-disabled path in the kernel.) As noted earlier, this is fixable at the cost of the extra copy which could be implemented as an option later if needed. Or, as always, the RT folks can just not enable zcache. Or maybe smarter developers than me will find a solution that will work even better. Also, yes, as I said, zcache currently is written to assume 4k pagesize, but the tmem.c code/API (see below for more on that file) is pagesize-independent. > And zcache-main.c has #ifdef for both frontswap and cleancache > #ifdef CONFIG_CLEANCACHE > #include <linux/cleancache.h> > #endif > #ifdef CONFIG_FRONTSWAP > #include <linux/frontswap.h> > #endif Yeah, remember zcache was merged before either cleancache or frontswap, so this ugliness was necessary to get around the chicken-and-egg problem. Zcache will definitely need some work before it is ready to move out of staging, and your feedback here is useful for that, but I don't see that as condemning frontswap, do you? > This zcache functionality is all but pluggable if you've to create a > new zcache slightly different implementation for each user > (frontswap/cleancache etc...). Not quite sure what you are saying here, but IIUC, the alternative was to push the tmem semantics up into the hooks (e.g. into swapfile.c). This is what the very first tmem patch did, before I was advised to (1) split cleancache and frontswap so that they could be reviewed separately and (2) move the details of tmem into a different "layer" (cleancache.c/h and frontswap.c/h). So in order to move ugliness out of the core VM, a bit more ugliness is required in the tmem shim/backend. > struct page *page = (struct page *)(data); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > And the cast of the page when it enters > tmem to char: > > ret = tmem_put(pool, oidp, index, (char *)(page), > PAGE_SIZE, 0, is_ephemeral(pool)); > > Is so weird... and then it returns a page when it exits tmem and > enters zcache again in zcache_pampd_create. > > And the "len" get lost at some point inside zcache but I guess that's > fixable and not part of the API at least.... but the whole thing looks > an exercise to pass through tmem. I don't really understand why one > page must become a char at some point and what benefit it would ever > provide. This is the "fix highmem" bug fix from Seth Jennings. The file tmem.c in zcache is an attempt to separate out the core tmem functionality and data structures so that it can (eventually) be in the lib/ directory and be used by multiple backends. (RAMster uses tmem.c unchanged.) The code in tmem.c reflects my "highmem-blindness" in that a single pointer is assumed to be able to address the "PAMPD" (as opposed to a struct page * and an offset, necessary for a 32-bit highmem system). Seth cleverly discovered this ugly two-line fix that (at least for now) avoided major mods to tmem.c. > I also don't understand how you plan to ever swap the compressed data > considering it's hold outside of the kernel not anymore in a struct > page. If swap compression was done right, the on-disk data should be > stored in the compressed format in a compact way so you spend the CPU > once and you also gain disk speed by writing less. How do you plan to > achieve this with this design? First ignoring frontswap, there is currently no way to move a page of swap data from one swap device to another swap device except by moving it first into RAM (in the swap cache), right? Frontswap doesn't solve that problem either, though it would be cool if it could. The "partial swapoff" functionality in the patch, added so that it can be called from frontswap_shrink, enables pages to be pulled out of frontswap into swap cache so that they can be moved if desired/necessary onto a real swap device. The selfballooning code in drivers/xen calls frontswap_shrink to pull swap pages out of the Xen hypervisor when memory pressure is reduced. Frontswap_shrink is not yet called from zcache. Note, however, that unlike swap-disks, compressed pages in frontswap CAN be silently moved to another "device". This is the foundation of RAMster, which moves those compressed pages to the RAM of another machine. The device _could_ be some special type of real-swap-disk, I suppose. > I like the failing when the size of the compressed data is bigger than > the uncompressed one, only in that case the data should go to swap > uncompressed of course. That's something in software we can handle and > hardware can't handle so well and that's why some older hardware > compression for RAM probably didn't takeoff. Yes, this is a good example of the most important feature of tmem/frontswap: Every frontswap_put can be rejected for whatever reason the tmem backend chooses, entirely dynamically. Not only is it true that hardware can't handle this well, but the Linux block I/O subsystem can't handle it either. I've suggested in the frontswap documentation that this is also a key to allowing "mixed RAM + phase-change RAM" systems to be useful. Also I think this is also why many linux vm/vfs/fs/bio developers "don't like it much" (where "it" is cleancache or frontswap). They are not used to losing control of data to some other non-kernel-controlled entity and not used to being told "NO" when they are trying to move data somewhere. IOW, they are control freaks and tmem is out of their control so it must be defeated ;-) > I've an hard time to be convinced this is the best way to do swap > compression especially not seeing how it will ever reach swap on > disk. But yes it's not doing an additional copy unlike the tmem_put > comment would imply (it's disabling irqs for the whole duration of the > compression though). I hope the earlier explanation about frontswap_shrink helps. It's also good to note that the only other successful Linux implementation of swap compression is zram, and zram's creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8) So where are we now? Are you now supportive of merging frontswap? If not, can you suggest any concrete steps that will gain your support? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 16:41 ` Dan Magenheimer @ 2011-11-01 18:07 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-01 18:07 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, Nov 01, 2011 at 09:41:38AM -0700, Dan Magenheimer wrote: > I suppose this documentation (note, it is in drivers/staging/zcache, > not in the proposed frontswap patchset) could be misleading. It is Yep I gotten the comment from tmem.c in staging, and the lwn link I read before reading the tmem_put comment also only mentioned about tmem_put doing a copy. So I erroneously assumed that all memory passing through tmem was being copied and you lost reference of the "struct page" when it entered zcache. But instead there is this obscure cast of a "struct page *" to a "char *", that is casted back to a struct page * from a char * in zcache code, and kmap() runs on the page, to avoid the unnecessary copy. So far so good, now the question is why do you have that cast at all? I mean it's hard to be convinced on the sanity of on a API that requires the caller to cast a "struct page *" to a "char *" to run zerocopy. And well that is the very core tmem_put API I'm talking about. I assume the explanation of the cast is: before it was passing page_address(page) to tmem, but that breaks with highmem because highmem requires kmap(page). So then you casted the page. This basically proofs the API must be fixed. In the kernel we work with _pages_ not char *, exactly for this reason, and tmem_put must be fixed to take a page structure. (in fact better would be an array of pages and ranges start/end for each entry in the array but hey at least a page+len would be sane). A char * is flawed and the cast of the page to char * and back to struct page, kind of proofs it. So I think that must be fixed in tmem_put. Unfortunately it's already merged with this cast back and forth in the upstream kernel. About the rest of zcache I think it's interesting but because it works inside tmem I'm unsure how we're going to write it to disk. The local_irq_save would be nice to understand why it's needed for frontswap but not for pagecache. All that VM code never runs from irqs, so it's hard to see how the irq disabling is relevant. A bit fat comment on why local_irq_save is needed in zcache code (in staging already) would be helpful. Maybe it's tmem that can run from irq? The only thing running from irqs is the tlb flush and I/O completion handlers, everything else in the VM isn't irq/softirq driven so we never have to clear irqs. My feeling is this zcache should be based on a memory pool abstraction that we can write to disk with a bio and working with "pages". I'm also not sure how you balance the pressure in the tmem pool, when you fail the allocation and swap to disk, or when you keep moving to compressed swap. > This is a known problem: zcache is currently not very > good for high-response RT environments because it currently > compresses a page of data with interrupts disabled, which > takes (IIRC) about 20000 cycles. (I suspect though, without proof, > that this is not the worst irq-disabled path in the kernel.) That's certainly more than the irq latency so it's probably something the rt folks don't want and yes they should keep it in mind not to use frontswap+zcache in embedded RT environments. Besides there was no benchmark comparing zram performance to zcache performance so latency aside we miss a lot of info. > As noted earlier, this is fixable at the cost of the extra copy > which could be implemented as an option later if needed. > Or, as always, the RT folks can just not enable zcache. > Or maybe smarter developers than me will find a solution > that will work even better. And what is the exact reason of the local_irq_save for doing it zerocopy? > Yeah, remember zcache was merged before either cleancache or > frontswap, so this ugliness was necessary to get around the > chicken-and-egg problem. Zcache will definitely need some > work before it is ready to move out of staging, and your > feedback here is useful for that, but I don't see that as > condemning frontswap, do you? Would I'd like is a mechanism where you: 1) add swapcache to zcache (with fallback to swap immediately if zcache allocation fails) 2) when some threshold is hit or zcache allocation fails, we write the compressed data in a compact way to swap (freeing zcache memory), or swapcache directly to swap if no zcache is present 3) newly added swapcache is added to zcache (old zcache was written to swap device compressed and freed) Once we already did the compression it's silly to write to disk the uncompressed data. Ok initially it's ok because compacting the stuff on disk is super tricky but we want a design that will allow writing the zcache to disk and add new swapcache to zcache, instead of the current way of swapping the new swapcache to disk uncompressed and not being able to writeout the compressed zcache. If nobody called zcache_get and uncompressed it, it means it's probably less likely to be used than the newly added swapcache that wants to be compressed. I'm afraid adding frontswap in this form will still get stuck us in the wrong model and most of it will have to be dropped and rewritten to do just the above 3 points I described to do proper swap compression. Also I'm skeptical we need to pass through tmem at all to do that. I mean done right the swap compression could be a feature to enable across the board without needing tmem at all. Then if you want to add ramster just add a frontswap on the already compressed swapcache... before it goes to the hard swap device. The final swap design must also include the pre-swapout from Avi by writing data to swapcache in advance and relaying on the dirty bit to rewrite it. And the pre-swapin as well (original idea from Con). The pre-swapout would need to stop before compressing. The pre-swapin should stop before decompressing. I mean I see an huge potential for improvement in the swap space, just I guess most are busy with more pressing issues, like James said most data centers don't use swap, desktop is irrelevant and android (as relevant as data center) don't use swap. But your improvement to frontswap don't look the right direction if you really want to improve swap for the long term. It may be better than nothing but I don't see it going the way it should go and I prefer to remove the tmem dependency on zcache all together. Zcache alone would be way more interesting. And tmem_put must be fixed to take a page, that cast to char * of a page, to avoid crashing on highmem is not allowed. Of course I didn't have the time to read 100% of the code so please correct me again if I misunderstood something. > This is the "fix highmem" bug fix from Seth Jennings. The file > tmem.c in zcache is an attempt to separate out the core tmem > functionality and data structures so that it can (eventually) > be in the lib/ directory and be used by multiple backends. > (RAMster uses tmem.c unchanged.) The code in tmem.c reflects > my "highmem-blindness" in that a single pointer is assumed to > be able to address the "PAMPD" (as opposed to a struct page * > and an offset, necessary for a 32-bit highmem system). Seth > cleverly discovered this ugly two-line fix that (at least for now) > avoided major mods to tmem.c. Well you need to do the major mods, it's not ok to do that cast, passing pages is correct instead. Let's fix the tmem_put API before people can use it wrong. Maybe then I'll dislike passing through tmem less? Dunno. int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index, - char *data, size_t size, bool raw, bool ephemeral) + struct page *page, size_t size, bool raw, bool ephemeral) > First ignoring frontswap, there is currently no way to move a > page of swap data from one swap device to another swap device > except by moving it first into RAM (in the swap cache), right? Yes. > Frontswap doesn't solve that problem either, though it would > be cool if it could. The "partial swapoff" functionality > in the patch, added so that it can be called from frontswap_shrink, > enables pages to be pulled out of frontswap into swap cache > so that they can be moved if desired/necessary onto a real > swap device. The whole logic deciding the size of the frontswap zcache is going to be messy. But to do the real swapout you should not pull the memory out of frontswap zcache, you should write it to disk compacted and compressed compared to how it was inserted in frontswap... That would be the ideal. > The selfballooning code in drivers/xen calls frontswap_shrink > to pull swap pages out of the Xen hypervisor when memory pressure > is reduced. Frontswap_shrink is not yet called from zcache. So I wonder how zcache is dealing with the dynamic size. Or has it a fixed size? How do you pull pages out of zcache to max out the real RAM availability? > Note, however, that unlike swap-disks, compressed pages in > frontswap CAN be silently moved to another "device". This is > the foundation of RAMster, which moves those compressed pages > to the RAM of another machine. The device _could_ be some > special type of real-swap-disk, I suppose. Yeah you can do ramster with frontswap+zcache but not writing the zcache to disk into the swap device. Writing to disk doesn't require new allocations. Migrating to other node does. And you must deal with OOM conditions there. Or it'll deadlock. So the basic should be to write compressed data to disk (which at least can be done reliably for swapcache, unlike ramster which has the same issues of nfs swapping and nbd swapping and iscsi sapping) before wondering if to send it to another node. > Yes, this is a good example of the most important feature of > tmem/frontswap: Every frontswap_put can be rejected for whatever reason > the tmem backend chooses, entirely dynamically. Not only is it true > that hardware can't handle this well, but the Linux block I/O subsystem > can't handle it either. I've suggested in the frontswap documentation > that this is also a key to allowing "mixed RAM + phase-change RAM" > systems to be useful. Yes what is not clear is how the size of the zcache is choosen. > Also I think this is also why many linux vm/vfs/fs/bio developers > "don't like it much" (where "it" is cleancache or frontswap). > They are not used to losing control of data to some other > non-kernel-controlled entity and not used to being told "NO" > when they are trying to move data somewhere. IOW, they are > control freaks and tmem is out of their control so it must > be defeated ;-) Either tmem works on something that is a core MM structure and is compatible with all bios and operations we can want to do on memory, or I've an hard time to think it's a good thing in trying to make the memory it handles not-kernel-controlled. This non-kernel-controlled approach to me looks like exactly a requirement coming from Xen, not really something useful. There is no reason why a kernel abstraction should stay away from using kernel data structures like "struct page" just to cast it back from char * to struct page * when it needs to handle highmem in zcache. Something seriously wrong is going on there in API terms so you can start by fixing that bit. > I hope the earlier explanation about frontswap_shrink helps. > It's also good to note that the only other successful Linux > implementation of swap compression is zram, and zram's > creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8) > > So where are we now? Are you now supportive of merging > frontswap? If not, can you suggest any concrete steps > that will gain your support? My problem is this is like zram, like mentioned it only solves the compression. There is no way it can store the compressed data on disk. And this is way more complex than zram, and it only makes the pooling size not fixed at swapon time... so very very small gain and huge complexity added (again compared to zram). zram in fact required absolutely zero changes to the VM. So it's hard to see how this is overall better than zram. If we deal with that amount of complexity we should at least be a little better than zram at runtime, while this is same. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 18:07 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-01 18:07 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, Nov 01, 2011 at 09:41:38AM -0700, Dan Magenheimer wrote: > I suppose this documentation (note, it is in drivers/staging/zcache, > not in the proposed frontswap patchset) could be misleading. It is Yep I gotten the comment from tmem.c in staging, and the lwn link I read before reading the tmem_put comment also only mentioned about tmem_put doing a copy. So I erroneously assumed that all memory passing through tmem was being copied and you lost reference of the "struct page" when it entered zcache. But instead there is this obscure cast of a "struct page *" to a "char *", that is casted back to a struct page * from a char * in zcache code, and kmap() runs on the page, to avoid the unnecessary copy. So far so good, now the question is why do you have that cast at all? I mean it's hard to be convinced on the sanity of on a API that requires the caller to cast a "struct page *" to a "char *" to run zerocopy. And well that is the very core tmem_put API I'm talking about. I assume the explanation of the cast is: before it was passing page_address(page) to tmem, but that breaks with highmem because highmem requires kmap(page). So then you casted the page. This basically proofs the API must be fixed. In the kernel we work with _pages_ not char *, exactly for this reason, and tmem_put must be fixed to take a page structure. (in fact better would be an array of pages and ranges start/end for each entry in the array but hey at least a page+len would be sane). A char * is flawed and the cast of the page to char * and back to struct page, kind of proofs it. So I think that must be fixed in tmem_put. Unfortunately it's already merged with this cast back and forth in the upstream kernel. About the rest of zcache I think it's interesting but because it works inside tmem I'm unsure how we're going to write it to disk. The local_irq_save would be nice to understand why it's needed for frontswap but not for pagecache. All that VM code never runs from irqs, so it's hard to see how the irq disabling is relevant. A bit fat comment on why local_irq_save is needed in zcache code (in staging already) would be helpful. Maybe it's tmem that can run from irq? The only thing running from irqs is the tlb flush and I/O completion handlers, everything else in the VM isn't irq/softirq driven so we never have to clear irqs. My feeling is this zcache should be based on a memory pool abstraction that we can write to disk with a bio and working with "pages". I'm also not sure how you balance the pressure in the tmem pool, when you fail the allocation and swap to disk, or when you keep moving to compressed swap. > This is a known problem: zcache is currently not very > good for high-response RT environments because it currently > compresses a page of data with interrupts disabled, which > takes (IIRC) about 20000 cycles. (I suspect though, without proof, > that this is not the worst irq-disabled path in the kernel.) That's certainly more than the irq latency so it's probably something the rt folks don't want and yes they should keep it in mind not to use frontswap+zcache in embedded RT environments. Besides there was no benchmark comparing zram performance to zcache performance so latency aside we miss a lot of info. > As noted earlier, this is fixable at the cost of the extra copy > which could be implemented as an option later if needed. > Or, as always, the RT folks can just not enable zcache. > Or maybe smarter developers than me will find a solution > that will work even better. And what is the exact reason of the local_irq_save for doing it zerocopy? > Yeah, remember zcache was merged before either cleancache or > frontswap, so this ugliness was necessary to get around the > chicken-and-egg problem. Zcache will definitely need some > work before it is ready to move out of staging, and your > feedback here is useful for that, but I don't see that as > condemning frontswap, do you? Would I'd like is a mechanism where you: 1) add swapcache to zcache (with fallback to swap immediately if zcache allocation fails) 2) when some threshold is hit or zcache allocation fails, we write the compressed data in a compact way to swap (freeing zcache memory), or swapcache directly to swap if no zcache is present 3) newly added swapcache is added to zcache (old zcache was written to swap device compressed and freed) Once we already did the compression it's silly to write to disk the uncompressed data. Ok initially it's ok because compacting the stuff on disk is super tricky but we want a design that will allow writing the zcache to disk and add new swapcache to zcache, instead of the current way of swapping the new swapcache to disk uncompressed and not being able to writeout the compressed zcache. If nobody called zcache_get and uncompressed it, it means it's probably less likely to be used than the newly added swapcache that wants to be compressed. I'm afraid adding frontswap in this form will still get stuck us in the wrong model and most of it will have to be dropped and rewritten to do just the above 3 points I described to do proper swap compression. Also I'm skeptical we need to pass through tmem at all to do that. I mean done right the swap compression could be a feature to enable across the board without needing tmem at all. Then if you want to add ramster just add a frontswap on the already compressed swapcache... before it goes to the hard swap device. The final swap design must also include the pre-swapout from Avi by writing data to swapcache in advance and relaying on the dirty bit to rewrite it. And the pre-swapin as well (original idea from Con). The pre-swapout would need to stop before compressing. The pre-swapin should stop before decompressing. I mean I see an huge potential for improvement in the swap space, just I guess most are busy with more pressing issues, like James said most data centers don't use swap, desktop is irrelevant and android (as relevant as data center) don't use swap. But your improvement to frontswap don't look the right direction if you really want to improve swap for the long term. It may be better than nothing but I don't see it going the way it should go and I prefer to remove the tmem dependency on zcache all together. Zcache alone would be way more interesting. And tmem_put must be fixed to take a page, that cast to char * of a page, to avoid crashing on highmem is not allowed. Of course I didn't have the time to read 100% of the code so please correct me again if I misunderstood something. > This is the "fix highmem" bug fix from Seth Jennings. The file > tmem.c in zcache is an attempt to separate out the core tmem > functionality and data structures so that it can (eventually) > be in the lib/ directory and be used by multiple backends. > (RAMster uses tmem.c unchanged.) The code in tmem.c reflects > my "highmem-blindness" in that a single pointer is assumed to > be able to address the "PAMPD" (as opposed to a struct page * > and an offset, necessary for a 32-bit highmem system). Seth > cleverly discovered this ugly two-line fix that (at least for now) > avoided major mods to tmem.c. Well you need to do the major mods, it's not ok to do that cast, passing pages is correct instead. Let's fix the tmem_put API before people can use it wrong. Maybe then I'll dislike passing through tmem less? Dunno. int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index, - char *data, size_t size, bool raw, bool ephemeral) + struct page *page, size_t size, bool raw, bool ephemeral) > First ignoring frontswap, there is currently no way to move a > page of swap data from one swap device to another swap device > except by moving it first into RAM (in the swap cache), right? Yes. > Frontswap doesn't solve that problem either, though it would > be cool if it could. The "partial swapoff" functionality > in the patch, added so that it can be called from frontswap_shrink, > enables pages to be pulled out of frontswap into swap cache > so that they can be moved if desired/necessary onto a real > swap device. The whole logic deciding the size of the frontswap zcache is going to be messy. But to do the real swapout you should not pull the memory out of frontswap zcache, you should write it to disk compacted and compressed compared to how it was inserted in frontswap... That would be the ideal. > The selfballooning code in drivers/xen calls frontswap_shrink > to pull swap pages out of the Xen hypervisor when memory pressure > is reduced. Frontswap_shrink is not yet called from zcache. So I wonder how zcache is dealing with the dynamic size. Or has it a fixed size? How do you pull pages out of zcache to max out the real RAM availability? > Note, however, that unlike swap-disks, compressed pages in > frontswap CAN be silently moved to another "device". This is > the foundation of RAMster, which moves those compressed pages > to the RAM of another machine. The device _could_ be some > special type of real-swap-disk, I suppose. Yeah you can do ramster with frontswap+zcache but not writing the zcache to disk into the swap device. Writing to disk doesn't require new allocations. Migrating to other node does. And you must deal with OOM conditions there. Or it'll deadlock. So the basic should be to write compressed data to disk (which at least can be done reliably for swapcache, unlike ramster which has the same issues of nfs swapping and nbd swapping and iscsi sapping) before wondering if to send it to another node. > Yes, this is a good example of the most important feature of > tmem/frontswap: Every frontswap_put can be rejected for whatever reason > the tmem backend chooses, entirely dynamically. Not only is it true > that hardware can't handle this well, but the Linux block I/O subsystem > can't handle it either. I've suggested in the frontswap documentation > that this is also a key to allowing "mixed RAM + phase-change RAM" > systems to be useful. Yes what is not clear is how the size of the zcache is choosen. > Also I think this is also why many linux vm/vfs/fs/bio developers > "don't like it much" (where "it" is cleancache or frontswap). > They are not used to losing control of data to some other > non-kernel-controlled entity and not used to being told "NO" > when they are trying to move data somewhere. IOW, they are > control freaks and tmem is out of their control so it must > be defeated ;-) Either tmem works on something that is a core MM structure and is compatible with all bios and operations we can want to do on memory, or I've an hard time to think it's a good thing in trying to make the memory it handles not-kernel-controlled. This non-kernel-controlled approach to me looks like exactly a requirement coming from Xen, not really something useful. There is no reason why a kernel abstraction should stay away from using kernel data structures like "struct page" just to cast it back from char * to struct page * when it needs to handle highmem in zcache. Something seriously wrong is going on there in API terms so you can start by fixing that bit. > I hope the earlier explanation about frontswap_shrink helps. > It's also good to note that the only other successful Linux > implementation of swap compression is zram, and zram's > creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8) > > So where are we now? Are you now supportive of merging > frontswap? If not, can you suggest any concrete steps > that will gain your support? My problem is this is like zram, like mentioned it only solves the compression. There is no way it can store the compressed data on disk. And this is way more complex than zram, and it only makes the pooling size not fixed at swapon time... so very very small gain and huge complexity added (again compared to zram). zram in fact required absolutely zero changes to the VM. So it's hard to see how this is overall better than zram. If we deal with that amount of complexity we should at least be a little better than zram at runtime, while this is same. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 18:07 ` Andrea Arcangeli @ 2011-11-01 21:00 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 21:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Sent: Tuesday, November 01, 2011 12:07 PM > To: Dan Magenheimer > Cc: Pekka Enberg; Cyclonus J; Sasha Levin; Christoph Hellwig; David Rientjes; Linus Torvalds; linux- > mm@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy Fitzhardinge; Seth Jennings; ngupta@vflare.org; > Chris Mason; JBeulich@novell.com; Dave Hansen; Jonathan Corbet > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Andrea -- Pardon me for complaining about my typing fingers, but it seems like you are making statements and asking questions as if you are not reading the whole reply before you start responding to the first parts. So it's going to be hard to answer each sub-thread in order. So let me hit a couple of the high points first. > This basically proofs the API must be fixed.... Let me emphasize and repeat: Many of your comments here are addressing zcache, which is a staging driver. You are commenting on intra-zcache APIs only, _not_ on the tmem ABI. I realize there is some potential confusion here since the file in the zcache directory is called tmem.c; but it is NOT defining or implementing the tmem ABI/API used by the kernel. The ONLY kernel API that need be debated here is the code in the frontswap patchset, which provides registration for a set of function pointers (see the struct frontswap_ops in frontswap.h in the patch) and provides the function calls (API) between the frontswap (and cleancache) "frontends" and the "backends" in the driver directory. The zcache file "tmem.c" is simply a very early attempt to tease out core operations and data structures that are likely to be common to multiple tmem users. Everything in zcache (including tmem.c) is completely open to evolution as needed (by KVM or other users) and this will need to happen before zcache is promoted out of staging. So your comments will be very useful when "we" work on that promotion process. So, I'm going to attempt to ignore the portions of your reply that are commenting specifically about zcache coding issues and reply to the parts that potentially affect acceptance of the frontswap patchset, but if I miss anything important to you, please let me know. > About the rest of zcache I think it's interesting but because it works > inside tmem I'm unsure how we're going to write it to disk. > > The local_irq_save would be nice to understand why it's needed for > frontswap but not for pagecache. It's because the already-merged cleancache hooks that call cleancache_put are invoked from mm/vfs code where irqs are already disabled. This is not true for the hook calling frontswap_get, but since there's a lot of shared code, I disabled irqs for frontswap_get also. > All that VM code never runs from > irqs, so it's hard to see how the irq disabling is relevant. A bit fat > comment on why local_irq_save is needed in zcache code (in staging > already) would be helpful. Maybe it's tmem that can run from irq? The > only thing running from irqs is the tlb flush and I/O completion > handlers, everything else in the VM isn't irq/softirq driven so we > never have to clear irqs. Other than the fact that cleancache_put is called with irqs disabled (and, IIRC, sometimes cleancache_flush?) and the coding complications that causes, you are correct. Preemption does need to be disabled though and, IIRC, in some cases, softirqs. > My feeling is this zcache should be based on a memory pool abstraction > that we can write to disk with a bio and working with "pages". Possible I suppose. But unless you can teach bio to deal with dynamically time-and-size varying devices, you are not implementing the most important value of the tmem concept, you are just re-implementing zram. And, as I said, Nitin supports frontswap because it is better than zram for exactly this (dynamicity) reason: https://lkml.org/lkml/2011/10/28/8 > I'm also not sure how you balance the pressure in the tmem pool, when > you fail the allocation and swap to disk, or when you keep moving to > compressed swap. Just like all existing memory management code, zcache depends on some heuristics, which can be improved as necessary over time. Some of the metrics that feed into the heuristics are in debugfs so they can be manipulated as zcache continue to develop. (See zv_page_count_policy_percent for example... yes this is still a bit primitive. And before you start up about dynamic sizing, this is only a maximum.) For Xen, there is a "memmax" for each guest, and Xen tmem disallows a guest from using a page for swap (tmem calls it a "persistent" pool) if it has reached its memmax. Thus unless a tmem-enabled guest is "giving", it can never expect to "get". For KVM, you can overcommit in the host, so you could choose a different heuristic... if you are willing to accept host swapping (which I think is evil :-) > > This is a known problem: zcache is currently not very > > good for high-response RT environments because it currently > > compresses a page of data with interrupts disabled, which > > takes (IIRC) about 20000 cycles. (I suspect though, without proof, > > that this is not the worst irq-disabled path in the kernel.) > > That's certainly more than the irq latency so it's probably something > the rt folks don't want and yes they should keep it in mind not to use > frontswap+zcache in embedded RT environments. Well, you have yet to convince me that an extra copy is so damning, especially on a modern many-core CPU where it can be done in 256 cycles and especially when the cache-pollution for the copy is necessary for the subsequent compression anyway. But for now, yes, don't turn on zcache in embedded RT. > Besides there was no benchmark comparing zram performance to zcache > performance so latency aside we miss a lot of info. Think of zcache as zram PLUS dynamicity PLUS ability to dynamically trade off memory utilization against compressed page cache. > And what is the exact reason of the local_irq_save for doing it > zerocopy? (Answered above I think? If not, let me know.) > Would I'd like is a mechanism where you: > > 1) add swapcache to zcache (with fallback to swap immediately if zcache > allocation fails) Current swap code pre-selects the swap device several layers higher in the call chain, so this requires fairly major surgery on the swap subsystem... and the long bug-tail that implies. > 2) when some threshold is hit or zcache allocation fails, we write the > compressed data in a compact way to swap (freeing zcache memory), > or swapcache directly to swap if no zcache is present Has efficient writing (and reading) of smaller-than-page chunks through blkio every been implemented? I know compression can be done "behind the curtain" of many I/O devices, but am unaware that the same functionality exists in the kernel. If it doesn't exist, this requires fairly major surgery on the blkio subsystem. If it does exist, I doubt the swap subsystem is capable of using it without major surgery. > 3) newly added swapcache is added to zcache (old zcache was written to > swap device compressed and freed) > > Once we already did the compression it's silly to write to disk the > uncompressed data. Ok initially it's ok because compacting the stuff > on disk is super tricky but we want a design that will allow writing > the zcache to disk and add new swapcache to zcache, instead of the > current way of swapping the new swapcache to disk uncompressed and not > being able to writeout the compressed zcache. > > If nobody called zcache_get and uncompressed it, it means it's > probably less likely to be used than the newly added swapcache that > wants to be compressed. Yeah, I agree that sounds like a cool high-level design for a swap subsystem rewrite. Problem is it doesn't replace the dynamicity to do what frontswap does for virtualization and multiple physical machines (RAMster). Just not as flexible. And do you really want to rewrite the swap subsystem anyway when a handful of frontswap hooks do the same thing (and more)? > I'm afraid adding frontswap in this form will still get stuck us in > the wrong model and most of it will have to be dropped and rewritten > to do just the above 3 points I described to do proper swap > compression. This is a red herring. I translate this as "your handful of hooks might interfere with some major effort that I've barely begun to design". And even if you DO code that major effort... the frontswap hooks are almost trivial and clearly separated from most of the core swap code... how do you know those hooks will interfere with your grand plan anyway? Do I have to quote Linus's statement from the KS2011 minutes again? :-) > The final swap design must also include the pre-swapout from Avi by > writing data to swapcache in advance and relaying on the dirty bit to > rewrite it. And the pre-swapin as well (original idea from Con). The > pre-swapout would need to stop before compressing. The pre-swapin > should stop before decompressing. IIUC, you're talking about improvements to host-swapping here. That is (IMHO) putting lipstick on a pig. And, in any case, you are talking about significant swap subsystm changes that only help a single user, KVM. You seem to be already measuring non-existent KVM patches by a different/easier standard than you are applying to a simple frontswap patchset that's been public for nearly three years. > I mean I see an huge potential for improvement in the swap space, just > I guess most are busy with more pressing issues, like James said most > data centers don't use swap, desktop is irrelevant and android (as > relevant as data center) don't use swap. Yep. I agree that it is unlikely to get done. But James' data centers are running cgroups, not Xen, not KVM. And there is a solution proposed that exists today for Xen, and that KVM can at least attempt if not heavily leverage. > But your improvement to frontswap don't look the right direction if > you really want to improve swap for the long term. It may be better > than nothing but I don't see it going the way it should go and I > prefer to remove the tmem dependency on zcache all together. Zcache > alone would be way more interesting. There is no tmem dependency on zcache. Feel free to rewrite zcache entirely. It still needs the hooks in the frontswap patch, or something at least very similar. > And tmem_put must be fixed to take a page, that cast to char * of a > page, to avoid crashing on highmem is not allowed. > > Of course I didn't have the time to read 100% of the code so please > correct me again if I misunderstood something. Then feel free to rewrite that code.. or wait until it gets fixed. I agree that it's unlikely that zcache will be promoted out of staging with that hack. That's all still unrelated to merging frontswap. > > > This is the "fix highmem" bug fix from Seth Jennings. The file > > tmem.c in zcache is an attempt to separate out the core tmem > > functionality and data structures so that it can (eventually) > > be in the lib/ directory and be used by multiple backends. > > (RAMster uses tmem.c unchanged.) The code in tmem.c reflects > > my "highmem-blindness" in that a single pointer is assumed to > > be able to address the "PAMPD" (as opposed to a struct page * > > and an offset, necessary for a 32-bit highmem system). Seth > > cleverly discovered this ugly two-line fix that (at least for now) > > avoided major mods to tmem.c. > > Well you need to do the major mods, it's not ok to do that cast, > passing pages is correct instead. Let's fix the tmem_put API before > people can use it wrong. Maybe then I'll dislike passing through tmem > less? Dunno. Zcache doesn't need to pass through tmem.c. RAMster is using tmem.c but isn't even in staging yet. > The whole logic deciding the size of the frontswap zcache is going to > be messy. It's not messy, and is entirely dynamic. Finding the ideal heuristics for the maximum size, and when and how much to decompress pages back out of zcache back into the swap cache, I agree, is messy and will take some time. Still not sure how this is related to the proposed frontswap patch now (which just provides some mechanism for the heuristics to drive). > But to do the real swapout you should not pull the memory > out of frontswap zcache, you should write it to disk compacted and > compressed compared to how it was inserted in frontswap... That would > be the ideal. Agreed, that would be cool... and very difficult to implement. > > The selfballooning code in drivers/xen calls frontswap_shrink > > to pull swap pages out of the Xen hypervisor when memory pressure > > is reduced. Frontswap_shrink is not yet called from zcache. > > So I wonder how zcache is dealing with the dynamic size. Or has it a > fixed size? How do you pull pages out of zcache to max out the real > RAM availability? Dynamic. Pulled out with frontswap_shrink, see above. > > Note, however, that unlike swap-disks, compressed pages in > > frontswap CAN be silently moved to another "device". This is > > the foundation of RAMster, which moves those compressed pages > > to the RAM of another machine. The device _could_ be some > > special type of real-swap-disk, I suppose. > > Yeah you can do ramster with frontswap+zcache but not writing the > zcache to disk into the swap device. Writing to disk doesn't require > new allocations. Migrating to other node does. And you must deal with > OOM conditions there. Or it'll deadlock. So the basic should be to > write compressed data to disk (which at least can be done reliably for > swapcache, unlike ramster which has the same issues of nfs swapping > and nbd swapping and iscsi sapping) before wondering if to send it to > another node. I guess you are missing the key magic for RAMster, or really for tmem. Because everything in tmem is entirely dynamic (e.g. any attempt to put a page can be rejected), the "remote" machine has complete control over how many pages to accept from whom, and can manage its own needs as higher priority. Think of a machine in RAMster as a KVM/Xen "host" for a bunch of virtual-machines-that-are-really-physical-machines. And it is all peer-to-peer, so each machine can act as a host when necessary. None of this is possible through anything that exists today in the swap subsystem or blkio subsystem. And RAMster runs on the same cleancache and frontswap hooks as Xen and zcache and, potentially, KVM. Yeah, the heuristics may be even harder for RAMster. But the first response to this thread (from Christoph) said that this stuff isn't sexy. Personally I can't think of anything sexier than the first CROSS-MACHINE memory management subsystem in a mainstream OS. Again... NO additional core VM changes. > > Yes, this is a good example of the most important feature of > > tmem/frontswap: Every frontswap_put can be rejected for whatever reason > > the tmem backend chooses, entirely dynamically. Not only is it true > > that hardware can't handle this well, but the Linux block I/O subsystem > > can't handle it either. I've suggested in the frontswap documentation > > that this is also a key to allowing "mixed RAM + phase-change RAM" > > systems to be useful. > > Yes what is not clear is how the size of the zcache is choosen. Is that answered clearly now? > > Also I think this is also why many linux vm/vfs/fs/bio developers > > "don't like it much" (where "it" is cleancache or frontswap). > > They are not used to losing control of data to some other > > non-kernel-controlled entity and not used to being told "NO" > > when they are trying to move data somewhere. IOW, they are > > control freaks and tmem is out of their control so it must > > be defeated ;-) > > Either tmem works on something that is a core MM structure and is > compatible with all bios and operations we can want to do on memory, > or I've an hard time to think it's a good thing in trying to make the > memory it handles not-kernel-controlled. > > This non-kernel-controlled approach to me looks like exactly a > requirement coming from Xen, not really something useful. C'mon Andrea. You're an extremely creative guy and you are disappointing me. Think RAMster. Think a version of RAMster with a "memory server" (where the RAM expandability is in one server in a rack). Think fast SSDs that can be attached to one machine and shared by other machines. Think phase-change (or other future limited-write-cycle) RAM without a separate processor counting how many times a cell has been written. This WAS all about Xen a year or two ago. I haven't written a line of Xen in over a year because I am excited about the FULL value of tmem. > There is no reason why a kernel abstraction should stay away from > using kernel data structures like "struct page" just to cast it back > from char * to struct page * when it needs to handle highmem in > zcache. Something seriously wrong is going on there in API terms so > you can start by fixing that bit. Yep, let's fix that problem in zcache. That is a stupid coding error by me and irrelevant to frontswap and the bigger transcendent memory picture. > > I hope the earlier explanation about frontswap_shrink helps. > > It's also good to note that the only other successful Linux > > implementation of swap compression is zram, and zram's > > creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8) > > > > So where are we now? Are you now supportive of merging > > frontswap? If not, can you suggest any concrete steps > > that will gain your support? > > My problem is this is like zram, like mentioned it only solves the > compression. There is no way it can store the compressed data on > disk. And this is way more complex than zram, and it only makes the > pooling size not fixed at swapon time... so very very small gain and > huge complexity added (again compared to zram). zram in fact required > absolutely zero changes to the VM. So it's hard to see how this is > overall better than zram. If we deal with that amount of complexity we > should at least be a little better than zram at runtime, while this is > same. Zram required exactly ONE change to the VM, and Nitin placed it there AFTER he looked at how frontswap worked. Then he was forced down the "gotta do it as a device" path which lost a lot of the value. Then, when he wanted to do compression on page cache, he found that the cleancache interface was perfect for it. Why does everyone keep telling me to "do it like zram" when the author of zram has seen the light? Did I mention Nitin's support for frontswap already? https://lkml.org/lkml/2011/10/28/8 So, I repeat, where are we now? Have I sufficiently answered your concerns and questions? Or are you going to go start coding to prove me wrong with a swap subsystem rewrite? :-) Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 21:00 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 21:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Sent: Tuesday, November 01, 2011 12:07 PM > To: Dan Magenheimer > Cc: Pekka Enberg; Cyclonus J; Sasha Levin; Christoph Hellwig; David Rientjes; Linus Torvalds; linux- > mm@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy Fitzhardinge; Seth Jennings; ngupta@vflare.org; > Chris Mason; JBeulich@novell.com; Dave Hansen; Jonathan Corbet > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Andrea -- Pardon me for complaining about my typing fingers, but it seems like you are making statements and asking questions as if you are not reading the whole reply before you start responding to the first parts. So it's going to be hard to answer each sub-thread in order. So let me hit a couple of the high points first. > This basically proofs the API must be fixed.... Let me emphasize and repeat: Many of your comments here are addressing zcache, which is a staging driver. You are commenting on intra-zcache APIs only, _not_ on the tmem ABI. I realize there is some potential confusion here since the file in the zcache directory is called tmem.c; but it is NOT defining or implementing the tmem ABI/API used by the kernel. The ONLY kernel API that need be debated here is the code in the frontswap patchset, which provides registration for a set of function pointers (see the struct frontswap_ops in frontswap.h in the patch) and provides the function calls (API) between the frontswap (and cleancache) "frontends" and the "backends" in the driver directory. The zcache file "tmem.c" is simply a very early attempt to tease out core operations and data structures that are likely to be common to multiple tmem users. Everything in zcache (including tmem.c) is completely open to evolution as needed (by KVM or other users) and this will need to happen before zcache is promoted out of staging. So your comments will be very useful when "we" work on that promotion process. So, I'm going to attempt to ignore the portions of your reply that are commenting specifically about zcache coding issues and reply to the parts that potentially affect acceptance of the frontswap patchset, but if I miss anything important to you, please let me know. > About the rest of zcache I think it's interesting but because it works > inside tmem I'm unsure how we're going to write it to disk. > > The local_irq_save would be nice to understand why it's needed for > frontswap but not for pagecache. It's because the already-merged cleancache hooks that call cleancache_put are invoked from mm/vfs code where irqs are already disabled. This is not true for the hook calling frontswap_get, but since there's a lot of shared code, I disabled irqs for frontswap_get also. > All that VM code never runs from > irqs, so it's hard to see how the irq disabling is relevant. A bit fat > comment on why local_irq_save is needed in zcache code (in staging > already) would be helpful. Maybe it's tmem that can run from irq? The > only thing running from irqs is the tlb flush and I/O completion > handlers, everything else in the VM isn't irq/softirq driven so we > never have to clear irqs. Other than the fact that cleancache_put is called with irqs disabled (and, IIRC, sometimes cleancache_flush?) and the coding complications that causes, you are correct. Preemption does need to be disabled though and, IIRC, in some cases, softirqs. > My feeling is this zcache should be based on a memory pool abstraction > that we can write to disk with a bio and working with "pages". Possible I suppose. But unless you can teach bio to deal with dynamically time-and-size varying devices, you are not implementing the most important value of the tmem concept, you are just re-implementing zram. And, as I said, Nitin supports frontswap because it is better than zram for exactly this (dynamicity) reason: https://lkml.org/lkml/2011/10/28/8 > I'm also not sure how you balance the pressure in the tmem pool, when > you fail the allocation and swap to disk, or when you keep moving to > compressed swap. Just like all existing memory management code, zcache depends on some heuristics, which can be improved as necessary over time. Some of the metrics that feed into the heuristics are in debugfs so they can be manipulated as zcache continue to develop. (See zv_page_count_policy_percent for example... yes this is still a bit primitive. And before you start up about dynamic sizing, this is only a maximum.) For Xen, there is a "memmax" for each guest, and Xen tmem disallows a guest from using a page for swap (tmem calls it a "persistent" pool) if it has reached its memmax. Thus unless a tmem-enabled guest is "giving", it can never expect to "get". For KVM, you can overcommit in the host, so you could choose a different heuristic... if you are willing to accept host swapping (which I think is evil :-) > > This is a known problem: zcache is currently not very > > good for high-response RT environments because it currently > > compresses a page of data with interrupts disabled, which > > takes (IIRC) about 20000 cycles. (I suspect though, without proof, > > that this is not the worst irq-disabled path in the kernel.) > > That's certainly more than the irq latency so it's probably something > the rt folks don't want and yes they should keep it in mind not to use > frontswap+zcache in embedded RT environments. Well, you have yet to convince me that an extra copy is so damning, especially on a modern many-core CPU where it can be done in 256 cycles and especially when the cache-pollution for the copy is necessary for the subsequent compression anyway. But for now, yes, don't turn on zcache in embedded RT. > Besides there was no benchmark comparing zram performance to zcache > performance so latency aside we miss a lot of info. Think of zcache as zram PLUS dynamicity PLUS ability to dynamically trade off memory utilization against compressed page cache. > And what is the exact reason of the local_irq_save for doing it > zerocopy? (Answered above I think? If not, let me know.) > Would I'd like is a mechanism where you: > > 1) add swapcache to zcache (with fallback to swap immediately if zcache > allocation fails) Current swap code pre-selects the swap device several layers higher in the call chain, so this requires fairly major surgery on the swap subsystem... and the long bug-tail that implies. > 2) when some threshold is hit or zcache allocation fails, we write the > compressed data in a compact way to swap (freeing zcache memory), > or swapcache directly to swap if no zcache is present Has efficient writing (and reading) of smaller-than-page chunks through blkio every been implemented? I know compression can be done "behind the curtain" of many I/O devices, but am unaware that the same functionality exists in the kernel. If it doesn't exist, this requires fairly major surgery on the blkio subsystem. If it does exist, I doubt the swap subsystem is capable of using it without major surgery. > 3) newly added swapcache is added to zcache (old zcache was written to > swap device compressed and freed) > > Once we already did the compression it's silly to write to disk the > uncompressed data. Ok initially it's ok because compacting the stuff > on disk is super tricky but we want a design that will allow writing > the zcache to disk and add new swapcache to zcache, instead of the > current way of swapping the new swapcache to disk uncompressed and not > being able to writeout the compressed zcache. > > If nobody called zcache_get and uncompressed it, it means it's > probably less likely to be used than the newly added swapcache that > wants to be compressed. Yeah, I agree that sounds like a cool high-level design for a swap subsystem rewrite. Problem is it doesn't replace the dynamicity to do what frontswap does for virtualization and multiple physical machines (RAMster). Just not as flexible. And do you really want to rewrite the swap subsystem anyway when a handful of frontswap hooks do the same thing (and more)? > I'm afraid adding frontswap in this form will still get stuck us in > the wrong model and most of it will have to be dropped and rewritten > to do just the above 3 points I described to do proper swap > compression. This is a red herring. I translate this as "your handful of hooks might interfere with some major effort that I've barely begun to design". And even if you DO code that major effort... the frontswap hooks are almost trivial and clearly separated from most of the core swap code... how do you know those hooks will interfere with your grand plan anyway? Do I have to quote Linus's statement from the KS2011 minutes again? :-) > The final swap design must also include the pre-swapout from Avi by > writing data to swapcache in advance and relaying on the dirty bit to > rewrite it. And the pre-swapin as well (original idea from Con). The > pre-swapout would need to stop before compressing. The pre-swapin > should stop before decompressing. IIUC, you're talking about improvements to host-swapping here. That is (IMHO) putting lipstick on a pig. And, in any case, you are talking about significant swap subsystm changes that only help a single user, KVM. You seem to be already measuring non-existent KVM patches by a different/easier standard than you are applying to a simple frontswap patchset that's been public for nearly three years. > I mean I see an huge potential for improvement in the swap space, just > I guess most are busy with more pressing issues, like James said most > data centers don't use swap, desktop is irrelevant and android (as > relevant as data center) don't use swap. Yep. I agree that it is unlikely to get done. But James' data centers are running cgroups, not Xen, not KVM. And there is a solution proposed that exists today for Xen, and that KVM can at least attempt if not heavily leverage. > But your improvement to frontswap don't look the right direction if > you really want to improve swap for the long term. It may be better > than nothing but I don't see it going the way it should go and I > prefer to remove the tmem dependency on zcache all together. Zcache > alone would be way more interesting. There is no tmem dependency on zcache. Feel free to rewrite zcache entirely. It still needs the hooks in the frontswap patch, or something at least very similar. > And tmem_put must be fixed to take a page, that cast to char * of a > page, to avoid crashing on highmem is not allowed. > > Of course I didn't have the time to read 100% of the code so please > correct me again if I misunderstood something. Then feel free to rewrite that code.. or wait until it gets fixed. I agree that it's unlikely that zcache will be promoted out of staging with that hack. That's all still unrelated to merging frontswap. > > > This is the "fix highmem" bug fix from Seth Jennings. The file > > tmem.c in zcache is an attempt to separate out the core tmem > > functionality and data structures so that it can (eventually) > > be in the lib/ directory and be used by multiple backends. > > (RAMster uses tmem.c unchanged.) The code in tmem.c reflects > > my "highmem-blindness" in that a single pointer is assumed to > > be able to address the "PAMPD" (as opposed to a struct page * > > and an offset, necessary for a 32-bit highmem system). Seth > > cleverly discovered this ugly two-line fix that (at least for now) > > avoided major mods to tmem.c. > > Well you need to do the major mods, it's not ok to do that cast, > passing pages is correct instead. Let's fix the tmem_put API before > people can use it wrong. Maybe then I'll dislike passing through tmem > less? Dunno. Zcache doesn't need to pass through tmem.c. RAMster is using tmem.c but isn't even in staging yet. > The whole logic deciding the size of the frontswap zcache is going to > be messy. It's not messy, and is entirely dynamic. Finding the ideal heuristics for the maximum size, and when and how much to decompress pages back out of zcache back into the swap cache, I agree, is messy and will take some time. Still not sure how this is related to the proposed frontswap patch now (which just provides some mechanism for the heuristics to drive). > But to do the real swapout you should not pull the memory > out of frontswap zcache, you should write it to disk compacted and > compressed compared to how it was inserted in frontswap... That would > be the ideal. Agreed, that would be cool... and very difficult to implement. > > The selfballooning code in drivers/xen calls frontswap_shrink > > to pull swap pages out of the Xen hypervisor when memory pressure > > is reduced. Frontswap_shrink is not yet called from zcache. > > So I wonder how zcache is dealing with the dynamic size. Or has it a > fixed size? How do you pull pages out of zcache to max out the real > RAM availability? Dynamic. Pulled out with frontswap_shrink, see above. > > Note, however, that unlike swap-disks, compressed pages in > > frontswap CAN be silently moved to another "device". This is > > the foundation of RAMster, which moves those compressed pages > > to the RAM of another machine. The device _could_ be some > > special type of real-swap-disk, I suppose. > > Yeah you can do ramster with frontswap+zcache but not writing the > zcache to disk into the swap device. Writing to disk doesn't require > new allocations. Migrating to other node does. And you must deal with > OOM conditions there. Or it'll deadlock. So the basic should be to > write compressed data to disk (which at least can be done reliably for > swapcache, unlike ramster which has the same issues of nfs swapping > and nbd swapping and iscsi sapping) before wondering if to send it to > another node. I guess you are missing the key magic for RAMster, or really for tmem. Because everything in tmem is entirely dynamic (e.g. any attempt to put a page can be rejected), the "remote" machine has complete control over how many pages to accept from whom, and can manage its own needs as higher priority. Think of a machine in RAMster as a KVM/Xen "host" for a bunch of virtual-machines-that-are-really-physical-machines. And it is all peer-to-peer, so each machine can act as a host when necessary. None of this is possible through anything that exists today in the swap subsystem or blkio subsystem. And RAMster runs on the same cleancache and frontswap hooks as Xen and zcache and, potentially, KVM. Yeah, the heuristics may be even harder for RAMster. But the first response to this thread (from Christoph) said that this stuff isn't sexy. Personally I can't think of anything sexier than the first CROSS-MACHINE memory management subsystem in a mainstream OS. Again... NO additional core VM changes. > > Yes, this is a good example of the most important feature of > > tmem/frontswap: Every frontswap_put can be rejected for whatever reason > > the tmem backend chooses, entirely dynamically. Not only is it true > > that hardware can't handle this well, but the Linux block I/O subsystem > > can't handle it either. I've suggested in the frontswap documentation > > that this is also a key to allowing "mixed RAM + phase-change RAM" > > systems to be useful. > > Yes what is not clear is how the size of the zcache is choosen. Is that answered clearly now? > > Also I think this is also why many linux vm/vfs/fs/bio developers > > "don't like it much" (where "it" is cleancache or frontswap). > > They are not used to losing control of data to some other > > non-kernel-controlled entity and not used to being told "NO" > > when they are trying to move data somewhere. IOW, they are > > control freaks and tmem is out of their control so it must > > be defeated ;-) > > Either tmem works on something that is a core MM structure and is > compatible with all bios and operations we can want to do on memory, > or I've an hard time to think it's a good thing in trying to make the > memory it handles not-kernel-controlled. > > This non-kernel-controlled approach to me looks like exactly a > requirement coming from Xen, not really something useful. C'mon Andrea. You're an extremely creative guy and you are disappointing me. Think RAMster. Think a version of RAMster with a "memory server" (where the RAM expandability is in one server in a rack). Think fast SSDs that can be attached to one machine and shared by other machines. Think phase-change (or other future limited-write-cycle) RAM without a separate processor counting how many times a cell has been written. This WAS all about Xen a year or two ago. I haven't written a line of Xen in over a year because I am excited about the FULL value of tmem. > There is no reason why a kernel abstraction should stay away from > using kernel data structures like "struct page" just to cast it back > from char * to struct page * when it needs to handle highmem in > zcache. Something seriously wrong is going on there in API terms so > you can start by fixing that bit. Yep, let's fix that problem in zcache. That is a stupid coding error by me and irrelevant to frontswap and the bigger transcendent memory picture. > > I hope the earlier explanation about frontswap_shrink helps. > > It's also good to note that the only other successful Linux > > implementation of swap compression is zram, and zram's > > creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8) > > > > So where are we now? Are you now supportive of merging > > frontswap? If not, can you suggest any concrete steps > > that will gain your support? > > My problem is this is like zram, like mentioned it only solves the > compression. There is no way it can store the compressed data on > disk. And this is way more complex than zram, and it only makes the > pooling size not fixed at swapon time... so very very small gain and > huge complexity added (again compared to zram). zram in fact required > absolutely zero changes to the VM. So it's hard to see how this is > overall better than zram. If we deal with that amount of complexity we > should at least be a little better than zram at runtime, while this is > same. Zram required exactly ONE change to the VM, and Nitin placed it there AFTER he looked at how frontswap worked. Then he was forced down the "gotta do it as a device" path which lost a lot of the value. Then, when he wanted to do compression on page cache, he found that the cleancache interface was perfect for it. Why does everyone keep telling me to "do it like zram" when the author of zram has seen the light? Did I mention Nitin's support for frontswap already? https://lkml.org/lkml/2011/10/28/8 So, I repeat, where are we now? Have I sufficiently answered your concerns and questions? Or are you going to go start coding to prove me wrong with a swap subsystem rewrite? :-) Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 21:00 ` Dan Magenheimer @ 2011-11-02 1:31 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-02 1:31 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet Hi Dan. On Tue, Nov 01, 2011 at 02:00:34PM -0700, Dan Magenheimer wrote: > Pardon me for complaining about my typing fingers, but it seems > like you are making statements and asking questions as if you > are not reading the whole reply before you start responding > to the first parts. So it's going to be hard to answer each > sub-thread in order. So let me hit a couple of the high > points first. I'm actually reading all your reply, if I skip some part it may be because the email is too long already :). I'm just trying to understand it and I wish I had more time to dedicate to this too but I've other pending stuff too. > Let me emphasize and repeat: Many of your comments > here are addressing zcache, which is a staging driver. > You are commenting on intra-zcache APIs only, _not_ on the > tmem ABI. I realize there is some potential confusion here > since the file in the zcache directory is called tmem.c; > but it is NOT defining or implementing the tmem ABI/API used So where exactly is the tmem API if it's not in tmem.c in staging/zcache? I read what is in the kernel and I comment on it. > by the kernel. The ONLY kernel API that need be debated here > is the code in the frontswap patchset, which provides That code is calling into zcache that calls into tmem.. so I'm not sure how you can pretend we focus only in forntswap. Also I don't care if zcache is already merged in staging, I may still like to see changes happening there. > registration for a set of function pointers (see the > struct frontswap_ops in frontswap.h in the patch) and > provides the function calls (API) between the frontswap > (and cleancache) "frontends" and the "backends" in > the driver directory. The zcache file "tmem.c" is simply > a very early attempt to tease out core operations and > data structures that are likely to be common to multiple > tmem users. It's a little hard to follow, so that tmem.c is not the real tmem and tries to mirror the real tmem.c that people will really use trying to be compatible when the real tmem.c will be used instead? It looks even more weird than I thought, why isn't the real tmem.c in the zcache directory instead of an attempt to tease out core operations and data structures? Maybe I just misunderstood what tmem.c is about in zcache. > Everything in zcache (including tmem.c) is completely open > to evolution as needed (by KVM or other users) and this > will need to happen before zcache is promoted out of staging. > So your comments will be very useful when "we" work > on that promotion process. Again I don't care frontswap out-of-tree, zcache-in-tree tmem-of-zcache-not-real-tmem-in-tree, I will comment on any of those. I understand your focus is to get frontswap merged, but on my side to review frontswap I am forced to read zcache and if I don't like something there, it's unlikely I can like the _caller_ of zcache code either (i.e. frontswap_put). I don't see this as a showstopper on your side, if you agree why don't you start fixing what is _in_tree_ first, and then submit frontswap again? I mean if you proof what you're pushing in (and what you already pushed in staging) is the way to go, you shouldn't be so worried about frontswap being merged immediately. I think the merge of frontswap it's going to work much better if you convince the whole VM camp what is in tree (zcache/tmem.c) is the way to go, then there won't be much opposition to merge frontswap and make the core VM add hooks for something already proven _worthwhile_. I believe if all the reviewers and commenters would think the zcache directory is the way to go to store swapcache before it hits swap, you wouldn't have much problem to add changes to the core VM to put swapcache into it. But when things gets merged they never go out of tree, or rarely go out of tree, and the maintenance is then upon us and the whole community. So before adding dependencies on the core VM to zcache/tmem.c it'd be nicer to be sure it's the way to go... I hope this explains why I am "forced" to look into tmem.c while commenting on frontswap. > So, I'm going to attempt to ignore the portions of your > reply that are commenting specifically about zcache coding > issues and reply to the parts that potentially affect > acceptance of the frontswap patchset, but if I miss anything > important to you, please let me know. So you call the tmem_put getting a char a not a page "zcache coding", but to me it's not even clear if tmem would equally be happy to use a page structure. Would that remain compatible with what you call above "multiple tmem users" or not? It's a little hard to see where Xen starts and where the kernel ends here. Can Xen make any use of the kernel code you pushed in staging yet? Where does the Xen API starts there? I'd like to compare to the real tmem.c if zcache/tmem.c isn't it. And no I don't imply the cast of the page to char is a problem at all, I assume you're perfectly right that it's a coding issue, and may be very well fixable with a few liner fix, but then why not proof the point and fix it, instead of keep insisting we review frontswap as a standalone entity like if it wasn't calling the code I am commenting? And deferring the (albeit minor) fixage of tmem API for later after we already are calling tmem_put from the core VM? Or do you disagree with my proposed changes? I don't think it's unreasonable to ask you to cleanup and make more proper what is in tree already before adding more "stuff" that depends on it and would have to be maintained _forever_ in the VM? I don't actually ask perfection, but it'd be easier if you cleaned up what looks "fishy". > Other than the fact that cleancache_put is called with > irqs disabled (and, IIRC, sometimes cleancache_flush?) > and the coding complications that causes, you are correct. > > Preemption does need to be disabled though and, IIRC, > in some cases, softirqs. What code runs in softirqs that could race with zcache+frontswap? BTW, I wonder if the tree_lock is clearing irqs only to avoid getting interrupted during the critical section as a performance optimization (normally __delete_from_swap_cache would be way faster than a page-compression but with page compression in, unless there's real code running from irqs it's unlikely we want to insist with irqs disabled, that only is a good optimization for fast code, delaying irqs 20 times more than normal isn't so good, would be better if those hooks run outside of the tree_lock). > Possible I suppose. But unless you can teach bio to deal > with dynamically time-and-size varying devices, you are > not implementing the most important value of the tmem concept, > you are just re-implementing zram. And, as I said, Nitin > supports frontswap because it is better than zram for > exactly this (dynamicity) reason: https://lkml.org/lkml/2011/10/28/8 I don't think the bio should deal with that, the bios can write at hardblocksize granularity, it should be up to an upper layer to find stuff to compact into tmem and write it out in a compact way, in a way that the "cookie" returned to swapcache code can still read from it when tmem is asked a .get operation. BTW, this swapcache_put/get is a bit misleading with get_page/put_page, maybe tmem_store/tmem_load are more appropriate names for the API. Normally we call get on a page when we take a refcount on it and put when we release it. > Just like all existing memory management code, zcache depends > on some heuristics, which can be improved as necessary over time. > Some of the metrics that feed into the heuristics are in > debugfs so they can be manipulated as zcache continue to > develop. (See zv_page_count_policy_percent for example... > yes this is still a bit primitive. And before you start up > about dynamic sizing, this is only a maximum.) I wonder how can you tune it without any direct feedback from the VM pressure, VM pressure changes too fast to poll it. The zcache pool should shrink fast and be pushed into real disk (ideally without requiring decompression and regular swapout but by compacting it and writing it out in a compact way), if there's mlocked memory growing for example. > For Xen, there is a "memmax" for each guest, and Xen tmem disallows > a guest from using a page for swap (tmem calls it a "persistent" pool) > if it has reached its memmax. Thus unless a tmem-enabled guest > is "giving", it can never expect to "get". > > For KVM, you can overcommit in the host, so you could choose a > different heuristic... if you are willing to accept host swapping > (which I think is evil :-) Well host swapping with zcache working on host is going to be theoretically (modulo implementation issues) faster than anything else because it won't run into any vmexits. Swapping in guest is forced to go through vmexits to page out to disk (and now even tmem 4k calls apparently). Plus we keep the whole system balance in the host VM without having to try to mix what has been collected by each guest VM. It's like comparing host I/O vs guest I/O, host I/O is always faster by definition and we have host swapping fully functional, the mmu notifier overhead is nothing compared to the regular pte tlb flushing and VM overhead (that you have to pay in guest anyway to overcommit in guest). > Well, you have yet to convince me that an extra copy is > so damning, especially on a modern many-core CPU where it > can be done in 256 cycles and especially when the cache-pollution > for the copy is necessary for the subsequent compression anyway. I thought we agreed the extra copy (like highmem bounce buffers) that destroys cpu caches was the worst possible thing, not sure why you bring it back. To remove the disable of irqs you've just to check which code could run from irq, and it's a bit hard to see what... maybe it's fixable. > Think of zcache as zram PLUS dynamicity PLUS ability to dynamically > trade off memory utilization against compressed page cache. It's hard to see how you can obtain dynamicity and no risk of going OOM prematurely if suddenly if a mlockall() program tries to allocate all RAM in the system. You need some more hooks in the VM than the one you have today, and that applies to cleancache too I think. It starts to look a big hack that works for VM and can fall apart if exposed to the wrong workload that uses mlockall or similar. I understand it's corner cases but we can't add cruft to the VM. You've no idea how many times I hear of people adding hooks here and there, last time it was to make mremap run with a binary only module and move 2M pages allocated at boot not visible to the VM. There is no way we can add hooks all over the place every time somebody invents something that helps a specific workload. What we had must work 100% for everything. So mlockall() using all RAM and triggering an OOM with zcache+cleancache enabled, it's not ok in my view. I think it can be fixed so I don't mean it's not good but somebody should work in fixing it, not just leaving the code unchanged and keep pushing. I mean this thing looks more complicated than it is in the current implementation if it's claimed to be fully dynamic and it looks like it can backfire on the wrong workload. > > And what is the exact reason of the local_irq_save for doing it > > zerocopy? > > (Answered above I think? If not, let me know.) No, you didn't actually tell the exact line of code that runs from irq and that races with the code, and that requires to disable irq. I still have no clue why irqs must be disabled. You now mentioned sofitrqs, what is the code running in softirqs that requires disabling irqs. > > Would I'd like is a mechanism where you: > > > > 1) add swapcache to zcache (with fallback to swap immediately if zcache > > allocation fails) > > Current swap code pre-selects the swap device several layers > higher in the call chain, so this requires fairly major surgery > on the swap subsystem... and the long bug-tail that implies. Well you can still release the swapcache once tmem_put (better tmem_store) succeeds. Then it's up to the zcache layer to allocate more swap entries and store it in the swap in a compact way. > > 2) when some threshold is hit or zcache allocation fails, we write the > > compressed data in a compact way to swap (freeing zcache memory), > > or swapcache directly to swap if no zcache is present > > Has efficient writing (and reading) of smaller-than-page chunks > through blkio every been implemented? I know compression can be > done "behind the curtain" of many I/O devices, but am unaware > that the same functionality exists in the kernel. If it doesn't > exist, this requires fairly major surgery on the blkio subsystem. > If it does exist, I doubt the swap subsystem is capable of using > it without major surgery. Again I don't think compacting is the task of the I/O subsystem. Quite obviously not. Even reiser3 writes to disk the tails compacted and surely doesn't require changes to the storage layer. The algorithm belongs to tmem or whatever abstraction where we stored the swapcache. > Yeah, I agree that sounds like a cool high-level design for a > swap subsystem rewrite. Problem is it doesn't replace the dynamicity > to do what frontswap does for virtualization and multiple physical > machines (RAMster). Just not as flexible. > > And do you really want to rewrite the swap subsystem anyway > when a handful of frontswap hooks do the same thing (and more)? Not a plan to change the swap subsystem, I don't think it requires a rewrite to just improve it. You are improving the swap subsystem while adding frontswap. Not me. So I'd like the improvement to go in the right direction. If we add frontswap with tmem today, it shall be able tomorrow to write the compressed data compacted on the swap device, without requiring nuking frontswap. I mean incremental steps are totally fine, it doesn't need to do it now, but it must be able to do it later. Which means tmem must be somehow able to attach its memory to bios, allocate swap entries with get_swap_page and write the tmem memory there. I simply wouldn't like something that adds more work to do later when we want swap to improve further. > > I'm afraid adding frontswap in this form will still get stuck us in > > the wrong model and most of it will have to be dropped and rewritten > > to do just the above 3 points I described to do proper swap > > compression. > > This is a red herring. I translate this as "your handful of hooks > might interfere with some major effort that I've barely begun to > design". And even if you DO code that major effort... the > frontswap hooks are almost trivial and clearly separated from > most of the core swap code... how do you know those hooks will > interfere with your grand plan anyway? Hey this is what I'm asking... I'm asking if these hooks will interfere or not. If they're tailored for Xen or if they can make the Linux Kernel VM better for the long run and we can go ahead and swap the tmem memory to disk compacted later, or not. I guess everything is possible but the simpler design the better. And I've no clue if this is the simpler design. > Do I have to quote Linus's statement from the KS2011 minutes > again? :-) Well don't worry it's not my decision if things go in or not, and I tend to agree it's not huge work to remove frontswap later if needed, but it is quite apparent you don't want to make changes at all, and you need it to be merged in this form. Which makes me wonder if this is because of hidden Xen ABI issues in tmem.c or similar Xen issues or if it's just because you think the code should change later after it's all upstream, including frontswap. > IIUC, you're talking about improvements to host-swapping here. > That is (IMHO) putting lipstick on a pig. And, in any case, you > are talking about significant swap subsystm changes that only help > a single user, KVM. You seem to be already measuring non-existent A single user KVM? You've got to be kidding me. The whole basis of the KVM design, and why I refuse Xen is: we never improve KVM, we improve the kernel for non-virt usages! And by improving the kernel for non-virt usages, we also improve KVM. KVM is just like firefox or apache or anything that uses anonymous memory. I never thought of KVM in the context of the changes to the swap logic. Sure they'll improve KVM too if we do those, but that'd be the side effect of having improved the desktop swap behavior in general. We improve the kernel for non-virt usage to benchmark-beat Xen/vmware etc... There's nothing I'm doing in the VM that improves only KVM (even the mmu notifier are used by GRU and stuff, more recently a new pci card doing remote DMA from AMD is using mmu notifier too, infiniband could do that too). In fact it can't actually recall a single kernel change I did over the last few years that would improve only KVM :). > KVM patches by a different/easier standard than you are applying > to a simple frontswap patchset that's been public for nearly > three years. I'm perfectly fine if frontswap gets in.... as long as it is the way to go for the future of the Linux VM. Ignore virt here please, no KVM, no Xen (even no cgroups just in case they could matter). Just tmem and bare metal. > There is no tmem dependency on zcache. Feel free to rewrite > zcache entirely. It still needs the hooks in the frontswap > patch, or something at least very similar. That I agree, the hooks probably would be similar. > Then feel free to rewrite that code.. or wait until it gets > fixed. I agree that it's unlikely that zcache will be promoted > out of staging with that hack. That's all still unrelated to > merging frontswap. frontswap don't go in staging so the moment you add a dependency on staging/zcache code from the core VM code, we've to look into what is being called too... Otherwise again we get hook requests every year from whatever new user that does something a little weird. Not saying this is the case, but just reading the hooks and pretending they're non-intrusive and quite similar to what would have to be done anyway, isn't the convincing method. I will like it if I'm convinced that tmem that is being called is the future way for the VM to handle compression dynamically with direct control of the Linux VM (that is needed or it can't shrink when mlockall program grows). And not some Xen hack that can't be modified or Xen ABI breaks. You see there's a whole lot of difference... Once it'll be proven that tmem is the future way for the VM to go to do dynamic compression and compaction of the data + writing it to disk when VM pressure increases, I don't think anybody will argue on the frontswap hooks. > Zcache doesn't need to pass through tmem.c. RAMster is using tmem.c > but isn't even in staging yet. That's what I had the feeling in fact, it looked like zcache could work by its own without calling in tmem. But I guess tmem is still needed to manage the memory pooling using by zcache? > > The whole logic deciding the size of the frontswap zcache is going to > > be messy. > > It's not messy, and is entirely dynamic. Finding the ideal > heuristics for the maximum size, and when and how much to > decompress pages back out of zcache back into the swap cache, > I agree, is messy and will take some time. That's what I intended to be messy... the heuristic to find the maximum zcache size. And that requires feedback from the VM to shrink fast if we're squeezed by mlocked RAM. And yes it's better than zram without any doubt, there's no way to squeeze zram out... :) But the tradeoff is you lose a fixed amount of RAM with zram and overall it should help and it's non intrusive. It doesn't require a magic heuristic to size it dynamically etc... The major benefit of zcache should be: 1) dynamic sizing (but adding complexity) 2) ability later to compact the compressed memory and write it to disk compacted when a shrink is requested by the VM pressure (and by the core VM code) > Still not sure how this is related to the proposed frontswap > patch now (which just provides some mechanism for the heuristics > to drive). > > > But to do the real swapout you should not pull the memory > > out of frontswap zcache, you should write it to disk compacted and > > compressed compared to how it was inserted in frontswap... That would > > be the ideal. > > Agreed, that would be cool... and very difficult to implement. Glad we agree :). > Dynamic. Pulled out with frontswap_shrink, see above. Got it now. > I guess you are missing the key magic for RAMster, or really > for tmem. Because everything in tmem is entirely dynamic (e.g. > any attempt to put a page can be rejected), the "remote" machine > has complete control over how many pages to accept from whom, > and can manage its own needs as higher priority. Think of > a machine in RAMster as a KVM/Xen "host" for a bunch of > virtual-machines-that-are-really-physical-machines. And it > is all peer-to-peer, so each machine can act as a host when > necessary. None of this is possible through anything that > exists today in the swap subsystem or blkio subsystem. > And RAMster runs on the same cleancache and frontswap hooks > as Xen and zcache and, potentially, KVM. > > Yeah, the heuristics may be even harder for RAMster. But > the first response to this thread (from Christoph) said > that this stuff isn't sexy. Personally I can't think of > anything sexier than the first CROSS-MACHINE memory management > subsystem in a mainstream OS. Again... NO additional core > VM changes. I see the point. And well this discussion certainly helps to clarify this further (at least for me). Another question is if you can stack these things on top of each other. Like ramster over zcache. Because if that's possible you'd need to write a backend to write out the tmem memory to disk and allowing tmem to swap that way. And then you could also use ramster on compressed pagecache. A system with little RAM could want compression and if we're out of pagecache to share it through ramster but once we have it compressed why not to send it compressed to other tmem in the cloud? > Is that answered clearly now? Yep :). > Think RAMster. Think a version of RAMster with a "memory server" > (where the RAM expandability is in one server in a rack). Think > fast SSDs that can be attached to one machine and shared by other > machines. Think phase-change (or other future limited-write-cycle) > RAM without a separate processor counting how many times a cell > has been written. This WAS all about Xen a year or two ago. > I haven't written a line of Xen in over a year because I am > excited about the FULL value of tmem. I understand this. I'd just like to know how much this is hackable, or if the Xen dependency (that still remains) is a limitation for future extension or development. I mean the value of the Xen part is zero in my view, so if we add something like this it should be hackable and free to evolve for the benefit of the core VM regardless of whatever Xen API/ABI. That in short is my concern. No much else... doesn't need to be perfect as long as it's hackable and there is no resistence to fix things like below: > Yep, let's fix that problem in zcache. That is a stupid > coding error by me and irrelevant to frontswap and the bigger > transcendent memory picture. Ok! Glad to hear. > Zram required exactly ONE change to the VM, and Nitin placed it > there AFTER he looked at how frontswap worked. Then he was forced > down the "gotta do it as a device" path which lost a lot of the > value. Then, when he wanted to do compression on page cache, he > found that the cleancache interface was perfect for it. Why > does everyone keep telling me to "do it like zram" when the author > of zram has seen the light? Did I mention Nitin's support for > frontswap already? https://lkml.org/lkml/2011/10/28/8 > > So, I repeat, where are we now? Have I sufficiently answered > your concerns and questions? Or are you going to go start > coding to prove me wrong with a swap subsystem rewrite? :-) My argument about zram is currently frontswap+zcache provides very little value addition to zram (considering you said before there's no shrinker being called and whatever heuristic you're using today won't be able to react in a timely fascion to a mlockall growing fast). So if we add hooks on the core VM to depend on it, we need to be sure it's hackable and allowed to improve without worrying about breaking Xen later. I mean Xen may still work if modified for it, but there shall be no such thing as an API or ABI that cannot be broken. Otherwise again it's better you add a few Xen specific hacks and then we evolve tmem separately from it. And I still see the real value I would see from zcache+frontswap is if we can add into zcache/tmem the code to compact the fragments and write them into swap pages, kind of like tail packing in the filesystem in fs/ (absolutely unrelated to blkdev layer). If you confirm it's free to go and there's no ABI/API we get stuck into, I'm fairly positive about it, it's clearly "alpha" feature behavior (almost no improvement with zram today) but it could very well be in the right direction and give huge benefit compared to zram in the future. I definitely don't pretend things to be perfect... but they must be in the right design direction for me to be sold off on those. Just like KVM in virt space. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 1:31 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-02 1:31 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet Hi Dan. On Tue, Nov 01, 2011 at 02:00:34PM -0700, Dan Magenheimer wrote: > Pardon me for complaining about my typing fingers, but it seems > like you are making statements and asking questions as if you > are not reading the whole reply before you start responding > to the first parts. So it's going to be hard to answer each > sub-thread in order. So let me hit a couple of the high > points first. I'm actually reading all your reply, if I skip some part it may be because the email is too long already :). I'm just trying to understand it and I wish I had more time to dedicate to this too but I've other pending stuff too. > Let me emphasize and repeat: Many of your comments > here are addressing zcache, which is a staging driver. > You are commenting on intra-zcache APIs only, _not_ on the > tmem ABI. I realize there is some potential confusion here > since the file in the zcache directory is called tmem.c; > but it is NOT defining or implementing the tmem ABI/API used So where exactly is the tmem API if it's not in tmem.c in staging/zcache? I read what is in the kernel and I comment on it. > by the kernel. The ONLY kernel API that need be debated here > is the code in the frontswap patchset, which provides That code is calling into zcache that calls into tmem.. so I'm not sure how you can pretend we focus only in forntswap. Also I don't care if zcache is already merged in staging, I may still like to see changes happening there. > registration for a set of function pointers (see the > struct frontswap_ops in frontswap.h in the patch) and > provides the function calls (API) between the frontswap > (and cleancache) "frontends" and the "backends" in > the driver directory. The zcache file "tmem.c" is simply > a very early attempt to tease out core operations and > data structures that are likely to be common to multiple > tmem users. It's a little hard to follow, so that tmem.c is not the real tmem and tries to mirror the real tmem.c that people will really use trying to be compatible when the real tmem.c will be used instead? It looks even more weird than I thought, why isn't the real tmem.c in the zcache directory instead of an attempt to tease out core operations and data structures? Maybe I just misunderstood what tmem.c is about in zcache. > Everything in zcache (including tmem.c) is completely open > to evolution as needed (by KVM or other users) and this > will need to happen before zcache is promoted out of staging. > So your comments will be very useful when "we" work > on that promotion process. Again I don't care frontswap out-of-tree, zcache-in-tree tmem-of-zcache-not-real-tmem-in-tree, I will comment on any of those. I understand your focus is to get frontswap merged, but on my side to review frontswap I am forced to read zcache and if I don't like something there, it's unlikely I can like the _caller_ of zcache code either (i.e. frontswap_put). I don't see this as a showstopper on your side, if you agree why don't you start fixing what is _in_tree_ first, and then submit frontswap again? I mean if you proof what you're pushing in (and what you already pushed in staging) is the way to go, you shouldn't be so worried about frontswap being merged immediately. I think the merge of frontswap it's going to work much better if you convince the whole VM camp what is in tree (zcache/tmem.c) is the way to go, then there won't be much opposition to merge frontswap and make the core VM add hooks for something already proven _worthwhile_. I believe if all the reviewers and commenters would think the zcache directory is the way to go to store swapcache before it hits swap, you wouldn't have much problem to add changes to the core VM to put swapcache into it. But when things gets merged they never go out of tree, or rarely go out of tree, and the maintenance is then upon us and the whole community. So before adding dependencies on the core VM to zcache/tmem.c it'd be nicer to be sure it's the way to go... I hope this explains why I am "forced" to look into tmem.c while commenting on frontswap. > So, I'm going to attempt to ignore the portions of your > reply that are commenting specifically about zcache coding > issues and reply to the parts that potentially affect > acceptance of the frontswap patchset, but if I miss anything > important to you, please let me know. So you call the tmem_put getting a char a not a page "zcache coding", but to me it's not even clear if tmem would equally be happy to use a page structure. Would that remain compatible with what you call above "multiple tmem users" or not? It's a little hard to see where Xen starts and where the kernel ends here. Can Xen make any use of the kernel code you pushed in staging yet? Where does the Xen API starts there? I'd like to compare to the real tmem.c if zcache/tmem.c isn't it. And no I don't imply the cast of the page to char is a problem at all, I assume you're perfectly right that it's a coding issue, and may be very well fixable with a few liner fix, but then why not proof the point and fix it, instead of keep insisting we review frontswap as a standalone entity like if it wasn't calling the code I am commenting? And deferring the (albeit minor) fixage of tmem API for later after we already are calling tmem_put from the core VM? Or do you disagree with my proposed changes? I don't think it's unreasonable to ask you to cleanup and make more proper what is in tree already before adding more "stuff" that depends on it and would have to be maintained _forever_ in the VM? I don't actually ask perfection, but it'd be easier if you cleaned up what looks "fishy". > Other than the fact that cleancache_put is called with > irqs disabled (and, IIRC, sometimes cleancache_flush?) > and the coding complications that causes, you are correct. > > Preemption does need to be disabled though and, IIRC, > in some cases, softirqs. What code runs in softirqs that could race with zcache+frontswap? BTW, I wonder if the tree_lock is clearing irqs only to avoid getting interrupted during the critical section as a performance optimization (normally __delete_from_swap_cache would be way faster than a page-compression but with page compression in, unless there's real code running from irqs it's unlikely we want to insist with irqs disabled, that only is a good optimization for fast code, delaying irqs 20 times more than normal isn't so good, would be better if those hooks run outside of the tree_lock). > Possible I suppose. But unless you can teach bio to deal > with dynamically time-and-size varying devices, you are > not implementing the most important value of the tmem concept, > you are just re-implementing zram. And, as I said, Nitin > supports frontswap because it is better than zram for > exactly this (dynamicity) reason: https://lkml.org/lkml/2011/10/28/8 I don't think the bio should deal with that, the bios can write at hardblocksize granularity, it should be up to an upper layer to find stuff to compact into tmem and write it out in a compact way, in a way that the "cookie" returned to swapcache code can still read from it when tmem is asked a .get operation. BTW, this swapcache_put/get is a bit misleading with get_page/put_page, maybe tmem_store/tmem_load are more appropriate names for the API. Normally we call get on a page when we take a refcount on it and put when we release it. > Just like all existing memory management code, zcache depends > on some heuristics, which can be improved as necessary over time. > Some of the metrics that feed into the heuristics are in > debugfs so they can be manipulated as zcache continue to > develop. (See zv_page_count_policy_percent for example... > yes this is still a bit primitive. And before you start up > about dynamic sizing, this is only a maximum.) I wonder how can you tune it without any direct feedback from the VM pressure, VM pressure changes too fast to poll it. The zcache pool should shrink fast and be pushed into real disk (ideally without requiring decompression and regular swapout but by compacting it and writing it out in a compact way), if there's mlocked memory growing for example. > For Xen, there is a "memmax" for each guest, and Xen tmem disallows > a guest from using a page for swap (tmem calls it a "persistent" pool) > if it has reached its memmax. Thus unless a tmem-enabled guest > is "giving", it can never expect to "get". > > For KVM, you can overcommit in the host, so you could choose a > different heuristic... if you are willing to accept host swapping > (which I think is evil :-) Well host swapping with zcache working on host is going to be theoretically (modulo implementation issues) faster than anything else because it won't run into any vmexits. Swapping in guest is forced to go through vmexits to page out to disk (and now even tmem 4k calls apparently). Plus we keep the whole system balance in the host VM without having to try to mix what has been collected by each guest VM. It's like comparing host I/O vs guest I/O, host I/O is always faster by definition and we have host swapping fully functional, the mmu notifier overhead is nothing compared to the regular pte tlb flushing and VM overhead (that you have to pay in guest anyway to overcommit in guest). > Well, you have yet to convince me that an extra copy is > so damning, especially on a modern many-core CPU where it > can be done in 256 cycles and especially when the cache-pollution > for the copy is necessary for the subsequent compression anyway. I thought we agreed the extra copy (like highmem bounce buffers) that destroys cpu caches was the worst possible thing, not sure why you bring it back. To remove the disable of irqs you've just to check which code could run from irq, and it's a bit hard to see what... maybe it's fixable. > Think of zcache as zram PLUS dynamicity PLUS ability to dynamically > trade off memory utilization against compressed page cache. It's hard to see how you can obtain dynamicity and no risk of going OOM prematurely if suddenly if a mlockall() program tries to allocate all RAM in the system. You need some more hooks in the VM than the one you have today, and that applies to cleancache too I think. It starts to look a big hack that works for VM and can fall apart if exposed to the wrong workload that uses mlockall or similar. I understand it's corner cases but we can't add cruft to the VM. You've no idea how many times I hear of people adding hooks here and there, last time it was to make mremap run with a binary only module and move 2M pages allocated at boot not visible to the VM. There is no way we can add hooks all over the place every time somebody invents something that helps a specific workload. What we had must work 100% for everything. So mlockall() using all RAM and triggering an OOM with zcache+cleancache enabled, it's not ok in my view. I think it can be fixed so I don't mean it's not good but somebody should work in fixing it, not just leaving the code unchanged and keep pushing. I mean this thing looks more complicated than it is in the current implementation if it's claimed to be fully dynamic and it looks like it can backfire on the wrong workload. > > And what is the exact reason of the local_irq_save for doing it > > zerocopy? > > (Answered above I think? If not, let me know.) No, you didn't actually tell the exact line of code that runs from irq and that races with the code, and that requires to disable irq. I still have no clue why irqs must be disabled. You now mentioned sofitrqs, what is the code running in softirqs that requires disabling irqs. > > Would I'd like is a mechanism where you: > > > > 1) add swapcache to zcache (with fallback to swap immediately if zcache > > allocation fails) > > Current swap code pre-selects the swap device several layers > higher in the call chain, so this requires fairly major surgery > on the swap subsystem... and the long bug-tail that implies. Well you can still release the swapcache once tmem_put (better tmem_store) succeeds. Then it's up to the zcache layer to allocate more swap entries and store it in the swap in a compact way. > > 2) when some threshold is hit or zcache allocation fails, we write the > > compressed data in a compact way to swap (freeing zcache memory), > > or swapcache directly to swap if no zcache is present > > Has efficient writing (and reading) of smaller-than-page chunks > through blkio every been implemented? I know compression can be > done "behind the curtain" of many I/O devices, but am unaware > that the same functionality exists in the kernel. If it doesn't > exist, this requires fairly major surgery on the blkio subsystem. > If it does exist, I doubt the swap subsystem is capable of using > it without major surgery. Again I don't think compacting is the task of the I/O subsystem. Quite obviously not. Even reiser3 writes to disk the tails compacted and surely doesn't require changes to the storage layer. The algorithm belongs to tmem or whatever abstraction where we stored the swapcache. > Yeah, I agree that sounds like a cool high-level design for a > swap subsystem rewrite. Problem is it doesn't replace the dynamicity > to do what frontswap does for virtualization and multiple physical > machines (RAMster). Just not as flexible. > > And do you really want to rewrite the swap subsystem anyway > when a handful of frontswap hooks do the same thing (and more)? Not a plan to change the swap subsystem, I don't think it requires a rewrite to just improve it. You are improving the swap subsystem while adding frontswap. Not me. So I'd like the improvement to go in the right direction. If we add frontswap with tmem today, it shall be able tomorrow to write the compressed data compacted on the swap device, without requiring nuking frontswap. I mean incremental steps are totally fine, it doesn't need to do it now, but it must be able to do it later. Which means tmem must be somehow able to attach its memory to bios, allocate swap entries with get_swap_page and write the tmem memory there. I simply wouldn't like something that adds more work to do later when we want swap to improve further. > > I'm afraid adding frontswap in this form will still get stuck us in > > the wrong model and most of it will have to be dropped and rewritten > > to do just the above 3 points I described to do proper swap > > compression. > > This is a red herring. I translate this as "your handful of hooks > might interfere with some major effort that I've barely begun to > design". And even if you DO code that major effort... the > frontswap hooks are almost trivial and clearly separated from > most of the core swap code... how do you know those hooks will > interfere with your grand plan anyway? Hey this is what I'm asking... I'm asking if these hooks will interfere or not. If they're tailored for Xen or if they can make the Linux Kernel VM better for the long run and we can go ahead and swap the tmem memory to disk compacted later, or not. I guess everything is possible but the simpler design the better. And I've no clue if this is the simpler design. > Do I have to quote Linus's statement from the KS2011 minutes > again? :-) Well don't worry it's not my decision if things go in or not, and I tend to agree it's not huge work to remove frontswap later if needed, but it is quite apparent you don't want to make changes at all, and you need it to be merged in this form. Which makes me wonder if this is because of hidden Xen ABI issues in tmem.c or similar Xen issues or if it's just because you think the code should change later after it's all upstream, including frontswap. > IIUC, you're talking about improvements to host-swapping here. > That is (IMHO) putting lipstick on a pig. And, in any case, you > are talking about significant swap subsystm changes that only help > a single user, KVM. You seem to be already measuring non-existent A single user KVM? You've got to be kidding me. The whole basis of the KVM design, and why I refuse Xen is: we never improve KVM, we improve the kernel for non-virt usages! And by improving the kernel for non-virt usages, we also improve KVM. KVM is just like firefox or apache or anything that uses anonymous memory. I never thought of KVM in the context of the changes to the swap logic. Sure they'll improve KVM too if we do those, but that'd be the side effect of having improved the desktop swap behavior in general. We improve the kernel for non-virt usage to benchmark-beat Xen/vmware etc... There's nothing I'm doing in the VM that improves only KVM (even the mmu notifier are used by GRU and stuff, more recently a new pci card doing remote DMA from AMD is using mmu notifier too, infiniband could do that too). In fact it can't actually recall a single kernel change I did over the last few years that would improve only KVM :). > KVM patches by a different/easier standard than you are applying > to a simple frontswap patchset that's been public for nearly > three years. I'm perfectly fine if frontswap gets in.... as long as it is the way to go for the future of the Linux VM. Ignore virt here please, no KVM, no Xen (even no cgroups just in case they could matter). Just tmem and bare metal. > There is no tmem dependency on zcache. Feel free to rewrite > zcache entirely. It still needs the hooks in the frontswap > patch, or something at least very similar. That I agree, the hooks probably would be similar. > Then feel free to rewrite that code.. or wait until it gets > fixed. I agree that it's unlikely that zcache will be promoted > out of staging with that hack. That's all still unrelated to > merging frontswap. frontswap don't go in staging so the moment you add a dependency on staging/zcache code from the core VM code, we've to look into what is being called too... Otherwise again we get hook requests every year from whatever new user that does something a little weird. Not saying this is the case, but just reading the hooks and pretending they're non-intrusive and quite similar to what would have to be done anyway, isn't the convincing method. I will like it if I'm convinced that tmem that is being called is the future way for the VM to handle compression dynamically with direct control of the Linux VM (that is needed or it can't shrink when mlockall program grows). And not some Xen hack that can't be modified or Xen ABI breaks. You see there's a whole lot of difference... Once it'll be proven that tmem is the future way for the VM to go to do dynamic compression and compaction of the data + writing it to disk when VM pressure increases, I don't think anybody will argue on the frontswap hooks. > Zcache doesn't need to pass through tmem.c. RAMster is using tmem.c > but isn't even in staging yet. That's what I had the feeling in fact, it looked like zcache could work by its own without calling in tmem. But I guess tmem is still needed to manage the memory pooling using by zcache? > > The whole logic deciding the size of the frontswap zcache is going to > > be messy. > > It's not messy, and is entirely dynamic. Finding the ideal > heuristics for the maximum size, and when and how much to > decompress pages back out of zcache back into the swap cache, > I agree, is messy and will take some time. That's what I intended to be messy... the heuristic to find the maximum zcache size. And that requires feedback from the VM to shrink fast if we're squeezed by mlocked RAM. And yes it's better than zram without any doubt, there's no way to squeeze zram out... :) But the tradeoff is you lose a fixed amount of RAM with zram and overall it should help and it's non intrusive. It doesn't require a magic heuristic to size it dynamically etc... The major benefit of zcache should be: 1) dynamic sizing (but adding complexity) 2) ability later to compact the compressed memory and write it to disk compacted when a shrink is requested by the VM pressure (and by the core VM code) > Still not sure how this is related to the proposed frontswap > patch now (which just provides some mechanism for the heuristics > to drive). > > > But to do the real swapout you should not pull the memory > > out of frontswap zcache, you should write it to disk compacted and > > compressed compared to how it was inserted in frontswap... That would > > be the ideal. > > Agreed, that would be cool... and very difficult to implement. Glad we agree :). > Dynamic. Pulled out with frontswap_shrink, see above. Got it now. > I guess you are missing the key magic for RAMster, or really > for tmem. Because everything in tmem is entirely dynamic (e.g. > any attempt to put a page can be rejected), the "remote" machine > has complete control over how many pages to accept from whom, > and can manage its own needs as higher priority. Think of > a machine in RAMster as a KVM/Xen "host" for a bunch of > virtual-machines-that-are-really-physical-machines. And it > is all peer-to-peer, so each machine can act as a host when > necessary. None of this is possible through anything that > exists today in the swap subsystem or blkio subsystem. > And RAMster runs on the same cleancache and frontswap hooks > as Xen and zcache and, potentially, KVM. > > Yeah, the heuristics may be even harder for RAMster. But > the first response to this thread (from Christoph) said > that this stuff isn't sexy. Personally I can't think of > anything sexier than the first CROSS-MACHINE memory management > subsystem in a mainstream OS. Again... NO additional core > VM changes. I see the point. And well this discussion certainly helps to clarify this further (at least for me). Another question is if you can stack these things on top of each other. Like ramster over zcache. Because if that's possible you'd need to write a backend to write out the tmem memory to disk and allowing tmem to swap that way. And then you could also use ramster on compressed pagecache. A system with little RAM could want compression and if we're out of pagecache to share it through ramster but once we have it compressed why not to send it compressed to other tmem in the cloud? > Is that answered clearly now? Yep :). > Think RAMster. Think a version of RAMster with a "memory server" > (where the RAM expandability is in one server in a rack). Think > fast SSDs that can be attached to one machine and shared by other > machines. Think phase-change (or other future limited-write-cycle) > RAM without a separate processor counting how many times a cell > has been written. This WAS all about Xen a year or two ago. > I haven't written a line of Xen in over a year because I am > excited about the FULL value of tmem. I understand this. I'd just like to know how much this is hackable, or if the Xen dependency (that still remains) is a limitation for future extension or development. I mean the value of the Xen part is zero in my view, so if we add something like this it should be hackable and free to evolve for the benefit of the core VM regardless of whatever Xen API/ABI. That in short is my concern. No much else... doesn't need to be perfect as long as it's hackable and there is no resistence to fix things like below: > Yep, let's fix that problem in zcache. That is a stupid > coding error by me and irrelevant to frontswap and the bigger > transcendent memory picture. Ok! Glad to hear. > Zram required exactly ONE change to the VM, and Nitin placed it > there AFTER he looked at how frontswap worked. Then he was forced > down the "gotta do it as a device" path which lost a lot of the > value. Then, when he wanted to do compression on page cache, he > found that the cleancache interface was perfect for it. Why > does everyone keep telling me to "do it like zram" when the author > of zram has seen the light? Did I mention Nitin's support for > frontswap already? https://lkml.org/lkml/2011/10/28/8 > > So, I repeat, where are we now? Have I sufficiently answered > your concerns and questions? Or are you going to go start > coding to prove me wrong with a swap subsystem rewrite? :-) My argument about zram is currently frontswap+zcache provides very little value addition to zram (considering you said before there's no shrinker being called and whatever heuristic you're using today won't be able to react in a timely fascion to a mlockall growing fast). So if we add hooks on the core VM to depend on it, we need to be sure it's hackable and allowed to improve without worrying about breaking Xen later. I mean Xen may still work if modified for it, but there shall be no such thing as an API or ABI that cannot be broken. Otherwise again it's better you add a few Xen specific hacks and then we evolve tmem separately from it. And I still see the real value I would see from zcache+frontswap is if we can add into zcache/tmem the code to compact the fragments and write them into swap pages, kind of like tail packing in the filesystem in fs/ (absolutely unrelated to blkdev layer). If you confirm it's free to go and there's no ABI/API we get stuck into, I'm fairly positive about it, it's clearly "alpha" feature behavior (almost no improvement with zram today) but it could very well be in the right direction and give huge benefit compared to zram in the future. I definitely don't pretend things to be perfect... but they must be in the right design direction for me to be sold off on those. Just like KVM in virt space. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 1:31 ` Andrea Arcangeli @ 2011-11-02 19:06 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 19:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > Hi Dan. > > On Tue, Nov 01, 2011 at 02:00:34PM -0700, Dan Magenheimer wrote: > > Pardon me for complaining about my typing fingers, but it seems > > like you are making statements and asking questions as if you > > are not reading the whole reply before you start responding > > to the first parts. So it's going to be hard to answer each > > sub-thread in order. So let me hit a couple of the high > > points first. > > I'm actually reading all your reply, if I skip some part it may be > because the email is too long already :). I'm just trying to > understand it and I wish I had more time to dedicate to this too but > I've other pending stuff too. Hi Andrea -- First, let me apologize for yesterday. I was unnecessarily sarcastic and disrespectful, and I am sorry. I very much appreciate your time and discussion, and good hard technical questions that have allowed me to clarify some of the design and implementation under discussion. I agree this email is too long, though it has been very useful. You've got some great feedback and insights in improving zcache, so let me be the first to cry "uncle" (surrender) and cut to the end.... > If you confirm it's free to go and there's no ABI/API we get stuck > into, I'm fairly positive about it, it's clearly "alpha" feature > behavior (almost no improvement with zram today) but it could very > well be in the right direction and give huge benefit compared to zram > in the future. I definitely don't pretend things to be perfect... but > they must be in the right design direction for me to be sold off on > those. Just like KVM in virt space. Confirmed. Anything below the "struct frontswap_ops" (and "struct cleancache_ops), that is anything in the staging/zcache directory, is wide open for your ideas and improvement. In fact, I would very much welcome your contribution and I think IBM and Nitin would also. Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 19:06 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 19:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > Hi Dan. > > On Tue, Nov 01, 2011 at 02:00:34PM -0700, Dan Magenheimer wrote: > > Pardon me for complaining about my typing fingers, but it seems > > like you are making statements and asking questions as if you > > are not reading the whole reply before you start responding > > to the first parts. So it's going to be hard to answer each > > sub-thread in order. So let me hit a couple of the high > > points first. > > I'm actually reading all your reply, if I skip some part it may be > because the email is too long already :). I'm just trying to > understand it and I wish I had more time to dedicate to this too but > I've other pending stuff too. Hi Andrea -- First, let me apologize for yesterday. I was unnecessarily sarcastic and disrespectful, and I am sorry. I very much appreciate your time and discussion, and good hard technical questions that have allowed me to clarify some of the design and implementation under discussion. I agree this email is too long, though it has been very useful. You've got some great feedback and insights in improving zcache, so let me be the first to cry "uncle" (surrender) and cut to the end.... > If you confirm it's free to go and there's no ABI/API we get stuck > into, I'm fairly positive about it, it's clearly "alpha" feature > behavior (almost no improvement with zram today) but it could very > well be in the right direction and give huge benefit compared to zram > in the future. I definitely don't pretend things to be perfect... but > they must be in the right design direction for me to be sold off on > those. Just like KVM in virt space. Confirmed. Anything below the "struct frontswap_ops" (and "struct cleancache_ops), that is anything in the staging/zcache directory, is wide open for your ideas and improvement. In fact, I would very much welcome your contribution and I think IBM and Nitin would also. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 19:06 ` Dan Magenheimer @ 2011-11-03 0:32 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-03 0:32 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Wed, Nov 02, 2011 at 12:06:02PM -0700, Dan Magenheimer wrote: > First, let me apologize for yesterday. I was unnecessarily > sarcastic and disrespectful, and I am sorry. I very much appreciate > your time and discussion, and good hard technical questions > that have allowed me to clarify some of the design and > implementation under discussion. No problem, I know it must be frustrating to wait so long to get something merged. Like somebody already pointed out (and I agree) it'd be nice to get the patches posted to the mailing list (with git send-emails/hg email/quilt) and get them merged into -mm first. About the subject, git is a super powerful tool, its design saved our day with kernel.org too. Awesome backed design (I have to admit way better then mercurial backend in the end, well after the packs have been introduced) [despite the user interface is still horrible in my view but it's very well worth the pain to learn to take advantage of the backend]. The pulls are extremely scalable way to merge stuff, but they tends to hide stuff and the VM/MM is such a critical piece of the kernel that in my view it's probably better to go through the pain of patchbombing linux-mm (maybe not lkml) and pass through -mm for merging. It's a less scalable approach but it will get more eyes on the code and if just a single bug is noticed that way, we all win. So I think you could try to submit the origin/master..origin/tmem with Andrew and Hugh in CC and see if more comments showup. > I agree this email is too long, though it has been very useful. Sure useful to me. I think it's normal and healthy if it gets down to more lowlevel issues and long emails... There are still a couple of unanswered issues left in that mail but they're not major if it can be fixed. > Confirmed. Anything below the "struct frontswap_ops" (and > "struct cleancache_ops), that is anything in the staging/zcache > directory, is wide open for your ideas and improvement. > In fact, I would very much welcome your contribution and > I think IBM and Nitin would also. Thanks. So this overall sounds fairly positive (or at least better than neutral) to me. The VM camp is large so I'd be nice to get comments from others too, especially if they had time to read our exchange to see if their concerns were similar to mine. Hugh's knowledge of the swap path would really help (last time he added swapping to KSM). On my side I hope it get improved over time to get the best out of it. I've not been hugely impressed so far because at this point in time it doesn't seem a vast improvement in runtime behavior compared to what zram could provide, like Rik said there's no iov/SG/vectored input to tmem_put (which I'd find more intuitive renamed to tmem_store), like Avi said ramster is synchronous and not good having to wait a long time. But if we can make these plugins stackable and we can put a storage backend at the end we could do storage+zcache+frontswap. It needs to have future potential to be worthwhile considering it's not self contained and modifies the core VM actively in a way that must be maintained over time. I think I already clarified myself well enough in prev long email to explain what are the reasons that would made like it or not. And well if I don't like it, it wouldn't mean it won't get merged, like wrote in prev mail it's not my decision and I understand the distro issues you pointed out. Now that you cleared the fact there is no API/ABI in the staging/zcache directory to worry about, frankly I'm a lot more happy, I thought at some point Xen would get into the equation in the tmem code. So I certainly don't want to take the slightest risk of stifling innovation saying no to something that makes sense and is free to evolve :). ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-03 0:32 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-03 0:32 UTC (permalink / raw) To: Dan Magenheimer Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Wed, Nov 02, 2011 at 12:06:02PM -0700, Dan Magenheimer wrote: > First, let me apologize for yesterday. I was unnecessarily > sarcastic and disrespectful, and I am sorry. I very much appreciate > your time and discussion, and good hard technical questions > that have allowed me to clarify some of the design and > implementation under discussion. No problem, I know it must be frustrating to wait so long to get something merged. Like somebody already pointed out (and I agree) it'd be nice to get the patches posted to the mailing list (with git send-emails/hg email/quilt) and get them merged into -mm first. About the subject, git is a super powerful tool, its design saved our day with kernel.org too. Awesome backed design (I have to admit way better then mercurial backend in the end, well after the packs have been introduced) [despite the user interface is still horrible in my view but it's very well worth the pain to learn to take advantage of the backend]. The pulls are extremely scalable way to merge stuff, but they tends to hide stuff and the VM/MM is such a critical piece of the kernel that in my view it's probably better to go through the pain of patchbombing linux-mm (maybe not lkml) and pass through -mm for merging. It's a less scalable approach but it will get more eyes on the code and if just a single bug is noticed that way, we all win. So I think you could try to submit the origin/master..origin/tmem with Andrew and Hugh in CC and see if more comments showup. > I agree this email is too long, though it has been very useful. Sure useful to me. I think it's normal and healthy if it gets down to more lowlevel issues and long emails... There are still a couple of unanswered issues left in that mail but they're not major if it can be fixed. > Confirmed. Anything below the "struct frontswap_ops" (and > "struct cleancache_ops), that is anything in the staging/zcache > directory, is wide open for your ideas and improvement. > In fact, I would very much welcome your contribution and > I think IBM and Nitin would also. Thanks. So this overall sounds fairly positive (or at least better than neutral) to me. The VM camp is large so I'd be nice to get comments from others too, especially if they had time to read our exchange to see if their concerns were similar to mine. Hugh's knowledge of the swap path would really help (last time he added swapping to KSM). On my side I hope it get improved over time to get the best out of it. I've not been hugely impressed so far because at this point in time it doesn't seem a vast improvement in runtime behavior compared to what zram could provide, like Rik said there's no iov/SG/vectored input to tmem_put (which I'd find more intuitive renamed to tmem_store), like Avi said ramster is synchronous and not good having to wait a long time. But if we can make these plugins stackable and we can put a storage backend at the end we could do storage+zcache+frontswap. It needs to have future potential to be worthwhile considering it's not self contained and modifies the core VM actively in a way that must be maintained over time. I think I already clarified myself well enough in prev long email to explain what are the reasons that would made like it or not. And well if I don't like it, it wouldn't mean it won't get merged, like wrote in prev mail it's not my decision and I understand the distro issues you pointed out. Now that you cleared the fact there is no API/ABI in the staging/zcache directory to worry about, frankly I'm a lot more happy, I thought at some point Xen would get into the equation in the tmem code. So I certainly don't want to take the slightest risk of stifling innovation saying no to something that makes sense and is free to evolve :). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-03 0:32 ` Andrea Arcangeli @ 2011-11-03 22:29 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-03 22:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Andrea -- Sorry for the delayed response... and for continuing this thread further, but I want to ensure I answer your points. First, did you see my reply to Rik that suggested a design as to how KVM could do batching with no change to the hooks or frontswap_ops API? (Basically a guest-side cache and add a batching op to the KVM-tmem ABI.) I think it resolves your last remaining concern (too many vmexits), so am eager to see if you agree. > Like somebody already pointed out (and I agree) it'd be nice to get > the patches posted to the mailing list (with git send-emails/hg Frontswap v10 https://lkml.org/lkml/2011/9/15/367 as last posted to linux-mm has identical code to the git commits... in response to Konrad and Kame, the commit-set was slightly reorganized and extended from 6 commits to 8, but absolutely no code differences. Since no code was changed between v10 and v11, I didn't repost v11 to linux-mm. Note, every version of frontswap was posted to linux-mm and cc'ed to Andrew, Hugh, Nick and Rik and I was very diligent in responding to all comments... Wish I would have cc'ed you all along as this has been a great discussion. > email/quilt) and get them merged into -mm first. Sorry, I'm still a newbie on this process, but just to clarify, "into -mm" means Andrew merges the patches, right? Andrew said in the first snippet of https://lkml.org/lkml/2011/11/1/317 that linux-next is fine, so I'm not sure whether to follow your advice or not. > Thanks. So this overall sounds fairly positive (or at least better > than neutral) to me. Excellent! > On my side I hope it get improved over time to get the best out of > it. I've not been hugely impressed so far because at this point in > time it doesn't seem a vast improvement in runtime behavior compared > to what zram could provide, like Rik said there's no iov/SG/vectored > input to tmem_put (which I'd find more intuitive renamed to > tmem_store), like Avi said ramster is synchronous and not good having > to wait a long time. But if we can make these plugins stackable and we > can put a storage backend at the end we could do > storage+zcache+frontswap. This thread has been so long, I don't even remember what I've replied to who, so just to clarify on these several points, in case you didn't see these elsewhere in the thread: - Nitin Gupta, author of zram, thinks zcache is an improvement over zram because it is more flexible/dynamic - KVM can do batching fairly easily with no changes to the hooks or frontswap_ops with the design I recently proposed - RAMster is synchronous, but the requirement is _only_ on the "local" put... once the data is "in tmem", asynchronous threads can do other things with it (like RAMster moving the pages to a tmem pool on a remote system) - the plugins as they exist today (Xen, zcache) aren't stackable, but the frontswap_ops registration already handles stacking, so it is certainly a good future enhancement... RAMster already does "stacking", but by incorporating a copy of the zcache code. (I think that's just a code organization issue that can be resolved if/when RAMster goes into staging.) With these in mind, I hope you will now be even a "lot more happy now" with frontswap and MUCH better than neutral. :-) :-) Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-03 22:29 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-03 22:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Andrea Arcangeli [mailto:aarcange@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Andrea -- Sorry for the delayed response... and for continuing this thread further, but I want to ensure I answer your points. First, did you see my reply to Rik that suggested a design as to how KVM could do batching with no change to the hooks or frontswap_ops API? (Basically a guest-side cache and add a batching op to the KVM-tmem ABI.) I think it resolves your last remaining concern (too many vmexits), so am eager to see if you agree. > Like somebody already pointed out (and I agree) it'd be nice to get > the patches posted to the mailing list (with git send-emails/hg Frontswap v10 https://lkml.org/lkml/2011/9/15/367 as last posted to linux-mm has identical code to the git commits... in response to Konrad and Kame, the commit-set was slightly reorganized and extended from 6 commits to 8, but absolutely no code differences. Since no code was changed between v10 and v11, I didn't repost v11 to linux-mm. Note, every version of frontswap was posted to linux-mm and cc'ed to Andrew, Hugh, Nick and Rik and I was very diligent in responding to all comments... Wish I would have cc'ed you all along as this has been a great discussion. > email/quilt) and get them merged into -mm first. Sorry, I'm still a newbie on this process, but just to clarify, "into -mm" means Andrew merges the patches, right? Andrew said in the first snippet of https://lkml.org/lkml/2011/11/1/317 that linux-next is fine, so I'm not sure whether to follow your advice or not. > Thanks. So this overall sounds fairly positive (or at least better > than neutral) to me. Excellent! > On my side I hope it get improved over time to get the best out of > it. I've not been hugely impressed so far because at this point in > time it doesn't seem a vast improvement in runtime behavior compared > to what zram could provide, like Rik said there's no iov/SG/vectored > input to tmem_put (which I'd find more intuitive renamed to > tmem_store), like Avi said ramster is synchronous and not good having > to wait a long time. But if we can make these plugins stackable and we > can put a storage backend at the end we could do > storage+zcache+frontswap. This thread has been so long, I don't even remember what I've replied to who, so just to clarify on these several points, in case you didn't see these elsewhere in the thread: - Nitin Gupta, author of zram, thinks zcache is an improvement over zram because it is more flexible/dynamic - KVM can do batching fairly easily with no changes to the hooks or frontswap_ops with the design I recently proposed - RAMster is synchronous, but the requirement is _only_ on the "local" put... once the data is "in tmem", asynchronous threads can do other things with it (like RAMster moving the pages to a tmem pool on a remote system) - the plugins as they exist today (Xen, zcache) aren't stackable, but the frontswap_ops registration already handles stacking, so it is certainly a good future enhancement... RAMster already does "stacking", but by incorporating a copy of the zcache code. (I think that's just a code organization issue that can be resolved if/when RAMster goes into staging.) With these in mind, I hope you will now be even a "lot more happy now" with frontswap and MUCH better than neutral. :-) :-) Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 23:36 ` Dan Magenheimer @ 2011-11-02 20:51 ` Rik van Riel -1 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-02 20:51 UTC (permalink / raw) To: Dan Magenheimer Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 10/31/2011 07:36 PM, Dan Magenheimer wrote: >> From: Andrea Arcangeli [mailto:aarcange@redhat.com] >>> real work to do instead and (2) that vmexit/vmenter is horribly >> >> Sure the CPU has another 1000 VM to schedule. This is like saying >> virtio-blk isn't needed on desktop virt becauase the desktop isn't >> doing much I/O. Absurd argument, there are another 1000 desktops doing >> I/O at the same time of course. > > But this is truly different, I think at least for the most common > cases, because the guest is essentially out of physical memory if it > is swapping. And the vmexit/vmenter (I assume, I don't really > know KVM) gives the KVM scheduler the opportunity to schedule > another of those 1000 VMs if it wishes. I believe the problem Andrea is trying to point out here is that the proposed API cannot handle a batch of pages to be pushed into frontswap/cleancache at one time. Even if the current back-end implementations are synchronous and can only do one page at a time, I believe it would still be a good idea to have the API able to handle a vector with a bunch of pages all at once. That way we can optimize the back-ends as required, at some later point in time. If enough people start using tmem, such bottlenecks will show up at some point :) ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 20:51 ` Rik van Riel 0 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-02 20:51 UTC (permalink / raw) To: Dan Magenheimer Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 10/31/2011 07:36 PM, Dan Magenheimer wrote: >> From: Andrea Arcangeli [mailto:aarcange@redhat.com] >>> real work to do instead and (2) that vmexit/vmenter is horribly >> >> Sure the CPU has another 1000 VM to schedule. This is like saying >> virtio-blk isn't needed on desktop virt becauase the desktop isn't >> doing much I/O. Absurd argument, there are another 1000 desktops doing >> I/O at the same time of course. > > But this is truly different, I think at least for the most common > cases, because the guest is essentially out of physical memory if it > is swapping. And the vmexit/vmenter (I assume, I don't really > know KVM) gives the KVM scheduler the opportunity to schedule > another of those 1000 VMs if it wishes. I believe the problem Andrea is trying to point out here is that the proposed API cannot handle a batch of pages to be pushed into frontswap/cleancache at one time. Even if the current back-end implementations are synchronous and can only do one page at a time, I believe it would still be a good idea to have the API able to handle a vector with a bunch of pages all at once. That way we can optimize the back-ends as required, at some later point in time. If enough people start using tmem, such bottlenecks will show up at some point :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 20:51 ` Rik van Riel @ 2011-11-02 21:14 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 21:14 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Rik van Riel [mailto:riel@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 10/31/2011 07:36 PM, Dan Magenheimer wrote: > >> From: Andrea Arcangeli [mailto:aarcange@redhat.com] > > >>> real work to do instead and (2) that vmexit/vmenter is horribly > >> > >> Sure the CPU has another 1000 VM to schedule. This is like saying > >> virtio-blk isn't needed on desktop virt becauase the desktop isn't > >> doing much I/O. Absurd argument, there are another 1000 desktops doing > >> I/O at the same time of course. > > > > But this is truly different, I think at least for the most common > > cases, because the guest is essentially out of physical memory if it > > is swapping. And the vmexit/vmenter (I assume, I don't really > > know KVM) gives the KVM scheduler the opportunity to schedule > > another of those 1000 VMs if it wishes. > > I believe the problem Andrea is trying to point out here is > that the proposed API cannot handle a batch of pages to be > pushed into frontswap/cleancache at one time. That wasn't the part of Andrea's discussion I meant, but I am getting foggy now, so let's address your point rather than mine. > Even if the current back-end implementations are synchronous > and can only do one page at a time, I believe it would still > be a good idea to have the API able to handle a vector with > a bunch of pages all at once. > > That way we can optimize the back-ends as required, at some > later point in time. > > If enough people start using tmem, such bottlenecks will show > up at some point :) It occurs to me that batching could be done locally without changing the in-kernel "API" (i.e. frontswap_ops)... the guest-side KVM tmem-backend-driver could do the compression into guest-side memory and make a single hypercall=vmexit/vmenter whenever it has collected enough for a batch. The "get" and "flush" would have to search this guest-side local cache and, if not local, make a hypercall. This is more or less what RAMster does, except it (currently) still transmits the "batch" one (pre-compressed) page at a time. And, when I think about it deeper (with my currently admittedly fried brain), this may even be the best way to do batching anyway. I can't think offhand where else you would put a "put batch" hook in the swap subsystem because I think the current swap subsystem batching code only works with adjacent "entry" numbers. And, one more thing occurs to me then... this shows the KVM "ABI" (hypercall) is not constrained by the existing Xen ABI. It can be arbitrarily more functional. /me gets hand slapped remotely from Oracle HQ ;-) ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 21:14 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 21:14 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Rik van Riel [mailto:riel@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 10/31/2011 07:36 PM, Dan Magenheimer wrote: > >> From: Andrea Arcangeli [mailto:aarcange@redhat.com] > > >>> real work to do instead and (2) that vmexit/vmenter is horribly > >> > >> Sure the CPU has another 1000 VM to schedule. This is like saying > >> virtio-blk isn't needed on desktop virt becauase the desktop isn't > >> doing much I/O. Absurd argument, there are another 1000 desktops doing > >> I/O at the same time of course. > > > > But this is truly different, I think at least for the most common > > cases, because the guest is essentially out of physical memory if it > > is swapping. And the vmexit/vmenter (I assume, I don't really > > know KVM) gives the KVM scheduler the opportunity to schedule > > another of those 1000 VMs if it wishes. > > I believe the problem Andrea is trying to point out here is > that the proposed API cannot handle a batch of pages to be > pushed into frontswap/cleancache at one time. That wasn't the part of Andrea's discussion I meant, but I am getting foggy now, so let's address your point rather than mine. > Even if the current back-end implementations are synchronous > and can only do one page at a time, I believe it would still > be a good idea to have the API able to handle a vector with > a bunch of pages all at once. > > That way we can optimize the back-ends as required, at some > later point in time. > > If enough people start using tmem, such bottlenecks will show > up at some point :) It occurs to me that batching could be done locally without changing the in-kernel "API" (i.e. frontswap_ops)... the guest-side KVM tmem-backend-driver could do the compression into guest-side memory and make a single hypercall=vmexit/vmenter whenever it has collected enough for a batch. The "get" and "flush" would have to search this guest-side local cache and, if not local, make a hypercall. This is more or less what RAMster does, except it (currently) still transmits the "batch" one (pre-compressed) page at a time. And, when I think about it deeper (with my currently admittedly fried brain), this may even be the best way to do batching anyway. I can't think offhand where else you would put a "put batch" hook in the swap subsystem because I think the current swap subsystem batching code only works with adjacent "entry" numbers. And, one more thing occurs to me then... this shows the KVM "ABI" (hypercall) is not constrained by the existing Xen ABI. It can be arbitrarily more functional. /me gets hand slapped remotely from Oracle HQ ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 21:14 ` Dan Magenheimer @ 2011-11-15 16:29 ` Rik van Riel -1 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-15 16:29 UTC (permalink / raw) To: Dan Magenheimer Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/02/2011 05:14 PM, Dan Magenheimer wrote: > It occurs to me that batching could be done locally without > changing the in-kernel "API" (i.e. frontswap_ops)... the > guest-side KVM tmem-backend-driver could do the compression > into guest-side memory and make a single > hypercall=vmexit/vmenter whenever it has collected enough for > a batch. That seems like the best way to do it, indeed. Do the current hooks allow that mode of operation, or do the hooks only return after the entire operation has completed? ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-15 16:29 ` Rik van Riel 0 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-15 16:29 UTC (permalink / raw) To: Dan Magenheimer Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/02/2011 05:14 PM, Dan Magenheimer wrote: > It occurs to me that batching could be done locally without > changing the in-kernel "API" (i.e. frontswap_ops)... the > guest-side KVM tmem-backend-driver could do the compression > into guest-side memory and make a single > hypercall=vmexit/vmenter whenever it has collected enough for > a batch. That seems like the best way to do it, indeed. Do the current hooks allow that mode of operation, or do the hooks only return after the entire operation has completed? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-15 16:29 ` Rik van Riel @ 2011-11-15 17:33 ` Jeremy Fitzhardinge -1 siblings, 0 replies; 175+ messages in thread From: Jeremy Fitzhardinge @ 2011-11-15 17:33 UTC (permalink / raw) To: Rik van Riel Cc: Dan Magenheimer, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/15/2011 08:29 AM, Rik van Riel wrote: > On 11/02/2011 05:14 PM, Dan Magenheimer wrote: > >> It occurs to me that batching could be done locally without >> changing the in-kernel "API" (i.e. frontswap_ops)... the >> guest-side KVM tmem-backend-driver could do the compression >> into guest-side memory and make a single >> hypercall=vmexit/vmenter whenever it has collected enough for >> a batch. > > That seems like the best way to do it, indeed. > > Do the current hooks allow that mode of operation, > or do the hooks only return after the entire operation > has completed? The APIs are synchronous, but need only return once the memory has been dealt with in some way. If you were batching before making a hypercall, then the implementation would just have to make a copy into its private memory and you'd have to make sure that lookups on batched but unsubmitted pages work. (It's been a while since I've looked at these patches, but I'm assuming nothing fundamental has changed about them lately.) J ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-15 17:33 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 175+ messages in thread From: Jeremy Fitzhardinge @ 2011-11-15 17:33 UTC (permalink / raw) To: Rik van Riel Cc: Dan Magenheimer, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/15/2011 08:29 AM, Rik van Riel wrote: > On 11/02/2011 05:14 PM, Dan Magenheimer wrote: > >> It occurs to me that batching could be done locally without >> changing the in-kernel "API" (i.e. frontswap_ops)... the >> guest-side KVM tmem-backend-driver could do the compression >> into guest-side memory and make a single >> hypercall=vmexit/vmenter whenever it has collected enough for >> a batch. > > That seems like the best way to do it, indeed. > > Do the current hooks allow that mode of operation, > or do the hooks only return after the entire operation > has completed? The APIs are synchronous, but need only return once the memory has been dealt with in some way. If you were batching before making a hypercall, then the implementation would just have to make a copy into its private memory and you'd have to make sure that lookups on batched but unsubmitted pages work. (It's been a while since I've looked at these patches, but I'm assuming nothing fundamental has changed about them lately.) J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-15 17:33 ` Jeremy Fitzhardinge @ 2011-11-16 14:49 ` Konrad Rzeszutek Wilk -1 siblings, 0 replies; 175+ messages in thread From: Konrad Rzeszutek Wilk @ 2011-11-16 14:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Rik van Riel, Dan Magenheimer, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, Nov 15, 2011 at 09:33:40AM -0800, Jeremy Fitzhardinge wrote: > On 11/15/2011 08:29 AM, Rik van Riel wrote: > > On 11/02/2011 05:14 PM, Dan Magenheimer wrote: > > > >> It occurs to me that batching could be done locally without > >> changing the in-kernel "API" (i.e. frontswap_ops)... the > >> guest-side KVM tmem-backend-driver could do the compression > >> into guest-side memory and make a single > >> hypercall=vmexit/vmenter whenever it has collected enough for > >> a batch. > > > > That seems like the best way to do it, indeed. > > > > Do the current hooks allow that mode of operation, > > or do the hooks only return after the entire operation > > has completed? > > The APIs are synchronous, but need only return once the memory has been > dealt with in some way. If you were batching before making a hypercall, > then the implementation would just have to make a copy into its private > memory and you'd have to make sure that lookups on batched but > unsubmitted pages work. > > (It's been a while since I've looked at these patches, but I'm assuming > nothing fundamental has changed about them lately.) Yup, what you describe is possible, and nothing fundamental has changed about them. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-16 14:49 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 175+ messages in thread From: Konrad Rzeszutek Wilk @ 2011-11-16 14:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Rik van Riel, Dan Magenheimer, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, Nov 15, 2011 at 09:33:40AM -0800, Jeremy Fitzhardinge wrote: > On 11/15/2011 08:29 AM, Rik van Riel wrote: > > On 11/02/2011 05:14 PM, Dan Magenheimer wrote: > > > >> It occurs to me that batching could be done locally without > >> changing the in-kernel "API" (i.e. frontswap_ops)... the > >> guest-side KVM tmem-backend-driver could do the compression > >> into guest-side memory and make a single > >> hypercall=vmexit/vmenter whenever it has collected enough for > >> a batch. > > > > That seems like the best way to do it, indeed. > > > > Do the current hooks allow that mode of operation, > > or do the hooks only return after the entire operation > > has completed? > > The APIs are synchronous, but need only return once the memory has been > dealt with in some way. If you were batching before making a hypercall, > then the implementation would just have to make a copy into its private > memory and you'd have to make sure that lookups on batched but > unsubmitted pages work. > > (It's been a while since I've looked at these patches, but I'm assuming > nothing fundamental has changed about them lately.) Yup, what you describe is possible, and nothing fundamental has changed about them. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 18:16 ` Andrea Arcangeli @ 2011-11-01 10:16 ` James Bottomley -1 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-01 10:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, 2011-10-31 at 19:16 +0100, Andrea Arcangeli wrote: > On Fri, Oct 28, 2011 at 08:21:31AM -0700, Dan Magenheimer wrote: > > real users and real distros and real products waiting, so if there > > are any real issues, let's get them resolved. > > We already told you the real issues there are and you did nothing so > far to address those, so much was built on top of a flawed API that I > guess an heartquake of massive scale has to come in to actually > convince Xen to change any of the huge amount of code built on the > flawed API. > > I don't know the exact Xen details (it's possible Xen design doesn't > allow these below 4 issues to be fixed, I've no idea) but for all > other non-virt usages (compressed-swap/compressed-pagecache, ramster) > I doubt it is impossible to change the design of the tmem API to > address at least one of those basic huge troubles that such an API > imposes: Actually, I think there's an unexpressed fifth requirement: 5. The optimised use case should be for non-paging situations. The problem here is that almost every data centre person tries very hard to make sure their systems never tip into the swap zone. A lot of hosting datacentres use tons of cgroup controllers for this and deliberately never configure swap which makes transcendent memory useless to them under the current API. I'm not sure this is fixable, but it's the reason why a large swathe of users would never be interested in the patches, because they by design never operate in the region transcended memory is currently looking to address. This isn't an inherent design flaw, but it does ask the question "is your design scope too narrow?" James ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 10:16 ` James Bottomley 0 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-01 10:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Mon, 2011-10-31 at 19:16 +0100, Andrea Arcangeli wrote: > On Fri, Oct 28, 2011 at 08:21:31AM -0700, Dan Magenheimer wrote: > > real users and real distros and real products waiting, so if there > > are any real issues, let's get them resolved. > > We already told you the real issues there are and you did nothing so > far to address those, so much was built on top of a flawed API that I > guess an heartquake of massive scale has to come in to actually > convince Xen to change any of the huge amount of code built on the > flawed API. > > I don't know the exact Xen details (it's possible Xen design doesn't > allow these below 4 issues to be fixed, I've no idea) but for all > other non-virt usages (compressed-swap/compressed-pagecache, ramster) > I doubt it is impossible to change the design of the tmem API to > address at least one of those basic huge troubles that such an API > imposes: Actually, I think there's an unexpressed fifth requirement: 5. The optimised use case should be for non-paging situations. The problem here is that almost every data centre person tries very hard to make sure their systems never tip into the swap zone. A lot of hosting datacentres use tons of cgroup controllers for this and deliberately never configure swap which makes transcendent memory useless to them under the current API. I'm not sure this is fixable, but it's the reason why a large swathe of users would never be interested in the patches, because they by design never operate in the region transcended memory is currently looking to address. This isn't an inherent design flaw, but it does ask the question "is your design scope too narrow?" James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 10:16 ` James Bottomley @ 2011-11-01 18:21 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 18:21 UTC (permalink / raw) To: James Bottomley, Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > Actually, I think there's an unexpressed fifth requirement: > > 5. The optimised use case should be for non-paging situations. Not quite sure what you mean here (especially for frontswap)... > The problem here is that almost every data centre person tries very hard > to make sure their systems never tip into the swap zone. A lot of > hosting datacentres use tons of cgroup controllers for this and > deliberately never configure swap which makes transcendent memory > useless to them under the current API. I'm not sure this is fixable, I can't speak for cgroups, but the generic "state-of-the-art" that you describe is a big part of what frontswap DOES try to fix, or at least ameliorate. Tipping "into the swap zone" is currently very bad. Very very bad. Frontswap doesn't "solve" swapping, but it is the foundation for some of the first things in a long time that aren't just "add more RAM." > but it's the reason why a large swathe of users would never be > interested in the patches, because they by design never operate in the > region transcended memory is currently looking to address. It's true, those that are memory-rich and can spend nearly infinite amounts on more RAM (and on high-end platforms that can expand to hold massive amounts of RAM) are not tmem's target audience. > This isn't an inherent design flaw, but it does ask the question "is > your design scope too narrow?" Considering all the hazing that I've gone through to get this far, you think I should _expand_ my design scope?!? :-) Thanks, I guess I'll pass. :-) Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 18:21 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 18:21 UTC (permalink / raw) To: James Bottomley, Andrea Arcangeli Cc: Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > Actually, I think there's an unexpressed fifth requirement: > > 5. The optimised use case should be for non-paging situations. Not quite sure what you mean here (especially for frontswap)... > The problem here is that almost every data centre person tries very hard > to make sure their systems never tip into the swap zone. A lot of > hosting datacentres use tons of cgroup controllers for this and > deliberately never configure swap which makes transcendent memory > useless to them under the current API. I'm not sure this is fixable, I can't speak for cgroups, but the generic "state-of-the-art" that you describe is a big part of what frontswap DOES try to fix, or at least ameliorate. Tipping "into the swap zone" is currently very bad. Very very bad. Frontswap doesn't "solve" swapping, but it is the foundation for some of the first things in a long time that aren't just "add more RAM." > but it's the reason why a large swathe of users would never be > interested in the patches, because they by design never operate in the > region transcended memory is currently looking to address. It's true, those that are memory-rich and can spend nearly infinite amounts on more RAM (and on high-end platforms that can expand to hold massive amounts of RAM) are not tmem's target audience. > This isn't an inherent design flaw, but it does ask the question "is > your design scope too narrow?" Considering all the hazing that I've gone through to get this far, you think I should _expand_ my design scope?!? :-) Thanks, I guess I'll pass. :-) Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 18:21 ` Dan Magenheimer @ 2011-11-02 8:14 ` James Bottomley -1 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-02 8:14 UTC (permalink / raw) To: Dan Magenheimer Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, 2011-11-01 at 11:21 -0700, Dan Magenheimer wrote: > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > > > Actually, I think there's an unexpressed fifth requirement: > > > > 5. The optimised use case should be for non-paging situations. > > Not quite sure what you mean here (especially for frontswap)... I mean could it be used in a more controlled situation than an alternative to swap? > > The problem here is that almost every data centre person tries very hard > > to make sure their systems never tip into the swap zone. A lot of > > hosting datacentres use tons of cgroup controllers for this and > > deliberately never configure swap which makes transcendent memory > > useless to them under the current API. I'm not sure this is fixable, > > I can't speak for cgroups, but the generic "state-of-the-art" > that you describe is a big part of what frontswap DOES try > to fix, or at least ameliorate. Tipping "into the swap zone" > is currently very bad. Very very bad. Frontswap doesn't > "solve" swapping, but it is the foundation for some of the > first things in a long time that aren't just "add more RAM." OK, I still don't think you understand what I'm saying. Machines in a Data Centre tend to be provisioned to criticality. What this means is that the Data Centre has a bunch of mandatory work and a bunch of Best Effort work (and grades in between). We load up the mandatory work according to the resource limits being careful not to overprovision the capacity then we look at the spare capacity and slot in the Best effort stuff. We want the machine to run at capacity, not over it; plus we need to respond instantly for demands of the mandatory work, which usually involves either dialling down or pushing away best effort work. In this situation, action is taken long before the swap paths become active because if they activate, the entire machine bogs and you've just blown the SLA on the mandatory work. This is why a lot of data centres simply never configure swap for this reason. Putting frontswap in the swap paths means that the data centre job scheduler has taken action long before frontswap ever activates, so it can never be used which is why I wrote the above. > > but it's the reason why a large swathe of users would never be > > interested in the patches, because they by design never operate in the > > region transcended memory is currently looking to address. > > It's true, those that are memory-rich and can spend nearly > infinite amounts on more RAM (and on high-end platforms that > can expand to hold massive amounts of RAM) are not tmem's > target audience. Where do you get the infinite RAM idea from? The most concrete example of what I said above are Lean Data Centres, which are highly resource constrained but they want to run at (or just below) criticality so that they get through all of the Mandatory and as much of the best effort work as they can. > > This isn't an inherent design flaw, but it does ask the question "is > > your design scope too narrow?" > > Considering all the hazing that I've gone through to get > this far, you think I should _expand_ my design scope?!? :-) > Thanks, I guess I'll pass. :-) Sure, I think the conclusion that Transcendent Memory has no applicability to a lean Data Centre isn't unreasonable; I was just probing to see if it was the only conclusion. James ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 8:14 ` James Bottomley 0 siblings, 0 replies; 175+ messages in thread From: James Bottomley @ 2011-11-02 8:14 UTC (permalink / raw) To: Dan Magenheimer Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Tue, 2011-11-01 at 11:21 -0700, Dan Magenheimer wrote: > > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > > > Actually, I think there's an unexpressed fifth requirement: > > > > 5. The optimised use case should be for non-paging situations. > > Not quite sure what you mean here (especially for frontswap)... I mean could it be used in a more controlled situation than an alternative to swap? > > The problem here is that almost every data centre person tries very hard > > to make sure their systems never tip into the swap zone. A lot of > > hosting datacentres use tons of cgroup controllers for this and > > deliberately never configure swap which makes transcendent memory > > useless to them under the current API. I'm not sure this is fixable, > > I can't speak for cgroups, but the generic "state-of-the-art" > that you describe is a big part of what frontswap DOES try > to fix, or at least ameliorate. Tipping "into the swap zone" > is currently very bad. Very very bad. Frontswap doesn't > "solve" swapping, but it is the foundation for some of the > first things in a long time that aren't just "add more RAM." OK, I still don't think you understand what I'm saying. Machines in a Data Centre tend to be provisioned to criticality. What this means is that the Data Centre has a bunch of mandatory work and a bunch of Best Effort work (and grades in between). We load up the mandatory work according to the resource limits being careful not to overprovision the capacity then we look at the spare capacity and slot in the Best effort stuff. We want the machine to run at capacity, not over it; plus we need to respond instantly for demands of the mandatory work, which usually involves either dialling down or pushing away best effort work. In this situation, action is taken long before the swap paths become active because if they activate, the entire machine bogs and you've just blown the SLA on the mandatory work. This is why a lot of data centres simply never configure swap for this reason. Putting frontswap in the swap paths means that the data centre job scheduler has taken action long before frontswap ever activates, so it can never be used which is why I wrote the above. > > but it's the reason why a large swathe of users would never be > > interested in the patches, because they by design never operate in the > > region transcended memory is currently looking to address. > > It's true, those that are memory-rich and can spend nearly > infinite amounts on more RAM (and on high-end platforms that > can expand to hold massive amounts of RAM) are not tmem's > target audience. Where do you get the infinite RAM idea from? The most concrete example of what I said above are Lean Data Centres, which are highly resource constrained but they want to run at (or just below) criticality so that they get through all of the Mandatory and as much of the best effort work as they can. > > This isn't an inherent design flaw, but it does ask the question "is > > your design scope too narrow?" > > Considering all the hazing that I've gone through to get > this far, you think I should _expand_ my design scope?!? :-) > Thanks, I guess I'll pass. :-) Sure, I think the conclusion that Transcendent Memory has no applicability to a lean Data Centre isn't unreasonable; I was just probing to see if it was the only conclusion. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 8:14 ` James Bottomley @ 2011-11-02 20:08 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:08 UTC (permalink / raw) To: James Bottomley Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > > Not quite sure what you mean here (especially for frontswap)... > > I mean could it be used in a more controlled situation than an > alternative to swap? I think it could, but have focused on the cases which reduce disk I/O: cleancache, which replaces refaults, and frontswap, which replaces swap in/outs. Did you have some other kernel data in mind? > OK, I still don't think you understand what I'm saying. Machines in a > Data Centre tend to be provisioned to criticality. What this means is > that the Data Centre has a bunch of mandatory work and a bunch of Best > Effort work (and grades in between). We load up the mandatory work > according to the resource limits being careful not to overprovision the > capacity then we look at the spare capacity and slot in the Best effort > stuff. We want the machine to run at capacity, not over it; plus we > need to respond instantly for demands of the mandatory work, which > usually involves either dialling down or pushing away best effort work. > In this situation, action is taken long before the swap paths become > active because if they activate, the entire machine bogs and you've just > blown the SLA on the mandatory work. > > > It's true, those that are memory-rich and can spend nearly > > infinite amounts on more RAM (and on high-end platforms that > > can expand to hold massive amounts of RAM) are not tmem's > > target audience. > > Where do you get the infinite RAM idea from? The most concrete example > of what I said above are Lean Data Centres, which are highly resource > constrained but they want to run at (or just below) criticality so that > they get through all of the Mandatory and as much of the best effort > work as they can. OK, I think you are asking the same question as I answered for Kame earlier today. By "infinite" I am glibly describing any environment where the data centre administrator positively knows the maximum working set of every machine (physical or virtual) and can ensure in advance that the physical RAM always exceeds that maximum working set. As you say, these machines need not be configured with a swap device as they, by definition, will never swap. The point of tmem is to use RAM more efficiently by taking advantage of all the unused RAM when the current working set size is less than the maximum working set size. This is very common in many data centers too, especially virtualized. It turned out that an identical set of hooks made pagecache compression possible, and swappage compression more flexible than zram, and that became the single-kernel user, zcache. RAM optimization and QoS guarantees are generally mutually exclusive, so this doesn't seem like a good test case for tmem (but see below). > > > This isn't an inherent design flaw, but it does ask the question "is > > > your design scope too narrow?" > > > > Considering all the hazing that I've gone through to get > > this far, you think I should _expand_ my design scope?!? :-) > > Thanks, I guess I'll pass. :-) (Sorry again for the sarcasm :-( > Sure, I think the conclusion that Transcendent Memory has no > applicability to a lean Data Centre isn't unreasonable; I was just > probing to see if it was the only conclusion. Now that I understand it better, I think it does have a limited application for your Lean Data Centre... but only to optimize the "best effort" part of the data centre workload. That would probably be a relatively easy enhancement... but, please, my brain is full now and my typing fingers hurt, so can we consider it post-merge? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 20:08 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:08 UTC (permalink / raw) To: James Bottomley Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window) > > > Not quite sure what you mean here (especially for frontswap)... > > I mean could it be used in a more controlled situation than an > alternative to swap? I think it could, but have focused on the cases which reduce disk I/O: cleancache, which replaces refaults, and frontswap, which replaces swap in/outs. Did you have some other kernel data in mind? > OK, I still don't think you understand what I'm saying. Machines in a > Data Centre tend to be provisioned to criticality. What this means is > that the Data Centre has a bunch of mandatory work and a bunch of Best > Effort work (and grades in between). We load up the mandatory work > according to the resource limits being careful not to overprovision the > capacity then we look at the spare capacity and slot in the Best effort > stuff. We want the machine to run at capacity, not over it; plus we > need to respond instantly for demands of the mandatory work, which > usually involves either dialling down or pushing away best effort work. > In this situation, action is taken long before the swap paths become > active because if they activate, the entire machine bogs and you've just > blown the SLA on the mandatory work. > > > It's true, those that are memory-rich and can spend nearly > > infinite amounts on more RAM (and on high-end platforms that > > can expand to hold massive amounts of RAM) are not tmem's > > target audience. > > Where do you get the infinite RAM idea from? The most concrete example > of what I said above are Lean Data Centres, which are highly resource > constrained but they want to run at (or just below) criticality so that > they get through all of the Mandatory and as much of the best effort > work as they can. OK, I think you are asking the same question as I answered for Kame earlier today. By "infinite" I am glibly describing any environment where the data centre administrator positively knows the maximum working set of every machine (physical or virtual) and can ensure in advance that the physical RAM always exceeds that maximum working set. As you say, these machines need not be configured with a swap device as they, by definition, will never swap. The point of tmem is to use RAM more efficiently by taking advantage of all the unused RAM when the current working set size is less than the maximum working set size. This is very common in many data centers too, especially virtualized. It turned out that an identical set of hooks made pagecache compression possible, and swappage compression more flexible than zram, and that became the single-kernel user, zcache. RAM optimization and QoS guarantees are generally mutually exclusive, so this doesn't seem like a good test case for tmem (but see below). > > > This isn't an inherent design flaw, but it does ask the question "is > > > your design scope too narrow?" > > > > Considering all the hazing that I've gone through to get > > this far, you think I should _expand_ my design scope?!? :-) > > Thanks, I guess I'll pass. :-) (Sorry again for the sarcasm :-( > Sure, I think the conclusion that Transcendent Memory has no > applicability to a lean Data Centre isn't unreasonable; I was just > probing to see if it was the only conclusion. Now that I understand it better, I think it does have a limited application for your Lean Data Centre... but only to optimize the "best effort" part of the data centre workload. That would probably be a relatively easy enhancement... but, please, my brain is full now and my typing fingers hurt, so can we consider it post-merge? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 20:08 ` Dan Magenheimer @ 2011-11-03 10:30 ` Theodore Tso -1 siblings, 0 replies; 175+ messages in thread From: Theodore Tso @ 2011-11-03 10:30 UTC (permalink / raw) To: Dan Magenheimer Cc: Theodore Tso, James Bottomley, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Nov 2, 2011, at 4:08 PM, Dan Magenheimer wrote: > By "infinite" I am glibly describing any environment where the > data centre administrator positively knows the maximum working > set of every machine (physical or virtual) and can ensure in > advance that the physical RAM always exceeds that maximum > working set. As you say, these machines need not be configured > with a swap device as they, by definition, will never swap. > > The point of tmem is to use RAM more efficiently by taking > advantage of all the unused RAM when the current working set > size is less than the maximum working set size. This is very > common in many data centers too, especially virtualized. That doesn't match with my experience, especially with "cloud" deployments, where in order to make the business plans work, the machines tend to be memory constrained, since you want to pack a large number of jobs/VM's onto a single machine, and high density memory is expensive and/or you are DIMM slot constrained. Of course, if you are running multiple Java runtimes in each guest OS (i.e., an J2EE server, and another Java VM for management, and yet another Java VM for the backup manager, etc. --- really, I've seen cloud architectures that work that way), things get worst even faster…. -- Ted ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-03 10:30 ` Theodore Tso 0 siblings, 0 replies; 175+ messages in thread From: Theodore Tso @ 2011-11-03 10:30 UTC (permalink / raw) To: Dan Magenheimer Cc: Theodore Tso, James Bottomley, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On Nov 2, 2011, at 4:08 PM, Dan Magenheimer wrote: > By "infinite" I am glibly describing any environment where the > data centre administrator positively knows the maximum working > set of every machine (physical or virtual) and can ensure in > advance that the physical RAM always exceeds that maximum > working set. As you say, these machines need not be configured > with a swap device as they, by definition, will never swap. > > The point of tmem is to use RAM more efficiently by taking > advantage of all the unused RAM when the current working set > size is less than the maximum working set size. This is very > common in many data centers too, especially virtualized. That doesn't match with my experience, especially with "cloud" deployments, where in order to make the business plans work, the machines tend to be memory constrained, since you want to pack a large number of jobs/VM's onto a single machine, and high density memory is expensive and/or you are DIMM slot constrained. Of course, if you are running multiple Java runtimes in each guest OS (i.e., an J2EE server, and another Java VM for management, and yet another Java VM for the backup manager, etc. --- really, I've seen cloud architectures that work that way), things get worst even faster…. -- Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-03 10:30 ` Theodore Tso @ 2011-11-03 14:59 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-03 14:59 UTC (permalink / raw) To: Theodore Tso Cc: James Bottomley, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Theodore Tso [mailto:tytso@mit.edu] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Ted -- Thanks for your reply! > On Nov 2, 2011, at 4:08 PM, Dan Magenheimer wrote: > > > By "infinite" I am glibly describing any environment where the > > data centre administrator positively knows the maximum working > > set of every machine (physical or virtual) and can ensure in > > advance that the physical RAM always exceeds that maximum > > working set. As you say, these machines need not be configured > > with a swap device as they, by definition, will never swap. > > > > The point of tmem is to use RAM more efficiently by taking > > advantage of all the unused RAM when the current working set > > size is less than the maximum working set size. This is very > > common in many data centers too, especially virtualized. > > That doesn't match with my experience, especially with "cloud" deployments, where in order to make the > business plans work, the machines tend to be memory constrained, since you want to pack a large number > of jobs/VM's onto a single machine, and high density memory is expensive and/or you are DIMM slot > constrained. Of course, if you are running multiple Java runtimes in each guest OS (i.e., an J2EE > server, and another Java VM for management, and yet another Java VM for the backup manager, etc. --- > really, I've seen cloud architectures that work that way), things get worst even faster.. Hmmm... since your memory-constrained example is highly similar to one I use in my presentations, I _think_ we are in total agreement, but I am confused by "doesn't match with my experience", or maybe you are countering James' lean data centre example? To clarify, for a multi-tenancy environment (such as virtualization or RAMster), tmem enables the ability to redistribute the constrained RAM resource, i.e. "steal from the rich and give to the poor," which is otherwise very difficult because each kernel is a memory hog. Frontswap's role is really to announce "I'm overconstrained and am about to swap to disk, which would be embarrassing for my performance... can someone hold this swap page for me, please?" Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-03 14:59 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-03 14:59 UTC (permalink / raw) To: Theodore Tso Cc: James Bottomley, Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Theodore Tso [mailto:tytso@mit.edu] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Ted -- Thanks for your reply! > On Nov 2, 2011, at 4:08 PM, Dan Magenheimer wrote: > > > By "infinite" I am glibly describing any environment where the > > data centre administrator positively knows the maximum working > > set of every machine (physical or virtual) and can ensure in > > advance that the physical RAM always exceeds that maximum > > working set. As you say, these machines need not be configured > > with a swap device as they, by definition, will never swap. > > > > The point of tmem is to use RAM more efficiently by taking > > advantage of all the unused RAM when the current working set > > size is less than the maximum working set size. This is very > > common in many data centers too, especially virtualized. > > That doesn't match with my experience, especially with "cloud" deployments, where in order to make the > business plans work, the machines tend to be memory constrained, since you want to pack a large number > of jobs/VM's onto a single machine, and high density memory is expensive and/or you are DIMM slot > constrained. Of course, if you are running multiple Java runtimes in each guest OS (i.e., an J2EE > server, and another Java VM for management, and yet another Java VM for the backup manager, etc. --- > really, I've seen cloud architectures that work that way), things get worst even faster.. Hmmm... since your memory-constrained example is highly similar to one I use in my presentations, I _think_ we are in total agreement, but I am confused by "doesn't match with my experience", or maybe you are countering James' lean data centre example? To clarify, for a multi-tenancy environment (such as virtualization or RAMster), tmem enables the ability to redistribute the constrained RAM resource, i.e. "steal from the rich and give to the poor," which is otherwise very difficult because each kernel is a memory hog. Frontswap's role is really to announce "I'm overconstrained and am about to swap to disk, which would be embarrassing for my performance... can someone hold this swap page for me, please?" Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 10:16 ` James Bottomley @ 2011-11-02 15:44 ` Avi Kivity -1 siblings, 0 replies; 175+ messages in thread From: Avi Kivity @ 2011-11-02 15:44 UTC (permalink / raw) To: James Bottomley Cc: Andrea Arcangeli, Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/01/2011 12:16 PM, James Bottomley wrote: > Actually, I think there's an unexpressed fifth requirement: > > 5. The optimised use case should be for non-paging situations. > > The problem here is that almost every data centre person tries very hard > to make sure their systems never tip into the swap zone. A lot of > hosting datacentres use tons of cgroup controllers for this and > deliberately never configure swap which makes transcendent memory > useless to them under the current API. I'm not sure this is fixable, > but it's the reason why a large swathe of users would never be > interested in the patches, because they by design never operate in the > region transcended memory is currently looking to address. > > This isn't an inherent design flaw, but it does ask the question "is > your design scope too narrow?" If you look at cleancache, then it addresses this concern - it extends pagecache through host memory. When dropping a page from the tail of the LRU it first goes into tmem, and when reading in a page from disk you first try to read it from tmem. However in many workloads, cleancache is actually detrimental. If you have a lot of cache misses, then every one of them causes a pointless vmexit; considering that servers today can chew hundreds of megabytes per second, this adds up. On the other side, if you have a use-once workload, then every page that falls of the tail of the LRU causes a vmexit and a pointless page copy. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 15:44 ` Avi Kivity 0 siblings, 0 replies; 175+ messages in thread From: Avi Kivity @ 2011-11-02 15:44 UTC (permalink / raw) To: James Bottomley Cc: Andrea Arcangeli, Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/01/2011 12:16 PM, James Bottomley wrote: > Actually, I think there's an unexpressed fifth requirement: > > 5. The optimised use case should be for non-paging situations. > > The problem here is that almost every data centre person tries very hard > to make sure their systems never tip into the swap zone. A lot of > hosting datacentres use tons of cgroup controllers for this and > deliberately never configure swap which makes transcendent memory > useless to them under the current API. I'm not sure this is fixable, > but it's the reason why a large swathe of users would never be > interested in the patches, because they by design never operate in the > region transcended memory is currently looking to address. > > This isn't an inherent design flaw, but it does ask the question "is > your design scope too narrow?" If you look at cleancache, then it addresses this concern - it extends pagecache through host memory. When dropping a page from the tail of the LRU it first goes into tmem, and when reading in a page from disk you first try to read it from tmem. However in many workloads, cleancache is actually detrimental. If you have a lot of cache misses, then every one of them causes a pointless vmexit; considering that servers today can chew hundreds of megabytes per second, this adds up. On the other side, if you have a use-once workload, then every page that falls of the tail of the LRU causes a vmexit and a pointless page copy. -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 15:44 ` Avi Kivity @ 2011-11-02 16:02 ` Andrea Arcangeli -1 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-02 16:02 UTC (permalink / raw) To: Avi Kivity Cc: James Bottomley, Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet Hi Avi, On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote: > If you look at cleancache, then it addresses this concern - it extends > pagecache through host memory. When dropping a page from the tail of > the LRU it first goes into tmem, and when reading in a page from disk > you first try to read it from tmem. However in many workloads, > cleancache is actually detrimental. If you have a lot of cache misses, > then every one of them causes a pointless vmexit; considering that > servers today can chew hundreds of megabytes per second, this adds up. > On the other side, if you have a use-once workload, then every page that > falls of the tail of the LRU causes a vmexit and a pointless page copy. I also think it's bad design for Virt usage, but hey, without this they can't even run with cache=writeback/writethrough and they're forced to cache=off, and then they claim specvirt is marketing, so for Xen it's better than nothing I guess. I'm trying right now to evaluate it as a pure zcache host side optimization. If it can drive us in the right long term direction and we're free to modify it as we wish to boost swapping I/O too using compressed data, then it may be viable. Otherwise it's better they add some Xen specific hook and leave whatever zcache infrastructure "free to be modified as the VM needs" "as Xen needs not". I currently don't know exactly where the Xen ABI starts and the kernel stops in tmem so it's hard to tell how hackable it is and if it is actually a complication to try to hide things away from the VM or not. Certainly the highly advertised automatic dynamic sizing of the tmem pools is an OOM timebomb without proper VM control on it. So it just can't stay away from the VM too much. Currently it's unlikely to be safe in all workloads (i.e. mlockall growing fast). Whatever happens in tmem it must be still "owned by the kernel" so it will be written out to disk with bios. Doesn't need to happpen immediately, doesn't need to be perfect, but must definitely be possible to add it later without Xen folks complaining at whatever change we do in tmem. The fact not a line of code of Xen was written over the last two years, doesn't mean there aren't dependencies on the code, just maybe those never broke and so Xen never needed to be modified either becuse they kept the tmem ABI/API fixed while adding the other backends of tmem (zcache etc..). I mean just the fact I read in those emails the word "ABI" signals something is wrong. There can't be any ABI there, only an API and even the API is a kernel internal one so it must be allowed to break freely. Or we can't innovate. Again if we can't change whatever ABI/API without first talking with the Xen folks I think it's better they split the two projects and just submit the Xen hooks separately. That wouldn't remove value to tmem (assuming it's the way to go which I'm not entirely convinced yet). In any case starting fixing up the zcache layer sounds good to me, first things that come to mind are to document with a comment why it disables irqs and which is the exact code racing with the compression that runs from irqs or softirqs, fix the casts in tmem_put, rename tmem_put to tmem_store etc... Then we see if Xen side complains by just those small needed cleanups. Ideally the API should also be stackable so you can do ramster on top of zcache on top of cleancache/frontswap so we can write a swap driver for the zcache and we can do swapper -> zcache -> frontswap, we could even write compressed pagecache to disk that way. And the whole thing should handle all allocation failures with a fallback all up to the top layer (which for swap would mean go to the regular swapout path if oom happens within those calls and for pagecache would mean to really free the page not compress it in some tmem memory). That is a design that may be good. I hadn't an huge amount of time to think about it but if you remove virt from the equation it looks less bad. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 16:02 ` Andrea Arcangeli 0 siblings, 0 replies; 175+ messages in thread From: Andrea Arcangeli @ 2011-11-02 16:02 UTC (permalink / raw) To: Avi Kivity Cc: James Bottomley, Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet Hi Avi, On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote: > If you look at cleancache, then it addresses this concern - it extends > pagecache through host memory. When dropping a page from the tail of > the LRU it first goes into tmem, and when reading in a page from disk > you first try to read it from tmem. However in many workloads, > cleancache is actually detrimental. If you have a lot of cache misses, > then every one of them causes a pointless vmexit; considering that > servers today can chew hundreds of megabytes per second, this adds up. > On the other side, if you have a use-once workload, then every page that > falls of the tail of the LRU causes a vmexit and a pointless page copy. I also think it's bad design for Virt usage, but hey, without this they can't even run with cache=writeback/writethrough and they're forced to cache=off, and then they claim specvirt is marketing, so for Xen it's better than nothing I guess. I'm trying right now to evaluate it as a pure zcache host side optimization. If it can drive us in the right long term direction and we're free to modify it as we wish to boost swapping I/O too using compressed data, then it may be viable. Otherwise it's better they add some Xen specific hook and leave whatever zcache infrastructure "free to be modified as the VM needs" "as Xen needs not". I currently don't know exactly where the Xen ABI starts and the kernel stops in tmem so it's hard to tell how hackable it is and if it is actually a complication to try to hide things away from the VM or not. Certainly the highly advertised automatic dynamic sizing of the tmem pools is an OOM timebomb without proper VM control on it. So it just can't stay away from the VM too much. Currently it's unlikely to be safe in all workloads (i.e. mlockall growing fast). Whatever happens in tmem it must be still "owned by the kernel" so it will be written out to disk with bios. Doesn't need to happpen immediately, doesn't need to be perfect, but must definitely be possible to add it later without Xen folks complaining at whatever change we do in tmem. The fact not a line of code of Xen was written over the last two years, doesn't mean there aren't dependencies on the code, just maybe those never broke and so Xen never needed to be modified either becuse they kept the tmem ABI/API fixed while adding the other backends of tmem (zcache etc..). I mean just the fact I read in those emails the word "ABI" signals something is wrong. There can't be any ABI there, only an API and even the API is a kernel internal one so it must be allowed to break freely. Or we can't innovate. Again if we can't change whatever ABI/API without first talking with the Xen folks I think it's better they split the two projects and just submit the Xen hooks separately. That wouldn't remove value to tmem (assuming it's the way to go which I'm not entirely convinced yet). In any case starting fixing up the zcache layer sounds good to me, first things that come to mind are to document with a comment why it disables irqs and which is the exact code racing with the compression that runs from irqs or softirqs, fix the casts in tmem_put, rename tmem_put to tmem_store etc... Then we see if Xen side complains by just those small needed cleanups. Ideally the API should also be stackable so you can do ramster on top of zcache on top of cleancache/frontswap so we can write a swap driver for the zcache and we can do swapper -> zcache -> frontswap, we could even write compressed pagecache to disk that way. And the whole thing should handle all allocation failures with a fallback all up to the top layer (which for swap would mean go to the regular swapout path if oom happens within those calls and for pagecache would mean to really free the page not compress it in some tmem memory). That is a design that may be good. I hadn't an huge amount of time to think about it but if you remove virt from the equation it looks less bad. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 16:02 ` Andrea Arcangeli @ 2011-11-02 16:13 ` Avi Kivity -1 siblings, 0 replies; 175+ messages in thread From: Avi Kivity @ 2011-11-02 16:13 UTC (permalink / raw) To: Andrea Arcangeli Cc: James Bottomley, Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/02/2011 06:02 PM, Andrea Arcangeli wrote: > Hi Avi, > > On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote: > > If you look at cleancache, then it addresses this concern - it extends > > pagecache through host memory. When dropping a page from the tail of > > the LRU it first goes into tmem, and when reading in a page from disk > > you first try to read it from tmem. However in many workloads, > > cleancache is actually detrimental. If you have a lot of cache misses, > > then every one of them causes a pointless vmexit; considering that > > servers today can chew hundreds of megabytes per second, this adds up. > > On the other side, if you have a use-once workload, then every page that > > falls of the tail of the LRU causes a vmexit and a pointless page copy. > > I also think it's bad design for Virt usage, but hey, without this > they can't even run with cache=writeback/writethrough and they're > forced to cache=off, and then they claim specvirt is marketing, so for > Xen it's better than nothing I guess. Surely Xen can use the pagecache, it uses Linux for I/O just like kvm. > I'm trying right now to evaluate it as a pure zcache host side > optimization. zcache style usage is fine. It's purely internal so no ABI constraints, and no hypercalls either. It's still synchronous though so RAMster like approaches will not work well. <snip> -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 16:13 ` Avi Kivity 0 siblings, 0 replies; 175+ messages in thread From: Avi Kivity @ 2011-11-02 16:13 UTC (permalink / raw) To: Andrea Arcangeli Cc: James Bottomley, Dan Magenheimer, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet On 11/02/2011 06:02 PM, Andrea Arcangeli wrote: > Hi Avi, > > On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote: > > If you look at cleancache, then it addresses this concern - it extends > > pagecache through host memory. When dropping a page from the tail of > > the LRU it first goes into tmem, and when reading in a page from disk > > you first try to read it from tmem. However in many workloads, > > cleancache is actually detrimental. If you have a lot of cache misses, > > then every one of them causes a pointless vmexit; considering that > > servers today can chew hundreds of megabytes per second, this adds up. > > On the other side, if you have a use-once workload, then every page that > > falls of the tail of the LRU causes a vmexit and a pointless page copy. > > I also think it's bad design for Virt usage, but hey, without this > they can't even run with cache=writeback/writethrough and they're > forced to cache=off, and then they claim specvirt is marketing, so for > Xen it's better than nothing I guess. Surely Xen can use the pagecache, it uses Linux for I/O just like kvm. > I'm trying right now to evaluate it as a pure zcache host side > optimization. zcache style usage is fine. It's purely internal so no ABI constraints, and no hypercalls either. It's still synchronous though so RAMster like approaches will not work well. <snip> -- error compiling committee.c: too many arguments to function -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 16:13 ` Avi Kivity @ 2011-11-02 20:27 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:27 UTC (permalink / raw) To: Avi Kivity, Andrea Arcangeli Cc: James Bottomley, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Avi Kivity [mailto:avi@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 11/02/2011 06:02 PM, Andrea Arcangeli wrote: > > Hi Avi, > > > > On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote: > > > If you look at cleancache, then it addresses this concern - it extends > > > pagecache through host memory. When dropping a page from the tail of > > > the LRU it first goes into tmem, and when reading in a page from disk > > > you first try to read it from tmem. However in many workloads, > > > cleancache is actually detrimental. If you have a lot of cache misses, > > > then every one of them causes a pointless vmexit; considering that > > > servers today can chew hundreds of megabytes per second, this adds up. > > > On the other side, if you have a use-once workload, then every page that > > > falls of the tail of the LRU causes a vmexit and a pointless page copy. > > > > I also think it's bad design for Virt usage, but hey, without this > > they can't even run with cache=writeback/writethrough and they're > > forced to cache=off, and then they claim specvirt is marketing, so for > > Xen it's better than nothing I guess. > > Surely Xen can use the pagecache, it uses Linux for I/O just like kvm. > > > I'm trying right now to evaluate it as a pure zcache host side > > optimization. > > zcache style usage is fine. It's purely internal so no ABI constraints, > and no hypercalls either. It's still synchronous though so RAMster like > approaches will not work well. Still experimental, but only the initial local put must be synchronous. RAMster uses a separate thread to "remotify" pre-compressed pages. The "get" still needs to be synchronous, but (if I ever have time to get back to coding it) I've got some ideas on how to fix that. If I manage to get that working, perhaps it could be used for Andrea's write-precompressed-zcache-pages-to-disk. Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 20:27 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:27 UTC (permalink / raw) To: Avi Kivity, Andrea Arcangeli Cc: James Bottomley, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Avi Kivity [mailto:avi@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 11/02/2011 06:02 PM, Andrea Arcangeli wrote: > > Hi Avi, > > > > On Wed, Nov 02, 2011 at 05:44:50PM +0200, Avi Kivity wrote: > > > If you look at cleancache, then it addresses this concern - it extends > > > pagecache through host memory. When dropping a page from the tail of > > > the LRU it first goes into tmem, and when reading in a page from disk > > > you first try to read it from tmem. However in many workloads, > > > cleancache is actually detrimental. If you have a lot of cache misses, > > > then every one of them causes a pointless vmexit; considering that > > > servers today can chew hundreds of megabytes per second, this adds up. > > > On the other side, if you have a use-once workload, then every page that > > > falls of the tail of the LRU causes a vmexit and a pointless page copy. > > > > I also think it's bad design for Virt usage, but hey, without this > > they can't even run with cache=writeback/writethrough and they're > > forced to cache=off, and then they claim specvirt is marketing, so for > > Xen it's better than nothing I guess. > > Surely Xen can use the pagecache, it uses Linux for I/O just like kvm. > > > I'm trying right now to evaluate it as a pure zcache host side > > optimization. > > zcache style usage is fine. It's purely internal so no ABI constraints, > and no hypercalls either. It's still synchronous though so RAMster like > approaches will not work well. Still experimental, but only the initial local put must be synchronous. RAMster uses a separate thread to "remotify" pre-compressed pages. The "get" still needs to be synchronous, but (if I ever have time to get back to coding it) I've got some ideas on how to fix that. If I manage to get that working, perhaps it could be used for Andrea's write-precompressed-zcache-pages-to-disk. Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 15:44 ` Avi Kivity @ 2011-11-02 20:19 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:19 UTC (permalink / raw) To: Avi Kivity, James Bottomley Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Avi Kivity [mailto:avi@redhat.com] > Sent: Wednesday, November 02, 2011 9:45 AM > To: James Bottomley > Cc: Andrea Arcangeli; Dan Magenheimer; Pekka Enberg; Cyclonus J; Sasha Levin; Christoph Hellwig; David > Rientjes; Linus Torvalds; linux-mm@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy Fitzhardinge; > Seth Jennings; ngupta@vflare.org; Chris Mason; JBeulich@novell.com; Dave Hansen; Jonathan Corbet > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 11/01/2011 12:16 PM, James Bottomley wrote: > > Actually, I think there's an unexpressed fifth requirement: > > > > 5. The optimised use case should be for non-paging situations. > > > > The problem here is that almost every data centre person tries very hard > > to make sure their systems never tip into the swap zone. A lot of > > hosting datacentres use tons of cgroup controllers for this and > > deliberately never configure swap which makes transcendent memory > > useless to them under the current API. I'm not sure this is fixable, > > but it's the reason why a large swathe of users would never be > > interested in the patches, because they by design never operate in the > > region transcended memory is currently looking to address. > > > > This isn't an inherent design flaw, but it does ask the question "is > > your design scope too narrow?" > > If you look at cleancache, then it addresses this concern - it extends > pagecache through host memory. When dropping a page from the tail of > the LRU it first goes into tmem, and when reading in a page from disk > you first try to read it from tmem. However in many workloads, > cleancache is actually detrimental. If you have a lot of cache misses, > then every one of them causes a pointless vmexit; considering that > servers today can chew hundreds of megabytes per second, this adds up. > On the other side, if you have a use-once workload, then every page that > falls of the tail of the LRU causes a vmexit and a pointless page copy. I agree with everything you've said except "_many_ workloads". I would characterize this as "some" workloads, and increasingly fewer machines... because core-counts are increasing faster than the ability to attach RAM to them (according to published research). I did code a horrible hack to fix this, but haven't gotten back to RFC'ing it to see if there were better, less horrible, ideas. It essentially only puts into tmem pages that are being reclaimed but previously had the PageActive bit set... a smaller but higher-hit-ratio source of pages, I think. Anyway, I've been very open about this (see https://lkml.org/lkml/2011/8/29/225 , but it affects cleancache. Frontswap ONLY deals with pages that would have otherwise been swapin/swapout to a physical swap device. Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 20:19 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 20:19 UTC (permalink / raw) To: Avi Kivity, James Bottomley Cc: Andrea Arcangeli, Pekka Enberg, Cyclonus J, Sasha Levin, Christoph Hellwig, David Rientjes, Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet > From: Avi Kivity [mailto:avi@redhat.com] > Sent: Wednesday, November 02, 2011 9:45 AM > To: James Bottomley > Cc: Andrea Arcangeli; Dan Magenheimer; Pekka Enberg; Cyclonus J; Sasha Levin; Christoph Hellwig; David > Rientjes; Linus Torvalds; linux-mm@kvack.org; LKML; Andrew Morton; Konrad Wilk; Jeremy Fitzhardinge; > Seth Jennings; ngupta@vflare.org; Chris Mason; JBeulich@novell.com; Dave Hansen; Jonathan Corbet > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 11/01/2011 12:16 PM, James Bottomley wrote: > > Actually, I think there's an unexpressed fifth requirement: > > > > 5. The optimised use case should be for non-paging situations. > > > > The problem here is that almost every data centre person tries very hard > > to make sure their systems never tip into the swap zone. A lot of > > hosting datacentres use tons of cgroup controllers for this and > > deliberately never configure swap which makes transcendent memory > > useless to them under the current API. I'm not sure this is fixable, > > but it's the reason why a large swathe of users would never be > > interested in the patches, because they by design never operate in the > > region transcended memory is currently looking to address. > > > > This isn't an inherent design flaw, but it does ask the question "is > > your design scope too narrow?" > > If you look at cleancache, then it addresses this concern - it extends > pagecache through host memory. When dropping a page from the tail of > the LRU it first goes into tmem, and when reading in a page from disk > you first try to read it from tmem. However in many workloads, > cleancache is actually detrimental. If you have a lot of cache misses, > then every one of them causes a pointless vmexit; considering that > servers today can chew hundreds of megabytes per second, this adds up. > On the other side, if you have a use-once workload, then every page that > falls of the tail of the LRU causes a vmexit and a pointless page copy. I agree with everything you've said except "_many_ workloads". I would characterize this as "some" workloads, and increasingly fewer machines... because core-counts are increasing faster than the ability to attach RAM to them (according to published research). I did code a horrible hack to fix this, but haven't gotten back to RFC'ing it to see if there were better, less horrible, ideas. It essentially only puts into tmem pages that are being reclaimed but previously had the PageActive bit set... a smaller but higher-hit-ratio source of pages, I think. Anyway, I've been very open about this (see https://lkml.org/lkml/2011/8/29/225 , but it affects cleancache. Frontswap ONLY deals with pages that would have otherwise been swapin/swapout to a physical swap device. Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-10-27 21:44 ` Avi Miller -1 siblings, 0 replies; 175+ messages in thread From: Avi Miller @ 2011-10-27 21:44 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi Linus et al, If further support is required: On 28/10/2011, at 5:52 AM, Dan Magenheimer wrote: > Linux kernel distros incorporating frontswap: > - Oracle UEK 2.6.39 Beta: I have been testing this kernel for a while now as well and is performing well. I have tested Xen HVM, HVPVM and PVM guests all with tmem enabled. Automated testing is scheduled to go into our test farm (that runs ~80,000 hours of QA of testing of Oracle products on Oracle Linux per day) soon. > - OracleVM since 2.2 (2009) Likewise. We are planning to incorporate Transcendent Memory support into future Oracle VM 3.0 releases as support functionality, i.e. that this will be enabled on a per-server/per-guest basis so that guests are capable of reducing memory footprint. We see this as a critical feature to compete with other hypervisor's memory sharing/de-duplication functionality. Thanks, Avi --- Oracle <http://www.oracle.com> Avi Miller | Principal Program Manager | +61 (412) 229 687 Oracle Linux and Virtualization 417 St Kilda Road, Melbourne, Victoria 3004 Australia ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 21:44 ` Avi Miller 0 siblings, 0 replies; 175+ messages in thread From: Avi Miller @ 2011-10-27 21:44 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi Linus et al, If further support is required: On 28/10/2011, at 5:52 AM, Dan Magenheimer wrote: > Linux kernel distros incorporating frontswap: > - Oracle UEK 2.6.39 Beta: I have been testing this kernel for a while now as well and is performing well. I have tested Xen HVM, HVPVM and PVM guests all with tmem enabled. Automated testing is scheduled to go into our test farm (that runs ~80,000 hours of QA of testing of Oracle products on Oracle Linux per day) soon. > - OracleVM since 2.2 (2009) Likewise. We are planning to incorporate Transcendent Memory support into future Oracle VM 3.0 releases as support functionality, i.e. that this will be enabled on a per-server/per-guest basis so that guests are capable of reducing memory footprint. We see this as a critical feature to compete with other hypervisor's memory sharing/de-duplication functionality. Thanks, Avi --- Oracle <http://www.oracle.com> Avi Miller | Principal Program Manager | +61 (412) 229 687 Oracle Linux and Virtualization 417 St Kilda Road, Melbourne, Victoria 3004 Australia -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-10-27 22:33 ` Brian King -1 siblings, 0 replies; 175+ messages in thread From: Brian King @ 2011-10-27 22:33 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On 10/27/2011 01:52 PM, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). We are also actively looking at utilizing frontswap for IBM Power and would welcome its inclusion in mainline. Thanks, Brian -- Brian King Linux on Power Virtualization IBM Linux Technology Center ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-27 22:33 ` Brian King 0 siblings, 0 replies; 175+ messages in thread From: Brian King @ 2011-10-27 22:33 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On 10/27/2011 01:52 PM, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). We are also actively looking at utilizing frontswap for IBM Power and would welcome its inclusion in mainline. Thanks, Brian -- Brian King Linux on Power Virtualization IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-10-28 5:17 ` Nitin Gupta -1 siblings, 0 replies; 175+ messages in thread From: Nitin Gupta @ 2011-10-28 5:17 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi Dan, On 10/27/2011 02:52 PM, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > I think frontswap would be really useful. Without this, zcache would be limited to compressed caching just the page cache pages but with frontswap, we can balance out compressed memory usage between swap cache and page cache pages. It also provides many advantages over existing solutions like zram which presents a fixed size virtual (compressed) block device interface. Since fronstwap doesn't have to "pretend" as a block device, it can incorporate many dynamic resizing policies, a critical factor for compressed caching. Thanks, Nitin ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-28 5:17 ` Nitin Gupta 0 siblings, 0 replies; 175+ messages in thread From: Nitin Gupta @ 2011-10-28 5:17 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia Hi Dan, On 10/27/2011 02:52 PM, Dan Magenheimer wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > I think frontswap would be really useful. Without this, zcache would be limited to compressed caching just the page cache pages but with frontswap, we can balance out compressed memory usage between swap cache and page cache pages. It also provides many advantages over existing solutions like zram which presents a fixed size virtual (compressed) block device interface. Since fronstwap doesn't have to "pretend" as a block device, it can incorporate many dynamic resizing policies, a critical factor for compressed caching. Thanks, Nitin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-10-29 13:43 ` Ed Tomlinson -1 siblings, 0 replies; 175+ messages in thread From: Ed Tomlinson @ 2011-10-29 13:43 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thursday 27 October 2011 11:52:22 Dan Magenheimer wrote: > Hi Linus -- > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > My wife has an old PC thats short on memory. Its got Ubuntu running on it. It also has cleancache and zram enabled. The box works better when using these. Frontcache would improve things further. It will balance the tmem vs physical memory dynamicily making it a better solution than zram. I'd love to see this in the kernel. Thanks Ed Tomlinson PS. At work we use AIX with memory compression. With the workloads we run compression lets the OS act like it has 30% more memory. It works. It would be nice to have a similar facility in Linux. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-29 13:43 ` Ed Tomlinson 0 siblings, 0 replies; 175+ messages in thread From: Ed Tomlinson @ 2011-10-29 13:43 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thursday 27 October 2011 11:52:22 Dan Magenheimer wrote: > Hi Linus -- > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > My wife has an old PC thats short on memory. Its got Ubuntu running on it. It also has cleancache and zram enabled. The box works better when using these. Frontcache would improve things further. It will balance the tmem vs physical memory dynamicily making it a better solution than zram. I'd love to see this in the kernel. Thanks Ed Tomlinson PS. At work we use AIX with memory compression. With the workloads we run compression lets the OS act like it has 30% more memory. It works. It would be nice to have a similar facility in Linux. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-10-31 8:13 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-10-31 8:13 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, 27 Oct 2011 11:52:22 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > > since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb > Linus Torvalds (1): > Why bypass -mm tree ? I think you planned to merge this via -mm tree and, then, posted patches to linux-mm with CC -mm guys. I think you posted 2011/09/16 at the last time, v10. But no further submission to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. Why now 8 ? Just because it's simple changes ? I don't have heavy concerns to the codes itself but this process as bypassing -mm or linux-next seems ugly. Thanks, -Kame ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 8:13 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-10-31 8:13 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Thu, 27 Oct 2011 11:52:22 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > > since git commit b6fd41e29dea9c6753b1843a77e50433e6123bcb > Linus Torvalds (1): > Why bypass -mm tree ? I think you planned to merge this via -mm tree and, then, posted patches to linux-mm with CC -mm guys. I think you posted 2011/09/16 at the last time, v10. But no further submission to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. Why now 8 ? Just because it's simple changes ? I don't have heavy concerns to the codes itself but this process as bypassing -mm or linux-next seems ugly. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 8:13 ` KAMEZAWA Hiroyuki @ 2011-10-31 16:38 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 16:38 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Kame -- Thanks for your reply and for your earlier reviews of frontswap, and my apologies that I accidentally left you off of the Cc list \ for the basenote of this git-pull request. > I don't have heavy concerns to the codes itself but this process as bypassing -mm > or linux-next seems ugly. First, frontswap IS in linux-next and it has been since June 3 and v11 has been in linux-next since September 23. This is stated in the base git-pull request. > Why bypass -mm tree ? > > I think you planned to merge this via -mm tree and, then, posted patches > to linux-mm with CC -mm guys. Hmmm... the mm process is not clear or well-documented. I am a relative newbie here. Linus has repeatedly spoken of ensuring that code is in linux-next, and there is no (last I checked) current -mm git tree. I was aware that the mm tree still existed, but thought it was for shaking out major features, not for adding a handful of hooks. I was aware that akpm's blessing was highly desirable, but his (offlist) reply was essentially "I'm not interested, I don't have time to deal with this, and I don't think anyone will use it." I explained about all the users (many of whom have replied to this thread to support frontswap), but got no further reply. I was advised by several people that, in the case of disagreement, Linus will decide, so I pushed forward. This is the same as the process I used for cleancache, which Linus merged. I have been instructed offlist and onlist that this was a big mistake, that it appears that I am subverting the process, and that I am probably insulting akpm. If so, I am truly sorry and would be happy to take instruction on how to proceed correctly. However, in turn, I hope that those driving the process aren't blocking useful code simply due to lack of time. > I think you posted 2011/09/16 at the last time, v10. But no further submission > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. > Why now 8 ? Just because it's simple changes ? See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk helped me to reorganize the patches (closer to what you suggested I think), but there were no code changes between v10 and v11, just dividing up the patches differently as Konrad thought there should be more smaller commits. So no code change between v10 and v11 but the number of patches went from 6 to 8. My last line in that post should also make it clear that I thought I was done and ready for the 3.2 window, so there was no evil intent on my part to subvert a process. It would have been nice if someone had told me there were uncompleted steps in the -mm process or, even better, pointed me to a (non-existent?) document where I could see for myself if I was missing steps! So... now what? Thanks, Dan P.S. It appears that this excerpt from the LWN KS2011 report might be related to the problem? "Andrew complained about the acceptance of entirely new features into the kernel. Those features often land on his doorstep without much justification, forcing him to ask the developers to explain their motivations. The kernel community, he complained, is not supporting him well. Who can tell him if a given patch makes sense? Mistakes have been made in the past; bad features have been merged and good stuff has been lost. How, he asked, can he find people who know better about the desirability of specific patches?" ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-10-31 16:38 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-10-31 16:38 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) Hi Kame -- Thanks for your reply and for your earlier reviews of frontswap, and my apologies that I accidentally left you off of the Cc list \ for the basenote of this git-pull request. > I don't have heavy concerns to the codes itself but this process as bypassing -mm > or linux-next seems ugly. First, frontswap IS in linux-next and it has been since June 3 and v11 has been in linux-next since September 23. This is stated in the base git-pull request. > Why bypass -mm tree ? > > I think you planned to merge this via -mm tree and, then, posted patches > to linux-mm with CC -mm guys. Hmmm... the mm process is not clear or well-documented. I am a relative newbie here. Linus has repeatedly spoken of ensuring that code is in linux-next, and there is no (last I checked) current -mm git tree. I was aware that the mm tree still existed, but thought it was for shaking out major features, not for adding a handful of hooks. I was aware that akpm's blessing was highly desirable, but his (offlist) reply was essentially "I'm not interested, I don't have time to deal with this, and I don't think anyone will use it." I explained about all the users (many of whom have replied to this thread to support frontswap), but got no further reply. I was advised by several people that, in the case of disagreement, Linus will decide, so I pushed forward. This is the same as the process I used for cleancache, which Linus merged. I have been instructed offlist and onlist that this was a big mistake, that it appears that I am subverting the process, and that I am probably insulting akpm. If so, I am truly sorry and would be happy to take instruction on how to proceed correctly. However, in turn, I hope that those driving the process aren't blocking useful code simply due to lack of time. > I think you posted 2011/09/16 at the last time, v10. But no further submission > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. > Why now 8 ? Just because it's simple changes ? See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk helped me to reorganize the patches (closer to what you suggested I think), but there were no code changes between v10 and v11, just dividing up the patches differently as Konrad thought there should be more smaller commits. So no code change between v10 and v11 but the number of patches went from 6 to 8. My last line in that post should also make it clear that I thought I was done and ready for the 3.2 window, so there was no evil intent on my part to subvert a process. It would have been nice if someone had told me there were uncompleted steps in the -mm process or, even better, pointed me to a (non-existent?) document where I could see for myself if I was missing steps! So... now what? Thanks, Dan P.S. It appears that this excerpt from the LWN KS2011 report might be related to the problem? "Andrew complained about the acceptance of entirely new features into the kernel. Those features often land on his doorstep without much justification, forcing him to ask the developers to explain their motivations. The kernel community, he complained, is not supporting him well. Who can tell him if a given patch makes sense? Mistakes have been made in the past; bad features have been merged and good stuff has been lost. How, he asked, can he find people who know better about the desirability of specific patches?" -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-31 16:38 ` Dan Magenheimer @ 2011-11-01 0:50 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-11-01 0:50 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Mon, 31 Oct 2011 09:38:12 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > Hi Kame -- > > Thanks for your reply and for your earlier reviews of frontswap, > and my apologies that I accidentally left you off of the Cc list \ > for the basenote of this git-pull request. > > > I don't have heavy concerns to the codes itself but this process as bypassing -mm > > or linux-next seems ugly. > > First, frontswap IS in linux-next and it has been since June 3 > and v11 has been in linux-next since September 23. This > is stated in the base git-pull request. > Ok, I'm sorry. I found frontswap.c in my tree. > > Why bypass -mm tree ? > > > > I think you planned to merge this via -mm tree and, then, posted patches > > to linux-mm with CC -mm guys. > > Hmmm... the mm process is not clear or well-documented. not complicated to me. post -> akpm's -mm tree -> mainline. But your tree seems to be in -mm via linux-next. Hmm, complicated ;( I'm sorry I didn't notice frontswap.c was there.... > > I think you posted 2011/09/16 at the last time, v10. But no further submission > > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion > > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. > > Why now 8 ? Just because it's simple changes ? > > See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk > helped me to reorganize the patches (closer to what you > suggested I think), but there were no code changes between > v10 and v11, just dividing up the patches differently > as Konrad thought there should be more smaller commits. > So no code change between v10 and v11 but the number of > patches went from 6 to 8. > > My last line in that post should also make it clear that > I thought I was done and ready for the 3.2 window, so there > was no evil intent on my part to subvert a process. > It would have been nice if someone had told me there > were uncompleted steps in the -mm process or, even better, > pointed me to a (non-existent?) document where I could see > for myself if I was missing steps! > > So... now what? > As far as I know, patches for memory management should go through akpm's tree. And most of developpers in that area see that tree. Now, your tree goes through linux-next. It complicates the problem. When a patch goes through -mm tree, its justification is already checked by , at least, akpm. And while in -mm tree, other developpers checks it and some improvements are done there. Now, you tries to push patches via linux-next and your justification for patches is checked _now_. That's what happens. It's not complicated. I think other linux-next patches are checked its justification at pull request. So, all your work will be to convice people that this feature is necessary and not-intrusive, here. >From my point of view, - I have no concerns with performance cost. But, at the same time, I want to see performance improvement numbers. - At discussing an fujitsu user support guy (just now), he asked 'why it's not designed as device driver ?" I couldn't answered. So, I have small concerns with frontswap.ops ABI design. Do we need ABI and other modules should be pluggable ? Can frontswap be implemented as something like # setup frontswap via device-mapper or some. # swapon /dev/frontswap ? It seems required hooks are just before/after read/write swap device. other hooks can be implemented in notifier..no ? Thanks, -Kame ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 0:50 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-11-01 0:50 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Mon, 31 Oct 2011 09:38:12 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] > > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > Hi Kame -- > > Thanks for your reply and for your earlier reviews of frontswap, > and my apologies that I accidentally left you off of the Cc list \ > for the basenote of this git-pull request. > > > I don't have heavy concerns to the codes itself but this process as bypassing -mm > > or linux-next seems ugly. > > First, frontswap IS in linux-next and it has been since June 3 > and v11 has been in linux-next since September 23. This > is stated in the base git-pull request. > Ok, I'm sorry. I found frontswap.c in my tree. > > Why bypass -mm tree ? > > > > I think you planned to merge this via -mm tree and, then, posted patches > > to linux-mm with CC -mm guys. > > Hmmm... the mm process is not clear or well-documented. not complicated to me. post -> akpm's -mm tree -> mainline. But your tree seems to be in -mm via linux-next. Hmm, complicated ;( I'm sorry I didn't notice frontswap.c was there.... > > I think you posted 2011/09/16 at the last time, v10. But no further submission > > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion > > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. > > Why now 8 ? Just because it's simple changes ? > > See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk > helped me to reorganize the patches (closer to what you > suggested I think), but there were no code changes between > v10 and v11, just dividing up the patches differently > as Konrad thought there should be more smaller commits. > So no code change between v10 and v11 but the number of > patches went from 6 to 8. > > My last line in that post should also make it clear that > I thought I was done and ready for the 3.2 window, so there > was no evil intent on my part to subvert a process. > It would have been nice if someone had told me there > were uncompleted steps in the -mm process or, even better, > pointed me to a (non-existent?) document where I could see > for myself if I was missing steps! > > So... now what? > As far as I know, patches for memory management should go through akpm's tree. And most of developpers in that area see that tree. Now, your tree goes through linux-next. It complicates the problem. When a patch goes through -mm tree, its justification is already checked by , at least, akpm. And while in -mm tree, other developpers checks it and some improvements are done there. Now, you tries to push patches via linux-next and your justification for patches is checked _now_. That's what happens. It's not complicated. I think other linux-next patches are checked its justification at pull request. So, all your work will be to convice people that this feature is necessary and not-intrusive, here. >From my point of view, - I have no concerns with performance cost. But, at the same time, I want to see performance improvement numbers. - At discussing an fujitsu user support guy (just now), he asked 'why it's not designed as device driver ?" I couldn't answered. So, I have small concerns with frontswap.ops ABI design. Do we need ABI and other modules should be pluggable ? Can frontswap be implemented as something like # setup frontswap via device-mapper or some. # swapon /dev/frontswap ? It seems required hooks are just before/after read/write swap device. other hooks can be implemented in notifier..no ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 0:50 ` KAMEZAWA Hiroyuki @ 2011-11-01 15:25 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 15:25 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, 31 Oct 2011 09:38:12 -0700 (PDT) > Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > > > I think you planned to merge this via -mm tree and, then, posted patches > > > to linux-mm with CC -mm guys. > > > > Hmmm... the mm process is not clear or well-documented. > > not complicated to me. > > post -> akpm's -mm tree -> mainline. > > But your tree seems to be in -mm via linux-next. Hmm, complicated ;( > I'm sorry I didn't notice frontswap.c was there.... Am I correct that the "post -> akpm's -mm tree" part requires akpm to personally merge the posted linux-mm patches into his -mm tree? So no git tree? I guess I didn't understand that which is why I never posted v11 and just put it into my git tree which was being pulled into linux-next. Anyway, I am learning now... thanks. > > > I think you posted 2011/09/16 at the last time, v10. But no further submission > > > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion > > > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. > > > Why now 8 ? Just because it's simple changes ? > > > > See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk > > helped me to reorganize the patches (closer to what you > > suggested I think), but there were no code changes between > > v10 and v11, just dividing up the patches differently > > as Konrad thought there should be more smaller commits. > > So no code change between v10 and v11 but the number of > > patches went from 6 to 8. > > > > My last line in that post should also make it clear that > > I thought I was done and ready for the 3.2 window, so there > > was no evil intent on my part to subvert a process. > > It would have been nice if someone had told me there > > were uncompleted steps in the -mm process or, even better, > > pointed me to a (non-existent?) document where I could see > > for myself if I was missing steps! > > > > So... now what? > > As far as I know, patches for memory management should go through akpm's tree. > And most of developpers in that area see that tree. > Now, your tree goes through linux-next. It complicates the problem. > > When a patch goes through -mm tree, its justification is already checked by > , at least, akpm. And while in -mm tree, other developpers checks it and > some improvements are done there. > > Now, you tries to push patches via linux-next and your > justification for patches is checked _now_. That's what happens. > It's not complicated. I think other linux-next patches are checked > its justification at pull request. OK, I will then coordinate with sfr to remove it from the linux-next tree when (if?) akpm puts the patchset into the -mm tree. But since very few linux-mm experts had responded to previous postings of the frontswap patchset, I am glad to have a much wider audience to discuss it now because of the lkml git-pull request. > So, all your work will be to convice people that this feature is > necessary and not-intrusive, here. > > From my point of view, > > - I have no concerns with performance cost. But, at the same time, > I want to see performance improvement numbers. There are numbers published for Xen. I have received the feedback that benchmarks are needed for zcache also. > - At discussing an fujitsu user support guy (just now), he asked > 'why it's not designed as device driver ?" > I couldn't answered. > > So, I have small concerns with frontswap.ops ABI design. > Do we need ABI and other modules should be pluggable ? > Can frontswap be implemented as something like > > # setup frontswap via device-mapper or some. > # swapon /dev/frontswap > ? > It seems required hooks are just before/after read/write swap device. > other hooks can be implemented in notifier..no ? A good question, and it is answered in FAQ #4 included in the patchset (Documentation/vm/frontswap.txt). The short answer is that the tmem ABI/API used by frontswap is intentionally very very dynamic -- ANY attempt to put a page into it can be rejected by the backend. This is not possible with block I/O or swap, at least without a massive rewrite. And this dynamic capability is the key to supporting the many users that frontswap supports. By the way, what your fujitsu user support guy suggests is exactly what zram does. The author of zram (Nitin Gupta) agrees that frontswap has many advantages over zram, see https://lkml.org/lkml/2011/10/28/8 and he supports merging frontswap. And Ed Tomlinson, a current user of zram says that he would use frontswap instead of zram: https://lkml.org/lkml/2011/10/29/53 Kame, can I add you to the list of people who support merging frontswap, assuming more good performance numbers are posted? Thanks, Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 15:25 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 15:25 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Mon, 31 Oct 2011 09:38:12 -0700 (PDT) > Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > > > I think you planned to merge this via -mm tree and, then, posted patches > > > to linux-mm with CC -mm guys. > > > > Hmmm... the mm process is not clear or well-documented. > > not complicated to me. > > post -> akpm's -mm tree -> mainline. > > But your tree seems to be in -mm via linux-next. Hmm, complicated ;( > I'm sorry I didn't notice frontswap.c was there.... Am I correct that the "post -> akpm's -mm tree" part requires akpm to personally merge the posted linux-mm patches into his -mm tree? So no git tree? I guess I didn't understand that which is why I never posted v11 and just put it into my git tree which was being pulled into linux-next. Anyway, I am learning now... thanks. > > > I think you posted 2011/09/16 at the last time, v10. But no further submission > > > to gather acks/reviews from Mel, Johannes, Andrew, Hugh etc.. and no inclusion > > > request to -mm or -next. _AND_, IIUC, at v10, the number of posted pathces was 6. > > > Why now 8 ? Just because it's simple changes ? > > > > See https://lkml.org/lkml/2011/9/21/373. Konrad Wilk > > helped me to reorganize the patches (closer to what you > > suggested I think), but there were no code changes between > > v10 and v11, just dividing up the patches differently > > as Konrad thought there should be more smaller commits. > > So no code change between v10 and v11 but the number of > > patches went from 6 to 8. > > > > My last line in that post should also make it clear that > > I thought I was done and ready for the 3.2 window, so there > > was no evil intent on my part to subvert a process. > > It would have been nice if someone had told me there > > were uncompleted steps in the -mm process or, even better, > > pointed me to a (non-existent?) document where I could see > > for myself if I was missing steps! > > > > So... now what? > > As far as I know, patches for memory management should go through akpm's tree. > And most of developpers in that area see that tree. > Now, your tree goes through linux-next. It complicates the problem. > > When a patch goes through -mm tree, its justification is already checked by > , at least, akpm. And while in -mm tree, other developpers checks it and > some improvements are done there. > > Now, you tries to push patches via linux-next and your > justification for patches is checked _now_. That's what happens. > It's not complicated. I think other linux-next patches are checked > its justification at pull request. OK, I will then coordinate with sfr to remove it from the linux-next tree when (if?) akpm puts the patchset into the -mm tree. But since very few linux-mm experts had responded to previous postings of the frontswap patchset, I am glad to have a much wider audience to discuss it now because of the lkml git-pull request. > So, all your work will be to convice people that this feature is > necessary and not-intrusive, here. > > From my point of view, > > - I have no concerns with performance cost. But, at the same time, > I want to see performance improvement numbers. There are numbers published for Xen. I have received the feedback that benchmarks are needed for zcache also. > - At discussing an fujitsu user support guy (just now), he asked > 'why it's not designed as device driver ?" > I couldn't answered. > > So, I have small concerns with frontswap.ops ABI design. > Do we need ABI and other modules should be pluggable ? > Can frontswap be implemented as something like > > # setup frontswap via device-mapper or some. > # swapon /dev/frontswap > ? > It seems required hooks are just before/after read/write swap device. > other hooks can be implemented in notifier..no ? A good question, and it is answered in FAQ #4 included in the patchset (Documentation/vm/frontswap.txt). The short answer is that the tmem ABI/API used by frontswap is intentionally very very dynamic -- ANY attempt to put a page into it can be rejected by the backend. This is not possible with block I/O or swap, at least without a massive rewrite. And this dynamic capability is the key to supporting the many users that frontswap supports. By the way, what your fujitsu user support guy suggests is exactly what zram does. The author of zram (Nitin Gupta) agrees that frontswap has many advantages over zram, see https://lkml.org/lkml/2011/10/28/8 and he supports merging frontswap. And Ed Tomlinson, a current user of zram says that he would use frontswap instead of zram: https://lkml.org/lkml/2011/10/29/53 Kame, can I add you to the list of people who support merging frontswap, assuming more good performance numbers are posted? Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 15:25 ` Dan Magenheimer @ 2011-11-01 21:43 ` Andrew Morton -1 siblings, 0 replies; 175+ messages in thread From: Andrew Morton @ 2011-11-01 21:43 UTC (permalink / raw) To: Dan Magenheimer Cc: KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Tue, 1 Nov 2011 08:25:38 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > OK, I will then coordinate with sfr to remove it from the linux-next > tree when (if?) akpm puts the patchset into the -mm tree. No, that's not necessary. The current process (you maintain git tree, it gets included in -next, later gets pulled by Linus) is good. The only reason I see for putting such code through -mm would be if there were significant interactions with other core MM work. It doesn't matter which route is taken, as long as the code is appropriately reviewed and tested. > But > since very few linux-mm experts had responded to previous postings > of the frontswap patchset, I am glad to have a much wider audience > to discuss it now because of the lkml git-pull request. At kernel summit there was discussion and overall agreement that we've been paying insufficient attention to the big-picture "should we include this feature at all" issues. We resolved to look more intensely and critically at new features with a view to deciding whether their usefulness justified their maintenance burden. It seems that you're our crash-test dummy ;) (Now I'm wondering how to get "cgroups: add a task counter subsystem" put through the same wringer). I will confess to and apologise for dropping the ball on cleancache and frontswap. I was never really able to convince myself that it met the (very vague) cost/benefit test, but nor was I able to present convincing arguments that it failed that test. So I very badly went into hiding, to wait and see what happened. What we needed all those months ago was to have the discussion we're having now. This is a difficult discussion and a difficult decision. But it is important that we get it right. Thanks for you patience. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 21:43 ` Andrew Morton 0 siblings, 0 replies; 175+ messages in thread From: Andrew Morton @ 2011-11-01 21:43 UTC (permalink / raw) To: Dan Magenheimer Cc: KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Tue, 1 Nov 2011 08:25:38 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > OK, I will then coordinate with sfr to remove it from the linux-next > tree when (if?) akpm puts the patchset into the -mm tree. No, that's not necessary. The current process (you maintain git tree, it gets included in -next, later gets pulled by Linus) is good. The only reason I see for putting such code through -mm would be if there were significant interactions with other core MM work. It doesn't matter which route is taken, as long as the code is appropriately reviewed and tested. > But > since very few linux-mm experts had responded to previous postings > of the frontswap patchset, I am glad to have a much wider audience > to discuss it now because of the lkml git-pull request. At kernel summit there was discussion and overall agreement that we've been paying insufficient attention to the big-picture "should we include this feature at all" issues. We resolved to look more intensely and critically at new features with a view to deciding whether their usefulness justified their maintenance burden. It seems that you're our crash-test dummy ;) (Now I'm wondering how to get "cgroups: add a task counter subsystem" put through the same wringer). I will confess to and apologise for dropping the ball on cleancache and frontswap. I was never really able to convince myself that it met the (very vague) cost/benefit test, but nor was I able to present convincing arguments that it failed that test. So I very badly went into hiding, to wait and see what happened. What we needed all those months ago was to have the discussion we're having now. This is a difficult discussion and a difficult decision. But it is important that we get it right. Thanks for you patience. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 21:43 ` Andrew Morton @ 2011-11-01 22:25 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 22:25 UTC (permalink / raw) To: Andrew Morton Cc: KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Andrew Morton [mailto:akpm@linux-foundation.org] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Tue, 1 Nov 2011 08:25:38 -0700 (PDT) > Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > At kernel summit there was discussion and overall agreement that we've > been paying insufficient attention to the big-picture "should we > include this feature at all" issues. We resolved to look more > intensely and critically at new features with a view to deciding > whether their usefulness justified their maintenance burden. It seems > that you're our crash-test dummy ;) (Now I'm wondering how to get > "cgroups: add a task counter subsystem" put through the same wringer). > > I will confess to and apologise for dropping the ball on cleancache and > frontswap. I was never really able to convince myself that it met the > (very vague) cost/benefit test, but nor was I able to present > convincing arguments that it failed that test. So I very badly went > into hiding, to wait and see what happened. What we needed all those > months ago was to have the discussion we're having now. > > This is a difficult discussion and a difficult decision. But it is > important that we get it right. Thanks for you patience. Thanks very much for your very kind response. Let me know if I can do anything else to help the process other than continuing the discussion of course. I'll be happy to help as soon as I return from the crash-test-dummy hospital ;-) Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-01 22:25 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-01 22:25 UTC (permalink / raw) To: Andrew Morton Cc: KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Andrew Morton [mailto:akpm@linux-foundation.org] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On Tue, 1 Nov 2011 08:25:38 -0700 (PDT) > Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > At kernel summit there was discussion and overall agreement that we've > been paying insufficient attention to the big-picture "should we > include this feature at all" issues. We resolved to look more > intensely and critically at new features with a view to deciding > whether their usefulness justified their maintenance burden. It seems > that you're our crash-test dummy ;) (Now I'm wondering how to get > "cgroups: add a task counter subsystem" put through the same wringer). > > I will confess to and apologise for dropping the ball on cleancache and > frontswap. I was never really able to convince myself that it met the > (very vague) cost/benefit test, but nor was I able to present > convincing arguments that it failed that test. So I very badly went > into hiding, to wait and see what happened. What we needed all those > months ago was to have the discussion we're having now. > > This is a difficult discussion and a difficult decision. But it is > important that we get it right. Thanks for you patience. Thanks very much for your very kind response. Let me know if I can do anything else to help the process other than continuing the discussion of course. I'll be happy to help as soon as I return from the crash-test-dummy hospital ;-) Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 21:43 ` Andrew Morton @ 2011-11-02 21:03 ` Rik van Riel -1 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-02 21:03 UTC (permalink / raw) To: Andrew Morton Cc: Dan Magenheimer, KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On 11/01/2011 05:43 PM, Andrew Morton wrote: > I will confess to and apologise for dropping the ball on cleancache and > frontswap. I was never really able to convince myself that it met the > (very vague) cost/benefit test, I believe that it can, but if it does, we also have to operate under the assumption that the major distros will enable it. This means that "no overhead when not compiled in" is not going to apply to the majority of the users out there, and we need clear numbers on what the overhead is when it is enabled, but not used. We also need an API that can handle arbitrarily heavy workloads, since that is what people will throw at it if it is enabled everywhere. I believe that means addressing some of Andrea's concerns, specifically that the API should be able to handle vectors of pages and handle them asynchronously. Even if the current back-ends do not handle that today, chances are that (if tmem were to be enabled everywhere) people will end up throwing workloads at tmem that pretty much require such a thing. An asynchronous interface would probably be a requirement for something as high latency as encrypted ramster :) API concerns like this are things that should be solved before a merge IMHO, since afterwards we would end up with the "we cannot change the API, because that breaks users" scenario that we always end up finding ourselves in. ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 21:03 ` Rik van Riel 0 siblings, 0 replies; 175+ messages in thread From: Rik van Riel @ 2011-11-02 21:03 UTC (permalink / raw) To: Andrew Morton Cc: Dan Magenheimer, KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On 11/01/2011 05:43 PM, Andrew Morton wrote: > I will confess to and apologise for dropping the ball on cleancache and > frontswap. I was never really able to convince myself that it met the > (very vague) cost/benefit test, I believe that it can, but if it does, we also have to operate under the assumption that the major distros will enable it. This means that "no overhead when not compiled in" is not going to apply to the majority of the users out there, and we need clear numbers on what the overhead is when it is enabled, but not used. We also need an API that can handle arbitrarily heavy workloads, since that is what people will throw at it if it is enabled everywhere. I believe that means addressing some of Andrea's concerns, specifically that the API should be able to handle vectors of pages and handle them asynchronously. Even if the current back-ends do not handle that today, chances are that (if tmem were to be enabled everywhere) people will end up throwing workloads at tmem that pretty much require such a thing. An asynchronous interface would probably be a requirement for something as high latency as encrypted ramster :) API concerns like this are things that should be solved before a merge IMHO, since afterwards we would end up with the "we cannot change the API, because that breaks users" scenario that we always end up finding ourselves in. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 21:03 ` Rik van Riel @ 2011-11-02 21:42 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 21:42 UTC (permalink / raw) To: Rik van Riel, Andrew Morton Cc: KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Rik van Riel [mailto:riel@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 11/01/2011 05:43 PM, Andrew Morton wrote: > > > I will confess to and apologise for dropping the ball on cleancache and > > frontswap. I was never really able to convince myself that it met the > > (very vague) cost/benefit test, > > I believe that it can, but if it does, we also have to > operate under the assumption that the major distros will > enable it. > This means that "no overhead when not compiled in" is > not going to apply to the majority of the users out there, > and we need clear numbers on what the overhead is when it > is enabled, but not used. Right. That's Case B (see James Bottomley subthread) and the overhead is one pointer comparison against NULL per page physically swapin/swapout to a swap device (i.e., essentially zero). Rik, would you be willing to examine the code to confirm that statement? > We also need an API that can handle arbitrarily heavy > workloads, since that is what people will throw at it > if it is enabled everywhere. > > I believe that means addressing some of Andrea's concerns, > specifically that the API should be able to handle vectors > of pages and handle them asynchronously. > > Even if the current back-ends do not handle that today, > chances are that (if tmem were to be enabled everywhere) > people will end up throwing workloads at tmem that pretty > much require such a thing. Wish I'd been a little faster on typing the previous message. Rik, could you ensure you respond to yourself here if you are happy with my proposed batching design to do the batching that you and Andrea want? (And if you are not happy, provide code to show where you would place a new batch-put hook?) > An asynchronous interface would probably be a requirement > for something as high latency as encrypted ramster :) Pure asynchrony is a show-stopper for me. But the only synchrony required is to move/transform the data locally. Asynchronous things can still be done but as a separate thread AFTER the data has been "put" to tmem (which is exactly what RAMster does). If asynchrony at frontswap_ops is demanded (and I think Andrea has already retracted that), I would have to ask you to present alternate code, both hooks and driver, that work successfully, because my claim is that it can't be done, certainly not without massive changes to the swap subsystem (and likely corresponding massive changes to VFS for cleancache). > API concerns like this are things that should be solved > before a merge IMHO, since afterwards we would end up with > the "we cannot change the API, because that breaks users" > scenario that we always end up finding ourselves in. I think I've amply demonstrated that the API is minimal and extensible, as demonstrated by the above points. Much of Andrea's concerns were due to a misunderstanding of the code in staging/zcache, thinking it was part of the API; the only "API" being considered here is defined by frontswap_ops. Also, the API for frontswap_ops is almost identical to the API for cleancache_ops and uses a much simpler, much more isolated set of hooks. Frontswap "finishes" tmem, cleancache is already merged. Leaving tmem unfinished is worse than not having it all (and I can already hear Christoph cackling and jumping to his keyboard ;-) Thanks, Dan OK, I really need to discontinue my participation in this for a couple of days for personal/health reasons, so I hope I've made my case. ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 21:42 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 21:42 UTC (permalink / raw) To: Rik van Riel, Andrew Morton Cc: KAMEZAWA Hiroyuki, Linus Torvalds, linux-mm, LKML, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: Rik van Riel [mailto:riel@redhat.com] > Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window) > > On 11/01/2011 05:43 PM, Andrew Morton wrote: > > > I will confess to and apologise for dropping the ball on cleancache and > > frontswap. I was never really able to convince myself that it met the > > (very vague) cost/benefit test, > > I believe that it can, but if it does, we also have to > operate under the assumption that the major distros will > enable it. > This means that "no overhead when not compiled in" is > not going to apply to the majority of the users out there, > and we need clear numbers on what the overhead is when it > is enabled, but not used. Right. That's Case B (see James Bottomley subthread) and the overhead is one pointer comparison against NULL per page physically swapin/swapout to a swap device (i.e., essentially zero). Rik, would you be willing to examine the code to confirm that statement? > We also need an API that can handle arbitrarily heavy > workloads, since that is what people will throw at it > if it is enabled everywhere. > > I believe that means addressing some of Andrea's concerns, > specifically that the API should be able to handle vectors > of pages and handle them asynchronously. > > Even if the current back-ends do not handle that today, > chances are that (if tmem were to be enabled everywhere) > people will end up throwing workloads at tmem that pretty > much require such a thing. Wish I'd been a little faster on typing the previous message. Rik, could you ensure you respond to yourself here if you are happy with my proposed batching design to do the batching that you and Andrea want? (And if you are not happy, provide code to show where you would place a new batch-put hook?) > An asynchronous interface would probably be a requirement > for something as high latency as encrypted ramster :) Pure asynchrony is a show-stopper for me. But the only synchrony required is to move/transform the data locally. Asynchronous things can still be done but as a separate thread AFTER the data has been "put" to tmem (which is exactly what RAMster does). If asynchrony at frontswap_ops is demanded (and I think Andrea has already retracted that), I would have to ask you to present alternate code, both hooks and driver, that work successfully, because my claim is that it can't be done, certainly not without massive changes to the swap subsystem (and likely corresponding massive changes to VFS for cleancache). > API concerns like this are things that should be solved > before a merge IMHO, since afterwards we would end up with > the "we cannot change the API, because that breaks users" > scenario that we always end up finding ourselves in. I think I've amply demonstrated that the API is minimal and extensible, as demonstrated by the above points. Much of Andrea's concerns were due to a misunderstanding of the code in staging/zcache, thinking it was part of the API; the only "API" being considered here is defined by frontswap_ops. Also, the API for frontswap_ops is almost identical to the API for cleancache_ops and uses a much simpler, much more isolated set of hooks. Frontswap "finishes" tmem, cleancache is already merged. Leaving tmem unfinished is worse than not having it all (and I can already hear Christoph cackling and jumping to his keyboard ;-) Thanks, Dan OK, I really need to discontinue my participation in this for a couple of days for personal/health reasons, so I hope I've made my case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-01 15:25 ` Dan Magenheimer @ 2011-11-02 1:14 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-11-02 1:14 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Tue, 1 Nov 2011 08:25:38 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > - At discussing an fujitsu user support guy (just now), he asked > > 'why it's not designed as device driver ?" > > I couldn't answered. > > > > So, I have small concerns with frontswap.ops ABI design. > > Do we need ABI and other modules should be pluggable ? > > Can frontswap be implemented as something like > > > > # setup frontswap via device-mapper or some. > > # swapon /dev/frontswap > > ? > > It seems required hooks are just before/after read/write swap device. > > other hooks can be implemented in notifier..no ? > > A good question, and it is answered in FAQ #4 included in > the patchset (Documentation/vm/frontswap.txt). The short > answer is that the tmem ABI/API used by frontswap is > intentionally very very dynamic -- ANY attempt to put > a page into it can be rejected by the backend. This is > not possible with block I/O or swap, at least without > a massive rewrite. And this dynamic capability is the > key to supporting the many users that frontswap supports. > Hmm. > By the way, what your fujitsu user support guy suggests is > exactly what zram does. The author of zram (Nitin Gupta) > agrees that frontswap has many advantages over zram, > see https://lkml.org/lkml/2011/10/28/8 and he supports > merging frontswap. And Ed Tomlinson, a current user > of zram says that he would use frontswap instead of > zram: https://lkml.org/lkml/2011/10/29/53 > > Kame, can I add you to the list of people who support > merging frontswap, assuming more good performance numbers > are posted? > Before answer, let me explain my attitude to this project. As hobby, I like this kind of work which allows me to imagine what kind of new fancy features it will allow us. Then, I reviewed patches. As people who sells enterprise system and support, I can't recommend this to our customers. IIUC, cleancache/frontswap/zcache hides its avaiable resources from user's view and making the system performance unvisible and not-predictable. That's one of the reason why I asksed whether or not you have plans to make frontswap(cleancache) cgroup aware. (Hmm, but at making a product which offers best-effort-performance to customers, this project may make sense. But I am not very interested in best-effort service very much.) I wonder if there are 'static size simple victim cache per cgroup' project under frontswap/cleancache and it helps all user's workload isolation even if there is no VM or zcache, tmem. It sounds wonderful. So, I'd like to ask whether you have any enhancement plans in future ? rather than 'current' peformance. The reason I hesitate to say "Okay!", is that I can't see enterprise usage of this, a feature which cannot be controlled by admins and make perfomrance prediction difficult in busy system. Thanks, -Kame ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 1:14 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-11-02 1:14 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Tue, 1 Nov 2011 08:25:38 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > - At discussing an fujitsu user support guy (just now), he asked > > 'why it's not designed as device driver ?" > > I couldn't answered. > > > > So, I have small concerns with frontswap.ops ABI design. > > Do we need ABI and other modules should be pluggable ? > > Can frontswap be implemented as something like > > > > # setup frontswap via device-mapper or some. > > # swapon /dev/frontswap > > ? > > It seems required hooks are just before/after read/write swap device. > > other hooks can be implemented in notifier..no ? > > A good question, and it is answered in FAQ #4 included in > the patchset (Documentation/vm/frontswap.txt). The short > answer is that the tmem ABI/API used by frontswap is > intentionally very very dynamic -- ANY attempt to put > a page into it can be rejected by the backend. This is > not possible with block I/O or swap, at least without > a massive rewrite. And this dynamic capability is the > key to supporting the many users that frontswap supports. > Hmm. > By the way, what your fujitsu user support guy suggests is > exactly what zram does. The author of zram (Nitin Gupta) > agrees that frontswap has many advantages over zram, > see https://lkml.org/lkml/2011/10/28/8 and he supports > merging frontswap. And Ed Tomlinson, a current user > of zram says that he would use frontswap instead of > zram: https://lkml.org/lkml/2011/10/29/53 > > Kame, can I add you to the list of people who support > merging frontswap, assuming more good performance numbers > are posted? > Before answer, let me explain my attitude to this project. As hobby, I like this kind of work which allows me to imagine what kind of new fancy features it will allow us. Then, I reviewed patches. As people who sells enterprise system and support, I can't recommend this to our customers. IIUC, cleancache/frontswap/zcache hides its avaiable resources from user's view and making the system performance unvisible and not-predictable. That's one of the reason why I asksed whether or not you have plans to make frontswap(cleancache) cgroup aware. (Hmm, but at making a product which offers best-effort-performance to customers, this project may make sense. But I am not very interested in best-effort service very much.) I wonder if there are 'static size simple victim cache per cgroup' project under frontswap/cleancache and it helps all user's workload isolation even if there is no VM or zcache, tmem. It sounds wonderful. So, I'd like to ask whether you have any enhancement plans in future ? rather than 'current' peformance. The reason I hesitate to say "Okay!", is that I can't see enterprise usage of this, a feature which cannot be controlled by admins and make perfomrance prediction difficult in busy system. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 1:14 ` KAMEZAWA Hiroyuki @ 2011-11-02 15:12 ` Dan Magenheimer -1 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 15:12 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] Hi Kame -- > > By the way, what your fujitsu user support guy suggests is > > exactly what zram does. The author of zram (Nitin Gupta) > > agrees that frontswap has many advantages over zram, > > see https://lkml.org/lkml/2011/10/28/8 and he supports > > merging frontswap. And Ed Tomlinson, a current user > > of zram says that he would use frontswap instead of > > zram: https://lkml.org/lkml/2011/10/29/53 > > > > Kame, can I add you to the list of people who support > > merging frontswap, assuming more good performance numbers > > are posted? > > > Before answer, let me explain my attitude to this project. > > As hobby, I like this kind of work which allows me to imagine what kind > of new fancy features it will allow us. Then, I reviewed patches. > > As people who sells enterprise system and support, I can't recommend this > to our customers. IIUC, cleancache/frontswap/zcache hides its avaiable > resources from user's view and making the system performance unvisible and > not-predictable. That's one of the reason why I asksed whether or not > you have plans to make frontswap(cleancache) cgroup aware. > (Hmm, but at making a product which offers best-effort-performance to customers, > this project may make sense. But I am not very interested in best-effort > service very much.) I agree that zcache is not a good choice for enterprise customers trying to achieve predictable QoS. Tmem works to improve memory efficiency (with zcache backend) and/or take advantage of statistical variations in working sets across multiple virtual (Xen backend and KVM work-in-progress backend) or physical (RAMster backend) machines so, you are correct, there will be some non-visible and non-predictable effects of tmem. In a strict QoS environment, the data center must ensure that all resources are overprovisioned, including RAM. RAM on each machine must exceed the peak working set on that machine or QoS guarantees won't be met. Tmem has no value when RAM is "infinite", that is, when RAM can be increased arbitrarily to ensure that it always exceeds the peak working set. Tmem has great value when RAM is sometimes less than the working set. This is most obvious today in consolidated virtualization environments, but (as shown in my presentations) is increasingly a system topology. For example: Resource optimization across a broad set of users with unknown and time-varying workloads (and thus working sets) is necessary for "cloud providers" to profit. In many such environments, RAM is becoming the bottleneck and cloud providers can't ensure that RAM is "infinite". Cloud users that require absolute control over their performance are instructed to pay a much higher price to "rent" a physical server. In some parts of the US (and I think in other countries as well), electricity providers offer a discount to customers that are willing to allow the provider to remotely disable their air conditioning units when electricity demand peaks across the entire grid. Tmem allows cloud providers to offer a similar feature to their users. This is neither guaranteed-QoS nor "best effort" but allows the provider to expand the capabilities of their data center as needed, rather than predict peak demand and pre-provision for it. I agree, IMHO, zcache is more for small single machines (possibly mobile units) where RAM is limited or at capacity and the workload is bumping into that limit (resulting in swapping). Ed Tomlinson presents a good example: https://lkml.org/lkml/2011/10/29/53 But IBM seems to be _very_ interested in zcache and is not in the desktop business, so probably is working on some cool use model for servers that I've never thought of. > I wonder if there are 'static size simple victim cache per cgroup' project > under frontswap/cleancache and it helps all user's workload isolation > even if there is no VM or zcache, tmem. It sounds wonderful. > > So, I'd like to ask whether you have any enhancement plans in future ? > rather than 'current' peformance. The reason I hesitate to say "Okay!", > is that I can't see enterprise usage of this, a feature which cannot > be controlled by admins and make perfomrance prediction difficult in busy system. Personally, my only enhancement plan is to work on RAMster until it is ready for the staging tree. But once the foundations of tmem (frontswap and cleanache) are in-tree, I hope that you and other developers will find other clever ways to exploit it. For example, Larry Bassel's postings on linux-mm uncovered a new use for cleancache that I had not considered (so I think cleancache now has five users). > > Kame, can I add you to the list of people who support > > merging frontswap, assuming more good performance numbers > > are posted? So I'm not asking you if Fujitsu enterprise QoS-guarantee customers will use zcache.... Andrew said yesterday: "At kernel summit there was discussion and overall agreement that we've been paying insufficient attention to the big-picture "should we include this feature at all" issues. We resolved to look more intensely and critically at new features with a view to deciding whether their usefulness justified their maintenance burden." I am asking you, who are an open source Linux developer and a respected -mm developer, do you think the usefulness of frontswap justifies the maintenance burden, and frontswap should be merged? Dan ^ permalink raw reply [flat|nested] 175+ messages in thread
* RE: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-02 15:12 ` Dan Magenheimer 0 siblings, 0 replies; 175+ messages in thread From: Dan Magenheimer @ 2011-11-02 15:12 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia > From: KAMEZAWA Hiroyuki [mailto:kamezawa.hiroyu@jp.fujitsu.com] Hi Kame -- > > By the way, what your fujitsu user support guy suggests is > > exactly what zram does. The author of zram (Nitin Gupta) > > agrees that frontswap has many advantages over zram, > > see https://lkml.org/lkml/2011/10/28/8 and he supports > > merging frontswap. And Ed Tomlinson, a current user > > of zram says that he would use frontswap instead of > > zram: https://lkml.org/lkml/2011/10/29/53 > > > > Kame, can I add you to the list of people who support > > merging frontswap, assuming more good performance numbers > > are posted? > > > Before answer, let me explain my attitude to this project. > > As hobby, I like this kind of work which allows me to imagine what kind > of new fancy features it will allow us. Then, I reviewed patches. > > As people who sells enterprise system and support, I can't recommend this > to our customers. IIUC, cleancache/frontswap/zcache hides its avaiable > resources from user's view and making the system performance unvisible and > not-predictable. That's one of the reason why I asksed whether or not > you have plans to make frontswap(cleancache) cgroup aware. > (Hmm, but at making a product which offers best-effort-performance to customers, > this project may make sense. But I am not very interested in best-effort > service very much.) I agree that zcache is not a good choice for enterprise customers trying to achieve predictable QoS. Tmem works to improve memory efficiency (with zcache backend) and/or take advantage of statistical variations in working sets across multiple virtual (Xen backend and KVM work-in-progress backend) or physical (RAMster backend) machines so, you are correct, there will be some non-visible and non-predictable effects of tmem. In a strict QoS environment, the data center must ensure that all resources are overprovisioned, including RAM. RAM on each machine must exceed the peak working set on that machine or QoS guarantees won't be met. Tmem has no value when RAM is "infinite", that is, when RAM can be increased arbitrarily to ensure that it always exceeds the peak working set. Tmem has great value when RAM is sometimes less than the working set. This is most obvious today in consolidated virtualization environments, but (as shown in my presentations) is increasingly a system topology. For example: Resource optimization across a broad set of users with unknown and time-varying workloads (and thus working sets) is necessary for "cloud providers" to profit. In many such environments, RAM is becoming the bottleneck and cloud providers can't ensure that RAM is "infinite". Cloud users that require absolute control over their performance are instructed to pay a much higher price to "rent" a physical server. In some parts of the US (and I think in other countries as well), electricity providers offer a discount to customers that are willing to allow the provider to remotely disable their air conditioning units when electricity demand peaks across the entire grid. Tmem allows cloud providers to offer a similar feature to their users. This is neither guaranteed-QoS nor "best effort" but allows the provider to expand the capabilities of their data center as needed, rather than predict peak demand and pre-provision for it. I agree, IMHO, zcache is more for small single machines (possibly mobile units) where RAM is limited or at capacity and the workload is bumping into that limit (resulting in swapping). Ed Tomlinson presents a good example: https://lkml.org/lkml/2011/10/29/53 But IBM seems to be _very_ interested in zcache and is not in the desktop business, so probably is working on some cool use model for servers that I've never thought of. > I wonder if there are 'static size simple victim cache per cgroup' project > under frontswap/cleancache and it helps all user's workload isolation > even if there is no VM or zcache, tmem. It sounds wonderful. > > So, I'd like to ask whether you have any enhancement plans in future ? > rather than 'current' peformance. The reason I hesitate to say "Okay!", > is that I can't see enterprise usage of this, a feature which cannot > be controlled by admins and make perfomrance prediction difficult in busy system. Personally, my only enhancement plan is to work on RAMster until it is ready for the staging tree. But once the foundations of tmem (frontswap and cleanache) are in-tree, I hope that you and other developers will find other clever ways to exploit it. For example, Larry Bassel's postings on linux-mm uncovered a new use for cleancache that I had not considered (so I think cleancache now has five users). > > Kame, can I add you to the list of people who support > > merging frontswap, assuming more good performance numbers > > are posted? So I'm not asking you if Fujitsu enterprise QoS-guarantee customers will use zcache.... Andrew said yesterday: "At kernel summit there was discussion and overall agreement that we've been paying insufficient attention to the big-picture "should we include this feature at all" issues. We resolved to look more intensely and critically at new features with a view to deciding whether their usefulness justified their maintenance burden." I am asking you, who are an open source Linux developer and a respected -mm developer, do you think the usefulness of frontswap justifies the maintenance burden, and frontswap should be merged? Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-02 15:12 ` Dan Magenheimer @ 2011-11-04 4:19 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-11-04 4:19 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Wed, 2 Nov 2011 08:12:01 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > > Kame, can I add you to the list of people who support > > > merging frontswap, assuming more good performance numbers > > > are posted? > > So I'm not asking you if Fujitsu enterprise QoS-guarantee > customers will use zcache.... Andrew said yesterday: > > "At kernel summit there was discussion and overall agreement > that we've been paying insufficient attention to the > big-picture "should we include this feature at all" issues. > We resolved to look more intensely and critically at new > features with a view to deciding whether their usefulness > justified their maintenance burden." > > I am asking you, who are an open source Linux developer and > a respected -mm developer, do you think the usefulness > of frontswap justifies the maintenance burden, and frontswap > should be merged? > When you convince other guys that the design is good. At reading the whole threads, it seems other deveoppers raise 2 problems. 1. justification of usage 2. API design. For 1, you'll need to show performance and benefits. I think you tried and you'll do, again. But please take care of "2", it seems some guys (Rik and Andrea) has concerns. Please CC me, I'd like to join code review process, at least. I'd like to think of a new usage for frontswap/cleancache benficial for enterprise users. Thanks, -Kame ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-04 4:19 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 175+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-11-04 4:19 UTC (permalink / raw) To: Dan Magenheimer Cc: Linus Torvalds, linux-mm, LKML, Andrew Morton, Konrad Wilk, Jeremy Fitzhardinge, Seth Jennings, ngupta, levinsasha928, Chris Mason, JBeulich, Dave Hansen, Jonathan Corbet, Neo Jia On Wed, 2 Nov 2011 08:12:01 -0700 (PDT) Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > > Kame, can I add you to the list of people who support > > > merging frontswap, assuming more good performance numbers > > > are posted? > > So I'm not asking you if Fujitsu enterprise QoS-guarantee > customers will use zcache.... Andrew said yesterday: > > "At kernel summit there was discussion and overall agreement > that we've been paying insufficient attention to the > big-picture "should we include this feature at all" issues. > We resolved to look more intensely and critically at new > features with a view to deciding whether their usefulness > justified their maintenance burden." > > I am asking you, who are an open source Linux developer and > a respected -mm developer, do you think the usefulness > of frontswap justifies the maintenance burden, and frontswap > should be merged? > When you convince other guys that the design is good. At reading the whole threads, it seems other deveoppers raise 2 problems. 1. justification of usage 2. API design. For 1, you'll need to show performance and benefits. I think you tried and you'll do, again. But please take care of "2", it seems some guys (Rik and Andrea) has concerns. Please CC me, I'd like to join code review process, at least. I'd like to think of a new usage for frontswap/cleancache benficial for enterprise users. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-10-27 18:52 ` Dan Magenheimer @ 2011-11-03 16:49 ` Jan Beulich -1 siblings, 0 replies; 175+ messages in thread From: Jan Beulich @ 2011-11-03 16:49 UTC (permalink / raw) To: Linus Torvalds, Dan Magenheimer Cc: Neo Jia, levinsasha928, JeremyFitzhardinge, linux-mm, Andrew Morton, Dave Hansen, Seth Jennings, Jonathan Corbet, Chris Mason, Konrad Wilk, ngupta, LKML >>> On 27.10.11 at 20:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > >... > Linux kernel distros incorporating frontswap: > - Oracle UEK 2.6.39 Beta: > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] > http://kernel.opensuse.org/cgit/kernel/ I've been away so I am too far behind to read this entire very long thread, but wanted to confirm that we've been carrying an earlier version of this code as indicated above and it would simplify our kernel maintenance if frontswap got merged. So please count me as supporting frontswap. Thanks, Jan > - a popular Gentoo distro > http://forums.gentoo.org/viewtopic-t-862105.html > > Xen distros supporting Linux guests with frontswap: > - Xen hypervisor backend since Xen 4.0 (2009) > http://www.xen.org/files/Xen_4_0_Datasheet.pdf > - OracleVM since 2.2 (2009) > http://twitter.com/#!/Djelibeybi/status/113876514688352256 > > Public visibility for frontswap (as part of transcendent memory): > - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle > Open World 2011, two LSF/MM Summits (2010,2011), and three > Xen Summits (2009,2010,2011) > - http://lwn.net/Articles/454795 (current overview) > - http://lwn.net/Articles/386090 (2010) > - http://lwn.net/Articles/340080 (2009) ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-03 16:49 ` Jan Beulich 0 siblings, 0 replies; 175+ messages in thread From: Jan Beulich @ 2011-11-03 16:49 UTC (permalink / raw) To: Linus Torvalds, Dan Magenheimer Cc: Neo Jia, levinsasha928, JeremyFitzhardinge, linux-mm, Andrew Morton, Dave Hansen, Seth Jennings, Jonathan Corbet, Chris Mason, Konrad Wilk, ngupta, LKML >>> On 27.10.11 at 20:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > Hi Linus -- > > Frontswap now has FOUR users: Two already merged in-tree (zcache > and Xen) and two still in development but in public git trees > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > changes required to support transcendent memory; part 1 was cleancache > which you merged at 3.0 (and which now has FIVE users). > > Frontswap patches have been in linux-next since June 3 (with zero > changes since Sep 22). First posted to lkml in June 2009, frontswap > is now at version 11 and has incorporated feedback from a wide range > of kernel developers. For a good overview, see > http://lwn.net/Articles/454795. > If further rationale is needed, please see the end of this email > for more info. > > SO... Please pull: > > git://oss.oracle.com/git/djm/tmem.git #tmem > >... > Linux kernel distros incorporating frontswap: > - Oracle UEK 2.6.39 Beta: > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] > http://kernel.opensuse.org/cgit/kernel/ I've been away so I am too far behind to read this entire very long thread, but wanted to confirm that we've been carrying an earlier version of this code as indicated above and it would simplify our kernel maintenance if frontswap got merged. So please count me as supporting frontswap. Thanks, Jan > - a popular Gentoo distro > http://forums.gentoo.org/viewtopic-t-862105.html > > Xen distros supporting Linux guests with frontswap: > - Xen hypervisor backend since Xen 4.0 (2009) > http://www.xen.org/files/Xen_4_0_Datasheet.pdf > - OracleVM since 2.2 (2009) > http://twitter.com/#!/Djelibeybi/status/113876514688352256 > > Public visibility for frontswap (as part of transcendent memory): > - presented at OSDI'08, OLS'09, LCA'10, LPC'10, LinuxCon NA 11, Oracle > Open World 2011, two LSF/MM Summits (2010,2011), and three > Xen Summits (2009,2010,2011) > - http://lwn.net/Articles/454795 (current overview) > - http://lwn.net/Articles/386090 (2010) > - http://lwn.net/Articles/340080 (2009) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-03 16:49 ` Jan Beulich @ 2011-11-04 0:54 ` Andrew Morton -1 siblings, 0 replies; 175+ messages in thread From: Andrew Morton @ 2011-11-04 0:54 UTC (permalink / raw) To: Jan Beulich Cc: Linus Torvalds, Dan Magenheimer, Neo Jia, levinsasha928, JeremyFitzhardinge, linux-mm, Dave Hansen, Seth Jennings, Jonathan Corbet, Chris Mason, Konrad Wilk, ngupta, LKML On Thu, 03 Nov 2011 16:49:27 +0000 "Jan Beulich" <JBeulich@suse.com> wrote: > >>> On 27.10.11 at 20:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > Hi Linus -- > > > > Frontswap now has FOUR users: Two already merged in-tree (zcache > > and Xen) and two still in development but in public git trees > > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > > changes required to support transcendent memory; part 1 was cleancache > > which you merged at 3.0 (and which now has FIVE users). > > > > Frontswap patches have been in linux-next since June 3 (with zero > > changes since Sep 22). First posted to lkml in June 2009, frontswap > > is now at version 11 and has incorporated feedback from a wide range > > of kernel developers. For a good overview, see > > http://lwn.net/Articles/454795. > > If further rationale is needed, please see the end of this email > > for more info. > > > > SO... Please pull: > > > > git://oss.oracle.com/git/djm/tmem.git #tmem > > > >... > > Linux kernel distros incorporating frontswap: > > - Oracle UEK 2.6.39 Beta: > > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary > > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] > > http://kernel.opensuse.org/cgit/kernel/ > > I've been away so I am too far behind to read this entire > very long thread, but wanted to confirm that we've been > carrying an earlier version of this code as indicated above > and it would simplify our kernel maintenance if frontswap > got merged. So please count me as supporting frontswap. Are you able to tell use *why* you're carrying it, and what benefit it is providing to your users? ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-04 0:54 ` Andrew Morton 0 siblings, 0 replies; 175+ messages in thread From: Andrew Morton @ 2011-11-04 0:54 UTC (permalink / raw) To: Jan Beulich Cc: Linus Torvalds, Dan Magenheimer, Neo Jia, levinsasha928, JeremyFitzhardinge, linux-mm, Dave Hansen, Seth Jennings, Jonathan Corbet, Chris Mason, Konrad Wilk, ngupta, LKML On Thu, 03 Nov 2011 16:49:27 +0000 "Jan Beulich" <JBeulich@suse.com> wrote: > >>> On 27.10.11 at 20:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > Hi Linus -- > > > > Frontswap now has FOUR users: Two already merged in-tree (zcache > > and Xen) and two still in development but in public git trees > > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel > > changes required to support transcendent memory; part 1 was cleancache > > which you merged at 3.0 (and which now has FIVE users). > > > > Frontswap patches have been in linux-next since June 3 (with zero > > changes since Sep 22). First posted to lkml in June 2009, frontswap > > is now at version 11 and has incorporated feedback from a wide range > > of kernel developers. For a good overview, see > > http://lwn.net/Articles/454795. > > If further rationale is needed, please see the end of this email > > for more info. > > > > SO... Please pull: > > > > git://oss.oracle.com/git/djm/tmem.git #tmem > > > >... > > Linux kernel distros incorporating frontswap: > > - Oracle UEK 2.6.39 Beta: > > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary > > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] > > http://kernel.opensuse.org/cgit/kernel/ > > I've been away so I am too far behind to read this entire > very long thread, but wanted to confirm that we've been > carrying an earlier version of this code as indicated above > and it would simplify our kernel maintenance if frontswap > got merged. So please count me as supporting frontswap. Are you able to tell use *why* you're carrying it, and what benefit it is providing to your users? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) 2011-11-04 0:54 ` Andrew Morton @ 2011-11-04 8:49 ` Jan Beulich -1 siblings, 0 replies; 175+ messages in thread From: Jan Beulich @ 2011-11-04 8:49 UTC (permalink / raw) To: Andrew Morton Cc: Neo Jia, levinsasha928, JeremyFitzhardinge, linux-mm, Linus Torvalds, Dave Hansen, Seth Jennings, Jonathan Corbet, Chris Mason, Dan Magenheimer, Konrad Wilk, ngupta, LKML >>> On 04.11.11 at 01:54, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 03 Nov 2011 16:49:27 +0000 "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 27.10.11 at 20:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> > Hi Linus -- >> > >> > Frontswap now has FOUR users: Two already merged in-tree (zcache >> > and Xen) and two still in development but in public git trees >> > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel >> > changes required to support transcendent memory; part 1 was cleancache >> > which you merged at 3.0 (and which now has FIVE users). >> > >> > Frontswap patches have been in linux-next since June 3 (with zero >> > changes since Sep 22). First posted to lkml in June 2009, frontswap >> > is now at version 11 and has incorporated feedback from a wide range >> > of kernel developers. For a good overview, see >> > http://lwn.net/Articles/454795. >> > If further rationale is needed, please see the end of this email >> > for more info. >> > >> > SO... Please pull: >> > >> > git://oss.oracle.com/git/djm/tmem.git #tmem >> > >> >... >> > Linux kernel distros incorporating frontswap: >> > - Oracle UEK 2.6.39 Beta: >> > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary >> > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] >> > http://kernel.opensuse.org/cgit/kernel/ >> >> I've been away so I am too far behind to read this entire >> very long thread, but wanted to confirm that we've been >> carrying an earlier version of this code as indicated above >> and it would simplify our kernel maintenance if frontswap >> got merged. So please count me as supporting frontswap. > > Are you able to tell use *why* you're carrying it, and what benefit it > is providing to your users? Because we're supporting/using Xen, where this (within the general tmem picture) allows for better overall memory utilization. Jan ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-04 8:49 ` Jan Beulich 0 siblings, 0 replies; 175+ messages in thread From: Jan Beulich @ 2011-11-04 8:49 UTC (permalink / raw) To: Andrew Morton Cc: Neo Jia, levinsasha928, JeremyFitzhardinge, linux-mm, Linus Torvalds, Dave Hansen, Seth Jennings, Jonathan Corbet, Chris Mason, Dan Magenheimer, Konrad Wilk, ngupta, LKML >>> On 04.11.11 at 01:54, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 03 Nov 2011 16:49:27 +0000 "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>> On 27.10.11 at 20:52, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> > Hi Linus -- >> > >> > Frontswap now has FOUR users: Two already merged in-tree (zcache >> > and Xen) and two still in development but in public git trees >> > (RAMster and KVM). Frontswap is part 2 of 2 of the core kernel >> > changes required to support transcendent memory; part 1 was cleancache >> > which you merged at 3.0 (and which now has FIVE users). >> > >> > Frontswap patches have been in linux-next since June 3 (with zero >> > changes since Sep 22). First posted to lkml in June 2009, frontswap >> > is now at version 11 and has incorporated feedback from a wide range >> > of kernel developers. For a good overview, see >> > http://lwn.net/Articles/454795. >> > If further rationale is needed, please see the end of this email >> > for more info. >> > >> > SO... Please pull: >> > >> > git://oss.oracle.com/git/djm/tmem.git #tmem >> > >> >... >> > Linux kernel distros incorporating frontswap: >> > - Oracle UEK 2.6.39 Beta: >> > http://oss.oracle.com/git/?p=linux-2.6-unbreakable-beta.git;a=summary >> > - OpenSuSE since 11.2 (2009) [see mm/tmem-xen.c] >> > http://kernel.opensuse.org/cgit/kernel/ >> >> I've been away so I am too far behind to read this entire >> very long thread, but wanted to confirm that we've been >> carrying an earlier version of this code as indicated above >> and it would simplify our kernel maintenance if frontswap >> got merged. So please count me as supporting frontswap. > > Are you able to tell use *why* you're carrying it, and what benefit it > is providing to your users? Because we're supporting/using Xen, where this (within the general tmem picture) allows for better overall memory utilization. Jan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-04 12:37 Clayton Weaver 0 siblings, 0 replies; 175+ messages in thread From: Clayton Weaver @ 2011-11-04 12:37 UTC (permalink / raw) To: linux-kernel "So where I can I buy Network Attached Ram and skip all of this byzantine VM complication?" So let me see if I have this right: when the frontswap back end fills up, the current design would force dumping newer pages to real on-disk swap (to avoid OOM), possibly compressed, while keeping older pages in the compressed ram swap cache? It seems like it should just dump (blocksize/pagesize) * pagesize multiples of its oldest compressed pages to disk instead and store and compress the new pages that are submitted to it, thus preserving the "least recently used" logic in the frontswap backend. A backend to frontswap should not be able to fail a put at all (unless the whole machine or container is OOM and no physical swap is configured, so the backend contains no pages and has no space to allocate from). -- Clayton Weaver cgweav at fastmail dot fm ^ permalink raw reply [flat|nested] 175+ messages in thread
* Re: [GIT PULL] mm: frontswap (for 3.2 window) @ 2011-11-05 17:08 Clayton Weaver 0 siblings, 0 replies; 175+ messages in thread From: Clayton Weaver @ 2011-11-05 17:08 UTC (permalink / raw) To: linux-kernel (NB: My only dog in this hunt is the length of this thread.) When swapping to rotating media, all swapped pages have the same age. Is there any performance reason to keep this property when swapping to in-memory swap space that has rotating media or some other longer-latency swap space for worst-case swap storage? Is there any performance reason to extend lru logic to this type of low-latency/high-latency swap? Seems like an obvious question. Will all of these potential frontswap backends want page compression? (Should it be factored out into a common page compression implementation that anything can use? Does this already exist? How many pages should it operate on at one time, batched together to get higher average compression ratios?) -- Clayton Weaver cgweav at fastmail dot fm ^ permalink raw reply [flat|nested] 175+ messages in thread
end of thread, other threads:[~2011-11-16 14:51 UTC | newest] Thread overview: 175+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-10-27 18:52 [GIT PULL] mm: frontswap (for 3.2 window) Dan Magenheimer 2011-10-27 18:52 ` Dan Magenheimer [not found] ` <alpine.DEB.2.00.1110271318220.7639@chino.kir.corp.google.com20111027211157.GA1199@infradead.org> 2011-10-27 19:30 ` Kurt Hackel 2011-10-27 19:30 ` Kurt Hackel 2011-10-27 20:18 ` David Rientjes 2011-10-27 20:18 ` David Rientjes 2011-10-27 21:11 ` Christoph Hellwig 2011-10-27 21:11 ` Christoph Hellwig 2011-10-27 21:49 ` Dan Magenheimer 2011-10-27 21:49 ` Dan Magenheimer 2011-10-27 21:52 ` Christoph Hellwig 2011-10-27 21:52 ` Christoph Hellwig 2011-10-27 22:21 ` Dan Magenheimer 2011-10-27 22:21 ` Dan Magenheimer 2011-10-28 7:12 ` Sasha Levin 2011-10-28 7:12 ` Sasha Levin [not found] ` <CAOzbF4fnD=CGR-nizZoBxmFSuAjFC3uAHf3wDj5RLneJvJhrOQ@mail.gmail.comCAOJsxLGOTw7rtFnqeHvzFxifA0QgPVDHZzrEo=-uB2Gkrvp=JQ@mail.gmail.com> [not found] ` <552d2067-474d-4aef-a9a4-89e5fd8ef84f@default20111031181651.GF3466@redhat.com> [not found] ` <60592afd-97aa-4eaf-b86b-f6695d31c7f1@default20111031223717.GI3466@redhat.com> [not found] ` <1b2e4f74-7058-4712-85a7-84198723e3ee@default20111101012017.GJ3466@redhat.com> [not found] ` <6a9db6d9-6f13-4855-b026-ba668c29ddfa@default20111101180702.GL3466@redhat.com> [not found] ` <b8a0ca71-a31b-488a-9a92-2502d4a6e9bf@default20111102013122.GA18879@redhat.com> 2011-10-28 7:30 ` Cyclonus J 2011-10-28 7:30 ` Cyclonus J 2011-10-28 14:26 ` Pekka Enberg 2011-10-28 14:26 ` Pekka Enberg 2011-10-28 15:21 ` Dan Magenheimer 2011-10-28 15:21 ` Dan Magenheimer [not found] ` <CAOJsxLEE-qf9me1SAZLFiEVhHVnDh7BDrSx1+abe9R4mfkhD=g@mail.gmail.com20111028163053.GC1319@redhat.com> 2011-10-28 15:36 ` Pekka Enberg 2011-10-28 15:36 ` Pekka Enberg 2011-10-28 16:30 ` Johannes Weiner 2011-10-28 16:30 ` Johannes Weiner 2011-10-28 17:01 ` Pekka Enberg 2011-10-28 17:01 ` Pekka Enberg 2011-10-28 17:07 ` Dan Magenheimer 2011-10-28 17:07 ` Dan Magenheimer 2011-10-28 18:28 ` John Stoffel 2011-10-28 18:28 ` John Stoffel 2011-10-28 20:19 ` Dan Magenheimer 2011-10-28 20:19 ` Dan Magenheimer 2011-10-28 20:52 ` John Stoffel 2011-10-28 20:52 ` John Stoffel 2011-10-30 19:18 ` Dan Magenheimer 2011-10-30 19:18 ` Dan Magenheimer 2011-10-30 20:06 ` Dave Hansen 2011-10-30 20:06 ` Dave Hansen 2011-10-30 21:50 ` Dan Magenheimer 2011-10-30 21:50 ` Dan Magenheimer 2011-11-02 19:45 ` Rik van Riel 2011-11-02 19:45 ` Rik van Riel 2011-11-02 20:45 ` Dan Magenheimer 2011-11-02 20:45 ` Dan Magenheimer 2011-11-06 22:32 ` Valdis.Kletnieks 2011-11-08 12:15 ` Ed Tomlinson 2011-11-08 12:15 ` Ed Tomlinson 2011-10-31 8:12 ` James Bottomley 2011-10-31 8:12 ` James Bottomley 2011-10-31 15:39 ` Dan Magenheimer 2011-10-31 15:39 ` Dan Magenheimer 2011-11-01 10:13 ` James Bottomley 2011-11-01 10:13 ` James Bottomley 2011-11-01 18:10 ` Dan Magenheimer 2011-11-01 18:10 ` Dan Magenheimer 2011-11-01 18:48 ` Dave Hansen 2011-11-01 18:48 ` Dave Hansen 2011-11-01 21:32 ` Dan Magenheimer 2011-11-01 21:32 ` Dan Magenheimer 2011-11-02 7:44 ` James Bottomley 2011-11-02 7:44 ` James Bottomley 2011-11-02 19:39 ` Dan Magenheimer 2011-11-02 19:39 ` Dan Magenheimer 2011-10-31 18:44 ` Andrea Arcangeli 2011-10-31 18:44 ` Andrea Arcangeli 2011-10-30 21:47 ` Johannes Weiner 2011-10-30 21:47 ` Johannes Weiner 2011-10-30 23:19 ` Dan Magenheimer 2011-10-30 23:19 ` Dan Magenheimer 2011-10-31 18:34 ` Andrea Arcangeli 2011-10-31 18:34 ` Andrea Arcangeli 2011-10-31 21:45 ` Dan Magenheimer 2011-10-31 21:45 ` Dan Magenheimer 2011-10-28 16:37 ` Dan Magenheimer 2011-10-28 16:37 ` Dan Magenheimer 2011-10-28 16:59 ` Pekka Enberg 2011-10-28 16:59 ` Pekka Enberg 2011-10-28 17:20 ` Dan Magenheimer 2011-10-28 17:20 ` Dan Magenheimer 2011-10-31 18:16 ` Andrea Arcangeli 2011-10-31 18:16 ` Andrea Arcangeli 2011-10-31 20:58 ` Dan Magenheimer 2011-10-31 20:58 ` Dan Magenheimer 2011-10-31 22:37 ` Andrea Arcangeli 2011-10-31 22:37 ` Andrea Arcangeli 2011-10-31 23:36 ` Dan Magenheimer 2011-10-31 23:36 ` Dan Magenheimer 2011-11-01 1:20 ` Andrea Arcangeli 2011-11-01 1:20 ` Andrea Arcangeli 2011-11-01 16:41 ` Dan Magenheimer 2011-11-01 16:41 ` Dan Magenheimer 2011-11-01 18:07 ` Andrea Arcangeli 2011-11-01 18:07 ` Andrea Arcangeli 2011-11-01 21:00 ` Dan Magenheimer 2011-11-01 21:00 ` Dan Magenheimer 2011-11-02 1:31 ` Andrea Arcangeli 2011-11-02 1:31 ` Andrea Arcangeli 2011-11-02 19:06 ` Dan Magenheimer 2011-11-02 19:06 ` Dan Magenheimer 2011-11-03 0:32 ` Andrea Arcangeli 2011-11-03 0:32 ` Andrea Arcangeli 2011-11-03 22:29 ` Dan Magenheimer 2011-11-03 22:29 ` Dan Magenheimer 2011-11-02 20:51 ` Rik van Riel 2011-11-02 20:51 ` Rik van Riel 2011-11-02 21:14 ` Dan Magenheimer 2011-11-02 21:14 ` Dan Magenheimer 2011-11-15 16:29 ` Rik van Riel 2011-11-15 16:29 ` Rik van Riel 2011-11-15 17:33 ` Jeremy Fitzhardinge 2011-11-15 17:33 ` Jeremy Fitzhardinge 2011-11-16 14:49 ` Konrad Rzeszutek Wilk 2011-11-16 14:49 ` Konrad Rzeszutek Wilk 2011-11-01 10:16 ` James Bottomley 2011-11-01 10:16 ` James Bottomley 2011-11-01 18:21 ` Dan Magenheimer 2011-11-01 18:21 ` Dan Magenheimer 2011-11-02 8:14 ` James Bottomley 2011-11-02 8:14 ` James Bottomley 2011-11-02 20:08 ` Dan Magenheimer 2011-11-02 20:08 ` Dan Magenheimer 2011-11-03 10:30 ` Theodore Tso 2011-11-03 10:30 ` Theodore Tso 2011-11-03 14:59 ` Dan Magenheimer 2011-11-03 14:59 ` Dan Magenheimer 2011-11-02 15:44 ` Avi Kivity 2011-11-02 15:44 ` Avi Kivity 2011-11-02 16:02 ` Andrea Arcangeli 2011-11-02 16:02 ` Andrea Arcangeli 2011-11-02 16:13 ` Avi Kivity 2011-11-02 16:13 ` Avi Kivity 2011-11-02 20:27 ` Dan Magenheimer 2011-11-02 20:27 ` Dan Magenheimer 2011-11-02 20:19 ` Dan Magenheimer 2011-11-02 20:19 ` Dan Magenheimer 2011-10-27 21:44 ` Avi Miller 2011-10-27 21:44 ` Avi Miller 2011-10-27 22:33 ` Brian King 2011-10-27 22:33 ` Brian King 2011-10-28 5:17 ` Nitin Gupta 2011-10-28 5:17 ` Nitin Gupta 2011-10-29 13:43 ` Ed Tomlinson 2011-10-29 13:43 ` Ed Tomlinson 2011-10-31 8:13 ` KAMEZAWA Hiroyuki 2011-10-31 8:13 ` KAMEZAWA Hiroyuki 2011-10-31 16:38 ` Dan Magenheimer 2011-10-31 16:38 ` Dan Magenheimer 2011-11-01 0:50 ` KAMEZAWA Hiroyuki 2011-11-01 0:50 ` KAMEZAWA Hiroyuki 2011-11-01 15:25 ` Dan Magenheimer 2011-11-01 15:25 ` Dan Magenheimer 2011-11-01 21:43 ` Andrew Morton 2011-11-01 21:43 ` Andrew Morton 2011-11-01 22:25 ` Dan Magenheimer 2011-11-01 22:25 ` Dan Magenheimer 2011-11-02 21:03 ` Rik van Riel 2011-11-02 21:03 ` Rik van Riel 2011-11-02 21:42 ` Dan Magenheimer 2011-11-02 21:42 ` Dan Magenheimer 2011-11-02 1:14 ` KAMEZAWA Hiroyuki 2011-11-02 1:14 ` KAMEZAWA Hiroyuki 2011-11-02 15:12 ` Dan Magenheimer 2011-11-02 15:12 ` Dan Magenheimer 2011-11-04 4:19 ` KAMEZAWA Hiroyuki 2011-11-04 4:19 ` KAMEZAWA Hiroyuki 2011-11-03 16:49 ` Jan Beulich 2011-11-03 16:49 ` Jan Beulich 2011-11-04 0:54 ` Andrew Morton 2011-11-04 0:54 ` Andrew Morton 2011-11-04 8:49 ` Jan Beulich 2011-11-04 8:49 ` Jan Beulich 2011-11-04 12:37 Clayton Weaver 2011-11-05 17:08 Clayton Weaver
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.