All of lore.kernel.org
 help / color / mirror / Atom feed
* Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-22 13:42 ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-22 13:42 UTC (permalink / raw)
  To: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Patch applies to 2.6.34-rc5

In previous patch postings, frontswap was part of the Transcendent
Memory ("tmem") patchset.  This patchset refocuses not on the underlying
technology (tmem) but instead on the useful functionality provided for Linux,
and provides a clean API so that frontswap can provide this very useful
functionality via a Xen tmem driver OR completely independent of tmem.
For example: Nitin Gupta (of compcache and ramzswap fame) is implementing
an in-kernel compression "backend" for frontswap; some believe
frontswap will be a very nice interface for building RAM-like functionality
for pseudo-RAM devices such as SSD or phase-change memory; and a Pune
University team is looking at a backend for virtio (see OLS'2010).

A more complete description of frontswap can be found in the introductory
comment in mm/frontswap.c (in PATCH 2/4) which is included below
for convenience.

Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
and will soon ship in a release of Oracle Enterprise Linux.  Underlying
tmem technology is now shipping in Oracle VM 2.2 and was just released
in Xen 4.0 on April 15, 2010.  (Search news.google.com for Transcedent
Memory)

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>

 include/linux/frontswap.h |   98 ++++++++++++++
 include/linux/swap.h      |    2 
 include/linux/swapfile.h  |   13 +
 mm/Kconfig                |   16 ++
 mm/Makefile               |    1 
 mm/frontswap.c            |  301 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_io.c              |   12 +
 mm/swap.c                 |    4 
 mm/swapfile.c             |   58 +++++++-
 9 files changed, 496 insertions(+), 9 deletions(-)

Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device.  The storage is assumed to be
a synchronous concurrency-safe page-oriented pseudo-RAM device (such as
Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory,
aka "zmem", or other RAM-like devices) which is not directly accessible
or addressable by the kernel and is of unknown and possibly time-varying
size.  This pseudo-RAM device links itself to frontswap by setting the
frontswap_ops pointer appropriately and the functions it provides must
conform to certain policies as follows:

An "init" prepares the pseudo-RAM to receive frontswap pages and returns
a non-negative pool id, used for all swap device numbers (aka "type").
A "put_page" will copy the page to pseudo-RAM and associate it with
the type and offset associated with the page. A "get_page" will copy the
page, if found, from pseudo-RAM into kernel memory, but will NOT remove
the page from pseudo-RAM.  A "flush_page" will remove the page from
pseudo-RAM and a "flush_area" will remove ALL pages associated with the
swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse
further puts with that swap type.

Once a page is successfully put, a matching get on the page will always
succeed.  So when the kernel finds itself in a situation where it needs
to swap out a page, it first attempts to use frontswap.  If the put returns
non-zero, the data has been successfully saved to pseudo-RAM and
a disk write and, if the data is later read back, a disk read are avoided.
If a put returns zero, pseudo-RAM has rejected the data, and the page can
be written to swap as usual.

Note that if a page is put and the page already exists in pseudo-RAM
(a "duplicate" put), either the put succeeds and the data is overwritten,
or the put fails AND the page is flushed.  This ensures stale data may
never be obtained from pseudo-RAM.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-22 13:42 ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-22 13:42 UTC (permalink / raw)
  To: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Patch applies to 2.6.34-rc5

In previous patch postings, frontswap was part of the Transcendent
Memory ("tmem") patchset.  This patchset refocuses not on the underlying
technology (tmem) but instead on the useful functionality provided for Linux,
and provides a clean API so that frontswap can provide this very useful
functionality via a Xen tmem driver OR completely independent of tmem.
For example: Nitin Gupta (of compcache and ramzswap fame) is implementing
an in-kernel compression "backend" for frontswap; some believe
frontswap will be a very nice interface for building RAM-like functionality
for pseudo-RAM devices such as SSD or phase-change memory; and a Pune
University team is looking at a backend for virtio (see OLS'2010).

A more complete description of frontswap can be found in the introductory
comment in mm/frontswap.c (in PATCH 2/4) which is included below
for convenience.

Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
and will soon ship in a release of Oracle Enterprise Linux.  Underlying
tmem technology is now shipping in Oracle VM 2.2 and was just released
in Xen 4.0 on April 15, 2010.  (Search news.google.com for Transcedent
Memory)

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>

 include/linux/frontswap.h |   98 ++++++++++++++
 include/linux/swap.h      |    2 
 include/linux/swapfile.h  |   13 +
 mm/Kconfig                |   16 ++
 mm/Makefile               |    1 
 mm/frontswap.c            |  301 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_io.c              |   12 +
 mm/swap.c                 |    4 
 mm/swapfile.c             |   58 +++++++-
 9 files changed, 496 insertions(+), 9 deletions(-)

Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device.  The storage is assumed to be
a synchronous concurrency-safe page-oriented pseudo-RAM device (such as
Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory,
aka "zmem", or other RAM-like devices) which is not directly accessible
or addressable by the kernel and is of unknown and possibly time-varying
size.  This pseudo-RAM device links itself to frontswap by setting the
frontswap_ops pointer appropriately and the functions it provides must
conform to certain policies as follows:

An "init" prepares the pseudo-RAM to receive frontswap pages and returns
a non-negative pool id, used for all swap device numbers (aka "type").
A "put_page" will copy the page to pseudo-RAM and associate it with
the type and offset associated with the page. A "get_page" will copy the
page, if found, from pseudo-RAM into kernel memory, but will NOT remove
the page from pseudo-RAM.  A "flush_page" will remove the page from
pseudo-RAM and a "flush_area" will remove ALL pages associated with the
swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse
further puts with that swap type.

Once a page is successfully put, a matching get on the page will always
succeed.  So when the kernel finds itself in a situation where it needs
to swap out a page, it first attempts to use frontswap.  If the put returns
non-zero, the data has been successfully saved to pseudo-RAM and
a disk write and, if the data is later read back, a disk read are avoided.
If a put returns zero, pseudo-RAM has rejected the data, and the page can
be written to swap as usual.

Note that if a page is put and the page already exists in pseudo-RAM
(a "duplicate" put), either the put succeeds and the data is overwritten,
or the put fails AND the page is flushed.  This ensures stale data may
never be obtained from pseudo-RAM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-22 13:42 ` Dan Magenheimer
@ 2010-04-22 15:28   ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-22 15:28 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/22/2010 04:42 PM, Dan Magenheimer wrote:
> Frontswap is so named because it can be thought of as the opposite of
> a "backing" store for a swap device.  The storage is assumed to be
> a synchronous concurrency-safe page-oriented pseudo-RAM device (such as
> Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory,
> aka "zmem", or other RAM-like devices) which is not directly accessible
> or addressable by the kernel and is of unknown and possibly time-varying
> size.  This pseudo-RAM device links itself to frontswap by setting the
> frontswap_ops pointer appropriately and the functions it provides must
> conform to certain policies as follows:
>    

How baked in is the synchronous requirement?  Memory, for example, can 
be asynchronous if it is copied by a dma engine, and since there are 
hardware encryption engines, there may be hardware compression engines 
in the future.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-22 15:28   ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-22 15:28 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/22/2010 04:42 PM, Dan Magenheimer wrote:
> Frontswap is so named because it can be thought of as the opposite of
> a "backing" store for a swap device.  The storage is assumed to be
> a synchronous concurrency-safe page-oriented pseudo-RAM device (such as
> Xen's Transcendent Memory, aka "tmem", or in-kernel compressed memory,
> aka "zmem", or other RAM-like devices) which is not directly accessible
> or addressable by the kernel and is of unknown and possibly time-varying
> size.  This pseudo-RAM device links itself to frontswap by setting the
> frontswap_ops pointer appropriately and the functions it provides must
> conform to certain policies as follows:
>    

How baked in is the synchronous requirement?  Memory, for example, can 
be asynchronous if it is copied by a dma engine, and since there are 
hardware encryption engines, there may be hardware compression engines 
in the future.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-22 15:28   ` Avi Kivity
@ 2010-04-22 15:48     ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-22 15:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > a synchronous concurrency-safe page-oriented pseudo-RAM device (such
> >  :
> > conform to certain policies as follows:
> 
> How baked in is the synchronous requirement?  Memory, for example, can
> be asynchronous if it is copied by a dma engine, and since there are
> hardware encryption engines, there may be hardware compression engines
> in the future.

Thanks for the comment!

Synchronous is required, but likely could be simulated by ensuring all
coherency (and concurrency) requirements are met by some intermediate
"buffering driver" -- at the cost of an extra page copy into a buffer
and overhead of tracking the handles (poolid/inode/index) of pages in
the buffer that are "in flight".  This is an approach we are considering
to implement an SSD backend, but hasn't been tested yet so, ahem, the
proof will be in the put'ing. ;-)

Dan

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-22 15:48     ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-22 15:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > a synchronous concurrency-safe page-oriented pseudo-RAM device (such
> >  :
> > conform to certain policies as follows:
> 
> How baked in is the synchronous requirement?  Memory, for example, can
> be asynchronous if it is copied by a dma engine, and since there are
> hardware encryption engines, there may be hardware compression engines
> in the future.

Thanks for the comment!

Synchronous is required, but likely could be simulated by ensuring all
coherency (and concurrency) requirements are met by some intermediate
"buffering driver" -- at the cost of an extra page copy into a buffer
and overhead of tracking the handles (poolid/inode/index) of pages in
the buffer that are "in flight".  This is an approach we are considering
to implement an SSD backend, but hasn't been tested yet so, ahem, the
proof will be in the put'ing. ;-)

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-22 15:48     ` Dan Magenheimer
@ 2010-04-22 16:13       ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-22 16:13 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/22/2010 06:48 PM, Dan Magenheimer wrote:
>>> a synchronous concurrency-safe page-oriented pseudo-RAM device (such
>>>   :
>>> conform to certain policies as follows:
>>>        
>> How baked in is the synchronous requirement?  Memory, for example, can
>> be asynchronous if it is copied by a dma engine, and since there are
>> hardware encryption engines, there may be hardware compression engines
>> in the future.
>>      
> Thanks for the comment!
>
> Synchronous is required, but likely could be simulated by ensuring all
> coherency (and concurrency) requirements are met by some intermediate
> "buffering driver" -- at the cost of an extra page copy into a buffer
> and overhead of tracking the handles (poolid/inode/index) of pages in
> the buffer that are "in flight".  This is an approach we are considering
> to implement an SSD backend, but hasn't been tested yet so, ahem, the
> proof will be in the put'ing. ;-)
>    

Well, copying memory so you can use a zero-copy dma engine is 
counterproductive.

Much easier to simulate an asynchronous API with a synchronous backend.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-22 16:13       ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-22 16:13 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/22/2010 06:48 PM, Dan Magenheimer wrote:
>>> a synchronous concurrency-safe page-oriented pseudo-RAM device (such
>>>   :
>>> conform to certain policies as follows:
>>>        
>> How baked in is the synchronous requirement?  Memory, for example, can
>> be asynchronous if it is copied by a dma engine, and since there are
>> hardware encryption engines, there may be hardware compression engines
>> in the future.
>>      
> Thanks for the comment!
>
> Synchronous is required, but likely could be simulated by ensuring all
> coherency (and concurrency) requirements are met by some intermediate
> "buffering driver" -- at the cost of an extra page copy into a buffer
> and overhead of tracking the handles (poolid/inode/index) of pages in
> the buffer that are "in flight".  This is an approach we are considering
> to implement an SSD backend, but hasn't been tested yet so, ahem, the
> proof will be in the put'ing. ;-)
>    

Well, copying memory so you can use a zero-copy dma engine is 
counterproductive.

Much easier to simulate an asynchronous API with a synchronous backend.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-22 16:13       ` Avi Kivity
@ 2010-04-22 20:15         ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-22 20:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Synchronous is required, but likely could be simulated by ensuring
> all
> > coherency (and concurrency) requirements are met by some intermediate
> > "buffering driver" -- at the cost of an extra page copy into a buffer
> > and overhead of tracking the handles (poolid/inode/index) of pages in
> > the buffer that are "in flight".  This is an approach we are
> considering
> > to implement an SSD backend, but hasn't been tested yet so, ahem, the
> > proof will be in the put'ing. ;-)
> 
> Much easier to simulate an asynchronous API with a synchronous backend.

Indeed.  But an asynchronous API is not appropriate for frontswap
(or cleancache).  The reason the hooks are so simple is because they
are assumed to be synchronous so that the page can be immediately
freed/reused.
 
> Well, copying memory so you can use a zero-copy dma engine is
> counterproductive.

Yes, but for something like an SSD where copying can be used to
build up a full 64K write, the cost of copying memory may not be
counterproductive.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-22 20:15         ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-22 20:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Synchronous is required, but likely could be simulated by ensuring
> all
> > coherency (and concurrency) requirements are met by some intermediate
> > "buffering driver" -- at the cost of an extra page copy into a buffer
> > and overhead of tracking the handles (poolid/inode/index) of pages in
> > the buffer that are "in flight".  This is an approach we are
> considering
> > to implement an SSD backend, but hasn't been tested yet so, ahem, the
> > proof will be in the put'ing. ;-)
> 
> Much easier to simulate an asynchronous API with a synchronous backend.

Indeed.  But an asynchronous API is not appropriate for frontswap
(or cleancache).  The reason the hooks are so simple is because they
are assumed to be synchronous so that the page can be immediately
freed/reused.
 
> Well, copying memory so you can use a zero-copy dma engine is
> counterproductive.

Yes, but for something like an SSD where copying can be used to
build up a full 64K write, the cost of copying memory may not be
counterproductive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-22 20:15         ` Dan Magenheimer
@ 2010-04-23  9:48           ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23  9:48 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/22/2010 11:15 PM, Dan Magenheimer wrote:
>>
>> Much easier to simulate an asynchronous API with a synchronous backend.
>>      
> Indeed.  But an asynchronous API is not appropriate for frontswap
> (or cleancache).  The reason the hooks are so simple is because they
> are assumed to be synchronous so that the page can be immediately
> freed/reused.
>    

Swapping is inherently asynchronous, so we'll have to wait for that to 
complete anyway (as frontswap does not guarantee swap-in will succeed).  
I don't doubt it makes things simpler, but also less flexible and useful.

Something else that bothers me is the double swapping.  Sure we're 
making swapin faster, but we we're still loading the io subsystem with 
writes.  Much better to make swap-to-ram authoritative (and have the 
hypervisor swap it to disk if it needs the memory).

>> Well, copying memory so you can use a zero-copy dma engine is
>> counterproductive.
>>      
> Yes, but for something like an SSD where copying can be used to
> build up a full 64K write, the cost of copying memory may not be
> counterproductive.
>    

I don't understand.  Please clarify.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23  9:48           ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23  9:48 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/22/2010 11:15 PM, Dan Magenheimer wrote:
>>
>> Much easier to simulate an asynchronous API with a synchronous backend.
>>      
> Indeed.  But an asynchronous API is not appropriate for frontswap
> (or cleancache).  The reason the hooks are so simple is because they
> are assumed to be synchronous so that the page can be immediately
> freed/reused.
>    

Swapping is inherently asynchronous, so we'll have to wait for that to 
complete anyway (as frontswap does not guarantee swap-in will succeed).  
I don't doubt it makes things simpler, but also less flexible and useful.

Something else that bothers me is the double swapping.  Sure we're 
making swapin faster, but we we're still loading the io subsystem with 
writes.  Much better to make swap-to-ram authoritative (and have the 
hypervisor swap it to disk if it needs the memory).

>> Well, copying memory so you can use a zero-copy dma engine is
>> counterproductive.
>>      
> Yes, but for something like an SSD where copying can be used to
> build up a full 64K write, the cost of copying memory may not be
> counterproductive.
>    

I don't understand.  Please clarify.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23  9:48           ` Avi Kivity
@ 2010-04-23 13:47             ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 13:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> >> Much easier to simulate an asynchronous API with a synchronous
> backend.
> >>
> > Indeed.  But an asynchronous API is not appropriate for frontswap
> > (or cleancache).  The reason the hooks are so simple is because they
> > are assumed to be synchronous so that the page can be immediately
> > freed/reused.
> >
> 
> Swapping is inherently asynchronous, so we'll have to wait for that to
> complete anyway (as frontswap does not guarantee swap-in will succeed).
> I don't doubt it makes things simpler, but also less flexible and
> useful.
> 
> Something else that bothers me is the double swapping.  Sure we're
> making swapin faster, but we we're still loading the io subsystem with
> writes.  Much better to make swap-to-ram authoritative (and have the
> hypervisor swap it to disk if it needs the memory).

Hmmm.... I now realize you are thinking of applying frontswap to
a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
will succeed, never double-swaps, and doesn't load the io subsystem
with writes.  This all works very nicely today with a fully
synchronous "backend" (e.g. with tmem in Xen 4.0).

So, I agree, hiding a truly asynchronous interface behind
frontswap's synchronous interface may have some thorny issues.
I wasn't recommending that it should be done, just speculating
how it might be done.  This doesn't make frontswap any less
useful with a fully synchronous "backend".

> >> Well, copying memory so you can use a zero-copy dma engine is
> >> counterproductive.
> >>
> > Yes, but for something like an SSD where copying can be used to
> > build up a full 64K write, the cost of copying memory may not be
> > counterproductive.
> 
> I don't understand.  Please clarify.

If I understand correctly, SSDs work much more efficiently when
writing 64KB blocks.  So much more efficiently in fact that waiting
to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
will be faster than page-at-a-time DMA'ing them.  If so, the
frontswap interface, backed by an asynchronous "buffering layer"
which collects 16 pages before writing to the SSD, may work
very nicely.  Again this is still just speculation... I was
only pointing out that zero-copy DMA may not always be the best
solution.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 13:47             ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 13:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> >> Much easier to simulate an asynchronous API with a synchronous
> backend.
> >>
> > Indeed.  But an asynchronous API is not appropriate for frontswap
> > (or cleancache).  The reason the hooks are so simple is because they
> > are assumed to be synchronous so that the page can be immediately
> > freed/reused.
> >
> 
> Swapping is inherently asynchronous, so we'll have to wait for that to
> complete anyway (as frontswap does not guarantee swap-in will succeed).
> I don't doubt it makes things simpler, but also less flexible and
> useful.
> 
> Something else that bothers me is the double swapping.  Sure we're
> making swapin faster, but we we're still loading the io subsystem with
> writes.  Much better to make swap-to-ram authoritative (and have the
> hypervisor swap it to disk if it needs the memory).

Hmmm.... I now realize you are thinking of applying frontswap to
a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
will succeed, never double-swaps, and doesn't load the io subsystem
with writes.  This all works very nicely today with a fully
synchronous "backend" (e.g. with tmem in Xen 4.0).

So, I agree, hiding a truly asynchronous interface behind
frontswap's synchronous interface may have some thorny issues.
I wasn't recommending that it should be done, just speculating
how it might be done.  This doesn't make frontswap any less
useful with a fully synchronous "backend".

> >> Well, copying memory so you can use a zero-copy dma engine is
> >> counterproductive.
> >>
> > Yes, but for something like an SSD where copying can be used to
> > build up a full 64K write, the cost of copying memory may not be
> > counterproductive.
> 
> I don't understand.  Please clarify.

If I understand correctly, SSDs work much more efficiently when
writing 64KB blocks.  So much more efficiently in fact that waiting
to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
will be faster than page-at-a-time DMA'ing them.  If so, the
frontswap interface, backed by an asynchronous "buffering layer"
which collects 16 pages before writing to the SSD, may work
very nicely.  Again this is still just speculation... I was
only pointing out that zero-copy DMA may not always be the best
solution.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 13:47             ` Dan Magenheimer
@ 2010-04-23 13:57               ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23 13:57 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 04:47 PM, Dan Magenheimer wrote:
>>>> Much easier to simulate an asynchronous API with a synchronous
>>>>          
>> backend.
>>      
>>>>          
>>> Indeed.  But an asynchronous API is not appropriate for frontswap
>>> (or cleancache).  The reason the hooks are so simple is because they
>>> are assumed to be synchronous so that the page can be immediately
>>> freed/reused.
>>>
>>>        
>> Swapping is inherently asynchronous, so we'll have to wait for that to
>> complete anyway (as frontswap does not guarantee swap-in will succeed).
>> I don't doubt it makes things simpler, but also less flexible and
>> useful.
>>
>> Something else that bothers me is the double swapping.  Sure we're
>> making swapin faster, but we we're still loading the io subsystem with
>> writes.  Much better to make swap-to-ram authoritative (and have the
>> hypervisor swap it to disk if it needs the memory).
>>      
> Hmmm.... I now realize you are thinking of applying frontswap to
> a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
> hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
> will succeed, never double-swaps, and doesn't load the io subsystem
> with writes.  This all works very nicely today with a fully
> synchronous "backend" (e.g. with tmem in Xen 4.0).
>    

Perhaps I misunderstood.  Isn't frontswap in front of the normal swap 
device?  So we do have double swapping, first to frontswap (which is in 
memory, yes, but still a nonzero cost), then the normal swap device.  
The io subsystem is loaded with writes; you only save the reads.

Better to swap to the hypervisor, and make it responsible for committing 
to disk on overcommit or keeping in RAM when memory is available.  This 
way we avoid the write to disk if memory is in fact available (or at 
least defer it until later).  This way you avoid both reads and writes 
if memory is available.

>>>> Well, copying memory so you can use a zero-copy dma engine is
>>>> counterproductive.
>>>>
>>>>          
>>> Yes, but for something like an SSD where copying can be used to
>>> build up a full 64K write, the cost of copying memory may not be
>>> counterproductive.
>>>        
>> I don't understand.  Please clarify.
>>      
> If I understand correctly, SSDs work much more efficiently when
> writing 64KB blocks.  So much more efficiently in fact that waiting
> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> will be faster than page-at-a-time DMA'ing them.  If so, the
> frontswap interface, backed by an asynchronous "buffering layer"
> which collects 16 pages before writing to the SSD, may work
> very nicely.  Again this is still just speculation... I was
> only pointing out that zero-copy DMA may not always be the best
> solution.
>    

The guest can easily (and should) issue 64k dmas using scatter/gather.  
No need for copying.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 13:57               ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23 13:57 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 04:47 PM, Dan Magenheimer wrote:
>>>> Much easier to simulate an asynchronous API with a synchronous
>>>>          
>> backend.
>>      
>>>>          
>>> Indeed.  But an asynchronous API is not appropriate for frontswap
>>> (or cleancache).  The reason the hooks are so simple is because they
>>> are assumed to be synchronous so that the page can be immediately
>>> freed/reused.
>>>
>>>        
>> Swapping is inherently asynchronous, so we'll have to wait for that to
>> complete anyway (as frontswap does not guarantee swap-in will succeed).
>> I don't doubt it makes things simpler, but also less flexible and
>> useful.
>>
>> Something else that bothers me is the double swapping.  Sure we're
>> making swapin faster, but we we're still loading the io subsystem with
>> writes.  Much better to make swap-to-ram authoritative (and have the
>> hypervisor swap it to disk if it needs the memory).
>>      
> Hmmm.... I now realize you are thinking of applying frontswap to
> a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
> hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
> will succeed, never double-swaps, and doesn't load the io subsystem
> with writes.  This all works very nicely today with a fully
> synchronous "backend" (e.g. with tmem in Xen 4.0).
>    

Perhaps I misunderstood.  Isn't frontswap in front of the normal swap 
device?  So we do have double swapping, first to frontswap (which is in 
memory, yes, but still a nonzero cost), then the normal swap device.  
The io subsystem is loaded with writes; you only save the reads.

Better to swap to the hypervisor, and make it responsible for committing 
to disk on overcommit or keeping in RAM when memory is available.  This 
way we avoid the write to disk if memory is in fact available (or at 
least defer it until later).  This way you avoid both reads and writes 
if memory is available.

>>>> Well, copying memory so you can use a zero-copy dma engine is
>>>> counterproductive.
>>>>
>>>>          
>>> Yes, but for something like an SSD where copying can be used to
>>> build up a full 64K write, the cost of copying memory may not be
>>> counterproductive.
>>>        
>> I don't understand.  Please clarify.
>>      
> If I understand correctly, SSDs work much more efficiently when
> writing 64KB blocks.  So much more efficiently in fact that waiting
> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> will be faster than page-at-a-time DMA'ing them.  If so, the
> frontswap interface, backed by an asynchronous "buffering layer"
> which collects 16 pages before writing to the SSD, may work
> very nicely.  Again this is still just speculation... I was
> only pointing out that zero-copy DMA may not always be the best
> solution.
>    

The guest can easily (and should) issue 64k dmas using scatter/gather.  
No need for copying.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 13:57               ` Avi Kivity
@ 2010-04-23 14:43                 ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 14:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> >> Something else that bothers me is the double swapping.  Sure we're
> >> making swapin faster, but we we're still loading the io subsystem
> with
> >> writes.  Much better to make swap-to-ram authoritative (and have the
> >> hypervisor swap it to disk if it needs the memory).
> >>
> > Hmmm.... I now realize you are thinking of applying frontswap to
> > a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
> > hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
> > will succeed, never double-swaps, and doesn't load the io subsystem
> > with writes.  This all works very nicely today with a fully
> > synchronous "backend" (e.g. with tmem in Xen 4.0).
> 
> Perhaps I misunderstood.  Isn't frontswap in front of the normal swap
> device?  So we do have double swapping, first to frontswap (which is in
> memory, yes, but still a nonzero cost), then the normal swap device.
> The io subsystem is loaded with writes; you only save the reads.
> Better to swap to the hypervisor, and make it responsible for
> committing
> to disk on overcommit or keeping in RAM when memory is available.  This
> way we avoid the write to disk if memory is in fact available (or at
> least defer it until later).  This way you avoid both reads and writes
> if memory is available.

Each page is either in frontswap OR on the normal swap device,
never both.  So, yes, both reads and writes are avoided if memory
is available and there is no write issued to the io subsystem if
memory is available.  The is_memory_available decision is determined
by the hypervisor dynamically for each page when the guest attempts
a "frontswap_put".  So, yes, you are indeed "swapping to the
hypervisor" but, at least in the case of Xen, the hypervisor
never swaps any memory to disk so there is never double swapping.
 
> > If I understand correctly, SSDs work much more efficiently when
> > writing 64KB blocks.  So much more efficiently in fact that waiting
> > to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> > will be faster than page-at-a-time DMA'ing them.  If so, the
> > frontswap interface, backed by an asynchronous "buffering layer"
> > which collects 16 pages before writing to the SSD, may work
> > very nicely.  Again this is still just speculation... I was
> > only pointing out that zero-copy DMA may not always be the best
> > solution.
> 
> The guest can easily (and should) issue 64k dmas using scatter/gather.
> No need for copying.

In many cases, this is true.  For the swap subsystem, it may not always
be true, though I see recent signs that it may be headed in that
direction.  In any case, unless you see this SSD discussion as
critical to the proposed acceptance of the frontswap patchset,
let's table it until there's some prototyping done.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 14:43                 ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 14:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> >> Something else that bothers me is the double swapping.  Sure we're
> >> making swapin faster, but we we're still loading the io subsystem
> with
> >> writes.  Much better to make swap-to-ram authoritative (and have the
> >> hypervisor swap it to disk if it needs the memory).
> >>
> > Hmmm.... I now realize you are thinking of applying frontswap to
> > a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
> > hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
> > will succeed, never double-swaps, and doesn't load the io subsystem
> > with writes.  This all works very nicely today with a fully
> > synchronous "backend" (e.g. with tmem in Xen 4.0).
> 
> Perhaps I misunderstood.  Isn't frontswap in front of the normal swap
> device?  So we do have double swapping, first to frontswap (which is in
> memory, yes, but still a nonzero cost), then the normal swap device.
> The io subsystem is loaded with writes; you only save the reads.
> Better to swap to the hypervisor, and make it responsible for
> committing
> to disk on overcommit or keeping in RAM when memory is available.  This
> way we avoid the write to disk if memory is in fact available (or at
> least defer it until later).  This way you avoid both reads and writes
> if memory is available.

Each page is either in frontswap OR on the normal swap device,
never both.  So, yes, both reads and writes are avoided if memory
is available and there is no write issued to the io subsystem if
memory is available.  The is_memory_available decision is determined
by the hypervisor dynamically for each page when the guest attempts
a "frontswap_put".  So, yes, you are indeed "swapping to the
hypervisor" but, at least in the case of Xen, the hypervisor
never swaps any memory to disk so there is never double swapping.
 
> > If I understand correctly, SSDs work much more efficiently when
> > writing 64KB blocks.  So much more efficiently in fact that waiting
> > to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> > will be faster than page-at-a-time DMA'ing them.  If so, the
> > frontswap interface, backed by an asynchronous "buffering layer"
> > which collects 16 pages before writing to the SSD, may work
> > very nicely.  Again this is still just speculation... I was
> > only pointing out that zero-copy DMA may not always be the best
> > solution.
> 
> The guest can easily (and should) issue 64k dmas using scatter/gather.
> No need for copying.

In many cases, this is true.  For the swap subsystem, it may not always
be true, though I see recent signs that it may be headed in that
direction.  In any case, unless you see this SSD discussion as
critical to the proposed acceptance of the frontswap patchset,
let's table it until there's some prototyping done.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 14:43                 ` Dan Magenheimer
@ 2010-04-23 14:52                   ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23 14:52 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 05:43 PM, Dan Magenheimer wrote:
>>
>> Perhaps I misunderstood.  Isn't frontswap in front of the normal swap
>> device?  So we do have double swapping, first to frontswap (which is in
>> memory, yes, but still a nonzero cost), then the normal swap device.
>> The io subsystem is loaded with writes; you only save the reads.
>> Better to swap to the hypervisor, and make it responsible for
>> committing
>> to disk on overcommit or keeping in RAM when memory is available.  This
>> way we avoid the write to disk if memory is in fact available (or at
>> least defer it until later).  This way you avoid both reads and writes
>> if memory is available.
>>      
> Each page is either in frontswap OR on the normal swap device,
> never both.  So, yes, both reads and writes are avoided if memory
> is available and there is no write issued to the io subsystem if
> memory is available.  The is_memory_available decision is determined
> by the hypervisor dynamically for each page when the guest attempts
> a "frontswap_put".  So, yes, you are indeed "swapping to the
> hypervisor" but, at least in the case of Xen, the hypervisor
> never swaps any memory to disk so there is never double swapping.
>    

I see.  So why not implement this as an ordinary swap device, with a 
higher priority than the disk device?  this way we reuse an API and keep 
things asynchronous, instead of introducing a special purpose API.

Doesn't this commit the hypervisor to retain this memory?  If so, isn't 
it simpler to give the page to the guest (so now it doesn't need to swap 
at all)?

What about live migration?  do you live migrate frontswap pages?

>>> If I understand correctly, SSDs work much more efficiently when
>>> writing 64KB blocks.  So much more efficiently in fact that waiting
>>> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
>>> will be faster than page-at-a-time DMA'ing them.  If so, the
>>> frontswap interface, backed by an asynchronous "buffering layer"
>>> which collects 16 pages before writing to the SSD, may work
>>> very nicely.  Again this is still just speculation... I was
>>> only pointing out that zero-copy DMA may not always be the best
>>> solution.
>>>        
>> The guest can easily (and should) issue 64k dmas using scatter/gather.
>> No need for copying.
>>      
> In many cases, this is true.  For the swap subsystem, it may not always
> be true, though I see recent signs that it may be headed in that
> direction.

I think it will be true in an overwhelming number of cases.  Flash is 
new enough that most devices support scatter/gather.

> In any case, unless you see this SSD discussion as
> critical to the proposed acceptance of the frontswap patchset,
> let's table it until there's some prototyping done.
>    

It isn't particularly related.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 14:52                   ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23 14:52 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 05:43 PM, Dan Magenheimer wrote:
>>
>> Perhaps I misunderstood.  Isn't frontswap in front of the normal swap
>> device?  So we do have double swapping, first to frontswap (which is in
>> memory, yes, but still a nonzero cost), then the normal swap device.
>> The io subsystem is loaded with writes; you only save the reads.
>> Better to swap to the hypervisor, and make it responsible for
>> committing
>> to disk on overcommit or keeping in RAM when memory is available.  This
>> way we avoid the write to disk if memory is in fact available (or at
>> least defer it until later).  This way you avoid both reads and writes
>> if memory is available.
>>      
> Each page is either in frontswap OR on the normal swap device,
> never both.  So, yes, both reads and writes are avoided if memory
> is available and there is no write issued to the io subsystem if
> memory is available.  The is_memory_available decision is determined
> by the hypervisor dynamically for each page when the guest attempts
> a "frontswap_put".  So, yes, you are indeed "swapping to the
> hypervisor" but, at least in the case of Xen, the hypervisor
> never swaps any memory to disk so there is never double swapping.
>    

I see.  So why not implement this as an ordinary swap device, with a 
higher priority than the disk device?  this way we reuse an API and keep 
things asynchronous, instead of introducing a special purpose API.

Doesn't this commit the hypervisor to retain this memory?  If so, isn't 
it simpler to give the page to the guest (so now it doesn't need to swap 
at all)?

What about live migration?  do you live migrate frontswap pages?

>>> If I understand correctly, SSDs work much more efficiently when
>>> writing 64KB blocks.  So much more efficiently in fact that waiting
>>> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
>>> will be faster than page-at-a-time DMA'ing them.  If so, the
>>> frontswap interface, backed by an asynchronous "buffering layer"
>>> which collects 16 pages before writing to the SSD, may work
>>> very nicely.  Again this is still just speculation... I was
>>> only pointing out that zero-copy DMA may not always be the best
>>> solution.
>>>        
>> The guest can easily (and should) issue 64k dmas using scatter/gather.
>> No need for copying.
>>      
> In many cases, this is true.  For the swap subsystem, it may not always
> be true, though I see recent signs that it may be headed in that
> direction.

I think it will be true in an overwhelming number of cases.  Flash is 
new enough that most devices support scatter/gather.

> In any case, unless you see this SSD discussion as
> critical to the proposed acceptance of the frontswap patchset,
> let's table it until there's some prototyping done.
>    

It isn't particularly related.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 14:52                   ` Avi Kivity
@ 2010-04-23 15:00                     ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23 15:00 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 05:52 PM, Avi Kivity wrote:
>
> I see.  So why not implement this as an ordinary swap device, with a 
> higher priority than the disk device?  this way we reuse an API and 
> keep things asynchronous, instead of introducing a special purpose API.
>

Ok, from your original post:

> An "init" prepares the pseudo-RAM to receive frontswap pages and returns
> a non-negative pool id, used for all swap device numbers (aka "type").
> A "put_page" will copy the page to pseudo-RAM and associate it with
> the type and offset associated with the page. A "get_page" will copy the
> page, if found, from pseudo-RAM into kernel memory, but will NOT remove
> the page from pseudo-RAM.  A "flush_page" will remove the page from
> pseudo-RAM and a "flush_area" will remove ALL pages associated with the
> swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse
> further puts with that swap type.
>
> Once a page is successfully put, a matching get on the page will always
> succeed.  So when the kernel finds itself in a situation where it needs
> to swap out a page, it first attempts to use frontswap.  If the put returns
> non-zero, the data has been successfully saved to pseudo-RAM and
> a disk write and, if the data is later read back, a disk read are avoided.
> If a put returns zero, pseudo-RAM has rejected the data, and the page can
> be written to swap as usual.
>
> Note that if a page is put and the page already exists in pseudo-RAM
> (a "duplicate" put), either the put succeeds and the data is overwritten,
> or the put fails AND the page is flushed.  This ensures stale data may
> never be obtained from pseudo-RAM.
>    

Looks like "init" == open, "put_page" == write, "get_page" == read, 
"flush_page|flush_area" == trim.  The only difference seems to be that 
an overwriting put_page may fail.  Doesn't seem to be much of a win, 
since a guest can simply avoid issuing the duplicate put_page, so the 
hypervisor is still committed to holding this memory for the guest.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 15:00                     ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-23 15:00 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 05:52 PM, Avi Kivity wrote:
>
> I see.  So why not implement this as an ordinary swap device, with a 
> higher priority than the disk device?  this way we reuse an API and 
> keep things asynchronous, instead of introducing a special purpose API.
>

Ok, from your original post:

> An "init" prepares the pseudo-RAM to receive frontswap pages and returns
> a non-negative pool id, used for all swap device numbers (aka "type").
> A "put_page" will copy the page to pseudo-RAM and associate it with
> the type and offset associated with the page. A "get_page" will copy the
> page, if found, from pseudo-RAM into kernel memory, but will NOT remove
> the page from pseudo-RAM.  A "flush_page" will remove the page from
> pseudo-RAM and a "flush_area" will remove ALL pages associated with the
> swap type (e.g., like swapoff) and notify the pseudo-RAM device to refuse
> further puts with that swap type.
>
> Once a page is successfully put, a matching get on the page will always
> succeed.  So when the kernel finds itself in a situation where it needs
> to swap out a page, it first attempts to use frontswap.  If the put returns
> non-zero, the data has been successfully saved to pseudo-RAM and
> a disk write and, if the data is later read back, a disk read are avoided.
> If a put returns zero, pseudo-RAM has rejected the data, and the page can
> be written to swap as usual.
>
> Note that if a page is put and the page already exists in pseudo-RAM
> (a "duplicate" put), either the put succeeds and the data is overwritten,
> or the put fails AND the page is flushed.  This ensures stale data may
> never be obtained from pseudo-RAM.
>    

Looks like "init" == open, "put_page" == write, "get_page" == read, 
"flush_page|flush_area" == trim.  The only difference seems to be that 
an overwriting put_page may fail.  Doesn't seem to be much of a win, 
since a guest can simply avoid issuing the duplicate put_page, so the 
hypervisor is still committed to holding this memory for the guest.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 14:52                   ` Avi Kivity
@ 2010-04-23 15:56                     ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 15:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Each page is either in frontswap OR on the normal swap device,
> > never both.  So, yes, both reads and writes are avoided if memory
> > is available and there is no write issued to the io subsystem if
> > memory is available.  The is_memory_available decision is determined
> > by the hypervisor dynamically for each page when the guest attempts
> > a "frontswap_put".  So, yes, you are indeed "swapping to the
> > hypervisor" but, at least in the case of Xen, the hypervisor
> > never swaps any memory to disk so there is never double swapping.
> 
> I see.  So why not implement this as an ordinary swap device, with a
> higher priority than the disk device?  this way we reuse an API and
> keep
> things asynchronous, instead of introducing a special purpose API.

Because the swapping API doesn't adapt well to dynamic changes in
the size and availability of the underlying "swap" device, which
is very useful for swap to (bare-metal) hypervisor.

> Doesn't this commit the hypervisor to retain this memory?  If so, isn't
> it simpler to give the page to the guest (so now it doesn't need to
> swap at all)?

Yes the hypervisor is committed to retain the memory.  In
some ways, giving a page of memory to a guest (via ballooning)
is simpler and in some ways not.  When a guest "owns" a page,
it can do whatever it wants with it, independent of what is best
for the "whole" virtualized system.  When the hypervisor
"owns" the page on behalf of the guest but the guest can't
directly address it, the hypervisor has more flexibility.
For example, tmem optionally compresses all frontswap pages,
effectively doubling the size of its available memory.
In the future, knowing that a guest application can never
access the pages directly, it might store all frontswap pages in
(slower but still synchronous) phase change memory or "far NUMA"
memory.

> What about live migration?  do you live migrate frontswap pages?

Yes, fully supported in Xen 4.0.  And as another example of
flexibility, note that "lazy migration" of frontswap'ed pages
might be quite reasonable.

> >> The guest can easily (and should) issue 64k dmas using
> scatter/gather.
> >> No need for copying.
> >>
> > In many cases, this is true.  For the swap subsystem, it may not
> always
> > be true, though I see recent signs that it may be headed in that
> > direction.
> 
> I think it will be true in an overwhelming number of cases.  Flash is
> new enough that most devices support scatter/gather.

I wasn't referring to hardware capability but to the availability
and timing constraints of the pages that need to be swapped.

> > In any case, unless you see this SSD discussion as
> > critical to the proposed acceptance of the frontswap patchset,
> > let's table it until there's some prototyping done.
>
> It isn't particularly related.

Agreed.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 15:56                     ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 15:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Each page is either in frontswap OR on the normal swap device,
> > never both.  So, yes, both reads and writes are avoided if memory
> > is available and there is no write issued to the io subsystem if
> > memory is available.  The is_memory_available decision is determined
> > by the hypervisor dynamically for each page when the guest attempts
> > a "frontswap_put".  So, yes, you are indeed "swapping to the
> > hypervisor" but, at least in the case of Xen, the hypervisor
> > never swaps any memory to disk so there is never double swapping.
> 
> I see.  So why not implement this as an ordinary swap device, with a
> higher priority than the disk device?  this way we reuse an API and
> keep
> things asynchronous, instead of introducing a special purpose API.

Because the swapping API doesn't adapt well to dynamic changes in
the size and availability of the underlying "swap" device, which
is very useful for swap to (bare-metal) hypervisor.

> Doesn't this commit the hypervisor to retain this memory?  If so, isn't
> it simpler to give the page to the guest (so now it doesn't need to
> swap at all)?

Yes the hypervisor is committed to retain the memory.  In
some ways, giving a page of memory to a guest (via ballooning)
is simpler and in some ways not.  When a guest "owns" a page,
it can do whatever it wants with it, independent of what is best
for the "whole" virtualized system.  When the hypervisor
"owns" the page on behalf of the guest but the guest can't
directly address it, the hypervisor has more flexibility.
For example, tmem optionally compresses all frontswap pages,
effectively doubling the size of its available memory.
In the future, knowing that a guest application can never
access the pages directly, it might store all frontswap pages in
(slower but still synchronous) phase change memory or "far NUMA"
memory.

> What about live migration?  do you live migrate frontswap pages?

Yes, fully supported in Xen 4.0.  And as another example of
flexibility, note that "lazy migration" of frontswap'ed pages
might be quite reasonable.

> >> The guest can easily (and should) issue 64k dmas using
> scatter/gather.
> >> No need for copying.
> >>
> > In many cases, this is true.  For the swap subsystem, it may not
> always
> > be true, though I see recent signs that it may be headed in that
> > direction.
> 
> I think it will be true in an overwhelming number of cases.  Flash is
> new enough that most devices support scatter/gather.

I wasn't referring to hardware capability but to the availability
and timing constraints of the pages that need to be swapped.

> > In any case, unless you see this SSD discussion as
> > critical to the proposed acceptance of the frontswap patchset,
> > let's table it until there's some prototyping done.
>
> It isn't particularly related.

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 15:00                     ` Avi Kivity
@ 2010-04-23 16:26                       ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 16:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > If a put returns zero, pseudo-RAM has rejected the data, and the page
> can
> > be written to swap as usual.
> >
> > Note that if a page is put and the page already exists in pseudo-RAM
> > (a "duplicate" put), either the put succeeds and the data is
> overwritten,
> > or the put fails AND the page is flushed.  This ensures stale data
> may
> > never be obtained from pseudo-RAM.
> 
> Looks like "init" == open, "put_page" == write, "get_page" == read,
> "flush_page|flush_area" == trim.  The only difference seems to be that
> an overwriting put_page may fail.  Doesn't seem to be much of a win,

No, ANY put_page can fail, and this is a critical part of the API
that provides all of the flexibility for the hypervisor and all
the guests. (See previous reply.)

The "duplicate put" semantics are carefully specified as there
are some coherency corner cases that are very difficult to handle
in the "backend" but very easy to handle in the kernel.  So the
specification explicitly punts these to the kernel.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 16:26                       ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-23 16:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > If a put returns zero, pseudo-RAM has rejected the data, and the page
> can
> > be written to swap as usual.
> >
> > Note that if a page is put and the page already exists in pseudo-RAM
> > (a "duplicate" put), either the put succeeds and the data is
> overwritten,
> > or the put fails AND the page is flushed.  This ensures stale data
> may
> > never be obtained from pseudo-RAM.
> 
> Looks like "init" == open, "put_page" == write, "get_page" == read,
> "flush_page|flush_area" == trim.  The only difference seems to be that
> an overwriting put_page may fail.  Doesn't seem to be much of a win,

No, ANY put_page can fail, and this is a critical part of the API
that provides all of the flexibility for the hypervisor and all
the guests. (See previous reply.)

The "duplicate put" semantics are carefully specified as there
are some coherency corner cases that are very difficult to handle
in the "backend" but very easy to handle in the kernel.  So the
specification explicitly punts these to the kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 13:47             ` Dan Magenheimer
@ 2010-04-23 16:35               ` Jiahua
  -1 siblings, 0 replies; 163+ messages in thread
From: Jiahua @ 2010-04-23 16:35 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On Fri, Apr 23, 2010 at 6:47 AM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:

> If I understand correctly, SSDs work much more efficiently when
> writing 64KB blocks.  So much more efficiently in fact that waiting
> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> will be faster than page-at-a-time DMA'ing them.  If so, the
> frontswap interface, backed by an asynchronous "buffering layer"
> which collects 16 pages before writing to the SSD, may work
> very nicely.  Again this is still just speculation... I was
> only pointing out that zero-copy DMA may not always be the best
> solution.

I guess you are talking about the write amplification issue of SSD. In
fact, most of the new generation drives already solved the problem
with log like structure. Even with the old drives, the size of the
writes depends on the the size of the erase block, which is not
necessary 64KB.

Jiahua

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-23 16:35               ` Jiahua
  0 siblings, 0 replies; 163+ messages in thread
From: Jiahua @ 2010-04-23 16:35 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On Fri, Apr 23, 2010 at 6:47 AM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:

> If I understand correctly, SSDs work much more efficiently when
> writing 64KB blocks.  So much more efficiently in fact that waiting
> to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
> will be faster than page-at-a-time DMA'ing them.  If so, the
> frontswap interface, backed by an asynchronous "buffering layer"
> which collects 16 pages before writing to the SSD, may work
> very nicely.  Again this is still just speculation... I was
> only pointing out that zero-copy DMA may not always be the best
> solution.

I guess you are talking about the write amplification issue of SSD. In
fact, most of the new generation drives already solved the problem
with log like structure. Even with the old drives, the size of the
writes depends on the the size of the erase block, which is not
necessary 64KB.

Jiahua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 14:52                   ` Avi Kivity
@ 2010-04-24  1:49                     ` Nitin Gupta
  -1 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-24  1:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/23/2010 08:22 PM, Avi Kivity wrote:
> On 04/23/2010 05:43 PM, Dan Magenheimer wrote:
>>>
>>> Perhaps I misunderstood.  Isn't frontswap in front of the normal swap
>>> device?  So we do have double swapping, first to frontswap (which is in
>>> memory, yes, but still a nonzero cost), then the normal swap device.
>>> The io subsystem is loaded with writes; you only save the reads.
>>> Better to swap to the hypervisor, and make it responsible for
>>> committing
>>> to disk on overcommit or keeping in RAM when memory is available.  This
>>> way we avoid the write to disk if memory is in fact available (or at
>>> least defer it until later).  This way you avoid both reads and writes
>>> if memory is available.
>>>      
>> Each page is either in frontswap OR on the normal swap device,
>> never both.  So, yes, both reads and writes are avoided if memory
>> is available and there is no write issued to the io subsystem if
>> memory is available.  The is_memory_available decision is determined
>> by the hypervisor dynamically for each page when the guest attempts
>> a "frontswap_put".  So, yes, you are indeed "swapping to the
>> hypervisor" but, at least in the case of Xen, the hypervisor
>> never swaps any memory to disk so there is never double swapping.
>>    
> 
> I see.  So why not implement this as an ordinary swap device, with a
> higher priority than the disk device?  this way we reuse an API and keep
> things asynchronous, instead of introducing a special purpose API.
> 

ramzswap is exactly this: an ordinary swap device which stores every page
in (compressed) memory and its enabled as highest priority swap. Currently,
it stores these compressed chunks in guest memory itself but it is not very
difficult to send these chunks out to host/hypervisor using virtio.
 
However, it suffers from unnecessary block I/O layer overhead and requires
weird hooks in swap code, say to get notification when a swap slot is freed.
OTOH frontswap approach gets rid of any such artifacts and overheads.
(ramzswap: http://code.google.com/p/compcache/)

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-24  1:49                     ` Nitin Gupta
  0 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-24  1:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/23/2010 08:22 PM, Avi Kivity wrote:
> On 04/23/2010 05:43 PM, Dan Magenheimer wrote:
>>>
>>> Perhaps I misunderstood.  Isn't frontswap in front of the normal swap
>>> device?  So we do have double swapping, first to frontswap (which is in
>>> memory, yes, but still a nonzero cost), then the normal swap device.
>>> The io subsystem is loaded with writes; you only save the reads.
>>> Better to swap to the hypervisor, and make it responsible for
>>> committing
>>> to disk on overcommit or keeping in RAM when memory is available.  This
>>> way we avoid the write to disk if memory is in fact available (or at
>>> least defer it until later).  This way you avoid both reads and writes
>>> if memory is available.
>>>      
>> Each page is either in frontswap OR on the normal swap device,
>> never both.  So, yes, both reads and writes are avoided if memory
>> is available and there is no write issued to the io subsystem if
>> memory is available.  The is_memory_available decision is determined
>> by the hypervisor dynamically for each page when the guest attempts
>> a "frontswap_put".  So, yes, you are indeed "swapping to the
>> hypervisor" but, at least in the case of Xen, the hypervisor
>> never swaps any memory to disk so there is never double swapping.
>>    
> 
> I see.  So why not implement this as an ordinary swap device, with a
> higher priority than the disk device?  this way we reuse an API and keep
> things asynchronous, instead of introducing a special purpose API.
> 

ramzswap is exactly this: an ordinary swap device which stores every page
in (compressed) memory and its enabled as highest priority swap. Currently,
it stores these compressed chunks in guest memory itself but it is not very
difficult to send these chunks out to host/hypervisor using virtio.
 
However, it suffers from unnecessary block I/O layer overhead and requires
weird hooks in swap code, say to get notification when a swap slot is freed.
OTOH frontswap approach gets rid of any such artifacts and overheads.
(ramzswap: http://code.google.com/p/compcache/)

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 15:56                     ` Dan Magenheimer
@ 2010-04-24 18:22                       ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-24 18:22 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 06:56 PM, Dan Magenheimer wrote:
>>> Each page is either in frontswap OR on the normal swap device,
>>> never both.  So, yes, both reads and writes are avoided if memory
>>> is available and there is no write issued to the io subsystem if
>>> memory is available.  The is_memory_available decision is determined
>>> by the hypervisor dynamically for each page when the guest attempts
>>> a "frontswap_put".  So, yes, you are indeed "swapping to the
>>> hypervisor" but, at least in the case of Xen, the hypervisor
>>> never swaps any memory to disk so there is never double swapping.
>>>        
>> I see.  So why not implement this as an ordinary swap device, with a
>> higher priority than the disk device?  this way we reuse an API and
>> keep
>> things asynchronous, instead of introducing a special purpose API.
>>      
> Because the swapping API doesn't adapt well to dynamic changes in
> the size and availability of the underlying "swap" device, which
> is very useful for swap to (bare-metal) hypervisor.
>    

Can we extend it?  Adding new APIs is easy, but harder to maintain in 
the long term.

>> Doesn't this commit the hypervisor to retain this memory?  If so, isn't
>> it simpler to give the page to the guest (so now it doesn't need to
>> swap at all)?
>>      
> Yes the hypervisor is committed to retain the memory.  In
> some ways, giving a page of memory to a guest (via ballooning)
> is simpler and in some ways not.  When a guest "owns" a page,
> it can do whatever it wants with it, independent of what is best
> for the "whole" virtualized system.  When the hypervisor
> "owns" the page on behalf of the guest but the guest can't
> directly address it, the hypervisor has more flexibility.
> For example, tmem optionally compresses all frontswap pages,
> effectively doubling the size of its available memory.
> In the future, knowing that a guest application can never
> access the pages directly, it might store all frontswap pages in
> (slower but still synchronous) phase change memory or "far NUMA"
> memory.
>    

Ok.  For non traditional RAM uses I really think an async API is 
needed.  If the API is backed by a cpu synchronous operation is fine, 
but once it isn't RAM, it can be all kinds of interesting things.

Note that even if you do give the page to the guest, you still control 
how it can access it, through the page tables.  So for example you can 
easily compress a guest's pages without telling it about it; whenever it 
touches them you decompress them on the fly.

>> I think it will be true in an overwhelming number of cases.  Flash is
>> new enough that most devices support scatter/gather.
>>      
> I wasn't referring to hardware capability but to the availability
> and timing constraints of the pages that need to be swapped.
>    

I have a feeling we're talking past each other here.  Swap has no timing 
constraints, it is asynchronous and usually to slow devices.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-24 18:22                       ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-24 18:22 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 06:56 PM, Dan Magenheimer wrote:
>>> Each page is either in frontswap OR on the normal swap device,
>>> never both.  So, yes, both reads and writes are avoided if memory
>>> is available and there is no write issued to the io subsystem if
>>> memory is available.  The is_memory_available decision is determined
>>> by the hypervisor dynamically for each page when the guest attempts
>>> a "frontswap_put".  So, yes, you are indeed "swapping to the
>>> hypervisor" but, at least in the case of Xen, the hypervisor
>>> never swaps any memory to disk so there is never double swapping.
>>>        
>> I see.  So why not implement this as an ordinary swap device, with a
>> higher priority than the disk device?  this way we reuse an API and
>> keep
>> things asynchronous, instead of introducing a special purpose API.
>>      
> Because the swapping API doesn't adapt well to dynamic changes in
> the size and availability of the underlying "swap" device, which
> is very useful for swap to (bare-metal) hypervisor.
>    

Can we extend it?  Adding new APIs is easy, but harder to maintain in 
the long term.

>> Doesn't this commit the hypervisor to retain this memory?  If so, isn't
>> it simpler to give the page to the guest (so now it doesn't need to
>> swap at all)?
>>      
> Yes the hypervisor is committed to retain the memory.  In
> some ways, giving a page of memory to a guest (via ballooning)
> is simpler and in some ways not.  When a guest "owns" a page,
> it can do whatever it wants with it, independent of what is best
> for the "whole" virtualized system.  When the hypervisor
> "owns" the page on behalf of the guest but the guest can't
> directly address it, the hypervisor has more flexibility.
> For example, tmem optionally compresses all frontswap pages,
> effectively doubling the size of its available memory.
> In the future, knowing that a guest application can never
> access the pages directly, it might store all frontswap pages in
> (slower but still synchronous) phase change memory or "far NUMA"
> memory.
>    

Ok.  For non traditional RAM uses I really think an async API is 
needed.  If the API is backed by a cpu synchronous operation is fine, 
but once it isn't RAM, it can be all kinds of interesting things.

Note that even if you do give the page to the guest, you still control 
how it can access it, through the page tables.  So for example you can 
easily compress a guest's pages without telling it about it; whenever it 
touches them you decompress them on the fly.

>> I think it will be true in an overwhelming number of cases.  Flash is
>> new enough that most devices support scatter/gather.
>>      
> I wasn't referring to hardware capability but to the availability
> and timing constraints of the pages that need to be swapped.
>    

I have a feeling we're talking past each other here.  Swap has no timing 
constraints, it is asynchronous and usually to slow devices.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-23 16:26                       ` Dan Magenheimer
@ 2010-04-24 18:25                         ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-24 18:25 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 07:26 PM, Dan Magenheimer wrote:
>>
>> Looks like "init" == open, "put_page" == write, "get_page" == read,
>> "flush_page|flush_area" == trim.  The only difference seems to be that
>> an overwriting put_page may fail.  Doesn't seem to be much of a win,
>>      
> No, ANY put_page can fail, and this is a critical part of the API
> that provides all of the flexibility for the hypervisor and all
> the guests. (See previous reply.)
>    

The guest isn't required to do any put_page()s.  It can issue lots of 
them when memory is available, and keep them in the hypervisor forever.  
Failing new put_page()s isn't enough for a dynamic system, you need to 
be able to force the guest to give up some of its tmem.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-24 18:25                         ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-24 18:25 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/23/2010 07:26 PM, Dan Magenheimer wrote:
>>
>> Looks like "init" == open, "put_page" == write, "get_page" == read,
>> "flush_page|flush_area" == trim.  The only difference seems to be that
>> an overwriting put_page may fail.  Doesn't seem to be much of a win,
>>      
> No, ANY put_page can fail, and this is a critical part of the API
> that provides all of the flexibility for the hypervisor and all
> the guests. (See previous reply.)
>    

The guest isn't required to do any put_page()s.  It can issue lots of 
them when memory is available, and keep them in the hypervisor forever.  
Failing new put_page()s isn't enough for a dynamic system, you need to 
be able to force the guest to give up some of its tmem.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-24  1:49                     ` Nitin Gupta
@ 2010-04-24 18:27                       ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-24 18:27 UTC (permalink / raw)
  To: ngupta
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>
>> I see.  So why not implement this as an ordinary swap device, with a
>> higher priority than the disk device?  this way we reuse an API and keep
>> things asynchronous, instead of introducing a special purpose API.
>>
>>      
> ramzswap is exactly this: an ordinary swap device which stores every page
> in (compressed) memory and its enabled as highest priority swap. Currently,
> it stores these compressed chunks in guest memory itself but it is not very
> difficult to send these chunks out to host/hypervisor using virtio.
>
> However, it suffers from unnecessary block I/O layer overhead and requires
> weird hooks in swap code, say to get notification when a swap slot is freed.
>    

Isn't that TRIM?

> OTOH frontswap approach gets rid of any such artifacts and overheads.
> (ramzswap: http://code.google.com/p/compcache/)
>    

Maybe we should optimize these overheads instead.  Swap used to always 
be to slow devices, but swap-to-flash has the potential to make swap act 
like an extension of RAM.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-24 18:27                       ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-24 18:27 UTC (permalink / raw)
  To: ngupta
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>
>> I see.  So why not implement this as an ordinary swap device, with a
>> higher priority than the disk device?  this way we reuse an API and keep
>> things asynchronous, instead of introducing a special purpose API.
>>
>>      
> ramzswap is exactly this: an ordinary swap device which stores every page
> in (compressed) memory and its enabled as highest priority swap. Currently,
> it stores these compressed chunks in guest memory itself but it is not very
> difficult to send these chunks out to host/hypervisor using virtio.
>
> However, it suffers from unnecessary block I/O layer overhead and requires
> weird hooks in swap code, say to get notification when a swap slot is freed.
>    

Isn't that TRIM?

> OTOH frontswap approach gets rid of any such artifacts and overheads.
> (ramzswap: http://code.google.com/p/compcache/)
>    

Maybe we should optimize these overheads instead.  Swap used to always 
be to slow devices, but swap-to-flash has the potential to make swap act 
like an extension of RAM.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-24 18:22                       ` Avi Kivity
@ 2010-04-25  0:30                         ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25  0:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> >> I see.  So why not implement this as an ordinary swap device, with a
> >> higher priority than the disk device?  this way we reuse an API and
> >> keep
> >> things asynchronous, instead of introducing a special purpose API.
> >>
> > Because the swapping API doesn't adapt well to dynamic changes in
> > the size and availability of the underlying "swap" device, which
> > is very useful for swap to (bare-metal) hypervisor.
> 
> Can we extend it?  Adding new APIs is easy, but harder to maintain in
> the long term.

Umm... I think the difference between a "new" API and extending
an existing one here is a choice of semantics.  As designed, frontswap
is an extremely simple, only-very-slightly-intrusive set of hooks that
allows swap pages to, under some conditions, go to pseudo-RAM instead
of an asynchronous disk-like device.  It works today with at least
one "backend" (Xen tmem), is shipping today in real distros, and is
extremely easy to enable/disable via CONFIG or module... meaning
no impact on anyone other than those who choose to benefit from it.

"Extending" the existing swap API, which has largely been untouched for
many years, seems like a significantly more complex and error-prone
undertaking that will affect nearly all Linux users with a likely long
bug tail.  And, by the way, there is no existence proof that it
will be useful.

Seems like a no-brainer to me.

> Ok.  For non traditional RAM uses I really think an async API is
> needed.  If the API is backed by a cpu synchronous operation is fine,
> but once it isn't RAM, it can be all kinds of interesting things.

Well, we shall see.  It may also be the case that the existing
asynchronous swap API will work fine for some non traditional RAM;
and it may also be the case that frontswap works fine for some
non traditional RAM.  I agree there is fertile ground for exploration
here.  But let's not allow our speculation on what may or may
not work in the future halt forward progress of something that works
today.
 
> Note that even if you do give the page to the guest, you still control
> how it can access it, through the page tables.  So for example you can
> easily compress a guest's pages without telling it about it; whenever
> it
> touches them you decompress them on the fly.

Yes, at a much larger more invasive cost to the kernel.  Frontswap
and cleancache and tmem are all well-layered for a good reason.

> >> I think it will be true in an overwhelming number of cases.  Flash
> is
> >> new enough that most devices support scatter/gather.
> >>
> > I wasn't referring to hardware capability but to the availability
> > and timing constraints of the pages that need to be swapped.
> >
> 
> I have a feeling we're talking past each other here.

Could be.

> Swap has no timing
> constraints, it is asynchronous and usually to slow devices.

What I was referring to is that the existing swap code DOES NOT
always have the ability to collect N scattered pages before
initiating an I/O write suitable for a device (such as an SSD)
that is optimized for writing N pages at a time.  That is what
I meant by a timing constraint.  See references to page_cluster
in the swap code (and this is for contiguous pages, not scattered).

Dan

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25  0:30                         ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25  0:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> >> I see.  So why not implement this as an ordinary swap device, with a
> >> higher priority than the disk device?  this way we reuse an API and
> >> keep
> >> things asynchronous, instead of introducing a special purpose API.
> >>
> > Because the swapping API doesn't adapt well to dynamic changes in
> > the size and availability of the underlying "swap" device, which
> > is very useful for swap to (bare-metal) hypervisor.
> 
> Can we extend it?  Adding new APIs is easy, but harder to maintain in
> the long term.

Umm... I think the difference between a "new" API and extending
an existing one here is a choice of semantics.  As designed, frontswap
is an extremely simple, only-very-slightly-intrusive set of hooks that
allows swap pages to, under some conditions, go to pseudo-RAM instead
of an asynchronous disk-like device.  It works today with at least
one "backend" (Xen tmem), is shipping today in real distros, and is
extremely easy to enable/disable via CONFIG or module... meaning
no impact on anyone other than those who choose to benefit from it.

"Extending" the existing swap API, which has largely been untouched for
many years, seems like a significantly more complex and error-prone
undertaking that will affect nearly all Linux users with a likely long
bug tail.  And, by the way, there is no existence proof that it
will be useful.

Seems like a no-brainer to me.

> Ok.  For non traditional RAM uses I really think an async API is
> needed.  If the API is backed by a cpu synchronous operation is fine,
> but once it isn't RAM, it can be all kinds of interesting things.

Well, we shall see.  It may also be the case that the existing
asynchronous swap API will work fine for some non traditional RAM;
and it may also be the case that frontswap works fine for some
non traditional RAM.  I agree there is fertile ground for exploration
here.  But let's not allow our speculation on what may or may
not work in the future halt forward progress of something that works
today.
 
> Note that even if you do give the page to the guest, you still control
> how it can access it, through the page tables.  So for example you can
> easily compress a guest's pages without telling it about it; whenever
> it
> touches them you decompress them on the fly.

Yes, at a much larger more invasive cost to the kernel.  Frontswap
and cleancache and tmem are all well-layered for a good reason.

> >> I think it will be true in an overwhelming number of cases.  Flash
> is
> >> new enough that most devices support scatter/gather.
> >>
> > I wasn't referring to hardware capability but to the availability
> > and timing constraints of the pages that need to be swapped.
> >
> 
> I have a feeling we're talking past each other here.

Could be.

> Swap has no timing
> constraints, it is asynchronous and usually to slow devices.

What I was referring to is that the existing swap code DOES NOT
always have the ability to collect N scattered pages before
initiating an I/O write suitable for a device (such as an SSD)
that is optimized for writing N pages at a time.  That is what
I meant by a timing constraint.  See references to page_cluster
in the swap code (and this is for contiguous pages, not scattered).

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-24 18:25                         ` Avi Kivity
@ 2010-04-25  0:41                           ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25  0:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > No, ANY put_page can fail, and this is a critical part of the API
> > that provides all of the flexibility for the hypervisor and all
> > the guests. (See previous reply.)
> 
> The guest isn't required to do any put_page()s.  It can issue lots of
> them when memory is available, and keep them in the hypervisor forever.
> Failing new put_page()s isn't enough for a dynamic system, you need to
> be able to force the guest to give up some of its tmem.

Yes, indeed, this is true.  That is why it is important for any
policy implemented behind frontswap to "bill" the guest if it
is attempting to keep frontswap pages in the hypervisor forever
and to prod the guest to reclaim them when it no longer needs
super-fast emergency swap space.  The frontswap patch already includes
the kernel mechanism to enable this and the prodding can be implemented
by a guest daemon (of which there already exists an existence proof).

(While devil's advocacy is always welcome, frontswap is NOT a
cool academic science project where these issues have not been
considered or tested.)

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25  0:41                           ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25  0:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > No, ANY put_page can fail, and this is a critical part of the API
> > that provides all of the flexibility for the hypervisor and all
> > the guests. (See previous reply.)
> 
> The guest isn't required to do any put_page()s.  It can issue lots of
> them when memory is available, and keep them in the hypervisor forever.
> Failing new put_page()s isn't enough for a dynamic system, you need to
> be able to force the guest to give up some of its tmem.

Yes, indeed, this is true.  That is why it is important for any
policy implemented behind frontswap to "bill" the guest if it
is attempting to keep frontswap pages in the hypervisor forever
and to prod the guest to reclaim them when it no longer needs
super-fast emergency swap space.  The frontswap patch already includes
the kernel mechanism to enable this and the prodding can be implemented
by a guest daemon (of which there already exists an existence proof).

(While devil's advocacy is always welcome, frontswap is NOT a
cool academic science project where these issues have not been
considered or tested.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-24 18:27                       ` Avi Kivity
@ 2010-04-25  3:11                         ` Nitin Gupta
  -1 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-25  3:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/24/2010 11:57 PM, Avi Kivity wrote:
> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>
>>> I see.  So why not implement this as an ordinary swap device, with a
>>> higher priority than the disk device?  this way we reuse an API and keep
>>> things asynchronous, instead of introducing a special purpose API.
>>>
>>>      
>> ramzswap is exactly this: an ordinary swap device which stores every page
>> in (compressed) memory and its enabled as highest priority swap.
>> Currently,
>> it stores these compressed chunks in guest memory itself but it is not
>> very
>> difficult to send these chunks out to host/hypervisor using virtio.
>>
>> However, it suffers from unnecessary block I/O layer overhead and
>> requires
>> weird hooks in swap code, say to get notification when a swap slot is
>> freed.
>>    
> 
> Isn't that TRIM?

No: trim or discard is not useful. The problem is that we require a callback
_as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).

Increasing the frequency of discards is also not an option:
 - Creating discard bio requests themselves need memory and these swap devices
come into picture only under low memory conditions.
 - We need to regularly scan swap_map to issue these discards. Increasing discard
frequency also means more frequent scanning (which will still not be fast enough
for ramzswap needs).

> 
>> OTOH frontswap approach gets rid of any such artifacts and overheads.
>> (ramzswap: http://code.google.com/p/compcache/)
>>    
> 
> Maybe we should optimize these overheads instead.  Swap used to always
> be to slow devices, but swap-to-flash has the potential to make swap act
> like an extension of RAM.
> 

Spending lot of effort optimizing an overhead which can be completely avoided
is probably not worth it.

Also, I think the choice of a synchronous style API for frontswap and cleancache
is justified as they want to send pages to host *RAM*. If you want to use other
devices like SSDs, then these should be just added as another swap device as
we do currently -- these should not be used as frontswap storage directly.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25  3:11                         ` Nitin Gupta
  0 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-25  3:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/24/2010 11:57 PM, Avi Kivity wrote:
> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>
>>> I see.  So why not implement this as an ordinary swap device, with a
>>> higher priority than the disk device?  this way we reuse an API and keep
>>> things asynchronous, instead of introducing a special purpose API.
>>>
>>>      
>> ramzswap is exactly this: an ordinary swap device which stores every page
>> in (compressed) memory and its enabled as highest priority swap.
>> Currently,
>> it stores these compressed chunks in guest memory itself but it is not
>> very
>> difficult to send these chunks out to host/hypervisor using virtio.
>>
>> However, it suffers from unnecessary block I/O layer overhead and
>> requires
>> weird hooks in swap code, say to get notification when a swap slot is
>> freed.
>>    
> 
> Isn't that TRIM?

No: trim or discard is not useful. The problem is that we require a callback
_as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).

Increasing the frequency of discards is also not an option:
 - Creating discard bio requests themselves need memory and these swap devices
come into picture only under low memory conditions.
 - We need to regularly scan swap_map to issue these discards. Increasing discard
frequency also means more frequent scanning (which will still not be fast enough
for ramzswap needs).

> 
>> OTOH frontswap approach gets rid of any such artifacts and overheads.
>> (ramzswap: http://code.google.com/p/compcache/)
>>    
> 
> Maybe we should optimize these overheads instead.  Swap used to always
> be to slow devices, but swap-to-flash has the potential to make swap act
> like an extension of RAM.
> 

Spending lot of effort optimizing an overhead which can be completely avoided
is probably not worth it.

Also, I think the choice of a synchronous style API for frontswap and cleancache
is justified as they want to send pages to host *RAM*. If you want to use other
devices like SSDs, then these should be just added as another swap device as
we do currently -- these should not be used as frontswap storage directly.

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25  0:41                           ` Dan Magenheimer
@ 2010-04-25 12:06                             ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 12:06 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
>>> No, ANY put_page can fail, and this is a critical part of the API
>>> that provides all of the flexibility for the hypervisor and all
>>> the guests. (See previous reply.)
>>>        
>> The guest isn't required to do any put_page()s.  It can issue lots of
>> them when memory is available, and keep them in the hypervisor forever.
>> Failing new put_page()s isn't enough for a dynamic system, you need to
>> be able to force the guest to give up some of its tmem.
>>      
> Yes, indeed, this is true.  That is why it is important for any
> policy implemented behind frontswap to "bill" the guest if it
> is attempting to keep frontswap pages in the hypervisor forever
> and to prod the guest to reclaim them when it no longer needs
> super-fast emergency swap space.  The frontswap patch already includes
> the kernel mechanism to enable this and the prodding can be implemented
> by a guest daemon (of which there already exists an existence proof).
>    

In this case you could use the same mechanism to stop new put_page()s?

Seems frontswap is like a reverse balloon, where the balloon is in 
hypervisor space instead of the guest space.

> (While devil's advocacy is always welcome, frontswap is NOT a
> cool academic science project where these issues have not been
> considered or tested.)
>    


Good to know.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 12:06                             ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 12:06 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
>>> No, ANY put_page can fail, and this is a critical part of the API
>>> that provides all of the flexibility for the hypervisor and all
>>> the guests. (See previous reply.)
>>>        
>> The guest isn't required to do any put_page()s.  It can issue lots of
>> them when memory is available, and keep them in the hypervisor forever.
>> Failing new put_page()s isn't enough for a dynamic system, you need to
>> be able to force the guest to give up some of its tmem.
>>      
> Yes, indeed, this is true.  That is why it is important for any
> policy implemented behind frontswap to "bill" the guest if it
> is attempting to keep frontswap pages in the hypervisor forever
> and to prod the guest to reclaim them when it no longer needs
> super-fast emergency swap space.  The frontswap patch already includes
> the kernel mechanism to enable this and the prodding can be implemented
> by a guest daemon (of which there already exists an existence proof).
>    

In this case you could use the same mechanism to stop new put_page()s?

Seems frontswap is like a reverse balloon, where the balloon is in 
hypervisor space instead of the guest space.

> (While devil's advocacy is always welcome, frontswap is NOT a
> cool academic science project where these issues have not been
> considered or tested.)
>    


Good to know.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25  0:30                         ` Dan Magenheimer
@ 2010-04-25 12:11                           ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 12:11 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 03:30 AM, Dan Magenheimer wrote:
>>>> I see.  So why not implement this as an ordinary swap device, with a
>>>> higher priority than the disk device?  this way we reuse an API and
>>>> keep
>>>> things asynchronous, instead of introducing a special purpose API.
>>>>
>>>>          
>>> Because the swapping API doesn't adapt well to dynamic changes in
>>> the size and availability of the underlying "swap" device, which
>>> is very useful for swap to (bare-metal) hypervisor.
>>>        
>> Can we extend it?  Adding new APIs is easy, but harder to maintain in
>> the long term.
>>      
> Umm... I think the difference between a "new" API and extending
> an existing one here is a choice of semantics.  As designed, frontswap
> is an extremely simple, only-very-slightly-intrusive set of hooks that
> allows swap pages to, under some conditions, go to pseudo-RAM instead
> of an asynchronous disk-like device.  It works today with at least
> one "backend" (Xen tmem), is shipping today in real distros, and is
> extremely easy to enable/disable via CONFIG or module... meaning
> no impact on anyone other than those who choose to benefit from it.
>
> "Extending" the existing swap API, which has largely been untouched for
> many years, seems like a significantly more complex and error-prone
> undertaking that will affect nearly all Linux users with a likely long
> bug tail.  And, by the way, there is no existence proof that it
> will be useful.
>
> Seems like a no-brainer to me.
>    

My issue is with the API's synchronous nature.  Both RAM and more exotic 
memories can be used with DMA instead of copying.  A synchronous 
interface gives this up.

>> Ok.  For non traditional RAM uses I really think an async API is
>> needed.  If the API is backed by a cpu synchronous operation is fine,
>> but once it isn't RAM, it can be all kinds of interesting things.
>>      
> Well, we shall see.  It may also be the case that the existing
> asynchronous swap API will work fine for some non traditional RAM;
> and it may also be the case that frontswap works fine for some
> non traditional RAM.  I agree there is fertile ground for exploration
> here.  But let's not allow our speculation on what may or may
> not work in the future halt forward progress of something that works
> today.
>    

Let's not allow the urge to merge prevent us from doing the right thing.

>
>    
>> Note that even if you do give the page to the guest, you still control
>> how it can access it, through the page tables.  So for example you can
>> easily compress a guest's pages without telling it about it; whenever
>> it
>> touches them you decompress them on the fly.
>>      
> Yes, at a much larger more invasive cost to the kernel.  Frontswap
> and cleancache and tmem are all well-layered for a good reason.
>    

No need to change the kernel at all; the hypervisor controls the page 
tables.

>> Swap has no timing
>> constraints, it is asynchronous and usually to slow devices.
>>      
> What I was referring to is that the existing swap code DOES NOT
> always have the ability to collect N scattered pages before
> initiating an I/O write suitable for a device (such as an SSD)
> that is optimized for writing N pages at a time.  That is what
> I meant by a timing constraint.  See references to page_cluster
> in the swap code (and this is for contiguous pages, not scattered).
>    

I see.  Given that swap-to-flash will soon be way more common than 
frontswap, it needs to be solved (either in flash or in the swap code).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 12:11                           ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 12:11 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 03:30 AM, Dan Magenheimer wrote:
>>>> I see.  So why not implement this as an ordinary swap device, with a
>>>> higher priority than the disk device?  this way we reuse an API and
>>>> keep
>>>> things asynchronous, instead of introducing a special purpose API.
>>>>
>>>>          
>>> Because the swapping API doesn't adapt well to dynamic changes in
>>> the size and availability of the underlying "swap" device, which
>>> is very useful for swap to (bare-metal) hypervisor.
>>>        
>> Can we extend it?  Adding new APIs is easy, but harder to maintain in
>> the long term.
>>      
> Umm... I think the difference between a "new" API and extending
> an existing one here is a choice of semantics.  As designed, frontswap
> is an extremely simple, only-very-slightly-intrusive set of hooks that
> allows swap pages to, under some conditions, go to pseudo-RAM instead
> of an asynchronous disk-like device.  It works today with at least
> one "backend" (Xen tmem), is shipping today in real distros, and is
> extremely easy to enable/disable via CONFIG or module... meaning
> no impact on anyone other than those who choose to benefit from it.
>
> "Extending" the existing swap API, which has largely been untouched for
> many years, seems like a significantly more complex and error-prone
> undertaking that will affect nearly all Linux users with a likely long
> bug tail.  And, by the way, there is no existence proof that it
> will be useful.
>
> Seems like a no-brainer to me.
>    

My issue is with the API's synchronous nature.  Both RAM and more exotic 
memories can be used with DMA instead of copying.  A synchronous 
interface gives this up.

>> Ok.  For non traditional RAM uses I really think an async API is
>> needed.  If the API is backed by a cpu synchronous operation is fine,
>> but once it isn't RAM, it can be all kinds of interesting things.
>>      
> Well, we shall see.  It may also be the case that the existing
> asynchronous swap API will work fine for some non traditional RAM;
> and it may also be the case that frontswap works fine for some
> non traditional RAM.  I agree there is fertile ground for exploration
> here.  But let's not allow our speculation on what may or may
> not work in the future halt forward progress of something that works
> today.
>    

Let's not allow the urge to merge prevent us from doing the right thing.

>
>    
>> Note that even if you do give the page to the guest, you still control
>> how it can access it, through the page tables.  So for example you can
>> easily compress a guest's pages without telling it about it; whenever
>> it
>> touches them you decompress them on the fly.
>>      
> Yes, at a much larger more invasive cost to the kernel.  Frontswap
> and cleancache and tmem are all well-layered for a good reason.
>    

No need to change the kernel at all; the hypervisor controls the page 
tables.

>> Swap has no timing
>> constraints, it is asynchronous and usually to slow devices.
>>      
> What I was referring to is that the existing swap code DOES NOT
> always have the ability to collect N scattered pages before
> initiating an I/O write suitable for a device (such as an SSD)
> that is optimized for writing N pages at a time.  That is what
> I meant by a timing constraint.  See references to page_cluster
> in the swap code (and this is for contiguous pages, not scattered).
>    

I see.  Given that swap-to-flash will soon be way more common than 
frontswap, it needs to be solved (either in flash or in the swap code).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25  3:11                         ` Nitin Gupta
@ 2010-04-25 12:16                           ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 12:16 UTC (permalink / raw)
  To: ngupta
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 06:11 AM, Nitin Gupta wrote:
> On 04/24/2010 11:57 PM, Avi Kivity wrote:
>    
>> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>      
>>>        
>>>> I see.  So why not implement this as an ordinary swap device, with a
>>>> higher priority than the disk device?  this way we reuse an API and keep
>>>> things asynchronous, instead of introducing a special purpose API.
>>>>
>>>>
>>>>          
>>> ramzswap is exactly this: an ordinary swap device which stores every page
>>> in (compressed) memory and its enabled as highest priority swap.
>>> Currently,
>>> it stores these compressed chunks in guest memory itself but it is not
>>> very
>>> difficult to send these chunks out to host/hypervisor using virtio.
>>>
>>> However, it suffers from unnecessary block I/O layer overhead and
>>> requires
>>> weird hooks in swap code, say to get notification when a swap slot is
>>> freed.
>>>
>>>        
>> Isn't that TRIM?
>>      
> No: trim or discard is not useful. The problem is that we require a callback
> _as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
> in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).
>    

Doesn't flash have similar requirements?  The earlier you discard, the 
likelier you are to reuse an erase block (or reduce the amount of copying).

> Increasing the frequency of discards is also not an option:
>   - Creating discard bio requests themselves need memory and these swap devices
> come into picture only under low memory conditions.
>    

That's fine, swap works under low memory conditions by using reserves.

>   - We need to regularly scan swap_map to issue these discards. Increasing discard
> frequency also means more frequent scanning (which will still not be fast enough
> for ramzswap needs).
>    

How does frontswap do this?  Does it maintain its own data structures?

>> Maybe we should optimize these overheads instead.  Swap used to always
>> be to slow devices, but swap-to-flash has the potential to make swap act
>> like an extension of RAM.
>>
>>      
> Spending lot of effort optimizing an overhead which can be completely avoided
> is probably not worth it.
>    

I'm not sure.  Swap-to-flash will soon be everywhere.   If it's slow, 
people will feel it a lot more than ramzswap slowness.

> Also, I think the choice of a synchronous style API for frontswap and cleancache
> is justified as they want to send pages to host *RAM*. If you want to use other
> devices like SSDs, then these should be just added as another swap device as
> we do currently -- these should not be used as frontswap storage directly.
>    

Even for copying to RAM an async API is wanted, so you can dma it 
instead of copying.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 12:16                           ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 12:16 UTC (permalink / raw)
  To: ngupta
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 06:11 AM, Nitin Gupta wrote:
> On 04/24/2010 11:57 PM, Avi Kivity wrote:
>    
>> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>      
>>>        
>>>> I see.  So why not implement this as an ordinary swap device, with a
>>>> higher priority than the disk device?  this way we reuse an API and keep
>>>> things asynchronous, instead of introducing a special purpose API.
>>>>
>>>>
>>>>          
>>> ramzswap is exactly this: an ordinary swap device which stores every page
>>> in (compressed) memory and its enabled as highest priority swap.
>>> Currently,
>>> it stores these compressed chunks in guest memory itself but it is not
>>> very
>>> difficult to send these chunks out to host/hypervisor using virtio.
>>>
>>> However, it suffers from unnecessary block I/O layer overhead and
>>> requires
>>> weird hooks in swap code, say to get notification when a swap slot is
>>> freed.
>>>
>>>        
>> Isn't that TRIM?
>>      
> No: trim or discard is not useful. The problem is that we require a callback
> _as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
> in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).
>    

Doesn't flash have similar requirements?  The earlier you discard, the 
likelier you are to reuse an erase block (or reduce the amount of copying).

> Increasing the frequency of discards is also not an option:
>   - Creating discard bio requests themselves need memory and these swap devices
> come into picture only under low memory conditions.
>    

That's fine, swap works under low memory conditions by using reserves.

>   - We need to regularly scan swap_map to issue these discards. Increasing discard
> frequency also means more frequent scanning (which will still not be fast enough
> for ramzswap needs).
>    

How does frontswap do this?  Does it maintain its own data structures?

>> Maybe we should optimize these overheads instead.  Swap used to always
>> be to slow devices, but swap-to-flash has the potential to make swap act
>> like an extension of RAM.
>>
>>      
> Spending lot of effort optimizing an overhead which can be completely avoided
> is probably not worth it.
>    

I'm not sure.  Swap-to-flash will soon be everywhere.   If it's slow, 
people will feel it a lot more than ramzswap slowness.

> Also, I think the choice of a synchronous style API for frontswap and cleancache
> is justified as they want to send pages to host *RAM*. If you want to use other
> devices like SSDs, then these should be just added as another swap device as
> we do currently -- these should not be used as frontswap storage directly.
>    

Even for copying to RAM an async API is wanted, so you can dma it 
instead of copying.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 12:06                             ` Avi Kivity
@ 2010-04-25 13:12                               ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25 13:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
> >>> No, ANY put_page can fail, and this is a critical part of the API
> >>> that provides all of the flexibility for the hypervisor and all
> >>> the guests. (See previous reply.)
> >>>
> >> The guest isn't required to do any put_page()s.  It can issue lots
> of
> >> them when memory is available, and keep them in the hypervisor
> forever.
> >> Failing new put_page()s isn't enough for a dynamic system, you need
> to
> >> be able to force the guest to give up some of its tmem.
> >>
> > Yes, indeed, this is true.  That is why it is important for any
> > policy implemented behind frontswap to "bill" the guest if it
> > is attempting to keep frontswap pages in the hypervisor forever
> > and to prod the guest to reclaim them when it no longer needs
> > super-fast emergency swap space.  The frontswap patch already
> includes
> > the kernel mechanism to enable this and the prodding can be
> implemented
> > by a guest daemon (of which there already exists an existence proof).
> 
> In this case you could use the same mechanism to stop new put_page()s?

You are suggesting the hypervisor communicate dynamically-rapidly-changing
physical memory availability information to a userland daemon in each guest,
and each daemon communicate this information to each respective kernel
to notify the kernel that hypervisor memory is not available?

Seems very convoluted to me, and anyway it doesn't eliminate the need
for a hook placed exactly where the frontswap_put hook is placed.

> Seems frontswap is like a reverse balloon, where the balloon is in
> hypervisor space instead of the guest space.

That's a reasonable analogy.  Frontswap serves nicely as an
emergency safety valve when a guest has given up (too) much of
its memory via ballooning but unexpectedly has an urgent need
that can't be serviced quickly enough by the balloon driver.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 13:12                               ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25 13:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> On 04/25/2010 03:41 AM, Dan Magenheimer wrote:
> >>> No, ANY put_page can fail, and this is a critical part of the API
> >>> that provides all of the flexibility for the hypervisor and all
> >>> the guests. (See previous reply.)
> >>>
> >> The guest isn't required to do any put_page()s.  It can issue lots
> of
> >> them when memory is available, and keep them in the hypervisor
> forever.
> >> Failing new put_page()s isn't enough for a dynamic system, you need
> to
> >> be able to force the guest to give up some of its tmem.
> >>
> > Yes, indeed, this is true.  That is why it is important for any
> > policy implemented behind frontswap to "bill" the guest if it
> > is attempting to keep frontswap pages in the hypervisor forever
> > and to prod the guest to reclaim them when it no longer needs
> > super-fast emergency swap space.  The frontswap patch already
> includes
> > the kernel mechanism to enable this and the prodding can be
> implemented
> > by a guest daemon (of which there already exists an existence proof).
> 
> In this case you could use the same mechanism to stop new put_page()s?

You are suggesting the hypervisor communicate dynamically-rapidly-changing
physical memory availability information to a userland daemon in each guest,
and each daemon communicate this information to each respective kernel
to notify the kernel that hypervisor memory is not available?

Seems very convoluted to me, and anyway it doesn't eliminate the need
for a hook placed exactly where the frontswap_put hook is placed.

> Seems frontswap is like a reverse balloon, where the balloon is in
> hypervisor space instead of the guest space.

That's a reasonable analogy.  Frontswap serves nicely as an
emergency safety valve when a guest has given up (too) much of
its memory via ballooning but unexpectedly has an urgent need
that can't be serviced quickly enough by the balloon driver.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 13:12                               ` Dan Magenheimer
@ 2010-04-25 13:18                                 ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 13:18 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 04:12 PM, Dan Magenheimer wrote:
>>
>> In this case you could use the same mechanism to stop new put_page()s?
>>      
> You are suggesting the hypervisor communicate dynamically-rapidly-changing
> physical memory availability information to a userland daemon in each guest,
> and each daemon communicate this information to each respective kernel
> to notify the kernel that hypervisor memory is not available?
>
> Seems very convoluted to me, and anyway it doesn't eliminate the need
> for a hook placed exactly where the frontswap_put hook is placed.
>    

Yeah, it's pretty ugly.  Balloons typically communicate without a daemon 
too.

>> Seems frontswap is like a reverse balloon, where the balloon is in
>> hypervisor space instead of the guest space.
>>      
> That's a reasonable analogy.  Frontswap serves nicely as an
> emergency safety valve when a guest has given up (too) much of
> its memory via ballooning but unexpectedly has an urgent need
> that can't be serviced quickly enough by the balloon driver.
>    

(or ordinary swap)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 13:18                                 ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 13:18 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 04:12 PM, Dan Magenheimer wrote:
>>
>> In this case you could use the same mechanism to stop new put_page()s?
>>      
> You are suggesting the hypervisor communicate dynamically-rapidly-changing
> physical memory availability information to a userland daemon in each guest,
> and each daemon communicate this information to each respective kernel
> to notify the kernel that hypervisor memory is not available?
>
> Seems very convoluted to me, and anyway it doesn't eliminate the need
> for a hook placed exactly where the frontswap_put hook is placed.
>    

Yeah, it's pretty ugly.  Balloons typically communicate without a daemon 
too.

>> Seems frontswap is like a reverse balloon, where the balloon is in
>> hypervisor space instead of the guest space.
>>      
> That's a reasonable analogy.  Frontswap serves nicely as an
> emergency safety valve when a guest has given up (too) much of
> its memory via ballooning but unexpectedly has an urgent need
> that can't be serviced quickly enough by the balloon driver.
>    

(or ordinary swap)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 12:11                           ` Avi Kivity
@ 2010-04-25 13:37                             ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25 13:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> My issue is with the API's synchronous nature.  Both RAM and more
> exotic
> memories can be used with DMA instead of copying.  A synchronous
> interface gives this up.
>  :
> Let's not allow the urge to merge prevent us from doing the right
> thing.
>  :
> I see.  Given that swap-to-flash will soon be way more common than
> frontswap, it needs to be solved (either in flash or in the swap code).

While I admit that I started this whole discussion by implying
that frontswap (and cleancache) might be useful for SSDs, I think
we are going far astray here.  Frontswap is synchronous for a
reason: It uses real RAM, but RAM that is not directly addressable
by a (guest) kernel.  SSD's (at least today) are still I/O devices;
even though they may be very fast, they still live on a PCI (or
slower) bus and use DMA.  Frontswap is not intended for use with
I/O devices.

Today's memory technologies are either RAM that can be addressed
by the kernel, or I/O devices that sit on an I/O bus.  The
exotic memories that I am referring to may be a hybrid:
memory that is fast enough to live on a QPI/hypertransport,
but slow enough that you wouldn't want to randomly mix and
hand out to userland apps some pages from "exotic RAM" and some
pages from "normal RAM".  Such memory makes no sense today
because OS's wouldn't know what to do with it.  But it MAY
make sense with frontswap (and cleancache).

Nevertheless, frontswap works great today with a bare-metal
hypervisor.  I think it stands on its own merits, regardless
of one's vision of future SSD/memory technologies.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 13:37                             ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25 13:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> My issue is with the API's synchronous nature.  Both RAM and more
> exotic
> memories can be used with DMA instead of copying.  A synchronous
> interface gives this up.
>  :
> Let's not allow the urge to merge prevent us from doing the right
> thing.
>  :
> I see.  Given that swap-to-flash will soon be way more common than
> frontswap, it needs to be solved (either in flash or in the swap code).

While I admit that I started this whole discussion by implying
that frontswap (and cleancache) might be useful for SSDs, I think
we are going far astray here.  Frontswap is synchronous for a
reason: It uses real RAM, but RAM that is not directly addressable
by a (guest) kernel.  SSD's (at least today) are still I/O devices;
even though they may be very fast, they still live on a PCI (or
slower) bus and use DMA.  Frontswap is not intended for use with
I/O devices.

Today's memory technologies are either RAM that can be addressed
by the kernel, or I/O devices that sit on an I/O bus.  The
exotic memories that I am referring to may be a hybrid:
memory that is fast enough to live on a QPI/hypertransport,
but slow enough that you wouldn't want to randomly mix and
hand out to userland apps some pages from "exotic RAM" and some
pages from "normal RAM".  Such memory makes no sense today
because OS's wouldn't know what to do with it.  But it MAY
make sense with frontswap (and cleancache).

Nevertheless, frontswap works great today with a bare-metal
hypervisor.  I think it stands on its own merits, regardless
of one's vision of future SSD/memory technologies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 13:37                             ` Dan Magenheimer
@ 2010-04-25 14:15                               ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 14:15 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 04:37 PM, Dan Magenheimer wrote:
>> My issue is with the API's synchronous nature.  Both RAM and more
>> exotic
>> memories can be used with DMA instead of copying.  A synchronous
>> interface gives this up.
>>   :
>> Let's not allow the urge to merge prevent us from doing the right
>> thing.
>>   :
>> I see.  Given that swap-to-flash will soon be way more common than
>> frontswap, it needs to be solved (either in flash or in the swap code).
>>      
> While I admit that I started this whole discussion by implying
> that frontswap (and cleancache) might be useful for SSDs, I think
> we are going far astray here.  Frontswap is synchronous for a
> reason: It uses real RAM, but RAM that is not directly addressable
> by a (guest) kernel.  SSD's (at least today) are still I/O devices;
> even though they may be very fast, they still live on a PCI (or
> slower) bus and use DMA.  Frontswap is not intended for use with
> I/O devices.
>
> Today's memory technologies are either RAM that can be addressed
> by the kernel, or I/O devices that sit on an I/O bus.  The
> exotic memories that I am referring to may be a hybrid:
> memory that is fast enough to live on a QPI/hypertransport,
> but slow enough that you wouldn't want to randomly mix and
> hand out to userland apps some pages from "exotic RAM" and some
> pages from "normal RAM".  Such memory makes no sense today
> because OS's wouldn't know what to do with it.  But it MAY
> make sense with frontswap (and cleancache).
>
> Nevertheless, frontswap works great today with a bare-metal
> hypervisor.  I think it stands on its own merits, regardless
> of one's vision of future SSD/memory technologies.
>    

Even when frontswapping to RAM on a bare metal hypervisor it makes sense 
to use an async API, in case you have a DMA engine on board.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 14:15                               ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-25 14:15 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 04:37 PM, Dan Magenheimer wrote:
>> My issue is with the API's synchronous nature.  Both RAM and more
>> exotic
>> memories can be used with DMA instead of copying.  A synchronous
>> interface gives this up.
>>   :
>> Let's not allow the urge to merge prevent us from doing the right
>> thing.
>>   :
>> I see.  Given that swap-to-flash will soon be way more common than
>> frontswap, it needs to be solved (either in flash or in the swap code).
>>      
> While I admit that I started this whole discussion by implying
> that frontswap (and cleancache) might be useful for SSDs, I think
> we are going far astray here.  Frontswap is synchronous for a
> reason: It uses real RAM, but RAM that is not directly addressable
> by a (guest) kernel.  SSD's (at least today) are still I/O devices;
> even though they may be very fast, they still live on a PCI (or
> slower) bus and use DMA.  Frontswap is not intended for use with
> I/O devices.
>
> Today's memory technologies are either RAM that can be addressed
> by the kernel, or I/O devices that sit on an I/O bus.  The
> exotic memories that I am referring to may be a hybrid:
> memory that is fast enough to live on a QPI/hypertransport,
> but slow enough that you wouldn't want to randomly mix and
> hand out to userland apps some pages from "exotic RAM" and some
> pages from "normal RAM".  Such memory makes no sense today
> because OS's wouldn't know what to do with it.  But it MAY
> make sense with frontswap (and cleancache).
>
> Nevertheless, frontswap works great today with a bare-metal
> hypervisor.  I think it stands on its own merits, regardless
> of one's vision of future SSD/memory technologies.
>    

Even when frontswapping to RAM on a bare metal hypervisor it makes sense 
to use an async API, in case you have a DMA engine on board.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 14:15                               ` Avi Kivity
@ 2010-04-25 15:29                                 ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25 15:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > While I admit that I started this whole discussion by implying
> > that frontswap (and cleancache) might be useful for SSDs, I think
> > we are going far astray here.  Frontswap is synchronous for a
> > reason: It uses real RAM, but RAM that is not directly addressable
> > by a (guest) kernel.  SSD's (at least today) are still I/O devices;
> > even though they may be very fast, they still live on a PCI (or
> > slower) bus and use DMA.  Frontswap is not intended for use with
> > I/O devices.
> >
> > Today's memory technologies are either RAM that can be addressed
> > by the kernel, or I/O devices that sit on an I/O bus.  The
> > exotic memories that I am referring to may be a hybrid:
> > memory that is fast enough to live on a QPI/hypertransport,
> > but slow enough that you wouldn't want to randomly mix and
> > hand out to userland apps some pages from "exotic RAM" and some
> > pages from "normal RAM".  Such memory makes no sense today
> > because OS's wouldn't know what to do with it.  But it MAY
> > make sense with frontswap (and cleancache).
> >
> > Nevertheless, frontswap works great today with a bare-metal
> > hypervisor.  I think it stands on its own merits, regardless
> > of one's vision of future SSD/memory technologies.
> 
> Even when frontswapping to RAM on a bare metal hypervisor it makes
> sense
> to use an async API, in case you have a DMA engine on board.

When pages are 2MB, this may be true.  When pages are 4KB and 
copied individually, it may take longer to program a DMA engine 
than to just copy 4KB.

But in any case, frontswap works fine on all existing machines
today.  If/when most commodity CPUs have an asynchronous RAM DMA
engine, an asynchronous API may be appropriate.  Or the existing
swap API might be appropriate. Or the synchronous frontswap API
may work fine too.  Speculating further about non-existent
hardware that might exist in the (possibly far) future is irrelevant
to the proposed patch, which works today on all existing x86 hardware
and on shipping software.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 15:29                                 ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-25 15:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > While I admit that I started this whole discussion by implying
> > that frontswap (and cleancache) might be useful for SSDs, I think
> > we are going far astray here.  Frontswap is synchronous for a
> > reason: It uses real RAM, but RAM that is not directly addressable
> > by a (guest) kernel.  SSD's (at least today) are still I/O devices;
> > even though they may be very fast, they still live on a PCI (or
> > slower) bus and use DMA.  Frontswap is not intended for use with
> > I/O devices.
> >
> > Today's memory technologies are either RAM that can be addressed
> > by the kernel, or I/O devices that sit on an I/O bus.  The
> > exotic memories that I am referring to may be a hybrid:
> > memory that is fast enough to live on a QPI/hypertransport,
> > but slow enough that you wouldn't want to randomly mix and
> > hand out to userland apps some pages from "exotic RAM" and some
> > pages from "normal RAM".  Such memory makes no sense today
> > because OS's wouldn't know what to do with it.  But it MAY
> > make sense with frontswap (and cleancache).
> >
> > Nevertheless, frontswap works great today with a bare-metal
> > hypervisor.  I think it stands on its own merits, regardless
> > of one's vision of future SSD/memory technologies.
> 
> Even when frontswapping to RAM on a bare metal hypervisor it makes
> sense
> to use an async API, in case you have a DMA engine on board.

When pages are 2MB, this may be true.  When pages are 4KB and 
copied individually, it may take longer to program a DMA engine 
than to just copy 4KB.

But in any case, frontswap works fine on all existing machines
today.  If/when most commodity CPUs have an asynchronous RAM DMA
engine, an asynchronous API may be appropriate.  Or the existing
swap API might be appropriate. Or the synchronous frontswap API
may work fine too.  Speculating further about non-existent
hardware that might exist in the (possibly far) future is irrelevant
to the proposed patch, which works today on all existing x86 hardware
and on shipping software.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 12:16                           ` Avi Kivity
@ 2010-04-25 16:05                             ` Nitin Gupta
  -1 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-25 16:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 05:46 PM, Avi Kivity wrote:
> On 04/25/2010 06:11 AM, Nitin Gupta wrote:
>> On 04/24/2010 11:57 PM, Avi Kivity wrote:
>>   
>>> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>>     
>>>>       
>>>>> I see.  So why not implement this as an ordinary swap device, with a
>>>>> higher priority than the disk device?  this way we reuse an API and
>>>>> keep
>>>>> things asynchronous, instead of introducing a special purpose API.
>>>>>
>>>>>
>>>>>          
>>>> ramzswap is exactly this: an ordinary swap device which stores every
>>>> page
>>>> in (compressed) memory and its enabled as highest priority swap.
>>>> Currently,
>>>> it stores these compressed chunks in guest memory itself but it is not
>>>> very
>>>> difficult to send these chunks out to host/hypervisor using virtio.
>>>>
>>>> However, it suffers from unnecessary block I/O layer overhead and
>>>> requires
>>>> weird hooks in swap code, say to get notification when a swap slot is
>>>> freed.
>>>>
>>>>        
>>> Isn't that TRIM?
>>>      
>> No: trim or discard is not useful. The problem is that we require a
>> callback
>> _as soon as_ a page (swap slot) is freed. Otherwise, stale data
>> quickly accumulates
>> in memory defeating the whole purpose of in-memory compressed swap
>> devices (like ramzswap).
>>    
> 
> Doesn't flash have similar requirements?  The earlier you discard, the
> likelier you are to reuse an erase block (or reduce the amount of copying).
> 

No. We do not want to issue discard for every page as soon as it is freed.
I'm not flash expert but I guess issuing erase is just too expensive to be
issued so frequently. OTOH, ramzswap needs a callback for every page and as
soon as it is freed.


>> Increasing the frequency of discards is also not an option:
>>   - Creating discard bio requests themselves need memory and these
>> swap devices
>> come into picture only under low memory conditions.
>>    
> 
> That's fine, swap works under low memory conditions by using reserves.
> 

Ok, but still all this bio allocation and block layer overhead seems
unnecessary and is easily avoidable. I think frontswap code needs
clean up but at least it avoids all this bio overhead.

>>   - We need to regularly scan swap_map to issue these discards.
>> Increasing discard
>> frequency also means more frequent scanning (which will still not be
>> fast enough
>> for ramzswap needs).
>>    
> 
> How does frontswap do this?  Does it maintain its own data structures?
> 

frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as
soon as a swap slot is freed. No bio allocation etc.

>>> Maybe we should optimize these overheads instead.  Swap used to always
>>> be to slow devices, but swap-to-flash has the potential to make swap act
>>> like an extension of RAM.
>>>
>>>      
>> Spending lot of effort optimizing an overhead which can be completely
>> avoided
>> is probably not worth it.
>>    
> 
> I'm not sure.  Swap-to-flash will soon be everywhere.   If it's slow,
> people will feel it a lot more than ramzswap slowness.
> 

Optimizing swap-to-flash is surely desirable but this problem is separate
from ramzswap or frontswap optimization. For the latter, I think dealing
with bio's, going through block layer is plain overhead.

>> Also, I think the choice of a synchronous style API for frontswap and
>> cleancache
>> is justified as they want to send pages to host *RAM*. If you want to
>> use other
>> devices like SSDs, then these should be just added as another swap
>> device as
>> we do currently -- these should not be used as frontswap storage
>> directly.
>>    
> 
> Even for copying to RAM an async API is wanted, so you can dma it
> instead of copying.
>

Maybe incremental development is better? Stabilize and refine existing
code and gradually move to async API, if required in future?

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-25 16:05                             ` Nitin Gupta
  0 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-25 16:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 05:46 PM, Avi Kivity wrote:
> On 04/25/2010 06:11 AM, Nitin Gupta wrote:
>> On 04/24/2010 11:57 PM, Avi Kivity wrote:
>>   
>>> On 04/24/2010 04:49 AM, Nitin Gupta wrote:
>>>     
>>>>       
>>>>> I see.  So why not implement this as an ordinary swap device, with a
>>>>> higher priority than the disk device?  this way we reuse an API and
>>>>> keep
>>>>> things asynchronous, instead of introducing a special purpose API.
>>>>>
>>>>>
>>>>>          
>>>> ramzswap is exactly this: an ordinary swap device which stores every
>>>> page
>>>> in (compressed) memory and its enabled as highest priority swap.
>>>> Currently,
>>>> it stores these compressed chunks in guest memory itself but it is not
>>>> very
>>>> difficult to send these chunks out to host/hypervisor using virtio.
>>>>
>>>> However, it suffers from unnecessary block I/O layer overhead and
>>>> requires
>>>> weird hooks in swap code, say to get notification when a swap slot is
>>>> freed.
>>>>
>>>>        
>>> Isn't that TRIM?
>>>      
>> No: trim or discard is not useful. The problem is that we require a
>> callback
>> _as soon as_ a page (swap slot) is freed. Otherwise, stale data
>> quickly accumulates
>> in memory defeating the whole purpose of in-memory compressed swap
>> devices (like ramzswap).
>>    
> 
> Doesn't flash have similar requirements?  The earlier you discard, the
> likelier you are to reuse an erase block (or reduce the amount of copying).
> 

No. We do not want to issue discard for every page as soon as it is freed.
I'm not flash expert but I guess issuing erase is just too expensive to be
issued so frequently. OTOH, ramzswap needs a callback for every page and as
soon as it is freed.


>> Increasing the frequency of discards is also not an option:
>>   - Creating discard bio requests themselves need memory and these
>> swap devices
>> come into picture only under low memory conditions.
>>    
> 
> That's fine, swap works under low memory conditions by using reserves.
> 

Ok, but still all this bio allocation and block layer overhead seems
unnecessary and is easily avoidable. I think frontswap code needs
clean up but at least it avoids all this bio overhead.

>>   - We need to regularly scan swap_map to issue these discards.
>> Increasing discard
>> frequency also means more frequent scanning (which will still not be
>> fast enough
>> for ramzswap needs).
>>    
> 
> How does frontswap do this?  Does it maintain its own data structures?
> 

frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as
soon as a swap slot is freed. No bio allocation etc.

>>> Maybe we should optimize these overheads instead.  Swap used to always
>>> be to slow devices, but swap-to-flash has the potential to make swap act
>>> like an extension of RAM.
>>>
>>>      
>> Spending lot of effort optimizing an overhead which can be completely
>> avoided
>> is probably not worth it.
>>    
> 
> I'm not sure.  Swap-to-flash will soon be everywhere.   If it's slow,
> people will feel it a lot more than ramzswap slowness.
> 

Optimizing swap-to-flash is surely desirable but this problem is separate
from ramzswap or frontswap optimization. For the latter, I think dealing
with bio's, going through block layer is plain overhead.

>> Also, I think the choice of a synchronous style API for frontswap and
>> cleancache
>> is justified as they want to send pages to host *RAM*. If you want to
>> use other
>> devices like SSDs, then these should be just added as another swap
>> device as
>> we do currently -- these should not be used as frontswap storage
>> directly.
>>    
> 
> Even for copying to RAM an async API is wanted, so you can dma it
> instead of copying.
>

Maybe incremental development is better? Stabilize and refine existing
code and gradually move to async API, if required in future?

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 15:29                                 ` Dan Magenheimer
@ 2010-04-26  6:01                                   ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26  6:01 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 06:29 PM, Dan Magenheimer wrote:
>>> While I admit that I started this whole discussion by implying
>>> that frontswap (and cleancache) might be useful for SSDs, I think
>>> we are going far astray here.  Frontswap is synchronous for a
>>> reason: It uses real RAM, but RAM that is not directly addressable
>>> by a (guest) kernel.  SSD's (at least today) are still I/O devices;
>>> even though they may be very fast, they still live on a PCI (or
>>> slower) bus and use DMA.  Frontswap is not intended for use with
>>> I/O devices.
>>>
>>> Today's memory technologies are either RAM that can be addressed
>>> by the kernel, or I/O devices that sit on an I/O bus.  The
>>> exotic memories that I am referring to may be a hybrid:
>>> memory that is fast enough to live on a QPI/hypertransport,
>>> but slow enough that you wouldn't want to randomly mix and
>>> hand out to userland apps some pages from "exotic RAM" and some
>>> pages from "normal RAM".  Such memory makes no sense today
>>> because OS's wouldn't know what to do with it.  But it MAY
>>> make sense with frontswap (and cleancache).
>>>
>>> Nevertheless, frontswap works great today with a bare-metal
>>> hypervisor.  I think it stands on its own merits, regardless
>>> of one's vision of future SSD/memory technologies.
>>>        
>> Even when frontswapping to RAM on a bare metal hypervisor it makes
>> sense
>> to use an async API, in case you have a DMA engine on board.
>>      
> When pages are 2MB, this may be true.  When pages are 4KB and
> copied individually, it may take longer to program a DMA engine
> than to just copy 4KB.
>    

Of course, you have to use a batching API, like virtio or Xen's rings, 
to avoid the overhead.

> But in any case, frontswap works fine on all existing machines
> today.  If/when most commodity CPUs have an asynchronous RAM DMA
> engine, an asynchronous API may be appropriate.  Or the existing
> swap API might be appropriate. Or the synchronous frontswap API
> may work fine too.  Speculating further about non-existent
> hardware that might exist in the (possibly far) future is irrelevant
> to the proposed patch, which works today on all existing x86 hardware
> and on shipping software.
>    

dma engines are present on commodity hardware now:

http://en.wikipedia.org/wiki/I/O_Acceleration_Technology

I don't know if consumer machines have them, but servers certainly do.  
modprobe ioatdma.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-26  6:01                                   ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26  6:01 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/25/2010 06:29 PM, Dan Magenheimer wrote:
>>> While I admit that I started this whole discussion by implying
>>> that frontswap (and cleancache) might be useful for SSDs, I think
>>> we are going far astray here.  Frontswap is synchronous for a
>>> reason: It uses real RAM, but RAM that is not directly addressable
>>> by a (guest) kernel.  SSD's (at least today) are still I/O devices;
>>> even though they may be very fast, they still live on a PCI (or
>>> slower) bus and use DMA.  Frontswap is not intended for use with
>>> I/O devices.
>>>
>>> Today's memory technologies are either RAM that can be addressed
>>> by the kernel, or I/O devices that sit on an I/O bus.  The
>>> exotic memories that I am referring to may be a hybrid:
>>> memory that is fast enough to live on a QPI/hypertransport,
>>> but slow enough that you wouldn't want to randomly mix and
>>> hand out to userland apps some pages from "exotic RAM" and some
>>> pages from "normal RAM".  Such memory makes no sense today
>>> because OS's wouldn't know what to do with it.  But it MAY
>>> make sense with frontswap (and cleancache).
>>>
>>> Nevertheless, frontswap works great today with a bare-metal
>>> hypervisor.  I think it stands on its own merits, regardless
>>> of one's vision of future SSD/memory technologies.
>>>        
>> Even when frontswapping to RAM on a bare metal hypervisor it makes
>> sense
>> to use an async API, in case you have a DMA engine on board.
>>      
> When pages are 2MB, this may be true.  When pages are 4KB and
> copied individually, it may take longer to program a DMA engine
> than to just copy 4KB.
>    

Of course, you have to use a batching API, like virtio or Xen's rings, 
to avoid the overhead.

> But in any case, frontswap works fine on all existing machines
> today.  If/when most commodity CPUs have an asynchronous RAM DMA
> engine, an asynchronous API may be appropriate.  Or the existing
> swap API might be appropriate. Or the synchronous frontswap API
> may work fine too.  Speculating further about non-existent
> hardware that might exist in the (possibly far) future is irrelevant
> to the proposed patch, which works today on all existing x86 hardware
> and on shipping software.
>    

dma engines are present on commodity hardware now:

http://en.wikipedia.org/wiki/I/O_Acceleration_Technology

I don't know if consumer machines have them, but servers certainly do.  
modprobe ioatdma.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 16:05                             ` Nitin Gupta
@ 2010-04-26  6:06                               ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26  6:06 UTC (permalink / raw)
  To: ngupta
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 07:05 PM, Nitin Gupta wrote:
>
>>> Increasing the frequency of discards is also not an option:
>>>    - Creating discard bio requests themselves need memory and these
>>> swap devices
>>> come into picture only under low memory conditions.
>>>
>>>        
>> That's fine, swap works under low memory conditions by using reserves.
>>
>>      
> Ok, but still all this bio allocation and block layer overhead seems
> unnecessary and is easily avoidable. I think frontswap code needs
> clean up but at least it avoids all this bio overhead.
>    

Ok.  I agree it is silly to go through the block layer and end up 
servicing it within the kernel.

>>>    - We need to regularly scan swap_map to issue these discards.
>>> Increasing discard
>>> frequency also means more frequent scanning (which will still not be
>>> fast enough
>>> for ramzswap needs).
>>>
>>>        
>> How does frontswap do this?  Does it maintain its own data structures?
>>
>>      
> frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as
> soon as a swap slot is freed. No bio allocation etc.
>    

The same code could also issue the discard?

>> Even for copying to RAM an async API is wanted, so you can dma it
>> instead of copying.
>>
>>      
> Maybe incremental development is better? Stabilize and refine existing
> code and gradually move to async API, if required in future?
>    

Incremental development is fine, especially for ramzswap where the APIs 
are all internal.  I'm more worried about external interfaces, these 
stick around a lot longer and if not done right they're a pain forever.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-26  6:06                               ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26  6:06 UTC (permalink / raw)
  To: ngupta
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 07:05 PM, Nitin Gupta wrote:
>
>>> Increasing the frequency of discards is also not an option:
>>>    - Creating discard bio requests themselves need memory and these
>>> swap devices
>>> come into picture only under low memory conditions.
>>>
>>>        
>> That's fine, swap works under low memory conditions by using reserves.
>>
>>      
> Ok, but still all this bio allocation and block layer overhead seems
> unnecessary and is easily avoidable. I think frontswap code needs
> clean up but at least it avoids all this bio overhead.
>    

Ok.  I agree it is silly to go through the block layer and end up 
servicing it within the kernel.

>>>    - We need to regularly scan swap_map to issue these discards.
>>> Increasing discard
>>> frequency also means more frequent scanning (which will still not be
>>> fast enough
>>> for ramzswap needs).
>>>
>>>        
>> How does frontswap do this?  Does it maintain its own data structures?
>>
>>      
> frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as
> soon as a swap slot is freed. No bio allocation etc.
>    

The same code could also issue the discard?

>> Even for copying to RAM an async API is wanted, so you can dma it
>> instead of copying.
>>
>>      
> Maybe incremental development is better? Stabilize and refine existing
> code and gradually move to async API, if required in future?
>    

Incremental development is fine, especially for ramzswap where the APIs 
are all internal.  I'm more worried about external interfaces, these 
stick around a lot longer and if not done right they're a pain forever.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-26  6:01                                   ` Avi Kivity
@ 2010-04-26 12:45                                     ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-26 12:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> dma engines are present on commodity hardware now:
> 
> http://en.wikipedia.org/wiki/I/O_Acceleration_Technology
> 
> I don't know if consumer machines have them, but servers certainly do.
> modprobe ioatdma.

They don't seem to have gained much ground in the FIVE YEARS
since the patch was first posted to Linux, have they?

Maybe it's because memory-to-memory copy using a CPU
is so fast (especially for page-ish quantities of data)
and is a small percentage of CPU utilization these days?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-26 12:45                                     ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-26 12:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> dma engines are present on commodity hardware now:
> 
> http://en.wikipedia.org/wiki/I/O_Acceleration_Technology
> 
> I don't know if consumer machines have them, but servers certainly do.
> modprobe ioatdma.

They don't seem to have gained much ground in the FIVE YEARS
since the patch was first posted to Linux, have they?

Maybe it's because memory-to-memory copy using a CPU
is so fast (especially for page-ish quantities of data)
and is a small percentage of CPU utilization these days?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-26  6:06                               ` Avi Kivity
@ 2010-04-26 12:50                                 ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-26 12:50 UTC (permalink / raw)
  To: Avi Kivity, ngupta
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Maybe incremental development is better? Stabilize and refine
> existing
> > code and gradually move to async API, if required in future?
> 
> Incremental development is fine, especially for ramzswap where the APIs
> are all internal.  I'm more worried about external interfaces, these
> stick around a lot longer and if not done right they're a pain forever.

Well if you are saying that your primary objection to the
frontswap synchronous API is that it is exposed to modules via
some EXPORT_SYMBOLs, we can certainly fix that, at least
unless/until there are other pseudo-RAM devices that can use it.

Would that resolve your concerns?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-26 12:50                                 ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-26 12:50 UTC (permalink / raw)
  To: Avi Kivity, ngupta
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Maybe incremental development is better? Stabilize and refine
> existing
> > code and gradually move to async API, if required in future?
> 
> Incremental development is fine, especially for ramzswap where the APIs
> are all internal.  I'm more worried about external interfaces, these
> stick around a lot longer and if not done right they're a pain forever.

Well if you are saying that your primary objection to the
frontswap synchronous API is that it is exposed to modules via
some EXPORT_SYMBOLs, we can certainly fix that, at least
unless/until there are other pseudo-RAM devices that can use it.

Would that resolve your concerns?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-26 12:50                                 ` Dan Magenheimer
@ 2010-04-26 13:43                                   ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26 13:43 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/26/2010 03:50 PM, Dan Magenheimer wrote:
>>> Maybe incremental development is better? Stabilize and refine
>>>        
>> existing
>>      
>>> code and gradually move to async API, if required in future?
>>>        
>> Incremental development is fine, especially for ramzswap where the APIs
>> are all internal.  I'm more worried about external interfaces, these
>> stick around a lot longer and if not done right they're a pain forever.
>>      
> Well if you are saying that your primary objection to the
> frontswap synchronous API is that it is exposed to modules via
> some EXPORT_SYMBOLs, we can certainly fix that, at least
> unless/until there are other pseudo-RAM devices that can use it.
>
> Would that resolve your concerns?
>    

By external interfaces I mean the guest/hypervisor interface.  
EXPORT_SYMBOL is an internal interface as far as I'm concerned.

Now, the frontswap interface is also an internal interface, but it's 
close to the external one.  I'd feel much better if it was asynchronous.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-26 13:43                                   ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26 13:43 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/26/2010 03:50 PM, Dan Magenheimer wrote:
>>> Maybe incremental development is better? Stabilize and refine
>>>        
>> existing
>>      
>>> code and gradually move to async API, if required in future?
>>>        
>> Incremental development is fine, especially for ramzswap where the APIs
>> are all internal.  I'm more worried about external interfaces, these
>> stick around a lot longer and if not done right they're a pain forever.
>>      
> Well if you are saying that your primary objection to the
> frontswap synchronous API is that it is exposed to modules via
> some EXPORT_SYMBOLs, we can certainly fix that, at least
> unless/until there are other pseudo-RAM devices that can use it.
>
> Would that resolve your concerns?
>    

By external interfaces I mean the guest/hypervisor interface.  
EXPORT_SYMBOL is an internal interface as far as I'm concerned.

Now, the frontswap interface is also an internal interface, but it's 
close to the external one.  I'd feel much better if it was asynchronous.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-26  6:06                               ` Avi Kivity
@ 2010-04-26 13:47                                 ` Nitin Gupta
  -1 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-26 13:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/26/2010 11:36 AM, Avi Kivity wrote:
> On 04/25/2010 07:05 PM, Nitin Gupta wrote:
>>
>>>> Increasing the frequency of discards is also not an option:
>>>>    - Creating discard bio requests themselves need memory and these
>>>> swap devices
>>>> come into picture only under low memory conditions.
>>>>
>>>>        
>>> That's fine, swap works under low memory conditions by using reserves.
>>>
>>>      
>> Ok, but still all this bio allocation and block layer overhead seems
>> unnecessary and is easily avoidable. I think frontswap code needs
>> clean up but at least it avoids all this bio overhead.
>>    
> 
> Ok.  I agree it is silly to go through the block layer and end up
> servicing it within the kernel.
> 
>>>>    - We need to regularly scan swap_map to issue these discards.
>>>> Increasing discard
>>>> frequency also means more frequent scanning (which will still not be
>>>> fast enough
>>>> for ramzswap needs).
>>>>
>>>>        
>>> How does frontswap do this?  Does it maintain its own data structures?
>>>
>>>      
>> frontswap simply calls frontswap_flush_page() in swap_entry_free()
>> i.e. as
>> soon as a swap slot is freed. No bio allocation etc.
>>    
> 
> The same code could also issue the discard?
> 


No, we cannot issue discard bio at this place since swap_lock
spinlock is held.


Thanks,
Nitin

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-26 13:47                                 ` Nitin Gupta
  0 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-26 13:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/26/2010 11:36 AM, Avi Kivity wrote:
> On 04/25/2010 07:05 PM, Nitin Gupta wrote:
>>
>>>> Increasing the frequency of discards is also not an option:
>>>>    - Creating discard bio requests themselves need memory and these
>>>> swap devices
>>>> come into picture only under low memory conditions.
>>>>
>>>>        
>>> That's fine, swap works under low memory conditions by using reserves.
>>>
>>>      
>> Ok, but still all this bio allocation and block layer overhead seems
>> unnecessary and is easily avoidable. I think frontswap code needs
>> clean up but at least it avoids all this bio overhead.
>>    
> 
> Ok.  I agree it is silly to go through the block layer and end up
> servicing it within the kernel.
> 
>>>>    - We need to regularly scan swap_map to issue these discards.
>>>> Increasing discard
>>>> frequency also means more frequent scanning (which will still not be
>>>> fast enough
>>>> for ramzswap needs).
>>>>
>>>>        
>>> How does frontswap do this?  Does it maintain its own data structures?
>>>
>>>      
>> frontswap simply calls frontswap_flush_page() in swap_entry_free()
>> i.e. as
>> soon as a swap slot is freed. No bio allocation etc.
>>    
> 
> The same code could also issue the discard?
> 


No, we cannot issue discard bio at this place since swap_lock
spinlock is held.


Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-26 12:45                                     ` Dan Magenheimer
@ 2010-04-26 13:48                                       ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26 13:48 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/26/2010 03:45 PM, Dan Magenheimer wrote:
>> dma engines are present on commodity hardware now:
>>
>> http://en.wikipedia.org/wiki/I/O_Acceleration_Technology
>>
>> I don't know if consumer machines have them, but servers certainly do.
>> modprobe ioatdma.
>>      
> They don't seem to have gained much ground in the FIVE YEARS
> since the patch was first posted to Linux, have they?
>    

Why do you say this?  Servers have them and AFAIK networking uses them.  
There are other uses of the API in the code, but I don't know how much 
of this is for bulk copies.

> Maybe it's because memory-to-memory copy using a CPU
> is so fast (especially for page-ish quantities of data)
> and is a small percentage of CPU utilization these days?
>    

Copies take a small percentage of cpu because a lot of care goes into 
avoiding them, or placing them near the place where the copy is used.  
They certainly show up in high speed networking.

A page-sized copy is small, but many of them will be expensive.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-26 13:48                                       ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-26 13:48 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/26/2010 03:45 PM, Dan Magenheimer wrote:
>> dma engines are present on commodity hardware now:
>>
>> http://en.wikipedia.org/wiki/I/O_Acceleration_Technology
>>
>> I don't know if consumer machines have them, but servers certainly do.
>> modprobe ioatdma.
>>      
> They don't seem to have gained much ground in the FIVE YEARS
> since the patch was first posted to Linux, have they?
>    

Why do you say this?  Servers have them and AFAIK networking uses them.  
There are other uses of the API in the code, but I don't know how much 
of this is for bulk copies.

> Maybe it's because memory-to-memory copy using a CPU
> is so fast (especially for page-ish quantities of data)
> and is a small percentage of CPU utilization these days?
>    

Copies take a small percentage of cpu because a lot of care goes into 
avoiding them, or placing them near the place where the copy is used.  
They certainly show up in high speed networking.

A page-sized copy is small, but many of them will be expensive.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 12:11                           ` Avi Kivity
@ 2010-04-27  0:49                             ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 163+ messages in thread
From: Jeremy Fitzhardinge @ 2010-04-27  0:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 05:11 AM, Avi Kivity wrote:
> No need to change the kernel at all; the hypervisor controls the page
> tables.

Not in Xen PV guests (the hypervisor vets guest updates, but it can't
safely make its own changes to the pagetables).  (Its kind of annoying.)

    J

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-27  0:49                             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 163+ messages in thread
From: Jeremy Fitzhardinge @ 2010-04-27  0:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, linux-kernel, linux-mm, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

On 04/25/2010 05:11 AM, Avi Kivity wrote:
> No need to change the kernel at all; the hypervisor controls the page
> tables.

Not in Xen PV guests (the hypervisor vets guest updates, but it can't
safely make its own changes to the pagetables).  (Its kind of annoying.)

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-26 13:43                                   ` Avi Kivity
@ 2010-04-27  8:29                                     ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-27  8:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ngupta, linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Well if you are saying that your primary objection to the
> > frontswap synchronous API is that it is exposed to modules via
> > some EXPORT_SYMBOLs, we can certainly fix that, at least
> > unless/until there are other pseudo-RAM devices that can use it.
> >
> > Would that resolve your concerns?
> >
> 
> By external interfaces I mean the guest/hypervisor interface.
> EXPORT_SYMBOL is an internal interface as far as I'm concerned.
> 
> Now, the frontswap interface is also an internal interface, but it's
> close to the external one.  I'd feel much better if it was
> asynchronous.

OK, so on the one hand, you think that the proposed synchronous
interface for frontswap is insufficiently extensible for other
uses (presumably including KVM).  On the other hand, you agree
that using the existing I/O subsystem is unnecessarily heavyweight.
On the third hand, Nitin has answered your questions and spent
a good part of three years finding that extending the existing swap
interface to efficiently support swap-to-pseudo-RAM requires
some kind of in-kernel notification mechanism to which Linus
has already objected.

So you are instead proposing some new guest-to-host asynchronous
notification mechanism that doesn't use the existing bio
mechanism (and so presumably not irqs), imitates or can
utilize a dma engine, and uses less cpu cycles than copying
pages.  AND, for long-term maintainability, you'd like to avoid
creating a new guest-host API that does all this, even one that
is as simple and lightweight as the proposed frontswap hooks.

Does that summarize your objection well?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-27  8:29                                     ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-27  8:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ngupta, linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Well if you are saying that your primary objection to the
> > frontswap synchronous API is that it is exposed to modules via
> > some EXPORT_SYMBOLs, we can certainly fix that, at least
> > unless/until there are other pseudo-RAM devices that can use it.
> >
> > Would that resolve your concerns?
> >
> 
> By external interfaces I mean the guest/hypervisor interface.
> EXPORT_SYMBOL is an internal interface as far as I'm concerned.
> 
> Now, the frontswap interface is also an internal interface, but it's
> close to the external one.  I'd feel much better if it was
> asynchronous.

OK, so on the one hand, you think that the proposed synchronous
interface for frontswap is insufficiently extensible for other
uses (presumably including KVM).  On the other hand, you agree
that using the existing I/O subsystem is unnecessarily heavyweight.
On the third hand, Nitin has answered your questions and spent
a good part of three years finding that extending the existing swap
interface to efficiently support swap-to-pseudo-RAM requires
some kind of in-kernel notification mechanism to which Linus
has already objected.

So you are instead proposing some new guest-to-host asynchronous
notification mechanism that doesn't use the existing bio
mechanism (and so presumably not irqs), imitates or can
utilize a dma engine, and uses less cpu cycles than copying
pages.  AND, for long-term maintainability, you'd like to avoid
creating a new guest-host API that does all this, even one that
is as simple and lightweight as the proposed frontswap hooks.

Does that summarize your objection well?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-27  8:29                                     ` Dan Magenheimer
@ 2010-04-27  9:21                                       ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-27  9:21 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/27/2010 11:29 AM, Dan Magenheimer wrote:
>
> OK, so on the one hand, you think that the proposed synchronous
> interface for frontswap is insufficiently extensible for other
> uses (presumably including KVM).  On the other hand, you agree
> that using the existing I/O subsystem is unnecessarily heavyweight.
> On the third hand, Nitin has answered your questions and spent
> a good part of three years finding that extending the existing swap
> interface to efficiently support swap-to-pseudo-RAM requires
> some kind of in-kernel notification mechanism to which Linus
> has already objected.
>
> So you are instead proposing some new guest-to-host asynchronous
> notification mechanism that doesn't use the existing bio
> mechanism (and so presumably not irqs),

(any notification mechanism has to use irqs if it exits the guest)

> imitates or can
> utilize a dma engine, and uses less cpu cycles than copying
> pages.  AND, for long-term maintainability, you'd like to avoid
> creating a new guest-host API that does all this, even one that
> is as simple and lightweight as the proposed frontswap hooks.
>
> Does that summarize your objection well?
>    

No.  Adding a new async API that parallels the block layer would be 
madness.  My first preference would be to completely avoid new APIs.  I 
think that would work for swap-to-hypervisor but probably not for 
compcache.  Second preference is the synchronous API, third is a new 
async API.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-27  9:21                                       ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-27  9:21 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, linux-kernel, linux-mm, jeremy, hugh.dickins, JBeulich,
	chris.mason, kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/27/2010 11:29 AM, Dan Magenheimer wrote:
>
> OK, so on the one hand, you think that the proposed synchronous
> interface for frontswap is insufficiently extensible for other
> uses (presumably including KVM).  On the other hand, you agree
> that using the existing I/O subsystem is unnecessarily heavyweight.
> On the third hand, Nitin has answered your questions and spent
> a good part of three years finding that extending the existing swap
> interface to efficiently support swap-to-pseudo-RAM requires
> some kind of in-kernel notification mechanism to which Linus
> has already objected.
>
> So you are instead proposing some new guest-to-host asynchronous
> notification mechanism that doesn't use the existing bio
> mechanism (and so presumably not irqs),

(any notification mechanism has to use irqs if it exits the guest)

> imitates or can
> utilize a dma engine, and uses less cpu cycles than copying
> pages.  AND, for long-term maintainability, you'd like to avoid
> creating a new guest-host API that does all this, even one that
> is as simple and lightweight as the proposed frontswap hooks.
>
> Does that summarize your objection well?
>    

No.  Adding a new async API that parallels the block layer would be 
madness.  My first preference would be to completely avoid new APIs.  I 
think that would work for swap-to-hypervisor but probably not for 
compcache.  Second preference is the synchronous API, third is a new 
async API.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 13:37                             ` Dan Magenheimer
  (?)
  (?)
@ 2010-04-27 11:52                             ` Valdis.Kletnieks
  -1 siblings, 0 replies; 163+ messages in thread
From: Valdis.Kletnieks @ 2010-04-27 11:52 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On Sun, 25 Apr 2010 06:37:30 PDT, Dan Magenheimer said:

> While I admit that I started this whole discussion by implying
> that frontswap (and cleancache) might be useful for SSDs, I think
> we are going far astray here.  Frontswap is synchronous for a
> reason: It uses real RAM, but RAM that is not directly addressable
> by a (guest) kernel.

Are there any production boxes that actually do this currently? I know IBM had
'expanded storage' on the 3090 series 20 years ago, haven't checked if the
Z-series still do that.  Was very cool at the time - supported 900+ users with
128M of main memory and 256M of expanded storage, because you got the first
3,000 or so page faults per second for almost free.  Oh, and the 3090 had 2
special opcodes for "move page to/from expanded", so it was a very fast but
still synchronous move (for whatever that's worth).


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25  0:30                         ` Dan Magenheimer
@ 2010-04-27 12:55                           ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-27 12:55 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > Can we extend it?  Adding new APIs is easy, but harder to maintain in
> > the long term.
> 
> Umm... I think the difference between a "new" API and extending
> an existing one here is a choice of semantics.  As designed, frontswap
> is an extremely simple, only-very-slightly-intrusive set of hooks that
> allows swap pages to, under some conditions, go to pseudo-RAM instead
...
> "Extending" the existing swap API, which has largely been untouched for
> many years, seems like a significantly more complex and error-prone
> undertaking that will affect nearly all Linux users with a likely long
> bug tail.  And, by the way, there is no existence proof that it
> will be useful.

> Seems like a no-brainer to me.

Stop right here. Instead of improving existing swap api, you just
create one because it is less work.

We do not want apis to cummulate; please just fix the existing one.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-27 12:55                           ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-27 12:55 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > Can we extend it?  Adding new APIs is easy, but harder to maintain in
> > the long term.
> 
> Umm... I think the difference between a "new" API and extending
> an existing one here is a choice of semantics.  As designed, frontswap
> is an extremely simple, only-very-slightly-intrusive set of hooks that
> allows swap pages to, under some conditions, go to pseudo-RAM instead
...
> "Extending" the existing swap API, which has largely been untouched for
> many years, seems like a significantly more complex and error-prone
> undertaking that will affect nearly all Linux users with a likely long
> bug tail.  And, by the way, there is no existence proof that it
> will be useful.

> Seems like a no-brainer to me.

Stop right here. Instead of improving existing swap api, you just
create one because it is less work.

We do not want apis to cummulate; please just fix the existing one.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 15:29                                 ` Dan Magenheimer
@ 2010-04-27 12:56                                   ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-27 12:56 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > > Nevertheless, frontswap works great today with a bare-metal
> > > hypervisor.  I think it stands on its own merits, regardless
> > > of one's vision of future SSD/memory technologies.
> > 
> > Even when frontswapping to RAM on a bare metal hypervisor it makes
> > sense
> > to use an async API, in case you have a DMA engine on board.
> 
> When pages are 2MB, this may be true.  When pages are 4KB and 
> copied individually, it may take longer to program a DMA engine 
> than to just copy 4KB.
> 
> But in any case, frontswap works fine on all existing machines
> today.  If/when most commodity CPUs have an asynchronous RAM DMA
> engine, an asynchronous API may be appropriate.  Or the existing
> swap API might be appropriate. Or the synchronous frontswap API
> may work fine too.  Speculating further about non-existent
> hardware that might exist in the (possibly far) future is irrelevant
> to the proposed patch, which works today on all existing x86 hardware
> and on shipping software.

If we added all the apis that worked when proposed, we'd have
unmaintanable mess by about 1996.

Why can't frontswap just use existing swap api?
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-27 12:56                                   ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-27 12:56 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > > Nevertheless, frontswap works great today with a bare-metal
> > > hypervisor.  I think it stands on its own merits, regardless
> > > of one's vision of future SSD/memory technologies.
> > 
> > Even when frontswapping to RAM on a bare metal hypervisor it makes
> > sense
> > to use an async API, in case you have a DMA engine on board.
> 
> When pages are 2MB, this may be true.  When pages are 4KB and 
> copied individually, it may take longer to program a DMA engine 
> than to just copy 4KB.
> 
> But in any case, frontswap works fine on all existing machines
> today.  If/when most commodity CPUs have an asynchronous RAM DMA
> engine, an asynchronous API may be appropriate.  Or the existing
> swap API might be appropriate. Or the synchronous frontswap API
> may work fine too.  Speculating further about non-existent
> hardware that might exist in the (possibly far) future is irrelevant
> to the proposed patch, which works today on all existing x86 hardware
> and on shipping software.

If we added all the apis that worked when proposed, we'd have
unmaintanable mess by about 1996.

Why can't frontswap just use existing swap api?
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-27 12:56                                   ` Pavel Machek
@ 2010-04-27 14:32                                     ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-27 14:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
> 
> We do not want apis to cummulate; please just fix the existing one.

> If we added all the apis that worked when proposed, we'd have
> unmaintanable mess by about 1996.
> 
> Why can't frontswap just use existing swap api?

Hi Pavel!

The existing swap API as it stands is inadequate for an efficient
synchronous interface (e.g. for swapping to RAM).  Both Nitin
and I independently have found this to be true.  But swap-to-RAM
is very useful in some cases (swap-to-kernel-compressed-RAM
and swap-to-hypervisor-RAM and maybe others) that were not even
conceived many years ago at the time the existing swap API was
designed for swap-to-disk.  Swap-to-RAM can relieve memory
pressure faster and more resource-efficient than swap-to-device
but must assume that RAM available for swap-to-RAM is dynamic
(not fixed in size).  (And swap-to-SSD, when the SSD is an
I/O device on an I/O bus is NOT the same as swap-to-RAM.)

In my opinion, frontswap is NOT a new API, but the simplest
possible extension of the existing swap API to allow for
efficient swap-to-RAM.  Avi's comments about a new API
(as he explained later in the thread) refer to a new API
between kernel and hypervisor, what is essentially the
Transcendent Memory interface.  Frontswap was separated from
the tmem dependency to enable Nitin's swap-to-kernel-compressed-RAM
and the possibility that there may be other interesting
swap-to-RAM uses.

Does this help?

Dan

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-27 14:32                                     ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-27 14:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
> 
> We do not want apis to cummulate; please just fix the existing one.

> If we added all the apis that worked when proposed, we'd have
> unmaintanable mess by about 1996.
> 
> Why can't frontswap just use existing swap api?

Hi Pavel!

The existing swap API as it stands is inadequate for an efficient
synchronous interface (e.g. for swapping to RAM).  Both Nitin
and I independently have found this to be true.  But swap-to-RAM
is very useful in some cases (swap-to-kernel-compressed-RAM
and swap-to-hypervisor-RAM and maybe others) that were not even
conceived many years ago at the time the existing swap API was
designed for swap-to-disk.  Swap-to-RAM can relieve memory
pressure faster and more resource-efficient than swap-to-device
but must assume that RAM available for swap-to-RAM is dynamic
(not fixed in size).  (And swap-to-SSD, when the SSD is an
I/O device on an I/O bus is NOT the same as swap-to-RAM.)

In my opinion, frontswap is NOT a new API, but the simplest
possible extension of the existing swap API to allow for
efficient swap-to-RAM.  Avi's comments about a new API
(as he explained later in the thread) refer to a new API
between kernel and hypervisor, what is essentially the
Transcendent Memory interface.  Frontswap was separated from
the tmem dependency to enable Nitin's swap-to-kernel-compressed-RAM
and the possibility that there may be other interesting
swap-to-RAM uses.

Does this help?

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-27 12:55                           ` Pavel Machek
@ 2010-04-27 14:43                             ` Nitin Gupta
  -1 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-27 14:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dan Magenheimer, Avi Kivity, linux-kernel, linux-mm, jeremy,
	hugh.dickins, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/27/2010 06:25 PM, Pavel Machek wrote:
> 
>>> Can we extend it?  Adding new APIs is easy, but harder to maintain in
>>> the long term.
>>
>> Umm... I think the difference between a "new" API and extending
>> an existing one here is a choice of semantics.  As designed, frontswap
>> is an extremely simple, only-very-slightly-intrusive set of hooks that
>> allows swap pages to, under some conditions, go to pseudo-RAM instead
> ...
>> "Extending" the existing swap API, which has largely been untouched for
>> many years, seems like a significantly more complex and error-prone
>> undertaking that will affect nearly all Linux users with a likely long
>> bug tail.  And, by the way, there is no existence proof that it
>> will be useful.
> 
>> Seems like a no-brainer to me.
> 
> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
> 
> We do not want apis to cummulate; please just fix the existing one.


I'm a bit confused: What do you mean by 'existing swap API'?
Frontswap simply hooks in swap_readpage() and swap_writepage() to
call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
implementation of these function, it introduces struct frontswap_ops
so that custom implementations fronswap get/put/etc. functions can be
provided. This allows easy implementation of swap-to-hypervisor,
in-memory-compressed-swapping etc. with common set of hooks.

So, how frontswap approach can be seen as introducing a new API?

Thanks,
Nitin







^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-27 14:43                             ` Nitin Gupta
  0 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-04-27 14:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dan Magenheimer, Avi Kivity, linux-kernel, linux-mm, jeremy,
	hugh.dickins, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/27/2010 06:25 PM, Pavel Machek wrote:
> 
>>> Can we extend it?  Adding new APIs is easy, but harder to maintain in
>>> the long term.
>>
>> Umm... I think the difference between a "new" API and extending
>> an existing one here is a choice of semantics.  As designed, frontswap
>> is an extremely simple, only-very-slightly-intrusive set of hooks that
>> allows swap pages to, under some conditions, go to pseudo-RAM instead
> ...
>> "Extending" the existing swap API, which has largely been untouched for
>> many years, seems like a significantly more complex and error-prone
>> undertaking that will affect nearly all Linux users with a likely long
>> bug tail.  And, by the way, there is no existence proof that it
>> will be useful.
> 
>> Seems like a no-brainer to me.
> 
> Stop right here. Instead of improving existing swap api, you just
> create one because it is less work.
> 
> We do not want apis to cummulate; please just fix the existing one.


I'm a bit confused: What do you mean by 'existing swap API'?
Frontswap simply hooks in swap_readpage() and swap_writepage() to
call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
implementation of these function, it introduces struct frontswap_ops
so that custom implementations fronswap get/put/etc. functions can be
provided. This allows easy implementation of swap-to-hypervisor,
in-memory-compressed-swapping etc. with common set of hooks.

So, how frontswap approach can be seen as introducing a new API?

Thanks,
Nitin






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-25 13:12                               ` Dan Magenheimer
@ 2010-04-28  5:55                                 ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-28  5:55 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > Seems frontswap is like a reverse balloon, where the balloon is in
> > hypervisor space instead of the guest space.
> 
> That's a reasonable analogy.  Frontswap serves nicely as an
> emergency safety valve when a guest has given up (too) much of
> its memory via ballooning but unexpectedly has an urgent need
> that can't be serviced quickly enough by the balloon driver.

wtf? So lets fix the ballooning driver instead?

There's no reason it could not be as fast as frontswap, right?
Actually I'd expect it to be faster -- it can deal with big chunks.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-28  5:55                                 ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-28  5:55 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > Seems frontswap is like a reverse balloon, where the balloon is in
> > hypervisor space instead of the guest space.
> 
> That's a reasonable analogy.  Frontswap serves nicely as an
> emergency safety valve when a guest has given up (too) much of
> its memory via ballooning but unexpectedly has an urgent need
> that can't be serviced quickly enough by the balloon driver.

wtf? So lets fix the ballooning driver instead?

There's no reason it could not be as fast as frontswap, right?
Actually I'd expect it to be faster -- it can deal with big chunks.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-27 14:32                                     ` Dan Magenheimer
@ 2010-04-29 13:02                                       ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-29 13:02 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> > 
> > We do not want apis to cummulate; please just fix the existing one.
> 
> > If we added all the apis that worked when proposed, we'd have
> > unmaintanable mess by about 1996.
> > 
> > Why can't frontswap just use existing swap api?
> 
> Hi Pavel!
> 
> The existing swap API as it stands is inadequate for an efficient
> synchronous interface (e.g. for swapping to RAM).  Both Nitin
> and I independently have found this to be true.  But swap-to-RAM

So... how much slower is swapping to RAM over current interface when
compared to proposed interface, and how much is that slower than just
using the memory directly?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-29 13:02                                       ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-29 13:02 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi!

> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> > 
> > We do not want apis to cummulate; please just fix the existing one.
> 
> > If we added all the apis that worked when proposed, we'd have
> > unmaintanable mess by about 1996.
> > 
> > Why can't frontswap just use existing swap api?
> 
> Hi Pavel!
> 
> The existing swap API as it stands is inadequate for an efficient
> synchronous interface (e.g. for swapping to RAM).  Both Nitin
> and I independently have found this to be true.  But swap-to-RAM

So... how much slower is swapping to RAM over current interface when
compared to proposed interface, and how much is that slower than just
using the memory directly?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-27 14:43                             ` Nitin Gupta
@ 2010-04-29 13:04                               ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-29 13:04 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Dan Magenheimer, Avi Kivity, linux-kernel, linux-mm, jeremy,
	hugh.dickins, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On Tue 2010-04-27 20:13:39, Nitin Gupta wrote:
> On 04/27/2010 06:25 PM, Pavel Machek wrote:
> > 
> >>> Can we extend it?  Adding new APIs is easy, but harder to maintain in
> >>> the long term.
> >>
> >> Umm... I think the difference between a "new" API and extending
> >> an existing one here is a choice of semantics.  As designed, frontswap
> >> is an extremely simple, only-very-slightly-intrusive set of hooks that
> >> allows swap pages to, under some conditions, go to pseudo-RAM instead
> > ...
> >> "Extending" the existing swap API, which has largely been untouched for
> >> many years, seems like a significantly more complex and error-prone
> >> undertaking that will affect nearly all Linux users with a likely long
> >> bug tail.  And, by the way, there is no existence proof that it
> >> will be useful.
> > 
> >> Seems like a no-brainer to me.
> > 
> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> > 
> > We do not want apis to cummulate; please just fix the existing one.
> 
> 
> I'm a bit confused: What do you mean by 'existing swap API'?
> Frontswap simply hooks in swap_readpage() and swap_writepage() to
> call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
> implementation of these function, it introduces struct frontswap_ops
> so that custom implementations fronswap get/put/etc. functions can be
> provided. This allows easy implementation of swap-to-hypervisor,
> in-memory-compressed-swapping etc. with common set of hooks.

Yes, and that set of hooks is new API, right?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-29 13:04                               ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-04-29 13:04 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Dan Magenheimer, Avi Kivity, linux-kernel, linux-mm, jeremy,
	hugh.dickins, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On Tue 2010-04-27 20:13:39, Nitin Gupta wrote:
> On 04/27/2010 06:25 PM, Pavel Machek wrote:
> > 
> >>> Can we extend it?  Adding new APIs is easy, but harder to maintain in
> >>> the long term.
> >>
> >> Umm... I think the difference between a "new" API and extending
> >> an existing one here is a choice of semantics.  As designed, frontswap
> >> is an extremely simple, only-very-slightly-intrusive set of hooks that
> >> allows swap pages to, under some conditions, go to pseudo-RAM instead
> > ...
> >> "Extending" the existing swap API, which has largely been untouched for
> >> many years, seems like a significantly more complex and error-prone
> >> undertaking that will affect nearly all Linux users with a likely long
> >> bug tail.  And, by the way, there is no existence proof that it
> >> will be useful.
> > 
> >> Seems like a no-brainer to me.
> > 
> > Stop right here. Instead of improving existing swap api, you just
> > create one because it is less work.
> > 
> > We do not want apis to cummulate; please just fix the existing one.
> 
> 
> I'm a bit confused: What do you mean by 'existing swap API'?
> Frontswap simply hooks in swap_readpage() and swap_writepage() to
> call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
> implementation of these function, it introduces struct frontswap_ops
> so that custom implementations fronswap get/put/etc. functions can be
> provided. This allows easy implementation of swap-to-hypervisor,
> in-memory-compressed-swapping etc. with common set of hooks.

Yes, and that set of hooks is new API, right?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-28  5:55                                 ` Pavel Machek
@ 2010-04-29 14:42                                   ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-29 14:42 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi Pavel --

The whole concept of RAM that _might_ be available to the
kernel and is _not_ directly addressable by the kernel takes
some thinking to wrap your mind around, but I assure you
there are very good use cases for it.  RAM owned and managed
by a hypervisor (using controls unknowable to the kernel)
is one example; this is Transcendent Memory.  RAM which
has been compressed is another example; Nitin is working
on this using the frontswap approach because of some
issues that arise with ramzswap (see elsewhere on this
thread).  There are likely more use cases.

So in that context, let me answer your questions, combined
into a single reply.

> > That's a reasonable analogy.  Frontswap serves nicely as an
> > emergency safety valve when a guest has given up (too) much of
> > its memory via ballooning but unexpectedly has an urgent need
> > that can't be serviced quickly enough by the balloon driver.
> 
> wtf? So lets fix the ballooning driver instead?
> 
> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.

If this was possible by fixing the balloon driver, VMware would
have done it years ago.  The problem is that the balloon driver
is acting on very limited information, namely ONLY what THIS
kernel wants; every kernel is selfish and (eventually) uses every
bit of RAM it can get.  This is especially true when swapping
is required (under memory pressure).

So, in general, ballooning is NOT faster because a balloon
request to "get" RAM must wait for some other balloon driver
in some other kernel to "give" RAM.  OR some other entity
must periodically scan every kernels memory and guess at which
kernels are using memory inefficiently and steal it away before
a "needy" kernel asks for it.

While this does indeed "work" today in VMware, if you talk to
VMware customers that use it, many are very unhappy with the
anomalous performance problems that occur.

> > The existing swap API as it stands is inadequate for an efficient
> > synchronous interface (e.g. for swapping to RAM).  Both Nitin
> > and I independently have found this to be true.  But swap-to-RAM
> 
> So... how much slower is swapping to RAM over current interface when
> compared to proposed interface, and how much is that slower than just
> using the memory directly?

Simply copying RAM from one page owned by the kernel to another
page owned by the kernel is pretty pointless as far as swapping
is concerned because it does nothing to reduce memory pressure,
so the comparison is a bit irrelevant.  But...

In my measurements, the overhead of managing "pseudo-RAM" pages
is in the same ballpark as copying the page.  Compression or
deduplication of course has additional costs.  See the
performance results at the end of the following two presentations
for some performance information when "pseudo-RAM" is Transcendent
Memory.

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryLinuxConfAu2010.pdf 

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf 

(the latter will be presented later today)

> > I'm a bit confused: What do you mean by 'existing swap API'?
> > Frontswap simply hooks in swap_readpage() and swap_writepage() to
> > call frontswap_{get,put}_page() respectively. Now to avoid a
> hardcoded
> > implementation of these function, it introduces struct frontswap_ops
> > so that custom implementations fronswap get/put/etc. functions can be
> > provided. This allows easy implementation of swap-to-hypervisor,
> > in-memory-compressed-swapping etc. with common set of hooks.
> 
> Yes, and that set of hooks is new API, right?

Well, no, if you define API as "application programming interface"
this is NOT exposed to userland.  If you define API as a new
in-kernel function call, yes, these hooks are a new API, but that
is true of virtually any new code in the kernel.  If you define
API as some new interface between the kernel and a hypervisor,
yes, this is a new API, but it is "optional" at several levels
so that any hypervisor (e.g. KVM) can completely ignore it.

So please let's not argue about whether the code is a "new API"
or not, but instead consider whether the concept is useful or not
and if useful, if there is or is not a cleaner way to implement it.

Thanks,
Dan


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-29 14:42                                   ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-29 14:42 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, linux-kernel, linux-mm, jeremy, hugh.dickins, ngupta,
	JBeulich, chris.mason, kurt.hackel, dave.mccracken, npiggin,
	akpm, riel

Hi Pavel --

The whole concept of RAM that _might_ be available to the
kernel and is _not_ directly addressable by the kernel takes
some thinking to wrap your mind around, but I assure you
there are very good use cases for it.  RAM owned and managed
by a hypervisor (using controls unknowable to the kernel)
is one example; this is Transcendent Memory.  RAM which
has been compressed is another example; Nitin is working
on this using the frontswap approach because of some
issues that arise with ramzswap (see elsewhere on this
thread).  There are likely more use cases.

So in that context, let me answer your questions, combined
into a single reply.

> > That's a reasonable analogy.  Frontswap serves nicely as an
> > emergency safety valve when a guest has given up (too) much of
> > its memory via ballooning but unexpectedly has an urgent need
> > that can't be serviced quickly enough by the balloon driver.
> 
> wtf? So lets fix the ballooning driver instead?
> 
> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.

If this was possible by fixing the balloon driver, VMware would
have done it years ago.  The problem is that the balloon driver
is acting on very limited information, namely ONLY what THIS
kernel wants; every kernel is selfish and (eventually) uses every
bit of RAM it can get.  This is especially true when swapping
is required (under memory pressure).

So, in general, ballooning is NOT faster because a balloon
request to "get" RAM must wait for some other balloon driver
in some other kernel to "give" RAM.  OR some other entity
must periodically scan every kernels memory and guess at which
kernels are using memory inefficiently and steal it away before
a "needy" kernel asks for it.

While this does indeed "work" today in VMware, if you talk to
VMware customers that use it, many are very unhappy with the
anomalous performance problems that occur.

> > The existing swap API as it stands is inadequate for an efficient
> > synchronous interface (e.g. for swapping to RAM).  Both Nitin
> > and I independently have found this to be true.  But swap-to-RAM
> 
> So... how much slower is swapping to RAM over current interface when
> compared to proposed interface, and how much is that slower than just
> using the memory directly?

Simply copying RAM from one page owned by the kernel to another
page owned by the kernel is pretty pointless as far as swapping
is concerned because it does nothing to reduce memory pressure,
so the comparison is a bit irrelevant.  But...

In my measurements, the overhead of managing "pseudo-RAM" pages
is in the same ballpark as copying the page.  Compression or
deduplication of course has additional costs.  See the
performance results at the end of the following two presentations
for some performance information when "pseudo-RAM" is Transcendent
Memory.

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryLinuxConfAu2010.pdf 

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/TranscendentMemoryXenSummit2010.pdf 

(the latter will be presented later today)

> > I'm a bit confused: What do you mean by 'existing swap API'?
> > Frontswap simply hooks in swap_readpage() and swap_writepage() to
> > call frontswap_{get,put}_page() respectively. Now to avoid a
> hardcoded
> > implementation of these function, it introduces struct frontswap_ops
> > so that custom implementations fronswap get/put/etc. functions can be
> > provided. This allows easy implementation of swap-to-hypervisor,
> > in-memory-compressed-swapping etc. with common set of hooks.
> 
> Yes, and that set of hooks is new API, right?

Well, no, if you define API as "application programming interface"
this is NOT exposed to userland.  If you define API as a new
in-kernel function call, yes, these hooks are a new API, but that
is true of virtually any new code in the kernel.  If you define
API as some new interface between the kernel and a hypervisor,
yes, this is a new API, but it is "optional" at several levels
so that any hypervisor (e.g. KVM) can completely ignore it.

So please let's not argue about whether the code is a "new API"
or not, but instead consider whether the concept is useful or not
and if useful, if there is or is not a cleaner way to implement it.

Thanks,
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-28  5:55                                 ` Pavel Machek
@ 2010-04-29 18:53                                   ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-29 18:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/28/2010 08:55 AM, Pavel Machek wrote:
>
>> That's a reasonable analogy.  Frontswap serves nicely as an
>> emergency safety valve when a guest has given up (too) much of
>> its memory via ballooning but unexpectedly has an urgent need
>> that can't be serviced quickly enough by the balloon driver.
>>      
> wtf? So lets fix the ballooning driver instead?
>    

You can't have a negative balloon size.  The two models are not equivalent.

Balloon allows you to give up a page for which you have a struct page.  
Frontswap (and swap) allows you to gain a page for which you don't have 
a struct page, but you can't access it directly.  The similarity is that 
in both cases the host may want the guest to give up a page, but cannot 
force it.

> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.
>    

There's no reason for swapping and ballooning to behave differently when 
swap backing storage is RAM (they probably do now since swap was tuned 
for disks, not flash, but that's a bug if it's true).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-29 18:53                                   ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-29 18:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dan Magenheimer, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/28/2010 08:55 AM, Pavel Machek wrote:
>
>> That's a reasonable analogy.  Frontswap serves nicely as an
>> emergency safety valve when a guest has given up (too) much of
>> its memory via ballooning but unexpectedly has an urgent need
>> that can't be serviced quickly enough by the balloon driver.
>>      
> wtf? So lets fix the ballooning driver instead?
>    

You can't have a negative balloon size.  The two models are not equivalent.

Balloon allows you to give up a page for which you have a struct page.  
Frontswap (and swap) allows you to gain a page for which you don't have 
a struct page, but you can't access it directly.  The similarity is that 
in both cases the host may want the guest to give up a page, but cannot 
force it.

> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.
>    

There's no reason for swapping and ballooning to behave differently when 
swap backing storage is RAM (they probably do now since swap was tuned 
for disks, not flash, but that's a bug if it's true).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-29 14:42                                   ` Dan Magenheimer
@ 2010-04-29 18:59                                     ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-29 18:59 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Pavel Machek, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/29/2010 05:42 PM, Dan Magenheimer wrote:
>>
>> Yes, and that set of hooks is new API, right?
>>      
> Well, no, if you define API as "application programming interface"
> this is NOT exposed to userland.  If you define API as a new
> in-kernel function call, yes, these hooks are a new API, but that
> is true of virtually any new code in the kernel.  If you define
> API as some new interface between the kernel and a hypervisor,
> yes, this is a new API, but it is "optional" at several levels
> so that any hypervisor (e.g. KVM) can completely ignore it.
>    

The concern is not with the hypervisor, but with Linux.  More external 
APIs reduce our flexibility to change things.

> So please let's not argue about whether the code is a "new API"
> or not, but instead consider whether the concept is useful or not
> and if useful, if there is or is not a cleaner way to implement it.
>    

I'm convinced it's useful.  The API is so close to a block device 
(read/write with key/value vs read/write with sector/value) that we 
should make the effort not to introduce a new API.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-29 18:59                                     ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-29 18:59 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Pavel Machek, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/29/2010 05:42 PM, Dan Magenheimer wrote:
>>
>> Yes, and that set of hooks is new API, right?
>>      
> Well, no, if you define API as "application programming interface"
> this is NOT exposed to userland.  If you define API as a new
> in-kernel function call, yes, these hooks are a new API, but that
> is true of virtually any new code in the kernel.  If you define
> API as some new interface between the kernel and a hypervisor,
> yes, this is a new API, but it is "optional" at several levels
> so that any hypervisor (e.g. KVM) can completely ignore it.
>    

The concern is not with the hypervisor, but with Linux.  More external 
APIs reduce our flexibility to change things.

> So please let's not argue about whether the code is a "new API"
> or not, but instead consider whether the concept is useful or not
> and if useful, if there is or is not a cleaner way to implement it.
>    

I'm convinced it's useful.  The API is so close to a block device 
(read/write with key/value vs read/write with sector/value) that we 
should make the effort not to introduce a new API.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-29 18:59                                     ` Avi Kivity
@ 2010-04-29 19:01                                       ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-29 19:01 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Pavel Machek, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/29/2010 09:59 PM, Avi Kivity wrote:
>
> I'm convinced it's useful.  The API is so close to a block device 
> (read/write with key/value vs read/write with sector/value) that we 
> should make the effort not to introduce a new API.
>

Plus of course the asynchronity and batching of the block layer.  Even 
if you don't use a dma engine, you improve performance by exiting one 
per several dozen pages instead of for every page, perhaps enough to 
allow the hypervisor to justify copying the memory with non-temporal moves.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-29 19:01                                       ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-29 19:01 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Pavel Machek, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

On 04/29/2010 09:59 PM, Avi Kivity wrote:
>
> I'm convinced it's useful.  The API is so close to a block device 
> (read/write with key/value vs read/write with sector/value) that we 
> should make the effort not to introduce a new API.
>

Plus of course the asynchronity and batching of the block layer.  Even 
if you don't use a dma engine, you improve performance by exiting one 
per several dozen pages instead of for every page, perhaps enough to 
allow the hypervisor to justify copying the memory with non-temporal moves.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-28  5:55                                 ` Pavel Machek
@ 2010-04-30  1:45                                   ` Dave Hansen
  -1 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30  1:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dan Magenheimer, Avi Kivity, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On Wed, 2010-04-28 at 07:55 +0200, Pavel Machek wrote:
> > > Seems frontswap is like a reverse balloon, where the balloon is in
> > > hypervisor space instead of the guest space.
> > 
> > That's a reasonable analogy.  Frontswap serves nicely as an
> > emergency safety valve when a guest has given up (too) much of
> > its memory via ballooning but unexpectedly has an urgent need
> > that can't be serviced quickly enough by the balloon driver.
> 
> wtf? So lets fix the ballooning driver instead?
> 
> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.

Frontswap and things like CMM2[1] have some fundamental advantages over
swapping and ballooning.  First of all, there are serious limits on
ballooning.  It's difficult for a guest to span a very wide range of
memory sizes without also including memory hotplug in the mix.  The ~1%
'struct page' penalty alone causes issues here.

A large portion of CMM2's gain came from the fact that you could take
memory away from guests without _them_ doing any work.  If the system is
experiencing a load spike, you increase load even more by making the
guests swap.  If you can just take some of their memory away, you can
smooth that spike out.  CMM2 and frontswap do that.  The guests
explicitly give up page contents that the hypervisor does not have to
first consult with the guest before discarding.

[1] http://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf 

-- Dave


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30  1:45                                   ` Dave Hansen
  0 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30  1:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dan Magenheimer, Avi Kivity, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On Wed, 2010-04-28 at 07:55 +0200, Pavel Machek wrote:
> > > Seems frontswap is like a reverse balloon, where the balloon is in
> > > hypervisor space instead of the guest space.
> > 
> > That's a reasonable analogy.  Frontswap serves nicely as an
> > emergency safety valve when a guest has given up (too) much of
> > its memory via ballooning but unexpectedly has an urgent need
> > that can't be serviced quickly enough by the balloon driver.
> 
> wtf? So lets fix the ballooning driver instead?
> 
> There's no reason it could not be as fast as frontswap, right?
> Actually I'd expect it to be faster -- it can deal with big chunks.

Frontswap and things like CMM2[1] have some fundamental advantages over
swapping and ballooning.  First of all, there are serious limits on
ballooning.  It's difficult for a guest to span a very wide range of
memory sizes without also including memory hotplug in the mix.  The ~1%
'struct page' penalty alone causes issues here.

A large portion of CMM2's gain came from the fact that you could take
memory away from guests without _them_ doing any work.  If the system is
experiencing a load spike, you increase load even more by making the
guests swap.  If you can just take some of their memory away, you can
smooth that spike out.  CMM2 and frontswap do that.  The guests
explicitly give up page contents that the hypervisor does not have to
first consult with the guest before discarding.

[1] http://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf 

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30  1:45                                   ` Dave Hansen
@ 2010-04-30  7:13                                     ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30  7:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pavel Machek, Dan Magenheimer, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On 04/30/2010 04:45 AM, Dave Hansen wrote:
>
> A large portion of CMM2's gain came from the fact that you could take
> memory away from guests without _them_ doing any work.  If the system is
> experiencing a load spike, you increase load even more by making the
> guests swap.  If you can just take some of their memory away, you can
> smooth that spike out.  CMM2 and frontswap do that.  The guests
> explicitly give up page contents that the hypervisor does not have to
> first consult with the guest before discarding.
>    

Frontswap does not do this.  Once a page has been frontswapped, the host 
is committed to retaining it until the guest releases it.  It's really 
not very different from a synchronous swap device.

I think cleancache allows the hypervisor to drop pages without the 
guest's immediate knowledge, but I'm not sure.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30  7:13                                     ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30  7:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pavel Machek, Dan Magenheimer, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On 04/30/2010 04:45 AM, Dave Hansen wrote:
>
> A large portion of CMM2's gain came from the fact that you could take
> memory away from guests without _them_ doing any work.  If the system is
> experiencing a load spike, you increase load even more by making the
> guests swap.  If you can just take some of their memory away, you can
> smooth that spike out.  CMM2 and frontswap do that.  The guests
> explicitly give up page contents that the hypervisor does not have to
> first consult with the guest before discarding.
>    

Frontswap does not do this.  Once a page has been frontswapped, the host 
is committed to retaining it until the guest releases it.  It's really 
not very different from a synchronous swap device.

I think cleancache allows the hypervisor to drop pages without the 
guest's immediate knowledge, but I'm not sure.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30  7:13                                     ` Avi Kivity
@ 2010-04-30 15:59                                       ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-30 15:59 UTC (permalink / raw)
  To: Avi Kivity, Dave Hansen
  Cc: Pavel Machek, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work.  If the system
> is
> > experiencing a load spike, you increase load even more by making the
> > guests swap.  If you can just take some of their memory away, you can
> > smooth that spike out.  CMM2 and frontswap do that.  The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
> 
> Frontswap does not do this.  Once a page has been frontswapped, the
> host
> is committed to retaining it until the guest releases it.

Dave or others can correct me if I am wrong, but I think CMM2 also
handles dirty pages that must be retained by the hypervisor.  The
difference between CMM2 (for dirty pages) and frontswap is that
CMM2 sets hints that can be handled asynchronously while frontswap
provides explicit hooks that synchronously succeed/fail.

In fact, Avi, CMM2 is probably a fairly good approximation of what
the asynchronous interface you are suggesting might look like.
In other words, feasible but much much more complex than frontswap.

> [frontswap is] really
> not very different from a synchronous swap device.

Not to beat a dead horse, but there is a very key difference:
The size and availability of frontswap is entirely dynamic;
any page-to-be-swapped can be rejected at any time even if
a page was previously successfully swapped to the same index.
Every other swap device is much more static so the swap code
assumes a static device.  Existing swap code can account for
"bad blocks" on a static device, but this is far from sufficient
to handle the dynamicity needed by frontswap.

> I think cleancache allows the hypervisor to drop pages without the
> guest's immediate knowledge, but I'm not sure.

Yes, cleancache can drop pages at any time because (as the
name implies) only clean pages can be put into cleancache.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 15:59                                       ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-30 15:59 UTC (permalink / raw)
  To: Avi Kivity, Dave Hansen
  Cc: Pavel Machek, linux-kernel, linux-mm, jeremy, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work.  If the system
> is
> > experiencing a load spike, you increase load even more by making the
> > guests swap.  If you can just take some of their memory away, you can
> > smooth that spike out.  CMM2 and frontswap do that.  The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
> 
> Frontswap does not do this.  Once a page has been frontswapped, the
> host
> is committed to retaining it until the guest releases it.

Dave or others can correct me if I am wrong, but I think CMM2 also
handles dirty pages that must be retained by the hypervisor.  The
difference between CMM2 (for dirty pages) and frontswap is that
CMM2 sets hints that can be handled asynchronously while frontswap
provides explicit hooks that synchronously succeed/fail.

In fact, Avi, CMM2 is probably a fairly good approximation of what
the asynchronous interface you are suggesting might look like.
In other words, feasible but much much more complex than frontswap.

> [frontswap is] really
> not very different from a synchronous swap device.

Not to beat a dead horse, but there is a very key difference:
The size and availability of frontswap is entirely dynamic;
any page-to-be-swapped can be rejected at any time even if
a page was previously successfully swapped to the same index.
Every other swap device is much more static so the swap code
assumes a static device.  Existing swap code can account for
"bad blocks" on a static device, but this is far from sufficient
to handle the dynamicity needed by frontswap.

> I think cleancache allows the hypervisor to drop pages without the
> guest's immediate knowledge, but I'm not sure.

Yes, cleancache can drop pages at any time because (as the
name implies) only clean pages can be put into cleancache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30  7:13                                     ` Avi Kivity
@ 2010-04-30 16:04                                       ` Dave Hansen
  -1 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30 16:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pavel Machek, Dan Magenheimer, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On Fri, 2010-04-30 at 10:13 +0300, Avi Kivity wrote:
> On 04/30/2010 04:45 AM, Dave Hansen wrote:
> >
> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work.  If the system is
> > experiencing a load spike, you increase load even more by making the
> > guests swap.  If you can just take some of their memory away, you can
> > smooth that spike out.  CMM2 and frontswap do that.  The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
> >    
> 
> Frontswap does not do this.  Once a page has been frontswapped, the host 
> is committed to retaining it until the guest releases it.  It's really 
> not very different from a synchronous swap device.
> 
> I think cleancache allows the hypervisor to drop pages without the 
> guest's immediate knowledge, but I'm not sure.

Gah.  You're right.  I'm  reading the two threads and confusing the
concepts.  I'm a bit less mystified why the discussion is revolving
around the swap device so much. :)

-- Dave


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 16:04                                       ` Dave Hansen
  0 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30 16:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Pavel Machek, Dan Magenheimer, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On Fri, 2010-04-30 at 10:13 +0300, Avi Kivity wrote:
> On 04/30/2010 04:45 AM, Dave Hansen wrote:
> >
> > A large portion of CMM2's gain came from the fact that you could take
> > memory away from guests without _them_ doing any work.  If the system is
> > experiencing a load spike, you increase load even more by making the
> > guests swap.  If you can just take some of their memory away, you can
> > smooth that spike out.  CMM2 and frontswap do that.  The guests
> > explicitly give up page contents that the hypervisor does not have to
> > first consult with the guest before discarding.
> >    
> 
> Frontswap does not do this.  Once a page has been frontswapped, the host 
> is committed to retaining it until the guest releases it.  It's really 
> not very different from a synchronous swap device.
> 
> I think cleancache allows the hypervisor to drop pages without the 
> guest's immediate knowledge, but I'm not sure.

Gah.  You're right.  I'm  reading the two threads and confusing the
concepts.  I'm a bit less mystified why the discussion is revolving
around the swap device so much. :)

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 15:59                                       ` Dan Magenheimer
@ 2010-04-30 16:08                                         ` Dave Hansen
  -1 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30 16:08 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel, Martin Schwidefsky

On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote:
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor.  The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.

Once pages were dirtied (or I guess just slightly before), they became
volatile, and I don't think the hypervisor could do anything with them.
It could still swap them out like usual, but none of the CMM-specific
optimizations could be performed.

CC'ing Martin since he's the expert. :)

-- Dave


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 16:08                                         ` Dave Hansen
  0 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30 16:08 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel, Martin Schwidefsky

On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote:
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor.  The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.

Once pages were dirtied (or I guess just slightly before), they became
volatile, and I don't think the hypervisor could do anything with them.
It could still swap them out like usual, but none of the CMM-specific
optimizations could be performed.

CC'ing Martin since he's the expert. :)

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 15:59                                       ` Dan Magenheimer
@ 2010-04-30 16:16                                         ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30 16:16 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On 04/30/2010 06:59 PM, Dan Magenheimer wrote:
>>
>>> experiencing a load spike, you increase load even more by making the
>>> guests swap.  If you can just take some of their memory away, you can
>>> smooth that spike out.  CMM2 and frontswap do that.  The guests
>>> explicitly give up page contents that the hypervisor does not have to
>>> first consult with the guest before discarding.
>>>        
>> Frontswap does not do this.  Once a page has been frontswapped, the
>> host
>> is committed to retaining it until the guest releases it.
>>      
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor.

But those are the guest's pages in the first place, that's not a new 
commitment.  CMM2 provides the hypervisor alternatives to swapping a 
page out.  Frontswap provides the guest alternatives to swapping a page out.

>    The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.
>    

They are not directly comparable.  In fact for dirty pages CMM2 is 
mostly a no-op - the host is forced to swap them out if it wants them.  
CMM2 brings value for demand zero or clean pages which can be restored 
by the guest without requiring swapin.

I think for dirty pages what CMM2 brings is the ability to discard them 
if the host has swapped them out but the guest doesn't need them,

> In fact, Avi, CMM2 is probably a fairly good approximation of what
> the asynchronous interface you are suggesting might look like.
> In other words,

CMM2 is more directly comparably to ballooning rather than to 
frontswap.  Frontswap (and cleancache) work with storage that is 
external to the guest, and say nothing about the guest's page itself.

> feasible but much much more complex than frontswap.
>    

The swap API (e.g. the block layer) itself is an asynchronous batched 
version of frontswap.  The complexity in CMM2 comes from the fact that 
it is communicating information about guest pages to the host, and from 
the fact that communication is two-way and asynchronous in both directions.


>    
>> [frontswap is] really
>> not very different from a synchronous swap device.
>>      
> Not to beat a dead horse, but there is a very key difference:
> The size and availability of frontswap is entirely dynamic;
> any page-to-be-swapped can be rejected at any time even if
> a page was previously successfully swapped to the same index.
> Every other swap device is much more static so the swap code
> assumes a static device.  Existing swap code can account for
> "bad blocks" on a static device, but this is far from sufficient
> to handle the dynamicity needed by frontswap.
>    

Given that whenever frontswap fails you need to swap anyway, it is 
better for the host to never fail a frontswap request and instead back 
it with disk storage if needed.  This way you avoid a pointless vmexit 
when you're out of memory.  Since it's disk backed it needs to be 
asynchronous and batched.

At this point we're back with the ordinary swap API.  Simply have your 
host expose a device which is write cached by host memory, you'll have 
all the benefits of frontswap with none of the disadvantages, and with 
no changes to guest code.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 16:16                                         ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30 16:16 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On 04/30/2010 06:59 PM, Dan Magenheimer wrote:
>>
>>> experiencing a load spike, you increase load even more by making the
>>> guests swap.  If you can just take some of their memory away, you can
>>> smooth that spike out.  CMM2 and frontswap do that.  The guests
>>> explicitly give up page contents that the hypervisor does not have to
>>> first consult with the guest before discarding.
>>>        
>> Frontswap does not do this.  Once a page has been frontswapped, the
>> host
>> is committed to retaining it until the guest releases it.
>>      
> Dave or others can correct me if I am wrong, but I think CMM2 also
> handles dirty pages that must be retained by the hypervisor.

But those are the guest's pages in the first place, that's not a new 
commitment.  CMM2 provides the hypervisor alternatives to swapping a 
page out.  Frontswap provides the guest alternatives to swapping a page out.

>    The
> difference between CMM2 (for dirty pages) and frontswap is that
> CMM2 sets hints that can be handled asynchronously while frontswap
> provides explicit hooks that synchronously succeed/fail.
>    

They are not directly comparable.  In fact for dirty pages CMM2 is 
mostly a no-op - the host is forced to swap them out if it wants them.  
CMM2 brings value for demand zero or clean pages which can be restored 
by the guest without requiring swapin.

I think for dirty pages what CMM2 brings is the ability to discard them 
if the host has swapped them out but the guest doesn't need them,

> In fact, Avi, CMM2 is probably a fairly good approximation of what
> the asynchronous interface you are suggesting might look like.
> In other words,

CMM2 is more directly comparably to ballooning rather than to 
frontswap.  Frontswap (and cleancache) work with storage that is 
external to the guest, and say nothing about the guest's page itself.

> feasible but much much more complex than frontswap.
>    

The swap API (e.g. the block layer) itself is an asynchronous batched 
version of frontswap.  The complexity in CMM2 comes from the fact that 
it is communicating information about guest pages to the host, and from 
the fact that communication is two-way and asynchronous in both directions.


>    
>> [frontswap is] really
>> not very different from a synchronous swap device.
>>      
> Not to beat a dead horse, but there is a very key difference:
> The size and availability of frontswap is entirely dynamic;
> any page-to-be-swapped can be rejected at any time even if
> a page was previously successfully swapped to the same index.
> Every other swap device is much more static so the swap code
> assumes a static device.  Existing swap code can account for
> "bad blocks" on a static device, but this is far from sufficient
> to handle the dynamicity needed by frontswap.
>    

Given that whenever frontswap fails you need to swap anyway, it is 
better for the host to never fail a frontswap request and instead back 
it with disk storage if needed.  This way you avoid a pointless vmexit 
when you're out of memory.  Since it's disk backed it needs to be 
asynchronous and batched.

At this point we're back with the ordinary swap API.  Simply have your 
host expose a device which is write cached by host memory, you'll have 
all the benefits of frontswap with none of the disadvantages, and with 
no changes to guest code.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 16:16                                         ` Avi Kivity
@ 2010-04-30 16:43                                           ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-30 16:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

(I'll back down on the CMM2 comparisons until I can go
back and read the paper :-)

> >> [frontswap is] really
> >> not very different from a synchronous swap device.
> >>
> > Not to beat a dead horse, but there is a very key difference:
> > The size and availability of frontswap is entirely dynamic;
> > any page-to-be-swapped can be rejected at any time even if
> > a page was previously successfully swapped to the same index.
> > Every other swap device is much more static so the swap code
> > assumes a static device.  Existing swap code can account for
> > "bad blocks" on a static device, but this is far from sufficient
> > to handle the dynamicity needed by frontswap.
> 
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed.  This way you avoid a pointless vmexit
> when you're out of memory.  Since it's disk backed it needs to be
> asynchronous and batched.
> 
> At this point we're back with the ordinary swap API.  Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest .

I think you are making a number of possibly false assumptions here:
1) The host [the frontswap backend may not even be a hypervisor]
2) can back it with disk storage [not if it is a bare-metal hypervisor]
3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
4) when you're out of memory [how can this be determined outside of
   the hypervisor?]

And, importantly, "have your host expose a device which is write
cached by host memory"... you are implying that all guest swapping
should be done to a device managed/controlled by the host?  That
eliminates guest swapping to directIO/SRIOV devices doesn't it?

Anyway, I think we can see now why frontswap might not be a good
match for a hosted hypervisor (KVM), but that doesn't make it
any less useful for a bare-metal hypervisor (or TBD for in-kernel
compressed swap and TBD for possible future pseudo-RAM technologies).

Dan

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 16:43                                           ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-04-30 16:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

(I'll back down on the CMM2 comparisons until I can go
back and read the paper :-)

> >> [frontswap is] really
> >> not very different from a synchronous swap device.
> >>
> > Not to beat a dead horse, but there is a very key difference:
> > The size and availability of frontswap is entirely dynamic;
> > any page-to-be-swapped can be rejected at any time even if
> > a page was previously successfully swapped to the same index.
> > Every other swap device is much more static so the swap code
> > assumes a static device.  Existing swap code can account for
> > "bad blocks" on a static device, but this is far from sufficient
> > to handle the dynamicity needed by frontswap.
> 
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed.  This way you avoid a pointless vmexit
> when you're out of memory.  Since it's disk backed it needs to be
> asynchronous and batched.
> 
> At this point we're back with the ordinary swap API.  Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest .

I think you are making a number of possibly false assumptions here:
1) The host [the frontswap backend may not even be a hypervisor]
2) can back it with disk storage [not if it is a bare-metal hypervisor]
3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
4) when you're out of memory [how can this be determined outside of
   the hypervisor?]

And, importantly, "have your host expose a device which is write
cached by host memory"... you are implying that all guest swapping
should be done to a device managed/controlled by the host?  That
eliminates guest swapping to directIO/SRIOV devices doesn't it?

Anyway, I think we can see now why frontswap might not be a good
match for a hosted hypervisor (KVM), but that doesn't make it
any less useful for a bare-metal hypervisor (or TBD for in-kernel
compressed swap and TBD for possible future pseudo-RAM technologies).

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 16:43                                           ` Dan Magenheimer
@ 2010-04-30 17:10                                             ` Dave Hansen
  -1 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30 17:10 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On Fri, 2010-04-30 at 09:43 -0700, Dan Magenheimer wrote:
> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host?  That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?

If you have a single swap device, sure.  But, I can also see a case
where you have a "fast" swap and "slow" swap.

The part of the argument about frontswap is that I like is the lack
sizing exposed to the guest.  When you're dealing with swap-only, you
are stuck adding or removing swap devices if you want to "grow/shrink"
the memory footprint.  If the host (or whatever is backing the
frontswap) wants to change the sizes, they're fairly free to.

The part that bothers me it is that it just pushes the problem
elsewhere.  For KVM, we still have to figure out _somewhere_ what to do
with all those pages.  It's nice that the host would have the freedom to
either swap or keep them around, but it doesn't really fix the problem.

I do see the lack of sizing exposed to the guest as being a bad thing,
too.  Let's say we saved 25% of system RAM to back a frontswap-type
device on a KVM host.  The first time a user boots up their set of VMs
and 25% of their RAM is gone, they're going to start complaining,
despite the fact that their 25% smaller systems may end up being faster.

I think I'd be more convinced if we saw this thing actually get used
somehow.  How is a ram-backed frontswap better than a /dev/ramX-backed
swap file in practice?

-- Dave


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 17:10                                             ` Dave Hansen
  0 siblings, 0 replies; 163+ messages in thread
From: Dave Hansen @ 2010-04-30 17:10 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On Fri, 2010-04-30 at 09:43 -0700, Dan Magenheimer wrote:
> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host?  That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?

If you have a single swap device, sure.  But, I can also see a case
where you have a "fast" swap and "slow" swap.

The part of the argument about frontswap is that I like is the lack
sizing exposed to the guest.  When you're dealing with swap-only, you
are stuck adding or removing swap devices if you want to "grow/shrink"
the memory footprint.  If the host (or whatever is backing the
frontswap) wants to change the sizes, they're fairly free to.

The part that bothers me it is that it just pushes the problem
elsewhere.  For KVM, we still have to figure out _somewhere_ what to do
with all those pages.  It's nice that the host would have the freedom to
either swap or keep them around, but it doesn't really fix the problem.

I do see the lack of sizing exposed to the guest as being a bad thing,
too.  Let's say we saved 25% of system RAM to back a frontswap-type
device on a KVM host.  The first time a user boots up their set of VMs
and 25% of their RAM is gone, they're going to start complaining,
despite the fact that their 25% smaller systems may end up being faster.

I think I'd be more convinced if we saw this thing actually get used
somehow.  How is a ram-backed frontswap better than a /dev/ramX-backed
swap file in practice?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 16:16                                         ` Avi Kivity
@ 2010-04-30 17:52                                           ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 163+ messages in thread
From: Jeremy Fitzhardinge @ 2010-04-30 17:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 09:16 AM, Avi Kivity wrote:
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed.  This way you avoid a pointless vmexit
> when you're out of memory.  Since it's disk backed it needs to be
> asynchronous and batched.

I'd argue the opposite.  There's no point in having the host do swapping
on behalf of guests if guests can do it themselves; it's just a
duplication of functionality.  You end up having two IO paths for each
guest, and the resulting problems in trying to account for the IO,
rate-limit it, etc.  If you can simply say "all guest disk IO happens
via this single interface", its much easier to manage.

If frontswap has value, it's because its providing a new facility to
guests that doesn't already exist and can't be easily emulated with
existing interfaces.

It seems to me the great strengths of the synchronous interface are:

    * it matches the needs of an existing implementation (tmem in Xen)
    * it is simple to understand within the context of the kernel code
      it's used in

Simplicity is important, because it allows the mm code to be understood
and maintained without having to have a deep understanding of
virtualization.  One of the problems with CMM2 was that it puts a lot of
intricate constraints on the mm code which can be easily broken, which
would only become apparent in subtle edge cases in a CMM2-using
environment.  An addition async frontswap-like interface - while not as
complex as CMM2 - still makes things harder for mm maintainers.

The downside is that it may not match some implementation in which the
get/put operations could take a long time (ie, physical IO to a slow
mechanical device).  But a general Linux principle is not to overdesign
interfaces for hypothetical users, only for real needs.

Do you think that you would be able to use frontswap in kvm if it were
an async interface, but not otherwise?  Or are you arguing a hypothetical?

> At this point we're back with the ordinary swap API.  Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest code.

Yes, that's comfortably within the "guests page themselves" model. 
Setting up a block device for the domain which is backed by pagecache
(something we usually try hard to avoid) is pretty straightforward.  But
it doesn't work well for Xen unless the blkback domain is sized so that
it has all of Xen's free memory in its pagecache.

That said, it does concern me that the host/hypervisor is left holding
the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
a whole lot of pages into the frontswap pool and leave them there.   I
guess this is mitigated by the fact that the API is designed such that
they can't update or read the data without also allowing the hypervisor
to drop the page (updates can fail destructively, and reads are also
destructive), so the guest can't use it as a clumsy extension of their
normal dedicated memory.

    J


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 17:52                                           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 163+ messages in thread
From: Jeremy Fitzhardinge @ 2010-04-30 17:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 09:16 AM, Avi Kivity wrote:
> Given that whenever frontswap fails you need to swap anyway, it is
> better for the host to never fail a frontswap request and instead back
> it with disk storage if needed.  This way you avoid a pointless vmexit
> when you're out of memory.  Since it's disk backed it needs to be
> asynchronous and batched.

I'd argue the opposite.  There's no point in having the host do swapping
on behalf of guests if guests can do it themselves; it's just a
duplication of functionality.  You end up having two IO paths for each
guest, and the resulting problems in trying to account for the IO,
rate-limit it, etc.  If you can simply say "all guest disk IO happens
via this single interface", its much easier to manage.

If frontswap has value, it's because its providing a new facility to
guests that doesn't already exist and can't be easily emulated with
existing interfaces.

It seems to me the great strengths of the synchronous interface are:

    * it matches the needs of an existing implementation (tmem in Xen)
    * it is simple to understand within the context of the kernel code
      it's used in

Simplicity is important, because it allows the mm code to be understood
and maintained without having to have a deep understanding of
virtualization.  One of the problems with CMM2 was that it puts a lot of
intricate constraints on the mm code which can be easily broken, which
would only become apparent in subtle edge cases in a CMM2-using
environment.  An addition async frontswap-like interface - while not as
complex as CMM2 - still makes things harder for mm maintainers.

The downside is that it may not match some implementation in which the
get/put operations could take a long time (ie, physical IO to a slow
mechanical device).  But a general Linux principle is not to overdesign
interfaces for hypothetical users, only for real needs.

Do you think that you would be able to use frontswap in kvm if it were
an async interface, but not otherwise?  Or are you arguing a hypothetical?

> At this point we're back with the ordinary swap API.  Simply have your
> host expose a device which is write cached by host memory, you'll have
> all the benefits of frontswap with none of the disadvantages, and with
> no changes to guest code.

Yes, that's comfortably within the "guests page themselves" model. 
Setting up a block device for the domain which is backed by pagecache
(something we usually try hard to avoid) is pretty straightforward.  But
it doesn't work well for Xen unless the blkback domain is sized so that
it has all of Xen's free memory in its pagecache.

That said, it does concern me that the host/hypervisor is left holding
the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
a whole lot of pages into the frontswap pool and leave them there.   I
guess this is mitigated by the fact that the API is designed such that
they can't update or read the data without also allowing the hypervisor
to drop the page (updates can fail destructively, and reads are also
destructive), so the guest can't use it as a clumsy extension of their
normal dedicated memory.

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 16:43                                           ` Dan Magenheimer
@ 2010-04-30 18:08                                             ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30 18:08 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On 04/30/2010 07:43 PM, Dan Magenheimer wrote:
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed.  This way you avoid a pointless vmexit
>> when you're out of memory.  Since it's disk backed it needs to be
>> asynchronous and batched.
>>
>> At this point we're back with the ordinary swap API.  Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest .
>>      
> I think you are making a number of possibly false assumptions here:
> 1) The host [the frontswap backend may not even be a hypervisor]
>    

True.  My remarks only apply to frontswap-to-hypervisor, for internally 
consumed frontswap the situation is different.

> 2) can back it with disk storage [not if it is a bare-metal hypervisor]
>    

So it seems a bare-metal hypervisor has less access to the bare metal 
than a non-bare-metal hypervisor?

Seriously, leave the bare-metal FUD to Simon.  People on this list know 
that kvm and Xen have exactly the same access to the hardware (well 
actually Xen needs to use privileged guests to access some of its hardware).

> 3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
>    

There's still an exit.  It's much faster than a vmx/svm vmexit but still 
nontrivial.

But why are we optimizing for 5 year old hardware?

> 4) when you're out of memory [how can this be determined outside of
>     the hypervisor?]
>    

It's determined by the hypervisor, same as with tmem.  The guest swaps 
to a virtual disk, the hypervisor places the data in RAM if it's 
available, or on disk if it isn't.  Write-back caching in all its glory.

> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host?  That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?
>    

You can have multiple swap devices.

wrt SR/IOV, you'll see synchronous frontswap reduce throughput.  SR/IOV 
will swap with <1 exit/page and DMA guest pages, while frontswap/tmem 
will carry a 1 exit/page hit (even if no swap actually happens) and the 
copy cost (if it does).

The API really, really wants to be asynchronous.

> Anyway, I think we can see now why frontswap might not be a good
> match for a hosted hypervisor (KVM), but that doesn't make it
> any less useful for a bare-metal hypervisor (or TBD for in-kernel
> compressed swap and TBD for possible future pseudo-RAM technologies).
>    

In-kernel compressed swap does seem to be a good match for a synchronous 
API.  For future memory devices, or even bare-metal buzzword-compliant 
hypervisors, I disagree.  An asynchronous API is required for 
efficiency, and they'll all have swap capability sooner or later (kvm, 
vmware, and I believe xen 4 already do).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 18:08                                             ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30 18:08 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, jeremy,
	hugh.dickins, ngupta, JBeulich, chris.mason, kurt.hackel,
	dave.mccracken, npiggin, akpm, riel

On 04/30/2010 07:43 PM, Dan Magenheimer wrote:
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed.  This way you avoid a pointless vmexit
>> when you're out of memory.  Since it's disk backed it needs to be
>> asynchronous and batched.
>>
>> At this point we're back with the ordinary swap API.  Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest .
>>      
> I think you are making a number of possibly false assumptions here:
> 1) The host [the frontswap backend may not even be a hypervisor]
>    

True.  My remarks only apply to frontswap-to-hypervisor, for internally 
consumed frontswap the situation is different.

> 2) can back it with disk storage [not if it is a bare-metal hypervisor]
>    

So it seems a bare-metal hypervisor has less access to the bare metal 
than a non-bare-metal hypervisor?

Seriously, leave the bare-metal FUD to Simon.  People on this list know 
that kvm and Xen have exactly the same access to the hardware (well 
actually Xen needs to use privileged guests to access some of its hardware).

> 3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
>    

There's still an exit.  It's much faster than a vmx/svm vmexit but still 
nontrivial.

But why are we optimizing for 5 year old hardware?

> 4) when you're out of memory [how can this be determined outside of
>     the hypervisor?]
>    

It's determined by the hypervisor, same as with tmem.  The guest swaps 
to a virtual disk, the hypervisor places the data in RAM if it's 
available, or on disk if it isn't.  Write-back caching in all its glory.

> And, importantly, "have your host expose a device which is write
> cached by host memory"... you are implying that all guest swapping
> should be done to a device managed/controlled by the host?  That
> eliminates guest swapping to directIO/SRIOV devices doesn't it?
>    

You can have multiple swap devices.

wrt SR/IOV, you'll see synchronous frontswap reduce throughput.  SR/IOV 
will swap with <1 exit/page and DMA guest pages, while frontswap/tmem 
will carry a 1 exit/page hit (even if no swap actually happens) and the 
copy cost (if it does).

The API really, really wants to be asynchronous.

> Anyway, I think we can see now why frontswap might not be a good
> match for a hosted hypervisor (KVM), but that doesn't make it
> any less useful for a bare-metal hypervisor (or TBD for in-kernel
> compressed swap and TBD for possible future pseudo-RAM technologies).
>    

In-kernel compressed swap does seem to be a good match for a synchronous 
API.  For future memory devices, or even bare-metal buzzword-compliant 
hypervisors, I disagree.  An asynchronous API is required for 
efficiency, and they'll all have swap capability sooner or later (kvm, 
vmware, and I believe xen 4 already do).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 17:52                                           ` Jeremy Fitzhardinge
@ 2010-04-30 18:24                                             ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30 18:24 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 08:52 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2010 09:16 AM, Avi Kivity wrote:
>    
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed.  This way you avoid a pointless vmexit
>> when you're out of memory.  Since it's disk backed it needs to be
>> asynchronous and batched.
>>      
> I'd argue the opposite.  There's no point in having the host do swapping
> on behalf of guests if guests can do it themselves; it's just a
> duplication of functionality.

The problem with relying on the guest to swap is that it's voluntary.  
The guest may not be able to do it.  When the hypervisor needs memory 
and guests don't cooperate, it has to swap.

But I'm not suggesting that the host swap on behalf on the guest.  
Rather, the guest swaps to (what it sees as) a device with a large 
write-back cache; the host simply manages that cache.

> You end up having two IO paths for each
> guest, and the resulting problems in trying to account for the IO,
> rate-limit it, etc.  If you can simply say "all guest disk IO happens
> via this single interface", its much easier to manage.
>    

With tmem you have to account for that memory, make sure it's 
distributed fairly, claim it back when you need it (requiring guest 
cooperation), live migrate and save/restore it.  It's a much larger 
change than introducing a write-back device for swapping (which has the 
benefit of working with unmodified guests).

> If frontswap has value, it's because its providing a new facility to
> guests that doesn't already exist and can't be easily emulated with
> existing interfaces.
>
> It seems to me the great strengths of the synchronous interface are:
>
>      * it matches the needs of an existing implementation (tmem in Xen)
>      * it is simple to understand within the context of the kernel code
>        it's used in
>
> Simplicity is important, because it allows the mm code to be understood
> and maintained without having to have a deep understanding of
> virtualization.

If we use the existing paths, things are even simpler, and we match more 
needs (hypervisors with dma engines, the ability to reclaim memory 
without guest cooperation).

> One of the problems with CMM2 was that it puts a lot of
> intricate constraints on the mm code which can be easily broken, which
> would only become apparent in subtle edge cases in a CMM2-using
> environment.  An addition async frontswap-like interface - while not as
> complex as CMM2 - still makes things harder for mm maintainers.
>    

No doubt CMM2 is hard to swallow.

> The downside is that it may not match some implementation in which the
> get/put operations could take a long time (ie, physical IO to a slow
> mechanical device).  But a general Linux principle is not to overdesign
> interfaces for hypothetical users, only for real needs.
>    

> Do you think that you would be able to use frontswap in kvm if it were
> an async interface, but not otherwise?  Or are you arguing a hypothetical?
>    

For kvm (or Xen, with some modifications) all of the benefits of 
frontswap/tmem can be achieved with the ordinary swap.  It would need 
trim/discard support to avoid writing back freed data, but that's good 
for flash as well.

The advantages are:
- just works
- old guests
- <1 exit/page (since it's batched)
- no extra overhead if no free memory
- can use dma engine (since it's asynchronous)

>> At this point we're back with the ordinary swap API.  Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest code.
>>      
> Yes, that's comfortably within the "guests page themselves" model.
> Setting up a block device for the domain which is backed by pagecache
> (something we usually try hard to avoid) is pretty straightforward.  But
> it doesn't work well for Xen unless the blkback domain is sized so that
> it has all of Xen's free memory in its pagecache.
>    

Could be easily achieved with ballooning?

> That said, it does concern me that the host/hypervisor is left holding
> the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
> a whole lot of pages into the frontswap pool and leave them there.   I
> guess this is mitigated by the fact that the API is designed such that
> they can't update or read the data without also allowing the hypervisor
> to drop the page (updates can fail destructively, and reads are also
> destructive), so the guest can't use it as a clumsy extension of their
> normal dedicated memory.
>    

Eventually you'll have to swap frontswap pages, or kill uncooperative 
guests.  At which point all of the simplicity is gone.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 18:24                                             ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-04-30 18:24 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 08:52 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2010 09:16 AM, Avi Kivity wrote:
>    
>> Given that whenever frontswap fails you need to swap anyway, it is
>> better for the host to never fail a frontswap request and instead back
>> it with disk storage if needed.  This way you avoid a pointless vmexit
>> when you're out of memory.  Since it's disk backed it needs to be
>> asynchronous and batched.
>>      
> I'd argue the opposite.  There's no point in having the host do swapping
> on behalf of guests if guests can do it themselves; it's just a
> duplication of functionality.

The problem with relying on the guest to swap is that it's voluntary.  
The guest may not be able to do it.  When the hypervisor needs memory 
and guests don't cooperate, it has to swap.

But I'm not suggesting that the host swap on behalf on the guest.  
Rather, the guest swaps to (what it sees as) a device with a large 
write-back cache; the host simply manages that cache.

> You end up having two IO paths for each
> guest, and the resulting problems in trying to account for the IO,
> rate-limit it, etc.  If you can simply say "all guest disk IO happens
> via this single interface", its much easier to manage.
>    

With tmem you have to account for that memory, make sure it's 
distributed fairly, claim it back when you need it (requiring guest 
cooperation), live migrate and save/restore it.  It's a much larger 
change than introducing a write-back device for swapping (which has the 
benefit of working with unmodified guests).

> If frontswap has value, it's because its providing a new facility to
> guests that doesn't already exist and can't be easily emulated with
> existing interfaces.
>
> It seems to me the great strengths of the synchronous interface are:
>
>      * it matches the needs of an existing implementation (tmem in Xen)
>      * it is simple to understand within the context of the kernel code
>        it's used in
>
> Simplicity is important, because it allows the mm code to be understood
> and maintained without having to have a deep understanding of
> virtualization.

If we use the existing paths, things are even simpler, and we match more 
needs (hypervisors with dma engines, the ability to reclaim memory 
without guest cooperation).

> One of the problems with CMM2 was that it puts a lot of
> intricate constraints on the mm code which can be easily broken, which
> would only become apparent in subtle edge cases in a CMM2-using
> environment.  An addition async frontswap-like interface - while not as
> complex as CMM2 - still makes things harder for mm maintainers.
>    

No doubt CMM2 is hard to swallow.

> The downside is that it may not match some implementation in which the
> get/put operations could take a long time (ie, physical IO to a slow
> mechanical device).  But a general Linux principle is not to overdesign
> interfaces for hypothetical users, only for real needs.
>    

> Do you think that you would be able to use frontswap in kvm if it were
> an async interface, but not otherwise?  Or are you arguing a hypothetical?
>    

For kvm (or Xen, with some modifications) all of the benefits of 
frontswap/tmem can be achieved with the ordinary swap.  It would need 
trim/discard support to avoid writing back freed data, but that's good 
for flash as well.

The advantages are:
- just works
- old guests
- <1 exit/page (since it's batched)
- no extra overhead if no free memory
- can use dma engine (since it's asynchronous)

>> At this point we're back with the ordinary swap API.  Simply have your
>> host expose a device which is write cached by host memory, you'll have
>> all the benefits of frontswap with none of the disadvantages, and with
>> no changes to guest code.
>>      
> Yes, that's comfortably within the "guests page themselves" model.
> Setting up a block device for the domain which is backed by pagecache
> (something we usually try hard to avoid) is pretty straightforward.  But
> it doesn't work well for Xen unless the blkback domain is sized so that
> it has all of Xen's free memory in its pagecache.
>    

Could be easily achieved with ballooning?

> That said, it does concern me that the host/hypervisor is left holding
> the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
> a whole lot of pages into the frontswap pool and leave them there.   I
> guess this is mitigated by the fact that the API is designed such that
> they can't update or read the data without also allowing the hypervisor
> to drop the page (updates can fail destructively, and reads are also
> destructive), so the guest can't use it as a clumsy extension of their
> normal dedicated memory.
>    

Eventually you'll have to swap frontswap pages, or kill uncooperative 
guests.  At which point all of the simplicity is gone.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 18:24                                             ` Avi Kivity
@ 2010-04-30 18:59                                               ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 163+ messages in thread
From: Jeremy Fitzhardinge @ 2010-04-30 18:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 11:24 AM, Avi Kivity wrote:
>> I'd argue the opposite.  There's no point in having the host do swapping
>> on behalf of guests if guests can do it themselves; it's just a
>> duplication of functionality.
>   
> The problem with relying on the guest to swap is that it's voluntary. 
> The guest may not be able to do it.  When the hypervisor needs memory
> and guests don't cooperate, it has to swap.

Or fail whatever operation its trying to do.  You can only use
overcommit to fake unlimited resources for so long before you need a
government bailout.

>> You end up having two IO paths for each
>> guest, and the resulting problems in trying to account for the IO,
>> rate-limit it, etc.  If you can simply say "all guest disk IO happens
>> via this single interface", its much easier to manage.
>>    
>
> With tmem you have to account for that memory, make sure it's
> distributed fairly, claim it back when you need it (requiring guest
> cooperation), live migrate and save/restore it.  It's a much larger
> change than introducing a write-back device for swapping (which has
> the benefit of working with unmodified guests).

Well, with caveats.  To be useful with migration the backing store needs
to be shared like other storage, so you can't use a specific host-local
fast (ssd) swap device.  And because the device is backed by pagecache
with delayed writes, it has much weaker integrity guarantees than a
normal device, so you need to be sure that the guests are only going to
use it for swap.  Sure, these are deployment issues rather than code
ones, but they're still issues.

>> If frontswap has value, it's because its providing a new facility to
>> guests that doesn't already exist and can't be easily emulated with
>> existing interfaces.
>>
>> It seems to me the great strengths of the synchronous interface are:
>>
>>      * it matches the needs of an existing implementation (tmem in Xen)
>>      * it is simple to understand within the context of the kernel code
>>        it's used in
>>
>> Simplicity is important, because it allows the mm code to be understood
>> and maintained without having to have a deep understanding of
>> virtualization.
>
> If we use the existing paths, things are even simpler, and we match
> more needs (hypervisors with dma engines, the ability to reclaim
> memory without guest cooperation).

Well, you still can't reclaim memory; you can write it out to storage. 
It may be cheaper/byte, but it's still a resource dedicated to the
guest.  But that's just a consequence of allowing overcommit, and to
what extent you're happy to allow it.

What kind of DMA engine do you have in mind?  Are there practical
memory->memory DMA engines that would be useful in this context?

>>> At this point we're back with the ordinary swap API.  Simply have your
>>> host expose a device which is write cached by host memory, you'll have
>>> all the benefits of frontswap with none of the disadvantages, and with
>>> no changes to guest code.
>>>      
>> Yes, that's comfortably within the "guests page themselves" model.
>> Setting up a block device for the domain which is backed by pagecache
>> (something we usually try hard to avoid) is pretty straightforward.  But
>> it doesn't work well for Xen unless the blkback domain is sized so that
>> it has all of Xen's free memory in its pagecache.
>>    
>
> Could be easily achieved with ballooning?

It could be achieved with ballooning, but it isn't completely trivial. 
It wouldn't work terribly well with a driver domain setup, unless all
the swap-devices turned out to be backed by the same domain (which in
turn would need to know how to balloon in response to overall system
demand).  The partitioning of the pagecache among the guests would be at
the mercy of the mm subsystem rather than subject to any specific QoS or
other per-domain policies you might want to put in place (maybe fiddling
around with [fm]advise could get you some control over that).

>
>> That said, it does concern me that the host/hypervisor is left holding
>> the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
>> a whole lot of pages into the frontswap pool and leave them there.   I
>> guess this is mitigated by the fact that the API is designed such that
>> they can't update or read the data without also allowing the hypervisor
>> to drop the page (updates can fail destructively, and reads are also
>> destructive), so the guest can't use it as a clumsy extension of their
>> normal dedicated memory.
>>    
>
> Eventually you'll have to swap frontswap pages, or kill uncooperative
> guests.  At which point all of the simplicity is gone.

Killing guests is pretty simple.  Presumably the oom killer will get kvm
processes like anything else?

    J


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-04-30 18:59                                               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 163+ messages in thread
From: Jeremy Fitzhardinge @ 2010-04-30 18:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 11:24 AM, Avi Kivity wrote:
>> I'd argue the opposite.  There's no point in having the host do swapping
>> on behalf of guests if guests can do it themselves; it's just a
>> duplication of functionality.
>   
> The problem with relying on the guest to swap is that it's voluntary. 
> The guest may not be able to do it.  When the hypervisor needs memory
> and guests don't cooperate, it has to swap.

Or fail whatever operation its trying to do.  You can only use
overcommit to fake unlimited resources for so long before you need a
government bailout.

>> You end up having two IO paths for each
>> guest, and the resulting problems in trying to account for the IO,
>> rate-limit it, etc.  If you can simply say "all guest disk IO happens
>> via this single interface", its much easier to manage.
>>    
>
> With tmem you have to account for that memory, make sure it's
> distributed fairly, claim it back when you need it (requiring guest
> cooperation), live migrate and save/restore it.  It's a much larger
> change than introducing a write-back device for swapping (which has
> the benefit of working with unmodified guests).

Well, with caveats.  To be useful with migration the backing store needs
to be shared like other storage, so you can't use a specific host-local
fast (ssd) swap device.  And because the device is backed by pagecache
with delayed writes, it has much weaker integrity guarantees than a
normal device, so you need to be sure that the guests are only going to
use it for swap.  Sure, these are deployment issues rather than code
ones, but they're still issues.

>> If frontswap has value, it's because its providing a new facility to
>> guests that doesn't already exist and can't be easily emulated with
>> existing interfaces.
>>
>> It seems to me the great strengths of the synchronous interface are:
>>
>>      * it matches the needs of an existing implementation (tmem in Xen)
>>      * it is simple to understand within the context of the kernel code
>>        it's used in
>>
>> Simplicity is important, because it allows the mm code to be understood
>> and maintained without having to have a deep understanding of
>> virtualization.
>
> If we use the existing paths, things are even simpler, and we match
> more needs (hypervisors with dma engines, the ability to reclaim
> memory without guest cooperation).

Well, you still can't reclaim memory; you can write it out to storage. 
It may be cheaper/byte, but it's still a resource dedicated to the
guest.  But that's just a consequence of allowing overcommit, and to
what extent you're happy to allow it.

What kind of DMA engine do you have in mind?  Are there practical
memory->memory DMA engines that would be useful in this context?

>>> At this point we're back with the ordinary swap API.  Simply have your
>>> host expose a device which is write cached by host memory, you'll have
>>> all the benefits of frontswap with none of the disadvantages, and with
>>> no changes to guest code.
>>>      
>> Yes, that's comfortably within the "guests page themselves" model.
>> Setting up a block device for the domain which is backed by pagecache
>> (something we usually try hard to avoid) is pretty straightforward.  But
>> it doesn't work well for Xen unless the blkback domain is sized so that
>> it has all of Xen's free memory in its pagecache.
>>    
>
> Could be easily achieved with ballooning?

It could be achieved with ballooning, but it isn't completely trivial. 
It wouldn't work terribly well with a driver domain setup, unless all
the swap-devices turned out to be backed by the same domain (which in
turn would need to know how to balloon in response to overall system
demand).  The partitioning of the pagecache among the guests would be at
the mercy of the mm subsystem rather than subject to any specific QoS or
other per-domain policies you might want to put in place (maybe fiddling
around with [fm]advise could get you some control over that).

>
>> That said, it does concern me that the host/hypervisor is left holding
>> the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
>> a whole lot of pages into the frontswap pool and leave them there.   I
>> guess this is mitigated by the fact that the API is designed such that
>> they can't update or read the data without also allowing the hypervisor
>> to drop the page (updates can fail destructively, and reads are also
>> destructive), so the guest can't use it as a clumsy extension of their
>> normal dedicated memory.
>>    
>
> Eventually you'll have to swap frontswap pages, or kill uncooperative
> guests.  At which point all of the simplicity is gone.

Killing guests is pretty simple.  Presumably the oom killer will get kvm
processes like anything else?

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 18:59                                               ` Jeremy Fitzhardinge
@ 2010-05-01  8:28                                                 ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-01  8:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 09:59 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2010 11:24 AM, Avi Kivity wrote:
>    
>>> I'd argue the opposite.  There's no point in having the host do swapping
>>> on behalf of guests if guests can do it themselves; it's just a
>>> duplication of functionality.
>>>        
>>
>> The problem with relying on the guest to swap is that it's voluntary.
>> The guest may not be able to do it.  When the hypervisor needs memory
>> and guests don't cooperate, it has to swap.
>>      
> Or fail whatever operation its trying to do.  You can only use
> overcommit to fake unlimited resources for so long before you need a
> government bailout.
>    

Keep your commitment below RAM+swap and you'll be fine.  We want to 
overcommit RAM, not total storage.

>>> You end up having two IO paths for each
>>> guest, and the resulting problems in trying to account for the IO,
>>> rate-limit it, etc.  If you can simply say "all guest disk IO happens
>>> via this single interface", its much easier to manage.
>>>
>>>        
>> With tmem you have to account for that memory, make sure it's
>> distributed fairly, claim it back when you need it (requiring guest
>> cooperation), live migrate and save/restore it.  It's a much larger
>> change than introducing a write-back device for swapping (which has
>> the benefit of working with unmodified guests).
>>      
> Well, with caveats.  To be useful with migration the backing store needs
> to be shared like other storage, so you can't use a specific host-local
> fast (ssd) swap device.

Live migration of local storage is possible (qemu does it).

> And because the device is backed by pagecache
> with delayed writes, it has much weaker integrity guarantees than a
> normal device, so you need to be sure that the guests are only going to
> use it for swap.  Sure, these are deployment issues rather than code
> ones, but they're still issues.
>    

You advertise it as a disk with write cache, so the guest is obliged to 
flush the cache if it wants a guarantee.  When it does, you flush your 
cache as well.  For swap, the guest will not issue any flushes.  This is 
already supported by qemu with cache=writeback.

I agree care is needed here.  You don't want to use the device for 
anything else.

>>> If frontswap has value, it's because its providing a new facility to
>>> guests that doesn't already exist and can't be easily emulated with
>>> existing interfaces.
>>>
>>> It seems to me the great strengths of the synchronous interface are:
>>>
>>>       * it matches the needs of an existing implementation (tmem in Xen)
>>>       * it is simple to understand within the context of the kernel code
>>>         it's used in
>>>
>>> Simplicity is important, because it allows the mm code to be understood
>>> and maintained without having to have a deep understanding of
>>> virtualization.
>>>        
>> If we use the existing paths, things are even simpler, and we match
>> more needs (hypervisors with dma engines, the ability to reclaim
>> memory without guest cooperation).
>>      
> Well, you still can't reclaim memory; you can write it out to storage.
> It may be cheaper/byte, but it's still a resource dedicated to the
> guest.  But that's just a consequence of allowing overcommit, and to
> what extent you're happy to allow it.
>    

In general you want to run on RAM.  To maximise your RAM, you do things 
like page sharing and ballooning.  Both can fail, increasing the demand 
for RAM.  At that time you either kill a guest or swap to disk.

Consider a frontswap/tmem on bare-metal hypervisor cluster.  Presumably 
you give most of your free memory to guests.  A node dies.  Now you need 
to start its guests on the surviving nodes, but you're at the mercy of 
your guests to give up their tmem.

With an ordinary swap approach, you first flush cache to disk, and if 
that's not sufficient you start paging out guest memory.  You take a 
performance hit but you keep your guests running.

> What kind of DMA engine do you have in mind?  Are there practical
> memory->memory DMA engines that would be useful in this context?
>    

I/OAT (driver ioatdma).

When you don't have a  lot of memory free, you can also switch from 
write cache to O_DIRECT, so you use the storage controller's dma engine 
to transfer pages to disk.

>>> Yes, that's comfortably within the "guests page themselves" model.
>>> Setting up a block device for the domain which is backed by pagecache
>>> (something we usually try hard to avoid) is pretty straightforward.  But
>>> it doesn't work well for Xen unless the blkback domain is sized so that
>>> it has all of Xen's free memory in its pagecache.
>>>
>>>        
>> Could be easily achieved with ballooning?
>>      
> It could be achieved with ballooning, but it isn't completely trivial.
> It wouldn't work terribly well with a driver domain setup, unless all
> the swap-devices turned out to be backed by the same domain (which in
> turn would need to know how to balloon in response to overall system
> demand).  The partitioning of the pagecache among the guests would be at
> the mercy of the mm subsystem rather than subject to any specific QoS or
> other per-domain policies you might want to put in place (maybe fiddling
> around with [fm]advise could get you some control over that).
>    

See Documentation/cgroups/memory.txt.

>>> That said, it does concern me that the host/hypervisor is left holding
>>> the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
>>> a whole lot of pages into the frontswap pool and leave them there.   I
>>> guess this is mitigated by the fact that the API is designed such that
>>> they can't update or read the data without also allowing the hypervisor
>>> to drop the page (updates can fail destructively, and reads are also
>>> destructive), so the guest can't use it as a clumsy extension of their
>>> normal dedicated memory.
>>>
>>>        
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests.  At which point all of the simplicity is gone.
>>      
> Killing guests is pretty simple.

Migrating to a hypervisor that doesn't kill guests isn't.

> Presumably the oom killer will get kvm
> processes like anything else?
>    

Yes.  Of course, you want your management code never to allow this to 
happen.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-01  8:28                                                 ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-01  8:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Magenheimer, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 04/30/2010 09:59 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2010 11:24 AM, Avi Kivity wrote:
>    
>>> I'd argue the opposite.  There's no point in having the host do swapping
>>> on behalf of guests if guests can do it themselves; it's just a
>>> duplication of functionality.
>>>        
>>
>> The problem with relying on the guest to swap is that it's voluntary.
>> The guest may not be able to do it.  When the hypervisor needs memory
>> and guests don't cooperate, it has to swap.
>>      
> Or fail whatever operation its trying to do.  You can only use
> overcommit to fake unlimited resources for so long before you need a
> government bailout.
>    

Keep your commitment below RAM+swap and you'll be fine.  We want to 
overcommit RAM, not total storage.

>>> You end up having two IO paths for each
>>> guest, and the resulting problems in trying to account for the IO,
>>> rate-limit it, etc.  If you can simply say "all guest disk IO happens
>>> via this single interface", its much easier to manage.
>>>
>>>        
>> With tmem you have to account for that memory, make sure it's
>> distributed fairly, claim it back when you need it (requiring guest
>> cooperation), live migrate and save/restore it.  It's a much larger
>> change than introducing a write-back device for swapping (which has
>> the benefit of working with unmodified guests).
>>      
> Well, with caveats.  To be useful with migration the backing store needs
> to be shared like other storage, so you can't use a specific host-local
> fast (ssd) swap device.

Live migration of local storage is possible (qemu does it).

> And because the device is backed by pagecache
> with delayed writes, it has much weaker integrity guarantees than a
> normal device, so you need to be sure that the guests are only going to
> use it for swap.  Sure, these are deployment issues rather than code
> ones, but they're still issues.
>    

You advertise it as a disk with write cache, so the guest is obliged to 
flush the cache if it wants a guarantee.  When it does, you flush your 
cache as well.  For swap, the guest will not issue any flushes.  This is 
already supported by qemu with cache=writeback.

I agree care is needed here.  You don't want to use the device for 
anything else.

>>> If frontswap has value, it's because its providing a new facility to
>>> guests that doesn't already exist and can't be easily emulated with
>>> existing interfaces.
>>>
>>> It seems to me the great strengths of the synchronous interface are:
>>>
>>>       * it matches the needs of an existing implementation (tmem in Xen)
>>>       * it is simple to understand within the context of the kernel code
>>>         it's used in
>>>
>>> Simplicity is important, because it allows the mm code to be understood
>>> and maintained without having to have a deep understanding of
>>> virtualization.
>>>        
>> If we use the existing paths, things are even simpler, and we match
>> more needs (hypervisors with dma engines, the ability to reclaim
>> memory without guest cooperation).
>>      
> Well, you still can't reclaim memory; you can write it out to storage.
> It may be cheaper/byte, but it's still a resource dedicated to the
> guest.  But that's just a consequence of allowing overcommit, and to
> what extent you're happy to allow it.
>    

In general you want to run on RAM.  To maximise your RAM, you do things 
like page sharing and ballooning.  Both can fail, increasing the demand 
for RAM.  At that time you either kill a guest or swap to disk.

Consider a frontswap/tmem on bare-metal hypervisor cluster.  Presumably 
you give most of your free memory to guests.  A node dies.  Now you need 
to start its guests on the surviving nodes, but you're at the mercy of 
your guests to give up their tmem.

With an ordinary swap approach, you first flush cache to disk, and if 
that's not sufficient you start paging out guest memory.  You take a 
performance hit but you keep your guests running.

> What kind of DMA engine do you have in mind?  Are there practical
> memory->memory DMA engines that would be useful in this context?
>    

I/OAT (driver ioatdma).

When you don't have a  lot of memory free, you can also switch from 
write cache to O_DIRECT, so you use the storage controller's dma engine 
to transfer pages to disk.

>>> Yes, that's comfortably within the "guests page themselves" model.
>>> Setting up a block device for the domain which is backed by pagecache
>>> (something we usually try hard to avoid) is pretty straightforward.  But
>>> it doesn't work well for Xen unless the blkback domain is sized so that
>>> it has all of Xen's free memory in its pagecache.
>>>
>>>        
>> Could be easily achieved with ballooning?
>>      
> It could be achieved with ballooning, but it isn't completely trivial.
> It wouldn't work terribly well with a driver domain setup, unless all
> the swap-devices turned out to be backed by the same domain (which in
> turn would need to know how to balloon in response to overall system
> demand).  The partitioning of the pagecache among the guests would be at
> the mercy of the mm subsystem rather than subject to any specific QoS or
> other per-domain policies you might want to put in place (maybe fiddling
> around with [fm]advise could get you some control over that).
>    

See Documentation/cgroups/memory.txt.

>>> That said, it does concern me that the host/hypervisor is left holding
>>> the bag on frontswapped pages.  A evil/uncooperative/lazy can just pump
>>> a whole lot of pages into the frontswap pool and leave them there.   I
>>> guess this is mitigated by the fact that the API is designed such that
>>> they can't update or read the data without also allowing the hypervisor
>>> to drop the page (updates can fail destructively, and reads are also
>>> destructive), so the guest can't use it as a clumsy extension of their
>>> normal dedicated memory.
>>>
>>>        
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests.  At which point all of the simplicity is gone.
>>      
> Killing guests is pretty simple.

Migrating to a hypervisor that doesn't kill guests isn't.

> Presumably the oom killer will get kvm
> processes like anything else?
>    

Yes.  Of course, you want your management code never to allow this to 
happen.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 18:24                                             ` Avi Kivity
@ 2010-05-01 17:10                                               ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-01 17:10 UTC (permalink / raw)
  To: Avi Kivity, Jeremy Fitzhardinge
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

> Eventually you'll have to swap frontswap pages, or kill uncooperative
> guests.  At which point all of the simplicity is gone.

OK, now I think I see the crux of the disagreement.

NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
in host swapping.  Host swapping is evil.  Host swapping is
the root of most of the bad reputation that memory overcommit
has gotten from VMware customers.  Host swapping can't be
avoided with some memory overcommit technologies (such as page
sharing), but frontswap on Xen+tmem CAN and DOES avoid it.

So, to summarize:

1) You agreed that a synchronous interface for frontswap makes
   sense for swap-to-in-kernel-compressed-RAM because it is
   truly swapping to RAM.
2) You have pointed out that an asynchronous interface for
   frontswap makes more sense for KVM than a synchronous
   interface, because KVM does host swapping.  Then you said
   if you have an asynchronous interface anyway, the existing
   swap code works just fine with no changes so frontswap
   is not needed at all... for KVM.
3) You have suggested that if Xen were more like KVM and required
   host-swapping, then Xen doesn't need frontswap either.

BUT frontswap on Xen+tmem always truly swaps to RAM.

So there are two users of frontswap for which the synchronous
interface makes sense.  I believe there may be more in the
future and you disagree but, as Jeremy said, "a general Linux
principle is not to overdesign interfaces for hypothetical users,
only for real needs."  We have demonstrated there is a need
with at least two users so the debate is only whether the
number of users is two or more than two.

Frontswap is a very non-invasive patch and is very cleanly
layered so that if it is not in the presence of either of 
the intended "users", it can be turned off in many different
ways with zero overhead (CONFIG'ed off) or extremely small overhead
(frontswap_ops is never set; or frontswap_ops is set but the
underlying hypervisor doesn't support it so frontswap_poolid
never gets set).

So... KVM doesn't need it and won't use it.  Do you, Avi, have
any other objections as to why the frontswap patch shouldn't be
accepted as is for the users that DO need it and WILL use it?

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-01 17:10                                               ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-01 17:10 UTC (permalink / raw)
  To: Avi Kivity, Jeremy Fitzhardinge
  Cc: Dave Hansen, Pavel Machek, linux-kernel, linux-mm, hugh.dickins,
	ngupta, JBeulich, chris.mason, kurt.hackel, dave.mccracken,
	npiggin, akpm, riel

> Eventually you'll have to swap frontswap pages, or kill uncooperative
> guests.  At which point all of the simplicity is gone.

OK, now I think I see the crux of the disagreement.

NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
in host swapping.  Host swapping is evil.  Host swapping is
the root of most of the bad reputation that memory overcommit
has gotten from VMware customers.  Host swapping can't be
avoided with some memory overcommit technologies (such as page
sharing), but frontswap on Xen+tmem CAN and DOES avoid it.

So, to summarize:

1) You agreed that a synchronous interface for frontswap makes
   sense for swap-to-in-kernel-compressed-RAM because it is
   truly swapping to RAM.
2) You have pointed out that an asynchronous interface for
   frontswap makes more sense for KVM than a synchronous
   interface, because KVM does host swapping.  Then you said
   if you have an asynchronous interface anyway, the existing
   swap code works just fine with no changes so frontswap
   is not needed at all... for KVM.
3) You have suggested that if Xen were more like KVM and required
   host-swapping, then Xen doesn't need frontswap either.

BUT frontswap on Xen+tmem always truly swaps to RAM.

So there are two users of frontswap for which the synchronous
interface makes sense.  I believe there may be more in the
future and you disagree but, as Jeremy said, "a general Linux
principle is not to overdesign interfaces for hypothetical users,
only for real needs."  We have demonstrated there is a need
with at least two users so the debate is only whether the
number of users is two or more than two.

Frontswap is a very non-invasive patch and is very cleanly
layered so that if it is not in the presence of either of 
the intended "users", it can be turned off in many different
ways with zero overhead (CONFIG'ed off) or extremely small overhead
(frontswap_ops is never set; or frontswap_ops is set but the
underlying hypervisor doesn't support it so frontswap_poolid
never gets set).

So... KVM doesn't need it and won't use it.  Do you, Avi, have
any other objections as to why the frontswap patch shouldn't be
accepted as is for the users that DO need it and WILL use it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-01 17:10                                               ` Dan Magenheimer
@ 2010-05-02  7:11                                                 ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-05-02  7:11 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel


> So there are two users of frontswap for which the synchronous
> interface makes sense.  I believe there may be more in the
> future and you disagree but, as Jeremy said, "a general Linux
> principle is not to overdesign interfaces for hypothetical users,
> only for real needs."  We have demonstrated there is a need
> with at least two users so the debate is only whether the
> number of users is two or more than two.
> 
> Frontswap is a very non-invasive patch and is very cleanly
> layered so that if it is not in the presence of either of 
> the intended "users", it can be turned off in many different
> ways with zero overhead (CONFIG'ed off) or extremely small overhead
> (frontswap_ops is never set; or frontswap_ops is set but the
> underlying hypervisor doesn't support it so frontswap_poolid
> never gets set).

Yet there are less invasive solutions available, like 'add trim
operation to swap_ops'.

So what needs to be said here is 'frontswap is XX times faster than
swap_ops based solution on workload YY'.
								       Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02  7:11                                                 ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-05-02  7:11 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel


> So there are two users of frontswap for which the synchronous
> interface makes sense.  I believe there may be more in the
> future and you disagree but, as Jeremy said, "a general Linux
> principle is not to overdesign interfaces for hypothetical users,
> only for real needs."  We have demonstrated there is a need
> with at least two users so the debate is only whether the
> number of users is two or more than two.
> 
> Frontswap is a very non-invasive patch and is very cleanly
> layered so that if it is not in the presence of either of 
> the intended "users", it can be turned off in many different
> ways with zero overhead (CONFIG'ed off) or extremely small overhead
> (frontswap_ops is never set; or frontswap_ops is set but the
> underlying hypervisor doesn't support it so frontswap_poolid
> never gets set).

Yet there are less invasive solutions available, like 'add trim
operation to swap_ops'.

So what needs to be said here is 'frontswap is XX times faster than
swap_ops based solution on workload YY'.
								       Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-01 17:10                                               ` Dan Magenheimer
@ 2010-05-02  7:57                                                 ` Nitin Gupta
  -1 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-05-02  7:57 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/01/2010 10:40 PM, Dan Magenheimer wrote:
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests.  At which point all of the simplicity is gone.
> 
> OK, now I think I see the crux of the disagreement.
> 
> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> in host swapping.  Host swapping is evil.  Host swapping is
> the root of most of the bad reputation that memory overcommit
> has gotten from VMware customers.  Host swapping can't be
> avoided with some memory overcommit technologies (such as page
> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
> 

Why host-level swapping is evil? In KVM case, VM is just another
process and host will just swap out pages using the same LRU like
scheme as with any other process, AFAIK.

Also, with frontswap, host cannot discard pages at any time as is
the case will cleancache. So, while cleancache is obviously very
useful, the usefulness of frontswap remains doubtful.

IMHO, along with cleancache, we should just have in in-memory
compressed swapping at *host* level i.e. no frontswap. I agree
that using frontswap hooks, it is easy to implement ramzswap
functionality but I think its not worth replacing this driver
with frontswap hooks. This driver already has all the goodness:
asynchronous interface, ability to dynamically add/remove ramzswap
devices etc. All that is lacking in this driver is a more efficient
'discard' functionality so we can free a page as soon as it becomes
unused.

It should also be easy to extend this driver to allow sending pages
to host using virtio (for KVM) or Xen hypercalls, if frontswap is
needed at all.

So, IMHO we can focus on cleancache development and add missing
parts to ramzswap driver.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02  7:57                                                 ` Nitin Gupta
  0 siblings, 0 replies; 163+ messages in thread
From: Nitin Gupta @ 2010-05-02  7:57 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/01/2010 10:40 PM, Dan Magenheimer wrote:
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests.  At which point all of the simplicity is gone.
> 
> OK, now I think I see the crux of the disagreement.
> 
> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> in host swapping.  Host swapping is evil.  Host swapping is
> the root of most of the bad reputation that memory overcommit
> has gotten from VMware customers.  Host swapping can't be
> avoided with some memory overcommit technologies (such as page
> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
> 

Why host-level swapping is evil? In KVM case, VM is just another
process and host will just swap out pages using the same LRU like
scheme as with any other process, AFAIK.

Also, with frontswap, host cannot discard pages at any time as is
the case will cleancache. So, while cleancache is obviously very
useful, the usefulness of frontswap remains doubtful.

IMHO, along with cleancache, we should just have in in-memory
compressed swapping at *host* level i.e. no frontswap. I agree
that using frontswap hooks, it is easy to implement ramzswap
functionality but I think its not worth replacing this driver
with frontswap hooks. This driver already has all the goodness:
asynchronous interface, ability to dynamically add/remove ramzswap
devices etc. All that is lacking in this driver is a more efficient
'discard' functionality so we can free a page as soon as it becomes
unused.

It should also be easy to extend this driver to allow sending pages
to host using virtio (for KVM) or Xen hypercalls, if frontswap is
needed at all.

So, IMHO we can focus on cleancache development and add missing
parts to ramzswap driver.

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02  7:11                                                 ` Pavel Machek
@ 2010-05-02 15:05                                                   ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 15:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > So there are two users of frontswap for which the synchronous
> > interface makes sense.  I believe there may be more in the
> > future and you disagree but, as Jeremy said, "a general Linux
> > principle is not to overdesign interfaces for hypothetical users,
> > only for real needs."  We have demonstrated there is a need
> > with at least two users so the debate is only whether the
> > number of users is two or more than two.
> >
> > Frontswap is a very non-invasive patch and is very cleanly
> > layered so that if it is not in the presence of either of
> > the intended "users", it can be turned off in many different
> > ways with zero overhead (CONFIG'ed off) or extremely small overhead
> > (frontswap_ops is never set; or frontswap_ops is set but the
> > underlying hypervisor doesn't support it so frontswap_poolid
> > never gets set).
> 
> Yet there are less invasive solutions available, like 'add trim
> operation to swap_ops'.

As Nitin pointed out much earlier in this thread:

"No: trim or discard is not useful"

I also think that trim does not do anything for the widely
varying dynamically changing size that frontswap provides.
 
> So what needs to be said here is 'frontswap is XX times faster than
> swap_ops based solution on workload YY'.

Are you asking me to demonstrate that swap-to-hypervisor-RAM is
faster than swap-to-disk?


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 15:05                                                   ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 15:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > So there are two users of frontswap for which the synchronous
> > interface makes sense.  I believe there may be more in the
> > future and you disagree but, as Jeremy said, "a general Linux
> > principle is not to overdesign interfaces for hypothetical users,
> > only for real needs."  We have demonstrated there is a need
> > with at least two users so the debate is only whether the
> > number of users is two or more than two.
> >
> > Frontswap is a very non-invasive patch and is very cleanly
> > layered so that if it is not in the presence of either of
> > the intended "users", it can be turned off in many different
> > ways with zero overhead (CONFIG'ed off) or extremely small overhead
> > (frontswap_ops is never set; or frontswap_ops is set but the
> > underlying hypervisor doesn't support it so frontswap_poolid
> > never gets set).
> 
> Yet there are less invasive solutions available, like 'add trim
> operation to swap_ops'.

As Nitin pointed out much earlier in this thread:

"No: trim or discard is not useful"

I also think that trim does not do anything for the widely
varying dynamically changing size that frontswap provides.
 
> So what needs to be said here is 'frontswap is XX times faster than
> swap_ops based solution on workload YY'.

Are you asking me to demonstrate that swap-to-hypervisor-RAM is
faster than swap-to-disk?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-01 17:10                                               ` Dan Magenheimer
@ 2010-05-02 15:35                                                 ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-02 15:35 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/01/2010 08:10 PM, Dan Magenheimer wrote:
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests.  At which point all of the simplicity is gone.
>>      
> OK, now I think I see the crux of the disagreement.
>    

Alas, I think we're pretty far from that.

> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> in host swapping.

That's a bug.  You're giving the guest memory without the means to take 
it back.  The result is that you have to _undercommit_ your memory 
resources.

Consider a machine running a guest, with most of its memory free.  You 
give the memory via frontswap to the guest.  The guest happily swaps to 
frontswap, and uses the freed memory for something unswappable, like 
mlock()ed memory or hugetlbfs.

Now the second node dies and you need memory to migrate your guests 
into.  But you can't, and the hypervisor is at the mercy of the guest 
for getting its memory back; and the guest can't do it (at least not 
quickly).

> Host swapping is evil.  Host swapping is
> the root of most of the bad reputation that memory overcommit
> has gotten from VMware customers.  Host swapping can't be
> avoided with some memory overcommit technologies (such as page
> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>    

In this case the guest expects that swapped out memory will be slow 
(since was freed via the swap API; it will be slow if the host happened 
to run out of tmem).  So by storing this memory on disk you aren't 
reducing performance beyond what you promised to the guest.

Swapping guest RAM will indeed cause a performance hit, but sometimes 
you need to do it.

> So, to summarize:
>
> 1) You agreed that a synchronous interface for frontswap makes
>     sense for swap-to-in-kernel-compressed-RAM because it is
>     truly swapping to RAM.
>    

Because the interface is internal to the kernel.

> 2) You have pointed out that an asynchronous interface for
>     frontswap makes more sense for KVM than a synchronous
>     interface, because KVM does host swapping.

kvm's host swapping is unrelated.  Host swapping swaps guest-owned 
memory; that's not what we want here.  We want to cache guest swap in 
RAM, and that's easily done by having a virtual disk cached in main 
memory.  We're simply presenting a disk with a large write-back cache to 
the guest.

You could just as easily cache a block device in free RAM with Xen.  
Have a tmem domain behave as the backend for your swap device.  Use 
ballooning to force tmem to disk, or to allow more cache when memory is 
free.

Voila: you no longer depend on guests (you depend on the tmem domain, 
but that's part of the host code), you don't need guest modifications, 
so it works across a wider range of guests.

>    Then you said
>     if you have an asynchronous interface anyway, the existing
>     swap code works just fine with no changes so frontswap
>     is not needed at all... for KVM.
>    

For any hypervisor which implements virtual disks with write-back cache 
in host memory.

> 3) You have suggested that if Xen were more like KVM and required
>     host-swapping, then Xen doesn't need frontswap either.
>    

Host swapping is not a requirement.

> BUT frontswap on Xen+tmem always truly swaps to RAM.
>    

AND that's a problem because it puts the hypervisor at the mercy of the 
guest.

> So there are two users of frontswap for which the synchronous
> interface makes sense.

I believe there is only one.  See below.

> I believe there may be more in the
> future and you disagree but, as Jeremy said, "a general Linux
> principle is not to overdesign interfaces for hypothetical users,
> only for real needs."  We have demonstrated there is a need
> with at least two users so the debate is only whether the
> number of users is two or more than two.
>
> Frontswap is a very non-invasive patch and is very cleanly
> layered so that if it is not in the presence of either of
> the intended "users", it can be turned off in many different
> ways with zero overhead (CONFIG'ed off) or extremely small overhead
> (frontswap_ops is never set; or frontswap_ops is set but the
> underlying hypervisor doesn't support it so frontswap_poolid
> never gets set).
>    

The problem is not the complexity of the patch itself.  It's the fact 
that it introduces a new external API.  If we refactor swapping, that 
stands in the way.

How much, that's up to the mm maintainers to say.  If it isn't a problem 
for them, fine (but I still think 
swap-to-RAM-without-hypervisor-decommit is a bad idea).

> So... KVM doesn't need it and won't use it.  Do you, Avi, have
> any other objections as to why the frontswap patch shouldn't be
> accepted as is for the users that DO need it and WILL use it?
>    

Even ignoring the problems above (which are really hypervisor problems 
and the guest, which is what we're discussing here, shouldn't care if 
the hypervisor paints itself into an oom), a synchronous single-page DMA 
API is a bad idea.  Look at the Xen network and block code, while they 
eventually do a memory copy for every page they see, they try to batch 
multiple pages into an exit, and make the response asynchronous.

As an example, with a batched API you could save/restore the fpu context 
and use sse for copying the memory, while with a single page API you'd 
probably lost out.  Synchronous DMA, even for emulated hardware, is out 
of place in 2010.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 15:35                                                 ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-02 15:35 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/01/2010 08:10 PM, Dan Magenheimer wrote:
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests.  At which point all of the simplicity is gone.
>>      
> OK, now I think I see the crux of the disagreement.
>    

Alas, I think we're pretty far from that.

> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> in host swapping.

That's a bug.  You're giving the guest memory without the means to take 
it back.  The result is that you have to _undercommit_ your memory 
resources.

Consider a machine running a guest, with most of its memory free.  You 
give the memory via frontswap to the guest.  The guest happily swaps to 
frontswap, and uses the freed memory for something unswappable, like 
mlock()ed memory or hugetlbfs.

Now the second node dies and you need memory to migrate your guests 
into.  But you can't, and the hypervisor is at the mercy of the guest 
for getting its memory back; and the guest can't do it (at least not 
quickly).

> Host swapping is evil.  Host swapping is
> the root of most of the bad reputation that memory overcommit
> has gotten from VMware customers.  Host swapping can't be
> avoided with some memory overcommit technologies (such as page
> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>    

In this case the guest expects that swapped out memory will be slow 
(since was freed via the swap API; it will be slow if the host happened 
to run out of tmem).  So by storing this memory on disk you aren't 
reducing performance beyond what you promised to the guest.

Swapping guest RAM will indeed cause a performance hit, but sometimes 
you need to do it.

> So, to summarize:
>
> 1) You agreed that a synchronous interface for frontswap makes
>     sense for swap-to-in-kernel-compressed-RAM because it is
>     truly swapping to RAM.
>    

Because the interface is internal to the kernel.

> 2) You have pointed out that an asynchronous interface for
>     frontswap makes more sense for KVM than a synchronous
>     interface, because KVM does host swapping.

kvm's host swapping is unrelated.  Host swapping swaps guest-owned 
memory; that's not what we want here.  We want to cache guest swap in 
RAM, and that's easily done by having a virtual disk cached in main 
memory.  We're simply presenting a disk with a large write-back cache to 
the guest.

You could just as easily cache a block device in free RAM with Xen.  
Have a tmem domain behave as the backend for your swap device.  Use 
ballooning to force tmem to disk, or to allow more cache when memory is 
free.

Voila: you no longer depend on guests (you depend on the tmem domain, 
but that's part of the host code), you don't need guest modifications, 
so it works across a wider range of guests.

>    Then you said
>     if you have an asynchronous interface anyway, the existing
>     swap code works just fine with no changes so frontswap
>     is not needed at all... for KVM.
>    

For any hypervisor which implements virtual disks with write-back cache 
in host memory.

> 3) You have suggested that if Xen were more like KVM and required
>     host-swapping, then Xen doesn't need frontswap either.
>    

Host swapping is not a requirement.

> BUT frontswap on Xen+tmem always truly swaps to RAM.
>    

AND that's a problem because it puts the hypervisor at the mercy of the 
guest.

> So there are two users of frontswap for which the synchronous
> interface makes sense.

I believe there is only one.  See below.

> I believe there may be more in the
> future and you disagree but, as Jeremy said, "a general Linux
> principle is not to overdesign interfaces for hypothetical users,
> only for real needs."  We have demonstrated there is a need
> with at least two users so the debate is only whether the
> number of users is two or more than two.
>
> Frontswap is a very non-invasive patch and is very cleanly
> layered so that if it is not in the presence of either of
> the intended "users", it can be turned off in many different
> ways with zero overhead (CONFIG'ed off) or extremely small overhead
> (frontswap_ops is never set; or frontswap_ops is set but the
> underlying hypervisor doesn't support it so frontswap_poolid
> never gets set).
>    

The problem is not the complexity of the patch itself.  It's the fact 
that it introduces a new external API.  If we refactor swapping, that 
stands in the way.

How much, that's up to the mm maintainers to say.  If it isn't a problem 
for them, fine (but I still think 
swap-to-RAM-without-hypervisor-decommit is a bad idea).

> So... KVM doesn't need it and won't use it.  Do you, Avi, have
> any other objections as to why the frontswap patch shouldn't be
> accepted as is for the users that DO need it and WILL use it?
>    

Even ignoring the problems above (which are really hypervisor problems 
and the guest, which is what we're discussing here, shouldn't care if 
the hypervisor paints itself into an oom), a synchronous single-page DMA 
API is a bad idea.  Look at the Xen network and block code, while they 
eventually do a memory copy for every page they see, they try to batch 
multiple pages into an exit, and make the response asynchronous.

As an example, with a batched API you could save/restore the fpu context 
and use sse for copying the memory, while with a single page API you'd 
probably lost out.  Synchronous DMA, even for emulated hardware, is out 
of place in 2010.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02  7:57                                                 ` Nitin Gupta
@ 2010-05-02 16:06                                                   ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 16:06 UTC (permalink / raw)
  To: ngupta
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> > in host swapping.  Host swapping is evil.  Host swapping is
> > the root of most of the bad reputation that memory overcommit
> > has gotten from VMware customers.  Host swapping can't be
> > avoided with some memory overcommit technologies (such as page
> > sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
> 
> Why host-level swapping is evil? In KVM case, VM is just another
> process and host will just swap out pages using the same LRU like
> scheme as with any other process, AFAIK.

The first problem is that you are simulating a fast resource
(RAM) with a resource that is orders of magnitude slower with
NO visibility to the user that suffers the consequences.  A good
analogy (and no analogy is perfect) is if Linux discovers a 16MHz
80286 on a serial card in addition to the 32 3GHz cores on a
Nehalem box and, whenever the 32 cores are all busy, randomly
schedules a process on the 80286, while recording all CPU usage
data as if the 80286 is a "real" processor.... "Hmmm... why
did my compile suddenly run 100 times slower?"

The second problem is "double swapping": A guest may choose
a page to swap to "guest swap", but invisibly to the guest,
the host first must fetch it from "host swap".  (This may
seem like it is easy to avoid... it is not and happens more
frequently than you might think.)

Third, host swapping makes live migration much more difficult.
Either the host swap disk must be accessible to all machines
or data sitting on a local disk MUST be migrated along with
RAM (which is not impossible but complicates live migration
substantially).  Last I checked, VMware does not allow
page-sharing and live migration to both be enabled for the
same host.

If you talk to VMware customers (especially web-hosting services)
that have attempted to use overcommit technologies that require
host-swapping, you will find that they quickly become allergic
to memory overcommit and turn it off.  The end users (users of
the VMs that inexplicably grind to a halt) complain loudly.
As a result, RAM has become a bottleneck in many many systems,
which ultimately reduces the utility of servers and the value
of virtualization.

> Also, with frontswap, host cannot discard pages at any time as is
> the case will cleancache

True.  But in the Xen+tmem implementation there are disincentives
for a guest to unnecessarily retain pages put into frontswap,
so the host doesn't need to care that it can't discard the pages
as the guest is "billed" for them anyway.

So far we've been avoiding hypervisor policy implementation
questions and focused on mechanism (because, after all, this
is a *Linux kernel* mailing list), but we can go there if
needed.

> IMHO, along with cleancache, we should just have in in-memory
> compressed swapping at *host* level i.e. no frontswap. I agree
> that using frontswap hooks, it is easy to implement ramzswap
> functionality but I think its not worth replacing this driver
> with frontswap hooks. This driver already has all the goodness:
> asynchronous interface, ability to dynamically add/remove ramzswap
> devices etc. All that is lacking in this driver is a more efficient
> 'discard' functionality so we can free a page as soon as it becomes
> unused.

The key missing element with ramzswap is that, with frontswap, EVERY
attempt to swap a page to RAM is evaluated and potentially rejected
by the "backend" (hypervisor).  Further, no additional per-guest
system administration is required to configure ramzswap.  (How big
should it be anyway?) This level of dynamicity is important to
optimally managing physical memory in a rapidly changing virtual
environment.

> It should also be easy to extend this driver to allow sending pages
> to host using virtio (for KVM) or Xen hypercalls, if frontswap is
> needed at all.
> 
> So, IMHO we can focus on cleancache development and add missing
> parts to ramzswap driver.

I'm certainly open to someone exploring this approach to see if
it works for swap-to-hypervisor-RAM.  It has been my understanding
that Linus rejected the proposed discard hooks, without which
ramzswap doesn't even really work for swap-to-in-kernel-compressed-
RAM. However, I suspect that ramzswap, even with the discard hooks,
will not have the "dynamic range" useful for swap-to-hypervisor-RAM,
but frontswap will work fine for swap-to-in-kernel-compressed-RAM.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 16:06                                                   ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 16:06 UTC (permalink / raw)
  To: ngupta
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> > in host swapping.  Host swapping is evil.  Host swapping is
> > the root of most of the bad reputation that memory overcommit
> > has gotten from VMware customers.  Host swapping can't be
> > avoided with some memory overcommit technologies (such as page
> > sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
> 
> Why host-level swapping is evil? In KVM case, VM is just another
> process and host will just swap out pages using the same LRU like
> scheme as with any other process, AFAIK.

The first problem is that you are simulating a fast resource
(RAM) with a resource that is orders of magnitude slower with
NO visibility to the user that suffers the consequences.  A good
analogy (and no analogy is perfect) is if Linux discovers a 16MHz
80286 on a serial card in addition to the 32 3GHz cores on a
Nehalem box and, whenever the 32 cores are all busy, randomly
schedules a process on the 80286, while recording all CPU usage
data as if the 80286 is a "real" processor.... "Hmmm... why
did my compile suddenly run 100 times slower?"

The second problem is "double swapping": A guest may choose
a page to swap to "guest swap", but invisibly to the guest,
the host first must fetch it from "host swap".  (This may
seem like it is easy to avoid... it is not and happens more
frequently than you might think.)

Third, host swapping makes live migration much more difficult.
Either the host swap disk must be accessible to all machines
or data sitting on a local disk MUST be migrated along with
RAM (which is not impossible but complicates live migration
substantially).  Last I checked, VMware does not allow
page-sharing and live migration to both be enabled for the
same host.

If you talk to VMware customers (especially web-hosting services)
that have attempted to use overcommit technologies that require
host-swapping, you will find that they quickly become allergic
to memory overcommit and turn it off.  The end users (users of
the VMs that inexplicably grind to a halt) complain loudly.
As a result, RAM has become a bottleneck in many many systems,
which ultimately reduces the utility of servers and the value
of virtualization.

> Also, with frontswap, host cannot discard pages at any time as is
> the case will cleancache

True.  But in the Xen+tmem implementation there are disincentives
for a guest to unnecessarily retain pages put into frontswap,
so the host doesn't need to care that it can't discard the pages
as the guest is "billed" for them anyway.

So far we've been avoiding hypervisor policy implementation
questions and focused on mechanism (because, after all, this
is a *Linux kernel* mailing list), but we can go there if
needed.

> IMHO, along with cleancache, we should just have in in-memory
> compressed swapping at *host* level i.e. no frontswap. I agree
> that using frontswap hooks, it is easy to implement ramzswap
> functionality but I think its not worth replacing this driver
> with frontswap hooks. This driver already has all the goodness:
> asynchronous interface, ability to dynamically add/remove ramzswap
> devices etc. All that is lacking in this driver is a more efficient
> 'discard' functionality so we can free a page as soon as it becomes
> unused.

The key missing element with ramzswap is that, with frontswap, EVERY
attempt to swap a page to RAM is evaluated and potentially rejected
by the "backend" (hypervisor).  Further, no additional per-guest
system administration is required to configure ramzswap.  (How big
should it be anyway?) This level of dynamicity is important to
optimally managing physical memory in a rapidly changing virtual
environment.

> It should also be easy to extend this driver to allow sending pages
> to host using virtio (for KVM) or Xen hypercalls, if frontswap is
> needed at all.
> 
> So, IMHO we can focus on cleancache development and add missing
> parts to ramzswap driver.

I'm certainly open to someone exploring this approach to see if
it works for swap-to-hypervisor-RAM.  It has been my understanding
that Linus rejected the proposed discard hooks, without which
ramzswap doesn't even really work for swap-to-in-kernel-compressed-
RAM. However, I suspect that ramzswap, even with the discard hooks,
will not have the "dynamic range" useful for swap-to-hypervisor-RAM,
but frontswap will work fine for swap-to-in-kernel-compressed-RAM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02 16:06                                                   ` Dan Magenheimer
@ 2010-05-02 16:48                                                     ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-02 16:48 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/02/2010 07:06 PM, Dan Magenheimer wrote:
>>> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
>>> in host swapping.  Host swapping is evil.  Host swapping is
>>> the root of most of the bad reputation that memory overcommit
>>> has gotten from VMware customers.  Host swapping can't be
>>> avoided with some memory overcommit technologies (such as page
>>> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>>>        
>> Why host-level swapping is evil? In KVM case, VM is just another
>> process and host will just swap out pages using the same LRU like
>> scheme as with any other process, AFAIK.
>>      
> The first problem is that you are simulating a fast resource
> (RAM) with a resource that is orders of magnitude slower with
> NO visibility to the user that suffers the consequences.  A good
> analogy (and no analogy is perfect) is if Linux discovers a 16MHz
> 80286 on a serial card in addition to the 32 3GHz cores on a
> Nehalem box and, whenever the 32 cores are all busy, randomly
> schedules a process on the 80286, while recording all CPU usage
> data as if the 80286 is a "real" processor.... "Hmmm... why
> did my compile suddenly run 100 times slower?"
>    

It's bad, but it's better than ooming.

The same thing happens with vcpus: you run 10 guests on one core, if 
they all wake up, your cpu is suddenly 10x slower and has 30000x 
interrupt latency (30ms vs 1us, assuming 3ms timeslices).  Your disks 
become slower as well.

It's worse with memory, so you try to swap as a last resort.  However, 
swap is still faster than a crashed guest.


> The second problem is "double swapping": A guest may choose
> a page to swap to "guest swap", but invisibly to the guest,
> the host first must fetch it from "host swap".  (This may
> seem like it is easy to avoid... it is not and happens more
> frequently than you might think.)
>    

True.  In fact when the guest and host use the same LRU algorithm, it 
becomes even likelier.  That's one of the things CMM2 addresses.

> Third, host swapping makes live migration much more difficult.
> Either the host swap disk must be accessible to all machines
> or data sitting on a local disk MUST be migrated along with
> RAM (which is not impossible but complicates live migration
> substantially).

kvm does live migration with swapping, and has no special code to 
integrate them.

>    Last I checked, VMware does not allow
> page-sharing and live migration to both be enabled for the
> same host.
>    

Don't know about vmware, but kvm supports page sharing, swapping, and 
live migration simultaneously.

> If you talk to VMware customers (especially web-hosting services)
> that have attempted to use overcommit technologies that require
> host-swapping, you will find that they quickly become allergic
> to memory overcommit and turn it off.  The end users (users of
> the VMs that inexplicably grind to a halt) complain loudly.
> As a result, RAM has become a bottleneck in many many systems,
> which ultimately reduces the utility of servers and the value
> of virtualization.
>    

Choosing the correct overcommit ratio is certainly not an easy task.  
However, just hoping that memory will be available when you need it is 
not a good solution.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 16:48                                                     ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-02 16:48 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/02/2010 07:06 PM, Dan Magenheimer wrote:
>>> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
>>> in host swapping.  Host swapping is evil.  Host swapping is
>>> the root of most of the bad reputation that memory overcommit
>>> has gotten from VMware customers.  Host swapping can't be
>>> avoided with some memory overcommit technologies (such as page
>>> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>>>        
>> Why host-level swapping is evil? In KVM case, VM is just another
>> process and host will just swap out pages using the same LRU like
>> scheme as with any other process, AFAIK.
>>      
> The first problem is that you are simulating a fast resource
> (RAM) with a resource that is orders of magnitude slower with
> NO visibility to the user that suffers the consequences.  A good
> analogy (and no analogy is perfect) is if Linux discovers a 16MHz
> 80286 on a serial card in addition to the 32 3GHz cores on a
> Nehalem box and, whenever the 32 cores are all busy, randomly
> schedules a process on the 80286, while recording all CPU usage
> data as if the 80286 is a "real" processor.... "Hmmm... why
> did my compile suddenly run 100 times slower?"
>    

It's bad, but it's better than ooming.

The same thing happens with vcpus: you run 10 guests on one core, if 
they all wake up, your cpu is suddenly 10x slower and has 30000x 
interrupt latency (30ms vs 1us, assuming 3ms timeslices).  Your disks 
become slower as well.

It's worse with memory, so you try to swap as a last resort.  However, 
swap is still faster than a crashed guest.


> The second problem is "double swapping": A guest may choose
> a page to swap to "guest swap", but invisibly to the guest,
> the host first must fetch it from "host swap".  (This may
> seem like it is easy to avoid... it is not and happens more
> frequently than you might think.)
>    

True.  In fact when the guest and host use the same LRU algorithm, it 
becomes even likelier.  That's one of the things CMM2 addresses.

> Third, host swapping makes live migration much more difficult.
> Either the host swap disk must be accessible to all machines
> or data sitting on a local disk MUST be migrated along with
> RAM (which is not impossible but complicates live migration
> substantially).

kvm does live migration with swapping, and has no special code to 
integrate them.

>    Last I checked, VMware does not allow
> page-sharing and live migration to both be enabled for the
> same host.
>    

Don't know about vmware, but kvm supports page sharing, swapping, and 
live migration simultaneously.

> If you talk to VMware customers (especially web-hosting services)
> that have attempted to use overcommit technologies that require
> host-swapping, you will find that they quickly become allergic
> to memory overcommit and turn it off.  The end users (users of
> the VMs that inexplicably grind to a halt) complain loudly.
> As a result, RAM has become a bottleneck in many many systems,
> which ultimately reduces the utility of servers and the value
> of virtualization.
>    

Choosing the correct overcommit ratio is certainly not an easy task.  
However, just hoping that memory will be available when you need it is 
not a good solution.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02 15:35                                                 ` Avi Kivity
@ 2010-05-02 17:06                                                   ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 17:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > OK, now I think I see the crux of the disagreement.
> 
> Alas, I think we're pretty far from that.

Well, to be fair, I meant the disagreement of synchronous vs
asynchronous.
 
> > NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> > in host swapping.
> 
> That's a bug.  You're giving the guest memory without the means to take
> it back.  The result is that you have to _undercommit_ your memory
> resources.
> 
> Consider a machine running a guest, with most of its memory free.  You
> give the memory via frontswap to the guest.  The guest happily swaps to
> frontswap, and uses the freed memory for something unswappable, like
> mlock()ed memory or hugetlbfs.
> 
> Now the second node dies and you need memory to migrate your guests
> into.  But you can't, and the hypervisor is at the mercy of the guest
> for getting its memory back; and the guest can't do it (at least not
> quickly).

Simple policies must exist and must be enforced by the hypervisor to ensure
this doesn't happen.  Xen+tmem provides these policies and enforces them.
And it enforces them very _dynamically_ to constantly optimize
RAM utilization across multiple guests each with dynamically varying RAM
usage.  Frontswap fits nicely into this framework.

> > Host swapping is evil.  Host swapping is
> > the root of most of the bad reputation that memory overcommit
> > has gotten from VMware customers.  Host swapping can't be
> > avoided with some memory overcommit technologies (such as page
> > sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
> 
> In this case the guest expects that swapped out memory will be slow
> (since was freed via the swap API; it will be slow if the host happened
> to run out of tmem).  So by storing this memory on disk you aren't
> reducing performance beyond what you promised to the guest.
> 
> Swapping guest RAM will indeed cause a performance hit, but sometimes
> you need to do it.

Huge performance hits that are completely inexplicable to a user
give virtualization a bad reputation.  If the user (i.e. guest,
not host, administrator) can at least see "Hmmm... I'm doing a lot
of swapping, guess I'd better pay for more (virtual) RAM", then
the user objections are greatly reduced.

> > So, to summarize:
> >
> > 1) You agreed that a synchronous interface for frontswap makes
> >     sense for swap-to-in-kernel-compressed-RAM because it is
> >     truly swapping to RAM.
> 
> Because the interface is internal to the kernel.

Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
code which performs the Xen-specific stuff (hypercalls) is only in
the Xen-specific directory.
 
> > 2) You have pointed out that an asynchronous interface for
> >     frontswap makes more sense for KVM than a synchronous
> >     interface, because KVM does host swapping.
> 
> kvm's host swapping is unrelated.  Host swapping swaps guest-owned
> memory; that's not what we want here.  We want to cache guest swap in
> RAM, and that's easily done by having a virtual disk cached in main
> memory.  We're simply presenting a disk with a large write-back cache
> to the guest.

The missing part again is dynamicity.  How large is the virtual
disk?  Or are you proposing that disks can dramatically vary
in size across time?  I suspect that would be a very big patch.
And you're talking about a disk that doesn't have all the
overhead of blockio, right?

> You could just as easily cache a block device in free RAM with Xen.
> Have a tmem domain behave as the backend for your swap device.  Use
> ballooning to force tmem to disk, or to allow more cache when memory is
> free.

A block device of what size?  Again, I don't think this will be
dynamic enough.

> Voila: you no longer depend on guests (you depend on the tmem domain,
> but that's part of the host code), you don't need guest modifications,
> so it works across a wider range of guests.

Ummm... no guest modifications, yet this special disk does everything
you've described above (and, to meet my dynamicity requirements,
varies in size as well)?

> > BUT frontswap on Xen+tmem always truly swaps to RAM.
> 
> AND that's a problem because it puts the hypervisor at the mercy of the
> guest.

As I described in a separate reply, this is simply not true.

> > So there are two users of frontswap for which the synchronous
> > interface makes sense.
> 
> I believe there is only one.  See below.
> 
> The problem is not the complexity of the patch itself.  It's the fact
> that it introduces a new external API.  If we refactor swapping, that
> stands in the way.

Could you please explicitly identify what you are referring
to as a new external API?  The part this is different from
the "only one" internal user?

> Even ignoring the problems above (which are really hypervisor problems
> and the guest, which is what we're discussing here, shouldn't care if
> the hypervisor paints itself into an oom)

which it doesn't.

> a synchronous single-page DMA
> API is a bad idea.  Look at the Xen network and block code, while they
> eventually do a memory copy for every page they see, they try to batch
> multiple pages into an exit, and make the response asynchronous.

As noted VERY early in this thread, if/when it makes sense, frontswap
can do exactly the same thing by adding a buffering layer invisible
to the internal kernel interfaces.

> As an example, with a batched API you could save/restore the fpu
> context
> and use sse for copying the memory, while with a single page API you'd
> probably lost out.  Synchronous DMA, even for emulated hardware, is out
> of place in 2010.

I think we agree that DMA makes sense when there is a lot of data to
copy and makes little sense when there is only a little (e.g. a
single page) to copy.  So I guess we need to understand what the
tradeoff is.  So, do you have any idea what the breakeven point is
for your favorite DMA engine for amount of data copied vs
1) locking the memory pages
2) programming the DMA engine
3) responding to the interrupt from the DMA engine

And the simple act of waiting to collect enough pages to "batch"
means none of those pages can be used until the last page is collected
and the DMA engine is programmed and the DMA is complete.
A page-at-a-time interface synchronously releases the pages
for other (presumably more important) needs and thus, when
memory is under extreme pressure, also reduces the probability
of a (guest) OOM.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 17:06                                                   ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 17:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > OK, now I think I see the crux of the disagreement.
> 
> Alas, I think we're pretty far from that.

Well, to be fair, I meant the disagreement of synchronous vs
asynchronous.
 
> > NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> > in host swapping.
> 
> That's a bug.  You're giving the guest memory without the means to take
> it back.  The result is that you have to _undercommit_ your memory
> resources.
> 
> Consider a machine running a guest, with most of its memory free.  You
> give the memory via frontswap to the guest.  The guest happily swaps to
> frontswap, and uses the freed memory for something unswappable, like
> mlock()ed memory or hugetlbfs.
> 
> Now the second node dies and you need memory to migrate your guests
> into.  But you can't, and the hypervisor is at the mercy of the guest
> for getting its memory back; and the guest can't do it (at least not
> quickly).

Simple policies must exist and must be enforced by the hypervisor to ensure
this doesn't happen.  Xen+tmem provides these policies and enforces them.
And it enforces them very _dynamically_ to constantly optimize
RAM utilization across multiple guests each with dynamically varying RAM
usage.  Frontswap fits nicely into this framework.

> > Host swapping is evil.  Host swapping is
> > the root of most of the bad reputation that memory overcommit
> > has gotten from VMware customers.  Host swapping can't be
> > avoided with some memory overcommit technologies (such as page
> > sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
> 
> In this case the guest expects that swapped out memory will be slow
> (since was freed via the swap API; it will be slow if the host happened
> to run out of tmem).  So by storing this memory on disk you aren't
> reducing performance beyond what you promised to the guest.
> 
> Swapping guest RAM will indeed cause a performance hit, but sometimes
> you need to do it.

Huge performance hits that are completely inexplicable to a user
give virtualization a bad reputation.  If the user (i.e. guest,
not host, administrator) can at least see "Hmmm... I'm doing a lot
of swapping, guess I'd better pay for more (virtual) RAM", then
the user objections are greatly reduced.

> > So, to summarize:
> >
> > 1) You agreed that a synchronous interface for frontswap makes
> >     sense for swap-to-in-kernel-compressed-RAM because it is
> >     truly swapping to RAM.
> 
> Because the interface is internal to the kernel.

Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
code which performs the Xen-specific stuff (hypercalls) is only in
the Xen-specific directory.
 
> > 2) You have pointed out that an asynchronous interface for
> >     frontswap makes more sense for KVM than a synchronous
> >     interface, because KVM does host swapping.
> 
> kvm's host swapping is unrelated.  Host swapping swaps guest-owned
> memory; that's not what we want here.  We want to cache guest swap in
> RAM, and that's easily done by having a virtual disk cached in main
> memory.  We're simply presenting a disk with a large write-back cache
> to the guest.

The missing part again is dynamicity.  How large is the virtual
disk?  Or are you proposing that disks can dramatically vary
in size across time?  I suspect that would be a very big patch.
And you're talking about a disk that doesn't have all the
overhead of blockio, right?

> You could just as easily cache a block device in free RAM with Xen.
> Have a tmem domain behave as the backend for your swap device.  Use
> ballooning to force tmem to disk, or to allow more cache when memory is
> free.

A block device of what size?  Again, I don't think this will be
dynamic enough.

> Voila: you no longer depend on guests (you depend on the tmem domain,
> but that's part of the host code), you don't need guest modifications,
> so it works across a wider range of guests.

Ummm... no guest modifications, yet this special disk does everything
you've described above (and, to meet my dynamicity requirements,
varies in size as well)?

> > BUT frontswap on Xen+tmem always truly swaps to RAM.
> 
> AND that's a problem because it puts the hypervisor at the mercy of the
> guest.

As I described in a separate reply, this is simply not true.

> > So there are two users of frontswap for which the synchronous
> > interface makes sense.
> 
> I believe there is only one.  See below.
> 
> The problem is not the complexity of the patch itself.  It's the fact
> that it introduces a new external API.  If we refactor swapping, that
> stands in the way.

Could you please explicitly identify what you are referring
to as a new external API?  The part this is different from
the "only one" internal user?

> Even ignoring the problems above (which are really hypervisor problems
> and the guest, which is what we're discussing here, shouldn't care if
> the hypervisor paints itself into an oom)

which it doesn't.

> a synchronous single-page DMA
> API is a bad idea.  Look at the Xen network and block code, while they
> eventually do a memory copy for every page they see, they try to batch
> multiple pages into an exit, and make the response asynchronous.

As noted VERY early in this thread, if/when it makes sense, frontswap
can do exactly the same thing by adding a buffering layer invisible
to the internal kernel interfaces.

> As an example, with a batched API you could save/restore the fpu
> context
> and use sse for copying the memory, while with a single page API you'd
> probably lost out.  Synchronous DMA, even for emulated hardware, is out
> of place in 2010.

I think we agree that DMA makes sense when there is a lot of data to
copy and makes little sense when there is only a little (e.g. a
single page) to copy.  So I guess we need to understand what the
tradeoff is.  So, do you have any idea what the breakeven point is
for your favorite DMA engine for amount of data copied vs
1) locking the memory pages
2) programming the DMA engine
3) responding to the interrupt from the DMA engine

And the simple act of waiting to collect enough pages to "batch"
means none of those pages can be used until the last page is collected
and the DMA engine is programmed and the DMA is complete.
A page-at-a-time interface synchronously releases the pages
for other (presumably more important) needs and thus, when
memory is under extreme pressure, also reduces the probability
of a (guest) OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02 16:48                                                     ` Avi Kivity
@ 2010-05-02 17:22                                                       ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 17:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> It's bad, but it's better than ooming.
> 
> The same thing happens with vcpus: you run 10 guests on one core, if
> they all wake up, your cpu is suddenly 10x slower and has 30000x
> interrupt latency (30ms vs 1us, assuming 3ms timeslices).  Your disks
> become slower as well.
> 
> It's worse with memory, so you try to swap as a last resort.  However,
> swap is still faster than a crashed guest.

Your analogy only holds when the host administrator is either
extremely greedy or stupid.  My analogy only requires some
statistical bad luck: Multiple guests with peaks and valleys
of memory requirements happen to have their peaks align.

> > Third, host swapping makes live migration much more difficult.
> > Either the host swap disk must be accessible to all machines
> > or data sitting on a local disk MUST be migrated along with
> > RAM (which is not impossible but complicates live migration
> > substantially).
> 
> kvm does live migration with swapping, and has no special code to
> integrate them.
>  :
> Don't know about vmware, but kvm supports page sharing, swapping, and
> live migration simultaneously.

Hmmm... I'll bet I can break it pretty easily.  I think the
case you raised that you thought would cause host OOM'ing
will cause kvm live migration to fail.

Or maybe not... when a guest is in the middle of a live migration,
I believe (in Xen), the entire guest memory allocation (possibly
excluding ballooned-out pages) must be simultaneously in RAM briefly
in BOTH the host and target machine.  That is, live migration is
not "pipelined".  Is this also true of KVM?  If so, your
statement above is just waiting a corner case to break it.
And if not, I expect you've got fault tolerance issues.

> > If you talk to VMware customers (especially web-hosting services)
> > that have attempted to use overcommit technologies that require
> > host-swapping, you will find that they quickly become allergic
> > to memory overcommit and turn it off.  The end users (users of
> > the VMs that inexplicably grind to a halt) complain loudly.
> > As a result, RAM has become a bottleneck in many many systems,
> > which ultimately reduces the utility of servers and the value
> > of virtualization.
> 
> Choosing the correct overcommit ratio is certainly not an easy task.
> However, just hoping that memory will be available when you need it is
> not a good solution.

Choosing the _optimal_ overcommit ratio is impossible without a
prescient knowledge of the workload in each guest.  Hoping memory
will be available is certainly not a good solution, but if memory
is not available guest swapping is much better than host swapping.
And making RAM usage as dynamic as possible and live migration
as easy as possible are keys to maximizing the benefits (and
limiting the problems) of virtualization.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 17:22                                                       ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 17:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> It's bad, but it's better than ooming.
> 
> The same thing happens with vcpus: you run 10 guests on one core, if
> they all wake up, your cpu is suddenly 10x slower and has 30000x
> interrupt latency (30ms vs 1us, assuming 3ms timeslices).  Your disks
> become slower as well.
> 
> It's worse with memory, so you try to swap as a last resort.  However,
> swap is still faster than a crashed guest.

Your analogy only holds when the host administrator is either
extremely greedy or stupid.  My analogy only requires some
statistical bad luck: Multiple guests with peaks and valleys
of memory requirements happen to have their peaks align.

> > Third, host swapping makes live migration much more difficult.
> > Either the host swap disk must be accessible to all machines
> > or data sitting on a local disk MUST be migrated along with
> > RAM (which is not impossible but complicates live migration
> > substantially).
> 
> kvm does live migration with swapping, and has no special code to
> integrate them.
>  :
> Don't know about vmware, but kvm supports page sharing, swapping, and
> live migration simultaneously.

Hmmm... I'll bet I can break it pretty easily.  I think the
case you raised that you thought would cause host OOM'ing
will cause kvm live migration to fail.

Or maybe not... when a guest is in the middle of a live migration,
I believe (in Xen), the entire guest memory allocation (possibly
excluding ballooned-out pages) must be simultaneously in RAM briefly
in BOTH the host and target machine.  That is, live migration is
not "pipelined".  Is this also true of KVM?  If so, your
statement above is just waiting a corner case to break it.
And if not, I expect you've got fault tolerance issues.

> > If you talk to VMware customers (especially web-hosting services)
> > that have attempted to use overcommit technologies that require
> > host-swapping, you will find that they quickly become allergic
> > to memory overcommit and turn it off.  The end users (users of
> > the VMs that inexplicably grind to a halt) complain loudly.
> > As a result, RAM has become a bottleneck in many many systems,
> > which ultimately reduces the utility of servers and the value
> > of virtualization.
> 
> Choosing the correct overcommit ratio is certainly not an easy task.
> However, just hoping that memory will be available when you need it is
> not a good solution.

Choosing the _optimal_ overcommit ratio is impossible without a
prescient knowledge of the workload in each guest.  Hoping memory
will be available is certainly not a good solution, but if memory
is not available guest swapping is much better than host swapping.
And making RAM usage as dynamic as possible and live migration
as easy as possible are keys to maximizing the benefits (and
limiting the problems) of virtualization.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02 15:05                                                   ` Dan Magenheimer
@ 2010-05-02 20:06                                                     ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-05-02 20:06 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel


> > So what needs to be said here is 'frontswap is XX times faster than
> > swap_ops based solution on workload YY'.
> 
> Are you asking me to demonstrate that swap-to-hypervisor-RAM is
> faster than swap-to-disk?

I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 20:06                                                     ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-05-02 20:06 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel


> > So what needs to be said here is 'frontswap is XX times faster than
> > swap_ops based solution on workload YY'.
> 
> Are you asking me to demonstrate that swap-to-hypervisor-RAM is
> faster than swap-to-disk?

I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02 20:06                                                     ` Pavel Machek
@ 2010-05-02 21:05                                                       ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 21:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> From: Pavel Machek [mailto:pavel@ucw.cz]
> 
> > > So what needs to be said here is 'frontswap is XX times faster than
> > > swap_ops based solution on workload YY'.
> >
> > Are you asking me to demonstrate that swap-to-hypervisor-RAM is
> > faster than swap-to-disk?
> 
> I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk.
> 									Pavel

Well, it's not really apples-to-apples because swap-to-RAMdisk
is copying to a chunk of RAM with a known permanently-fixed size
so it SHOULD be faster than swap-to-hypervisor, and should
*definitely* be faster than swap-to-in-kernel-compressed-RAM
but I suppose it is still an interesting comparison.  I'll
see what I can do, but it will probably be a couple days to
figure out how to measure it (e.g. without accidentally measuring
any swap-to-disk).

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-02 21:05                                                       ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-02 21:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> From: Pavel Machek [mailto:pavel@ucw.cz]
> 
> > > So what needs to be said here is 'frontswap is XX times faster than
> > > swap_ops based solution on workload YY'.
> >
> > Are you asking me to demonstrate that swap-to-hypervisor-RAM is
> > faster than swap-to-disk?
> 
> I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk.
> 									Pavel

Well, it's not really apples-to-apples because swap-to-RAMdisk
is copying to a chunk of RAM with a known permanently-fixed size
so it SHOULD be faster than swap-to-hypervisor, and should
*definitely* be faster than swap-to-in-kernel-compressed-RAM
but I suppose it is still an interesting comparison.  I'll
see what I can do, but it will probably be a couple days to
figure out how to measure it (e.g. without accidentally measuring
any swap-to-disk).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02 17:06                                                   ` Dan Magenheimer
@ 2010-05-03  8:46                                                     ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-03  8:46 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/02/2010 08:06 PM, Dan Magenheimer wrote:
>
>>> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
>>> in host swapping.
>>>        
>> That's a bug.  You're giving the guest memory without the means to take
>> it back.  The result is that you have to _undercommit_ your memory
>> resources.
>>
>> Consider a machine running a guest, with most of its memory free.  You
>> give the memory via frontswap to the guest.  The guest happily swaps to
>> frontswap, and uses the freed memory for something unswappable, like
>> mlock()ed memory or hugetlbfs.
>>
>> Now the second node dies and you need memory to migrate your guests
>> into.  But you can't, and the hypervisor is at the mercy of the guest
>> for getting its memory back; and the guest can't do it (at least not
>> quickly).
>>      
> Simple policies must exist and must be enforced by the hypervisor to ensure
> this doesn't happen.  Xen+tmem provides these policies and enforces them.
> And it enforces them very _dynamically_ to constantly optimize
> RAM utilization across multiple guests each with dynamically varying RAM
> usage.  Frontswap fits nicely into this framework.
>    

Can you explain what "enforcing" means in this context?  You loaned the 
guest some pages, can you enforce their return?

>>> Host swapping is evil.  Host swapping is
>>> the root of most of the bad reputation that memory overcommit
>>> has gotten from VMware customers.  Host swapping can't be
>>> avoided with some memory overcommit technologies (such as page
>>> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>>>        
>> In this case the guest expects that swapped out memory will be slow
>> (since was freed via the swap API; it will be slow if the host happened
>> to run out of tmem).  So by storing this memory on disk you aren't
>> reducing performance beyond what you promised to the guest.
>>
>> Swapping guest RAM will indeed cause a performance hit, but sometimes
>> you need to do it.
>>      
> Huge performance hits that are completely inexplicable to a user
> give virtualization a bad reputation.  If the user (i.e. guest,
> not host, administrator) can at least see "Hmmm... I'm doing a lot
> of swapping, guess I'd better pay for more (virtual) RAM", then
> the user objections are greatly reduced.
>    

What you're saying is "don't overcommit".  That's a good policy for some 
scenarios but not for others.  Note it applies equally well for cpu as 
well as memory.

frontswap+tmem is not overcommit, it's undercommit.   You have spare 
memory, and you give it away.  It isn't a replacement.  However, without 
the means to reclaim this spare memory, it can result in overcommit.


>>> So, to summarize:
>>>
>>> 1) You agreed that a synchronous interface for frontswap makes
>>>      sense for swap-to-in-kernel-compressed-RAM because it is
>>>      truly swapping to RAM.
>>>        
>> Because the interface is internal to the kernel.
>>      
> Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
> code which performs the Xen-specific stuff (hypercalls) is only in
> the Xen-specific directory.
>    

This makes it an external interface.

>>> 2) You have pointed out that an asynchronous interface for
>>>      frontswap makes more sense for KVM than a synchronous
>>>      interface, because KVM does host swapping.
>>>        
>> kvm's host swapping is unrelated.  Host swapping swaps guest-owned
>> memory; that's not what we want here.  We want to cache guest swap in
>> RAM, and that's easily done by having a virtual disk cached in main
>> memory.  We're simply presenting a disk with a large write-back cache
>> to the guest.
>>      
> The missing part again is dynamicity.  How large is the virtual
> disk?

Exactly as large as the swap space which the guest would have in the 
frontswap+tmem case.

> Or are you proposing that disks can dramatically vary
> in size across time?

Not needed, though I expect it is already supported (SAN volumes do grow).

> I suspect that would be a very big patch.
> And you're talking about a disk that doesn't have all the
> overhead of blockio, right?
>    

If block layer overhead is a problem, go ahead and optimize it instead 
of adding new interfaces to bypass it.  Though I expect it wouldn't be 
needed, and if any optimization needs to be done it is in the swap layer.

Optimizing swap has the additional benefit of improving performance on 
flash-backed swap.

>> You could just as easily cache a block device in free RAM with Xen.
>> Have a tmem domain behave as the backend for your swap device.  Use
>> ballooning to force tmem to disk, or to allow more cache when memory is
>> free.
>>      
> A block device of what size?  Again, I don't think this will be
> dynamic enough.
>    

What happens when no tmem is available?  you swap to a volume.  That's 
the disk size needed.

>> Voila: you no longer depend on guests (you depend on the tmem domain,
>> but that's part of the host code), you don't need guest modifications,
>> so it works across a wider range of guests.
>>      
> Ummm... no guest modifications, yet this special disk does everything
> you've described above (and, to meet my dynamicity requirements,
> varies in size as well)?
>    

You're dynamic swap is limited too.  And no, no guest modifications.

>>> BUT frontswap on Xen+tmem always truly swaps to RAM.
>>>        
>> AND that's a problem because it puts the hypervisor at the mercy of the
>> guest.
>>      
> As I described in a separate reply, this is simply not true.
>    

I still don't understand why.

>>> So there are two users of frontswap for which the synchronous
>>> interface makes sense.
>>>        
>> I believe there is only one.  See below.
>>
>> The problem is not the complexity of the patch itself.  It's the fact
>> that it introduces a new external API.  If we refactor swapping, that
>> stands in the way.
>>      
> Could you please explicitly identify what you are referring
> to as a new external API?  The part this is different from
> the "only one" internal user?
>    

Something completely internal to the guest can be replaced by something 
completely different.  Something that talks to a hypervisor will need 
those hooks forever to avoid regressions.

>> a synchronous single-page DMA
>> API is a bad idea.  Look at the Xen network and block code, while they
>> eventually do a memory copy for every page they see, they try to batch
>> multiple pages into an exit, and make the response asynchronous.
>>      
> As noted VERY early in this thread, if/when it makes sense, frontswap
> can do exactly the same thing by adding a buffering layer invisible
> to the internal kernel interfaces.
>    

So, you take a synchronous copyful interface, add another copy to make 
it into an asynchronous interface, instead of using the original 
asynchronous copyless interface.

>> As an example, with a batched API you could save/restore the fpu
>> context
>> and use sse for copying the memory, while with a single page API you'd
>> probably lost out.  Synchronous DMA, even for emulated hardware, is out
>> of place in 2010.
>>      
> I think we agree that DMA makes sense when there is a lot of data to
> copy and makes little sense when there is only a little (e.g. a
> single page) to copy.  So I guess we need to understand what the
> tradeoff is.  So, do you have any idea what the breakeven point is
> for your favorite DMA engine for amount of data copied vs
> 1) locking the memory pages
> 2) programming the DMA engine
> 3) responding to the interrupt from the DMA engine
>
> And the simple act of waiting to collect enough pages to "batch"
> means none of those pages can be used until the last page is collected
> and the DMA engine is programmed and the DMA is complete.
> A page-at-a-time interface synchronously releases the pages
> for other (presumably more important) needs and thus, when
> memory is under extreme pressure, also reduces the probability
> of a (guest) OOM.
>    

When swapping out, Linux already batches pages in the block device's 
request queue.  Swapping out is inherently asynchronous and batched, 
you're swapping out those pages _because_ you don't need them, and 
you're never interested in swapping out a single page.  Linux already 
reserves memory for use during swapout.  There's no need to re-solve 
solved problems.

Swapping in is less simple, it is mostly synchronous (in some cases it 
isn't: with many threads, or with the preswap patches (IIRC unmerged)).  
You can always choose to copy if you don't have enough to justify dma.

The networking stack seems to think 4096 bytes is a good size for dma 
(see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-03  8:46                                                     ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-03  8:46 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/02/2010 08:06 PM, Dan Magenheimer wrote:
>
>>> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
>>> in host swapping.
>>>        
>> That's a bug.  You're giving the guest memory without the means to take
>> it back.  The result is that you have to _undercommit_ your memory
>> resources.
>>
>> Consider a machine running a guest, with most of its memory free.  You
>> give the memory via frontswap to the guest.  The guest happily swaps to
>> frontswap, and uses the freed memory for something unswappable, like
>> mlock()ed memory or hugetlbfs.
>>
>> Now the second node dies and you need memory to migrate your guests
>> into.  But you can't, and the hypervisor is at the mercy of the guest
>> for getting its memory back; and the guest can't do it (at least not
>> quickly).
>>      
> Simple policies must exist and must be enforced by the hypervisor to ensure
> this doesn't happen.  Xen+tmem provides these policies and enforces them.
> And it enforces them very _dynamically_ to constantly optimize
> RAM utilization across multiple guests each with dynamically varying RAM
> usage.  Frontswap fits nicely into this framework.
>    

Can you explain what "enforcing" means in this context?  You loaned the 
guest some pages, can you enforce their return?

>>> Host swapping is evil.  Host swapping is
>>> the root of most of the bad reputation that memory overcommit
>>> has gotten from VMware customers.  Host swapping can't be
>>> avoided with some memory overcommit technologies (such as page
>>> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>>>        
>> In this case the guest expects that swapped out memory will be slow
>> (since was freed via the swap API; it will be slow if the host happened
>> to run out of tmem).  So by storing this memory on disk you aren't
>> reducing performance beyond what you promised to the guest.
>>
>> Swapping guest RAM will indeed cause a performance hit, but sometimes
>> you need to do it.
>>      
> Huge performance hits that are completely inexplicable to a user
> give virtualization a bad reputation.  If the user (i.e. guest,
> not host, administrator) can at least see "Hmmm... I'm doing a lot
> of swapping, guess I'd better pay for more (virtual) RAM", then
> the user objections are greatly reduced.
>    

What you're saying is "don't overcommit".  That's a good policy for some 
scenarios but not for others.  Note it applies equally well for cpu as 
well as memory.

frontswap+tmem is not overcommit, it's undercommit.   You have spare 
memory, and you give it away.  It isn't a replacement.  However, without 
the means to reclaim this spare memory, it can result in overcommit.


>>> So, to summarize:
>>>
>>> 1) You agreed that a synchronous interface for frontswap makes
>>>      sense for swap-to-in-kernel-compressed-RAM because it is
>>>      truly swapping to RAM.
>>>        
>> Because the interface is internal to the kernel.
>>      
> Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
> code which performs the Xen-specific stuff (hypercalls) is only in
> the Xen-specific directory.
>    

This makes it an external interface.

>>> 2) You have pointed out that an asynchronous interface for
>>>      frontswap makes more sense for KVM than a synchronous
>>>      interface, because KVM does host swapping.
>>>        
>> kvm's host swapping is unrelated.  Host swapping swaps guest-owned
>> memory; that's not what we want here.  We want to cache guest swap in
>> RAM, and that's easily done by having a virtual disk cached in main
>> memory.  We're simply presenting a disk with a large write-back cache
>> to the guest.
>>      
> The missing part again is dynamicity.  How large is the virtual
> disk?

Exactly as large as the swap space which the guest would have in the 
frontswap+tmem case.

> Or are you proposing that disks can dramatically vary
> in size across time?

Not needed, though I expect it is already supported (SAN volumes do grow).

> I suspect that would be a very big patch.
> And you're talking about a disk that doesn't have all the
> overhead of blockio, right?
>    

If block layer overhead is a problem, go ahead and optimize it instead 
of adding new interfaces to bypass it.  Though I expect it wouldn't be 
needed, and if any optimization needs to be done it is in the swap layer.

Optimizing swap has the additional benefit of improving performance on 
flash-backed swap.

>> You could just as easily cache a block device in free RAM with Xen.
>> Have a tmem domain behave as the backend for your swap device.  Use
>> ballooning to force tmem to disk, or to allow more cache when memory is
>> free.
>>      
> A block device of what size?  Again, I don't think this will be
> dynamic enough.
>    

What happens when no tmem is available?  you swap to a volume.  That's 
the disk size needed.

>> Voila: you no longer depend on guests (you depend on the tmem domain,
>> but that's part of the host code), you don't need guest modifications,
>> so it works across a wider range of guests.
>>      
> Ummm... no guest modifications, yet this special disk does everything
> you've described above (and, to meet my dynamicity requirements,
> varies in size as well)?
>    

You're dynamic swap is limited too.  And no, no guest modifications.

>>> BUT frontswap on Xen+tmem always truly swaps to RAM.
>>>        
>> AND that's a problem because it puts the hypervisor at the mercy of the
>> guest.
>>      
> As I described in a separate reply, this is simply not true.
>    

I still don't understand why.

>>> So there are two users of frontswap for which the synchronous
>>> interface makes sense.
>>>        
>> I believe there is only one.  See below.
>>
>> The problem is not the complexity of the patch itself.  It's the fact
>> that it introduces a new external API.  If we refactor swapping, that
>> stands in the way.
>>      
> Could you please explicitly identify what you are referring
> to as a new external API?  The part this is different from
> the "only one" internal user?
>    

Something completely internal to the guest can be replaced by something 
completely different.  Something that talks to a hypervisor will need 
those hooks forever to avoid regressions.

>> a synchronous single-page DMA
>> API is a bad idea.  Look at the Xen network and block code, while they
>> eventually do a memory copy for every page they see, they try to batch
>> multiple pages into an exit, and make the response asynchronous.
>>      
> As noted VERY early in this thread, if/when it makes sense, frontswap
> can do exactly the same thing by adding a buffering layer invisible
> to the internal kernel interfaces.
>    

So, you take a synchronous copyful interface, add another copy to make 
it into an asynchronous interface, instead of using the original 
asynchronous copyless interface.

>> As an example, with a batched API you could save/restore the fpu
>> context
>> and use sse for copying the memory, while with a single page API you'd
>> probably lost out.  Synchronous DMA, even for emulated hardware, is out
>> of place in 2010.
>>      
> I think we agree that DMA makes sense when there is a lot of data to
> copy and makes little sense when there is only a little (e.g. a
> single page) to copy.  So I guess we need to understand what the
> tradeoff is.  So, do you have any idea what the breakeven point is
> for your favorite DMA engine for amount of data copied vs
> 1) locking the memory pages
> 2) programming the DMA engine
> 3) responding to the interrupt from the DMA engine
>
> And the simple act of waiting to collect enough pages to "batch"
> means none of those pages can be used until the last page is collected
> and the DMA engine is programmed and the DMA is complete.
> A page-at-a-time interface synchronously releases the pages
> for other (presumably more important) needs and thus, when
> memory is under extreme pressure, also reduces the probability
> of a (guest) OOM.
>    

When swapping out, Linux already batches pages in the block device's 
request queue.  Swapping out is inherently asynchronous and batched, 
you're swapping out those pages _because_ you don't need them, and 
you're never interested in swapping out a single page.  Linux already 
reserves memory for use during swapout.  There's no need to re-solve 
solved problems.

Swapping in is less simple, it is mostly synchronous (in some cases it 
isn't: with many threads, or with the preswap patches (IIRC unmerged)).  
You can always choose to copy if you don't have enough to justify dma.

The networking stack seems to think 4096 bytes is a good size for dma 
(see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-02 17:22                                                       ` Dan Magenheimer
@ 2010-05-03  9:39                                                         ` Avi Kivity
  -1 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-03  9:39 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/02/2010 08:22 PM, Dan Magenheimer wrote:
>> It's bad, but it's better than ooming.
>>
>> The same thing happens with vcpus: you run 10 guests on one core, if
>> they all wake up, your cpu is suddenly 10x slower and has 30000x
>> interrupt latency (30ms vs 1us, assuming 3ms timeslices).  Your disks
>> become slower as well.
>>
>> It's worse with memory, so you try to swap as a last resort.  However,
>> swap is still faster than a crashed guest.
>>      
> Your analogy only holds when the host administrator is either
> extremely greedy or stupid.

10x vcpu is reasonable in some situations (VDI, powersave at night).  
Even a 2x vcpu overcommit will cause a 10000x interrupt latency degradation.

> My analogy only requires some
> statistical bad luck: Multiple guests with peaks and valleys
> of memory requirements happen to have their peaks align.
>    

Not sure I understand.

>>> Third, host swapping makes live migration much more difficult.
>>> Either the host swap disk must be accessible to all machines
>>> or data sitting on a local disk MUST be migrated along with
>>> RAM (which is not impossible but complicates live migration
>>> substantially).
>>>        
>> kvm does live migration with swapping, and has no special code to
>> integrate them.
>>   :
>> Don't know about vmware, but kvm supports page sharing, swapping, and
>> live migration simultaneously.
>>      
> Hmmm... I'll bet I can break it pretty easily.  I think the
> case you raised that you thought would cause host OOM'ing
> will cause kvm live migration to fail.
>
> Or maybe not... when a guest is in the middle of a live migration,
> I believe (in Xen), the entire guest memory allocation (possibly
> excluding ballooned-out pages) must be simultaneously in RAM briefly
> in BOTH the host and target machine.  That is, live migration is
> not "pipelined".  Is this also true of KVM?

No.  The entire guest address space can be swapped out on the source and 
target, less the pages being copied to or from the wire, and pages 
actively accessed by the guest.  Of course performance will suck if all 
memory is swapped out.

> If so, your
> statement above is just waiting a corner case to break it.
> And if not, I expect you've got fault tolerance issues.
>    

Not that I'm aware of.

>>> If you talk to VMware customers (especially web-hosting services)
>>> that have attempted to use overcommit technologies that require
>>> host-swapping, you will find that they quickly become allergic
>>> to memory overcommit and turn it off.  The end users (users of
>>> the VMs that inexplicably grind to a halt) complain loudly.
>>> As a result, RAM has become a bottleneck in many many systems,
>>> which ultimately reduces the utility of servers and the value
>>> of virtualization.
>>>        
>> Choosing the correct overcommit ratio is certainly not an easy task.
>> However, just hoping that memory will be available when you need it is
>> not a good solution.
>>      
> Choosing the _optimal_ overcommit ratio is impossible without a
> prescient knowledge of the workload in each guest.  Hoping memory
> will be available is certainly not a good solution, but if memory
> is not available guest swapping is much better than host swapping.
>    

You cannot rely on guest swapping.

> And making RAM usage as dynamic as possible and live migration
> as easy as possible are keys to maximizing the benefits (and
> limiting the problems) of virtualization.
>    

That is why you need overcommit.  You make things dynamic with page 
sharing and ballooning and live migration, but at some point you need a 
failsafe fallback.  The only failsafe fallback I can see (where the host 
doesn't rely on guests) is swapping.

As far as I can tell, frontswap+tmem increases the problem.  You loan 
the guest some memory without the means to take it back, this increases 
memory pressure on the host.  The result is that if you want to avoid 
swapping (or are unable to) you need to undercommit host resources.  
Instead of sum(guest mem) + reserve < (host mem), you need sum(guest mem 
+ committed tmem) + reserve < (host mem).  You need more host memory, or 
less guests, or to be prepared to swap if the worst happens.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-03  9:39                                                         ` Avi Kivity
  0 siblings, 0 replies; 163+ messages in thread
From: Avi Kivity @ 2010-05-03  9:39 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On 05/02/2010 08:22 PM, Dan Magenheimer wrote:
>> It's bad, but it's better than ooming.
>>
>> The same thing happens with vcpus: you run 10 guests on one core, if
>> they all wake up, your cpu is suddenly 10x slower and has 30000x
>> interrupt latency (30ms vs 1us, assuming 3ms timeslices).  Your disks
>> become slower as well.
>>
>> It's worse with memory, so you try to swap as a last resort.  However,
>> swap is still faster than a crashed guest.
>>      
> Your analogy only holds when the host administrator is either
> extremely greedy or stupid.

10x vcpu is reasonable in some situations (VDI, powersave at night).  
Even a 2x vcpu overcommit will cause a 10000x interrupt latency degradation.

> My analogy only requires some
> statistical bad luck: Multiple guests with peaks and valleys
> of memory requirements happen to have their peaks align.
>    

Not sure I understand.

>>> Third, host swapping makes live migration much more difficult.
>>> Either the host swap disk must be accessible to all machines
>>> or data sitting on a local disk MUST be migrated along with
>>> RAM (which is not impossible but complicates live migration
>>> substantially).
>>>        
>> kvm does live migration with swapping, and has no special code to
>> integrate them.
>>   :
>> Don't know about vmware, but kvm supports page sharing, swapping, and
>> live migration simultaneously.
>>      
> Hmmm... I'll bet I can break it pretty easily.  I think the
> case you raised that you thought would cause host OOM'ing
> will cause kvm live migration to fail.
>
> Or maybe not... when a guest is in the middle of a live migration,
> I believe (in Xen), the entire guest memory allocation (possibly
> excluding ballooned-out pages) must be simultaneously in RAM briefly
> in BOTH the host and target machine.  That is, live migration is
> not "pipelined".  Is this also true of KVM?

No.  The entire guest address space can be swapped out on the source and 
target, less the pages being copied to or from the wire, and pages 
actively accessed by the guest.  Of course performance will suck if all 
memory is swapped out.

> If so, your
> statement above is just waiting a corner case to break it.
> And if not, I expect you've got fault tolerance issues.
>    

Not that I'm aware of.

>>> If you talk to VMware customers (especially web-hosting services)
>>> that have attempted to use overcommit technologies that require
>>> host-swapping, you will find that they quickly become allergic
>>> to memory overcommit and turn it off.  The end users (users of
>>> the VMs that inexplicably grind to a halt) complain loudly.
>>> As a result, RAM has become a bottleneck in many many systems,
>>> which ultimately reduces the utility of servers and the value
>>> of virtualization.
>>>        
>> Choosing the correct overcommit ratio is certainly not an easy task.
>> However, just hoping that memory will be available when you need it is
>> not a good solution.
>>      
> Choosing the _optimal_ overcommit ratio is impossible without a
> prescient knowledge of the workload in each guest.  Hoping memory
> will be available is certainly not a good solution, but if memory
> is not available guest swapping is much better than host swapping.
>    

You cannot rely on guest swapping.

> And making RAM usage as dynamic as possible and live migration
> as easy as possible are keys to maximizing the benefits (and
> limiting the problems) of virtualization.
>    

That is why you need overcommit.  You make things dynamic with page 
sharing and ballooning and live migration, but at some point you need a 
failsafe fallback.  The only failsafe fallback I can see (where the host 
doesn't rely on guests) is swapping.

As far as I can tell, frontswap+tmem increases the problem.  You loan 
the guest some memory without the means to take it back, this increases 
memory pressure on the host.  The result is that if you want to avoid 
swapping (or are unable to) you need to undercommit host resources.  
Instead of sum(guest mem) + reserve < (host mem), you need sum(guest mem 
+ committed tmem) + reserve < (host mem).  You need more host memory, or 
less guests, or to be prepared to swap if the worst happens.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-03  9:39                                                         ` Avi Kivity
@ 2010-05-03 14:59                                                           ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-03 14:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > My analogy only requires some
> > statistical bad luck: Multiple guests with peaks and valleys
> > of memory requirements happen to have their peaks align.
> 
> Not sure I understand.

Virtualization is all about statistical multiplexing of fixed
resources.  If all guests demand a resource simultaneously,
that is peak alignment == "bad luck".

(But, honestly, I don't even remember the point either of us
was trying to make here :-)

> > Or maybe not... when a guest is in the middle of a live migration,
> > I believe (in Xen), the entire guest memory allocation (possibly
> > excluding ballooned-out pages) must be simultaneously in RAM briefly
> > in BOTH the host and target machine.  That is, live migration is
> > not "pipelined".  Is this also true of KVM?
> 
> No.  The entire guest address space can be swapped out on the source
> and
> target, less the pages being copied to or from the wire, and pages
> actively accessed by the guest.  Of course performance will suck if all
> memory is swapped out.

Will it suck to the point of eventually causing the live migration
to fail?  Or will swap-storms effectively cause denial-of-service
for other guests?

Anyway, if live migration works fine with mostly-swapped-out guests
on KVM, that's great.

> > Choosing the _optimal_ overcommit ratio is impossible without a
> > prescient knowledge of the workload in each guest.  Hoping memory
> > will be available is certainly not a good solution, but if memory
> > is not available guest swapping is much better than host swapping.
> 
> You cannot rely on guest swapping.

Frontswap only relies on the guest having an existing swap device,
defined in /etc/fstab like any normal Linux swap device.  If this
is "relying on guest swapping", yes frontswap relies on guest swapping.

Or if you are referring to your "host can't force guest to
reclaim pages" argument, see the other thread.

> > And making RAM usage as dynamic as possible and live migration
> > as easy as possible are keys to maximizing the benefits (and
> > limiting the problems) of virtualization.
> 
> That is why you need overcommit.  You make things dynamic with page
> sharing and ballooning and live migration, but at some point you need a
> failsafe fallback.  The only failsafe fallback I can see (where the
> host doesn't rely on guests) is swapping.

No fallback is required if the overcommitment is done intelligently.

> As far as I can tell, frontswap+tmem increases the problem.  You loan
> the guest some memory without the means to take it back, this increases
> memory pressure on the host.  The result is that if you want to avoid
> swapping (or are unable to) you need to undercommit host resources.
> Instead of sum(guest mem) + reserve < (host mem), you need sum(guest
> mem
> + committed tmem) + reserve < (host mem).  You need more host memory,
> or less guests, or to be prepared to swap if the worst happens.

Your argument might make sense from a KVM perspective but is
not true of frontswap with Xen+tmem.  With KVM, the host's
swap disk(s) can all be used as "slow RAM".  With Xen, there is
no host swap disk.  So, yes, the degree of potential memory
overcommitment is smaller with Xen+tmem than with KVM.  In
order to avoid all the host problems with host-swapping,
frontswap+Xen+tmem intentionally limits the degree of memory
overcommitment... but this is just memory overcommitment done
intelligently.

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-03 14:59                                                           ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-03 14:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ngupta, Jeremy Fitzhardinge, Dave Hansen, Pavel Machek,
	linux-kernel, linux-mm, hugh.dickins, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > My analogy only requires some
> > statistical bad luck: Multiple guests with peaks and valleys
> > of memory requirements happen to have their peaks align.
> 
> Not sure I understand.

Virtualization is all about statistical multiplexing of fixed
resources.  If all guests demand a resource simultaneously,
that is peak alignment == "bad luck".

(But, honestly, I don't even remember the point either of us
was trying to make here :-)

> > Or maybe not... when a guest is in the middle of a live migration,
> > I believe (in Xen), the entire guest memory allocation (possibly
> > excluding ballooned-out pages) must be simultaneously in RAM briefly
> > in BOTH the host and target machine.  That is, live migration is
> > not "pipelined".  Is this also true of KVM?
> 
> No.  The entire guest address space can be swapped out on the source
> and
> target, less the pages being copied to or from the wire, and pages
> actively accessed by the guest.  Of course performance will suck if all
> memory is swapped out.

Will it suck to the point of eventually causing the live migration
to fail?  Or will swap-storms effectively cause denial-of-service
for other guests?

Anyway, if live migration works fine with mostly-swapped-out guests
on KVM, that's great.

> > Choosing the _optimal_ overcommit ratio is impossible without a
> > prescient knowledge of the workload in each guest.  Hoping memory
> > will be available is certainly not a good solution, but if memory
> > is not available guest swapping is much better than host swapping.
> 
> You cannot rely on guest swapping.

Frontswap only relies on the guest having an existing swap device,
defined in /etc/fstab like any normal Linux swap device.  If this
is "relying on guest swapping", yes frontswap relies on guest swapping.

Or if you are referring to your "host can't force guest to
reclaim pages" argument, see the other thread.

> > And making RAM usage as dynamic as possible and live migration
> > as easy as possible are keys to maximizing the benefits (and
> > limiting the problems) of virtualization.
> 
> That is why you need overcommit.  You make things dynamic with page
> sharing and ballooning and live migration, but at some point you need a
> failsafe fallback.  The only failsafe fallback I can see (where the
> host doesn't rely on guests) is swapping.

No fallback is required if the overcommitment is done intelligently.

> As far as I can tell, frontswap+tmem increases the problem.  You loan
> the guest some memory without the means to take it back, this increases
> memory pressure on the host.  The result is that if you want to avoid
> swapping (or are unable to) you need to undercommit host resources.
> Instead of sum(guest mem) + reserve < (host mem), you need sum(guest
> mem
> + committed tmem) + reserve < (host mem).  You need more host memory,
> or less guests, or to be prepared to swap if the worst happens.

Your argument might make sense from a KVM perspective but is
not true of frontswap with Xen+tmem.  With KVM, the host's
swap disk(s) can all be used as "slow RAM".  With Xen, there is
no host swap disk.  So, yes, the degree of potential memory
overcommitment is smaller with Xen+tmem than with KVM.  In
order to avoid all the host problems with host-swapping,
frontswap+Xen+tmem intentionally limits the degree of memory
overcommitment... but this is just memory overcommitment done
intelligently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-03  8:46                                                     ` Avi Kivity
@ 2010-05-03 16:01                                                       ` Dan Magenheimer
  -1 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-03 16:01 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Simple policies must exist and must be enforced by the hypervisor to
> ensure
> > this doesn't happen.  Xen+tmem provides these policies and enforces
> them.
> > And it enforces them very _dynamically_ to constantly optimize
> > RAM utilization across multiple guests each with dynamically varying
> RAM
> > usage.  Frontswap fits nicely into this framework.
>
> Can you explain what "enforcing" means in this context?  You loaned the
> guest some pages, can you enforce their return?

We're getting into hypervisor policy issues, but given that probably
nobody else is listening by now, I guess that's OK. ;-)

The enforcement is on the "put" side.  The page is not loaned,
it is freely given, but only if the guest is within its
contractual limitations (e.g. within its predefined "maxmem").
If the guest chooses to never remove the pages from frontswap,
that's the guest's option, but that part of the guests
memory allocation can never be used for anything else so
it is in the guest's self-interest to "get" or "flush" the
pages from frontswap.

> > Huge performance hits that are completely inexplicable to a user
> > give virtualization a bad reputation.  If the user (i.e. guest,
> > not host, administrator) can at least see "Hmmm... I'm doing a lot
> > of swapping, guess I'd better pay for more (virtual) RAM", then
> > the user objections are greatly reduced.
> 
> What you're saying is "don't overcommit".

Not at all.  I am saying "overcommit, but do it intelligently".

> That's a good policy for some
> scenarios but not for others.  Note it applies equally well for cpu as
> well as memory.

Perhaps, but CPU overcommit has been a well-understood
part of computing for a very long time and users, admins,
and hosting providers all know how to recognize it and
deal with it.  Not so with overcommitment of memory;
the only exposure to memory limitations is "my disk light
is flashing a lot, I'd better buy more RAM".  Obviously,
this doesn't translate to virtualization very well.

And, as for your interrupt latency analogy, let's
revisit that if/when Xen or KVM support CPU overcommitment
for real-time-sensitive guests.  Until then, your analogy
is misleading.

> frontswap+tmem is not overcommit, it's undercommit.   You have spare
> memory, and you give it away.  It isn't a replacement.  However,
> without
> the means to reclaim this spare memory, it can result in overcommit.

But you are missing part of the magic:  Once the memory
page is no longer directly addressable (AND this implies not
directly writable) by the guest, the hypervisor can do interesting
things with it, such as compression and deduplication.

As a result, the sum of pages used by all the guests exceeds
the total pages of RAM in the system.  Thus overcommitment.
I agree that the degree of overcommitment is less than possible
with host-swapping, but none of the evil issues of host-swapping
happen. Again, this is "intelligent overcommitment".  Other
existing forms are "overcommit and cross your fingers that bad
things don't happen."

> > Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
> > code which performs the Xen-specific stuff (hypercalls) is only in
> > the Xen-specific directory.
> 
> This makes it an external interface.
>  :
> Something completely internal to the guest can be replaced by something
> completely different.  Something that talks to a hypervisor will need
> those hooks forever to avoid regressions.

Uh, no.  As I've said, everything about frontswap is entirely
optional, both at compile-time and run-time.  A frontswap-enabled
guest is fully compatible with a hypervisor with no frontswap;
a frontswap-enabled hypervisor is fully compatible with a guest
with no frontswap.  The only thing that is reserved forever is
a hypervisor-specific "hypercall number" which is not exposed in
the Linux kernel except in Xen-specific code.  And, for Xen,
frontswap shares the same hypercall number with cleancache.

So, IMHO, you are being alarmist.  This is not an "API
maintenance" problem for Linux.

> Exactly as large as the swap space which the guest would have in the
> frontswap+tmem case.
>  :
> Not needed, though I expect it is already supported (SAN volumes do
> grow).
>  :
> If block layer overhead is a problem, go ahead and optimize it instead
> of adding new interfaces to bypass it.  Though I expect it wouldn't be
> needed, and if any optimization needs to be done it is in the swap
> layer.
> Optimizing swap has the additional benefit of improving performance on
> flash-backed swap.
>  :
> What happens when no tmem is available?  you swap to a volume.  That's
> the disk size needed.
>  :
> You're dynamic swap is limited too.  And no, no guest modifications.

You keep saying you are going to implement all of the dynamic features
of frontswap with no changes to the guest and no copying and no
host-swapping.  You are being disingenuous.  VMware has had a lot
of people working on virtualization a lot longer than you or I have.
Don't you think they would have done this by now?

Frontswap exists today and is even shipping in real released products.
If you can work your magic (in Xen... I am not trying to claim
frontswap should work with KVM), please show us the code.

> So, you take a synchronous copyful interface, add another copy to make
> it into an asynchronous interface, instead of using the original
> asynchronous copyless interface.

"Add another copy" is not required any more than it is with the
other examples you cited.

The "original asynchronous copyless interface" works because DMA
for devices has been around for >40 years and has been greatly
refined.  We're not talking about DMA to a device here, we're
talking about DMA from one place in RAM to another (i.e. from
guest RAM to hypervisor RAM).  Do you have examples of DMA engines
that do page-size-ish RAM-to-RAM more efficiently than copying?

> The networking stack seems to think 4096 bytes is a good size for dma
> (see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

Networking is a device-to-RAM, not RAM-to-RAM.

> When swapping out, Linux already batches pages in the block device's
> request queue.  Swapping out is inherently asynchronous and batched,
> you're swapping out those pages _because_ you don't need them, and
> you're never interested in swapping out a single page.  Linux already
> reserves memory for use during swapout.  There's no need to re-solve
> solved problems.

Swapping out is inherently asynchronous and batches because it was
designed for swapping to a device, while you are claiming that the
same _unchanged_ interface is suitable for swap-to-hypervisor-RAM
and at the same time saying that the block layer might need
to be "optimized" (apparently without code changes).

I'm not trying to re-solve a solved problem; frontswap solves a NEW
problem, with very little impact to existing code.

> Swapping in is less simple, it is mostly synchronous (in some cases it
> isn't: with many threads, or with the preswap patches (IIRC unmerged)).
> You can always choose to copy if you don't have enough to justify dma.

Do you have a pointer to these preswap patches?


^ permalink raw reply	[flat|nested] 163+ messages in thread

* RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-03 16:01                                                       ` Dan Magenheimer
  0 siblings, 0 replies; 163+ messages in thread
From: Dan Magenheimer @ 2010-05-03 16:01 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, Dave Hansen, Pavel Machek, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

> > Simple policies must exist and must be enforced by the hypervisor to
> ensure
> > this doesn't happen.  Xen+tmem provides these policies and enforces
> them.
> > And it enforces them very _dynamically_ to constantly optimize
> > RAM utilization across multiple guests each with dynamically varying
> RAM
> > usage.  Frontswap fits nicely into this framework.
>
> Can you explain what "enforcing" means in this context?  You loaned the
> guest some pages, can you enforce their return?

We're getting into hypervisor policy issues, but given that probably
nobody else is listening by now, I guess that's OK. ;-)

The enforcement is on the "put" side.  The page is not loaned,
it is freely given, but only if the guest is within its
contractual limitations (e.g. within its predefined "maxmem").
If the guest chooses to never remove the pages from frontswap,
that's the guest's option, but that part of the guests
memory allocation can never be used for anything else so
it is in the guest's self-interest to "get" or "flush" the
pages from frontswap.

> > Huge performance hits that are completely inexplicable to a user
> > give virtualization a bad reputation.  If the user (i.e. guest,
> > not host, administrator) can at least see "Hmmm... I'm doing a lot
> > of swapping, guess I'd better pay for more (virtual) RAM", then
> > the user objections are greatly reduced.
> 
> What you're saying is "don't overcommit".

Not at all.  I am saying "overcommit, but do it intelligently".

> That's a good policy for some
> scenarios but not for others.  Note it applies equally well for cpu as
> well as memory.

Perhaps, but CPU overcommit has been a well-understood
part of computing for a very long time and users, admins,
and hosting providers all know how to recognize it and
deal with it.  Not so with overcommitment of memory;
the only exposure to memory limitations is "my disk light
is flashing a lot, I'd better buy more RAM".  Obviously,
this doesn't translate to virtualization very well.

And, as for your interrupt latency analogy, let's
revisit that if/when Xen or KVM support CPU overcommitment
for real-time-sensitive guests.  Until then, your analogy
is misleading.

> frontswap+tmem is not overcommit, it's undercommit.   You have spare
> memory, and you give it away.  It isn't a replacement.  However,
> without
> the means to reclaim this spare memory, it can result in overcommit.

But you are missing part of the magic:  Once the memory
page is no longer directly addressable (AND this implies not
directly writable) by the guest, the hypervisor can do interesting
things with it, such as compression and deduplication.

As a result, the sum of pages used by all the guests exceeds
the total pages of RAM in the system.  Thus overcommitment.
I agree that the degree of overcommitment is less than possible
with host-swapping, but none of the evil issues of host-swapping
happen. Again, this is "intelligent overcommitment".  Other
existing forms are "overcommit and cross your fingers that bad
things don't happen."

> > Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
> > code which performs the Xen-specific stuff (hypercalls) is only in
> > the Xen-specific directory.
> 
> This makes it an external interface.
>  :
> Something completely internal to the guest can be replaced by something
> completely different.  Something that talks to a hypervisor will need
> those hooks forever to avoid regressions.

Uh, no.  As I've said, everything about frontswap is entirely
optional, both at compile-time and run-time.  A frontswap-enabled
guest is fully compatible with a hypervisor with no frontswap;
a frontswap-enabled hypervisor is fully compatible with a guest
with no frontswap.  The only thing that is reserved forever is
a hypervisor-specific "hypercall number" which is not exposed in
the Linux kernel except in Xen-specific code.  And, for Xen,
frontswap shares the same hypercall number with cleancache.

So, IMHO, you are being alarmist.  This is not an "API
maintenance" problem for Linux.

> Exactly as large as the swap space which the guest would have in the
> frontswap+tmem case.
>  :
> Not needed, though I expect it is already supported (SAN volumes do
> grow).
>  :
> If block layer overhead is a problem, go ahead and optimize it instead
> of adding new interfaces to bypass it.  Though I expect it wouldn't be
> needed, and if any optimization needs to be done it is in the swap
> layer.
> Optimizing swap has the additional benefit of improving performance on
> flash-backed swap.
>  :
> What happens when no tmem is available?  you swap to a volume.  That's
> the disk size needed.
>  :
> You're dynamic swap is limited too.  And no, no guest modifications.

You keep saying you are going to implement all of the dynamic features
of frontswap with no changes to the guest and no copying and no
host-swapping.  You are being disingenuous.  VMware has had a lot
of people working on virtualization a lot longer than you or I have.
Don't you think they would have done this by now?

Frontswap exists today and is even shipping in real released products.
If you can work your magic (in Xen... I am not trying to claim
frontswap should work with KVM), please show us the code.

> So, you take a synchronous copyful interface, add another copy to make
> it into an asynchronous interface, instead of using the original
> asynchronous copyless interface.

"Add another copy" is not required any more than it is with the
other examples you cited.

The "original asynchronous copyless interface" works because DMA
for devices has been around for >40 years and has been greatly
refined.  We're not talking about DMA to a device here, we're
talking about DMA from one place in RAM to another (i.e. from
guest RAM to hypervisor RAM).  Do you have examples of DMA engines
that do page-size-ish RAM-to-RAM more efficiently than copying?

> The networking stack seems to think 4096 bytes is a good size for dma
> (see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

Networking is a device-to-RAM, not RAM-to-RAM.

> When swapping out, Linux already batches pages in the block device's
> request queue.  Swapping out is inherently asynchronous and batched,
> you're swapping out those pages _because_ you don't need them, and
> you're never interested in swapping out a single page.  Linux already
> reserves memory for use during swapout.  There's no need to re-solve
> solved problems.

Swapping out is inherently asynchronous and batches because it was
designed for swapping to a device, while you are claiming that the
same _unchanged_ interface is suitable for swap-to-hypervisor-RAM
and at the same time saying that the block layer might need
to be "optimized" (apparently without code changes).

I'm not trying to re-solve a solved problem; frontswap solves a NEW
problem, with very little impact to existing code.

> Swapping in is less simple, it is mostly synchronous (in some cases it
> isn't: with many threads, or with the preswap patches (IIRC unmerged)).
> You can always choose to copy if you don't have enough to justify dma.

Do you have a pointer to these preswap patches?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-05-03 16:01                                                       ` Dan Magenheimer
@ 2010-05-03 19:32                                                         ` Pavel Machek
  -1 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-05-03 19:32 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel


> > If block layer overhead is a problem, go ahead and optimize it instead
> > of adding new interfaces to bypass it.  Though I expect it wouldn't be
> > needed, and if any optimization needs to be done it is in the swap
> > layer.
> > Optimizing swap has the additional benefit of improving performance on
> > flash-backed swap.
> >  :
> > What happens when no tmem is available?  you swap to a volume.  That's
> > the disk size needed.
> >  :
> > You're dynamic swap is limited too.  And no, no guest modifications.
> 
> You keep saying you are going to implement all of the dynamic features
> of frontswap with no changes to the guest and no copying and no
> host-swapping.  You are being disingenuous.  VMware has had a lot

I don't see why no copying is a requirement. I believe requirement
should be "it is fast enough".
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-03 19:32                                                         ` Pavel Machek
  0 siblings, 0 replies; 163+ messages in thread
From: Pavel Machek @ 2010-05-03 19:32 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Avi Kivity, Jeremy Fitzhardinge, Dave Hansen, linux-kernel,
	linux-mm, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel


> > If block layer overhead is a problem, go ahead and optimize it instead
> > of adding new interfaces to bypass it.  Though I expect it wouldn't be
> > needed, and if any optimization needs to be done it is in the swap
> > layer.
> > Optimizing swap has the additional benefit of improving performance on
> > flash-backed swap.
> >  :
> > What happens when no tmem is available?  you swap to a volume.  That's
> > the disk size needed.
> >  :
> > You're dynamic swap is limited too.  And no, no guest modifications.
> 
> You keep saying you are going to implement all of the dynamic features
> of frontswap with no changes to the guest and no copying and no
> host-swapping.  You are being disingenuous.  VMware has had a lot

I don't see why no copying is a requirement. I believe requirement
should be "it is fast enough".
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
  2010-04-30 16:08                                         ` Dave Hansen
@ 2010-05-10 16:05                                           ` Martin Schwidefsky
  -1 siblings, 0 replies; 163+ messages in thread
From: Martin Schwidefsky @ 2010-05-10 16:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Magenheimer, Avi Kivity, Pavel Machek, linux-kernel,
	linux-mm, jeremy, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On Fri, 30 Apr 2010 09:08:00 -0700
Dave Hansen <dave@linux.vnet.ibm.com> wrote:

> On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote:
> > Dave or others can correct me if I am wrong, but I think CMM2 also
> > handles dirty pages that must be retained by the hypervisor.  The
> > difference between CMM2 (for dirty pages) and frontswap is that
> > CMM2 sets hints that can be handled asynchronously while frontswap
> > provides explicit hooks that synchronously succeed/fail.
> 
> Once pages were dirtied (or I guess just slightly before), they became
> volatile, and I don't think the hypervisor could do anything with them.
> It could still swap them out like usual, but none of the CMM-specific
> optimizations could be performed.

Well, almost correct :-)
A dirty page (or one that is about to become dirty) can be in one of two
CMMA states:
1) stable
This is the case for pages where the kernel is doing some operation on
the page that will make it dirty, e.g. I/O. Before the kernel can
allow the operation the page has to be made stable. If the state
conversion to stable fails because the hypervisor removed the page the
page needs to get deleted from page cache and recreated from scratch.
2) potentially-volatile
This state is used for page cache pages for which a writable mapping
exists. The page can be removed by the hypervisor as long as the
physical per-page dirty bit is not set. As soon as the bit is set the
page is considered stable although the CMMA state still is potentially-
volatile.

In both cases the only thing the hypervisor can do with a dirty page is
to swap it as usual.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 163+ messages in thread

* Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
@ 2010-05-10 16:05                                           ` Martin Schwidefsky
  0 siblings, 0 replies; 163+ messages in thread
From: Martin Schwidefsky @ 2010-05-10 16:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Magenheimer, Avi Kivity, Pavel Machek, linux-kernel,
	linux-mm, jeremy, hugh.dickins, ngupta, JBeulich, chris.mason,
	kurt.hackel, dave.mccracken, npiggin, akpm, riel

On Fri, 30 Apr 2010 09:08:00 -0700
Dave Hansen <dave@linux.vnet.ibm.com> wrote:

> On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote:
> > Dave or others can correct me if I am wrong, but I think CMM2 also
> > handles dirty pages that must be retained by the hypervisor.  The
> > difference between CMM2 (for dirty pages) and frontswap is that
> > CMM2 sets hints that can be handled asynchronously while frontswap
> > provides explicit hooks that synchronously succeed/fail.
> 
> Once pages were dirtied (or I guess just slightly before), they became
> volatile, and I don't think the hypervisor could do anything with them.
> It could still swap them out like usual, but none of the CMM-specific
> optimizations could be performed.

Well, almost correct :-)
A dirty page (or one that is about to become dirty) can be in one of two
CMMA states:
1) stable
This is the case for pages where the kernel is doing some operation on
the page that will make it dirty, e.g. I/O. Before the kernel can
allow the operation the page has to be made stable. If the state
conversion to stable fails because the hypervisor removed the page the
page needs to get deleted from page cache and recreated from scratch.
2) potentially-volatile
This state is used for page cache pages for which a writable mapping
exists. The page can be removed by the hypervisor as long as the
physical per-page dirty bit is not set. As soon as the bit is set the
page is considered stable although the CMMA state still is potentially-
volatile.

In both cases the only thing the hypervisor can do with a dirty page is
to swap it as usual.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 163+ messages in thread

end of thread, other threads:[~2010-05-10 16:05 UTC | newest]

Thread overview: 163+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-22 13:42 Frontswap [PATCH 0/4] (was Transcendent Memory): overview Dan Magenheimer
2010-04-22 13:42 ` Dan Magenheimer
2010-04-22 15:28 ` Avi Kivity
2010-04-22 15:28   ` Avi Kivity
2010-04-22 15:48   ` Dan Magenheimer
2010-04-22 15:48     ` Dan Magenheimer
2010-04-22 16:13     ` Avi Kivity
2010-04-22 16:13       ` Avi Kivity
2010-04-22 20:15       ` Dan Magenheimer
2010-04-22 20:15         ` Dan Magenheimer
2010-04-23  9:48         ` Avi Kivity
2010-04-23  9:48           ` Avi Kivity
2010-04-23 13:47           ` Dan Magenheimer
2010-04-23 13:47             ` Dan Magenheimer
2010-04-23 13:57             ` Avi Kivity
2010-04-23 13:57               ` Avi Kivity
2010-04-23 14:43               ` Dan Magenheimer
2010-04-23 14:43                 ` Dan Magenheimer
2010-04-23 14:52                 ` Avi Kivity
2010-04-23 14:52                   ` Avi Kivity
2010-04-23 15:00                   ` Avi Kivity
2010-04-23 15:00                     ` Avi Kivity
2010-04-23 16:26                     ` Dan Magenheimer
2010-04-23 16:26                       ` Dan Magenheimer
2010-04-24 18:25                       ` Avi Kivity
2010-04-24 18:25                         ` Avi Kivity
     [not found]                         ` <1c02a94a-a6aa-4cbb-a2e6-9d4647760e91@default4BD43033.7090706@redhat.com>
2010-04-25  0:41                         ` Dan Magenheimer
2010-04-25  0:41                           ` Dan Magenheimer
2010-04-25 12:06                           ` Avi Kivity
2010-04-25 12:06                             ` Avi Kivity
2010-04-25 13:12                             ` Dan Magenheimer
2010-04-25 13:12                               ` Dan Magenheimer
2010-04-25 13:18                               ` Avi Kivity
2010-04-25 13:18                                 ` Avi Kivity
2010-04-28  5:55                               ` Pavel Machek
2010-04-28  5:55                                 ` Pavel Machek
2010-04-29 14:42                                 ` Dan Magenheimer
2010-04-29 14:42                                   ` Dan Magenheimer
2010-04-29 18:59                                   ` Avi Kivity
2010-04-29 18:59                                     ` Avi Kivity
2010-04-29 19:01                                     ` Avi Kivity
2010-04-29 19:01                                       ` Avi Kivity
2010-04-29 18:53                                 ` Avi Kivity
2010-04-29 18:53                                   ` Avi Kivity
2010-04-30  1:45                                 ` Dave Hansen
2010-04-30  1:45                                   ` Dave Hansen
2010-04-30  7:13                                   ` Avi Kivity
2010-04-30  7:13                                     ` Avi Kivity
2010-04-30 15:59                                     ` Dan Magenheimer
2010-04-30 15:59                                       ` Dan Magenheimer
2010-04-30 16:08                                       ` Dave Hansen
2010-04-30 16:08                                         ` Dave Hansen
2010-05-10 16:05                                         ` Martin Schwidefsky
2010-05-10 16:05                                           ` Martin Schwidefsky
2010-04-30 16:16                                       ` Avi Kivity
2010-04-30 16:16                                         ` Avi Kivity
     [not found]                                         ` <4BDB18CE.2090608@goop.org4BDB2069.4000507@redhat.com>
     [not found]                                           ` <3a62a058-7976-48d7-acd2-8c6a8312f10f@default20100502071059.GF1790@ucw.cz>
2010-04-30 16:43                                         ` Dan Magenheimer
2010-04-30 16:43                                           ` Dan Magenheimer
2010-04-30 17:10                                           ` Dave Hansen
2010-04-30 17:10                                             ` Dave Hansen
2010-04-30 18:08                                           ` Avi Kivity
2010-04-30 18:08                                             ` Avi Kivity
2010-04-30 17:52                                         ` Jeremy Fitzhardinge
2010-04-30 17:52                                           ` Jeremy Fitzhardinge
2010-04-30 18:24                                           ` Avi Kivity
2010-04-30 18:24                                             ` Avi Kivity
2010-04-30 18:59                                             ` Jeremy Fitzhardinge
2010-04-30 18:59                                               ` Jeremy Fitzhardinge
2010-05-01  8:28                                               ` Avi Kivity
2010-05-01  8:28                                                 ` Avi Kivity
2010-05-01 17:10                                             ` Dan Magenheimer
2010-05-01 17:10                                               ` Dan Magenheimer
2010-05-02  7:11                                               ` Pavel Machek
2010-05-02  7:11                                                 ` Pavel Machek
2010-05-02 15:05                                                 ` Dan Magenheimer
2010-05-02 15:05                                                   ` Dan Magenheimer
2010-05-02 20:06                                                   ` Pavel Machek
2010-05-02 20:06                                                     ` Pavel Machek
2010-05-02 21:05                                                     ` Dan Magenheimer
2010-05-02 21:05                                                       ` Dan Magenheimer
2010-05-02  7:57                                               ` Nitin Gupta
2010-05-02  7:57                                                 ` Nitin Gupta
2010-05-02 16:06                                                 ` Dan Magenheimer
2010-05-02 16:06                                                   ` Dan Magenheimer
2010-05-02 16:48                                                   ` Avi Kivity
2010-05-02 16:48                                                     ` Avi Kivity
2010-05-02 17:22                                                     ` Dan Magenheimer
2010-05-02 17:22                                                       ` Dan Magenheimer
2010-05-03  9:39                                                       ` Avi Kivity
2010-05-03  9:39                                                         ` Avi Kivity
2010-05-03 14:59                                                         ` Dan Magenheimer
2010-05-03 14:59                                                           ` Dan Magenheimer
2010-05-02 15:35                                               ` Avi Kivity
2010-05-02 15:35                                                 ` Avi Kivity
2010-05-02 17:06                                                 ` Dan Magenheimer
2010-05-02 17:06                                                   ` Dan Magenheimer
2010-05-03  8:46                                                   ` Avi Kivity
2010-05-03  8:46                                                     ` Avi Kivity
2010-05-03 16:01                                                     ` Dan Magenheimer
2010-05-03 16:01                                                       ` Dan Magenheimer
2010-05-03 19:32                                                       ` Pavel Machek
2010-05-03 19:32                                                         ` Pavel Machek
2010-04-30 16:04                                     ` Dave Hansen
2010-04-30 16:04                                       ` Dave Hansen
2010-04-23 15:56                   ` Dan Magenheimer
2010-04-23 15:56                     ` Dan Magenheimer
2010-04-24 18:22                     ` Avi Kivity
2010-04-24 18:22                       ` Avi Kivity
2010-04-25  0:30                       ` Dan Magenheimer
2010-04-25  0:30                         ` Dan Magenheimer
2010-04-25 12:11                         ` Avi Kivity
2010-04-25 12:11                           ` Avi Kivity
     [not found]                           ` <c5062f3a-3232-4b21-b032-2ee1f2485ff0@default4BD44E74.2020506@redhat.com>
2010-04-25 13:37                           ` Dan Magenheimer
2010-04-25 13:37                             ` Dan Magenheimer
2010-04-25 14:15                             ` Avi Kivity
2010-04-25 14:15                               ` Avi Kivity
2010-04-25 15:29                               ` Dan Magenheimer
2010-04-25 15:29                                 ` Dan Magenheimer
2010-04-26  6:01                                 ` Avi Kivity
2010-04-26  6:01                                   ` Avi Kivity
2010-04-26 12:45                                   ` Dan Magenheimer
2010-04-26 12:45                                     ` Dan Magenheimer
2010-04-26 13:48                                     ` Avi Kivity
2010-04-26 13:48                                       ` Avi Kivity
2010-04-27 12:56                                 ` Pavel Machek
2010-04-27 12:56                                   ` Pavel Machek
2010-04-27 14:32                                   ` Dan Magenheimer
2010-04-27 14:32                                     ` Dan Magenheimer
2010-04-29 13:02                                     ` Pavel Machek
2010-04-29 13:02                                       ` Pavel Machek
2010-04-27 11:52                             ` Valdis.Kletnieks
2010-04-27  0:49                           ` Jeremy Fitzhardinge
2010-04-27  0:49                             ` Jeremy Fitzhardinge
2010-04-27 12:55                         ` Pavel Machek
2010-04-27 12:55                           ` Pavel Machek
2010-04-27 14:43                           ` Nitin Gupta
2010-04-27 14:43                             ` Nitin Gupta
2010-04-29 13:04                             ` Pavel Machek
2010-04-29 13:04                               ` Pavel Machek
2010-04-24  1:49                   ` Nitin Gupta
2010-04-24  1:49                     ` Nitin Gupta
2010-04-24 18:27                     ` Avi Kivity
2010-04-24 18:27                       ` Avi Kivity
2010-04-25  3:11                       ` Nitin Gupta
2010-04-25  3:11                         ` Nitin Gupta
2010-04-25 12:16                         ` Avi Kivity
2010-04-25 12:16                           ` Avi Kivity
2010-04-25 16:05                           ` Nitin Gupta
2010-04-25 16:05                             ` Nitin Gupta
2010-04-26  6:06                             ` Avi Kivity
2010-04-26  6:06                               ` Avi Kivity
2010-04-26 12:50                               ` Dan Magenheimer
2010-04-26 12:50                                 ` Dan Magenheimer
2010-04-26 13:43                                 ` Avi Kivity
2010-04-26 13:43                                   ` Avi Kivity
2010-04-27  8:29                                   ` Dan Magenheimer
2010-04-27  8:29                                     ` Dan Magenheimer
2010-04-27  9:21                                     ` Avi Kivity
2010-04-27  9:21                                       ` Avi Kivity
2010-04-26 13:47                               ` Nitin Gupta
2010-04-26 13:47                                 ` Nitin Gupta
2010-04-23 16:35             ` Jiahua
2010-04-23 16:35               ` Jiahua

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.