linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* swap on eMMC and other flash
@ 2012-03-30 17:44 Arnd Bergmann
  2012-03-30 18:50 ` Arnd Bergmann
       [not found] ` <CAEwNFnA2GeOayw2sJ_KXv4qOdC50_Nt2KoK796YmQF+YV1GiEA@mail.gmail.com>
  0 siblings, 2 replies; 41+ messages in thread
From: Arnd Bergmann @ 2012-03-30 17:44 UTC (permalink / raw)
  To: linaro-kernel
  Cc: android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
with Luca joining in on the discussion) about swapping to flash based media
such as eMMC. This is a summary of what we found and what we think should
be done. If people agree that this is a good idea, we can start working
on it.

The basic problem is that Linux without swap is sort of crippled and some
things either don't work at all (hibernate) or not as efficient as they
should (e.g. tmpfs). At the same time, the swap code seems to be rather
inappropriate for the algorithms used in most flash media today, causing
system performance to suffer drastically, and wearing out the flash hardware
much faster than necessary. In order to change that, we would be
implementing the following changes:

1) Try to swap out multiple pages at once, in a single write request. My
reading of the current code is that we always send pages one by one to
the swap device, while most flash devices have an optimum write size of
32 or 64 kb and some require an alignment of more than a page. Ideally
we would try to write an aligned 64 kb block all the time. Writing aligned
64 kb chunks often gives us ten times the throughput of linear 4kb writes,
and going beyond 64 kb usually does not give any better performance.

2) Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical
erase block size of 4 or 8 MB. We should try to make the swap cluster
aligned to erase blocks and have the size match to avoid garbage collection
in the drive. The cluster size would typically be set by mkswap as a new
option and interpreted at swapon time.

3) As Luca points out, some eMMC media would benefit significantly from
having discard requests issued for every page that gets freed from
the swap cache, rather than at the time just before we reuse a swap
cluster. This would probably have to become a configurable option
as well, to avoid the overhead of sending the discard requests on
media that don't benefit from this.

Does this all sound appropriate for the Linux memory management people?

Also, does this sound useful to the Android developers? Would you
start using swap if we make it perform well and not destroy the drives?

Finally, does this plan match up with the capabilities of the
various eMMC devices? I know more about SD and USB devices and
I'm quite convinced that it would help there, but eMMC can be
more like an SSD in some ways, and the current code should be fine
for real SSDs.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-03-30 17:44 swap on eMMC and other flash Arnd Bergmann
@ 2012-03-30 18:50 ` Arnd Bergmann
  2012-03-30 22:08   ` Zach Pfeffer
                     ` (2 more replies)
       [not found] ` <CAEwNFnA2GeOayw2sJ_KXv4qOdC50_Nt2KoK796YmQF+YV1GiEA@mail.gmail.com>
  1 sibling, 3 replies; 41+ messages in thread
From: Arnd Bergmann @ 2012-03-30 18:50 UTC (permalink / raw)
  To: linaro-kernel, linux-mm
  Cc: Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc, kernel-team

(sorry for the duplicated email, this corrects the address of the android
kernel team, please reply here)

On Friday 30 March 2012, Arnd Bergmann wrote:

 We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
 with Luca joining in on the discussion) about swapping to flash based media
 such as eMMC. This is a summary of what we found and what we think should
 be done. If people agree that this is a good idea, we can start working
 on it.
 
 The basic problem is that Linux without swap is sort of crippled and some
 things either don't work at all (hibernate) or not as efficient as they
 should (e.g. tmpfs). At the same time, the swap code seems to be rather
 inappropriate for the algorithms used in most flash media today, causing
 system performance to suffer drastically, and wearing out the flash hardware
 much faster than necessary. In order to change that, we would be
 implementing the following changes:
 
 1) Try to swap out multiple pages at once, in a single write request. My
 reading of the current code is that we always send pages one by one to
 the swap device, while most flash devices have an optimum write size of
 32 or 64 kb and some require an alignment of more than a page. Ideally
 we would try to write an aligned 64 kb block all the time. Writing aligned
 64 kb chunks often gives us ten times the throughput of linear 4kb writes,
 and going beyond 64 kb usually does not give any better performance.
 
 2) Make variable sized swap clusters. Right now, the swap space is
 organized in clusters of 256 pages (1MB), which is less than the typical
 erase block size of 4 or 8 MB. We should try to make the swap cluster
 aligned to erase blocks and have the size match to avoid garbage collection
 in the drive. The cluster size would typically be set by mkswap as a new
 option and interpreted at swapon time.
 
 3) As Luca points out, some eMMC media would benefit significantly from
 having discard requests issued for every page that gets freed from
 the swap cache, rather than at the time just before we reuse a swap
 cluster. This would probably have to become a configurable option
 as well, to avoid the overhead of sending the discard requests on
 media that don't benefit from this.
 
 Does this all sound appropriate for the Linux memory management people?
 
 Also, does this sound useful to the Android developers? Would you
 start using swap if we make it perform well and not destroy the drives?
 
 Finally, does this plan match up with the capabilities of the
 various eMMC devices? I know more about SD and USB devices and
 I'm quite convinced that it would help there, but eMMC can be
 more like an SSD in some ways, and the current code should be fine
 for real SSDs.
 
 	Arnd
 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-03-30 18:50 ` Arnd Bergmann
@ 2012-03-30 22:08   ` Zach Pfeffer
  2012-03-31  9:24     ` Arnd Bergmann
  2012-03-31 20:29   ` Hugh Dickins
  2012-04-04 12:21   ` Adrian Hunter
  2 siblings, 1 reply; 41+ messages in thread
From: Zach Pfeffer @ 2012-03-30 22:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, linux-mm, Alex Lemberg, linux-mmc, linux-kernel,
	Hyojin Jeong, Luca Porzio (lporzio),
	kernel-team, Yejin Moon

On 30 March 2012 13:50, Arnd Bergmann <arnd@arndb.de> wrote:
> (sorry for the duplicated email, this corrects the address of the android
> kernel team, please reply here)
>
> On Friday 30 March 2012, Arnd Bergmann wrote:
>
>  We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
>  with Luca joining in on the discussion) about swapping to flash based media
>  such as eMMC. This is a summary of what we found and what we think should
>  be done. If people agree that this is a good idea, we can start working
>  on it.
>
>  The basic problem is that Linux without swap is sort of crippled and some
>  things either don't work at all (hibernate) or not as efficient as they
>  should (e.g. tmpfs). At the same time, the swap code seems to be rather
>  inappropriate for the algorithms used in most flash media today, causing
>  system performance to suffer drastically, and wearing out the flash hardware
>  much faster than necessary. In order to change that, we would be
>  implementing the following changes:
>
>  1) Try to swap out multiple pages at once, in a single write request. My
>  reading of the current code is that we always send pages one by one to
>  the swap device, while most flash devices have an optimum write size of
>  32 or 64 kb and some require an alignment of more than a page. Ideally
>  we would try to write an aligned 64 kb block all the time. Writing aligned
>  64 kb chunks often gives us ten times the throughput of linear 4kb writes,
>  and going beyond 64 kb usually does not give any better performance.

Last I read Transparent Huge Pages are still paged in and out a page
at a time, is this or was this ever the case? If it is the case should
the paging system be extended to support THP which would take care of
the big block issues with flash media?

>  2) Make variable sized swap clusters. Right now, the swap space is
>  organized in clusters of 256 pages (1MB), which is less than the typical
>  erase block size of 4 or 8 MB. We should try to make the swap cluster
>  aligned to erase blocks and have the size match to avoid garbage collection
>  in the drive. The cluster size would typically be set by mkswap as a new
>  option and interpreted at swapon time.
>
>  3) As Luca points out, some eMMC media would benefit significantly from
>  having discard requests issued for every page that gets freed from
>  the swap cache, rather than at the time just before we reuse a swap
>  cluster. This would probably have to become a configurable option
>  as well, to avoid the overhead of sending the discard requests on
>  media that don't benefit from this.
>
>  Does this all sound appropriate for the Linux memory management people?
>
>  Also, does this sound useful to the Android developers? Would you
>  start using swap if we make it perform well and not destroy the drives?
>
>  Finally, does this plan match up with the capabilities of the
>  various eMMC devices? I know more about SD and USB devices and
>  I'm quite convinced that it would help there, but eMMC can be
>  more like an SSD in some ways, and the current code should be fine
>  for real SSDs.
>
>        Arnd
>
>
>
> _______________________________________________
> linaro-kernel mailing list
> linaro-kernel@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-kernel



-- 
Zach Pfeffer
Android Platform Team Lead, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-03-30 22:08   ` Zach Pfeffer
@ 2012-03-31  9:24     ` Arnd Bergmann
  2012-04-03 18:17       ` Zach Pfeffer
  0 siblings, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-03-31  9:24 UTC (permalink / raw)
  To: Zach Pfeffer
  Cc: linaro-kernel, linux-mm, Alex Lemberg, linux-mmc, linux-kernel,
	Hyojin Jeong, Luca Porzio (lporzio),
	kernel-team, Yejin Moon

On Friday 30 March 2012, Zach Pfeffer wrote:
> Last I read Transparent Huge Pages are still paged in and out a page
> at a time, is this or was this ever the case? If it is the case should
> the paging system be extended to support THP which would take care of
> the big block issues with flash media?
> 

I don't think we ever want to get /that/ big. As I mentioned, going
beyond 64kb does not improve throughput on most flash media. However,
paging out 16MB causes a very noticeable delay of up to a few seconds
on slow drives, which would be inacceptable to users.

Also, that would only deal with the rare case where the data you
want to page out is actually in huge pages, not the common case.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-03-30 18:50 ` Arnd Bergmann
  2012-03-30 22:08   ` Zach Pfeffer
@ 2012-03-31 20:29   ` Hugh Dickins
  2012-04-02 11:45     ` Arnd Bergmann
  2012-04-02 12:52     ` Luca Porzio (lporzio)
  2012-04-04 12:21   ` Adrian Hunter
  2 siblings, 2 replies; 41+ messages in thread
From: Hugh Dickins @ 2012-03-31 20:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Rik van Riel, linaro-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc, kernel-team

On Fri, 30 Mar 2012, Arnd Bergmann wrote:
> On Friday 30 March 2012, Arnd Bergmann wrote:
> 
>  We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
>  with Luca joining in on the discussion) about swapping to flash based media
>  such as eMMC. This is a summary of what we found and what we think should
>  be done. If people agree that this is a good idea, we can start working
>  on it.
>  
>  The basic problem is that Linux without swap is sort of crippled and some
>  things either don't work at all (hibernate) or not as efficient as they
>  should (e.g. tmpfs). At the same time, the swap code seems to be rather
>  inappropriate for the algorithms used in most flash media today, causing
>  system performance to suffer drastically, and wearing out the flash hardware
>  much faster than necessary. In order to change that, we would be
>  implementing the following changes:
>  
>  1) Try to swap out multiple pages at once, in a single write request. My
>  reading of the current code is that we always send pages one by one to
>  the swap device, while most flash devices have an optimum write size of
>  32 or 64 kb and some require an alignment of more than a page. Ideally
>  we would try to write an aligned 64 kb block all the time. Writing aligned
>  64 kb chunks often gives us ten times the throughput of linear 4kb writes,
>  and going beyond 64 kb usually does not give any better performance.

My suspicion is that we suffer a lot from the "distance" between when
we allocate swap space (add_to_swap getting the swp_entry_t to replace
ptes by) and when we finally decide to write out a page (swap_writepage):
intervening decisions can jumble the sequence badly.

I've not investigated to confirm that, but certainly it was the case two
or three years ago, that we got much better behaviour in swapping shmem
to flash, when we stopped giving it a second pass round the lru, which
used to come in between the allocation and the writeout.

I believe that you'll want to start by implementing something like what
Rik set out a year ago in the mail appended below.  Adding another layer
of indirection isn't always a pure win, and I think none of us have taken
it any further since then; but sooner or later we shall need to, and your
flash case might be just the prod needed.

With that change made (so swap ptes are just pointers into an intervening
structure, where we record disk blocks allocated at the time of writeout),
some improvement should come just from traditional merging by the I/O
scheduler (deadline seems both better for flash and better for swap: one
day it would be nice to work out how cfq can be tweaked better for swap).

Some improvement, but probably not enough, and you'd want to do something
more proactive, like the mblk_io_submit stuff ext4 does these days.

Though they might prove to give the greatest benefit on flash,
these kind of changes should be good for conventional disk too.

>  
>  2) Make variable sized swap clusters. Right now, the swap space is
>  organized in clusters of 256 pages (1MB), which is less than the typical
>  erase block size of 4 or 8 MB. We should try to make the swap cluster
>  aligned to erase blocks and have the size match to avoid garbage collection
>  in the drive. The cluster size would typically be set by mkswap as a new
>  option and interpreted at swapon time.

That gets to sound more flash-specific, and I feel less enthusiastic
about doing things in bigger and bigger lumps.  But if it really proves
to be of benefit, it's easy enough to let you.

Decide the cluster size at mkswap time, or at swapon time, or by
/sys/block/sda/queue parameters?  Perhaps a /sys parameter should give
the size, but a swapon flag decide whether to participate or not.  Perhaps.

>  
>  3) As Luca points out, some eMMC media would benefit significantly from
>  having discard requests issued for every page that gets freed from
>  the swap cache, rather than at the time just before we reuse a swap
>  cluster. This would probably have to become a configurable option
>  as well, to avoid the overhead of sending the discard requests on
>  media that don't benefit from this.

I'm surprised, I wouldn't have contemplated a discard per page;
but if you have cases where it can be proved of benefit, fine.
I know nothing at all of eMMC.

Though as things stand, that swap_lock spinlock makes it difficult
to find a good safe moment to issue a discard (you want the spinlock
to keep it safe, but you don't want to issue "I/O" while holding a
spinlock).  Perhaps that difficulty can be overcome in a satisfactory
way, in the course of restructuring swap allocation as Rik set out
(Rik suggests freeing on swapin, that should make it very easy).

Hugh

>  
>  Does this all sound appropriate for the Linux memory management people?
>  
>  Also, does this sound useful to the Android developers? Would you
>  start using swap if we make it perform well and not destroy the drives?
>  
>  Finally, does this plan match up with the capabilities of the
>  various eMMC devices? I know more about SD and USB devices and
>  I'm quite convinced that it would help there, but eMMC can be
>  more like an SSD in some ways, and the current code should be fine
>  for real SSDs.
>  
>  	Arnd

>From riel@redhat.com Sun Apr 10 17:50:10 2011
Date: Sun, 10 Apr 2011 20:50:01 -0400
From: Rik van Riel <riel@redhat.com>
To: Linux Memory Management List <linux-mm@kvack.org>
Subject: [LSF/Collab] swap cache redesign idea

On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were
sitting in the hallway talking about yet more VM things.

During that discussion, we came up with a way to redesign the
swap cache.  During my flight home, I came with ideas on how
to use that redesign, that may make the changes worthwhile.

Currently, the page table entries that have swapped out pages
associated with them contain a swap entry, pointing directly
at the swap device and swap slot containing the data. Meanwhile,
the swap count lives in a separate array.

The redesign we are considering moving the swap entry to the
page cache radix tree for the swapper_space and having the pte
contain only the offset into the swapper_space.  The swap count
info can also fit inside the swapper_space page cache radix
tree (at least on 64 bits - on 32 bits we may need to get
creative or accept a smaller max amount of swap space).

This extra layer of indirection allows us to do several things:

1) get rid of the virtual address scanning swapoff; instead
    we just swap the data in and mark the pages as present in
    the swapper_space radix tree

2) free swap entries as the are read in, without waiting for
    the process to fault it in - this may be useful for memory
    types that have a large erase block

3) together with the defragmentation from (2), we can always
    do writes in large aligned blocks - the extra indirection
    will make it relatively easy to have special backend code
    for different kinds of swap space, since all the state can
    now live in just one place

4) skip writeout of zero-filled pages - this can be a big help
    for KVM virtual machines running Windows, since Windows zeroes
    out free pages;   simply discarding a zero-filled page is not
    at all simple in the current VM, where we would have to iterate
    over all the ptes to free the swap entry before being able to
    free the swap cache page (I am not sure how that locking would
    even work)

    with the extra layer of indirection, the locking for this scheme
    can be trivial - either the faulting process gets the old page,
    or it gets a new one, either way it'll be zero filled

5) skip writeout of pages the guest has marked as free - same as
    above, with the same easier locking

Only one real question remaining - how do we handle the swap count
in the new scheme?  On 64 bit systems we have enough space in the
radix tree, on 32 bit systems maybe we'll have to start overflowing
into the "swap_count_continued" logic a little sooner than we are
now and reduce the maximum swap size a little?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-03-31 20:29   ` Hugh Dickins
@ 2012-04-02 11:45     ` Arnd Bergmann
  2012-04-02 14:41       ` Hugh Dickins
  2012-04-02 12:52     ` Luca Porzio (lporzio)
  1 sibling, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-02 11:45 UTC (permalink / raw)
  To: linaro-kernel
  Cc: Hugh Dickins, Rik van Riel, linux-mmc, Alex Lemberg,
	linux-kernel, Luca Porzio (lporzio),
	linux-mm, Hyojin Jeong, kernel-team, Yejin Moon

On Saturday 31 March 2012, Hugh Dickins wrote:
> On Fri, 30 Mar 2012, Arnd Bergmann wrote:
> > On Friday 30 March 2012, Arnd Bergmann wrote:

> My suspicion is that we suffer a lot from the "distance" between when
> we allocate swap space (add_to_swap getting the swp_entry_t to replace
> ptes by) and when we finally decide to write out a page (swap_writepage):
> intervening decisions can jumble the sequence badly.
> 
> I've not investigated to confirm that, but certainly it was the case two
> or three years ago, that we got much better behaviour in swapping shmem
> to flash, when we stopped giving it a second pass round the lru, which
> used to come in between the allocation and the writeout.
> 
> I believe that you'll want to start by implementing something like what
> Rik set out a year ago in the mail appended below.  Adding another layer
> of indirection isn't always a pure win, and I think none of us have taken
> it any further since then; but sooner or later we shall need to, and your
> flash case might be just the prod needed.

Thanks a lot for that pointer, that certainly sounds interesting. I guess
we should first do some investigations into in what order the pages normally
get writting out to flash. If they are not strictly in sequence order, the
other improvements I suggested would be less effective as well.

Note that I'm not at all worried about reading pages back in from flash
out of order, that tends to be harmless because reads are much rarer than
writes on swap, and because only random writes require garbage collection
inside of the flash (forcing up to 500ms delays on a single write
occasionally), while reads are always uniformly fast.

> >  2) Make variable sized swap clusters. Right now, the swap space is
> >  organized in clusters of 256 pages (1MB), which is less than the typical
> >  erase block size of 4 or 8 MB. We should try to make the swap cluster
> >  aligned to erase blocks and have the size match to avoid garbage collection
> >  in the drive. The cluster size would typically be set by mkswap as a new
> >  option and interpreted at swapon time.
> 
> That gets to sound more flash-specific, and I feel less enthusiastic
> about doing things in bigger and bigger lumps.  But if it really proves
> to be of benefit, it's easy enough to let you.
> 
> Decide the cluster size at mkswap time, or at swapon time, or by
> /sys/block/sda/queue parameters?  Perhaps a /sys parameter should give
> the size, but a swapon flag decide whether to participate or not.  Perhaps.

I was think of mkswap time, because the erase block size is specific to
the storage hardware and there is no reason to ever change it run time,
and we cannot always easily probe the value from looking at hardware
registers (USB doesn't have the data, in SD cards it's usually wrong,
and in eMMC it's sometimes wrong). I should also mention that it's not
always power-of-two, some drives that use TLC flash have three times
the erase block size of the equivalent SLC flash, e.g. 3 MB or 6 MB.

I don't think that's a problem, but I might be missing something here.
I have also encoutered a few older drives that use some completely
random erase block size, but they are very rare.

Also, I'm unsure what the largest cluster size would be that we can
realistically support. 8 MB sounds fairly large already, especially
on systems that have less than 1 GB of RAM, as most of the ARM machines
today do. For shingle based hard drives, we would get a very similar
behavior as for flash media, but the chunks would be even larger,
on the order of 64 MB. If we can make those work, it would no longer
be specific to flash, but also a lot harder to do.

> >  3) As Luca points out, some eMMC media would benefit significantly from
> >  having discard requests issued for every page that gets freed from
> >  the swap cache, rather than at the time just before we reuse a swap
> >  cluster. This would probably have to become a configurable option
> >  as well, to avoid the overhead of sending the discard requests on
> >  media that don't benefit from this.
> 
> I'm surprised, I wouldn't have contemplated a discard per page;
> but if you have cases where it can be proved of benefit, fine.
> I know nothing at all of eMMC.

My understanding is that some devices can arbitrarily map between
physical flash pages (typically 4, 8, or 16kb) and logical sector
numbers, instead of remapping on the much larger erase block
granularity. In those cases, it makes sense to free up as many
pages as possible on the drive, in order to give the hardware more
room for reorganizing itself and doing background defragmentation
of its free space.

> Though as things stand, that swap_lock spinlock makes it difficult
> to find a good safe moment to issue a discard (you want the spinlock
> to keep it safe, but you don't want to issue "I/O" while holding a
> spinlock).  Perhaps that difficulty can be overcome in a satisfactory
> way, in the course of restructuring swap allocation as Rik set out
> (Rik suggests freeing on swapin, that should make it very easy).

Luca was suggesting to use the disk->fops->swap_slot_free_notify
callback from  swap_entry_free(), which is currently only used
in zram, but you're right, that would not work.

Another option would be batched discard as we do it for file systems:
occasionally stop writing to swap space and scanning for areas that
have become available since the last discard, then send discard
commands for those.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: swap on eMMC and other flash
  2012-03-31 20:29   ` Hugh Dickins
  2012-04-02 11:45     ` Arnd Bergmann
@ 2012-04-02 12:52     ` Luca Porzio (lporzio)
  2012-04-02 14:58       ` Hugh Dickins
  1 sibling, 1 reply; 41+ messages in thread
From: Luca Porzio (lporzio) @ 2012-04-02 12:52 UTC (permalink / raw)
  To: Hugh Dickins, Arnd Bergmann
  Cc: Rik van Riel, linaro-kernel, linux-mm, Alex Lemberg,
	linux-kernel, Saugata Das, Venkatraman S, Yejin Moon,
	Hyojin Jeong, linux-mmc, kernel-team

Hugh,

Great topics. As per one of Rik original points:

> 4) skip writeout of zero-filled pages - this can be a big help
>     for KVM virtual machines running Windows, since Windows zeroes
>     out free pages;   simply discarding a zero-filled page is not
>     at all simple in the current VM, where we would have to iterate
>     over all the ptes to free the swap entry before being able to
>     free the swap cache page (I am not sure how that locking would
>     even work)
> 
>     with the extra layer of indirection, the locking for this scheme
>     can be trivial - either the faulting process gets the old page,
>     or it gets a new one, either way it'll be zero filled
> 

Since it's KVMs realm here, can't KSM simply solve the zero-filled pages problem avoiding unnecessary burden for the Swap subsystem?

Cheers, 
   Luca

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-02 11:45     ` Arnd Bergmann
@ 2012-04-02 14:41       ` Hugh Dickins
  2012-04-02 14:55         ` Arnd Bergmann
  0 siblings, 1 reply; 41+ messages in thread
From: Hugh Dickins @ 2012-04-02 14:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, Rik van Riel, linux-mmc, Alex Lemberg,
	linux-kernel, Luca Porzio (lporzio),
	linux-mm, Hyojin Jeong, kernel-team, Yejin Moon

On Mon, 2 Apr 2012, Arnd Bergmann wrote:
> 
> Another option would be batched discard as we do it for file systems:
> occasionally stop writing to swap space and scanning for areas that
> have become available since the last discard, then send discard
> commands for those.

I'm not sure whether you've missed "swapon --discard", which switches
on discard_swap_cluster() just before we allocate from a new cluster;
or whether you're musing that it's no use to you because you want to
repurpose the swap cluster to match erase block: I'm mentioning it in
case you missed that it's already there (but few use it, since even
done at that scale it's often more trouble than it's worth).

Hugh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-02 14:41       ` Hugh Dickins
@ 2012-04-02 14:55         ` Arnd Bergmann
  2012-04-05  0:17           ` 정효진
  2012-04-08 13:50           ` Alex Lemberg
  0 siblings, 2 replies; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-02 14:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linaro-kernel, Rik van Riel, linux-mmc, Alex Lemberg,
	linux-kernel, Luca Porzio (lporzio),
	linux-mm, Hyojin Jeong, kernel-team, Yejin Moon

On Monday 02 April 2012, Hugh Dickins wrote:
> On Mon, 2 Apr 2012, Arnd Bergmann wrote:
> > 
> > Another option would be batched discard as we do it for file systems:
> > occasionally stop writing to swap space and scanning for areas that
> > have become available since the last discard, then send discard
> > commands for those.
> 
> I'm not sure whether you've missed "swapon --discard", which switches
> on discard_swap_cluster() just before we allocate from a new cluster;
> or whether you're musing that it's no use to you because you want to
> repurpose the swap cluster to match erase block: I'm mentioning it in
> case you missed that it's already there (but few use it, since even
> done at that scale it's often more trouble than it's worth).

I actually argued that discard_swap_cluster is exactly the right thing
to do, especially when clusters match erase blocks on the less capable
devices like SD cards.

Luca was arguing that on some hardware there is no point in ever
submitting a discard just before we start reusing space, because
at that point it the hardware already discards the old data by
overwriting the logical addresses with new blocks, while
issuing a discard on all blocks as soon as they become available
would make a bigger difference. I would be interested in hearing
from Hyojin Jeong and Alex Lemberg what they think is the best
time to issue a discard, because they would know about other hardware
than Luca.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: swap on eMMC and other flash
  2012-04-02 12:52     ` Luca Porzio (lporzio)
@ 2012-04-02 14:58       ` Hugh Dickins
  2012-04-02 16:51         ` Rik van Riel
  0 siblings, 1 reply; 41+ messages in thread
From: Hugh Dickins @ 2012-04-02 14:58 UTC (permalink / raw)
  To: Luca Porzio (lporzio)
  Cc: Arnd Bergmann, Rik van Riel, linaro-kernel, linux-mm,
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc, kernel-team

On Mon, 2 Apr 2012, Luca Porzio (lporzio) wrote:
> 
> Great topics. As per one of Rik original points:
> 
> > 4) skip writeout of zero-filled pages - this can be a big help
> >     for KVM virtual machines running Windows, since Windows zeroes
> >     out free pages;   simply discarding a zero-filled page is not
> >     at all simple in the current VM, where we would have to iterate
> >     over all the ptes to free the swap entry before being able to
> >     free the swap cache page (I am not sure how that locking would
> >     even work)
> > 
> >     with the extra layer of indirection, the locking for this scheme
> >     can be trivial - either the faulting process gets the old page,
> >     or it gets a new one, either way it'll be zero filled
> > 
> 
> Since it's KVMs realm here, can't KSM simply solve the zero-filled pages problem avoiding unnecessary burden for the Swap subsystem?

I would expect that KSM already does largely handle this, yes.
But it's also quite possible that I'm missing Rik's point.

Hugh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-02 14:58       ` Hugh Dickins
@ 2012-04-02 16:51         ` Rik van Riel
  0 siblings, 0 replies; 41+ messages in thread
From: Rik van Riel @ 2012-04-02 16:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Luca Porzio (lporzio),
	Arnd Bergmann, linaro-kernel, linux-mm, Alex Lemberg,
	linux-kernel, Saugata Das, Venkatraman S, Yejin Moon,
	Hyojin Jeong, linux-mmc, kernel-team

On 04/02/2012 10:58 AM, Hugh Dickins wrote:
> On Mon, 2 Apr 2012, Luca Porzio (lporzio) wrote:
>>
>> Great topics. As per one of Rik original points:
>>
>>> 4) skip writeout of zero-filled pages - this can be a big help
>>>      for KVM virtual machines running Windows, since Windows zeroes
>>>      out free pages;   simply discarding a zero-filled page is not
>>>      at all simple in the current VM, where we would have to iterate
>>>      over all the ptes to free the swap entry before being able to
>>>      free the swap cache page (I am not sure how that locking would
>>>      even work)
>>>
>>>      with the extra layer of indirection, the locking for this scheme
>>>      can be trivial - either the faulting process gets the old page,
>>>      or it gets a new one, either way it'll be zero filled
>>>
>>
>> Since it's KVMs realm here, can't KSM simply solve the zero-filled pages problem avoiding unnecessary burden for the Swap subsystem?
>
> I would expect that KSM already does largely handle this, yes.
> But it's also quite possible that I'm missing Rik's point.

Indeed, KSM handles it already.

However, it may be worthwhile for non-KVM users of transparent
huge pages to discard zero-filled parts of pages (allocated by
the kernel to the process, but not used memory).

Not just because it takes up swap space (writing to swap is
easy, space is cheap), but because not swapping that memory
back in later (because it is not used) will prevent us from
re-building the transparent huge page...


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-03-31  9:24     ` Arnd Bergmann
@ 2012-04-03 18:17       ` Zach Pfeffer
  0 siblings, 0 replies; 41+ messages in thread
From: Zach Pfeffer @ 2012-04-03 18:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, linux-mm, Alex Lemberg, linux-mmc, linux-kernel,
	Hyojin Jeong, Luca Porzio (lporzio),
	kernel-team, Yejin Moon

On 31 March 2012 04:24, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 30 March 2012, Zach Pfeffer wrote:
>> Last I read Transparent Huge Pages are still paged in and out a page
>> at a time, is this or was this ever the case? If it is the case should
>> the paging system be extended to support THP which would take care of
>> the big block issues with flash media?
>>
>
> I don't think we ever want to get /that/ big. As I mentioned, going
> beyond 64kb does not improve throughput on most flash media. However,
> paging out 16MB causes a very noticeable delay of up to a few seconds
> on slow drives, which would be inacceptable to users.
>
> Also, that would only deal with the rare case where the data you
> want to page out is actually in huge pages, not the common case.

What I had in mind was being able to swap out big contiguous buffers
used by media and graphics engines in one go. This would allow devices
to support multiple engines without needing to reserve contiguous
memory for each device. They would instead share the contiguous
memory. Only one multimedia engine could run at a time, but that would
be an okay limitation given certain application domains (low end smart
phones).

>
>        Arnd



-- 
Zach Pfeffer
Android Platform Team Lead, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-03-30 18:50 ` Arnd Bergmann
  2012-03-30 22:08   ` Zach Pfeffer
  2012-03-31 20:29   ` Hugh Dickins
@ 2012-04-04 12:21   ` Adrian Hunter
  2012-04-04 12:47     ` Arnd Bergmann
  2 siblings, 1 reply; 41+ messages in thread
From: Adrian Hunter @ 2012-04-04 12:21 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc, kernel-team

On 30/03/12 21:50, Arnd Bergmann wrote:
> (sorry for the duplicated email, this corrects the address of the android
> kernel team, please reply here)
> 
> On Friday 30 March 2012, Arnd Bergmann wrote:
> 
>  We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
>  with Luca joining in on the discussion) about swapping to flash based media
>  such as eMMC. This is a summary of what we found and what we think should
>  be done. If people agree that this is a good idea, we can start working
>  on it.

There is mtdswap.

Also the old Nokia N900 had swap to eMMC.

The last I heard was that swap was considered to be simply too slow on hand
held devices.

As systems adopt more RAM, isn't there a decreasing demand for swap?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-04 12:21   ` Adrian Hunter
@ 2012-04-04 12:47     ` Arnd Bergmann
  2012-04-11 10:28       ` Adrian Hunter
  0 siblings, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-04 12:47 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: linaro-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc, kernel-team

On Wednesday 04 April 2012, Adrian Hunter wrote:
> On 30/03/12 21:50, Arnd Bergmann wrote:
> > (sorry for the duplicated email, this corrects the address of the android
> > kernel team, please reply here)
> > 
> > On Friday 30 March 2012, Arnd Bergmann wrote:
> > 
> >  We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
> >  with Luca joining in on the discussion) about swapping to flash based media
> >  such as eMMC. This is a summary of what we found and what we think should
> >  be done. If people agree that this is a good idea, we can start working
> >  on it.
> 
> There is mtdswap.

Ah, very interesting. I wasn't aware of that. Obviously we can't directly
use it on block devices that have their own garbage collection and wear
leveling built into them, but it's interesting to see how this was solved
before.

While we could build something similar that remaps blocks between an
eMMC device and the logical swap space that is used by the mm code,
my feeling is that it would be easier to modify the swap code itself
to do the right thing.

> Also the old Nokia N900 had swap to eMMC.
> 
> The last I heard was that swap was considered to be simply too slow on hand
> held devices.

That's the part that we want to solve here. It has nothing to do with
handheld devices, but more with specific incompatibilities of the
block allocation in the swap code vs. what an eMMC device expects
to see for fast operation. If you write data in the wrong order on
flash devices, you get long delays that you don't get when you do
it the right way. The same problem exists for file systems, and is
being addressed there as well.

> As systems adopt more RAM, isn't there a decreasing demand for swap?

No. You would never be able to make hibernate work, no matter how much
RAM you add ;-)

More seriously, the need for swap is not to work around the fact that
we have too little memory, it's one of the fundamental assumptions of
the mm subsystem that swap exists, and it's generally a good idea to
have, so you treat file backed memory in the same way as anonymous
memory.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: swap on eMMC and other flash
  2012-04-02 14:55         ` Arnd Bergmann
@ 2012-04-05  0:17           ` 정효진
  2012-04-09 12:50             ` Arnd Bergmann
  2012-04-08 13:50           ` Alex Lemberg
  1 sibling, 1 reply; 41+ messages in thread
From: 정효진 @ 2012-04-05  0:17 UTC (permalink / raw)
  To: '정효진', 'Arnd Bergmann',
	'Hugh Dickins',
	cpgs
  Cc: linaro-kernel, 'Rik van Riel',
	linux-mmc, 'Alex Lemberg',
	linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon'


Dear Arnd

Hello, 

I'm not clearly understand the history of this e-mail communication because
I joined in the middle of mail thread.
Anyhow I would like to make comments for discard in swap area.
eMMC device point of view, there is no information of files which is used
in System S/W(Linux filesystem).
So...  In the eMMC, there is no way to know the address info of data which
was already erased.
If discard CMD send this information(address of erased files) to eMMC, old
data should be erased in the physical NAND level and get the free space
with minimizing internal merge.

I'm not sure that how Linux manage swap area.
If there are difference of information for invalid data between host and
eMMC device, discard to eMMC is good for performance of IO. It is as same
as general case of discard of user partition which is formatted with
filesystem.
As your e-mail mentioned, overwriting the logical address is the another
way to send info of invalid data address just for the overwrite area,
however it is not a best way for eMMC to manage physical NAND array. In
this case, eMMC have to trim physical NAND array, and do write operation at
the same time. It needs more latency.
If host send discard with invalid data address info in advance, eMMC can
find beat way to manage physical NAND page before host usage(write
operation).
I'm not sure it is the right comments of your concern.
If you need more info, please let me know

Best Regards
Hyojin


-----Original Message-----
From: Arnd Bergmann [mailto:arnd@arndb.de]
Sent: Monday, April 02, 2012 11:55 PM
To: Hugh Dickins
Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux-
mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca
Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel-
team@android.com; Yejin Moon
Subject: Re: swap on eMMC and other flash

On Monday 02 April 2012, Hugh Dickins wrote:
> On Mon, 2 Apr 2012, Arnd Bergmann wrote:
> > 
> > Another option would be batched discard as we do it for file systems:
> > occasionally stop writing to swap space and scanning for areas that 
> > have become available since the last discard, then send discard 
> > commands for those.
> 
> I'm not sure whether you've missed "swapon --discard", which switches 
> on discard_swap_cluster() just before we allocate from a new cluster; 
> or whether you're musing that it's no use to you because you want to 
> repurpose the swap cluster to match erase block: I'm mentioning it in 
> case you missed that it's already there (but few use it, since even 
> done at that scale it's often more trouble than it's worth).

I actually argued that discard_swap_cluster is exactly the right thing to
do, especially when clusters match erase blocks on the less capable devices
like SD cards.

Luca was arguing that on some hardware there is no point in ever submitting
a discard just before we start reusing space, because at that point it the
hardware already discards the old data by overwriting the logical addresses
with new blocks, while issuing a discard on all blocks as soon as they
become available would make a bigger difference. I would be interested in
hearing from Hyojin Jeong and Alex Lemberg what they think is the best time
to issue a discard, because they would know about other hardware than Luca.

	Arnd


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
       [not found] ` <CAEwNFnA2GeOayw2sJ_KXv4qOdC50_Nt2KoK796YmQF+YV1GiEA@mail.gmail.com>
@ 2012-04-06 16:16   ` Arnd Bergmann
  2012-04-09  2:06     ` Minchan Kim
  0 siblings, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-06 16:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linaro-kernel, android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On Friday 06 April 2012, Minchan Kim wrote:
> On Sat, Mar 31, 2012 at 2:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> 
> > We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
> > with Luca joining in on the discussion) about swapping to flash based media
> > such as eMMC. This is a summary of what we found and what we think should
> > be done. If people agree that this is a good idea, we can start working
> > on it.
> >
> > The basic problem is that Linux without swap is sort of crippled and some
> > things either don't work at all (hibernate) or not as efficient as they
> > should (e.g. tmpfs). At the same time, the swap code seems to be rather
> > inappropriate for the algorithms used in most flash media today, causing
> > system performance to suffer drastically, and wearing out the flash
> > hardware
> > much faster than necessary. In order to change that, we would be
> > implementing the following changes:
> >
> > 1) Try to swap out multiple pages at once, in a single write request. My
> > reading of the current code is that we always send pages one by one to
> > the swap device, while most flash devices have an optimum write size of
> > 32 or 64 kb and some require an alignment of more than a page. Ideally
> > we would try to write an aligned 64 kb block all the time. Writing aligned
> > 64 kb chunks often gives us ten times the throughput of linear 4kb writes,
> > and going beyond 64 kb usually does not give any better performance.
> >
> 
> It does make sense.
> I think we can batch will-be-swapped-out pages in shrink_page_list if they
> are located by contiguous swap slots.

But would that guarantee that all writes are the same size? While writing
larger chunks would generally be helpful, in order to guarantee that we
the drive doesn't do any garbage collection, we would have to do all writes
in aligned chunks. It would probably be enough to do this in 8kb or
16kb units for most devices over the next few years, but implementing it
for 64kb should be the same amount of work and will get us a little bit
further.

I'm not sure what we would do when there are less than 64kb available
for pageout on the inactive list. The two choices I can think of are
either not writing anything, or wasting the swap slots and filling
up the data with zeroes.

> > 2) Make variable sized swap clusters. Right now, the swap space is
> > organized in clusters of 256 pages (1MB), which is less than the typical
> > erase block size of 4 or 8 MB. We should try to make the swap cluster
> > aligned to erase blocks and have the size match to avoid garbage collection
> > in the drive. The cluster size would typically be set by mkswap as a new
> > option and interpreted at swapon time.
> >
> 
> If we can find such big contiguous swap slots easily, it would be good.
> But I am not sure how often we can get such big slots. And maybe we have to
> improve search method for getting such big empty cluster.

As long as there are clusters available, we should try to find them. When
free space is too fragmented to find any unused cluster, we can pick one
that has very little data in it, so that we reduce the time it takes to
GC that erase block in the drive. While we could theoretically do active
garbage collection of swap data in the kernel, it won't get more efficient
than the GC inside of the drive. If we do this, it unfortunately means that
we can't just send a discard for the entire erase block.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: swap on eMMC and other flash
  2012-04-02 14:55         ` Arnd Bergmann
  2012-04-05  0:17           ` 정효진
@ 2012-04-08 13:50           ` Alex Lemberg
  2012-04-09  2:14             ` Minchan Kim
  1 sibling, 1 reply; 41+ messages in thread
From: Alex Lemberg @ 2012-04-08 13:50 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, Rik van Riel, linux-mmc, linux-kernel,
	Luca Porzio (lporzio),
	linux-mm, Hyojin Jeong, kernel-team, Yejin Moon, Hugh Dickins,
	Yaniv Iarovici

Hi Arnd,

Regarding time to issue discard/TRIM commands:
It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).

Regarding SWAP page size:
Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.

SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).

Thanks,
Alex

> -----Original Message-----
> From: Arnd Bergmann [mailto:arnd@arndb.de]
> Sent: Monday, April 02, 2012 5:55 PM
> To: Hugh Dickins
> Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux-
> mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca
> Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel-
> team@android.com; Yejin Moon
> Subject: Re: swap on eMMC and other flash
>
> On Monday 02 April 2012, Hugh Dickins wrote:
> > On Mon, 2 Apr 2012, Arnd Bergmann wrote:
> > >
> > > Another option would be batched discard as we do it for file
> systems:
> > > occasionally stop writing to swap space and scanning for areas that
> > > have become available since the last discard, then send discard
> > > commands for those.
> >
> > I'm not sure whether you've missed "swapon --discard", which switches
> > on discard_swap_cluster() just before we allocate from a new cluster;
> > or whether you're musing that it's no use to you because you want to
> > repurpose the swap cluster to match erase block: I'm mentioning it in
> > case you missed that it's already there (but few use it, since even
> > done at that scale it's often more trouble than it's worth).
>
> I actually argued that discard_swap_cluster is exactly the right thing
> to do, especially when clusters match erase blocks on the less capable
> devices like SD cards.
>
> Luca was arguing that on some hardware there is no point in ever
> submitting a discard just before we start reusing space, because
> at that point it the hardware already discards the old data by
> overwriting the logical addresses with new blocks, while
> issuing a discard on all blocks as soon as they become available
> would make a bigger difference. I would be interested in hearing
> from Hyojin Jeong and Alex Lemberg what they think is the best
> time to issue a discard, because they would know about other hardware
> than Luca.
>
>       Arnd

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-06 16:16   ` Arnd Bergmann
@ 2012-04-09  2:06     ` Minchan Kim
  2012-04-09 12:35       ` Arnd Bergmann
  0 siblings, 1 reply; 41+ messages in thread
From: Minchan Kim @ 2012-04-09  2:06 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

2012-04-07 오전 1:16, Arnd Bergmann 쓴 글:

> On Friday 06 April 2012, Minchan Kim wrote:
>> On Sat, Mar 31, 2012 at 2:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>>> We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
>>> with Luca joining in on the discussion) about swapping to flash based media
>>> such as eMMC. This is a summary of what we found and what we think should
>>> be done. If people agree that this is a good idea, we can start working
>>> on it.
>>>
>>> The basic problem is that Linux without swap is sort of crippled and some
>>> things either don't work at all (hibernate) or not as efficient as they
>>> should (e.g. tmpfs). At the same time, the swap code seems to be rather
>>> inappropriate for the algorithms used in most flash media today, causing
>>> system performance to suffer drastically, and wearing out the flash
>>> hardware
>>> much faster than necessary. In order to change that, we would be
>>> implementing the following changes:
>>>
>>> 1) Try to swap out multiple pages at once, in a single write request. My
>>> reading of the current code is that we always send pages one by one to
>>> the swap device, while most flash devices have an optimum write size of
>>> 32 or 64 kb and some require an alignment of more than a page. Ideally
>>> we would try to write an aligned 64 kb block all the time. Writing aligned
>>> 64 kb chunks often gives us ten times the throughput of linear 4kb writes,
>>> and going beyond 64 kb usually does not give any better performance.
>>>
>>
>> It does make sense.
>> I think we can batch will-be-swapped-out pages in shrink_page_list if they
>> are located by contiguous swap slots.
> 
> But would that guarantee that all writes are the same size? While writing


Of course, not.

> larger chunks would generally be helpful, in order to guarantee that we
> the drive doesn't do any garbage collection, we would have to do all writes


And we should guarantee for avoiding unnecessary swapout, even OOM killing.

> in aligned chunks. It would probably be enough to do this in 8kb or
> 16kb units for most devices over the next few years, but implementing it
> for 64kb should be the same amount of work and will get us a little bit
> further.


I understand it's best for writing 64K in your statement.
What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?

> 
> I'm not sure what we would do when there are less than 64kb available
> for pageout on the inactive list. The two choices I can think of are
> either not writing anything, or wasting the swap slots and filling


No wrtite will cause unnecessary many pages to swap out by next prioirty
of scanning and we can't gaurantee how long we wait to queue up to 64KB
in anon pages. It might take longer than GC time so we need some deadline.


> up the data with zeroes.


Zero padding would be a good solution but I have a concern on WAP so we
need smart policy.

To be honest, I think swapout is normally asynchonous operation so that
it should not affect system latency rather than swap read which is
synchronous operation. So if system is low memory pressure, we can queue
swap out pages up to 64KB and then batch write-out in empty cluster. If
we don't have any empty cluster in low memory pressure, we should write
out it in partial cluster. Maybe it doesn't affect system latency
severely in low memory pressure.

If system memory pressure is high(and It shoud be not frequent),
swap-out B/W would be more important. So we can reserve some clusters
for it and I think we can use page padding you mentioned in this case
for reducing latency if we can queue it up to 64KB within threshold time.

Swap-read is also important. We have to investigate fragmentation of
swap slots because we disable swap readahead in non-rotation device. It
can make lots of hole in swap cluster and it makes to find empty
cluster. So for it, it might be better than enable swap-read in
non-rotation devices, too.


> 
>>> 2) Make variable sized swap clusters. Right now, the swap space is
>>> organized in clusters of 256 pages (1MB), which is less than the typical
>>> erase block size of 4 or 8 MB. We should try to make the swap cluster
>>> aligned to erase blocks and have the size match to avoid garbage collection
>>> in the drive. The cluster size would typically be set by mkswap as a new
>>> option and interpreted at swapon time.
>>>
>>
>> If we can find such big contiguous swap slots easily, it would be good.
>> But I am not sure how often we can get such big slots. And maybe we have to
>> improve search method for getting such big empty cluster.
> 
> As long as there are clusters available, we should try to find them. When
> free space is too fragmented to find any unused cluster, we can pick one
> that has very little data in it, so that we reduce the time it takes to
> GC that erase block in the drive. While we could theoretically do active
> garbage collection of swap data in the kernel, it won't get more efficient
> than the GC inside of the drive. If we do this, it unfortunately means that
> we can't just send a discard for the entire erase block.


Might need some compaction during idle time but WAP concern raises again. :(

> 
> 	Arnd
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-08 13:50           ` Alex Lemberg
@ 2012-04-09  2:14             ` Minchan Kim
  2012-04-09  7:37               ` 정효진
  0 siblings, 1 reply; 41+ messages in thread
From: Minchan Kim @ 2012-04-09  2:14 UTC (permalink / raw)
  To: Alex Lemberg
  Cc: Arnd Bergmann, linaro-kernel, Rik van Riel, linux-mmc,
	linux-kernel, Luca Porzio (lporzio),
	linux-mm, Hyojin Jeong, kernel-team, Yejin Moon, Hugh Dickins,
	Yaniv Iarovici

2012-04-08 오후 10:50, Alex Lemberg 쓴 글:

> Hi Arnd,
> 
> Regarding time to issue discard/TRIM commands:
> It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).


Is it still good with page size, not cluster size?

> 
> Regarding SWAP page size:
> Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.
> 
> SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).
> 



I have a curiosity on above comment is valid on Samsung and other eMMC.
Hyojin, Could you answer?


> Thanks,
> Alex
> 
>> -----Original Message-----
>> From: Arnd Bergmann [mailto:arnd@arndb.de]
>> Sent: Monday, April 02, 2012 5:55 PM
>> To: Hugh Dickins
>> Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux-
>> mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca
>> Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel-
>> team@android.com; Yejin Moon
>> Subject: Re: swap on eMMC and other flash
>>
>> On Monday 02 April 2012, Hugh Dickins wrote:
>>> On Mon, 2 Apr 2012, Arnd Bergmann wrote:
>>>>
>>>> Another option would be batched discard as we do it for file
>> systems:
>>>> occasionally stop writing to swap space and scanning for areas that
>>>> have become available since the last discard, then send discard
>>>> commands for those.
>>>
>>> I'm not sure whether you've missed "swapon --discard", which switches
>>> on discard_swap_cluster() just before we allocate from a new cluster;
>>> or whether you're musing that it's no use to you because you want to
>>> repurpose the swap cluster to match erase block: I'm mentioning it in
>>> case you missed that it's already there (but few use it, since even
>>> done at that scale it's often more trouble than it's worth).
>>
>> I actually argued that discard_swap_cluster is exactly the right thing
>> to do, especially when clusters match erase blocks on the less capable
>> devices like SD cards.
>>
>> Luca was arguing that on some hardware there is no point in ever
>> submitting a discard just before we start reusing space, because
>> at that point it the hardware already discards the old data by
>> overwriting the logical addresses with new blocks, while
>> issuing a discard on all blocks as soon as they become available
>> would make a bigger difference. I would be interested in hearing
>> from Hyojin Jeong and Alex Lemberg what they think is the best
>> time to issue a discard, because they would know about other hardware
>> than Luca.
>>
>>       Arnd
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: swap on eMMC and other flash
  2012-04-09  2:14             ` Minchan Kim
@ 2012-04-09  7:37               ` 정효진
  2012-04-09  8:11                 ` Minchan Kim
  2012-04-09 12:54                 ` Arnd Bergmann
  0 siblings, 2 replies; 41+ messages in thread
From: 정효진 @ 2012-04-09  7:37 UTC (permalink / raw)
  To: 'Minchan Kim', 'Alex Lemberg'
  Cc: 'Arnd Bergmann', linaro-kernel, 'Rik van Riel',
	linux-mmc, linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon',
	'Hugh Dickins', 'Yaniv Iarovici',
	cpgs

Hi Minchan

How are you doing?

Regarding time to issue Discard/Trim :
eMMC point of view, I believe that the immediate Discard/Trim CMD after deleting/freezing a SWAP cluster is always better for all of general eMMC implementation.

Regarding swap page size:
Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation.
I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC.
If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level.
I'm not sure that the best page size between Swap system and eMMC device.

Best Regards
Hyojin
-----Original Message-----
From: Minchan Kim [mailto:minchan@kernel.org] 
Sent: Monday, April 09, 2012 11:14 AM
To: Alex Lemberg
Cc: Arnd Bergmann; linaro-kernel@lists.linaro.org; Rik van Riel; linux-mmc@vger.kernel.org; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel-team@android.com; Yejin Moon; Hugh Dickins; Yaniv Iarovici
Subject: Re: swap on eMMC and other flash

2012-04-08 오후 10:50, Alex Lemberg 쓴 글:

> Hi Arnd,
> 
> Regarding time to issue discard/TRIM commands:
> It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).


Is it still good with page size, not cluster size?

> 
> Regarding SWAP page size:
> Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.
> 
> SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).
> 



I have a curiosity on above comment is valid on Samsung and other eMMC.
Hyojin, Could you answer?


> Thanks,
> Alex
> 
>> -----Original Message-----
>> From: Arnd Bergmann [mailto:arnd@arndb.de]
>> Sent: Monday, April 02, 2012 5:55 PM
>> To: Hugh Dickins
>> Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux- 
>> mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca 
>> Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel- 
>> team@android.com; Yejin Moon
>> Subject: Re: swap on eMMC and other flash
>>
>> On Monday 02 April 2012, Hugh Dickins wrote:
>>> On Mon, 2 Apr 2012, Arnd Bergmann wrote:
>>>>
>>>> Another option would be batched discard as we do it for file
>> systems:
>>>> occasionally stop writing to swap space and scanning for areas that 
>>>> have become available since the last discard, then send discard 
>>>> commands for those.
>>>
>>> I'm not sure whether you've missed "swapon --discard", which 
>>> switches on discard_swap_cluster() just before we allocate from a 
>>> new cluster; or whether you're musing that it's no use to you 
>>> because you want to repurpose the swap cluster to match erase block: 
>>> I'm mentioning it in case you missed that it's already there (but 
>>> few use it, since even done at that scale it's often more trouble than it's worth).
>>
>> I actually argued that discard_swap_cluster is exactly the right 
>> thing to do, especially when clusters match erase blocks on the less 
>> capable devices like SD cards.
>>
>> Luca was arguing that on some hardware there is no point in ever 
>> submitting a discard just before we start reusing space, because at 
>> that point it the hardware already discards the old data by 
>> overwriting the logical addresses with new blocks, while issuing a 
>> discard on all blocks as soon as they become available would make a 
>> bigger difference. I would be interested in hearing from Hyojin Jeong 
>> and Alex Lemberg what they think is the best time to issue a discard, 
>> because they would know about other hardware than Luca.
>>
>>       Arnd
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body 
> to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign 
> http://stopthemeter.ca/ Don't email: <a href=ilto:"dont@kvack.org"> 
> email@kvack.org </a>
> 



--
Kind regards,
Minchan Kim


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-09  7:37               ` 정효진
@ 2012-04-09  8:11                 ` Minchan Kim
  2012-04-09 13:00                   ` Arnd Bergmann
  2012-04-09 12:54                 ` Arnd Bergmann
  1 sibling, 1 reply; 41+ messages in thread
From: Minchan Kim @ 2012-04-09  8:11 UTC (permalink / raw)
  To: 정효진
  Cc: 'Alex Lemberg', 'Arnd Bergmann',
	linaro-kernel, 'Rik van Riel',
	linux-mmc, linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon',
	'Hugh Dickins', 'Yaniv Iarovici',
	cpgs

2012-04-09 오후 4:37, 정효진 쓴 글:

> Hi Minchan
> 
> How are you doing?
> 


Pretty good :)

> Regarding time to issue Discard/Trim :
> eMMC point of view, I believe that the immediate Discard/Trim CMD after deleting/freezing a SWAP cluster is always better for all of general eMMC implementation.


The point of question is that discard of page size is good or not?
Luca and Arnd said some device would have a benefit when we send discard
command to eMMC as soon as linux free _a swap page_, not batched cluster
size. But AFAIK, Samsung eMMC isn't useful by per-page discard and most
of eMMC are not good in case of per page discard, I guess. Becauase FTL
doesn't support full-page mapping in such small device. So I'm not sure
we have to implement per-page discard even code is rather complicated
for few devices.

> 
> Regarding swap page size:
> Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation.
> I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC.
> If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level.
> I'm not sure that the best page size between Swap system and eMMC device.


The variety is one of challenges for removing GC generally. ;-(.
I don't like manual setting through /sys/block/xxx because it requires
that user have to know nand page size and erase block size but it's not
easy to know to normal user.
Arnd. What's your plan to support various flash storages effectively?



> 
> Best Regards
> Hyojin
> -----Original Message-----
> From: Minchan Kim [mailto:minchan@kernel.org] 
> Sent: Monday, April 09, 2012 11:14 AM
> To: Alex Lemberg
> Cc: Arnd Bergmann; linaro-kernel@lists.linaro.org; Rik van Riel; linux-mmc@vger.kernel.org; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel-team@android.com; Yejin Moon; Hugh Dickins; Yaniv Iarovici
> Subject: Re: swap on eMMC and other flash
> 
> 2012-04-08 오후 10:50, Alex Lemberg 쓴 글:
> 
>> Hi Arnd,
>>
>> Regarding time to issue discard/TRIM commands:
>> It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).
> 
> 
> Is it still good with page size, not cluster size?
> 
>>
>> Regarding SWAP page size:
>> Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.
>>
>> SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).
>>
> 
> 
> 
> I have a curiosity on above comment is valid on Samsung and other eMMC.
> Hyojin, Could you answer?
> 
> 
>> Thanks,
>> Alex
>>
>>> -----Original Message-----
>>> From: Arnd Bergmann [mailto:arnd@arndb.de]
>>> Sent: Monday, April 02, 2012 5:55 PM
>>> To: Hugh Dickins
>>> Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux- 
>>> mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca 
>>> Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel- 
>>> team@android.com; Yejin Moon
>>> Subject: Re: swap on eMMC and other flash
>>>
>>> On Monday 02 April 2012, Hugh Dickins wrote:
>>>> On Mon, 2 Apr 2012, Arnd Bergmann wrote:
>>>>>
>>>>> Another option would be batched discard as we do it for file
>>> systems:
>>>>> occasionally stop writing to swap space and scanning for areas that 
>>>>> have become available since the last discard, then send discard 
>>>>> commands for those.
>>>>
>>>> I'm not sure whether you've missed "swapon --discard", which 
>>>> switches on discard_swap_cluster() just before we allocate from a 
>>>> new cluster; or whether you're musing that it's no use to you 
>>>> because you want to repurpose the swap cluster to match erase block: 
>>>> I'm mentioning it in case you missed that it's already there (but 
>>>> few use it, since even done at that scale it's often more trouble than it's worth).
>>>
>>> I actually argued that discard_swap_cluster is exactly the right 
>>> thing to do, especially when clusters match erase blocks on the less 
>>> capable devices like SD cards.
>>>
>>> Luca was arguing that on some hardware there is no point in ever 
>>> submitting a discard just before we start reusing space, because at 
>>> that point it the hardware already discards the old data by 
>>> overwriting the logical addresses with new blocks, while issuing a 
>>> discard on all blocks as soon as they become available would make a 
>>> bigger difference. I would be interested in hearing from Hyojin Jeong 
>>> and Alex Lemberg what they think is the best time to issue a discard, 
>>> because they would know about other hardware than Luca.
>>>
>>>       Arnd
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body 
>> to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign 
>> http://stopthemeter.ca/ Don't email: <a href=ilto:"dont@kvack.org"> 
>> email@kvack.org </a>
>>
> 
> 
> 
> --
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-09  2:06     ` Minchan Kim
@ 2012-04-09 12:35       ` Arnd Bergmann
  2012-04-10  0:57         ` Minchan Kim
  0 siblings, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-09 12:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linaro-kernel, android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On Monday 09 April 2012, Minchan Kim wrote:
> 2012-04-07 오전 1:16, Arnd Bergmann 쓴 글:
> 
> > larger chunks would generally be helpful, in order to guarantee that we
> > the drive doesn't do any garbage collection, we would have to do all writes
> 
> 
> And we should guarantee for avoiding unnecessary swapout, even OOM killing.
> 
> > in aligned chunks. It would probably be enough to do this in 8kb or
> > 16kb units for most devices over the next few years, but implementing it
> > for 64kb should be the same amount of work and will get us a little bit
> > further.
> 
> 
> I understand it's best for writing 64K in your statement.
> What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?

>From my measurements, there are three sizes that are relevant here:

1. The underlying page size of the flash: This used to be less than 4kb,
which is fine when paging out 4kb mmu pages, as long as the partition is
aligned. Today, most devices use 8kb pages and the number is increasing
over time, meaning we will see more 16kb page devices in the future and
presumably larger sizes after that. Writes that are not naturally aligned
multiples of the page size tend to be a significant problem for the
controller to deal with: in order to guarantee that a 4kb write makes it
into permanent storage, the device has to write 8kb and the next 4kb
write has to go into another 8kb page because each page can only be
written once before the block is erased. At a later point, all the partial
pages get rewritten into a new erase block, a process that can take
hundreds of miliseconds and that we absolutely want to prevent from
happening, as it can block all other I/O to the device. Writing all
(flash) pages in an erase block sequentially usually avoids this, as
long as you don't write to many different erase blocks at the same time.
Note that the page size depends on how the controller combines different
planes and channels.

2. The super-page size of the flash: When you have multiple channels
between the controller and the individual flash chips, you can write
multiple pages simultaneously, which means that e.g. sending 32kb of
data to the device takes roughly the same amount of time as writing a
single 8kb page. Writing less than the super-page size when there is
more data waiting to get written out is a waste of time, although the
effects are much less drastic as writing data that is not aligned to
pages because it does not require garbage collection.

3. optimum write size: While writing larger amounts of data in a single
request is usually faster than writing less, almost all devices
I've seen have a sharp cut-off where increasing the size of the write
does not actually help any more because of a bottleneck somewhere
in the stack. Writing more than 64kb almost never improves performance
and sometimes reduces performance.

>From the I've done, a typical profile could look like

Size	Throughput
1KB	200KB/s
2KB	450KB/s
4KB	1MB/s
8KB	4MB/s		<== page size
16KB	8MB/s
32KB	16MB/s		<== superpage size
64KB	18MB/s		<== optimum size
128KB	17MB/s
...
8MB	18MB/s		<== erase block size

> > I'm not sure what we would do when there are less than 64kb available
> > for pageout on the inactive list. The two choices I can think of are
> > either not writing anything, or wasting the swap slots and filling
> 
> 
> No wrtite will cause unnecessary many pages to swap out by next prioirty
> of scanning and we can't gaurantee how long we wait to queue up to 64KB
> in anon pages. It might take longer than GC time so we need some deadline.
> 
> 
> > up the data with zeroes.
> 
> 
> Zero padding would be a good solution but I have a concern on WAP so we
> need smart policy.
> 
> To be honest, I think swapout is normally asynchonous operation so that
> it should not affect system latency rather than swap read which is
> synchronous operation. So if system is low memory pressure, we can queue
> swap out pages up to 64KB and then batch write-out in empty cluster. If
> we don't have any empty cluster in low memory pressure, we should write
> out it in partial cluster. Maybe it doesn't affect system latency
> severely in low memory pressure.

The main thing that can affect system latency is garbage collection
that blocks any other reads or writes for an extended amount of time.
If we can avoid that, we've got the 95% solution.

Note that eMMC-4.5 provides a high-priority interrupt mechamism that
lets us interrupt the a write that has hit the garbage collection
path, so we can send a more important read request to the device.
This will not work on other devices though and the patches for this
are still under discussion.

> If system memory pressure is high(and It shoud be not frequent),
> swap-out B/W would be more important. So we can reserve some clusters
> for it and I think we can use page padding you mentioned in this case
> for reducing latency if we can queue it up to 64KB within threshold time.
> 
> Swap-read is also important. We have to investigate fragmentation of
> swap slots because we disable swap readahead in non-rotation device. It
> can make lots of hole in swap cluster and it makes to find empty
> cluster. So for it, it might be better than enable swap-read in
> non-rotation devices, too.

Yes, reading in up to 64kb or at least a superpage would also help here,
although there is no problem reading in a single cpu page, it will still
take no more time than reading in a superpage.

> >>> 2) Make variable sized swap clusters. Right now, the swap space is
> >>> organized in clusters of 256 pages (1MB), which is less than the typical
> >>> erase block size of 4 or 8 MB. We should try to make the swap cluster
> >>> aligned to erase blocks and have the size match to avoid garbage collection
> >>> in the drive. The cluster size would typically be set by mkswap as a new
> >>> option and interpreted at swapon time.
> >>>
> >>
> >> If we can find such big contiguous swap slots easily, it would be good.
> >> But I am not sure how often we can get such big slots. And maybe we have to
> >> improve search method for getting such big empty cluster.
> > 
> > As long as there are clusters available, we should try to find them. When
> > free space is too fragmented to find any unused cluster, we can pick one
> > that has very little data in it, so that we reduce the time it takes to
> > GC that erase block in the drive. While we could theoretically do active
> > garbage collection of swap data in the kernel, it won't get more efficient
> > than the GC inside of the drive. If we do this, it unfortunately means that
> > we can't just send a discard for the entire erase block.
> 
> 
> Might need some compaction during idle time but WAP concern raises again. :(

Sorry for my ignorance, but what does WAP stand for?

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-05  0:17           ` 정효진
@ 2012-04-09 12:50             ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-09 12:50 UTC (permalink / raw)
  To: 정효진
  Cc: 'Hugh Dickins',
	cpgs, linaro-kernel, 'Rik van Riel',
	linux-mmc, 'Alex Lemberg',
	linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon'

On Thursday 05 April 2012, 정효진 wrote:

> I'm not sure that how Linux manage swap area.
> If there are difference of information for invalid data between host and
> eMMC device, discard to eMMC is good for performance of IO. It is as same
> as general case of discard of user partition which is formatted with
> filesystem.
> As your e-mail mentioned, overwriting the logical address is the another
> way to send info of invalid data address just for the overwrite area,
> however it is not a best way for eMMC to manage physical NAND array. In
> this case, eMMC have to trim physical NAND array, and do write operation at
> the same time. It needs more latency.
> If host send discard with invalid data address info in advance, eMMC can
> find beat way to manage physical NAND page before host usage(write
> operation).
> I'm not sure it is the right comments of your concern.
> If you need more info, please let me know

One specific property of the linux swap code is that we write relatively
large clusters (1 MB today) sequentially and only reuse them once all
of the data in them has become invalid. Part of my suggestion was to
increase that size to the erase block size of the underlying storage,
e.g. 8MB for typical eMMC. Right now, we send a discard command
just before reusing a swap cluster, for the entire cluster.

In my interpretation, this already means a typical device will never to a
garbage collection of that erase block because we never overwrite the
erase block partially.

Luca suggested that we could send the discard command as soon as an
individual 4kb page is freed, which would let the device reuse the
physical erase block as soon as all the pages in that erase block have
been freed over time, but my interpretation is that while this can
help for global wear levelling, it does not help avoid any garbage
collection.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-09  7:37               ` 정효진
  2012-04-09  8:11                 ` Minchan Kim
@ 2012-04-09 12:54                 ` Arnd Bergmann
  1 sibling, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-09 12:54 UTC (permalink / raw)
  To: 정효진
  Cc: 'Minchan Kim', 'Alex Lemberg',
	linaro-kernel, 'Rik van Riel',
	linux-mmc, linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon',
	'Hugh Dickins', 'Yaniv Iarovici',
	cpgs

On Monday 09 April 2012, 정효진 wrote:
> If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best
> performance as of today. However, large page size in swap partition may not best way
> in Linux system level.
> I'm not sure that the best page size between Swap system and eMMC device.

Can you explain the significance of the 512KB size? I've seen devices report 512KB
erase size, although measurements clearly showed an erase block size of 8MB and I
do not understand this discprepancy.

Right now, we always send discards of 1MB clusters to the device, which does what
you want, although I'm not sure if those clusters are naturally aligned to the start
of the partition. Obviously this also requires aligning the start of the partition
to the erase block size, but most devices should already get that right nowadays.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-09  8:11                 ` Minchan Kim
@ 2012-04-09 13:00                   ` Arnd Bergmann
  2012-04-10  1:10                     ` Minchan Kim
  0 siblings, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-09 13:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: 정효진, 'Alex Lemberg',
	linaro-kernel, 'Rik van Riel',
	linux-mmc, linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon',
	'Hugh Dickins', 'Yaniv Iarovici',
	cpgs

On Monday 09 April 2012, Minchan Kim wrote:
> > 
> > Regarding swap page size:
> > Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation.
> > I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC.
> > If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level.
> > I'm not sure that the best page size between Swap system and eMMC device.
> 
> 
> The variety is one of challenges for removing GC generally. ;-(.
> I don't like manual setting through /sys/block/xxx because it requires
> that user have to know nand page size and erase block size but it's not
> easy to know to normal user.
> Arnd. What's your plan to support various flash storages effectively?

My preference would be to build the logic to detect the sizes into mkfs
and mkswap and encode them in the superblock in new fields. I don't think
we can trust any data that a device reports right now because operating
systems have ignored it in the past and either someone has forgotten to
update the fields after moving to new technology (eMMC), or the data can
not be encoded correctly according to the spec (SD, USB).

System builders for embedded systems can then make sure that they get
it right for the hardware they use, and we can try our best to help
that process.

	Ard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-09 12:35       ` Arnd Bergmann
@ 2012-04-10  0:57         ` Minchan Kim
  2012-04-10  8:32           ` Arnd Bergmann
  0 siblings, 1 reply; 41+ messages in thread
From: Minchan Kim @ 2012-04-10  0:57 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

2012-04-09 오후 9:35, Arnd Bergmann 쓴 글:

> On Monday 09 April 2012, Minchan Kim wrote:
>> 2012-04-07 오전 1:16, Arnd Bergmann 쓴 글:
>>
>>> larger chunks would generally be helpful, in order to guarantee that we
>>> the drive doesn't do any garbage collection, we would have to do all writes
>>
>>
>> And we should guarantee for avoiding unnecessary swapout, even OOM killing.
>>
>>> in aligned chunks. It would probably be enough to do this in 8kb or
>>> 16kb units for most devices over the next few years, but implementing it
>>> for 64kb should be the same amount of work and will get us a little bit
>>> further.
>>
>>
>> I understand it's best for writing 64K in your statement.
>> What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
> 
> From my measurements, there are three sizes that are relevant here:
> 
> 1. The underlying page size of the flash: This used to be less than 4kb,
> which is fine when paging out 4kb mmu pages, as long as the partition is
> aligned. Today, most devices use 8kb pages and the number is increasing
> over time, meaning we will see more 16kb page devices in the future and
> presumably larger sizes after that. Writes that are not naturally aligned
> multiples of the page size tend to be a significant problem for the
> controller to deal with: in order to guarantee that a 4kb write makes it
> into permanent storage, the device has to write 8kb and the next 4kb
> write has to go into another 8kb page because each page can only be
> written once before the block is erased. At a later point, all the partial
> pages get rewritten into a new erase block, a process that can take
> hundreds of miliseconds and that we absolutely want to prevent from
> happening, as it can block all other I/O to the device. Writing all
> (flash) pages in an erase block sequentially usually avoids this, as
> long as you don't write to many different erase blocks at the same time.
> Note that the page size depends on how the controller combines different
> planes and channels.
> 
> 2. The super-page size of the flash: When you have multiple channels
> between the controller and the individual flash chips, you can write
> multiple pages simultaneously, which means that e.g. sending 32kb of
> data to the device takes roughly the same amount of time as writing a
> single 8kb page. Writing less than the super-page size when there is
> more data waiting to get written out is a waste of time, although the
> effects are much less drastic as writing data that is not aligned to
> pages because it does not require garbage collection.
> 
> 3. optimum write size: While writing larger amounts of data in a single
> request is usually faster than writing less, almost all devices
> I've seen have a sharp cut-off where increasing the size of the write
> does not actually help any more because of a bottleneck somewhere
> in the stack. Writing more than 64kb almost never improves performance
> and sometimes reduces performance.


For our understanding, you mean we have to do aligned-write as follows
if possible?

"Nand internal page size write(8K, 16K)" < "Super-page size write(32K)
which considers parallel working with number of channel and plane" <
some sequential big write (64K)

> 
> From the I've done, a typical profile could look like
> 
> Size	Throughput
> 1KB	200KB/s
> 2KB	450KB/s
> 4KB	1MB/s
> 8KB	4MB/s		<== page size
> 16KB	8MB/s
> 32KB	16MB/s		<== superpage size
> 64KB	18MB/s		<== optimum size
> 128KB	17MB/s
> ...
> 8MB	18MB/s		<== erase block size
> 
>>> I'm not sure what we would do when there are less than 64kb available
>>> for pageout on the inactive list. The two choices I can think of are
>>> either not writing anything, or wasting the swap slots and filling
>>
>>
>> No wrtite will cause unnecessary many pages to swap out by next prioirty
>> of scanning and we can't gaurantee how long we wait to queue up to 64KB
>> in anon pages. It might take longer than GC time so we need some deadline.
>>
>>
>>> up the data with zeroes.
>>
>>
>> Zero padding would be a good solution but I have a concern on WAP so we
>> need smart policy.
>>
>> To be honest, I think swapout is normally asynchonous operation so that
>> it should not affect system latency rather than swap read which is
>> synchronous operation. So if system is low memory pressure, we can queue
>> swap out pages up to 64KB and then batch write-out in empty cluster. If
>> we don't have any empty cluster in low memory pressure, we should write
>> out it in partial cluster. Maybe it doesn't affect system latency
>> severely in low memory pressure.
> 
> The main thing that can affect system latency is garbage collection
> that blocks any other reads or writes for an extended amount of time.
> If we can avoid that, we've got the 95% solution.


I see.

> 
> Note that eMMC-4.5 provides a high-priority interrupt mechamism that
> lets us interrupt the a write that has hit the garbage collection
> path, so we can send a more important read request to the device.
> This will not work on other devices though and the patches for this
> are still under discussion.


Nice feature but I think swap system doesn't need to consider such
feature. I should be handled by I/O subsystem like I/O scheduler.

> 
>> If system memory pressure is high(and It shoud be not frequent),
>> swap-out B/W would be more important. So we can reserve some clusters
>> for it and I think we can use page padding you mentioned in this case
>> for reducing latency if we can queue it up to 64KB within threshold time.
>>
>> Swap-read is also important. We have to investigate fragmentation of
>> swap slots because we disable swap readahead in non-rotation device. It
>> can make lots of hole in swap cluster and it makes to find empty
>> cluster. So for it, it might be better than enable swap-read in
>> non-rotation devices, too.
> 
> Yes, reading in up to 64kb or at least a superpage would also help here,
> although there is no problem reading in a single cpu page, it will still
> take no more time than reading in a superpage.
> 
>>>>> 2) Make variable sized swap clusters. Right now, the swap space is
>>>>> organized in clusters of 256 pages (1MB), which is less than the typical
>>>>> erase block size of 4 or 8 MB. We should try to make the swap cluster
>>>>> aligned to erase blocks and have the size match to avoid garbage collection
>>>>> in the drive. The cluster size would typically be set by mkswap as a new
>>>>> option and interpreted at swapon time.
>>>>>
>>>>
>>>> If we can find such big contiguous swap slots easily, it would be good.
>>>> But I am not sure how often we can get such big slots. And maybe we have to
>>>> improve search method for getting such big empty cluster.
>>>
>>> As long as there are clusters available, we should try to find them. When
>>> free space is too fragmented to find any unused cluster, we can pick one
>>> that has very little data in it, so that we reduce the time it takes to
>>> GC that erase block in the drive. While we could theoretically do active
>>> garbage collection of swap data in the kernel, it won't get more efficient
>>> than the GC inside of the drive. If we do this, it unfortunately means that
>>> we can't just send a discard for the entire erase block.
>>
>>
>> Might need some compaction during idle time but WAP concern raises again. :(
> 
> Sorry for my ignorance, but what does WAP stand for?


I should have written more general term. I means write amplication but
WAF(Write Amplication Factor) is more popular. :(

> 
> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-09 13:00                   ` Arnd Bergmann
@ 2012-04-10  1:10                     ` Minchan Kim
  2012-04-10  8:40                       ` Arnd Bergmann
  0 siblings, 1 reply; 41+ messages in thread
From: Minchan Kim @ 2012-04-10  1:10 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: 정효진, 'Alex Lemberg',
	linaro-kernel, 'Rik van Riel',
	linux-mmc, linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon',
	'Hugh Dickins', 'Yaniv Iarovici',
	cpgs

2012-04-09 오후 10:00, Arnd Bergmann 쓴 글:

> On Monday 09 April 2012, Minchan Kim wrote:
>>>
>>> Regarding swap page size:
>>> Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation.
>>> I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC.
>>> If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level.
>>> I'm not sure that the best page size between Swap system and eMMC device.
>>
>>
>> The variety is one of challenges for removing GC generally. ;-(.
>> I don't like manual setting through /sys/block/xxx because it requires
>> that user have to know nand page size and erase block size but it's not
>> easy to know to normal user.
>> Arnd. What's your plan to support various flash storages effectively?
> 
> My preference would be to build the logic to detect the sizes into mkfs
> and mkswap and encode them in the superblock in new fields. I don't think
> we can trust any data that a device reports right now because operating
> systems have ignored it in the past and either someone has forgotten to
> update the fields after moving to new technology (eMMC), or the data can
> not be encoded correctly according to the spec (SD, USB).


I think it's not good approach.
How long does it take to know such parameters?
I guess it's not short so that mkfs/mkswap would be very long
dramatically. If needed, let's maintain it as another tool.

If storage vendors break such fields, it doesn't work well on linux
which is very popular on mobile world today and user will not use such
vendor devices and company will be gone. Let's give such pressure to
them and make vendor keep in promise.


> 
> System builders for embedded systems can then make sure that they get
> it right for the hardware they use, and we can try our best to help
> that process.




> 
> 	Ard
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-10  0:57         ` Minchan Kim
@ 2012-04-10  8:32           ` Arnd Bergmann
  2012-04-11  9:54             ` Minchan Kim
  0 siblings, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-10  8:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linaro-kernel, android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On Tuesday 10 April 2012, Minchan Kim wrote:
> 2012-04-09 오후 9:35, Arnd Bergmann 쓴 글:

> >>
> >> I understand it's best for writing 64K in your statement.
> >> What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
> > 
> > From my measurements, there are three sizes that are relevant here:
> > 
> > 1. The underlying page size of the flash: This used to be less than 4kb,
> > which is fine when paging out 4kb mmu pages, as long as the partition is
> > aligned. Today, most devices use 8kb pages and the number is increasing
> > over time, meaning we will see more 16kb page devices in the future and
> > presumably larger sizes after that. Writes that are not naturally aligned
> > multiples of the page size tend to be a significant problem for the
> > controller to deal with: in order to guarantee that a 4kb write makes it
> > into permanent storage, the device has to write 8kb and the next 4kb
> > write has to go into another 8kb page because each page can only be
> > written once before the block is erased. At a later point, all the partial
> > pages get rewritten into a new erase block, a process that can take
> > hundreds of miliseconds and that we absolutely want to prevent from
> > happening, as it can block all other I/O to the device. Writing all
> > (flash) pages in an erase block sequentially usually avoids this, as
> > long as you don't write to many different erase blocks at the same time.
> > Note that the page size depends on how the controller combines different
> > planes and channels.
> > 
> > 2. The super-page size of the flash: When you have multiple channels
> > between the controller and the individual flash chips, you can write
> > multiple pages simultaneously, which means that e.g. sending 32kb of
> > data to the device takes roughly the same amount of time as writing a
> > single 8kb page. Writing less than the super-page size when there is
> > more data waiting to get written out is a waste of time, although the
> > effects are much less drastic as writing data that is not aligned to
> > pages because it does not require garbage collection.
> > 
> > 3. optimum write size: While writing larger amounts of data in a single
> > request is usually faster than writing less, almost all devices
> > I've seen have a sharp cut-off where increasing the size of the write
> > does not actually help any more because of a bottleneck somewhere
> > in the stack. Writing more than 64kb almost never improves performance
> > and sometimes reduces performance.
> 
> 
> For our understanding, you mean we have to do aligned-write as follows
> if possible?
> 
> "Nand internal page size write(8K, 16K)" < "Super-page size write(32K)
> which considers parallel working with number of channel and plane" <
> some sequential big write (64K)

In the definition I gave above, page size (8k, 16k) would be the only
one that requires alignment. Writing 64k at an arbitrary 16k alignment
should still give us the best performance in almost all cases and
introduce no extra write amplification, while writing with less than
page alignment causes significant write amplification and long latencies.

> 
> > 
> > Note that eMMC-4.5 provides a high-priority interrupt mechamism that
> > lets us interrupt the a write that has hit the garbage collection
> > path, so we can send a more important read request to the device.
> > This will not work on other devices though and the patches for this
> > are still under discussion.
> 
> 
> Nice feature but I think swap system doesn't need to consider such
> feature. I should be handled by I/O subsystem like I/O scheduler.

Right, this is completely independent of swap. The current implementation
of the patch set favours only reads that are done for page-in operations
by interrupting any long-running writes when a more important read comes
in. IMHO we should do the same for any synchronous read, but that discussion
is completely orthogonal to having the swap device on emmc.

> >>>>> 2) Make variable sized swap clusters. Right now, the swap space is
> >>>>> organized in clusters of 256 pages (1MB), which is less than the typical
> >>>>> erase block size of 4 or 8 MB. We should try to make the swap cluster
> >>>>> aligned to erase blocks and have the size match to avoid garbage collection
> >>>>> in the drive. The cluster size would typically be set by mkswap as a new
> >>>>> option and interpreted at swapon time.
> >>>>>
> >>>>
> >>>> If we can find such big contiguous swap slots easily, it would be good.
> >>>> But I am not sure how often we can get such big slots. And maybe we have to
> >>>> improve search method for getting such big empty cluster.
> >>>
> >>> As long as there are clusters available, we should try to find them. When
> >>> free space is too fragmented to find any unused cluster, we can pick one
> >>> that has very little data in it, so that we reduce the time it takes to
> >>> GC that erase block in the drive. While we could theoretically do active
> >>> garbage collection of swap data in the kernel, it won't get more efficient
> >>> than the GC inside of the drive. If we do this, it unfortunately means that
> >>> we can't just send a discard for the entire erase block.
> >>
> >>
> >> Might need some compaction during idle time but WAP concern raises again. :(
> > 
> > Sorry for my ignorance, but what does WAP stand for?
> 
> 
> I should have written more general term. I means write amplication but
> WAF(Write Amplication Factor) is more popular. :(

D'oh. Thanks for the clarification. Note that the entire idea of increasing the
swap cluster size to the erase block size is to *reduce* write amplification:

If we pick arbitrary swap clusters that are part of an erase block (or worse,
span two partial erase blocks), sending a discard for one cluster does not
allow the device to actually discard an entire erase block. Consider the best
possible scenario where we have a 1MB cluster and 2MB erase blocks, all
naturally aligned. After we have written the entire swap device once, all
blocks are marked as used in the device, but some are available for reuse
in the kernel. The swap code picks a cluster that is currently unused and 
sends a discard to the device, then fills the cluster with new pages.
After that, we pick another swap cluster elsewhere. The erase block now
contains 50% new and 50% old data and has to be garbage collected, so the
device writes 2MB of data  to anther erase block. So, in order to write 1MB,
the device has written 3MB and the write amplification factor is 3. Using
8MB erase blocks, it would be 9.

If we do the active compaction and increase the cluster size to the erase
block size, there is no write amplification inside of the device (and no
stalls from the garbage collection, which are the other concern), and
we only need to write a few blocks again that are still valid in a cluster
at the time we want to reuse it. On an ideal device, the write amplification
for active compaction should be exactly the same as what we get when we
write a cluster while some of the data in it is still valid and we skip
those pages, while some devices might now like having to gc themselves.
Doing the compaction in software means we have to spend CPU cycles on it,
but we get to choose when it happens and don't have to block on the device
during GC.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-10  1:10                     ` Minchan Kim
@ 2012-04-10  8:40                       ` Arnd Bergmann
  2012-04-12  8:32                         ` Luca Porzio (lporzio)
  0 siblings, 1 reply; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-10  8:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: 정효진, 'Alex Lemberg',
	linaro-kernel, 'Rik van Riel',
	linux-mmc, linux-kernel, 'Luca Porzio (lporzio)',
	linux-mm, kernel-team, 'Yejin Moon',
	'Hugh Dickins', 'Yaniv Iarovici',
	cpgs

On Tuesday 10 April 2012, Minchan Kim wrote:
> I think it's not good approach.
> How long does it take to know such parameters?
> I guess it's not short so that mkfs/mkswap would be very long
> dramatically. If needed, let's maintain it as another tool.

I haven't come up with a way that is both fast and reliable.
A very fast method is to time short read requests across potential
erase block boundaries and see which ones are faster than others,
this works on about 3 out of 4 devices. 

For the other devices, I currently use a fairly manual process that
times a lot of write requests and can take a long time.

> If storage vendors break such fields, it doesn't work well on linux
> which is very popular on mobile world today and user will not use such
> vendor devices and company will be gone. Let's give such pressure to
> them and make vendor keep in promise.

This could work for eMMC, yes.

The SD card standard makes it impossible to write the correct value for
most devices, it only supports power-of-two values up to 4MB for SDHC,
and larger values (I believe 8, 12, 16, 24, ... 64) for SDXC, but a lot
of SDHC cards nowadays use 1.5, 3, 6 or 8 MB erase blocks.

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-10  8:32           ` Arnd Bergmann
@ 2012-04-11  9:54             ` Minchan Kim
  2012-04-11 15:57               ` Arnd Bergmann
  0 siblings, 1 reply; 41+ messages in thread
From: Minchan Kim @ 2012-04-11  9:54 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Minchan Kim, linaro-kernel, android-kernel, linux-mm,
	Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On Tue, Apr 10, 2012 at 08:32:51AM +0000, Arnd Bergmann wrote:
> On Tuesday 10 April 2012, Minchan Kim wrote:
> > 2012-04-09 오후 9:35, Arnd Bergmann 쓴 글:
> 
> > >>
> > >> I understand it's best for writing 64K in your statement.
> > >> What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
> > > 
> > > From my measurements, there are three sizes that are relevant here:
> > > 
> > > 1. The underlying page size of the flash: This used to be less than 4kb,
> > > which is fine when paging out 4kb mmu pages, as long as the partition is
> > > aligned. Today, most devices use 8kb pages and the number is increasing
> > > over time, meaning we will see more 16kb page devices in the future and
> > > presumably larger sizes after that. Writes that are not naturally aligned
> > > multiples of the page size tend to be a significant problem for the
> > > controller to deal with: in order to guarantee that a 4kb write makes it
> > > into permanent storage, the device has to write 8kb and the next 4kb
> > > write has to go into another 8kb page because each page can only be
> > > written once before the block is erased. At a later point, all the partial
> > > pages get rewritten into a new erase block, a process that can take
> > > hundreds of miliseconds and that we absolutely want to prevent from
> > > happening, as it can block all other I/O to the device. Writing all
> > > (flash) pages in an erase block sequentially usually avoids this, as
> > > long as you don't write to many different erase blocks at the same time.
> > > Note that the page size depends on how the controller combines different
> > > planes and channels.
> > > 
> > > 2. The super-page size of the flash: When you have multiple channels
> > > between the controller and the individual flash chips, you can write
> > > multiple pages simultaneously, which means that e.g. sending 32kb of
> > > data to the device takes roughly the same amount of time as writing a
> > > single 8kb page. Writing less than the super-page size when there is
> > > more data waiting to get written out is a waste of time, although the
> > > effects are much less drastic as writing data that is not aligned to
> > > pages because it does not require garbage collection.
> > > 
> > > 3. optimum write size: While writing larger amounts of data in a single
> > > request is usually faster than writing less, almost all devices
> > > I've seen have a sharp cut-off where increasing the size of the write
> > > does not actually help any more because of a bottleneck somewhere
> > > in the stack. Writing more than 64kb almost never improves performance
> > > and sometimes reduces performance.
> > 
> > 
> > For our understanding, you mean we have to do aligned-write as follows
> > if possible?
> > 
> > "Nand internal page size write(8K, 16K)" < "Super-page size write(32K)
> > which considers parallel working with number of channel and plane" <
> > some sequential big write (64K)
> 
> In the definition I gave above, page size (8k, 16k) would be the only
> one that requires alignment. Writing 64k at an arbitrary 16k alignment
> should still give us the best performance in almost all cases and
> introduce no extra write amplification, while writing with less than
> page alignment causes significant write amplification and long latencies.
> 
> > 
> > > 
> > > Note that eMMC-4.5 provides a high-priority interrupt mechamism that
> > > lets us interrupt the a write that has hit the garbage collection
> > > path, so we can send a more important read request to the device.
> > > This will not work on other devices though and the patches for this
> > > are still under discussion.
> > 
> > 
> > Nice feature but I think swap system doesn't need to consider such
> > feature. I should be handled by I/O subsystem like I/O scheduler.
> 
> Right, this is completely independent of swap. The current implementation
> of the patch set favours only reads that are done for page-in operations
> by interrupting any long-running writes when a more important read comes
> in. IMHO we should do the same for any synchronous read, but that discussion
> is completely orthogonal to having the swap device on emmc.
> 
> > >>>>> 2) Make variable sized swap clusters. Right now, the swap space is
> > >>>>> organized in clusters of 256 pages (1MB), which is less than the typical
> > >>>>> erase block size of 4 or 8 MB. We should try to make the swap cluster
> > >>>>> aligned to erase blocks and have the size match to avoid garbage collection
> > >>>>> in the drive. The cluster size would typically be set by mkswap as a new
> > >>>>> option and interpreted at swapon time.
> > >>>>>
> > >>>>
> > >>>> If we can find such big contiguous swap slots easily, it would be good.
> > >>>> But I am not sure how often we can get such big slots. And maybe we have to
> > >>>> improve search method for getting such big empty cluster.
> > >>>
> > >>> As long as there are clusters available, we should try to find them. When
> > >>> free space is too fragmented to find any unused cluster, we can pick one
> > >>> that has very little data in it, so that we reduce the time it takes to
> > >>> GC that erase block in the drive. While we could theoretically do active
> > >>> garbage collection of swap data in the kernel, it won't get more efficient
> > >>> than the GC inside of the drive. If we do this, it unfortunately means that
> > >>> we can't just send a discard for the entire erase block.
> > >>
> > >>
> > >> Might need some compaction during idle time but WAP concern raises again. :(
> > > 
> > > Sorry for my ignorance, but what does WAP stand for?
> > 
> > 
> > I should have written more general term. I means write amplication but
> > WAF(Write Amplication Factor) is more popular. :(
> 
> D'oh. Thanks for the clarification. Note that the entire idea of increasing the
> swap cluster size to the erase block size is to *reduce* write amplification:
> 
> If we pick arbitrary swap clusters that are part of an erase block (or worse,
> span two partial erase blocks), sending a discard for one cluster does not
> allow the device to actually discard an entire erase block. Consider the best
> possible scenario where we have a 1MB cluster and 2MB erase blocks, all
> naturally aligned. After we have written the entire swap device once, all
> blocks are marked as used in the device, but some are available for reuse
> in the kernel. The swap code picks a cluster that is currently unused and 
> sends a discard to the device, then fills the cluster with new pages.
> After that, we pick another swap cluster elsewhere. The erase block now
> contains 50% new and 50% old data and has to be garbage collected, so the
> device writes 2MB of data  to anther erase block. So, in order to write 1MB,
> the device has written 3MB and the write amplification factor is 3. Using
> 8MB erase blocks, it would be 9.
> 
> If we do the active compaction and increase the cluster size to the erase
> block size, there is no write amplification inside of the device (and no
> stalls from the garbage collection, which are the other concern), and
> we only need to write a few blocks again that are still valid in a cluster
> at the time we want to reuse it. On an ideal device, the write amplification
> for active compaction should be exactly the same as what we get when we
> write a cluster while some of the data in it is still valid and we skip
> those pages, while some devices might now like having to gc themselves.
> Doing the compaction in software means we have to spend CPU cycles on it,
> but we get to choose when it happens and don't have to block on the device
> during GC.

Thanks for detail explanation.
At least, we need active compaction to avoid GC completely when we can't find
empty cluster and there are lots of hole.
Indirection layer we discussed last LSF/MM could help slot change by
compaction easily.
I think way to find empty cluster should be changed because current linear scan
is not proper for bigger cluster size.

I am looking forward to your works!

P.S) I'm afraid this work might raise endless war, again which host can do well VS
device can do well. If we can work out, we don't need costly eMMC FTL, just need
dumb bare nand, controller and simple firmware.

> 
> 	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-04 12:47     ` Arnd Bergmann
@ 2012-04-11 10:28       ` Adrian Hunter
  2012-07-16 13:29         ` Pavel Machek
  0 siblings, 1 reply; 41+ messages in thread
From: Adrian Hunter @ 2012-04-11 10:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Adrian Hunter, linaro-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc, kernel-team

On 04/04/12 15:47, Arnd Bergmann wrote:
> On Wednesday 04 April 2012, Adrian Hunter wrote:
>> On 30/03/12 21:50, Arnd Bergmann wrote:
>>> (sorry for the duplicated email, this corrects the address of the android
>>> kernel team, please reply here)
>>>
>>> On Friday 30 March 2012, Arnd Bergmann wrote:
>>>
>>>  We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
>>>  with Luca joining in on the discussion) about swapping to flash based media
>>>  such as eMMC. This is a summary of what we found and what we think should
>>>  be done. If people agree that this is a good idea, we can start working
>>>  on it.
>>
>> There is mtdswap.
> 
> Ah, very interesting. I wasn't aware of that. Obviously we can't directly
> use it on block devices that have their own garbage collection and wear
> leveling built into them, but it's interesting to see how this was solved
> before.
> 
> While we could build something similar that remaps blocks between an
> eMMC device and the logical swap space that is used by the mm code,
> my feeling is that it would be easier to modify the swap code itself
> to do the right thing.
> 
>> Also the old Nokia N900 had swap to eMMC.
>>
>> The last I heard was that swap was considered to be simply too slow on hand
>> held devices.
> 
> That's the part that we want to solve here. It has nothing to do with
> handheld devices, but more with specific incompatibilities of the
> block allocation in the swap code vs. what an eMMC device expects
> to see for fast operation. If you write data in the wrong order on
> flash devices, you get long delays that you don't get when you do
> it the right way. The same problem exists for file systems, and is
> being addressed there as well.
> 
>> As systems adopt more RAM, isn't there a decreasing demand for swap?
> 
> No. You would never be able to make hibernate work, no matter how much
> RAM you add ;-)

Have you considered making hibernate work without swap?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-11  9:54             ` Minchan Kim
@ 2012-04-11 15:57               ` Arnd Bergmann
  2012-04-12  2:36                 ` Minchan Kim
  2012-04-16 18:22                 ` Stephan Uphoff
  0 siblings, 2 replies; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-11 15:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linaro-kernel, android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On Wednesday 11 April 2012, Minchan Kim wrote:
> On Tue, Apr 10, 2012 at 08:32:51AM +0000, Arnd Bergmann wrote:
> > > 
> > > I should have written more general term. I means write amplication but
> > > WAF(Write Amplication Factor) is more popular. :(
> > 
> > D'oh. Thanks for the clarification. Note that the entire idea of increasing the
> > swap cluster size to the erase block size is to *reduce* write amplification:
> > 
> > If we pick arbitrary swap clusters that are part of an erase block (or worse,
> > span two partial erase blocks), sending a discard for one cluster does not
> > allow the device to actually discard an entire erase block. Consider the best
> > possible scenario where we have a 1MB cluster and 2MB erase blocks, all
> > naturally aligned. After we have written the entire swap device once, all
> > blocks are marked as used in the device, but some are available for reuse
> > in the kernel. The swap code picks a cluster that is currently unused and 
> > sends a discard to the device, then fills the cluster with new pages.
> > After that, we pick another swap cluster elsewhere. The erase block now
> > contains 50% new and 50% old data and has to be garbage collected, so the
> > device writes 2MB of data  to anther erase block. So, in order to write 1MB,
> > the device has written 3MB and the write amplification factor is 3. Using
> > 8MB erase blocks, it would be 9.
> > 
> > If we do the active compaction and increase the cluster size to the erase
> > block size, there is no write amplification inside of the device (and no
> > stalls from the garbage collection, which are the other concern), and
> > we only need to write a few blocks again that are still valid in a cluster
> > at the time we want to reuse it. On an ideal device, the write amplification
> > for active compaction should be exactly the same as what we get when we
> > write a cluster while some of the data in it is still valid and we skip
> > those pages, while some devices might now like having to gc themselves.
> > Doing the compaction in software means we have to spend CPU cycles on it,
> > but we get to choose when it happens and don't have to block on the device
> > during GC.
> 
> Thanks for detail explanation.
> At least, we need active compaction to avoid GC completely when we can't find
> empty cluster and there are lots of hole.
> Indirection layer we discussed last LSF/MM could help slot change by
> compaction easily.
> I think way to find empty cluster should be changed because current linear scan
> is not proper for bigger cluster size.
> 
> I am looking forward to your works!
> 
> P.S) I'm afraid this work might raise endless war, again which host can do well VS
> device can do well. If we can work out, we don't need costly eMMC FTL, just need
> dumb bare nand, controller and simple firmware.

IMHO, we should only distinguish between dumb and smart devices, defined as follows:

1. smart devices behave like all but the extremely cheap SSDs. They are optimized
for 4KB random I/O, and the erase block size is not visible because there is
a write cache and a flexible controller between the block device abstraction
and the raw flash.

2. dumb devices have very visible effects that stem from a simplistic remapping
layer that translates logical erase block numbers into physical erase blocks,
and only a fixed number of those can be written at the same time before forcing
GC. Writes smaller than page size are strongly discouraged here. There is no 
RAM to cache writes in the controller, but we still expect these devices to
have a reasonable wear levelling policy.  This covers almost all of today's
eMMC, SD, USB and CF as well as some cheap ATA SSD.

A third category is of course spinning rust, but I think with the distinction
for solid state media above, we have a pretty good grip on all existing
media. As eMMC and UFS evolve over time, we might want to stick them into the
first category, but I don't think we need more categories.

	Arnd


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-11 15:57               ` Arnd Bergmann
@ 2012-04-12  2:36                 ` Minchan Kim
  2012-04-16 18:22                 ` Stephan Uphoff
  1 sibling, 0 replies; 41+ messages in thread
From: Minchan Kim @ 2012-04-12  2:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linaro-kernel, android-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On 04/12/2012 12:57 AM, Arnd Bergmann wrote:
> On Wednesday 11 April 2012, Minchan Kim wrote:
>> On Tue, Apr 10, 2012 at 08:32:51AM +0000, Arnd Bergmann wrote:
>>>>
>>>> I should have written more general term. I means write amplication but
>>>> WAF(Write Amplication Factor) is more popular. :(
>>>
>>> D'oh. Thanks for the clarification. Note that the entire idea of increasing the
>>> swap cluster size to the erase block size is to *reduce* write amplification:
>>>
>>> If we pick arbitrary swap clusters that are part of an erase block (or worse,
>>> span two partial erase blocks), sending a discard for one cluster does not
>>> allow the device to actually discard an entire erase block. Consider the best
>>> possible scenario where we have a 1MB cluster and 2MB erase blocks, all
>>> naturally aligned. After we have written the entire swap device once, all
>>> blocks are marked as used in the device, but some are available for reuse
>>> in the kernel. The swap code picks a cluster that is currently unused and
>>> sends a discard to the device, then fills the cluster with new pages.
>>> After that, we pick another swap cluster elsewhere. The erase block now
>>> contains 50% new and 50% old data and has to be garbage collected, so the
>>> device writes 2MB of data  to anther erase block. So, in order to write 1MB,
>>> the device has written 3MB and the write amplification factor is 3. Using
>>> 8MB erase blocks, it would be 9.
>>>
>>> If we do the active compaction and increase the cluster size to the erase
>>> block size, there is no write amplification inside of the device (and no
>>> stalls from the garbage collection, which are the other concern), and
>>> we only need to write a few blocks again that are still valid in a cluster
>>> at the time we want to reuse it. On an ideal device, the write amplification
>>> for active compaction should be exactly the same as what we get when we
>>> write a cluster while some of the data in it is still valid and we skip
>>> those pages, while some devices might now like having to gc themselves.
>>> Doing the compaction in software means we have to spend CPU cycles on it,
>>> but we get to choose when it happens and don't have to block on the device
>>> during GC.
>>
>> Thanks for detail explanation.
>> At least, we need active compaction to avoid GC completely when we can't find
>> empty cluster and there are lots of hole.
>> Indirection layer we discussed last LSF/MM could help slot change by
>> compaction easily.
>> I think way to find empty cluster should be changed because current linear scan
>> is not proper for bigger cluster size.
>>
>> I am looking forward to your works!
>>
>> P.S) I'm afraid this work might raise endless war, again which host can do well VS
>> device can do well. If we can work out, we don't need costly eMMC FTL, just need
>> dumb bare nand, controller and simple firmware.
>
> IMHO, we should only distinguish between dumb and smart devices, defined as follows:
>
> 1. smart devices behave like all but the extremely cheap SSDs. They are optimized
> for 4KB random I/O, and the erase block size is not visible because there is
> a write cache and a flexible controller between the block device abstraction
> and the raw flash.
>
> 2. dumb devices have very visible effects that stem from a simplistic remapping
> layer that translates logical erase block numbers into physical erase blocks,
> and only a fixed number of those can be written at the same time before forcing
> GC. Writes smaller than page size are strongly discouraged here. There is no
> RAM to cache writes in the controller, but we still expect these devices to
> have a reasonable wear levelling policy.  This covers almost all of today's
> eMMC, SD, USB and CF as well as some cheap ATA SSD.

Such dumb devices have disadvantage as follows,
Some user expect it manage to do itself and some user don't expect it so 
someone like you will add smart features on host to remove GC but 
someone still believes that eMMC by itself will do enough so that he can 
use any FSes on it.

Conflict happens.

Although we can solve several problems to use eMMC as swap, other 
partition could be used for any FSes which are not aware of eMMC 
characteristic. It could cause GC in eMMC internal although it work out 
eMMC as swap so long latency when we use it as swap could be happened.

>
> A third category is of course spinning rust, but I think with the distinction
> for solid state media above, we have a pretty good grip on all existing
> media. As eMMC and UFS evolve over time, we might want to stick them into the
> first category, but I don't think we need more categories.
>
> 	Arnd
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: swap on eMMC and other flash
  2012-04-10  8:40                       ` Arnd Bergmann
@ 2012-04-12  8:32                         ` Luca Porzio (lporzio)
  0 siblings, 0 replies; 41+ messages in thread
From: Luca Porzio (lporzio) @ 2012-04-12  8:32 UTC (permalink / raw)
  To: Arnd Bergmann, Minchan Kim
  Cc: 정효진, 'Alex Lemberg',
	linaro-kernel, 'Rik van Riel',
	linux-mmc, linux-kernel, linux-mm, kernel-team,
	'Yejin Moon', 'Hugh Dickins',
	'Yaniv Iarovici',
	cpgs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2143 bytes --]

Hi All,

> -----Original Message-----
> From: linux-mmc-owner@vger.kernel.org [mailto:linux-mmc-owner@vger.kernel.org]
> On Behalf Of Arnd Bergmann
> Sent: Tuesday, April 10, 2012 1:40 AM
> To: Minchan Kim
> Cc: 정효진; 'Alex Lemberg'; linaro-kernel@lists.linaro.org; 'Rik van Riel';
> linux-mmc@vger.kernel.org; linux-kernel@vger.kernel.org; Luca Porzio
> (lporzio); linux-mm@kvack.org; kernel-team@android.com; 'Yejin Moon'; 'Hugh
> Dickins'; 'Yaniv Iarovici'; cpgs@samsung.com
> Subject: Re: swap on eMMC and other flash
> 
> On Tuesday 10 April 2012, Minchan Kim wrote:
> > I think it's not good approach.
> > How long does it take to know such parameters?
> > I guess it's not short so that mkfs/mkswap would be very long
> > dramatically. If needed, let's maintain it as another tool.
> 
> I haven't come up with a way that is both fast and reliable.
> A very fast method is to time short read requests across potential
> erase block boundaries and see which ones are faster than others,
> this works on about 3 out of 4 devices.
> 
> For the other devices, I currently use a fairly manual process that
> times a lot of write requests and can take a long time.
> 
> > If storage vendors break such fields, it doesn't work well on linux
> > which is very popular on mobile world today and user will not use such
> > vendor devices and company will be gone. Let's give such pressure to
> > them and make vendor keep in promise.
> 
> This could work for eMMC, yes.
> 

I like it ;)

> The SD card standard makes it impossible to write the correct value for
> most devices, it only supports power-of-two values up to 4MB for SDHC,
> and larger values (I believe 8, 12, 16, 24, ... 64) for SDXC, but a lot
> of SDHC cards nowadays use 1.5, 3, 6 or 8 MB erase blocks.
> 
> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-11 15:57               ` Arnd Bergmann
  2012-04-12  2:36                 ` Minchan Kim
@ 2012-04-16 18:22                 ` Stephan Uphoff
  2012-04-16 18:59                   ` Arnd Bergmann
  2012-04-27  7:34                   ` Luca Porzio (lporzio)
  1 sibling, 2 replies; 41+ messages in thread
From: Stephan Uphoff @ 2012-04-16 18:22 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Minchan Kim, linaro-kernel, android-kernel, linux-mm,
	Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

I really like where this is going and would like to use the
opportunity to plant a few ideas.

In contrast to rotational disks read/write operation overhead and
costs are not symmetric.
While random reads are much faster on flash - the number of write
operations is limited by wearout and garbage collection overhead.
To further improve swapping on eMMC or similar flash media I believe
that the following issues need to be addressed:

1) Limit average write bandwidth to eMMC to a configurable level to
guarantee a minimum device lifetime
2) Aim for a low write amplification factor to maximize useable write bandwidth
3) Strongly favor read over write operations

Lowering write amplification (2) has been discussed in this email
thread - and the only observation I would like to add is that
over-provisioning the internal swap space compared to the exported
swap space significantly can guarantee a lower write amplification
factor with the indirection and GC techniques discussed.

I believe the swap functionality is currently optimized for storage
media where read and write costs are nearly identical.
As this is not the case on flash I propose splitting the anonymous
inactive queue (at least conceptually) - keeping clean anonymous pages
with swap slots on a separate queue as the cost of swapping them
out/in is only an inexpensive read operation. A variable similar to
swapiness (or a more dynamic algorithmn) could determine the
preference for swapping out clean pages or dirty pages. ( A similar
argument could be made for splitting up the file inactive queue )

The problem of limiting the average write bandwidth reminds me of
enforcing cpu utilization limits on interactive workloads.
Just as with cpu workloads - using the resources to the limit produces
poor interactivity.
When interactivity suffers too much I believe the only sane response
for an interactive device is to limit usage of the swap device and
transition into a low memory situation - and if needed - either
allowing userspace to reduce memory usage or invoking the OOM killer.
As a result low memory situations could not only be encountered on new
memory allocations but also on workload changes that increase the
number of dirty pages.

A wild idea to avoid some writes altogether is to see if
de-duplication techniques can be used to (partially?) match pages
previously written so swap.
In case of unencrypted swap  (or encrypted swap with a static key)
swap pages on eMMC could even be re-used across multiple reboots.
A simple version would just compare dirty pages with data in their
swap slots as I suspect (but really don't know) that some user space
algorithms (garbage collection?) dirty a page just temporarily -
eventually reverting it to the previous content.

Stephan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-16 18:22                 ` Stephan Uphoff
@ 2012-04-16 18:59                   ` Arnd Bergmann
  2012-04-16 21:12                     ` Stephan Uphoff
  2012-04-17  2:05                     ` Minchan Kim
  2012-04-27  7:34                   ` Luca Porzio (lporzio)
  1 sibling, 2 replies; 41+ messages in thread
From: Arnd Bergmann @ 2012-04-16 18:59 UTC (permalink / raw)
  To: Stephan Uphoff
  Cc: Minchan Kim, linaro-kernel, android-kernel, linux-mm,
	Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On Monday 16 April 2012, Stephan Uphoff wrote:
> opportunity to plant a few ideas.
> 
> In contrast to rotational disks read/write operation overhead and
> costs are not symmetric.
> While random reads are much faster on flash - the number of write
> operations is limited by wearout and garbage collection overhead.
> To further improve swapping on eMMC or similar flash media I believe
> that the following issues need to be addressed:
> 
> 1) Limit average write bandwidth to eMMC to a configurable level to
> guarantee a minimum device lifetime
> 2) Aim for a low write amplification factor to maximize useable write bandwidth
> 3) Strongly favor read over write operations
> 
> Lowering write amplification (2) has been discussed in this email
> thread - and the only observation I would like to add is that
> over-provisioning the internal swap space compared to the exported
> swap space significantly can guarantee a lower write amplification
> factor with the indirection and GC techniques discussed.

Yes, good point.

> I believe the swap functionality is currently optimized for storage
> media where read and write costs are nearly identical.
> As this is not the case on flash I propose splitting the anonymous
> inactive queue (at least conceptually) - keeping clean anonymous pages
> with swap slots on a separate queue as the cost of swapping them
> out/in is only an inexpensive read operation. A variable similar to
> swapiness (or a more dynamic algorithmn) could determine the
> preference for swapping out clean pages or dirty pages. ( A similar
> argument could be made for splitting up the file inactive queue )

I'm not sure I understand yet how this would be different from swappiness.

> The problem of limiting the average write bandwidth reminds me of
> enforcing cpu utilization limits on interactive workloads.
> Just as with cpu workloads - using the resources to the limit produces
> poor interactivity.
> When interactivity suffers too much I believe the only sane response
> for an interactive device is to limit usage of the swap device and
> transition into a low memory situation - and if needed - either
> allowing userspace to reduce memory usage or invoking the OOM killer.
> As a result low memory situations could not only be encountered on new
> memory allocations but also on workload changes that increase the
> number of dirty pages.

While swap is just a special case for anonymous memory in writeback
rather than file backed pages, I think what you want here is a tuning
knob that decides whether we should discard a clean page or write back
a dirty page under memory pressure. I have to say that I don't know
whether we already have such a knob or whether we already treat them
differently, but it is certainly a valid observation that on hard
drives, discarding a clean page that is likely going to be needed
again has about the same overhead as writing back a dirty page
(i.e. one seek operation), while on flash the former would be much
cheaper than the latter.

> A wild idea to avoid some writes altogether is to see if
> de-duplication techniques can be used to (partially?) match pages
> previously written so swap.

Interesting! We already have KSM (kernel samepage merging) to do
the same thing in memory, but I don't know how that works
during swapout. It might already be there, waiting to get switched
on, or might not be possible until we implemnt an extra remapping
layer in swap as has been proposed. It's certainly worth remembering
this as we work on the design for that remapping layer.

> In case of unencrypted swap  (or encrypted swap with a static key)
> swap pages on eMMC could even be re-used across multiple reboots.
> A simple version would just compare dirty pages with data in their
> swap slots as I suspect (but really don't know) that some user space
> algorithms (garbage collection?) dirty a page just temporarily -
> eventually reverting it to the previous content.

I think that would incur overhead for indexing the pages in swap space
in a persistent way, something that by itself would contribute to
write amplification because for every swapout, we would have to write
both the page and the index (eventually), and that index would likely
be a random write.

Thanks for your thoughts!

	Arnd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-16 18:59                   ` Arnd Bergmann
@ 2012-04-16 21:12                     ` Stephan Uphoff
  2012-04-17  2:18                       ` Minchan Kim
  2012-04-17  2:05                     ` Minchan Kim
  1 sibling, 1 reply; 41+ messages in thread
From: Stephan Uphoff @ 2012-04-16 21:12 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Minchan Kim, linaro-kernel, android-kernel, linux-mm,
	Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

Hi Arnd,

On Mon, Apr 16, 2012 at 12:59 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 16 April 2012, Stephan Uphoff wrote:
>> opportunity to plant a few ideas.
>>
>> In contrast to rotational disks read/write operation overhead and
>> costs are not symmetric.
>> While random reads are much faster on flash - the number of write
>> operations is limited by wearout and garbage collection overhead.
>> To further improve swapping on eMMC or similar flash media I believe
>> that the following issues need to be addressed:
>>
>> 1) Limit average write bandwidth to eMMC to a configurable level to
>> guarantee a minimum device lifetime
>> 2) Aim for a low write amplification factor to maximize useable write bandwidth
>> 3) Strongly favor read over write operations
>>
>> Lowering write amplification (2) has been discussed in this email
>> thread - and the only observation I would like to add is that
>> over-provisioning the internal swap space compared to the exported
>> swap space significantly can guarantee a lower write amplification
>> factor with the indirection and GC techniques discussed.
>
> Yes, good point.
>
>> I believe the swap functionality is currently optimized for storage
>> media where read and write costs are nearly identical.
>> As this is not the case on flash I propose splitting the anonymous
>> inactive queue (at least conceptually) - keeping clean anonymous pages
>> with swap slots on a separate queue as the cost of swapping them
>> out/in is only an inexpensive read operation. A variable similar to
>> swapiness (or a more dynamic algorithmn) could determine the
>> preference for swapping out clean pages or dirty pages. ( A similar
>> argument could be made for splitting up the file inactive queue )
>
> I'm not sure I understand yet how this would be different from swappiness.

As I see it swappiness determines the ratio for paging out file backed
as compared to anonymous, swap backed pages.
I would like to further be able to set the ratio for throwing away
clean anonymous pages with swap slots ( that are easy to read back in)
as compared to writing out dirty anonymous pages to swap.

>
>> The problem of limiting the average write bandwidth reminds me of
>> enforcing cpu utilization limits on interactive workloads.
>> Just as with cpu workloads - using the resources to the limit produces
>> poor interactivity.
>> When interactivity suffers too much I believe the only sane response
>> for an interactive device is to limit usage of the swap device and
>> transition into a low memory situation - and if needed - either
>> allowing userspace to reduce memory usage or invoking the OOM killer.
>> As a result low memory situations could not only be encountered on new
>> memory allocations but also on workload changes that increase the
>> number of dirty pages.
>
> While swap is just a special case for anonymous memory in writeback
> rather than file backed pages, I think what you want here is a tuning
> knob that decides whether we should discard a clean page or write back
> a dirty page under memory pressure. I have to say that I don't know
> whether we already have such a knob or whether we already treat them
> differently, but it is certainly a valid observation that on hard
> drives, discarding a clean page that is likely going to be needed
> again has about the same overhead as writing back a dirty page
> (i.e. one seek operation), while on flash the former would be much
> cheaper than the latter.

Exactly - as far as I see there is no such knob.
I mentioned splitting the anonymous inactive queue (in clean and
dirty) as I believe it would make it easier to implement such a knob
while maintaining the maximum of LRU information..

>
>> A wild idea to avoid some writes altogether is to see if
>> de-duplication techniques can be used to (partially?) match pages
>> previously written so swap.
>
> Interesting! We already have KSM (kernel samepage merging) to do
> the same thing in memory, but I don't know how that works
> during swapout. It might already be there, waiting to get switched
> on, or might not be possible until we implemnt an extra remapping
> layer in swap as has been proposed. It's certainly worth remembering
> this as we work on the design for that remapping layer.
>
>> In case of unencrypted swap  (or encrypted swap with a static key)
>> swap pages on eMMC could even be re-used across multiple reboots.
>> A simple version would just compare dirty pages with data in their
>> swap slots as I suspect (but really don't know) that some user space
>> algorithms (garbage collection?) dirty a page just temporarily -
>> eventually reverting it to the previous content.
>
> I think that would incur overhead for indexing the pages in swap space
> in a persistent way, something that by itself would contribute to
> write amplification because for every swapout, we would have to write
> both the page and the index (eventually), and that index would likely
> be a random write.

I agree - overhead may be too big.
Still unless it is too energy intensive I could see a case for an idle
task to match up anonymous pages to pre-existing swap data sometimes
after reboot ( and before memory is tight )
Unless memory layout is randomized I expect many anonymous pages to
end up with the same data boot after boot.

>
> Thanks for your thoughts!
>
>        Arnd

Thanks for working on this

Stephan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-16 18:59                   ` Arnd Bergmann
  2012-04-16 21:12                     ` Stephan Uphoff
@ 2012-04-17  2:05                     ` Minchan Kim
  1 sibling, 0 replies; 41+ messages in thread
From: Minchan Kim @ 2012-04-17  2:05 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Stephan Uphoff, linaro-kernel, android-kernel, linux-mm,
	Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

Hi Arnd,

On 04/17/2012 03:59 AM, Arnd Bergmann wrote:
> On Monday 16 April 2012, Stephan Uphoff wrote:
>> opportunity to plant a few ideas.
>>
>> In contrast to rotational disks read/write operation overhead and
>> costs are not symmetric.
>> While random reads are much faster on flash - the number of write
>> operations is limited by wearout and garbage collection overhead.
>> To further improve swapping on eMMC or similar flash media I believe
>> that the following issues need to be addressed:
>>
>> 1) Limit average write bandwidth to eMMC to a configurable level to
>> guarantee a minimum device lifetime
>> 2) Aim for a low write amplification factor to maximize useable write bandwidth
>> 3) Strongly favor read over write operations
>>
>> Lowering write amplification (2) has been discussed in this email
>> thread - and the only observation I would like to add is that
>> over-provisioning the internal swap space compared to the exported
>> swap space significantly can guarantee a lower write amplification
>> factor with the indirection and GC techniques discussed.
>
> Yes, good point.
>
>> I believe the swap functionality is currently optimized for storage
>> media where read and write costs are nearly identical.
>> As this is not the case on flash I propose splitting the anonymous
>> inactive queue (at least conceptually) - keeping clean anonymous pages
>> with swap slots on a separate queue as the cost of swapping them
>> out/in is only an inexpensive read operation. A variable similar to
>> swapiness (or a more dynamic algorithmn) could determine the
>> preference for swapping out clean pages or dirty pages. ( A similar
>> argument could be made for splitting up the file inactive queue )
>
> I'm not sure I understand yet how this would be different from swappiness.
>
>> The problem of limiting the average write bandwidth reminds me of
>> enforcing cpu utilization limits on interactive workloads.
>> Just as with cpu workloads - using the resources to the limit produces
>> poor interactivity.
>> When interactivity suffers too much I believe the only sane response
>> for an interactive device is to limit usage of the swap device and
>> transition into a low memory situation - and if needed - either
>> allowing userspace to reduce memory usage or invoking the OOM killer.
>> As a result low memory situations could not only be encountered on new
>> memory allocations but also on workload changes that increase the
>> number of dirty pages.
>
> While swap is just a special case for anonymous memory in writeback
> rather than file backed pages, I think what you want here is a tuning
> knob that decides whether we should discard a clean page or write back
> a dirty page under memory pressure. I have to say that I don't know
> whether we already have such a knob or whether we already treat them
> differently, but it is certainly a valid observation that on hard
> drives, discarding a clean page that is likely going to be needed
> again has about the same overhead as writing back a dirty page
> (i.e. one seek operation), while on flash the former would be much
> cheaper than the latter.

It seems to make sense with considering asymmetric of flash and there is 
a CFLRU(Clean First LRU)[1] paper about it. You might already know it. 
Anyway if you don't aware of it, I hope it helps you.

[1] 
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCkQFjAA&url=http%3A%2F%2Fstaff.ustc.edu.cn%2F~jpq%2Fpaper%2Fflash%2F2006-CASES-CFLRU-%2520A%2520Replacement%2520Algorithm%2520for%2520Flash%2520Memory.pdf&ei=G8-MT5jGIqnj0gGMzJyCCg&usg=AFQjCNHybc5rUvuAlMylOUNwsHoFmWegzw&sig2=Uu5LDD3suso0QHsfD7yZ9Q


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-16 21:12                     ` Stephan Uphoff
@ 2012-04-17  2:18                       ` Minchan Kim
  0 siblings, 0 replies; 41+ messages in thread
From: Minchan Kim @ 2012-04-17  2:18 UTC (permalink / raw)
  To: Stephan Uphoff
  Cc: Arnd Bergmann, linaro-kernel, android-kernel, linux-mm,
	Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

On 04/17/2012 06:12 AM, Stephan Uphoff wrote:
> Hi Arnd,
>
> On Mon, Apr 16, 2012 at 12:59 PM, Arnd Bergmann<arnd@arndb.de>  wrote:
>> On Monday 16 April 2012, Stephan Uphoff wrote:
>>> opportunity to plant a few ideas.
>>>
>>> In contrast to rotational disks read/write operation overhead and
>>> costs are not symmetric.
>>> While random reads are much faster on flash - the number of write
>>> operations is limited by wearout and garbage collection overhead.
>>> To further improve swapping on eMMC or similar flash media I believe
>>> that the following issues need to be addressed:
>>>
>>> 1) Limit average write bandwidth to eMMC to a configurable level to
>>> guarantee a minimum device lifetime
>>> 2) Aim for a low write amplification factor to maximize useable write bandwidth
>>> 3) Strongly favor read over write operations
>>>
>>> Lowering write amplification (2) has been discussed in this email
>>> thread - and the only observation I would like to add is that
>>> over-provisioning the internal swap space compared to the exported
>>> swap space significantly can guarantee a lower write amplification
>>> factor with the indirection and GC techniques discussed.
>>
>> Yes, good point.
>>
>>> I believe the swap functionality is currently optimized for storage
>>> media where read and write costs are nearly identical.
>>> As this is not the case on flash I propose splitting the anonymous
>>> inactive queue (at least conceptually) - keeping clean anonymous pages
>>> with swap slots on a separate queue as the cost of swapping them
>>> out/in is only an inexpensive read operation. A variable similar to
>>> swapiness (or a more dynamic algorithmn) could determine the
>>> preference for swapping out clean pages or dirty pages. ( A similar
>>> argument could be made for splitting up the file inactive queue )
>>
>> I'm not sure I understand yet how this would be different from swappiness.
>
> As I see it swappiness determines the ratio for paging out file backed
> as compared to anonymous, swap backed pages.
> I would like to further be able to set the ratio for throwing away
> clean anonymous pages with swap slots ( that are easy to read back in)
> as compared to writing out dirty anonymous pages to swap.

We can apply the rule in file-lru list too and we already have 
ISOLATE_CLEAN mode to select victim pages in LRU list so it should work.

For selecting clean anon pages with swap slot, we need more looking.
Recent, Dan had a question about it and Hugh answered it.
Look at the http://marc.info/?l=linux-mm&m=133462346928786&w=2

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: swap on eMMC and other flash
  2012-04-16 18:22                 ` Stephan Uphoff
  2012-04-16 18:59                   ` Arnd Bergmann
@ 2012-04-27  7:34                   ` Luca Porzio (lporzio)
  1 sibling, 0 replies; 41+ messages in thread
From: Luca Porzio (lporzio) @ 2012-04-27  7:34 UTC (permalink / raw)
  To: Stephan Uphoff, Arnd Bergmann
  Cc: Minchan Kim, linaro-kernel, android-kernel, linux-mm,
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc

Stephan,

Good ideas. Some comments of mine below.

> -----Original Message-----
> From: linux-mmc-owner@vger.kernel.org [mailto:linux-mmc-owner@vger.kernel.org]
> On Behalf Of Stephan Uphoff
> Sent: Tuesday, April 17, 2012 3:22 AM
> To: Arnd Bergmann
> Cc: Minchan Kim; linaro-kernel@lists.linaro.org; android-
> kernel@googlegroups.com; linux-mm@kvack.org; Luca Porzio (lporzio); Alex
> Lemberg; linux-kernel@vger.kernel.org; Saugata Das; Venkatraman S; Yejin Moon;
> Hyojin Jeong; linux-mmc@vger.kernel.org
> Subject: Re: swap on eMMC and other flash
> 
> I really like where this is going and would like to use the
> opportunity to plant a few ideas.
> 
> In contrast to rotational disks read/write operation overhead and
> costs are not symmetric.
> While random reads are much faster on flash - the number of write
> operations is limited by wearout and garbage collection overhead.
> To further improve swapping on eMMC or similar flash media I believe
> that the following issues need to be addressed:
> 
> 1) Limit average write bandwidth to eMMC to a configurable level to
> guarantee a minimum device lifetime
> 2) Aim for a low write amplification factor to maximize useable write
> bandwidth
> 3) Strongly favor read over write operations
> 
> Lowering write amplification (2) has been discussed in this email
> thread - and the only observation I would like to add is that
> over-provisioning the internal swap space compared to the exported
> swap space significantly can guarantee a lower write amplification
> factor with the indirection and GC techniques discussed.
> 
> I believe the swap functionality is currently optimized for storage
> media where read and write costs are nearly identical.
> As this is not the case on flash I propose splitting the anonymous
> inactive queue (at least conceptually) - keeping clean anonymous pages
> with swap slots on a separate queue as the cost of swapping them
> out/in is only an inexpensive read operation. A variable similar to
> swapiness (or a more dynamic algorithmn) could determine the
> preference for swapping out clean pages or dirty pages. ( A similar
> argument could be made for splitting up the file inactive queue )
> 

I totally agree. Read are inexpensive on flash based devices and as such a good swap algorithm (as well as a flash oriented FS) should take this into account.

> The problem of limiting the average write bandwidth reminds me of
> enforcing cpu utilization limits on interactive workloads.
> Just as with cpu workloads - using the resources to the limit produces
> poor interactivity.

I don't quite get your definition of interactive workload and I am not sure here which is the technique for limiting resource utilization you have in mind.
CGroups, for example, have proven not to be much reliable through time. 
Also in my experience it has always been very difficult to correlate resources utilization stats with user interactivity.
The only technique which has been proven reliable through time is to do something while the system is idle, which is what, to my understanding, is already done.

> When interactivity suffers too much I believe the only sane response
> for an interactive device is to limit usage of the swap device and
> transition into a low memory situation - and if needed - either
> allowing userspace to reduce memory usage or invoking the OOM killer.
> As a result low memory situations could not only be encountered on new
> memory allocations but also on workload changes that increase the
> number of dirty pages.
> 

I agree with your comments about the OOM killer (what is the point of swapping out a page if that process is going to be killed soon? That is only increasing the WAF factor on MMCs). In fact one proposal here could be to somewhat mix OOM index with page age.
I would suggest to first optimize swap traffic for an MMC device and then start thinking about this.

> A wild idea to avoid some writes altogether is to see if
> de-duplication techniques can be used to (partially?) match pages
> previously written so swap.

If you have such a situation, I think this is where KSM may help. It is my personal belief that with a bit of work, the KSM algorithm can be extended to swapped out pages too with little effort (at the expense of few increase of read traffic, which is ok for flash based storage devices). 

> In case of unencrypted swap  (or encrypted swap with a static key)
> swap pages on eMMC could even be re-used across multiple reboots.
> A simple version would just compare dirty pages with data in their
> swap slots as I suspect (but really don't know) that some user space
> algorithms (garbage collection?) dirty a page just temporarily -
> eventually reverting it to the previous content.
> 

This goes in contrast with discarding or trimming a page and as such the advantages of this technique needs to be proven vs the performance gain of using the discard command.

> Stephan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers,
    Luca

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: swap on eMMC and other flash
  2012-04-11 10:28       ` Adrian Hunter
@ 2012-07-16 13:29         ` Pavel Machek
  0 siblings, 0 replies; 41+ messages in thread
From: Pavel Machek @ 2012-07-16 13:29 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Arnd Bergmann, linaro-kernel, linux-mm, Luca Porzio (lporzio),
	Alex Lemberg, linux-kernel, Saugata Das, Venkatraman S,
	Yejin Moon, Hyojin Jeong, linux-mmc, kernel-team,
	Rafael J. Wysocki

On Wed 2012-04-11 13:28:39, Adrian Hunter wrote:
> On 04/04/12 15:47, Arnd Bergmann wrote:
> > On Wednesday 04 April 2012, Adrian Hunter wrote:
> >> On 30/03/12 21:50, Arnd Bergmann wrote:
> >>> (sorry for the duplicated email, this corrects the address of the android
> >>> kernel team, please reply here)
> >>>
> >>> On Friday 30 March 2012, Arnd Bergmann wrote:
> >>>
> >>>  We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
> >>>  with Luca joining in on the discussion) about swapping to flash based media
> >>>  such as eMMC. This is a summary of what we found and what we think should
> >>>  be done. If people agree that this is a good idea, we can start working
> >>>  on it.
> >>
> >> There is mtdswap.
> > 
> > Ah, very interesting. I wasn't aware of that. Obviously we can't directly
> > use it on block devices that have their own garbage collection and wear
> > leveling built into them, but it's interesting to see how this was solved
> > before.
> > 
> > While we could build something similar that remaps blocks between an
> > eMMC device and the logical swap space that is used by the mm code,
> > my feeling is that it would be easier to modify the swap code itself
> > to do the right thing.
> > 
> >> Also the old Nokia N900 had swap to eMMC.
> >>
> >> The last I heard was that swap was considered to be simply too slow on hand
> >> held devices.
> > 
> > That's the part that we want to solve here. It has nothing to do with
> > handheld devices, but more with specific incompatibilities of the
> > block allocation in the swap code vs. what an eMMC device expects
> > to see for fast operation. If you write data in the wrong order on
> > flash devices, you get long delays that you don't get when you do
> > it the right way. The same problem exists for file systems, and is
> > being addressed there as well.
> > 
> >> As systems adopt more RAM, isn't there a decreasing demand for swap?
> > 
> > No. You would never be able to make hibernate work, no matter how much
> > RAM you add ;-)
> 
> Have you considered making hibernate work without swap?

It does work without swap. See userland suspend packages, where you
write the image is up-to you.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-07-16 13:29 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-30 17:44 swap on eMMC and other flash Arnd Bergmann
2012-03-30 18:50 ` Arnd Bergmann
2012-03-30 22:08   ` Zach Pfeffer
2012-03-31  9:24     ` Arnd Bergmann
2012-04-03 18:17       ` Zach Pfeffer
2012-03-31 20:29   ` Hugh Dickins
2012-04-02 11:45     ` Arnd Bergmann
2012-04-02 14:41       ` Hugh Dickins
2012-04-02 14:55         ` Arnd Bergmann
2012-04-05  0:17           ` 정효진
2012-04-09 12:50             ` Arnd Bergmann
2012-04-08 13:50           ` Alex Lemberg
2012-04-09  2:14             ` Minchan Kim
2012-04-09  7:37               ` 정효진
2012-04-09  8:11                 ` Minchan Kim
2012-04-09 13:00                   ` Arnd Bergmann
2012-04-10  1:10                     ` Minchan Kim
2012-04-10  8:40                       ` Arnd Bergmann
2012-04-12  8:32                         ` Luca Porzio (lporzio)
2012-04-09 12:54                 ` Arnd Bergmann
2012-04-02 12:52     ` Luca Porzio (lporzio)
2012-04-02 14:58       ` Hugh Dickins
2012-04-02 16:51         ` Rik van Riel
2012-04-04 12:21   ` Adrian Hunter
2012-04-04 12:47     ` Arnd Bergmann
2012-04-11 10:28       ` Adrian Hunter
2012-07-16 13:29         ` Pavel Machek
     [not found] ` <CAEwNFnA2GeOayw2sJ_KXv4qOdC50_Nt2KoK796YmQF+YV1GiEA@mail.gmail.com>
2012-04-06 16:16   ` Arnd Bergmann
2012-04-09  2:06     ` Minchan Kim
2012-04-09 12:35       ` Arnd Bergmann
2012-04-10  0:57         ` Minchan Kim
2012-04-10  8:32           ` Arnd Bergmann
2012-04-11  9:54             ` Minchan Kim
2012-04-11 15:57               ` Arnd Bergmann
2012-04-12  2:36                 ` Minchan Kim
2012-04-16 18:22                 ` Stephan Uphoff
2012-04-16 18:59                   ` Arnd Bergmann
2012-04-16 21:12                     ` Stephan Uphoff
2012-04-17  2:18                       ` Minchan Kim
2012-04-17  2:05                     ` Minchan Kim
2012-04-27  7:34                   ` Luca Porzio (lporzio)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).