All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC]swap improvements for fast SSD
@ 2013-01-22  6:53 Shaohua Li
  2013-01-23  7:58 ` Minchan Kim
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Shaohua Li @ 2013-01-22  6:53 UTC (permalink / raw)
  To: lsf-pc, linux-mm

Hi,

Because of high density, low power and low price, flash storage (SSD) is a good
candidate to partially replace DRAM. A quick answer for this is using SSD as
swap. But Linux swap is designed for slow hard disk storage. There are a lot of
challenges to efficiently use SSD for swap:

1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
overhead is very high even in a normal 2-socket machine.
3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
which makes swap IO pattern is interleave. Block layer isn't always efficient
to do request merge. Such IO pattern also makes swap prefetch hard.
4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
very inefficient, especially if swap storage is fast.
5. SSD related optimization, mainly discard support
6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
aren't always in LRU list adjacently, so page reclaim will not swap such pages
in adjacent storage sectors. This makes swap prefetch hard.
7. Alternative page reclaim policy to bias reclaiming anonymous page.
Currently reclaim anonymous page is considering harder than reclaim file pages,
so we bias reclaiming file pages. If there are high speed swap storage, we are
considering doing swap more aggressively.
8. Huge page swap. Huge page swap can solve a lot of problems above, but both
THP and hugetlbfs don't support swap.

I had some progresses in these areas recently:
http://marc.info/?l=linux-mm&m=134665691021172&w=2
http://marc.info/?l=linux-mm&m=135336039115191&w=2
http://marc.info/?l=linux-mm&m=135882182225444&w=2
http://marc.info/?l=linux-mm&m=135754636926984&w=2
http://marc.info/?l=linux-mm&m=135754634526979&w=2
But a lot of problems remain. I'd like to discuss the issues at the meeting.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-22  6:53 [LSF/MM TOPIC]swap improvements for fast SSD Shaohua Li
@ 2013-01-23  7:58 ` Minchan Kim
  2013-01-23 19:04   ` Seth Jennings
                     ` (4 more replies)
  2013-01-23 16:56 ` Seth Jennings
                   ` (3 subsequent siblings)
  4 siblings, 5 replies; 31+ messages in thread
From: Minchan Kim @ 2013-01-23  7:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: lsf-pc, linux-mm, Rik van Riel

On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
> Hi,
> 
> Because of high density, low power and low price, flash storage (SSD) is a good
> candidate to partially replace DRAM. A quick answer for this is using SSD as
> swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> challenges to efficiently use SSD for swap:

Many of below item could be applied in in-memory swap like zram, zcache.

> 
> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> overhead is very high even in a normal 2-socket machine.
> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> which makes swap IO pattern is interleave. Block layer isn't always efficient
> to do request merge. Such IO pattern also makes swap prefetch hard.

Agreed.

> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> very inefficient, especially if swap storage is fast.

Agreed.

> 5. SSD related optimization, mainly discard support
> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> aren't always in LRU list adjacently, so page reclaim will not swap such pages
> in adjacent storage sectors. This makes swap prefetch hard.

One of problem is LRU churning and I wanted to try to fix it.
http://marc.info/?l=linux-mm&m=130978831028952&w=4

> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> Currently reclaim anonymous page is considering harder than reclaim file pages,
> so we bias reclaiming file pages. If there are high speed swap storage, we are
> considering doing swap more aggressively.

Yeb. We need it. I tried it with extending vm_swappiness to 200.

From: Minchan Kim <minchan@kernel.org>
Date: Mon, 3 Dec 2012 16:21:00 +0900
Subject: [PATCH] mm: increase swappiness to 200

We have thought swap out cost is very high but it's not true
if we use fast device like swap-over-zram. Nonetheless, we can
swap out 1:1 ratio of anon and page cache at most.
It's not enough to use swap device fully so we encounter OOM kill
while there are many free space in zram swap device. It's never
what we want.

This patch makes swap out aggressively.

Cc: Luigi Semenzato <semenzato@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 kernel/sysctl.c |    3 ++-
 mm/vmscan.c     |    6 ++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 693e0ed..f1dbd9d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
 static int __maybe_unused three = 3;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
+extern int max_swappiness;
 #ifdef CONFIG_PRINTK
 static int ten_thousand = 10000;
 #endif
@@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
                .mode           = 0644,
                .proc_handler   = proc_dointvec_minmax,
                .extra1         = &zero,
-               .extra2         = &one_hundred,
+               .extra2         = &max_swappiness,
        },
 #ifdef CONFIG_HUGETLB_PAGE
        {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 53dcde9..64f3c21 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -53,6 +53,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+int max_swappiness = 200;
+
 struct scan_control {
        /* Incremented by the number of inactive pages that were scanned */
        unsigned long nr_scanned;
@@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc)
        return mem_cgroup_swappiness(sc->target_mem_cgroup);
 }
 
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
        }
 
        /*
-        * With swappiness at 100, anonymous and file have the same priority.
         * This scanning priority is essentially the inverse of IO cost.
         */
        anon_prio = vmscan_swappiness(sc);
-       file_prio = 200 - anon_prio;
+       file_prio = max_swappiness - anon_prio;
 
        /*
         * OK, so we have swap space and a fair amount of page cache
-- 
1.7.9.5

> 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> THP and hugetlbfs don't support swap.

Another items are indirection layers. Please read Rik's mail below.
Indirection layers could give many flexibility to backends and helpful
for defragmentation.

One of idea I am considering is that makes hierarchy swap devides,
NOT priority-based. I mean currently swap devices are used up by prioirty order.
It's not good fit if we use fast swap and slow swap at the same time.
I'd like to consume fast swap device (ex, in-memory swap) firstly, then
I want to migrate some of swap pages from fast swap to slow swap to
make room for fast swap. It could solve below concern.
In addition, buffering via in-memory swap could make big chunk which is aligned
to slow device's block size so migration speed from fast swap to slow swap
could be enhanced so wear out problem would go away, too.

Quote from last KS2012 - http://lwn.net/Articles/516538/
"Andrea Arcangeli was also concerned that the first pages to be evicted from
memory are, by definition of the LRU page order, the ones that are least likely
to be used in the future. These are the pages that should be going to secondary
storage and more frequently used pages should be going to zcache. As it stands,
zcache may fill up with no-longer-used pages and then the system continues to
move used pages from and to the disk."

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-22  6:53 [LSF/MM TOPIC]swap improvements for fast SSD Shaohua Li
  2013-01-23  7:58 ` Minchan Kim
@ 2013-01-23 16:56 ` Seth Jennings
  2013-01-24  6:28 ` Simon Jeons
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Seth Jennings @ 2013-01-23 16:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: lsf-pc, linux-mm, Minchan Kim

On 01/22/2013 12:53 AM, Shaohua Li wrote:
> Hi,
> 
> Because of high density, low power and low price, flash storage (SSD) is a good
> candidate to partially replace DRAM. A quick answer for this is using SSD as
> swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> challenges to efficiently use SSD for swap:
> 
> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> overhead is very high even in a normal 2-socket machine.
> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> which makes swap IO pattern is interleave. Block layer isn't always efficient
> to do request merge. Such IO pattern also makes swap prefetch hard.
> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> very inefficient, especially if swap storage is fast.
> 5. SSD related optimization, mainly discard support
> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> aren't always in LRU list adjacently, so page reclaim will not swap such pages
> in adjacent storage sectors. This makes swap prefetch hard.
> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> Currently reclaim anonymous page is considering harder than reclaim file pages,
> so we bias reclaiming file pages. If there are high speed swap storage, we are
> considering doing swap more aggressively.
> 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> THP and hugetlbfs don't support swap.

I too have also observed these issues in my work with zswap,
especially the lock contentions mentioned in 1 and the prefetch
situation in 3 and 6 that contains heuristics for rotational media.

I'd be very interested in discussing these issues and potential solutions.

Thanks to Minchan for the discussion about the front last year's summits.

Seth

> 
> I had some progresses in these areas recently:
> http://marc.info/?l=linux-mm&m=134665691021172&w=2
> http://marc.info/?l=linux-mm&m=135336039115191&w=2
> http://marc.info/?l=linux-mm&m=135882182225444&w=2
> http://marc.info/?l=linux-mm&m=135754636926984&w=2
> http://marc.info/?l=linux-mm&m=135754634526979&w=2
> But a lot of problems remain. I'd like to discuss the issues at the meeting.
> 
> Thanks,
> Shaohua
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-23  7:58 ` Minchan Kim
@ 2013-01-23 19:04   ` Seth Jennings
  2013-01-24  1:40     ` Minchan Kim
  2013-01-24  2:02   ` Shaohua Li
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Seth Jennings @ 2013-01-23 19:04 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Shaohua Li, lsf-pc, linux-mm, Rik van Riel

On 01/23/2013 01:58 AM, Minchan Kim wrote:
> Currently, the page table entries that have swapped out pages
> associated with them contain a swap entry, pointing directly
> at the swap device and swap slot containing the data. Meanwhile,
> the swap count lives in a separate array.
> 
> The redesign we are considering moving the swap entry to the
> page cache radix tree for the swapper_space and having the pte
> contain only the offset into the swapper_space.  The swap count
> info can also fit inside the swapper_space page cache radix
> tree (at least on 64 bits - on 32 bits we may need to get
> creative or accept a smaller max amount of swap space).

Correct me if I'm wrong, but this recent patchset creating a
swapper_space per type would mess this up right?  The offset alone
would no longer be sufficient to access the proper swapper_space.

Why not just continue to store the entire swap entry (type and offset)
in the pte?  Where you planning to use the type space in the pte for
something else?

Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-23 19:04   ` Seth Jennings
@ 2013-01-24  1:40     ` Minchan Kim
  2013-01-24  8:29       ` Simon Jeons
  0 siblings, 1 reply; 31+ messages in thread
From: Minchan Kim @ 2013-01-24  1:40 UTC (permalink / raw)
  To: Seth Jennings; +Cc: Shaohua Li, lsf-pc, linux-mm, Rik van Riel

Hi Seth,

On Wed, Jan 23, 2013 at 01:04:25PM -0600, Seth Jennings wrote:
> On 01/23/2013 01:58 AM, Minchan Kim wrote:
> > Currently, the page table entries that have swapped out pages
> > associated with them contain a swap entry, pointing directly
> > at the swap device and swap slot containing the data. Meanwhile,
> > the swap count lives in a separate array.
> > 
> > The redesign we are considering moving the swap entry to the
> > page cache radix tree for the swapper_space and having the pte
> > contain only the offset into the swapper_space.  The swap count
> > info can also fit inside the swapper_space page cache radix
> > tree (at least on 64 bits - on 32 bits we may need to get
> > creative or accept a smaller max amount of swap space).
> 
> Correct me if I'm wrong, but this recent patchset creating a
> swapper_space per type would mess this up right?  The offset alone
> would no longer be sufficient to access the proper swapper_space.

If I understand Rik's idea correctly, it doesn't mess up. Because we already
have used (swp_type, swp_offset) as offset of swapper_space so although
he mentioned "pte contains only the offset into the swapper_space",
it doesn't mean we will store only swp_offset in pte but store offset of
swapper_space in pte.

old :
        do_swap_page
        swp_entry_t entry = pte_to_swp_entry(pte);
        if (!lookup_swap_cache(entry))
                swapin_readahead(entry)

New :
        do_swap_page
        pgoff_t offset = pte_to_swp_offset(pte)
        if (!lookup_swap_cache(offset)) {
                swp_entry_t entry = offset_to_swp_entry(offset);
                swapin_readahead(entry);
        }

IOW, entry of old and offset of new would be same vaule.

> 
> Why not just continue to store the entire swap entry (type and offset)
> in the pte?  Where you planning to use the type space in the pte for
> something else?

No plan if I didn't miss something. :)

> 
> Seth
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-23  7:58 ` Minchan Kim
  2013-01-23 19:04   ` Seth Jennings
@ 2013-01-24  2:02   ` Shaohua Li
  2013-01-24  7:52   ` Simon Jeons
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Shaohua Li @ 2013-01-24  2:02 UTC (permalink / raw)
  To: Minchan Kim; +Cc: lsf-pc, linux-mm, Rik van Riel

On Wed, Jan 23, 2013 at 04:58:08PM +0900, Minchan Kim wrote:
> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
> > Hi,
> > 
> > Because of high density, low power and low price, flash storage (SSD) is a good
> > candidate to partially replace DRAM. A quick answer for this is using SSD as
> > swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> > challenges to efficiently use SSD for swap:
> 
> Many of below item could be applied in in-memory swap like zram, zcache.
> 
> > 
> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> > overhead is very high even in a normal 2-socket machine.
> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> > which makes swap IO pattern is interleave. Block layer isn't always efficient
> > to do request merge. Such IO pattern also makes swap prefetch hard.
> 
> Agreed.
> 
> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> > very inefficient, especially if swap storage is fast.
> 
> Agreed.
> 
> > 5. SSD related optimization, mainly discard support
> > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> > aren't always in LRU list adjacently, so page reclaim will not swap such pages
> > in adjacent storage sectors. This makes swap prefetch hard.
> 
> One of problem is LRU churning and I wanted to try to fix it.
> http://marc.info/?l=linux-mm&m=130978831028952&w=4

Yes, LRU churning is a problem. Another problem is we didn't add sequentially
accessed pages to LRU list adjacently if there are multiple tasks running and
consuming memory in the meantime. The percpu pagevec helps a little, but its
size isn't large.
 
> > 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> > Currently reclaim anonymous page is considering harder than reclaim file pages,
> > so we bias reclaiming file pages. If there are high speed swap storage, we are
> > considering doing swap more aggressively.
> 
> Yeb. We need it. I tried it with extending vm_swappiness to 200.
> 
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 3 Dec 2012 16:21:00 +0900
> Subject: [PATCH] mm: increase swappiness to 200

I had exactly the same code in my tree. And actually I found if swappiness is
set to 200, zone reclaim has problem. I has a patch for it. But haven't post it
out yet.

swappiness doesn't solve all the problem here. anonymous pages are in active
list first. And the rotation logic bias to anonymous pages too. So even you set
a high swappiness, file pages can still be easily reclaimed.

> > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> > THP and hugetlbfs don't support swap.
> 
> Another items are indirection layers. Please read Rik's mail below.
> Indirection layers could give many flexibility to backends and helpful
> for defragmentation.
> 
> One of idea I am considering is that makes hierarchy swap devides,
> NOT priority-based. I mean currently swap devices are used up by prioirty order.
> It's not good fit if we use fast swap and slow swap at the same time.
> I'd like to consume fast swap device (ex, in-memory swap) firstly, then
> I want to migrate some of swap pages from fast swap to slow swap to
> make room for fast swap. It could solve below concern.
> In addition, buffering via in-memory swap could make big chunk which is aligned
> to slow device's block size so migration speed from fast swap to slow swap
> could be enhanced so wear out problem would go away, too.

This looks interesting.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-22  6:53 [LSF/MM TOPIC]swap improvements for fast SSD Shaohua Li
  2013-01-23  7:58 ` Minchan Kim
  2013-01-23 16:56 ` Seth Jennings
@ 2013-01-24  6:28 ` Simon Jeons
  2013-03-15  9:39 ` Simon Jeons
  2013-04-28  8:12 ` Simon Jeons
  4 siblings, 0 replies; 31+ messages in thread
From: Simon Jeons @ 2013-01-24  6:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: lsf-pc, linux-mm

On Tue, 2013-01-22 at 14:53 +0800, Shaohua Li wrote:
> Hi,
> 
> Because of high density, low power and low price, flash storage (SSD) is a good
> candidate to partially replace DRAM. A quick answer for this is using SSD as
> swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> challenges to efficiently use SSD for swap:
> 
> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This

Which 2 TLB flush?

> overhead is very high even in a normal 2-socket machine.
> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> which makes swap IO pattern is interleave. Block layer isn't always efficient
> to do request merge. Such IO pattern also makes swap prefetch hard.
> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> very inefficient, especially if swap storage is fast.
> 5. SSD related optimization, mainly discard support
> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> aren't always in LRU list adjacently, so page reclaim will not swap such pages
> in adjacent storage sectors. This makes swap prefetch hard.
> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> Currently reclaim anonymous page is considering harder than reclaim file pages,
> so we bias reclaiming file pages. If there are high speed swap storage, we are
> considering doing swap more aggressively.
> 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> THP and hugetlbfs don't support swap.
> 
> I had some progresses in these areas recently:
> http://marc.info/?l=linux-mm&m=134665691021172&w=2
> http://marc.info/?l=linux-mm&m=135336039115191&w=2
> http://marc.info/?l=linux-mm&m=135882182225444&w=2
> http://marc.info/?l=linux-mm&m=135754636926984&w=2
> http://marc.info/?l=linux-mm&m=135754634526979&w=2
> But a lot of problems remain. I'd like to discuss the issues at the meeting.
> 
> Thanks,
> Shaohua
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-23  7:58 ` Minchan Kim
  2013-01-23 19:04   ` Seth Jennings
  2013-01-24  2:02   ` Shaohua Li
@ 2013-01-24  7:52   ` Simon Jeons
  2013-01-24  9:09   ` Simon Jeons
  2013-04-05  0:17   ` Simon Jeons
  4 siblings, 0 replies; 31+ messages in thread
From: Simon Jeons @ 2013-01-24  7:52 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Shaohua Li, lsf-pc, linux-mm, Rik van Riel

On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote:
> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
> > Hi,
> > 
> > Because of high density, low power and low price, flash storage (SSD) is a good
> > candidate to partially replace DRAM. A quick answer for this is using SSD as
> > swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> > challenges to efficiently use SSD for swap:
> 
> Many of below item could be applied in in-memory swap like zram, zcache.
> 
> > 
> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> > overhead is very high even in a normal 2-socket machine.
> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> > which makes swap IO pattern is interleave. Block layer isn't always efficient
> > to do request merge. Such IO pattern also makes swap prefetch hard.
> 
> Agreed.
> 
> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> > very inefficient, especially if swap storage is fast.
> 
> Agreed.
> 
> > 5. SSD related optimization, mainly discard support
> > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> > aren't always in LRU list adjacently, so page reclaim will not swap such pages
> > in adjacent storage sectors. This makes swap prefetch hard.
> 
> One of problem is LRU churning and I wanted to try to fix it.
> http://marc.info/?l=linux-mm&m=130978831028952&w=4

What's LRU history as you mentioned in your LRU churning patchset?

> 
> > 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> > Currently reclaim anonymous page is considering harder than reclaim file pages,
> > so we bias reclaiming file pages. If there are high speed swap storage, we are
> > considering doing swap more aggressively.
> 
> Yeb. We need it. I tried it with extending vm_swappiness to 200.
> 
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 3 Dec 2012 16:21:00 +0900
> Subject: [PATCH] mm: increase swappiness to 200
> 
> We have thought swap out cost is very high but it's not true
> if we use fast device like swap-over-zram. Nonetheless, we can
> swap out 1:1 ratio of anon and page cache at most.
> It's not enough to use swap device fully so we encounter OOM kill
> while there are many free space in zram swap device. It's never
> what we want.
> 
> This patch makes swap out aggressively.
> 
> Cc: Luigi Semenzato <semenzato@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  kernel/sysctl.c |    3 ++-
>  mm/vmscan.c     |    6 ++++--
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 693e0ed..f1dbd9d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
>  static int __maybe_unused three = 3;
>  static unsigned long one_ul = 1;
>  static int one_hundred = 100;
> +extern int max_swappiness;
>  #ifdef CONFIG_PRINTK
>  static int ten_thousand = 10000;
>  #endif
> @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = &zero,
> -               .extra2         = &one_hundred,
> +               .extra2         = &max_swappiness,
>         },
>  #ifdef CONFIG_HUGETLB_PAGE
>         {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 53dcde9..64f3c21 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -53,6 +53,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/vmscan.h>
>  
> +int max_swappiness = 200;
> +
>  struct scan_control {
>         /* Incremented by the number of inactive pages that were scanned */
>         unsigned long nr_scanned;
> @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc)
>         return mem_cgroup_swappiness(sc->target_mem_cgroup);
>  }
>  
> +
>  /*
>   * Determine how aggressively the anon and file LRU lists should be
>   * scanned.  The relative value of each set of LRU lists is determined
> @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>         }
>  
>         /*
> -        * With swappiness at 100, anonymous and file have the same priority.
>          * This scanning priority is essentially the inverse of IO cost.
>          */
>         anon_prio = vmscan_swappiness(sc);
> -       file_prio = 200 - anon_prio;
> +       file_prio = max_swappiness - anon_prio;
>  
>         /*
>          * OK, so we have swap space and a fair amount of page cache
> -- 
> 1.7.9.5
> 
> > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> > THP and hugetlbfs don't support swap.
> 
> Another items are indirection layers. Please read Rik's mail below.
> Indirection layers could give many flexibility to backends and helpful
> for defragmentation.
> 
> One of idea I am considering is that makes hierarchy swap devides,
> NOT priority-based. I mean currently swap devices are used up by prioirty order.
> It's not good fit if we use fast swap and slow swap at the same time.
> I'd like to consume fast swap device (ex, in-memory swap) firstly, then
> I want to migrate some of swap pages from fast swap to slow swap to
> make room for fast swap. It could solve below concern.
> In addition, buffering via in-memory swap could make big chunk which is aligned
> to slow device's block size so migration speed from fast swap to slow swap
> could be enhanced so wear out problem would go away, too.
> 
> Quote from last KS2012 - http://lwn.net/Articles/516538/
> "Andrea Arcangeli was also concerned that the first pages to be evicted from
> memory are, by definition of the LRU page order, the ones that are least likely
> to be used in the future. These are the pages that should be going to secondary
> storage and more frequently used pages should be going to zcache. As it stands,
> zcache may fill up with no-longer-used pages and then the system continues to
> move used pages from and to the disk."
> 
> From riel@redhat.com Sun Apr 10 17:50:10 2011
> Date: Sun, 10 Apr 2011 20:50:01 -0400
> From: Rik van Riel <riel@redhat.com>
> To: Linux Memory Management List <linux-mm@kvack.org>
> Subject: [LSF/Collab] swap cache redesign idea
> 
> On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were
> sitting in the hallway talking about yet more VM things.
> 
> During that discussion, we came up with a way to redesign the
> swap cache.  During my flight home, I came with ideas on how
> to use that redesign, that may make the changes worthwhile.
> 
> Currently, the page table entries that have swapped out pages
> associated with them contain a swap entry, pointing directly
> at the swap device and swap slot containing the data. Meanwhile,
> the swap count lives in a separate array.
> 
> The redesign we are considering moving the swap entry to the
> page cache radix tree for the swapper_space and having the pte
> contain only the offset into the swapper_space.  The swap count
> info can also fit inside the swapper_space page cache radix
> tree (at least on 64 bits - on 32 bits we may need to get
> creative or accept a smaller max amount of swap space).
> 
> This extra layer of indirection allows us to do several things:
> 
> 1) get rid of the virtual address scanning swapoff; instead
>     we just swap the data in and mark the pages as present in
>     the swapper_space radix tree
> 
> 2) free swap entries as the are read in, without waiting for
>     the process to fault it in - this may be useful for memory
>     types that have a large erase block
> 
> 3) together with the defragmentation from (2), we can always
>     do writes in large aligned blocks - the extra indirection
>     will make it relatively easy to have special backend code
>     for different kinds of swap space, since all the state can
>     now live in just one place
> 
> 4) skip writeout of zero-filled pages - this can be a big help
>     for KVM virtual machines running Windows, since Windows zeroes
>     out free pages;   simply discarding a zero-filled page is not
>     at all simple in the current VM, where we would have to iterate
>     over all the ptes to free the swap entry before being able to
>     free the swap cache page (I am not sure how that locking would
>     even work)
> 
>     with the extra layer of indirection, the locking for this scheme
>     can be trivial - either the faulting process gets the old page,
>     or it gets a new one, either way it'll be zero filled
> 
> 5) skip writeout of pages the guest has marked as free - same as
>     above, with the same easier locking
> 
> Only one real question remaining - how do we handle the swap count
> in the new scheme?  On 64 bit systems we have enough space in the
> radix tree, on 32 bit systems maybe we'll have to start overflowing
> into the "swap_count_continued" logic a little sooner than we are
> now and reduce the maximum swap size a little?
> 
> > 
> > I had some progresses in these areas recently:
> > http://marc.info/?l=linux-mm&m=134665691021172&w=2
> > http://marc.info/?l=linux-mm&m=135336039115191&w=2
> > http://marc.info/?l=linux-mm&m=135882182225444&w=2
> > http://marc.info/?l=linux-mm&m=135754636926984&w=2
> > http://marc.info/?l=linux-mm&m=135754634526979&w=2
> > But a lot of problems remain. I'd like to discuss the issues at the meeting.
> 
> I have an interest on this topic.
> Thnaks.
> 
> > 
> > Thanks,
> > Shaohua
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-24  1:40     ` Minchan Kim
@ 2013-01-24  8:29       ` Simon Jeons
  0 siblings, 0 replies; 31+ messages in thread
From: Simon Jeons @ 2013-01-24  8:29 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Seth Jennings, Shaohua Li, lsf-pc, linux-mm, Rik van Riel

On Thu, 2013-01-24 at 10:40 +0900, Minchan Kim wrote:
> Hi Seth,
> 
> On Wed, Jan 23, 2013 at 01:04:25PM -0600, Seth Jennings wrote:
> > On 01/23/2013 01:58 AM, Minchan Kim wrote:
> > > Currently, the page table entries that have swapped out pages
> > > associated with them contain a swap entry, pointing directly
> > > at the swap device and swap slot containing the data. Meanwhile,
> > > the swap count lives in a separate array.
> > > 
> > > The redesign we are considering moving the swap entry to the
> > > page cache radix tree for the swapper_space and having the pte
> > > contain only the offset into the swapper_space.  The swap count
> > > info can also fit inside the swapper_space page cache radix
> > > tree (at least on 64 bits - on 32 bits we may need to get
> > > creative or accept a smaller max amount of swap space).
> > 
> > Correct me if I'm wrong, but this recent patchset creating a
> > swapper_space per type would mess this up right?  The offset alone
> > would no longer be sufficient to access the proper swapper_space.
> 
> If I understand Rik's idea correctly, it doesn't mess up. Because we already
> have used (swp_type, swp_offset) as offset of swapper_space so although
> he mentioned "pte contains only the offset into the swapper_space",
> it doesn't mean we will store only swp_offset in pte but store offset of
> swapper_space in pte.
> 
> old :
>         do_swap_page
>         swp_entry_t entry = pte_to_swp_entry(pte);
>         if (!lookup_swap_cache(entry))
>                 swapin_readahead(entry)
> 
> New :
>         do_swap_page
>         pgoff_t offset = pte_to_swp_offset(pte)
>         if (!lookup_swap_cache(offset)) {
>                 swp_entry_t entry = offset_to_swp_entry(offset);
>                 swapin_readahead(entry);
>         }
> 

Since Shaohua change the logic to each swap partition have one
address_space, the idea mentioned above can't work any more, correct?   

> IOW, entry of old and offset of new would be same vaule.
> 
> > 
> > Why not just continue to store the entire swap entry (type and offset)
> > in the pte?  Where you planning to use the type space in the pte for
> > something else?
> 
> No plan if I didn't miss something. :)
> 
> > 
> > Seth
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-23  7:58 ` Minchan Kim
                     ` (2 preceding siblings ...)
  2013-01-24  7:52   ` Simon Jeons
@ 2013-01-24  9:09   ` Simon Jeons
  2013-01-26  4:40     ` Kyungmin Park
  2013-04-05  0:17   ` Simon Jeons
  4 siblings, 1 reply; 31+ messages in thread
From: Simon Jeons @ 2013-01-24  9:09 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Shaohua Li, lsf-pc, linux-mm, Rik van Riel

Hi Minchan,
On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote:
> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
> > Hi,
> > 
> > Because of high density, low power and low price, flash storage (SSD) is a good
> > candidate to partially replace DRAM. A quick answer for this is using SSD as
> > swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> > challenges to efficiently use SSD for swap:
> 
> Many of below item could be applied in in-memory swap like zram, zcache.
> 
> > 
> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> > overhead is very high even in a normal 2-socket machine.
> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> > which makes swap IO pattern is interleave. Block layer isn't always efficient
> > to do request merge. Such IO pattern also makes swap prefetch hard.
> 
> Agreed.
> 
> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> > very inefficient, especially if swap storage is fast.
> 
> Agreed.
> 
> > 5. SSD related optimization, mainly discard support
> > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> > aren't always in LRU list adjacently, so page reclaim will not swap such pages
> > in adjacent storage sectors. This makes swap prefetch hard.
> 
> One of problem is LRU churning and I wanted to try to fix it.
> http://marc.info/?l=linux-mm&m=130978831028952&w=4
> 
> > 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> > Currently reclaim anonymous page is considering harder than reclaim file pages,
> > so we bias reclaiming file pages. If there are high speed swap storage, we are
> > considering doing swap more aggressively.
> 
> Yeb. We need it. I tried it with extending vm_swappiness to 200.
> 
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 3 Dec 2012 16:21:00 +0900
> Subject: [PATCH] mm: increase swappiness to 200
> 
> We have thought swap out cost is very high but it's not true
> if we use fast device like swap-over-zram. Nonetheless, we can
> swap out 1:1 ratio of anon and page cache at most.
> It's not enough to use swap device fully so we encounter OOM kill
> while there are many free space in zram swap device. It's never
> what we want.
> 
> This patch makes swap out aggressively.
> 
> Cc: Luigi Semenzato <semenzato@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  kernel/sysctl.c |    3 ++-
>  mm/vmscan.c     |    6 ++++--
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 693e0ed..f1dbd9d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
>  static int __maybe_unused three = 3;
>  static unsigned long one_ul = 1;
>  static int one_hundred = 100;
> +extern int max_swappiness;
>  #ifdef CONFIG_PRINTK
>  static int ten_thousand = 10000;
>  #endif
> @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = &zero,
> -               .extra2         = &one_hundred,
> +               .extra2         = &max_swappiness,
>         },
>  #ifdef CONFIG_HUGETLB_PAGE
>         {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 53dcde9..64f3c21 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -53,6 +53,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/vmscan.h>
>  
> +int max_swappiness = 200;
> +
>  struct scan_control {
>         /* Incremented by the number of inactive pages that were scanned */
>         unsigned long nr_scanned;
> @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc)
>         return mem_cgroup_swappiness(sc->target_mem_cgroup);
>  }
>  
> +
>  /*
>   * Determine how aggressively the anon and file LRU lists should be
>   * scanned.  The relative value of each set of LRU lists is determined
> @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>         }
>  
>         /*
> -        * With swappiness at 100, anonymous and file have the same priority.
>          * This scanning priority is essentially the inverse of IO cost.
>          */
>         anon_prio = vmscan_swappiness(sc);
> -       file_prio = 200 - anon_prio;
> +       file_prio = max_swappiness - anon_prio;
>  
>         /*
>          * OK, so we have swap space and a fair amount of page cache
> -- 
> 1.7.9.5
> 
> > 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> > THP and hugetlbfs don't support swap.
> 
> Another items are indirection layers. Please read Rik's mail below.
> Indirection layers could give many flexibility to backends and helpful
> for defragmentation.
> 
> One of idea I am considering is that makes hierarchy swap devides,
> NOT priority-based. I mean currently swap devices are used up by prioirty order.
> It's not good fit if we use fast swap and slow swap at the same time.
> I'd like to consume fast swap device (ex, in-memory swap) firstly, then
> I want to migrate some of swap pages from fast swap to slow swap to
> make room for fast swap. It could solve below concern.
> In addition, buffering via in-memory swap could make big chunk which is aligned
> to slow device's block size so migration speed from fast swap to slow swap
> could be enhanced so wear out problem would go away, too.
> 
> Quote from last KS2012 - http://lwn.net/Articles/516538/
> "Andrea Arcangeli was also concerned that the first pages to be evicted from
> memory are, by definition of the LRU page order, the ones that are least likely
> to be used in the future. These are the pages that should be going to secondary
> storage and more frequently used pages should be going to zcache. As it stands,
> zcache may fill up with no-longer-used pages and then the system continues to
> move used pages from and to the disk."
> 
> From riel@redhat.com Sun Apr 10 17:50:10 2011
> Date: Sun, 10 Apr 2011 20:50:01 -0400
> From: Rik van Riel <riel@redhat.com>
> To: Linux Memory Management List <linux-mm@kvack.org>
> Subject: [LSF/Collab] swap cache redesign idea
> 
> On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were
> sitting in the hallway talking about yet more VM things.
> 
> During that discussion, we came up with a way to redesign the
> swap cache.  During my flight home, I came with ideas on how
> to use that redesign, that may make the changes worthwhile.
> 
> Currently, the page table entries that have swapped out pages
> associated with them contain a swap entry, pointing directly
> at the swap device and swap slot containing the data. Meanwhile,
> the swap count lives in a separate array.
> 
> The redesign we are considering moving the swap entry to the
> page cache radix tree for the swapper_space and having the pte
> contain only the offset into the swapper_space.  The swap count
> info can also fit inside the swapper_space page cache radix
> tree (at least on 64 bits - on 32 bits we may need to get
> creative or accept a smaller max amount of swap space).
> 
> This extra layer of indirection allows us to do several things:
> 
> 1) get rid of the virtual address scanning swapoff; instead
>     we just swap the data in and mark the pages as present in
>     the swapper_space radix tree

If radix tree will store all rmap to the pages? If not, how to position
the pages?

> 
> 2) free swap entries as the are read in, without waiting for
>     the process to fault it in - this may be useful for memory
>     types that have a large erase block
> 
> 3) together with the defragmentation from (2), we can always
>     do writes in large aligned blocks - the extra indirection
>     will make it relatively easy to have special backend code
>     for different kinds of swap space, since all the state can
>     now live in just one place
> 
> 4) skip writeout of zero-filled pages - this can be a big help
>     for KVM virtual machines running Windows, since Windows zeroes
>     out free pages;   simply discarding a zero-filled page is not
>     at all simple in the current VM, where we would have to iterate
>     over all the ptes to free the swap entry before being able to
>     free the swap cache page (I am not sure how that locking would
>     even work)
> 
>     with the extra layer of indirection, the locking for this scheme
>     can be trivial - either the faulting process gets the old page,
>     or it gets a new one, either way it'll be zero filled
> 
> 5) skip writeout of pages the guest has marked as free - same as
>     above, with the same easier locking
> 
> Only one real question remaining - how do we handle the swap count
> in the new scheme?  On 64 bit systems we have enough space in the
> radix tree, on 32 bit systems maybe we'll have to start overflowing
> into the "swap_count_continued" logic a little sooner than we are
> now and reduce the maximum swap size a little?
> 
> > 
> > I had some progresses in these areas recently:
> > http://marc.info/?l=linux-mm&m=134665691021172&w=2
> > http://marc.info/?l=linux-mm&m=135336039115191&w=2
> > http://marc.info/?l=linux-mm&m=135882182225444&w=2
> > http://marc.info/?l=linux-mm&m=135754636926984&w=2
> > http://marc.info/?l=linux-mm&m=135754634526979&w=2
> > But a lot of problems remain. I'd like to discuss the issues at the meeting.
> 
> I have an interest on this topic.
> Thnaks.
> 
> > 
> > Thanks,
> > Shaohua
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-24  9:09   ` Simon Jeons
@ 2013-01-26  4:40     ` Kyungmin Park
  2013-01-27  0:26       ` Simon Jeons
  2013-01-27 14:18       ` Shaohua Li
  0 siblings, 2 replies; 31+ messages in thread
From: Kyungmin Park @ 2013-01-26  4:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Minchan Kim, lsf-pc, linux-mm, Rik van Riel, Simon Jeons

Hi,

On 1/24/13, Simon Jeons <simon.jeons@gmail.com> wrote:
> Hi Minchan,
> On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote:
>> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
>> > Hi,
>> >
>> > Because of high density, low power and low price, flash storage (SSD) is
>> > a good
>> > candidate to partially replace DRAM. A quick answer for this is using
>> > SSD as
>> > swap. But Linux swap is designed for slow hard disk storage. There are a
>> > lot of
>> > challenges to efficiently use SSD for swap:
>>
>> Many of below item could be applied in in-memory swap like zram, zcache.
>>
>> >
>> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space
>> > lock)
>> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>> > flush. This
>> > overhead is very high even in a normal 2-socket machine.
>> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>> > swap,
>> > which makes swap IO pattern is interleave. Block layer isn't always
>> > efficient
>> > to do request merge. Such IO pattern also makes swap prefetch hard.
>>
>> Agreed.
>>
>> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which
>> > is
>> > very inefficient, especially if swap storage is fast.
>>
>> Agreed.
>>

5. SSD related optimization, mainly discard support.

Now swap codes are based on each swap slots. it means it can't
optimize discard feature since getting meaningful performance gain, it
requires 2 pages at least. Of course it's based on eMMC. In case of
SSD. it requires more pages to support discard.

To address issue. I consider the batched discard approach used at filesystem.
*Sometime* scan all empty slot and it issues discard continuous swap
slots as many as possible.

How to you think?

Thank you,
Kyungmin Park

P.S., It's almost same topics to optimize the eMMC with swap. I mean
I"m very interested with this topics.

>> > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed
>> > pages
>> > aren't always in LRU list adjacently, so page reclaim will not swap such
>> > pages
>> > in adjacent storage sectors. This makes swap prefetch hard.
>>
>> One of problem is LRU churning and I wanted to try to fix it.
>> http://marc.info/?l=linux-mm&m=130978831028952&w=4
>>
>> > 7. Alternative page reclaim policy to bias reclaiming anonymous page.
>> > Currently reclaim anonymous page is considering harder than reclaim file
>> > pages,
>> > so we bias reclaiming file pages. If there are high speed swap storage,
>> > we are
>> > considering doing swap more aggressively.
>>
>> Yeb. We need it. I tried it with extending vm_swappiness to 200.
>>
>> From: Minchan Kim <minchan@kernel.org>
>> Date: Mon, 3 Dec 2012 16:21:00 +0900
>> Subject: [PATCH] mm: increase swappiness to 200
>>
>> We have thought swap out cost is very high but it's not true
>> if we use fast device like swap-over-zram. Nonetheless, we can
>> swap out 1:1 ratio of anon and page cache at most.
>> It's not enough to use swap device fully so we encounter OOM kill
>> while there are many free space in zram swap device. It's never
>> what we want.
>>
>> This patch makes swap out aggressively.
>>
>> Cc: Luigi Semenzato <semenzato@google.com>
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>> ---
>>  kernel/sysctl.c |    3 ++-
>>  mm/vmscan.c     |    6 ++++--
>>  2 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 693e0ed..f1dbd9d 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
>>  static int __maybe_unused three = 3;
>>  static unsigned long one_ul = 1;
>>  static int one_hundred = 100;
>> +extern int max_swappiness;
>>  #ifdef CONFIG_PRINTK
>>  static int ten_thousand = 10000;
>>  #endif
>> @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
>>                 .mode           = 0644,
>>                 .proc_handler   = proc_dointvec_minmax,
>>                 .extra1         = &zero,
>> -               .extra2         = &one_hundred,
>> +               .extra2         = &max_swappiness,
>>         },
>>  #ifdef CONFIG_HUGETLB_PAGE
>>         {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 53dcde9..64f3c21 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -53,6 +53,8 @@
>>  #define CREATE_TRACE_POINTS
>>  #include <trace/events/vmscan.h>
>>
>> +int max_swappiness = 200;
>> +
>>  struct scan_control {
>>         /* Incremented by the number of inactive pages that were scanned
>> */
>>         unsigned long nr_scanned;
>> @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control
>> *sc)
>>         return mem_cgroup_swappiness(sc->target_mem_cgroup);
>>  }
>>
>> +
>>  /*
>>   * Determine how aggressively the anon and file LRU lists should be
>>   * scanned.  The relative value of each set of LRU lists is determined
>> @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec,
>> struct scan_control *sc,
>>         }
>>
>>         /*
>> -        * With swappiness at 100, anonymous and file have the same
>> priority.
>>          * This scanning priority is essentially the inverse of IO cost.
>>          */
>>         anon_prio = vmscan_swappiness(sc);
>> -       file_prio = 200 - anon_prio;
>> +       file_prio = max_swappiness - anon_prio;
>>
>>         /*
>>          * OK, so we have swap space and a fair amount of page cache
>> --
>> 1.7.9.5
>>
>> > 8. Huge page swap. Huge page swap can solve a lot of problems above, but
>> > both
>> > THP and hugetlbfs don't support swap.
>>
>> Another items are indirection layers. Please read Rik's mail below.
>> Indirection layers could give many flexibility to backends and helpful
>> for defragmentation.
>>
>> One of idea I am considering is that makes hierarchy swap devides,
>> NOT priority-based. I mean currently swap devices are used up by prioirty
>> order.
>> It's not good fit if we use fast swap and slow swap at the same time.
>> I'd like to consume fast swap device (ex, in-memory swap) firstly, then
>> I want to migrate some of swap pages from fast swap to slow swap to
>> make room for fast swap. It could solve below concern.
>> In addition, buffering via in-memory swap could make big chunk which is
>> aligned
>> to slow device's block size so migration speed from fast swap to slow
>> swap
>> could be enhanced so wear out problem would go away, too.
>>
>> Quote from last KS2012 - http://lwn.net/Articles/516538/
>> "Andrea Arcangeli was also concerned that the first pages to be evicted
>> from
>> memory are, by definition of the LRU page order, the ones that are least
>> likely
>> to be used in the future. These are the pages that should be going to
>> secondary
>> storage and more frequently used pages should be going to zcache. As it
>> stands,
>> zcache may fill up with no-longer-used pages and then the system continues
>> to
>> move used pages from and to the disk."
>>
>> From riel@redhat.com Sun Apr 10 17:50:10 2011
>> Date: Sun, 10 Apr 2011 20:50:01 -0400
>> From: Rik van Riel <riel@redhat.com>
>> To: Linux Memory Management List <linux-mm@kvack.org>
>> Subject: [LSF/Collab] swap cache redesign idea
>>
>> On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were
>> sitting in the hallway talking about yet more VM things.
>>
>> During that discussion, we came up with a way to redesign the
>> swap cache.  During my flight home, I came with ideas on how
>> to use that redesign, that may make the changes worthwhile.
>>
>> Currently, the page table entries that have swapped out pages
>> associated with them contain a swap entry, pointing directly
>> at the swap device and swap slot containing the data. Meanwhile,
>> the swap count lives in a separate array.
>>
>> The redesign we are considering moving the swap entry to the
>> page cache radix tree for the swapper_space and having the pte
>> contain only the offset into the swapper_space.  The swap count
>> info can also fit inside the swapper_space page cache radix
>> tree (at least on 64 bits - on 32 bits we may need to get
>> creative or accept a smaller max amount of swap space).
>>
>> This extra layer of indirection allows us to do several things:
>>
>> 1) get rid of the virtual address scanning swapoff; instead
>>     we just swap the data in and mark the pages as present in
>>     the swapper_space radix tree
>
> If radix tree will store all rmap to the pages? If not, how to position
> the pages?
>
>>
>> 2) free swap entries as the are read in, without waiting for
>>     the process to fault it in - this may be useful for memory
>>     types that have a large erase block
>>
>> 3) together with the defragmentation from (2), we can always
>>     do writes in large aligned blocks - the extra indirection
>>     will make it relatively easy to have special backend code
>>     for different kinds of swap space, since all the state can
>>     now live in just one place
>>
>> 4) skip writeout of zero-filled pages - this can be a big help
>>     for KVM virtual machines running Windows, since Windows zeroes
>>     out free pages;   simply discarding a zero-filled page is not
>>     at all simple in the current VM, where we would have to iterate
>>     over all the ptes to free the swap entry before being able to
>>     free the swap cache page (I am not sure how that locking would
>>     even work)
>>
>>     with the extra layer of indirection, the locking for this scheme
>>     can be trivial - either the faulting process gets the old page,
>>     or it gets a new one, either way it'll be zero filled
>>
>> 5) skip writeout of pages the guest has marked as free - same as
>>     above, with the same easier locking
>>
>> Only one real question remaining - how do we handle the swap count
>> in the new scheme?  On 64 bit systems we have enough space in the
>> radix tree, on 32 bit systems maybe we'll have to start overflowing
>> into the "swap_count_continued" logic a little sooner than we are
>> now and reduce the maximum swap size a little?
>>
>> >
>> > I had some progresses in these areas recently:
>> > http://marc.info/?l=linux-mm&m=134665691021172&w=2
>> > http://marc.info/?l=linux-mm&m=135336039115191&w=2
>> > http://marc.info/?l=linux-mm&m=135882182225444&w=2
>> > http://marc.info/?l=linux-mm&m=135754636926984&w=2
>> > http://marc.info/?l=linux-mm&m=135754634526979&w=2
>> > But a lot of problems remain. I'd like to discuss the issues at the
>> > meeting.
>>
>> I have an interest on this topic.
>> Thnaks.
>>
>> >
>> > Thanks,
>> > Shaohua
>> >
>> > --
>> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> > the body to majordomo@kvack.org.  For more info on Linux MM,
>> > see: http://www.linux-mm.org/ .
>> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-26  4:40     ` Kyungmin Park
@ 2013-01-27  0:26       ` Simon Jeons
  2013-01-27 14:18       ` Shaohua Li
  1 sibling, 0 replies; 31+ messages in thread
From: Simon Jeons @ 2013-01-27  0:26 UTC (permalink / raw)
  To: Kyungmin Park; +Cc: Shaohua Li, Minchan Kim, lsf-pc, linux-mm, Rik van Riel

On Sat, 2013-01-26 at 13:40 +0900, Kyungmin Park wrote:
> Hi,
> 
> On 1/24/13, Simon Jeons <simon.jeons@gmail.com> wrote:
> > Hi Minchan,
> > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote:
> >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
> >> > Hi,
> >> >
> >> > Because of high density, low power and low price, flash storage (SSD) is
> >> > a good
> >> > candidate to partially replace DRAM. A quick answer for this is using
> >> > SSD as
> >> > swap. But Linux swap is designed for slow hard disk storage. There are a
> >> > lot of
> >> > challenges to efficiently use SSD for swap:
> >>
> >> Many of below item could be applied in in-memory swap like zram, zcache.
> >>
> >> >
> >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space
> >> > lock)
> >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
> >> > flush. This
> >> > overhead is very high even in a normal 2-socket machine.
> >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
> >> > swap,
> >> > which makes swap IO pattern is interleave. Block layer isn't always
> >> > efficient
> >> > to do request merge. Such IO pattern also makes swap prefetch hard.
> >>
> >> Agreed.
> >>
> >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which
> >> > is
> >> > very inefficient, especially if swap storage is fast.
> >>
> >> Agreed.
> >>
> 

HI Kyungmin,

> 5. SSD related optimization, mainly discard support.
> 
> Now swap codes are based on each swap slots. it means it can't
> optimize discard feature since getting meaningful performance gain, it
> requires 2 pages at least. Of course it's based on eMMC. In case of
> SSD. it requires more pages to support discard.

Could explain 2 pages or more pages you mentioned used for what? Why
need it? I'm interested in.

> 
> To address issue. I consider the batched discard approach used at filesystem.
> *Sometime* scan all empty slot and it issues discard continuous swap
> slots as many as possible.
> 
> How to you think?
> 
> Thank you,
> Kyungmin Park
> 
> P.S., It's almost same topics to optimize the eMMC with swap. I mean
> I"m very interested with this topics.
> 
> >> > 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed
> >> > pages
> >> > aren't always in LRU list adjacently, so page reclaim will not swap such
> >> > pages
> >> > in adjacent storage sectors. This makes swap prefetch hard.
> >>
> >> One of problem is LRU churning and I wanted to try to fix it.
> >> http://marc.info/?l=linux-mm&m=130978831028952&w=4
> >>
> >> > 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> >> > Currently reclaim anonymous page is considering harder than reclaim file
> >> > pages,
> >> > so we bias reclaiming file pages. If there are high speed swap storage,
> >> > we are
> >> > considering doing swap more aggressively.
> >>
> >> Yeb. We need it. I tried it with extending vm_swappiness to 200.
> >>
> >> From: Minchan Kim <minchan@kernel.org>
> >> Date: Mon, 3 Dec 2012 16:21:00 +0900
> >> Subject: [PATCH] mm: increase swappiness to 200
> >>
> >> We have thought swap out cost is very high but it's not true
> >> if we use fast device like swap-over-zram. Nonetheless, we can
> >> swap out 1:1 ratio of anon and page cache at most.
> >> It's not enough to use swap device fully so we encounter OOM kill
> >> while there are many free space in zram swap device. It's never
> >> what we want.
> >>
> >> This patch makes swap out aggressively.
> >>
> >> Cc: Luigi Semenzato <semenzato@google.com>
> >> Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> ---
> >>  kernel/sysctl.c |    3 ++-
> >>  mm/vmscan.c     |    6 ++++--
> >>  2 files changed, 6 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> >> index 693e0ed..f1dbd9d 100644
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
> >>  static int __maybe_unused three = 3;
> >>  static unsigned long one_ul = 1;
> >>  static int one_hundred = 100;
> >> +extern int max_swappiness;
> >>  #ifdef CONFIG_PRINTK
> >>  static int ten_thousand = 10000;
> >>  #endif
> >> @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
> >>                 .mode           = 0644,
> >>                 .proc_handler   = proc_dointvec_minmax,
> >>                 .extra1         = &zero,
> >> -               .extra2         = &one_hundred,
> >> +               .extra2         = &max_swappiness,
> >>         },
> >>  #ifdef CONFIG_HUGETLB_PAGE
> >>         {
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 53dcde9..64f3c21 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -53,6 +53,8 @@
> >>  #define CREATE_TRACE_POINTS
> >>  #include <trace/events/vmscan.h>
> >>
> >> +int max_swappiness = 200;
> >> +
> >>  struct scan_control {
> >>         /* Incremented by the number of inactive pages that were scanned
> >> */
> >>         unsigned long nr_scanned;
> >> @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control
> >> *sc)
> >>         return mem_cgroup_swappiness(sc->target_mem_cgroup);
> >>  }
> >>
> >> +
> >>  /*
> >>   * Determine how aggressively the anon and file LRU lists should be
> >>   * scanned.  The relative value of each set of LRU lists is determined
> >> @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec,
> >> struct scan_control *sc,
> >>         }
> >>
> >>         /*
> >> -        * With swappiness at 100, anonymous and file have the same
> >> priority.
> >>          * This scanning priority is essentially the inverse of IO cost.
> >>          */
> >>         anon_prio = vmscan_swappiness(sc);
> >> -       file_prio = 200 - anon_prio;
> >> +       file_prio = max_swappiness - anon_prio;
> >>
> >>         /*
> >>          * OK, so we have swap space and a fair amount of page cache
> >> --
> >> 1.7.9.5
> >>
> >> > 8. Huge page swap. Huge page swap can solve a lot of problems above, but
> >> > both
> >> > THP and hugetlbfs don't support swap.
> >>
> >> Another items are indirection layers. Please read Rik's mail below.
> >> Indirection layers could give many flexibility to backends and helpful
> >> for defragmentation.
> >>
> >> One of idea I am considering is that makes hierarchy swap devides,
> >> NOT priority-based. I mean currently swap devices are used up by prioirty
> >> order.
> >> It's not good fit if we use fast swap and slow swap at the same time.
> >> I'd like to consume fast swap device (ex, in-memory swap) firstly, then
> >> I want to migrate some of swap pages from fast swap to slow swap to
> >> make room for fast swap. It could solve below concern.
> >> In addition, buffering via in-memory swap could make big chunk which is
> >> aligned
> >> to slow device's block size so migration speed from fast swap to slow
> >> swap
> >> could be enhanced so wear out problem would go away, too.
> >>
> >> Quote from last KS2012 - http://lwn.net/Articles/516538/
> >> "Andrea Arcangeli was also concerned that the first pages to be evicted
> >> from
> >> memory are, by definition of the LRU page order, the ones that are least
> >> likely
> >> to be used in the future. These are the pages that should be going to
> >> secondary
> >> storage and more frequently used pages should be going to zcache. As it
> >> stands,
> >> zcache may fill up with no-longer-used pages and then the system continues
> >> to
> >> move used pages from and to the disk."
> >>
> >> From riel@redhat.com Sun Apr 10 17:50:10 2011
> >> Date: Sun, 10 Apr 2011 20:50:01 -0400
> >> From: Rik van Riel <riel@redhat.com>
> >> To: Linux Memory Management List <linux-mm@kvack.org>
> >> Subject: [LSF/Collab] swap cache redesign idea
> >>
> >> On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were
> >> sitting in the hallway talking about yet more VM things.
> >>
> >> During that discussion, we came up with a way to redesign the
> >> swap cache.  During my flight home, I came with ideas on how
> >> to use that redesign, that may make the changes worthwhile.
> >>
> >> Currently, the page table entries that have swapped out pages
> >> associated with them contain a swap entry, pointing directly
> >> at the swap device and swap slot containing the data. Meanwhile,
> >> the swap count lives in a separate array.
> >>
> >> The redesign we are considering moving the swap entry to the
> >> page cache radix tree for the swapper_space and having the pte
> >> contain only the offset into the swapper_space.  The swap count
> >> info can also fit inside the swapper_space page cache radix
> >> tree (at least on 64 bits - on 32 bits we may need to get
> >> creative or accept a smaller max amount of swap space).
> >>
> >> This extra layer of indirection allows us to do several things:
> >>
> >> 1) get rid of the virtual address scanning swapoff; instead
> >>     we just swap the data in and mark the pages as present in
> >>     the swapper_space radix tree
> >
> > If radix tree will store all rmap to the pages? If not, how to position
> > the pages?
> >
> >>
> >> 2) free swap entries as the are read in, without waiting for
> >>     the process to fault it in - this may be useful for memory
> >>     types that have a large erase block
> >>
> >> 3) together with the defragmentation from (2), we can always
> >>     do writes in large aligned blocks - the extra indirection
> >>     will make it relatively easy to have special backend code
> >>     for different kinds of swap space, since all the state can
> >>     now live in just one place
> >>
> >> 4) skip writeout of zero-filled pages - this can be a big help
> >>     for KVM virtual machines running Windows, since Windows zeroes
> >>     out free pages;   simply discarding a zero-filled page is not
> >>     at all simple in the current VM, where we would have to iterate
> >>     over all the ptes to free the swap entry before being able to
> >>     free the swap cache page (I am not sure how that locking would
> >>     even work)
> >>
> >>     with the extra layer of indirection, the locking for this scheme
> >>     can be trivial - either the faulting process gets the old page,
> >>     or it gets a new one, either way it'll be zero filled
> >>
> >> 5) skip writeout of pages the guest has marked as free - same as
> >>     above, with the same easier locking
> >>
> >> Only one real question remaining - how do we handle the swap count
> >> in the new scheme?  On 64 bit systems we have enough space in the
> >> radix tree, on 32 bit systems maybe we'll have to start overflowing
> >> into the "swap_count_continued" logic a little sooner than we are
> >> now and reduce the maximum swap size a little?
> >>
> >> >
> >> > I had some progresses in these areas recently:
> >> > http://marc.info/?l=linux-mm&m=134665691021172&w=2
> >> > http://marc.info/?l=linux-mm&m=135336039115191&w=2
> >> > http://marc.info/?l=linux-mm&m=135882182225444&w=2
> >> > http://marc.info/?l=linux-mm&m=135754636926984&w=2
> >> > http://marc.info/?l=linux-mm&m=135754634526979&w=2
> >> > But a lot of problems remain. I'd like to discuss the issues at the
> >> > meeting.
> >>
> >> I have an interest on this topic.
> >> Thnaks.
> >>
> >> >
> >> > Thanks,
> >> > Shaohua
> >> >
> >> > --
> >> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> > the body to majordomo@kvack.org.  For more info on Linux MM,
> >> > see: http://www.linux-mm.org/ .
> >> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >>
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-26  4:40     ` Kyungmin Park
  2013-01-27  0:26       ` Simon Jeons
@ 2013-01-27 14:18       ` Shaohua Li
  2013-01-28  7:37         ` Kyungmin Park
  2013-02-04  4:56         ` Hugh Dickins
  1 sibling, 2 replies; 31+ messages in thread
From: Shaohua Li @ 2013-01-27 14:18 UTC (permalink / raw)
  To: Kyungmin Park; +Cc: Minchan Kim, lsf-pc, linux-mm, Rik van Riel, Simon Jeons

On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote:
> Hi,
> 
> On 1/24/13, Simon Jeons <simon.jeons@gmail.com> wrote:
> > Hi Minchan,
> > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote:
> >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
> >> > Hi,
> >> >
> >> > Because of high density, low power and low price, flash storage (SSD) is
> >> > a good
> >> > candidate to partially replace DRAM. A quick answer for this is using
> >> > SSD as
> >> > swap. But Linux swap is designed for slow hard disk storage. There are a
> >> > lot of
> >> > challenges to efficiently use SSD for swap:
> >>
> >> Many of below item could be applied in in-memory swap like zram, zcache.
> >>
> >> >
> >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space
> >> > lock)
> >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
> >> > flush. This
> >> > overhead is very high even in a normal 2-socket machine.
> >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
> >> > swap,
> >> > which makes swap IO pattern is interleave. Block layer isn't always
> >> > efficient
> >> > to do request merge. Such IO pattern also makes swap prefetch hard.
> >>
> >> Agreed.
> >>
> >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which
> >> > is
> >> > very inefficient, especially if swap storage is fast.
> >>
> >> Agreed.
> >>
> 
> 5. SSD related optimization, mainly discard support.
> 
> Now swap codes are based on each swap slots. it means it can't
> optimize discard feature since getting meaningful performance gain, it
> requires 2 pages at least. Of course it's based on eMMC. In case of
> SSD. it requires more pages to support discard.
> 
> To address issue. I consider the batched discard approach used at filesystem.
> *Sometime* scan all empty slot and it issues discard continuous swap
> slots as many as possible.

I posted a patch to make discard async before, which is almost good to me, though we
still discard a cluster. 
http://marc.info/?l=linux-mm&m=135087309208120&w=2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-27 14:18       ` Shaohua Li
@ 2013-01-28  7:37         ` Kyungmin Park
  2013-02-01 12:37           ` Kyungmin Park
  2013-02-04  4:56         ` Hugh Dickins
  1 sibling, 1 reply; 31+ messages in thread
From: Kyungmin Park @ 2013-01-28  7:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Minchan Kim, lsf-pc, linux-mm, Rik van Riel, Simon Jeons

On Sun, Jan 27, 2013 at 11:18 PM, Shaohua Li <shli@kernel.org> wrote:
> On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote:
>> Hi,
>>
>> On 1/24/13, Simon Jeons <simon.jeons@gmail.com> wrote:
>> > Hi Minchan,
>> > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote:
>> >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
>> >> > Hi,
>> >> >
>> >> > Because of high density, low power and low price, flash storage (SSD) is
>> >> > a good
>> >> > candidate to partially replace DRAM. A quick answer for this is using
>> >> > SSD as
>> >> > swap. But Linux swap is designed for slow hard disk storage. There are a
>> >> > lot of
>> >> > challenges to efficiently use SSD for swap:
>> >>
>> >> Many of below item could be applied in in-memory swap like zram, zcache.
>> >>
>> >> >
>> >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space
>> >> > lock)
>> >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>> >> > flush. This
>> >> > overhead is very high even in a normal 2-socket machine.
>> >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>> >> > swap,
>> >> > which makes swap IO pattern is interleave. Block layer isn't always
>> >> > efficient
>> >> > to do request merge. Such IO pattern also makes swap prefetch hard.
>> >>
>> >> Agreed.
>> >>
>> >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which
>> >> > is
>> >> > very inefficient, especially if swap storage is fast.
>> >>
>> >> Agreed.
>> >>
>>
>> 5. SSD related optimization, mainly discard support.
>>
>> Now swap codes are based on each swap slots. it means it can't
>> optimize discard feature since getting meaningful performance gain, it
>> requires 2 pages at least. Of course it's based on eMMC. In case of
>> SSD. it requires more pages to support discard.
>>
>> To address issue. I consider the batched discard approach used at filesystem.
>> *Sometime* scan all empty slot and it issues discard continuous swap
>> slots as many as possible.
>
> I posted a patch to make discard async before, which is almost good to me, though we
> still discard a cluster.
> http://marc.info/?l=linux-mm&m=135087309208120&w=2

I found your previous patches, It's almost same concept as batched
discard. Now I'm testing your patches.
BTW, which test program do you use? Now we just testing some scenario
and check scenario only.
There's no generic tool to measure improved performance gain.

After test, I'll share the results.

Thank you,
Kyungmin Park

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-28  7:37         ` Kyungmin Park
@ 2013-02-01 12:37           ` Kyungmin Park
  0 siblings, 0 replies; 31+ messages in thread
From: Kyungmin Park @ 2013-02-01 12:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Minchan Kim, lsf-pc, linux-mm, Rik van Riel, Simon Jeons

On Mon, Jan 28, 2013 at 4:37 PM, Kyungmin Park <kmpark@infradead.org> wrote:
> On Sun, Jan 27, 2013 at 11:18 PM, Shaohua Li <shli@kernel.org> wrote:
>> On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote:
>>> Hi,
>>>
>>> On 1/24/13, Simon Jeons <simon.jeons@gmail.com> wrote:
>>> > Hi Minchan,
>>> > On Wed, 2013-01-23 at 16:58 +0900, Minchan Kim wrote:
>>> >> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
>>> >> > Hi,
>>> >> >
>>> >> > Because of high density, low power and low price, flash storage (SSD) is
>>> >> > a good
>>> >> > candidate to partially replace DRAM. A quick answer for this is using
>>> >> > SSD as
>>> >> > swap. But Linux swap is designed for slow hard disk storage. There are a
>>> >> > lot of
>>> >> > challenges to efficiently use SSD for swap:
>>> >>
>>> >> Many of below item could be applied in in-memory swap like zram, zcache.
>>> >>
>>> >> >
>>> >> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space
>>> >> > lock)
>>> >> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>>> >> > flush. This
>>> >> > overhead is very high even in a normal 2-socket machine.
>>> >> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>>> >> > swap,
>>> >> > which makes swap IO pattern is interleave. Block layer isn't always
>>> >> > efficient
>>> >> > to do request merge. Such IO pattern also makes swap prefetch hard.
>>> >>
>>> >> Agreed.
>>> >>
>>> >> > 4. Swap map scan overhead. Swap in-memory map scan scans an array, which
>>> >> > is
>>> >> > very inefficient, especially if swap storage is fast.
>>> >>
>>> >> Agreed.
>>> >>
>>>
>>> 5. SSD related optimization, mainly discard support.
>>>
>>> Now swap codes are based on each swap slots. it means it can't
>>> optimize discard feature since getting meaningful performance gain, it
>>> requires 2 pages at least. Of course it's based on eMMC. In case of
>>> SSD. it requires more pages to support discard.
>>>
>>> To address issue. I consider the batched discard approach used at filesystem.
>>> *Sometime* scan all empty slot and it issues discard continuous swap
>>> slots as many as possible.
>>
>> I posted a patch to make discard async before, which is almost good to me, though we
>> still discard a cluster.
>> http://marc.info/?l=linux-mm&m=135087309208120&w=2
>
> I found your previous patches, It's almost same concept as batched
> discard. Now I'm testing your patches.
> BTW, which test program do you use? Now we just testing some scenario
> and check scenario only.
> There's no generic tool to measure improved performance gain.
>
> After test, I'll share the results.
Updated, it has good performance gain than previous one about 4 times.

Feel free to add.
Tested-by: Kyungmin Park <kyungmin.park@samsung.com>
>
> Thank you,
> Kyungmin Park

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-27 14:18       ` Shaohua Li
  2013-01-28  7:37         ` Kyungmin Park
@ 2013-02-04  4:56         ` Hugh Dickins
  2013-02-19  6:15           ` Shaohua Li
  1 sibling, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2013-02-04  4:56 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Kyungmin Park, Minchan Kim, lsf-pc, linux-mm, Rik van Riel, Simon Jeons

On Sun, 27 Jan 2013, Shaohua Li wrote:
> On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote:
> > 5. SSD related optimization, mainly discard support.
> > 
> > Now swap codes are based on each swap slots. it means it can't
> > optimize discard feature since getting meaningful performance gain, it
> > requires 2 pages at least. Of course it's based on eMMC. In case of
> > SSD. it requires more pages to support discard.
> > 
> > To address issue. I consider the batched discard approach used at filesystem.
> > *Sometime* scan all empty slot and it issues discard continuous swap
> > slots as many as possible.
> 
> I posted a patch to make discard async before, which is almost good to me,
> though we still discard a cluster. 
> http://marc.info/?l=linux-mm&m=135087309208120&w=2

Any reason why you point to 2012/10/22 patch rather than the 2012/11/19?

Seeing this reminded me to take your 1/2 and 2/2 (of 11/19) out again and
give them a fresh run - though they were easier to apply to 3.8-rc rather
than mmotm with your locking changes, so it was 3.8-rc6 I tried.

As I reported in private mail last year, I wish you'd remove the "buddy"
from description of your 1/2 allocator, that just misled me; but I've not
experienced any problem with the allocator, and I still like the direction
you take with improving swap discard in 2/2.

This time around I've not yet seen any "swap_free: Unused swap offset entry"
messages (despite forgetting to include your later SWAP_MAP_BAD addition to
__swap_duplicate() - I still haven't thought that through to be honest),
but did again get the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache()
called from add_to_swap() from shrink_page_list().

Since it came after 1.5 hours of load, I didn't give it much thought,
and just went on to test other things, thinking I could easily reproduce
it later; but have failed to do so in many hours since.  Still trying.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-02-04  4:56         ` Hugh Dickins
@ 2013-02-19  6:15           ` Shaohua Li
  2013-02-19 19:41             ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2013-02-19  6:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kyungmin Park, Minchan Kim, lsf-pc, linux-mm, Rik van Riel, Simon Jeons

On Sun, Feb 03, 2013 at 08:56:15PM -0800, Hugh Dickins wrote:
> On Sun, 27 Jan 2013, Shaohua Li wrote:
> > On Sat, Jan 26, 2013 at 01:40:55PM +0900, Kyungmin Park wrote:
> > > 5. SSD related optimization, mainly discard support.
> > > 
> > > Now swap codes are based on each swap slots. it means it can't
> > > optimize discard feature since getting meaningful performance gain, it
> > > requires 2 pages at least. Of course it's based on eMMC. In case of
> > > SSD. it requires more pages to support discard.
> > > 
> > > To address issue. I consider the batched discard approach used at filesystem.
> > > *Sometime* scan all empty slot and it issues discard continuous swap
> > > slots as many as possible.
> > 
> > I posted a patch to make discard async before, which is almost good to me,
> > though we still discard a cluster. 
> > http://marc.info/?l=linux-mm&m=135087309208120&w=2
> 
> Any reason why you point to 2012/10/22 patch rather than the 2012/11/19?
> 
> Seeing this reminded me to take your 1/2 and 2/2 (of 11/19) out again and
> give them a fresh run - though they were easier to apply to 3.8-rc rather
> than mmotm with your locking changes, so it was 3.8-rc6 I tried.
> 
> As I reported in private mail last year, I wish you'd remove the "buddy"
> from description of your 1/2 allocator, that just misled me; but I've not
> experienced any problem with the allocator, and I still like the direction
> you take with improving swap discard in 2/2.
> 
> This time around I've not yet seen any "swap_free: Unused swap offset entry"
> messages (despite forgetting to include your later SWAP_MAP_BAD addition to
> __swap_duplicate() - I still haven't thought that through to be honest),
> but did again get the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache()
> called from add_to_swap() from shrink_page_list().
> 
> Since it came after 1.5 hours of load, I didn't give it much thought,
> and just went on to test other things, thinking I could easily reproduce
> it later; but have failed to do so in many hours since.  Still trying.

Missed this mail, sorry. I'm planing to repost the patches against linux-next (because
of the locking changes) and will include the SWAP_MAP_BAD change. I did see
problems without the SWAP_MAP_BAD change.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-02-19  6:15           ` Shaohua Li
@ 2013-02-19 19:41             ` Hugh Dickins
  0 siblings, 0 replies; 31+ messages in thread
From: Hugh Dickins @ 2013-02-19 19:41 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Kyungmin Park, Minchan Kim, lsf-pc, linux-mm, Rik van Riel, Simon Jeons

On Tue, 19 Feb 2013, Shaohua Li wrote:
> On Sun, Feb 03, 2013 at 08:56:15PM -0800, Hugh Dickins wrote:
> > 
> > Seeing this reminded me to take your 1/2 and 2/2 (of 11/19) out again and
> > give them a fresh run - though they were easier to apply to 3.8-rc rather
> > than mmotm with your locking changes, so it was 3.8-rc6 I tried.
> > 
> > As I reported in private mail last year, I wish you'd remove the "buddy"
> > from description of your 1/2 allocator, that just misled me; but I've not
> > experienced any problem with the allocator, and I still like the direction
> > you take with improving swap discard in 2/2.
> > 
> > This time around I've not yet seen any "swap_free: Unused swap offset entry"
> > messages (despite forgetting to include your later SWAP_MAP_BAD addition to
> > __swap_duplicate() - I still haven't thought that through to be honest),
> > but did again get the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache()
> > called from add_to_swap() from shrink_page_list().
> > 
> > Since it came after 1.5 hours of load, I didn't give it much thought,
> > and just went on to test other things, thinking I could easily reproduce
> > it later; but have failed to do so in many hours since.  Still trying.
> 
> Missed this mail, sorry. I'm planing to repost the patches against linux-next (because
> of the locking changes) and will include the SWAP_MAP_BAD change. I did see
> problems without the SWAP_MAP_BAD change.

Good, I'll take a look at them then.

I did manage to hit the VM_BUG_ON(error == -EEXIST) in __add_to_swap_cache()
again with those patches, and verified that there really was another page
sitting in its radix_tree slot.

Although I've never succeeded in reproducing this without your patches,
I'm pretty sure they're not to blame, that they just perhaps alter the
timing in some way as to make this more likely to happen.

I believe (without actual evidence) that it's a race with swapin_readahead():
its read_swap_cache_async() coming in and reading into its own page, in
between the swap slot being allocated from the swap_map with SWAP_HAS_CACHE
and add_to_swap()'s page actually being inserted into the swap cache.

I've not prepared a fix for it yet, but it shouldn't be a worry.

Something I learnt in looking through the radix_tree to find the
right slot, a benefit of your your per-device swapper_spaces that
we had not anticipated: once you have multiple swap areas (because
the swp_entry_t is arranged with the "type" at the top to get the
offsets contiguous), the single-swapper_space radix_tree becomes
very sparse, with matching high height and lots of silly levels
of radix_tree_nodes - I had to go down 10 levels, despite having
only two 1.5GB swap areas.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-22  6:53 [LSF/MM TOPIC]swap improvements for fast SSD Shaohua Li
                   ` (2 preceding siblings ...)
  2013-01-24  6:28 ` Simon Jeons
@ 2013-03-15  9:39 ` Simon Jeons
  2013-03-18 10:38   ` Bob Liu
  2013-04-28  8:12 ` Simon Jeons
  4 siblings, 1 reply; 31+ messages in thread
From: Simon Jeons @ 2013-03-15  9:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: lsf-pc, linux-mm, Hugh Dickins, Minchan Kim, Rik van Riel

On 01/22/2013 02:53 PM, Shaohua Li wrote:
> Hi,
>
> Because of high density, low power and low price, flash storage (SSD) is a good
> candidate to partially replace DRAM. A quick answer for this is using SSD as
> swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> challenges to efficiently use SSD for swap:
>
> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> overhead is very high even in a normal 2-socket machine.
> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> which makes swap IO pattern is interleave. Block layer isn't always efficient
> to do request merge. Such IO pattern also makes swap prefetch hard.
> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> very inefficient, especially if swap storage is fast.
> 5. SSD related optimization, mainly discard support
> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> aren't always in LRU list adjacently, so page reclaim will not swap such pages
> in adjacent storage sectors. This makes swap prefetch hard.
> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> Currently reclaim anonymous page is considering harder than reclaim file pages,
> so we bias reclaiming file pages. If there are high speed swap storage, we are
> considering doing swap more aggressively.
> 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> THP and hugetlbfs don't support swap.

Could you tell me in which workload hugetlb/thp pages can't swapout 
influence your performance? Is it worth?

>
> I had some progresses in these areas recently:
> http://marc.info/?l=linux-mm&m=134665691021172&w=2
> http://marc.info/?l=linux-mm&m=135336039115191&w=2
> http://marc.info/?l=linux-mm&m=135882182225444&w=2
> http://marc.info/?l=linux-mm&m=135754636926984&w=2
> http://marc.info/?l=linux-mm&m=135754634526979&w=2
> But a lot of problems remain. I'd like to discuss the issues at the meeting.
>
> Thanks,
> Shaohua
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-03-15  9:39 ` Simon Jeons
@ 2013-03-18 10:38   ` Bob Liu
  2013-03-19  1:27     ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Bob Liu @ 2013-03-18 10:38 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Shaohua Li, lsf-pc, linux-mm, Hugh Dickins, Minchan Kim,
	Rik van Riel, dan.magenheimer, sjenning, rcj


On 03/15/2013 05:39 PM, Simon Jeons wrote:
> On 01/22/2013 02:53 PM, Shaohua Li wrote:
>> Hi,
>>
>> Because of high density, low power and low price, flash storage (SSD)
>> is a good
>> candidate to partially replace DRAM. A quick answer for this is using
>> SSD as
>> swap. But Linux swap is designed for slow hard disk storage. There are
>> a lot of
>> challenges to efficiently use SSD for swap:
>>
>> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
>> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>> flush. This
>> overhead is very high even in a normal 2-socket machine.
>> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>> swap,
>> which makes swap IO pattern is interleave. Block layer isn't always
>> efficient
>> to do request merge. Such IO pattern also makes swap prefetch hard.
>> 4. Swap map scan overhead. Swap in-memory map scan scans an array,
>> which is
>> very inefficient, especially if swap storage is fast.
>> 5. SSD related optimization, mainly discard support
>> 6. Better swap prefetch algorithm. Besides item 3, sequentially
>> accessed pages
>> aren't always in LRU list adjacently, so page reclaim will not swap
>> such pages
>> in adjacent storage sectors. This makes swap prefetch hard.
>> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
>> Currently reclaim anonymous page is considering harder than reclaim
>> file pages,
>> so we bias reclaiming file pages. If there are high speed swap
>> storage, we are
>> considering doing swap more aggressively.
>> 8. Huge page swap. Huge page swap can solve a lot of problems above,
>> but both
>> THP and hugetlbfs don't support swap.
> 
> Could you tell me in which workload hugetlb/thp pages can't swapout
> influence your performance? Is it worth?
> 

I'm also very interesting in this workload.
I think hugetlb/thp pages can be a potential user of zprojects like
zswap/zcache.
We can try to compress those pages before breaking them to normal pages.

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-03-18 10:38   ` Bob Liu
@ 2013-03-19  1:27     ` Shaohua Li
  2013-03-19  1:32       ` Simon Jeons
                         ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Shaohua Li @ 2013-03-19  1:27 UTC (permalink / raw)
  To: Bob Liu
  Cc: Simon Jeons, lsf-pc, linux-mm, Hugh Dickins, Minchan Kim,
	Rik van Riel, dan.magenheimer, sjenning, rcj

On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote:
> 
> On 03/15/2013 05:39 PM, Simon Jeons wrote:
> > On 01/22/2013 02:53 PM, Shaohua Li wrote:
> >> Hi,
> >>
> >> Because of high density, low power and low price, flash storage (SSD)
> >> is a good
> >> candidate to partially replace DRAM. A quick answer for this is using
> >> SSD as
> >> swap. But Linux swap is designed for slow hard disk storage. There are
> >> a lot of
> >> challenges to efficiently use SSD for swap:
> >>
> >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
> >> flush. This
> >> overhead is very high even in a normal 2-socket machine.
> >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
> >> swap,
> >> which makes swap IO pattern is interleave. Block layer isn't always
> >> efficient
> >> to do request merge. Such IO pattern also makes swap prefetch hard.
> >> 4. Swap map scan overhead. Swap in-memory map scan scans an array,
> >> which is
> >> very inefficient, especially if swap storage is fast.
> >> 5. SSD related optimization, mainly discard support
> >> 6. Better swap prefetch algorithm. Besides item 3, sequentially
> >> accessed pages
> >> aren't always in LRU list adjacently, so page reclaim will not swap
> >> such pages
> >> in adjacent storage sectors. This makes swap prefetch hard.
> >> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> >> Currently reclaim anonymous page is considering harder than reclaim
> >> file pages,
> >> so we bias reclaiming file pages. If there are high speed swap
> >> storage, we are
> >> considering doing swap more aggressively.
> >> 8. Huge page swap. Huge page swap can solve a lot of problems above,
> >> but both
> >> THP and hugetlbfs don't support swap.
> > 
> > Could you tell me in which workload hugetlb/thp pages can't swapout
> > influence your performance? Is it worth?
> > 
> 
> I'm also very interesting in this workload.
> I think hugetlb/thp pages can be a potential user of zprojects like
> zswap/zcache.
> We can try to compress those pages before breaking them to normal pages.

I don't have particular workload and don't have data for obvious reason. What I
expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and
improve IO pattern.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-03-19  1:27     ` Shaohua Li
@ 2013-03-19  1:32       ` Simon Jeons
  2013-03-19  5:57         ` Shaohua Li
  2013-03-19  4:25       ` Wanpeng Li
  2013-03-19  4:25       ` Wanpeng Li
  2 siblings, 1 reply; 31+ messages in thread
From: Simon Jeons @ 2013-03-19  1:32 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Bob Liu, lsf-pc, linux-mm, Hugh Dickins, Minchan Kim,
	Rik van Riel, dan.magenheimer, sjenning, rcj

Hi Shaohua,
On 03/19/2013 09:27 AM, Shaohua Li wrote:
> On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote:
>> On 03/15/2013 05:39 PM, Simon Jeons wrote:
>>> On 01/22/2013 02:53 PM, Shaohua Li wrote:
>>>> Hi,
>>>>
>>>> Because of high density, low power and low price, flash storage (SSD)
>>>> is a good
>>>> candidate to partially replace DRAM. A quick answer for this is using
>>>> SSD as
>>>> swap. But Linux swap is designed for slow hard disk storage. There are
>>>> a lot of
>>>> challenges to efficiently use SSD for swap:
>>>>
>>>> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
>>>> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>>>> flush. This
>>>> overhead is very high even in a normal 2-socket machine.
>>>> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>>>> swap,
>>>> which makes swap IO pattern is interleave. Block layer isn't always
>>>> efficient
>>>> to do request merge. Such IO pattern also makes swap prefetch hard.
>>>> 4. Swap map scan overhead. Swap in-memory map scan scans an array,
>>>> which is
>>>> very inefficient, especially if swap storage is fast.
>>>> 5. SSD related optimization, mainly discard support
>>>> 6. Better swap prefetch algorithm. Besides item 3, sequentially
>>>> accessed pages
>>>> aren't always in LRU list adjacently, so page reclaim will not swap
>>>> such pages
>>>> in adjacent storage sectors. This makes swap prefetch hard.
>>>> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
>>>> Currently reclaim anonymous page is considering harder than reclaim
>>>> file pages,
>>>> so we bias reclaiming file pages. If there are high speed swap
>>>> storage, we are
>>>> considering doing swap more aggressively.
>>>> 8. Huge page swap. Huge page swap can solve a lot of problems above,
>>>> but both
>>>> THP and hugetlbfs don't support swap.
>>> Could you tell me in which workload hugetlb/thp pages can't swapout
>>> influence your performance? Is it worth?
>>>
>> I'm also very interesting in this workload.
>> I think hugetlb/thp pages can be a potential user of zprojects like
>> zswap/zcache.
>> We can try to compress those pages before breaking them to normal pages.
> I don't have particular workload and don't have data for obvious reason. What I
> expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and
> improve IO pattern.
Do you have any idea about implement this feature?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-03-19  1:27     ` Shaohua Li
  2013-03-19  1:32       ` Simon Jeons
  2013-03-19  4:25       ` Wanpeng Li
@ 2013-03-19  4:25       ` Wanpeng Li
  2 siblings, 0 replies; 31+ messages in thread
From: Wanpeng Li @ 2013-03-19  4:25 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Bob Liu, Simon Jeons, lsf-pc, linux-mm, Hugh Dickins,
	Minchan Kim, Rik van Riel, dan.magenheimer, sjenning, rcj

On Tue, Mar 19, 2013 at 09:27:25AM +0800, Shaohua Li wrote:
>On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote:
>> 
>> On 03/15/2013 05:39 PM, Simon Jeons wrote:
>> > On 01/22/2013 02:53 PM, Shaohua Li wrote:
>> >> Hi,
>> >>
>> >> Because of high density, low power and low price, flash storage (SSD)
>> >> is a good
>> >> candidate to partially replace DRAM. A quick answer for this is using
>> >> SSD as
>> >> swap. But Linux swap is designed for slow hard disk storage. There are
>> >> a lot of
>> >> challenges to efficiently use SSD for swap:
>> >>
>> >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
>> >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>> >> flush. This
>> >> overhead is very high even in a normal 2-socket machine.
>> >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>> >> swap,
>> >> which makes swap IO pattern is interleave. Block layer isn't always
>> >> efficient
>> >> to do request merge. Such IO pattern also makes swap prefetch hard.
>> >> 4. Swap map scan overhead. Swap in-memory map scan scans an array,
>> >> which is
>> >> very inefficient, especially if swap storage is fast.
>> >> 5. SSD related optimization, mainly discard support
>> >> 6. Better swap prefetch algorithm. Besides item 3, sequentially
>> >> accessed pages
>> >> aren't always in LRU list adjacently, so page reclaim will not swap
>> >> such pages
>> >> in adjacent storage sectors. This makes swap prefetch hard.
>> >> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
>> >> Currently reclaim anonymous page is considering harder than reclaim
>> >> file pages,
>> >> so we bias reclaiming file pages. If there are high speed swap
>> >> storage, we are
>> >> considering doing swap more aggressively.
>> >> 8. Huge page swap. Huge page swap can solve a lot of problems above,
>> >> but both
>> >> THP and hugetlbfs don't support swap.
>> > 
>> > Could you tell me in which workload hugetlb/thp pages can't swapout
>> > influence your performance? Is it worth?
>> > 
>> 
>> I'm also very interesting in this workload.
>> I think hugetlb/thp pages can be a potential user of zprojects like
>> zswap/zcache.
>> We can try to compress those pages before breaking them to normal pages.
>
>I don't have particular workload and don't have data for obvious reason. What I
>expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and
>improve IO pattern.

Hi Shaohua and Bob,

I'm doing this work currently. :-)

Regards,
Wanpeng Li 

>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-03-19  1:27     ` Shaohua Li
  2013-03-19  1:32       ` Simon Jeons
@ 2013-03-19  4:25       ` Wanpeng Li
  2013-03-19  4:25       ` Wanpeng Li
  2 siblings, 0 replies; 31+ messages in thread
From: Wanpeng Li @ 2013-03-19  4:25 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Bob Liu, Simon Jeons, lsf-pc, linux-mm, Hugh Dickins,
	Minchan Kim, Rik van Riel, dan.magenheimer, sjenning, rcj

On Tue, Mar 19, 2013 at 09:27:25AM +0800, Shaohua Li wrote:
>On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote:
>> 
>> On 03/15/2013 05:39 PM, Simon Jeons wrote:
>> > On 01/22/2013 02:53 PM, Shaohua Li wrote:
>> >> Hi,
>> >>
>> >> Because of high density, low power and low price, flash storage (SSD)
>> >> is a good
>> >> candidate to partially replace DRAM. A quick answer for this is using
>> >> SSD as
>> >> swap. But Linux swap is designed for slow hard disk storage. There are
>> >> a lot of
>> >> challenges to efficiently use SSD for swap:
>> >>
>> >> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
>> >> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>> >> flush. This
>> >> overhead is very high even in a normal 2-socket machine.
>> >> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>> >> swap,
>> >> which makes swap IO pattern is interleave. Block layer isn't always
>> >> efficient
>> >> to do request merge. Such IO pattern also makes swap prefetch hard.
>> >> 4. Swap map scan overhead. Swap in-memory map scan scans an array,
>> >> which is
>> >> very inefficient, especially if swap storage is fast.
>> >> 5. SSD related optimization, mainly discard support
>> >> 6. Better swap prefetch algorithm. Besides item 3, sequentially
>> >> accessed pages
>> >> aren't always in LRU list adjacently, so page reclaim will not swap
>> >> such pages
>> >> in adjacent storage sectors. This makes swap prefetch hard.
>> >> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
>> >> Currently reclaim anonymous page is considering harder than reclaim
>> >> file pages,
>> >> so we bias reclaiming file pages. If there are high speed swap
>> >> storage, we are
>> >> considering doing swap more aggressively.
>> >> 8. Huge page swap. Huge page swap can solve a lot of problems above,
>> >> but both
>> >> THP and hugetlbfs don't support swap.
>> > 
>> > Could you tell me in which workload hugetlb/thp pages can't swapout
>> > influence your performance? Is it worth?
>> > 
>> 
>> I'm also very interesting in this workload.
>> I think hugetlb/thp pages can be a potential user of zprojects like
>> zswap/zcache.
>> We can try to compress those pages before breaking them to normal pages.
>
>I don't have particular workload and don't have data for obvious reason. What I
>expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and
>improve IO pattern.

Hi Shaohua and Bob,

I'm doing this work currently. :-)

Regards,
Wanpeng Li 

>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-03-19  1:32       ` Simon Jeons
@ 2013-03-19  5:57         ` Shaohua Li
  2013-03-19  6:10           ` Simon Jeons
  0 siblings, 1 reply; 31+ messages in thread
From: Shaohua Li @ 2013-03-19  5:57 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Bob Liu, lsf-pc, linux-mm, Hugh Dickins, Minchan Kim,
	Rik van Riel, dan.magenheimer, sjenning, rcj

On Tue, Mar 19, 2013 at 09:32:39AM +0800, Simon Jeons wrote:
> Hi Shaohua,
> On 03/19/2013 09:27 AM, Shaohua Li wrote:
> >On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote:
> >>On 03/15/2013 05:39 PM, Simon Jeons wrote:
> >>>On 01/22/2013 02:53 PM, Shaohua Li wrote:
> >>>>Hi,
> >>>>
> >>>>Because of high density, low power and low price, flash storage (SSD)
> >>>>is a good
> >>>>candidate to partially replace DRAM. A quick answer for this is using
> >>>>SSD as
> >>>>swap. But Linux swap is designed for slow hard disk storage. There are
> >>>>a lot of
> >>>>challenges to efficiently use SSD for swap:
> >>>>
> >>>>1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> >>>>2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
> >>>>flush. This
> >>>>overhead is very high even in a normal 2-socket machine.
> >>>>3. Better swap IO pattern. Both direct and kswapd page reclaim can do
> >>>>swap,
> >>>>which makes swap IO pattern is interleave. Block layer isn't always
> >>>>efficient
> >>>>to do request merge. Such IO pattern also makes swap prefetch hard.
> >>>>4. Swap map scan overhead. Swap in-memory map scan scans an array,
> >>>>which is
> >>>>very inefficient, especially if swap storage is fast.
> >>>>5. SSD related optimization, mainly discard support
> >>>>6. Better swap prefetch algorithm. Besides item 3, sequentially
> >>>>accessed pages
> >>>>aren't always in LRU list adjacently, so page reclaim will not swap
> >>>>such pages
> >>>>in adjacent storage sectors. This makes swap prefetch hard.
> >>>>7. Alternative page reclaim policy to bias reclaiming anonymous page.
> >>>>Currently reclaim anonymous page is considering harder than reclaim
> >>>>file pages,
> >>>>so we bias reclaiming file pages. If there are high speed swap
> >>>>storage, we are
> >>>>considering doing swap more aggressively.
> >>>>8. Huge page swap. Huge page swap can solve a lot of problems above,
> >>>>but both
> >>>>THP and hugetlbfs don't support swap.
> >>>Could you tell me in which workload hugetlb/thp pages can't swapout
> >>>influence your performance? Is it worth?
> >>>
> >>I'm also very interesting in this workload.
> >>I think hugetlb/thp pages can be a potential user of zprojects like
> >>zswap/zcache.
> >>We can try to compress those pages before breaking them to normal pages.
> >I don't have particular workload and don't have data for obvious reason. What I
> >expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and
> >improve IO pattern.
> Do you have any idea about implement this feature?

Didn't look at hugetlb yet, but for THP, maybe it's an overkill to really do 2M
page swapping. My idea is to provide a special version of add_to_swap +
try_to_unmap in page reclaim. We still do huge page split, but in the split, we
also do 'unmap' to reduce unnecessary TLB flush. In the split, tail pages
should be added back to page_list of shrink_page_list() instead of lru list, so
tail pages can be pageout soon. In this way, we can use existing swap code (not
bothering changing arch code and swap space allocation for example) and reach
my goal (reduce tlb flush and improve IO pattern). But that said, I didn't do
any coding yet, this might be just wrong actually, but I'll try some time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-03-19  5:57         ` Shaohua Li
@ 2013-03-19  6:10           ` Simon Jeons
  0 siblings, 0 replies; 31+ messages in thread
From: Simon Jeons @ 2013-03-19  6:10 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Bob Liu, lsf-pc, linux-mm, Hugh Dickins, Minchan Kim,
	Rik van Riel, dan.magenheimer, sjenning, rcj

Hi Shaohua,
On 03/19/2013 01:57 PM, Shaohua Li wrote:
> On Tue, Mar 19, 2013 at 09:32:39AM +0800, Simon Jeons wrote:
>> Hi Shaohua,
>> On 03/19/2013 09:27 AM, Shaohua Li wrote:
>>> On Mon, Mar 18, 2013 at 06:38:29PM +0800, Bob Liu wrote:
>>>> On 03/15/2013 05:39 PM, Simon Jeons wrote:
>>>>> On 01/22/2013 02:53 PM, Shaohua Li wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Because of high density, low power and low price, flash storage (SSD)
>>>>>> is a good
>>>>>> candidate to partially replace DRAM. A quick answer for this is using
>>>>>> SSD as
>>>>>> swap. But Linux swap is designed for slow hard disk storage. There are
>>>>>> a lot of
>>>>>> challenges to efficiently use SSD for swap:
>>>>>>
>>>>>> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
>>>>>> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB
>>>>>> flush. This
>>>>>> overhead is very high even in a normal 2-socket machine.
>>>>>> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do
>>>>>> swap,
>>>>>> which makes swap IO pattern is interleave. Block layer isn't always
>>>>>> efficient
>>>>>> to do request merge. Such IO pattern also makes swap prefetch hard.
>>>>>> 4. Swap map scan overhead. Swap in-memory map scan scans an array,
>>>>>> which is
>>>>>> very inefficient, especially if swap storage is fast.
>>>>>> 5. SSD related optimization, mainly discard support
>>>>>> 6. Better swap prefetch algorithm. Besides item 3, sequentially
>>>>>> accessed pages
>>>>>> aren't always in LRU list adjacently, so page reclaim will not swap
>>>>>> such pages
>>>>>> in adjacent storage sectors. This makes swap prefetch hard.
>>>>>> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
>>>>>> Currently reclaim anonymous page is considering harder than reclaim
>>>>>> file pages,
>>>>>> so we bias reclaiming file pages. If there are high speed swap
>>>>>> storage, we are
>>>>>> considering doing swap more aggressively.
>>>>>> 8. Huge page swap. Huge page swap can solve a lot of problems above,
>>>>>> but both
>>>>>> THP and hugetlbfs don't support swap.
>>>>> Could you tell me in which workload hugetlb/thp pages can't swapout
>>>>> influence your performance? Is it worth?
>>>>>
>>>> I'm also very interesting in this workload.
>>>> I think hugetlb/thp pages can be a potential user of zprojects like
>>>> zswap/zcache.
>>>> We can try to compress those pages before breaking them to normal pages.
>>> I don't have particular workload and don't have data for obvious reason. What I
>>> expected is swapout hugetlb/thp is to reduce some overheads (eg, tlb flush) and
>>> improve IO pattern.
>> Do you have any idea about implement this feature?
> Didn't look at hugetlb yet, but for THP, maybe it's an overkill to really do 2M
> page swapping. My idea is to provide a special version of add_to_swap +
> try_to_unmap in page reclaim. We still do huge page split, but in the split, we
> also do 'unmap' to reduce unnecessary TLB flush. In the split, tail pages
> should be added back to page_list of shrink_page_list() instead of lru list, so
> tail pages can be pageout soon. In this way, we can use existing swap code (not
> bothering changing arch code and swap space allocation for example) and reach
> my goal (reduce tlb flush and improve IO pattern). But that said, I didn't do
> any coding yet, this might be just wrong actually, but I'll try some time.

What will happen when swapin?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-23  7:58 ` Minchan Kim
                     ` (3 preceding siblings ...)
  2013-01-24  9:09   ` Simon Jeons
@ 2013-04-05  0:17   ` Simon Jeons
  2013-04-05  8:08     ` Minchan Kim
  4 siblings, 1 reply; 31+ messages in thread
From: Simon Jeons @ 2013-04-05  0:17 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Shaohua Li, lsf-pc, linux-mm, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 4832 bytes --]

Hi Minchan,
On 01/23/2013 03:58 PM, Minchan Kim wrote:
> On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
>> Hi,
>>
>> Because of high density, low power and low price, flash storage (SSD) is a good
>> candidate to partially replace DRAM. A quick answer for this is using SSD as
>> swap. But Linux swap is designed for slow hard disk storage. There are a lot of
>> challenges to efficiently use SSD for swap:
> Many of below item could be applied in in-memory swap like zram, zcache.
>
>> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
>> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
>> overhead is very high even in a normal 2-socket machine.
>> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
>> which makes swap IO pattern is interleave. Block layer isn't always efficient
>> to do request merge. Such IO pattern also makes swap prefetch hard.
> Agreed.
>
>> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
>> very inefficient, especially if swap storage is fast.
> Agreed.
>
>> 5. SSD related optimization, mainly discard support
>> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
>> aren't always in LRU list adjacently, so page reclaim will not swap such pages
>> in adjacent storage sectors. This makes swap prefetch hard.
> One of problem is LRU churning and I wanted to try to fix it.
> http://marc.info/?l=linux-mm&m=130978831028952&w=4

I'm interested in this feature, why it didn't merged? what's the fatal 
issue in your patchset?
http://lwn.net/Articles/449866/
You mentioned test script and all-at-once patch, but I can't get them 
from the URL, could you tell me how to get it?

>
>> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
>> Currently reclaim anonymous page is considering harder than reclaim file pages,
>> so we bias reclaiming file pages. If there are high speed swap storage, we are
>> considering doing swap more aggressively.
> Yeb. We need it. I tried it with extending vm_swappiness to 200.
>
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 3 Dec 2012 16:21:00 +0900
> Subject: [PATCH] mm: increase swappiness to 200
>
> We have thought swap out cost is very high but it's not true
> if we use fast device like swap-over-zram. Nonetheless, we can
> swap out 1:1 ratio of anon and page cache at most.
> It's not enough to use swap device fully so we encounter OOM kill
> while there are many free space in zram swap device. It's never
> what we want.
>
> This patch makes swap out aggressively.
>
> Cc: Luigi Semenzato <semenzato@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   kernel/sysctl.c |    3 ++-
>   mm/vmscan.c     |    6 ++++--
>   2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 693e0ed..f1dbd9d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -130,6 +130,7 @@ static int __maybe_unused two = 2;
>   static int __maybe_unused three = 3;
>   static unsigned long one_ul = 1;
>   static int one_hundred = 100;
> +extern int max_swappiness;
>   #ifdef CONFIG_PRINTK
>   static int ten_thousand = 10000;
>   #endif
> @@ -1157,7 +1158,7 @@ static struct ctl_table vm_table[] = {
>                  .mode           = 0644,
>                  .proc_handler   = proc_dointvec_minmax,
>                  .extra1         = &zero,
> -               .extra2         = &one_hundred,
> +               .extra2         = &max_swappiness,
>          },
>   #ifdef CONFIG_HUGETLB_PAGE
>          {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 53dcde9..64f3c21 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -53,6 +53,8 @@
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/vmscan.h>
>   
> +int max_swappiness = 200;
> +
>   struct scan_control {
>          /* Incremented by the number of inactive pages that were scanned */
>          unsigned long nr_scanned;
> @@ -1626,6 +1628,7 @@ static int vmscan_swappiness(struct scan_control *sc)
>          return mem_cgroup_swappiness(sc->target_mem_cgroup);
>   }
>   
> +
>   /*
>    * Determine how aggressively the anon and file LRU lists should be
>    * scanned.  The relative value of each set of LRU lists is determined
> @@ -1701,11 +1704,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>          }
>   
>          /*
> -        * With swappiness at 100, anonymous and file have the same priority.
>           * This scanning priority is essentially the inverse of IO cost.
>           */
>          anon_prio = vmscan_swappiness(sc);
> -       file_prio = 200 - anon_prio;
> +       file_prio = max_swappiness - anon_prio;
>   
>          /*
>           * OK, so we have swap space and a fair amount of page cache


[-- Attachment #2: Type: text/html, Size: 6098 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-04-05  0:17   ` Simon Jeons
@ 2013-04-05  8:08     ` Minchan Kim
  0 siblings, 0 replies; 31+ messages in thread
From: Minchan Kim @ 2013-04-05  8:08 UTC (permalink / raw)
  To: Simon Jeons; +Cc: Shaohua Li, lsf-pc, linux-mm, Rik van Riel

On Fri, Apr 05, 2013 at 08:17:00AM +0800, Simon Jeons wrote:
> Hi Minchan,
> On 01/23/2013 03:58 PM, Minchan Kim wrote:
> >On Tue, Jan 22, 2013 at 02:53:41PM +0800, Shaohua Li wrote:
> >>Hi,
> >>
> >>Because of high density, low power and low price, flash storage (SSD) is a good
> >>candidate to partially replace DRAM. A quick answer for this is using SSD as
> >>swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> >>challenges to efficiently use SSD for swap:
> >Many of below item could be applied in in-memory swap like zram, zcache.
> >
> >>1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> >>2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> >>overhead is very high even in a normal 2-socket machine.
> >>3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> >>which makes swap IO pattern is interleave. Block layer isn't always efficient
> >>to do request merge. Such IO pattern also makes swap prefetch hard.
> >Agreed.
> >
> >>4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> >>very inefficient, especially if swap storage is fast.
> >Agreed.
> >
> >>5. SSD related optimization, mainly discard support
> >>6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> >>aren't always in LRU list adjacently, so page reclaim will not swap such pages
> >>in adjacent storage sectors. This makes swap prefetch hard.
> >One of problem is LRU churning and I wanted to try to fix it.
> >http://marc.info/?l=linux-mm&m=130978831028952&w=4
> 
> I'm interested in this feature, why it didn't merged? what's the
> fatal issue in your patchset?
> http://lwn.net/Articles/449866/

There wasn't any fatal issue, AFAIRC but some people had a concern about
balancing between code complexity and benefit and dragged for a long time
and I lost interest.

> You mentioned test script and all-at-once patch, but I can't get
> them from the URL, could you tell me how to get it?

You can google it and google will find it in a few second.

http://www.filewatcher.com/b/ftp/ftp.cs.huji.ac.il/mirror/linux/kernel/linux/kernel/people/minchan/inorder_putback/v4-0.html

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-22  6:53 [LSF/MM TOPIC]swap improvements for fast SSD Shaohua Li
                   ` (3 preceding siblings ...)
  2013-03-15  9:39 ` Simon Jeons
@ 2013-04-28  8:12 ` Simon Jeons
  4 siblings, 0 replies; 31+ messages in thread
From: Simon Jeons @ 2013-04-28  8:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: lsf-pc, linux-mm, Shaohua Li

Hi Shaohua,
On 01/22/2013 02:53 PM, Shaohua Li wrote:
> Hi,
>
> Because of high density, low power and low price, flash storage (SSD) is a good
> candidate to partially replace DRAM. A quick answer for this is using SSD as
> swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> challenges to efficiently use SSD for swap:
>
> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> overhead is very high even in a normal 2-socket machine.

Why at least 2 TLB flush instead of one?

> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> which makes swap IO pattern is interleave. Block layer isn't always efficient
> to do request merge. Such IO pattern also makes swap prefetch hard.
> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> very inefficient, especially if swap storage is fast.
> 5. SSD related optimization, mainly discard support
> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> aren't always in LRU list adjacently, so page reclaim will not swap such pages
> in adjacent storage sectors. This makes swap prefetch hard.
> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> Currently reclaim anonymous page is considering harder than reclaim file pages,
> so we bias reclaiming file pages. If there are high speed swap storage, we are
> considering doing swap more aggressively.
> 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> THP and hugetlbfs don't support swap.
>
> I had some progresses in these areas recently:
> http://marc.info/?l=linux-mm&m=134665691021172&w=2
> http://marc.info/?l=linux-mm&m=135336039115191&w=2
> http://marc.info/?l=linux-mm&m=135882182225444&w=2
> http://marc.info/?l=linux-mm&m=135754636926984&w=2
> http://marc.info/?l=linux-mm&m=135754634526979&w=2
> But a lot of problems remain. I'd like to discuss the issues at the meeting.
>
> Thanks,
> Shaohua
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM TOPIC]swap improvements for fast SSD
  2013-01-23 23:05 ` Dan Magenheimer
@ 2013-01-24  2:11   ` Shaohua Li
  0 siblings, 0 replies; 31+ messages in thread
From: Shaohua Li @ 2013-01-24  2:11 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: shli, linux-mm

On Wed, Jan 23, 2013 at 03:05:22PM -0800, Dan Magenheimer wrote:
> I would be very interested in this topic.
> 
> > Because of high density, low power and low price, flash storage (SSD) is a good
> > candidate to partially replace DRAM. A quick answer for this is using SSD as
> > swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> > challenges to efficiently use SSD for swap:
> > 
> > 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> > 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> > overhead is very high even in a normal 2-socket machine.
> > 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> > which makes swap IO pattern is interleave. Block layer isn't always efficient
> > to do request merge. Such IO pattern also makes swap prefetch hard.
> 
> Shaohua --
> 
> Have you considered the possibility of subverting the block layer entirely
> and accessing the SSD like slow RAM rather than a fast I/O device?  E.g.
> something like NVME and as in this paper?
> 
> http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf 
> 
> If you think this could be an option, it could make a very
> interesting backend to frontswap (something like ramster).

We had discussion about this before, but looks this requires very low latency
storage, didn't take it serious yet.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [LSF/MM TOPIC]swap improvements for fast SSD
       [not found] <766b9855-adf5-47ce-9484-971f88ff0e54@default>
@ 2013-01-23 23:05 ` Dan Magenheimer
  2013-01-24  2:11   ` Shaohua Li
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Magenheimer @ 2013-01-23 23:05 UTC (permalink / raw)
  To: shli; +Cc: linux-mm

I would be very interested in this topic.

> Because of high density, low power and low price, flash storage (SSD) is a good
> candidate to partially replace DRAM. A quick answer for this is using SSD as
> swap. But Linux swap is designed for slow hard disk storage. There are a lot of
> challenges to efficiently use SSD for swap:
> 
> 1. Lock contentions (swap_lock, anon_vma mutex, swap address space lock)
> 2. TLB flush overhead. To reclaim one page, we need at least 2 TLB flush. This
> overhead is very high even in a normal 2-socket machine.
> 3. Better swap IO pattern. Both direct and kswapd page reclaim can do swap,
> which makes swap IO pattern is interleave. Block layer isn't always efficient
> to do request merge. Such IO pattern also makes swap prefetch hard.

Shaohua --

Have you considered the possibility of subverting the block layer entirely
and accessing the SSD like slow RAM rather than a fast I/O device?  E.g.
something like NVME and as in this paper?

http://static.usenix.org/events/fast12/tech/full_papers/Yang.pdf 

If you think this could be an option, it could make a very
interesting backend to frontswap (something like ramster).

Dan

> 4. Swap map scan overhead. Swap in-memory map scan scans an array, which is
> very inefficient, especially if swap storage is fast.
> 5. SSD related optimization, mainly discard support
> 6. Better swap prefetch algorithm. Besides item 3, sequentially accessed pages
> aren't always in LRU list adjacently, so page reclaim will not swap such pages
> in adjacent storage sectors. This makes swap prefetch hard.
> 7. Alternative page reclaim policy to bias reclaiming anonymous page.
> Currently reclaim anonymous page is considering harder than reclaim file pages,
> so we bias reclaiming file pages. If there are high speed swap storage, we are
> considering doing swap more aggressively.
> 8. Huge page swap. Huge page swap can solve a lot of problems above, but both
> THP and hugetlbfs don't support swap.
> 
> I had some progresses in these areas recently:
> http://marc.info/?l=linux-mm&m=134665691021172&w=2
> http://marc.info/?l=linux-mm&m=135336039115191&w=2
> http://marc.info/?l=linux-mm&m=135882182225444&w=2
> http://marc.info/?l=linux-mm&m=135754636926984&w=2
> http://marc.info/?l=linux-mm&m=135754634526979&w=2
> But a lot of problems remain. I'd like to discuss the issues at the meeting.
> 
> Thanks,
> Shaohua
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2013-04-28  8:12 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-22  6:53 [LSF/MM TOPIC]swap improvements for fast SSD Shaohua Li
2013-01-23  7:58 ` Minchan Kim
2013-01-23 19:04   ` Seth Jennings
2013-01-24  1:40     ` Minchan Kim
2013-01-24  8:29       ` Simon Jeons
2013-01-24  2:02   ` Shaohua Li
2013-01-24  7:52   ` Simon Jeons
2013-01-24  9:09   ` Simon Jeons
2013-01-26  4:40     ` Kyungmin Park
2013-01-27  0:26       ` Simon Jeons
2013-01-27 14:18       ` Shaohua Li
2013-01-28  7:37         ` Kyungmin Park
2013-02-01 12:37           ` Kyungmin Park
2013-02-04  4:56         ` Hugh Dickins
2013-02-19  6:15           ` Shaohua Li
2013-02-19 19:41             ` Hugh Dickins
2013-04-05  0:17   ` Simon Jeons
2013-04-05  8:08     ` Minchan Kim
2013-01-23 16:56 ` Seth Jennings
2013-01-24  6:28 ` Simon Jeons
2013-03-15  9:39 ` Simon Jeons
2013-03-18 10:38   ` Bob Liu
2013-03-19  1:27     ` Shaohua Li
2013-03-19  1:32       ` Simon Jeons
2013-03-19  5:57         ` Shaohua Li
2013-03-19  6:10           ` Simon Jeons
2013-03-19  4:25       ` Wanpeng Li
2013-03-19  4:25       ` Wanpeng Li
2013-04-28  8:12 ` Simon Jeons
     [not found] <766b9855-adf5-47ce-9484-971f88ff0e54@default>
2013-01-23 23:05 ` Dan Magenheimer
2013-01-24  2:11   ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.