From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753011Ab2DBLqO (ORCPT ); Mon, 2 Apr 2012 07:46:14 -0400 Received: from moutng.kundenserver.de ([212.227.17.10]:55839 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752681Ab2DBLpw (ORCPT ); Mon, 2 Apr 2012 07:45:52 -0400 From: Arnd Bergmann To: linaro-kernel@lists.linaro.org Subject: Re: swap on eMMC and other flash Date: Mon, 2 Apr 2012 11:45:42 +0000 User-Agent: KMail/1.12.2 (Linux/3.3.0-rc1; KDE/4.3.2; x86_64; ; ) Cc: Hugh Dickins , Rik van Riel , "linux-mmc@vger.kernel.org" , Alex Lemberg , linux-kernel@vger.kernel.org, "Luca Porzio \(lporzio\)" , linux-mm@kvack.org, Hyojin Jeong , kernel-team@android.com, Yejin Moon References: <201203301744.16762.arnd@arndb.de> <201203301850.22784.arnd@arndb.de> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201204021145.43222.arnd@arndb.de> X-Provags-ID: V02:K0:FwjJy28dOnVV0nOgN4Tmf+qNAQ5//K41eyX/ISdOZl/ HV9xSEpk8of7rorVUx99v1joVH2xLQUFI2RCFnq5jr1QZcJOWM WVpdbzEEDyO+oYHFeS3zGs83D3CqByrDg7YtfPN8JsiXxES0FM SYvdoVqll+VxSqphWahHu3dye6+VL92fDKI0J8gz1Uw7p+7jd8 TtFRbJ7eLztUP3LkB7E6kOr2FHY25Vl/Wad8YZCTsD5mE/8BBN tkc7lENcljKjYHONLHkuWTUw29NKERQZpQxbwJcy91aIurOCkJ s8V4t2gkQf+IuwKL3cEplML94KoTSiuyjRhNqBlUjRhZ4K454H 1768GmJuDBqPSkkWQpRM= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Saturday 31 March 2012, Hugh Dickins wrote: > On Fri, 30 Mar 2012, Arnd Bergmann wrote: > > On Friday 30 March 2012, Arnd Bergmann wrote: > My suspicion is that we suffer a lot from the "distance" between when > we allocate swap space (add_to_swap getting the swp_entry_t to replace > ptes by) and when we finally decide to write out a page (swap_writepage): > intervening decisions can jumble the sequence badly. > > I've not investigated to confirm that, but certainly it was the case two > or three years ago, that we got much better behaviour in swapping shmem > to flash, when we stopped giving it a second pass round the lru, which > used to come in between the allocation and the writeout. > > I believe that you'll want to start by implementing something like what > Rik set out a year ago in the mail appended below. Adding another layer > of indirection isn't always a pure win, and I think none of us have taken > it any further since then; but sooner or later we shall need to, and your > flash case might be just the prod needed. Thanks a lot for that pointer, that certainly sounds interesting. I guess we should first do some investigations into in what order the pages normally get writting out to flash. If they are not strictly in sequence order, the other improvements I suggested would be less effective as well. Note that I'm not at all worried about reading pages back in from flash out of order, that tends to be harmless because reads are much rarer than writes on swap, and because only random writes require garbage collection inside of the flash (forcing up to 500ms delays on a single write occasionally), while reads are always uniformly fast. > > 2) Make variable sized swap clusters. Right now, the swap space is > > organized in clusters of 256 pages (1MB), which is less than the typical > > erase block size of 4 or 8 MB. We should try to make the swap cluster > > aligned to erase blocks and have the size match to avoid garbage collection > > in the drive. The cluster size would typically be set by mkswap as a new > > option and interpreted at swapon time. > > That gets to sound more flash-specific, and I feel less enthusiastic > about doing things in bigger and bigger lumps. But if it really proves > to be of benefit, it's easy enough to let you. > > Decide the cluster size at mkswap time, or at swapon time, or by > /sys/block/sda/queue parameters? Perhaps a /sys parameter should give > the size, but a swapon flag decide whether to participate or not. Perhaps. I was think of mkswap time, because the erase block size is specific to the storage hardware and there is no reason to ever change it run time, and we cannot always easily probe the value from looking at hardware registers (USB doesn't have the data, in SD cards it's usually wrong, and in eMMC it's sometimes wrong). I should also mention that it's not always power-of-two, some drives that use TLC flash have three times the erase block size of the equivalent SLC flash, e.g. 3 MB or 6 MB. I don't think that's a problem, but I might be missing something here. I have also encoutered a few older drives that use some completely random erase block size, but they are very rare. Also, I'm unsure what the largest cluster size would be that we can realistically support. 8 MB sounds fairly large already, especially on systems that have less than 1 GB of RAM, as most of the ARM machines today do. For shingle based hard drives, we would get a very similar behavior as for flash media, but the chunks would be even larger, on the order of 64 MB. If we can make those work, it would no longer be specific to flash, but also a lot harder to do. > > 3) As Luca points out, some eMMC media would benefit significantly from > > having discard requests issued for every page that gets freed from > > the swap cache, rather than at the time just before we reuse a swap > > cluster. This would probably have to become a configurable option > > as well, to avoid the overhead of sending the discard requests on > > media that don't benefit from this. > > I'm surprised, I wouldn't have contemplated a discard per page; > but if you have cases where it can be proved of benefit, fine. > I know nothing at all of eMMC. My understanding is that some devices can arbitrarily map between physical flash pages (typically 4, 8, or 16kb) and logical sector numbers, instead of remapping on the much larger erase block granularity. In those cases, it makes sense to free up as many pages as possible on the drive, in order to give the hardware more room for reorganizing itself and doing background defragmentation of its free space. > Though as things stand, that swap_lock spinlock makes it difficult > to find a good safe moment to issue a discard (you want the spinlock > to keep it safe, but you don't want to issue "I/O" while holding a > spinlock). Perhaps that difficulty can be overcome in a satisfactory > way, in the course of restructuring swap allocation as Rik set out > (Rik suggests freeing on swapin, that should make it very easy). Luca was suggesting to use the disk->fops->swap_slot_free_notify callback from swap_entry_free(), which is currently only used in zram, but you're right, that would not work. Another option would be batched discard as we do it for file systems: occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those. Arnd From mboxrd@z Thu Jan 1 00:00:00 1970 From: Arnd Bergmann Subject: Re: swap on eMMC and other flash Date: Mon, 2 Apr 2012 11:45:42 +0000 Message-ID: <201204021145.43222.arnd@arndb.de> References: <201203301744.16762.arnd@arndb.de> <201203301850.22784.arnd@arndb.de> Mime-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org To: linaro-kernel@lists.linaro.org Cc: Hugh Dickins , Rik van Riel , "linux-mmc@vger.kernel.org" , Alex Lemberg , linux-kernel@vger.kernel.org, "Luca Porzio (lporzio)" , linux-mm@kvack.org, Hyojin Jeong , kernel-team@android.com, Yejin Moon List-Id: linux-mmc@vger.kernel.org On Saturday 31 March 2012, Hugh Dickins wrote: > On Fri, 30 Mar 2012, Arnd Bergmann wrote: > > On Friday 30 March 2012, Arnd Bergmann wrote: > My suspicion is that we suffer a lot from the "distance" between when > we allocate swap space (add_to_swap getting the swp_entry_t to replace > ptes by) and when we finally decide to write out a page (swap_writepage): > intervening decisions can jumble the sequence badly. > > I've not investigated to confirm that, but certainly it was the case two > or three years ago, that we got much better behaviour in swapping shmem > to flash, when we stopped giving it a second pass round the lru, which > used to come in between the allocation and the writeout. > > I believe that you'll want to start by implementing something like what > Rik set out a year ago in the mail appended below. Adding another layer > of indirection isn't always a pure win, and I think none of us have taken > it any further since then; but sooner or later we shall need to, and your > flash case might be just the prod needed. Thanks a lot for that pointer, that certainly sounds interesting. I guess we should first do some investigations into in what order the pages normally get writting out to flash. If they are not strictly in sequence order, the other improvements I suggested would be less effective as well. Note that I'm not at all worried about reading pages back in from flash out of order, that tends to be harmless because reads are much rarer than writes on swap, and because only random writes require garbage collection inside of the flash (forcing up to 500ms delays on a single write occasionally), while reads are always uniformly fast. > > 2) Make variable sized swap clusters. Right now, the swap space is > > organized in clusters of 256 pages (1MB), which is less than the typical > > erase block size of 4 or 8 MB. We should try to make the swap cluster > > aligned to erase blocks and have the size match to avoid garbage collection > > in the drive. The cluster size would typically be set by mkswap as a new > > option and interpreted at swapon time. > > That gets to sound more flash-specific, and I feel less enthusiastic > about doing things in bigger and bigger lumps. But if it really proves > to be of benefit, it's easy enough to let you. > > Decide the cluster size at mkswap time, or at swapon time, or by > /sys/block/sda/queue parameters? Perhaps a /sys parameter should give > the size, but a swapon flag decide whether to participate or not. Perhaps. I was think of mkswap time, because the erase block size is specific to the storage hardware and there is no reason to ever change it run time, and we cannot always easily probe the value from looking at hardware registers (USB doesn't have the data, in SD cards it's usually wrong, and in eMMC it's sometimes wrong). I should also mention that it's not always power-of-two, some drives that use TLC flash have three times the erase block size of the equivalent SLC flash, e.g. 3 MB or 6 MB. I don't think that's a problem, but I might be missing something here. I have also encoutered a few older drives that use some completely random erase block size, but they are very rare. Also, I'm unsure what the largest cluster size would be that we can realistically support. 8 MB sounds fairly large already, especially on systems that have less than 1 GB of RAM, as most of the ARM machines today do. For shingle based hard drives, we would get a very similar behavior as for flash media, but the chunks would be even larger, on the order of 64 MB. If we can make those work, it would no longer be specific to flash, but also a lot harder to do. > > 3) As Luca points out, some eMMC media would benefit significantly from > > having discard requests issued for every page that gets freed from > > the swap cache, rather than at the time just before we reuse a swap > > cluster. This would probably have to become a configurable option > > as well, to avoid the overhead of sending the discard requests on > > media that don't benefit from this. > > I'm surprised, I wouldn't have contemplated a discard per page; > but if you have cases where it can be proved of benefit, fine. > I know nothing at all of eMMC. My understanding is that some devices can arbitrarily map between physical flash pages (typically 4, 8, or 16kb) and logical sector numbers, instead of remapping on the much larger erase block granularity. In those cases, it makes sense to free up as many pages as possible on the drive, in order to give the hardware more room for reorganizing itself and doing background defragmentation of its free space. > Though as things stand, that swap_lock spinlock makes it difficult > to find a good safe moment to issue a discard (you want the spinlock > to keep it safe, but you don't want to issue "I/O" while holding a > spinlock). Perhaps that difficulty can be overcome in a satisfactory > way, in the course of restructuring swap allocation as Rik set out > (Rik suggests freeing on swapin, that should make it very easy). Luca was suggesting to use the disk->fops->swap_slot_free_notify callback from swap_entry_free(), which is currently only used in zram, but you're right, that would not work. Another option would be batched discard as we do it for file systems: occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those. Arnd -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org