* discard feature, mkfs.ext4 and mmc default fallback to normal erase op @ 2020-12-07 15:10 Michael Walle 2020-12-07 18:35 ` Theodore Y. Ts'o 0 siblings, 1 reply; 11+ messages in thread From: Michael Walle @ 2020-12-07 15:10 UTC (permalink / raw) To: linux-ext4, linux-mmc, linux-block Hi, The problem I'm having is that I'm trying to install debian on an embedded system onto an sdcard. During installation it will format the target filesystem, but the "mkfs.ext4 -F /dev/mmcblk0p2" takes ages. What I've found out so far: - mkfs.ext4 tries to discard all blocks on the target device - with my target device being an sdcard it seems to fallback to normal erase [1], with erase_arg being set to what the card is capable of [2] Now I'm trying to figure out if this behavior is intended. I guess one can reduce it to "blkdiscard /dev/mmcblk0p2". Should this actually fall back to normal erasing or should it return -EOPNOTSUPP? -michael [1] https://elixir.bootlin.com/linux/v5.9.12/source/drivers/mmc/core/block.c#L1063 [2] https://elixir.bootlin.com/linux/v5.9.12/source/drivers/mmc/core/mmc.c#L1751 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-07 15:10 discard feature, mkfs.ext4 and mmc default fallback to normal erase op Michael Walle @ 2020-12-07 18:35 ` Theodore Y. Ts'o 2020-12-07 20:39 ` Michael Walle 0 siblings, 1 reply; 11+ messages in thread From: Theodore Y. Ts'o @ 2020-12-07 18:35 UTC (permalink / raw) To: Michael Walle; +Cc: linux-ext4, linux-mmc, linux-block On Mon, Dec 07, 2020 at 04:10:27PM +0100, Michael Walle wrote: > Hi, > > The problem I'm having is that I'm trying to install debian on > an embedded system onto an sdcard. During installation it will > format the target filesystem, but the "mkfs.ext4 -F /dev/mmcblk0p2" > takes ages. > > What I've found out so far: > - mkfs.ext4 tries to discard all blocks on the target device > - with my target device being an sdcard it seems to fallback > to normal erase [1], with erase_arg being set to what the card > is capable of [2] > > Now I'm trying to figure out if this behavior is intended. I guess > one can reduce it to "blkdiscard /dev/mmcblk0p2". Should this > actually fall back to normal erasing or should it return -EOPNOTSUPP? There are three different MMC commands which are defined: 1) DISCARD 2) ERASE 3) SECURE ERASE The first two are expected to be fast, since it only involves clearing some metadata fields in the Flash Translation Layer (FTL), so that the LBA's in the specified range are no longer mapped to a flash page. The difference between "discard" and "erase" is that "discard" is a hint, so the device is allowed to ignore it whenever it wants (in practice, if it's busy doing a GC, or if it's busy writing back blocks in its writeback cache). "Erase" is guaranteed to work, in that after an erase, a read from a specified sector MUST return all zeros, but that can easily be done by redirecting a point in the FTL metadata. "Secure Erase" is the one which can be slow, since it requires physically zeroing all of the flash pages (although if the device is self-encrypting, this in theory could also be fast if you're doing a secure erase at the granularity of the device's encryption keys, so all it needs to do is to regenerate the crypto key). It sounds like your SD card is implementing the "erase" command in a particularly non-optimal way. If it's common, perhaps we need some kind of blacklist for drivers with badly implemented erase commands. As a workaround, you can run mke2fs with the command-line option "-E discard=0". Cheers, - Ted P.S. If your SD card got "erase" wrong, I'd be a little worried about what else the FTL implementation may have screwed up. So you want to under simply getting a different SD card --- especially if this is something that you plan to distribute as a product to downstream customers. In general, low-end flash needs to be very carefully qualified to make sure they are competently implemented if you plan to deploy in large quantities. An example of what happen if this qualification process is not done: https://insideevs.com/news/376037/tesla-mcu-emmc-memory-issue/ Tesla is currently under investigation by the National Highway Traffic Safety Administration due to cheaping out on their eMMC flash (probably just a few pennies per unit). Given that customers are having to pay $1500 to replace their engine controller out of warranty (and the NHTSA is considering whether or not to force Tesla to eat the costs, as opposed to forcing their customers to pay $$$), that's an example of false economy.... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-07 18:35 ` Theodore Y. Ts'o @ 2020-12-07 20:39 ` Michael Walle 2020-12-08 2:40 ` Theodore Y. Ts'o 0 siblings, 1 reply; 11+ messages in thread From: Michael Walle @ 2020-12-07 20:39 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: linux-ext4, linux-mmc, linux-block Hi Ted, Am 2020-12-07 19:35, schrieb Theodore Y. Ts'o: > On Mon, Dec 07, 2020 at 04:10:27PM +0100, Michael Walle wrote: >> Hi, >> >> The problem I'm having is that I'm trying to install debian on >> an embedded system onto an sdcard. During installation it will >> format the target filesystem, but the "mkfs.ext4 -F /dev/mmcblk0p2" >> takes ages. >> >> What I've found out so far: >> - mkfs.ext4 tries to discard all blocks on the target device >> - with my target device being an sdcard it seems to fallback >> to normal erase [1], with erase_arg being set to what the card >> is capable of [2] >> >> Now I'm trying to figure out if this behavior is intended. I guess >> one can reduce it to "blkdiscard /dev/mmcblk0p2". Should this >> actually fall back to normal erasing or should it return -EOPNOTSUPP? > > There are three different MMC commands which are defined: > > 1) DISCARD > 2) ERASE > 3) SECURE ERASE > > The first two are expected to be fast, since it only involves clearing > some metadata fields in the Flash Translation Layer (FTL), so that the > LBA's in the specified range are no longer mapped to a flash page. Mh, where is it specified that the erase command is fast? According to the Physical Layer Simplified Specification Version 8.00: The actual erase time may be quite long, and the host may issue CMD7 to deselect the card or perform card disconnection, as described in the Block Write section, above. Honest question. Also reading "4.14 Erase Timeout Calculation" doesn't sound that it is fast. Also there is this comment: https://elixir.bootlin.com/linux/v5.9.12/source/drivers/mmc/core/core.c#L1495 > The difference between "discard" and "erase" is that "discard" is a > hint, so the device is allowed to ignore it whenever it wants (in > practice, if it's busy doing a GC, or if it's busy writing back blocks > in its writeback cache). "Erase" is guaranteed to work, in that after > an erase, a read from a specified sector MUST return all zeros, but > that can easily be done by redirecting a point in the FTL metadata. > > "Secure Erase" is the one which can be slow, since it requires > physically zeroing all of the flash pages (although if the device is > self-encrypting, this in theory could also be fast if you're doing a > secure erase at the granularity of the device's encryption keys, so > all it needs to do is to regenerate the crypto key). > > It sounds like your SD card is implementing the "erase" command in a > particularly non-optimal way. If it's common, perhaps we need some > kind of blacklist for drivers with badly implemented erase commands. > As a workaround, you can run mke2fs with the command-line option "-E > discard=0". I've already tested that "mkfs.ext4 -E nodiscard" is fast (or works in the same way as before the pre-discard feature). But I wouldn't say it is a cheapo card (Toshiba Exceria). Although I cannot guarantee that it might be a china clone, but it looks authentic ;) > P.S. If your SD card got "erase" wrong, I'd be a little worried about > what else the FTL implementation may have screwed up. So you want to > under simply getting a different SD card --- especially if this is > something that you plan to distribute as a product to downstream > customers. In general, low-end flash needs to be very carefully > qualified to make sure they are competently implemented if you plan to > deploy in large quantities. An example of what happen if this > qualification process is not done: > > https://insideevs.com/news/376037/tesla-mcu-emmc-memory-issue/ > > Tesla is currently under investigation by the National Highway Traffic > Safety Administration due to cheaping out on their eMMC flash > (probably just a few pennies per unit). Given that customers are > having to pay $1500 to replace their engine controller out of warranty > (and the NHTSA is considering whether or not to force Tesla to eat the > costs, as opposed to forcing their customers to pay $$$), that's an > example of false economy.... Yeah I'm aware of the Tesla eMMC wear-out problem. But I've seen this esp. from a user point of view. Like take our product, where the user can freely choose its sdcard just to then notice that the installation of its distribution is painfully slow. So I'm interested in understanding the implications. Like is it really the case that the erase command can be assumed fast. -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-07 20:39 ` Michael Walle @ 2020-12-08 2:40 ` Theodore Y. Ts'o 2020-12-08 9:49 ` Ulf Hansson 0 siblings, 1 reply; 11+ messages in thread From: Theodore Y. Ts'o @ 2020-12-08 2:40 UTC (permalink / raw) To: Michael Walle; +Cc: linux-ext4, linux-mmc, linux-block On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote: > > There are three different MMC commands which are defined: > > > > 1) DISCARD > > 2) ERASE > > 3) SECURE ERASE > > > > The first two are expected to be fast, since it only involves clearing > > some metadata fields in the Flash Translation Layer (FTL), so that the > > LBA's in the specified range are no longer mapped to a flash page. > > Mh, where is it specified that the erase command is fast? According > to the Physical Layer Simplified Specification Version 8.00: > > The actual erase time may be quite long, and the host may issue CMD7 > to deselect thhe card or perform card disconnection, as described in > the Block Write section, above. I looked at the eMMC specification from JEDEC (JESD84-A44) and there, both the "erase" and "trim" are specified that the work is to be queued to be done at a time which is convenient to the controller (read: FTL). This is in contrast to the "secure erase" and "secure trim" commands, where the erasing has to be done NOW NOW NOW for "high security applications". The only difference between "erase" and "trim" seems to be that erahse has to be done in units of the "erase groups" which is typically larger than the "write pages" which is the granularity required by the trim command. There is also a comment that when you are erasing the entire partition, "erase" is preferred over "trim". (Presumably because it is more convenient? The spec is not clear.) Unfortunately, the SD Card spec and the eMMC spec both read like they were written by a standards committee stacked by hardware engineers. It doesn't look like they had file system engineers in the room, because the distinctions between "erase" and "trim" are pretty silly, and not well defined. Aside from what I wrote, the spec is remarkably silent about what the host OS can depend upon. From the fs perspective, what we care about is whether or not the command is a hint or a reliable way to zero a range of sectors. A command could be a hint if the device is allowed to ignore it, or if the values of the sector are indeterminate, or if the sectors are zero'ed or not could change after a power cycle. (I've seen an implementation where discard would result in the LBA's being read as zero --- but after a power cycle, reading from the same LBA would return the old data again. This is standards complaint, but it's not terribly useful.) Assuming that the command is reliable, the next question is whether the erase operation is logical or physical --- which is to say, if an attacker has physical access to the die, with the ability to bypass the FTL and directly read the flash cells, could the attack retrieve the data, even if it required a distructive, physical attack on the hardware? A logical erase would not require that the data be erased or otherwise made inaccessible against an attacker who bypasses the FTL; a physical erase would provide security guarantees that even if your phone has handed over to state-sponsored attacker, that nothing could be extracted after a physical erase. So if I were king, those would be the three levels of discard: "hint", "reliable logical", and "reliable physical", as those map to real use cases that are of actual use to a Host. The challenge is mapping what we *actually* are given by different specs, which were written by hardware engineers and make distinctions that are not well defined so that multiple implementations can be "standard compliant", but have completely different performance profiles, thus making life easy for the marketing types, and hard for the file system engineers. :-) All I can tell you is that I know a bunch of Android system team members at $WORK, and the current assumptions seem to work just fine for the sorts of devices that are used on mobile handsets --- even really cheap ones that are sold in India. At least, there are bunch of "cost optimized" (as well as high end) Android devices running ext4, and no one has complained to me about mke2fs taking a long time. I definitely agree with you that the SD Card spec seems to imply that other standards-compliant implementations could have the erase command taking minutes, and this seems to be allowable by the spec. I would consider this to be a flaw in the spec, myself. But I don't sit on the standards committess, and I don't write the specs. I (and everyone else) just have to live with them. Sigh.... - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-08 2:40 ` Theodore Y. Ts'o @ 2020-12-08 9:49 ` Ulf Hansson 2020-12-08 11:26 ` Michael Walle 0 siblings, 1 reply; 11+ messages in thread From: Ulf Hansson @ 2020-12-08 9:49 UTC (permalink / raw) To: Theodore Y. Ts'o, Michael Walle; +Cc: linux-ext4, linux-mmc, linux-block Hi Ted, Michael, On Tue, 8 Dec 2020 at 03:41, Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote: > > > There are three different MMC commands which are defined: > > > > > > 1) DISCARD > > > 2) ERASE > > > 3) SECURE ERASE > > > > > > The first two are expected to be fast, since it only involves clearing > > > some metadata fields in the Flash Translation Layer (FTL), so that the > > > LBA's in the specified range are no longer mapped to a flash page. > > > > Mh, where is it specified that the erase command is fast? According > > to the Physical Layer Simplified Specification Version 8.00: > > > > The actual erase time may be quite long, and the host may issue CMD7 > > to deselect thhe card or perform card disconnection, as described in > > the Block Write section, above. Before I go into some more detail, of course I fully agree that dealing with erase/discard from the eMMC/SD specifications (and other types of devices) point of view isn't entirely easy. :-) But I also think we can do better than currently, at least for eMMC/SD. > > I looked at the eMMC specification from JEDEC (JESD84-A44) and there, > both the "erase" and "trim" are specified that the work is to be > queued to be done at a time which is convenient to the controller > (read: FTL). This is in contrast to the "secure erase" and "secure > trim" commands, where the erasing has to be done NOW NOW NOW for "high > security applications". > > The only difference between "erase" and "trim" seems to be that erahse > has to be done in units of the "erase groups" which is typically > larger than the "write pages" which is the granularity required by the > trim command. There is also a comment that when you are erasing the > entire partition, "erase" is preferred over "trim". (Presumably > because it is more convenient? The spec is not clear.) > > Unfortunately, the SD Card spec and the eMMC spec both read like they > were written by a standards committee stacked by hardware engineers. > It doesn't look like they had file system engineers in the room, > because the distinctions between "erase" and "trim" are pretty silly, > and not well defined. Aside from what I wrote, the spec is remarkably > silent about what the host OS can depend upon. Moreover, the specs have evolved over the years. Somehow, we need to map a REQ_OP_DISCARD and REQ_OP_SECURE_ERASE to the best matching operation that the currently inserted eMMC/SD card supports... Long time time ago, both the SD and eMMC spec introduced support for real discards commands, as being hints to the card without any guarantees of what will happen to the data from a logical or a physical point of view. If the card supports that, we should use it as the first option for REQ_OP_DISCARD. Although, what should we pick as the second best option, when the card doesn't support discard - that's when it becomes more tricky. And the similar applies for REQ_OP_SECURE_ERASE, or course. If you have any suggestions for how we can improve in the above decisions, feel free to suggest something. Another issue that most likely is causing poor performance for REQ_OP_DISCARD/REQ_OP_SECURE_ERASE for eMMC/SD, is that in mmc_queue_setup_discard() we set up the maximum discard sectors allowed per request and the discard granularity. To find performance bottlenecks, I would start looking at what actual eMMC/SD commands/args we end up mapping towards the REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would also look at the values we end up picking as max discard sectors and the discard granularity. > > From the fs perspective, what we care about is whether or not the > command is a hint or a reliable way to zero a range of sectors. A > command could be a hint if the device is allowed to ignore it, or if > the values of the sector are indeterminate, or if the sectors are > zero'ed or not could change after a power cycle. (I've seen an > implementation where discard would result in the LBA's being read as > zero --- but after a power cycle, reading from the same LBA would > return the old data again. This is standards complaint, but it's not > terribly useful.) :-) > > Assuming that the command is reliable, the next question is whether > the erase operation is logical or physical --- which is to say, if an > attacker has physical access to the die, with the ability to bypass > the FTL and directly read the flash cells, could the attack retrieve > the data, even if it required a distructive, physical attack on the > hardware? A logical erase would not require that the data be erased > or otherwise made inaccessible against an attacker who bypasses the > FTL; a physical erase would provide security guarantees that even if > your phone has handed over to state-sponsored attacker, that nothing > could be extracted after a physical erase. > > So if I were king, those would be the three levels of discard: "hint", > "reliable logical", and "reliable physical", as those map to real use > cases that are of actual use to a Host. The challenge is mapping what > we *actually* are given by different specs, which were written by > hardware engineers and make distinctions that are not well defined so > that multiple implementations can be "standard compliant", but have > completely different performance profiles, thus making life easy for > the marketing types, and hard for the file system engineers. :-) I agree, these are the three levels that make sense to support. Honestly I haven't been paying enough attention to discussions for the generic block layer around discards. However, considering what you just stated above, we seem to be missing one request operation, don't we? [...] Kind regards Uffe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-08 9:49 ` Ulf Hansson @ 2020-12-08 11:26 ` Michael Walle 2020-12-08 16:17 ` Ulf Hansson 2020-12-08 16:52 ` Theodore Y. Ts'o 0 siblings, 2 replies; 11+ messages in thread From: Michael Walle @ 2020-12-08 11:26 UTC (permalink / raw) To: Ulf Hansson; +Cc: Theodore Y. Ts'o, linux-ext4, linux-mmc, linux-block Hi Ulf, Hi Ted, Am 2020-12-08 10:49, schrieb Ulf Hansson: > On Tue, 8 Dec 2020 at 03:41, Theodore Y. Ts'o <tytso@mit.edu> wrote: >> On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote: >> > > There are three different MMC commands which are defined: >> > > >> > > 1) DISCARD >> > > 2) ERASE >> > > 3) SECURE ERASE >> > > >> > > The first two are expected to be fast, since it only involves clearing >> > > some metadata fields in the Flash Translation Layer (FTL), so that the >> > > LBA's in the specified range are no longer mapped to a flash page. >> > >> > Mh, where is it specified that the erase command is fast? According >> > to the Physical Layer Simplified Specification Version 8.00: >> > >> > The actual erase time may be quite long, and the host may issue CMD7 >> > to deselect thhe card or perform card disconnection, as described in >> > the Block Write section, above. > > Before I go into some more detail, of course I fully agree that > dealing with erase/discard from the eMMC/SD specifications (and other > types of devices) point of view isn't entirely easy. :-) > > But I also think we can do better than currently, at least for eMMC/SD. > >> >> I looked at the eMMC specification from JEDEC (JESD84-A44) and there, >> both the "erase" and "trim" are specified that the work is to be >> queued to be done at a time which is convenient to the controller >> (read: FTL). This is in contrast to the "secure erase" and "secure >> trim" commands, where the erasing has to be done NOW NOW NOW for "high >> security applications". Oh this might also be because I've cited from the wrong place, namely the mmc_init_card() function. But what I really meant was the sd card equivalent which should be mmc_read_ssr(). Sorry. discard_support = UNSTUFF_BITS(resp, 313 - 288, 1); card->erase_arg = (card->scr.sda_specx && discard_support) ? SD_DISCARD_ARG : SD_ERASE_ARG; >> The only difference between "erase" and "trim" seems to be that erahse >> has to be done in units of the "erase groups" which is typically >> larger than the "write pages" which is the granularity required by the >> trim command. There is also a comment that when you are erasing the >> entire partition, "erase" is preferred over "trim". (Presumably >> because it is more convenient? The spec is not clear.) >> >> Unfortunately, the SD Card spec and the eMMC spec both read like they >> were written by a standards committee stacked by hardware engineers. >> It doesn't look like they had file system engineers in the room, >> because the distinctions between "erase" and "trim" are pretty silly, >> and not well defined. Aside from what I wrote, the spec is remarkably >> silent about what the host OS can depend upon. > > Moreover, the specs have evolved over the years. Somehow, we need to > map a REQ_OP_DISCARD and REQ_OP_SECURE_ERASE to the best matching > operation that the currently inserted eMMC/SD card supports... Do we really need to map these functions? What if we don't have an actual discard, but just a slow erase (I'm now assuming that erase will likely be slow on sdcards)? Can't we just tell the user space there is no discard? Like on a normal HDD? I really don't know the implications, seems like mmc_erase() is just there for the linux discard feature. Coming from the user space side. Does mkfs.ext4 assumes its pre-discard is fast? I'd think so, right? I'd presume it was intented to tell the FTL of the block device, "hey these blocks are unused, you can do some wear leveling with them". > Long time time ago, both the SD and eMMC spec introduced support for > real discards commands, as being hints to the card without any > guarantees of what will happen to the data from a logical or a > physical point of view. If the card supports that, we should use it as > the first option for REQ_OP_DISCARD. Although, what should we pick as > the second best option, when the card doesn't support discard - that's > when it becomes more tricky. And the similar applies for > REQ_OP_SECURE_ERASE, or course. > > If you have any suggestions for how we can improve in the above > decisions, feel free to suggest something. > > Another issue that most likely is causing poor performance for > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE for eMMC/SD, is that in > mmc_queue_setup_discard() we set up the maximum discard sectors > allowed per request and the discard granularity. > > To find performance bottlenecks, I would start looking at what actual > eMMC/SD commands/args we end up mapping towards the > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would > also look at the values we end up picking as max discard sectors and > the discard granularity. I'm just about finding some SD cards and looking how they behave timing wise and what they report they support (ie. erase or discard). Looks like other cards are doing better. But I'd have to find out if they support the discard (mine doesn't) and if they are slow too if I force them to use the normal erase. >> From the fs perspective, what we care about is whether or not the >> command is a hint or a reliable way to zero a range of sectors. A >> command could be a hint if the device is allowed to ignore it, or if >> the values of the sector are indeterminate, or if the sectors are >> zero'ed or not could change after a power cycle. (I've seen an >> implementation where discard would result in the LBA's being read as >> zero --- but after a power cycle, reading from the same LBA would >> return the old data again. This is standards complaint, but it's not >> terribly useful.) > > :-) > >> >> Assuming that the command is reliable, the next question is whether >> the erase operation is logical or physical --- which is to say, if an >> attacker has physical access to the die, with the ability to bypass >> the FTL and directly read the flash cells, could the attack retrieve >> the data, even if it required a distructive, physical attack on the >> hardware? A logical erase would not require that the data be erased >> or otherwise made inaccessible against an attacker who bypasses the >> FTL; a physical erase would provide security guarantees that even if >> your phone has handed over to state-sponsored attacker, that nothing >> could be extracted after a physical erase. >> >> So if I were king, those would be the three levels of discard: "hint", >> "reliable logical", and "reliable physical", as those map to real use >> cases that are of actual use to a Host. The challenge is mapping what >> we *actually* are given by different specs, which were written by >> hardware engineers and make distinctions that are not well defined so >> that multiple implementations can be "standard compliant", but have >> completely different performance profiles, thus making life easy for >> the marketing types, and hard for the file system engineers. :-) > > I agree, these are the three levels that make sense to support. > > Honestly I haven't been paying enough attention to discussions for the > generic block layer around discards. However, considering what you > just stated above, we seem to be missing one request operation, don't > we? -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-08 11:26 ` Michael Walle @ 2020-12-08 16:17 ` Ulf Hansson 2020-12-08 20:57 ` Michael Walle 2020-12-08 16:52 ` Theodore Y. Ts'o 1 sibling, 1 reply; 11+ messages in thread From: Ulf Hansson @ 2020-12-08 16:17 UTC (permalink / raw) To: Michael Walle; +Cc: Theodore Y. Ts'o, linux-ext4, linux-mmc, linux-block On Tue, 8 Dec 2020 at 12:26, Michael Walle <michael@walle.cc> wrote: > > Hi Ulf, Hi Ted, > > Am 2020-12-08 10:49, schrieb Ulf Hansson: > > On Tue, 8 Dec 2020 at 03:41, Theodore Y. Ts'o <tytso@mit.edu> wrote: > >> On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote: > >> > > There are three different MMC commands which are defined: > >> > > > >> > > 1) DISCARD > >> > > 2) ERASE > >> > > 3) SECURE ERASE > >> > > > >> > > The first two are expected to be fast, since it only involves clearing > >> > > some metadata fields in the Flash Translation Layer (FTL), so that the > >> > > LBA's in the specified range are no longer mapped to a flash page. > >> > > >> > Mh, where is it specified that the erase command is fast? According > >> > to the Physical Layer Simplified Specification Version 8.00: > >> > > >> > The actual erase time may be quite long, and the host may issue CMD7 > >> > to deselect thhe card or perform card disconnection, as described in > >> > the Block Write section, above. > > > > Before I go into some more detail, of course I fully agree that > > dealing with erase/discard from the eMMC/SD specifications (and other > > types of devices) point of view isn't entirely easy. :-) > > > > But I also think we can do better than currently, at least for eMMC/SD. > > > >> > >> I looked at the eMMC specification from JEDEC (JESD84-A44) and there, > >> both the "erase" and "trim" are specified that the work is to be > >> queued to be done at a time which is convenient to the controller > >> (read: FTL). This is in contrast to the "secure erase" and "secure > >> trim" commands, where the erasing has to be done NOW NOW NOW for "high > >> security applications". > > Oh this might also be because I've cited from the wrong place, namely > the > mmc_init_card() function. But what I really meant was the sd card > equivalent > which should be mmc_read_ssr(). Sorry. > > discard_support = UNSTUFF_BITS(resp, 313 - 288, 1); > card->erase_arg = (card->scr.sda_specx && discard_support) ? > SD_DISCARD_ARG : SD_ERASE_ARG; I assumed you were referring to this, but good that you pointed this out, for clarity. > > >> The only difference between "erase" and "trim" seems to be that erahse > >> has to be done in units of the "erase groups" which is typically > >> larger than the "write pages" which is the granularity required by the > >> trim command. There is also a comment that when you are erasing the > >> entire partition, "erase" is preferred over "trim". (Presumably > >> because it is more convenient? The spec is not clear.) > >> > >> Unfortunately, the SD Card spec and the eMMC spec both read like they > >> were written by a standards committee stacked by hardware engineers. > >> It doesn't look like they had file system engineers in the room, > >> because the distinctions between "erase" and "trim" are pretty silly, > >> and not well defined. Aside from what I wrote, the spec is remarkably > >> silent about what the host OS can depend upon. > > > > Moreover, the specs have evolved over the years. Somehow, we need to > > map a REQ_OP_DISCARD and to the best matching > > operation that the currently inserted eMMC/SD card supports... > > Do we really need to map these functions? What if we don't have an > actual discard, but just a slow erase (I'm now assuming that erase > will likely be slow on sdcards)? Can't we just tell the user space > there is no discard? Like on a normal HDD? I have considered that, but not sure what would be the best option. > I really don't know the > implications, seems like mmc_erase() is just there for the linux > discard feature. mmc_erase() is used for both REQ_OP_DISCARD and REQ_OP_SECURE_ERASE, but that's an implementation detail that we can change, of course. Honestly, the hole erase/discard support in the mmc core deserves a cleanup and I am looking at that (occasionally). > > Coming from the user space side. Does mkfs.ext4 assumes its pre-discard > is fast? I'd think so, right? I'd presume it was intented to tell the > FTL of the block device, "hey these blocks are unused, you can do some > wear leveling with them". I would assume that too. On the other hand, I guess there are situations when user space could live with slow formatting times. In particular if the goal is to let card clean up its internal garbage, as a way to improve "performance" for later I/O writes. > > > Long time time ago, both the SD and eMMC spec introduced support for > > real discards commands, as being hints to the card without any > > guarantees of what will happen to the data from a logical or a > > physical point of view. If the card supports that, we should use it as > > the first option for REQ_OP_DISCARD. Although, what should we pick as > > the second best option, when the card doesn't support discard - that's > > when it becomes more tricky. And the similar applies for > > REQ_OP_SECURE_ERASE, or course. > > > > If you have any suggestions for how we can improve in the above > > decisions, feel free to suggest something. > > > > Another issue that most likely is causing poor performance for > > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE for eMMC/SD, is that in > > mmc_queue_setup_discard() we set up the maximum discard sectors > > allowed per request and the discard granularity. > > > > To find performance bottlenecks, I would start looking at what actual > > eMMC/SD commands/args we end up mapping towards the > > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would > > also look at the values we end up picking as max discard sectors and > > the discard granularity. > > I'm just about finding some SD cards and looking how they behave timing > wise and what they report they support (ie. erase or discard). Looks > like other cards are doing better. But I'd have to find out if they > support the discard (mine doesn't) and if they are slow too if I force > them to use the normal erase. Sounds great, looking forward to hear more about your findings. [...] Kind regards Uffe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-08 16:17 ` Ulf Hansson @ 2020-12-08 20:57 ` Michael Walle 0 siblings, 0 replies; 11+ messages in thread From: Michael Walle @ 2020-12-08 20:57 UTC (permalink / raw) To: Ulf Hansson; +Cc: Theodore Y. Ts'o, linux-ext4, linux-mmc, linux-block Hi Ulf, Ted, Am 2020-12-08 17:17, schrieb Ulf Hansson: > On Tue, 8 Dec 2020 at 12:26, Michael Walle <michael@walle.cc> wrote: >> > To find performance bottlenecks, I would start looking at what actual >> > eMMC/SD commands/args we end up mapping towards the >> > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would >> > also look at the values we end up picking as max discard sectors and >> > the discard granularity. >> >> I'm just about finding some SD cards and looking how they behave >> timing >> wise and what they report they support (ie. erase or discard). Looks >> like other cards are doing better. But I'd have to find out if they >> support the discard (mine doesn't) and if they are slow too if I force >> them to use the normal erase. > > Sounds great, looking forward to hear more about your findings. Ok so sample size is 3 *g*. Two of these cards are actually "fast", meaning that a discard of any size will take less than a second and one is the slow card. I've added tracing to dump the cards parameters (see patch at the end of this mail). No card supports discard, they just use the normal erase method. That wasn't what I was expecting ;) (1) Fast card (Kingston CANVAS Select Plus, 16GB) # time blkdiscard -l 536870912 /dev/mmcblk1 real 0m 0.34s user 0m 0.00s sys 0m 0.00s kworker/0:2-81 [000] .... 123.285801: mmc_sd_setup_card: card->erase_arg=0, au=9 es=512 et=12 eo=3 kworker/1:3H-2368 [001] .... 133.570762: mmc_do_erase: from=0x0 to=0x1fff kworker/1:3H-2368 [001] .... 133.585204: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.585284: mmc_do_erase: from=0x2000 to=0x3fff kworker/1:3H-2368 [001] .... 133.589201: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.589217: mmc_do_erase: from=0x4000 to=0x5fff kworker/1:3H-2368 [001] .... 133.591315: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.591330: mmc_do_erase: from=0x6000 to=0x7fff kworker/1:3H-2368 [001] .... 133.593202: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.593217: mmc_do_erase: from=0x8000 to=0x9fff kworker/1:3H-2368 [001] .... 133.595338: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.595353: mmc_do_erase: from=0xa000 to=0xbfff kworker/1:3H-2368 [001] .... 133.597473: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.597488: mmc_do_erase: from=0xc000 to=0xdfff kworker/1:3H-2368 [001] .... 133.599605: mmc_do_erase: mmc_poll_for_busy() done [..] kworker/1:3H-2368 [001] .... 133.891681: mmc_do_erase: from=0xf0000 to=0xf1fff kworker/1:3H-2368 [001] .... 133.893919: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.893947: mmc_do_erase: from=0xf2000 to=0xf3fff kworker/1:3H-2368 [001] .... 133.896186: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.896213: mmc_do_erase: from=0xf4000 to=0xf5fff kworker/1:3H-2368 [001] .... 133.898452: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.898481: mmc_do_erase: from=0xf6000 to=0xf7fff kworker/1:3H-2368 [001] .... 133.900713: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.900741: mmc_do_erase: from=0xf8000 to=0xf9fff kworker/1:3H-2368 [001] .... 133.902979: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.903008: mmc_do_erase: from=0xfa000 to=0xfbfff kworker/1:3H-2368 [001] .... 133.905246: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.905274: mmc_do_erase: from=0xfc000 to=0xfdfff kworker/1:3H-2368 [001] .... 133.909589: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 133.909620: mmc_do_erase: from=0xfe000 to=0xfffff kworker/1:3H-2368 [001] .... 133.911870: mmc_do_erase: mmc_poll_for_busy() done (2) Fast card (Panasonic, Unknown model, 8GB) kworker/0:2-81 [000] .... 492.192453: mmc_sd_setup_card: card->erase_arg=0, au=9 es=8 et=1 eo=3 I didn't discard the blocks again, so no logs, but it didn't take long in the first run. (3) Slow card (Toshiba Exceria, 16GB) # time blkdiscard -l 536870912 /dev/mmcblk1 real 0m 39.78s user 0m 0.00s sys 0m 0.00s kworker/0:2-81 [000] .... 207.271171: mmc_sd_setup_card: card->erase_arg=0, au=9 es=512 et=12 eo=3 kworker/1:3H-2368 [001] .... 212.096265: mmc_do_erase: from=0x0 to=0x1fff kworker/1:3H-2368 [001] .... 212.100282: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 212.100328: mmc_do_erase: from=0x2000 to=0x3fff kworker/1:3H-2368 [001] .... 212.102207: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 212.102215: mmc_do_erase: from=0x4000 to=0x5fff kworker/1:3H-2368 [001] .... 212.104260: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 212.104267: mmc_do_erase: from=0x6000 to=0x7fff kworker/1:3H-2368 [001] .... 213.086808: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.086842: mmc_do_erase: from=0x8000 to=0x9fff kworker/1:3H-2368 [001] .... 213.149232: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.149263: mmc_do_erase: from=0xa000 to=0xbfff kworker/1:3H-2368 [001] .... 213.215185: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.215216: mmc_do_erase: from=0xc000 to=0xdfff kworker/1:3H-2368 [001] .... 213.346672: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.346702: mmc_do_erase: from=0xe000 to=0xffff kworker/1:3H-2368 [001] .... 213.412594: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.412623: mmc_do_erase: from=0x10000 to=0x11fff kworker/1:3H-2368 [001] .... 213.478507: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.478541: mmc_do_erase: from=0x12000 to=0x13fff kworker/1:3H-2368 [001] .... 213.598798: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.598829: mmc_do_erase: from=0x14000 to=0x15fff kworker/1:3H-2368 [001] .... 213.664721: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.664750: mmc_do_erase: from=0x16000 to=0x17fff kworker/1:3H-2368 [001] .... 213.730632: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.730661: mmc_do_erase: from=0x18000 to=0x19fff kworker/1:3H-2368 [001] .... 213.862108: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.862138: mmc_do_erase: from=0x1a000 to=0x1bfff kworker/1:3H-2368 [001] .... 213.928017: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.928046: mmc_do_erase: from=0x1c000 to=0x1dfff kworker/1:3H-2368 [001] .... 213.993925: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 213.993954: mmc_do_erase: from=0x1e000 to=0x1ffff kworker/1:3H-2368 [001] .... 214.110795: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 214.110827: mmc_do_erase: from=0x20000 to=0x21fff kworker/1:3H-2368 [001] .... 214.173232: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 214.173263: mmc_do_erase: from=0x22000 to=0x23fff kworker/1:3H-2368 [001] .... 214.239191: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 214.239221: mmc_do_erase: from=0x24000 to=0x25fff kworker/1:3H-2368 [001] .... 215.069222: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 215.069253: mmc_do_erase: from=0x26000 to=0x27fff kworker/1:3H-2368 [001] .... 215.135138: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 215.135168: mmc_do_erase: from=0x28000 to=0x29fff kworker/1:3H-2368 [001] .... 215.197232: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 215.197264: mmc_do_erase: from=0x2a000 to=0x2bfff kworker/1:3H-2368 [001] .... 216.040197: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 216.040229: mmc_do_erase: from=0x2c000 to=0x2dfff kworker/1:3H-2368 [001] .... 216.158794: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 216.158824: mmc_do_erase: from=0x2e000 to=0x2ffff kworker/1:3H-2368 [001] .... 216.221232: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 216.221263: mmc_do_erase: from=0x30000 to=0x31fff kworker/1:3H-2368 [001] .... 217.064195: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 217.064226: mmc_do_erase: from=0x32000 to=0x33fff kworker/1:3H-2368 [001] .... 217.182794: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 217.182824: mmc_do_erase: from=0x34000 to=0x35fff kworker/1:3H-2368 [001] .... 217.245231: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 217.245263: mmc_do_erase: from=0x36000 to=0x37fff kworker/1:3H-2368 [001] .... 218.083500: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 218.083532: mmc_do_erase: from=0x38000 to=0x39fff kworker/1:3H-2368 [001] .... 218.141223: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 218.141253: mmc_do_erase: from=0x3a000 to=0x3bfff kworker/1:3H-2368 [001] .... 218.207130: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 218.207160: mmc_do_erase: from=0x3c000 to=0x3dfff kworker/1:3H-2368 [001] .... 219.046630: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 219.046663: mmc_do_erase: from=0x3e000 to=0x3ffff kworker/1:3H-2368 [001] .... 219.112564: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 219.112595: mmc_do_erase: from=0x40000 to=0x41fff kworker/1:3H-2368 [001] .... 219.230811: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 219.230842: mmc_do_erase: from=0x42000 to=0x43fff kworker/1:3H-2368 [001] .... 220.070631: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 220.070665: mmc_do_erase: from=0x44000 to=0x45fff kworker/1:3H-2368 [001] .... 220.136551: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 220.136580: mmc_do_erase: from=0x46000 to=0x47fff kworker/1:3H-2368 [001] .... 220.254794: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 220.254824: mmc_do_erase: from=0x48000 to=0x49fff kworker/1:3H-2368 [001] .... 221.094626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 221.094658: mmc_do_erase: from=0x4a000 to=0x4bfff kworker/1:3H-2368 [001] .... 221.160559: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 221.160588: mmc_do_erase: from=0x4c000 to=0x4dfff kworker/1:3H-2368 [001] .... 221.278793: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 221.278823: mmc_do_erase: from=0x4e000 to=0x4ffff kworker/1:3H-2368 [001] .... 222.118626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 222.118658: mmc_do_erase: from=0x50000 to=0x51fff kworker/1:3H-2368 [001] .... 222.184557: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 222.184586: mmc_do_erase: from=0x52000 to=0x53fff kworker/1:3H-2368 [001] .... 222.302797: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 222.302829: mmc_do_erase: from=0x54000 to=0x55fff kworker/1:3H-2368 [001] .... 223.142627: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 223.142659: mmc_do_erase: from=0x56000 to=0x57fff kworker/1:3H-2368 [001] .... 223.208558: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 223.208587: mmc_do_erase: from=0x58000 to=0x59fff kworker/1:3H-2368 [001] .... 223.326793: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 223.326823: mmc_do_erase: from=0x5a000 to=0x5bfff kworker/1:3H-2368 [001] .... 224.166631: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 224.166663: mmc_do_erase: from=0x5c000 to=0x5dfff kworker/1:3H-2368 [001] .... 224.232553: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 224.232582: mmc_do_erase: from=0x5e000 to=0x5ffff kworker/1:3H-2368 [001] .... 224.350792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 224.350822: mmc_do_erase: from=0x60000 to=0x61fff kworker/1:3H-2368 [001] .... 225.190627: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 225.190658: mmc_do_erase: from=0x62000 to=0x63fff kworker/1:3H-2368 [001] .... 225.256542: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 225.256571: mmc_do_erase: from=0x64000 to=0x65fff kworker/1:3H-2368 [001] .... 225.374796: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 225.374827: mmc_do_erase: from=0x66000 to=0x67fff kworker/1:3H-2368 [001] .... 226.214627: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 226.214658: mmc_do_erase: from=0x68000 to=0x69fff kworker/1:3H-2368 [001] .... 226.333222: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 226.333255: mmc_do_erase: from=0x6a000 to=0x6bfff kworker/1:3H-2368 [001] .... 226.399137: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 226.399168: mmc_do_erase: from=0x6c000 to=0x6dfff kworker/1:3H-2368 [001] .... 227.238625: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 227.238657: mmc_do_erase: from=0x6e000 to=0x6ffff kworker/1:3H-2368 [001] .... 227.304560: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 227.304589: mmc_do_erase: from=0x70000 to=0x71fff kworker/1:3H-2368 [001] .... 227.422792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 227.422822: mmc_do_erase: from=0x72000 to=0x73fff kworker/1:3H-2368 [001] .... 228.262629: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 228.262661: mmc_do_erase: from=0x74000 to=0x75fff kworker/1:3H-2368 [001] .... 228.328546: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 228.328575: mmc_do_erase: from=0x76000 to=0x77fff kworker/1:3H-2368 [001] .... 228.446796: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 228.446827: mmc_do_erase: from=0x78000 to=0x79fff kworker/1:3H-2368 [001] .... 229.286630: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 229.286662: mmc_do_erase: from=0x7a000 to=0x7bfff kworker/1:3H-2368 [001] .... 229.352545: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 229.352573: mmc_do_erase: from=0x7c000 to=0x7dfff kworker/1:3H-2368 [001] .... 229.470792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 229.470822: mmc_do_erase: from=0x7e000 to=0x7ffff kworker/1:3H-2368 [001] .... 230.310627: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 230.310659: mmc_do_erase: from=0x80000 to=0x81fff kworker/1:3H-2368 [001] .... 230.376544: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 230.376574: mmc_do_erase: from=0x82000 to=0x83fff kworker/1:3H-2368 [001] .... 230.494792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 230.494822: mmc_do_erase: from=0x84000 to=0x85fff kworker/1:3H-2368 [001] .... 231.334626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 231.334658: mmc_do_erase: from=0x86000 to=0x87fff kworker/1:3H-2368 [001] .... 231.400542: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 231.400571: mmc_do_erase: from=0x88000 to=0x89fff kworker/1:3H-2368 [001] .... 231.518795: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 231.518827: mmc_do_erase: from=0x8a000 to=0x8bfff kworker/1:3H-2368 [001] .... 232.358627: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 232.358659: mmc_do_erase: from=0x8c000 to=0x8dfff kworker/1:3H-2368 [001] .... 232.477222: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 232.477255: mmc_do_erase: from=0x8e000 to=0x8ffff kworker/1:3H-2368 [001] .... 232.543130: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 232.543160: mmc_do_erase: from=0x90000 to=0x91fff kworker/1:3H-2368 [001] .... 233.382626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 233.382658: mmc_do_erase: from=0x92000 to=0x93fff kworker/1:3H-2368 [001] .... 233.448558: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 233.448587: mmc_do_erase: from=0x94000 to=0x95fff kworker/1:3H-2368 [001] .... 233.566793: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 233.566823: mmc_do_erase: from=0x96000 to=0x97fff kworker/1:3H-2368 [001] .... 234.406628: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 234.406659: mmc_do_erase: from=0x98000 to=0x99fff kworker/1:3H-2368 [001] .... 234.472545: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 234.472574: mmc_do_erase: from=0x9a000 to=0x9bfff kworker/1:3H-2368 [001] .... 234.590796: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 234.590827: mmc_do_erase: from=0x9c000 to=0x9dfff kworker/1:3H-2368 [001] .... 235.430625: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 235.430656: mmc_do_erase: from=0x9e000 to=0x9ffff kworker/1:3H-2368 [001] .... 235.496536: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 235.496566: mmc_do_erase: from=0xa0000 to=0xa1fff kworker/1:3H-2368 [001] .... 235.614792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 235.614822: mmc_do_erase: from=0xa2000 to=0xa3fff kworker/1:3H-2368 [001] .... 236.454627: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 236.454657: mmc_do_erase: from=0xa4000 to=0xa5fff kworker/1:3H-2368 [001] .... 236.520546: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 236.520575: mmc_do_erase: from=0xa6000 to=0xa7fff kworker/1:3H-2368 [001] .... 236.638793: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 236.638824: mmc_do_erase: from=0xa8000 to=0xa9fff kworker/1:3H-2368 [001] .... 237.478625: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 237.478656: mmc_do_erase: from=0xaa000 to=0xabfff kworker/1:3H-2368 [001] .... 237.544554: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 237.544583: mmc_do_erase: from=0xac000 to=0xadfff kworker/1:3H-2368 [001] .... 237.662796: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 237.662827: mmc_do_erase: from=0xae000 to=0xaffff kworker/1:3H-2368 [001] .... 238.502625: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 238.502656: mmc_do_erase: from=0xb0000 to=0xb1fff kworker/1:3H-2368 [001] .... 238.621222: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 238.621255: mmc_do_erase: from=0xb2000 to=0xb3fff kworker/1:3H-2368 [001] .... 238.687131: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 238.687161: mmc_do_erase: from=0xb4000 to=0xb5fff kworker/1:3H-2368 [001] .... 239.526626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 239.526657: mmc_do_erase: from=0xb6000 to=0xb7fff kworker/1:3H-2368 [001] .... 239.592540: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 239.592569: mmc_do_erase: from=0xb8000 to=0xb9fff kworker/1:3H-2368 [001] .... 239.710792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 239.710822: mmc_do_erase: from=0xba000 to=0xbbfff kworker/1:3H-2368 [001] .... 240.550626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 240.550656: mmc_do_erase: from=0xbc000 to=0xbdfff kworker/1:3H-2368 [001] .... 240.616539: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 240.616567: mmc_do_erase: from=0xbe000 to=0xbffff kworker/1:3H-2368 [001] .... 240.734796: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 240.734828: mmc_do_erase: from=0xc0000 to=0xc1fff kworker/1:3H-2368 [001] .... 241.574624: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 241.574655: mmc_do_erase: from=0xc2000 to=0xc3fff kworker/1:3H-2368 [001] .... 241.640552: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 241.640581: mmc_do_erase: from=0xc4000 to=0xc5fff kworker/1:3H-2368 [001] .... 241.758792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 241.758823: mmc_do_erase: from=0xc6000 to=0xc7fff kworker/1:3H-2368 [001] .... 242.598626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 242.598658: mmc_do_erase: from=0xc8000 to=0xc9fff kworker/1:3H-2368 [001] .... 242.664543: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 242.664571: mmc_do_erase: from=0xca000 to=0xcbfff kworker/1:3H-2368 [001] .... 242.782792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 242.782822: mmc_do_erase: from=0xcc000 to=0xcdfff kworker/1:3H-2368 [001] .... 243.622626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 243.622658: mmc_do_erase: from=0xce000 to=0xcffff kworker/1:3H-2368 [001] .... 243.688545: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 243.688574: mmc_do_erase: from=0xd0000 to=0xd1fff kworker/1:3H-2368 [001] .... 243.806795: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 243.806827: mmc_do_erase: from=0xd2000 to=0xd3fff kworker/1:3H-2368 [001] .... 244.646625: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 244.646656: mmc_do_erase: from=0xd4000 to=0xd5fff kworker/1:3H-2368 [001] .... 244.765222: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 244.765255: mmc_do_erase: from=0xd6000 to=0xd7fff kworker/1:3H-2368 [001] .... 244.831131: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 244.831160: mmc_do_erase: from=0xd8000 to=0xd9fff kworker/1:3H-2368 [001] .... 245.670626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 245.670658: mmc_do_erase: from=0xda000 to=0xdbfff kworker/1:3H-2368 [001] .... 245.736537: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 245.736566: mmc_do_erase: from=0xdc000 to=0xddfff kworker/1:3H-2368 [001] .... 245.854792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 245.854823: mmc_do_erase: from=0xde000 to=0xdffff kworker/1:3H-2368 [001] .... 246.694624: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 246.694655: mmc_do_erase: from=0xe0000 to=0xe1fff kworker/1:3H-2368 [001] .... 246.760553: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 246.760582: mmc_do_erase: from=0xe2000 to=0xe3fff kworker/1:3H-2368 [001] .... 246.878795: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 246.878827: mmc_do_erase: from=0xe4000 to=0xe5fff kworker/1:3H-2368 [001] .... 247.718624: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 247.718656: mmc_do_erase: from=0xe6000 to=0xe7fff kworker/1:3H-2368 [001] .... 247.784540: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 247.784570: mmc_do_erase: from=0xe8000 to=0xe9fff kworker/1:3H-2368 [001] .... 247.902791: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 247.902821: mmc_do_erase: from=0xea000 to=0xebfff kworker/1:3H-2368 [001] .... 248.742625: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 248.742657: mmc_do_erase: from=0xec000 to=0xedfff kworker/1:3H-2368 [001] .... 248.808554: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 248.808584: mmc_do_erase: from=0xee000 to=0xeffff kworker/1:3H-2368 [001] .... 248.926792: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 248.926822: mmc_do_erase: from=0xf0000 to=0xf1fff kworker/1:3H-2368 [001] .... 249.766626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 249.766657: mmc_do_erase: from=0xf2000 to=0xf3fff kworker/1:3H-2368 [001] .... 249.832544: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 249.832574: mmc_do_erase: from=0xf4000 to=0xf5fff kworker/1:3H-2368 [001] .... 249.950797: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 249.950828: mmc_do_erase: from=0xf6000 to=0xf7fff kworker/1:3H-2368 [001] .... 250.790626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 250.790658: mmc_do_erase: from=0xf8000 to=0xf9fff kworker/1:3H-2368 [001] .... 250.909223: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 250.909256: mmc_do_erase: from=0xfa000 to=0xfbfff kworker/1:3H-2368 [001] .... 250.975143: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 250.975173: mmc_do_erase: from=0xfc000 to=0xfdfff kworker/1:3H-2368 [001] .... 251.814626: mmc_do_erase: mmc_poll_for_busy() done kworker/1:3H-2368 [001] .... 251.814656: mmc_do_erase: from=0xfe000 to=0xfffff kworker/1:3H-2368 [001] .... 251.880556: mmc_do_erase: mmc_poll_for_busy() done As you can see, some erase operations are fast and some take significant longer. While for the fast card, all are completed almost instantaneously. Looks like the slow card will do some kind of background work between erase cycles. The reported parameters of the slow card sounds reasonable, like 15s for 2GiB. Because of this I've changed the perf_erase to its max value for this card (i.e. au * 512): # time blkdiscard /dev/mmcblk1 real 0m 1.72s user 0m 0.00s sys 0m 0.00s kworker/0:3H-2375 [000] .... 528.308617: mmc_do_erase: from=0x0 to=0x3fdfff kworker/0:3H-2375 [000] .... 528.435991: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 528.436047: mmc_do_erase: from=0x3fe000 to=0x7fbfff kworker/0:3H-2375 [000] .... 528.605276: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 528.605311: mmc_do_erase: from=0x7fc000 to=0x7ffffe kworker/0:3H-2375 [000] .... 528.736726: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 528.736757: mmc_do_erase: from=0x7fffff to=0xbfdffe kworker/0:3H-2375 [000] .... 528.926908: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 528.926940: mmc_do_erase: from=0xbfdfff to=0xffbffe kworker/0:3H-2375 [000] .... 529.189489: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 529.189520: mmc_do_erase: from=0xffbfff to=0xfffffd kworker/0:3H-2375 [000] .... 529.386494: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 529.386524: mmc_do_erase: from=0xfffffe to=0x13fdffd kworker/0:3H-2375 [000] .... 529.629276: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 529.629307: mmc_do_erase: from=0x13fdffe to=0x17fbffd kworker/0:3H-2375 [000] .... 529.760731: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 529.760762: mmc_do_erase: from=0x17fbffe to=0x17ffffc kworker/0:3H-2375 [000] .... 529.892180: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 529.892211: mmc_do_erase: from=0x17ffffd to=0x1bfdffc kworker/0:3H-2375 [000] .... 530.023626: mmc_do_erase: mmc_poll_for_busy() done kworker/0:3H-2375 [000] .... 530.023656: mmc_do_erase: from=0x1bfdffd to=0x1cd9fff kworker/0:3H-2375 [000] .... 530.032057: mmc_do_erase: mmc_poll_for_busy() done Now there is a comment about the "perf_erase" that states it should be small to allow other I/O. But maybe we could also take the erase time into account and allow larger erase sizes. -michael diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 19f1ee57fb34..e126a01414be 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -1675,6 +1675,8 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from, to <<= 9; } + trace_printk("from=0x%x to=0x%x\n", from, to); + if (mmc_card_sd(card)) cmd.opcode = SD_ERASE_WR_BLK_START; else @@ -1747,6 +1749,7 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from, /* Let's poll to find out when the erase operation completes. */ err = mmc_poll_for_busy(card, busy_timeout, MMC_BUSY_ERASE); + trace_printk("mmc_poll_for_busy() done\n"); out: mmc_retune_release(card->host); return err; diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c index 6f054c449d46..5e48a2cd4ad3 100644 --- a/drivers/mmc/core/sd.c +++ b/drivers/mmc/core/sd.c @@ -291,6 +291,8 @@ static int mmc_read_ssr(struct mmc_card *card) card->erase_arg = (card->scr.sda_specx && discard_support) ? SD_DISCARD_ARG : SD_ERASE_ARG; + trace_printk("card->erase_arg=%d, au=%d es=%d et=%d eo=%d\n", card->erase_arg, au, es, et, eo); + return 0; } ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-08 11:26 ` Michael Walle 2020-12-08 16:17 ` Ulf Hansson @ 2020-12-08 16:52 ` Theodore Y. Ts'o 2020-12-09 14:51 ` Ulf Hansson 1 sibling, 1 reply; 11+ messages in thread From: Theodore Y. Ts'o @ 2020-12-08 16:52 UTC (permalink / raw) To: Michael Walle; +Cc: Ulf Hansson, linux-ext4, linux-mmc, linux-block On Tue, Dec 08, 2020 at 12:26:22PM +0100, Michael Walle wrote: > Do we really need to map these functions? What if we don't have an > actual discard, but just a slow erase (I'm now assuming that erase > will likely be slow on sdcards)? Can't we just tell the user space > there is no discard? Like on a normal HDD? I really don't know the > implications, seems like mmc_erase() is just there for the linux > discard feature. So the potential gotcha here is that "discard" is important for reducing write amplification, and thus improving the lifespan of devices. (See my reference to the Tesla engine controller story earlier.) So if a device doesn't have "discard" but has "erase", and "erase" is fast, then skipping the discard could end up significantly reducing the lifespan of your product, and we're back to the NHTSA investigating whether they should stick Tesla for the $1500 engine controller replacement when cards die early. I guess the JEDEC spec does specify a way to query the card for how long an erase takes, but I don't have the knowledge about how the actual real-world implementations of these specs (and their many variants over the years) actually behave. Can the erase times that they advertise actually be trusted to be accurate? How many of them actually supply erase times at all, no matter what the spec says? > Coming from the user space side. Does mkfs.ext4 assumes its pre-discard > is fast? I'd think so, right? I'd presume it was intented to tell the > FTL of the block device, "hey these blocks are unused, you can do some > wear leveling with them". Yes, the assumption is that discard is fast. Exactly how fast seems to vary; this is one of the reasons why there are three different ways to do discards on a file system after files are deleted. One way is to do them after the deleted definitely won't be unwound (i.e., after the ext4 journal commit). But on some devices, the discard command, while fast, is slow enough that this will compete with the I/O completion times of other read commands, thus degrading system performance. So you can also execute the trim commands out of cron, using the fstrim command, which will run the discards in the background, and the system administrator can adjust when fstrim is executed during times wheno performance isn't critical. (e.g., when the phone is on a charger in the middle of the night, or at 4am local time, etc.) Finally, you can configure e2fsck to run the discards after the file system consistency check is done. The reason why we have to leave this up to the system administrators is that we have essentially no guidance from the device how slow the discard command might be, how it intereferes with other device operations, and whether the discard might be more likely ignored if the device is busy. So it might be that the discard will more likely improve write endurance when it is done when the device is idle. All of the speccs (SCSI, SATA, UFS, eMMC, SD) are remarkable unhelpful because performance considerations is generally consider "out of scope" of standards committees. They want to leave that up to market forces; which is why big companies (at handset vendors, hyperscale cloud providers, systems integrators, etc.) have to spend as much money doing certification testing before deciding which products to buy; think of it as a full-employment act for storage engineers. :-) But yes, mke2fs assumes that discard is sufficiently fast that it doing it at file system format time is extremely reasonable. The bigger concern is that we can't necessarily count on discard zero'ing the inode table, and there are robustness reasons (especially if before we had metadata checksums) where it makes file system repairs much more robust if the inode table is zero'ed ahead of time. > I'm just about finding some SD cards and looking how they behave timing > wise and what they report they support (ie. erase or discard). Looks > like other cards are doing better. But I'd have to find out if they > support the discard (mine doesn't) and if they are slow too if I force > them to use the normal erase. The challenge is that this sort of thing gets rapidly out of date, and it's not just SD cards but also eMMC devices which are built into various embedded devices, high-end SDHC cards, etc., etc. So doing this gets very expensive. That being said, both ext4 and f2fs do pre-discards as part of the format step, since improving write endurance is important; customers get cranky when their $1000 smart phones die an early death. So an SD card that behaves the way yours does would probably get disqualified very early in the certification step if it were ever intended to be used in an Android handset, since pretty much all Android devices, or embedded devices for that matter, use either f2fs or ext4. That's one of the reasons why I was a bit surprised that your device had such an "interesting" performance profile. Maybe it was intended for use in digital cameras, and digital camerase don't issue discards? I don't know.... > > I agree, these are the three levels that make sense to support. > > > > Honestly I haven't been paying enough attention to discussions for the > > generic block layer around discards. However, considering what you > > just stated above, we seem to be missing one request operation, don't > > we? Yes, that's true. We only have "discard" and "secure discard". Part of that is because that's the only levels which are available for SSD's, for which I have the same general complaint vis-a-vis standards committees and the general lack of usefulness for file system engineers. For example, pretty much everyone in the enterprise and hyperscale cloud world assume that low-numbered LBA's have better performance profiles, and are located physically at the outer diameter of HDD's, compared to high-number'ed LBA's. But that's nothing which is specified by the standards committees, because "performance considerations are out of scope". Yet we still have to engineer storage systems which assume this to be true, even though nothing in the formal specs guarantees this. We just have to trust that anyone who tries to sell a HDD for which this isn't true, even if it is "standards complaint", is going to have a bad time, and trust that this is enough. (Perhaps this is why when a certain HDD manufacturer tried to sell HDD's containing drive-managed SMR for the NAS market, without disclosing this fact to consumers, this generated a massive backlash.... Simply being standards compliant is not enough.) - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-08 16:52 ` Theodore Y. Ts'o @ 2020-12-09 14:51 ` Ulf Hansson 2020-12-09 16:35 ` Theodore Y. Ts'o 0 siblings, 1 reply; 11+ messages in thread From: Ulf Hansson @ 2020-12-09 14:51 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: Michael Walle, linux-ext4, linux-mmc, linux-block On Tue, 8 Dec 2020 at 17:52, Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Tue, Dec 08, 2020 at 12:26:22PM +0100, Michael Walle wrote: > > Do we really need to map these functions? What if we don't have an > > actual discard, but just a slow erase (I'm now assuming that erase > > will likely be slow on sdcards)? Can't we just tell the user space > > there is no discard? Like on a normal HDD? I really don't know the > > implications, seems like mmc_erase() is just there for the linux > > discard feature. > > So the potential gotcha here is that "discard" is important for > reducing write amplification, and thus improving the lifespan of > devices. (See my reference to the Tesla engine controller story > earlier.) So if a device doesn't have "discard" but has "erase", and > "erase" is fast, then skipping the discard could end up significantly > reducing the lifespan of your product, and we're back to the NHTSA > investigating whether they should stick Tesla for the $1500 engine > controller replacement when cards die early. Yes, exactly. The point about wear leveling and the lifespan of the device are critical. That said, we should continue to map discard requests to legacy erase commands for SD cards, unless the card supports the new discard, of course. One thing I realized though, is that we should probably announce and implement support for secure erase (QUEUE_FLAG_SECERASE) for SD cards, as that seems to map well towards with the erase command. An erase is specified in the SD spec as, after the erase the data is either "0" or "1", which I guess is what is expected from a REQ_OP_SECURE_ERASE operation? > > I guess the JEDEC spec does specify a way to query the card for how > long an erase takes, but I don't have the knowledge about how the > actual real-world implementations of these specs (and their many > variants over the years) actually behave. Can the erase times that > they advertise actually be trusted to be accurate? How many of them > actually supply erase times at all, no matter what the spec says? For eMMC discard commands are fast, but probably also trim commands. For erase I don't know. Then, whether the corresponding "erase times" that are be specified in the eMMC registers, I guess those always refer to the worst case scenario. I don't know how useful they really are in the end. In any case, we may end up with poor erase/discard performance, because of internal FW implementations. Although, what I think we may be able to improve, both from eMMC and SD point of view, is to allow more blocks per discard/erase operation. But honestly, I don't know how big of a problem this is, even if just staring at the code, gives me some ideas. > > > Coming from the user space side. Does mkfs.ext4 assumes its pre-discard > > is fast? I'd think so, right? I'd presume it was intented to tell the > > FTL of the block device, "hey these blocks are unused, you can do some > > wear leveling with them". > > Yes, the assumption is that discard is fast. Exactly how fast seems > to vary; this is one of the reasons why there are three different ways > to do discards on a file system after files are deleted. One way is > to do them after the deleted definitely won't be unwound (i.e., after > the ext4 journal commit). But on some devices, the discard command, > while fast, is slow enough that this will compete with the I/O > completion times of other read commands, thus degrading system > performance. So you can also execute the trim commands out of cron, > using the fstrim command, which will run the discards in the > background, and the system administrator can adjust when fstrim is > executed during times wheno performance isn't critical. (e.g., when > the phone is on a charger in the middle of the night, or at 4am local > time, etc.) Finally, you can configure e2fsck to run the discards > after the file system consistency check is done. > > The reason why we have to leave this up to the system administrators > is that we have essentially no guidance from the device how slow the > discard command might be, how it intereferes with other device > operations, and whether the discard might be more likely ignored if > the device is busy. So it might be that the discard will more likely > improve write endurance when it is done when the device is idle. All > of the speccs (SCSI, SATA, UFS, eMMC, SD) are remarkable unhelpful > because performance considerations is generally consider "out of > scope" of standards committees. They want to leave that up to market > forces; which is why big companies (at handset vendors, hyperscale > cloud providers, systems integrators, etc.) have to spend as much > money doing certification testing before deciding which products to > buy; think of it as a full-employment act for storage engineers. :-) A few comments related to the above. Even if the discarded blocks are flushed at some wisely selected point, when the device is idle, that doesn't guarantee that the internal garbage collection runs inside the device. In the end that depends on the FW implementation of the card - and I assume it's likely triggered based on some internal idle time and the amount of "garbage" there is to deal with. For both eMMC and SD cards, the specs define commands for how to manually control the background operations inside the cards. In principle, this allows us to tell the card when it's a good time to run garbage collection (and when not to). Both for eMMC and SD, we are not using this, yet. However, I have been playing with a couple of different ideas to explore this: *) Use the runtime PM framework to detect an idle period and then trigger background operations. The problem is, that we don't really know how long we will be idle, meaning that we don't know if it's really a wise decision to trigger the background operations in the end. **) Invent a new type of generic block request, as to let userspace trigger this. Of course, another option is also to leave this as is, thus relying on the internal FW of the card to act the best it can. Do you have any thoughts around this? [...] Kind regards Uffe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op 2020-12-09 14:51 ` Ulf Hansson @ 2020-12-09 16:35 ` Theodore Y. Ts'o 0 siblings, 0 replies; 11+ messages in thread From: Theodore Y. Ts'o @ 2020-12-09 16:35 UTC (permalink / raw) To: Ulf Hansson; +Cc: Michael Walle, linux-ext4, linux-mmc, linux-block On Wed, Dec 09, 2020 at 03:51:24PM +0100, Ulf Hansson wrote: > > Even if the discarded blocks are flushed at some wisely selected > point, when the device is idle, that doesn't guarantee that the > internal garbage collection runs inside the device. In the end that > depends on the FW implementation of the card - and I assume it's > likely triggered based on some internal idle time and the amount of > "garbage" there is to deal with. At least from a file system perspective, I don't care when the internal garbage collection actually runs inside the device. What I do care is that (a) a read to a discarded sector returns zero's after it has been discard (or the storage device needs to tell me I can't count on that), and (b) that eventually, for write endurance reasons, the garbage collection will *eventually* happen. If the list of erase blocks or flash pages that are not in use are tracked in such a way that they are actually garbage collected before the device actually needs free blocks, it really doesn't matter if it happens right away, or hours later. (If the device is 90% free, because it was just formatted and we did a pre-discard at format time, then it could happen hours or days later.) But if the device's FTL is too incompetent such that it loses track of which erase blocks / flash pages do need to be GC'ed, such that it impacts device lifetime... well then, that's sad, and it would be nice to find out about this without having to do an expensive, time-consuming certification process. (OTOH, all the big companies are doing hardware certifications anyway, because you can't fully trust the storage vendors, and how many storage vendors are really going to admit, or make it easy to determine, "the FTL is so cost-optimized that it's cr*p"? :-) Having a way to tell the storage device that it would be better to suspend GC, or to accelerate GC, because we know the device is about to become much less likely to perform writes, would certainly be a good and useful thing to do, although I see that as mostly being useful for improving I/O performance, especially for low-end flash --- I suspect that for high-end SSD's, which are designed so that they can handle continuous write streams without much performance degradation, they have enough oomph in their internal CPU that they can do GC's in real-time while the device is under a continuous random write workload with only minimal performance impacts. > *) Use the runtime PM framework to detect an idle period and then > trigger background operations. The problem is, that we don't really > know how long we will be idle, meaning that we don't know if it's > really a wise decision to trigger the background operations in the > end. > > **) Invent a new type of generic block request, as to let userspace > trigger this. I think you really want to give userspace the ability to trigger this. Whether it's via a generic block request, or an ioctl, I'll leave that to the people maintain the driver and/or block layer. That's because userspace will have knowledge to things like, "the screen is off", or "the phone is on the wireless charger and/or the user has said, "OK, Google, goodnight" to trigger the night-time home automation commands. We can of course try to make some automatic determinations based on the runtime PM framework, but that doesn't necessarily tell us the likelihood that the system will become busy in the future; OTOH, maybe that doesn't matter, if the storage needs only a very tiny amount of time after it's told, "stop GC", to finish up what it's doing so it can respond to I/O request at full speed? - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-12-09 16:36 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-07 15:10 discard feature, mkfs.ext4 and mmc default fallback to normal erase op Michael Walle 2020-12-07 18:35 ` Theodore Y. Ts'o 2020-12-07 20:39 ` Michael Walle 2020-12-08 2:40 ` Theodore Y. Ts'o 2020-12-08 9:49 ` Ulf Hansson 2020-12-08 11:26 ` Michael Walle 2020-12-08 16:17 ` Ulf Hansson 2020-12-08 20:57 ` Michael Walle 2020-12-08 16:52 ` Theodore Y. Ts'o 2020-12-09 14:51 ` Ulf Hansson 2020-12-09 16:35 ` Theodore Y. Ts'o
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.