* MMC quirks relating to performance/lifetime.
@ 2011-02-08 21:22 Andrei Warkentin
2011-02-08 21:38 ` Wolfram Sang
` (3 more replies)
0 siblings, 4 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-08 21:22 UTC (permalink / raw)
To: linux-arm-kernel
Hi,
I'm not sure if this is the best place to bring this up, but Russel's
name is on a fair share of drivers/mmc code, and there does seem to be
quite a bit of MMC-related discussions. Excuse me in advance if this
isn't the right forum :-).
Certain MMC vendors (maybe even quite a bit of them) use a pretty
rigid buffering scheme when it comes to handling writes. There is
usually a buffer A for random accesses, and a buffer B for sequential
accesses. For certain Toshiba parts, it looks like buffer A is 8KB
wide, with buffer B being 4MB wide, and all accesses larger than 8KB
effectively equating to 4MB accesses. Worse, consecutive small (8k)
writes are treated as one large sequential access, once again ending
up in buffer B, thus necessitating out-of-order writing to work around
this.
What this means is decreased life span for the parts, and it also
means a performance impact on small writes, but the first item is much
more crucial, especially for smaller parts.
As I've mentioned, probably more vendors are affected. How about a
generic MMC_BLOCK quirk that splits the requests (and optionally
reorders) them? The thresholds would then be adjustable as
module/kernel parameters based on manfid. I'm asking because I have a
patch now, but its ugly and hardcoded against a specific manufacturer.
Thanks,
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
@ 2011-02-08 21:38 ` Wolfram Sang
2011-02-08 22:42 ` Russell King - ARM Linux
` (2 subsequent siblings)
3 siblings, 0 replies; 117+ messages in thread
From: Wolfram Sang @ 2011-02-08 21:38 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, linux-mmc
[-- Attachment #1: Type: text/plain, Size: 2032 bytes --]
On Tue, Feb 08, 2011 at 03:22:59PM -0600, Andrei Warkentin wrote:
> Hi,
>
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
Searching for MMC in MAINTAINERS will get you:
MULTIMEDIA CARD (MMC), SECURE DIGITAL (SD) AND SDIO SUBSYSTEM
M: Chris Ball <cjb@laptop.org>
L: linux-mmc@vger.kernel.org
...
List CCed...
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
>
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
>
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.
>
> Thanks,
> A
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
--
Pengutronix e.K. | Wolfram Sang |
Industrial Linux Solutions | http://www.pengutronix.de/ |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-08 21:38 ` Wolfram Sang
0 siblings, 0 replies; 117+ messages in thread
From: Wolfram Sang @ 2011-02-08 21:38 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Feb 08, 2011 at 03:22:59PM -0600, Andrei Warkentin wrote:
> Hi,
>
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
Searching for MMC in MAINTAINERS will get you:
MULTIMEDIA CARD (MMC), SECURE DIGITAL (SD) AND SDIO SUBSYSTEM
M: Chris Ball <cjb@laptop.org>
L: linux-mmc at vger.kernel.org
...
List CCed...
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
>
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
>
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.
>
> Thanks,
> A
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
--
Pengutronix e.K. | Wolfram Sang |
Industrial Linux Solutions | http://www.pengutronix.de/ |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110208/88da88a3/attachment.sig>
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
2011-02-08 21:38 ` Wolfram Sang
@ 2011-02-08 22:42 ` Russell King - ARM Linux
2011-02-09 8:37 ` Linus Walleij
2011-02-11 14:41 ` Pavel Machek
3 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-08 22:42 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Feb 08, 2011 at 03:22:59PM -0600, Andrei Warkentin wrote:
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
I dropped out of MMC stuff once we had a functional infrastructure
in place in the kernel - before that, there were various competing
implementations around.
The implementation that's there was based off what meager information
was available on the MMC protocol, as published by some of the card
manufacturers. Certainly no one had the backing to be able to get the
official specifications and such like, nor to approach the various
companies to get the sort of details you're talking about.
So, what's there is basically a best-effort to provide something usable
and which works (most of the time.) And to reflect that, error handling
is almost non-existent.
As part of trying to get better performance out of PIO-based interfaces,
I've recently been putting some effort into making the mmc block driver
a little more rugged in the face of various communication errors.
That's not to say that I'm now taking an active interest in MMC - I'm
not. I'm just fixing the occasional issue which causes me problem.
As for what you're talking about (controlling the coalescing of requests),
I think you're better off sorting that out with the higher block layers
to restrict the amount of coalescing that happens there. I think there
are some hooks already in place which allow you to define the maximum
size of any request, but this doesn't take account of read/write
properties. Maybe that's something the higher block layer should be
extended with?
If so, you'll have to discuss it with the block layer folk.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
@ 2011-02-09 8:37 ` Linus Walleij
2011-02-08 22:42 ` Russell King - ARM Linux
` (2 subsequent siblings)
3 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-09 8:37 UTC (permalink / raw)
To: Andrei Warkentin, linux-mmc; +Cc: linux-arm-kernel
[Quoting in verbatin so the orginal mail hits linux-mmc, this is very
interesting!]
2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> Hi,
>
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
>
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
>
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
>
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.
There is a quirk API so that specific quirks can be flagged for certain
vendors and cards, e.g. some Toshibas in this case. e.g. grep the
kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
But as Russell says this probably needs to be signalled up to the
block layer to be handled properly.
Why don't you post the code you have today as an RFC: patch,
I think many will be interested?
Yours,
Linus Walleij
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-09 8:37 ` Linus Walleij
0 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-09 8:37 UTC (permalink / raw)
To: linux-arm-kernel
[Quoting in verbatin so the orginal mail hits linux-mmc, this is very
interesting!]
2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> Hi,
>
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
>
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
>
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
>
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.
There is a quirk API so that specific quirks can be flagged for certain
vendors and cards, e.g. some Toshibas in this case. e.g. grep the
kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
But as Russell says this probably needs to be signalled up to the
block layer to be handled properly.
Why don't you post the code you have today as an RFC: patch,
I think many will be interested?
Yours,
Linus Walleij
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-09 8:37 ` Linus Walleij
@ 2011-02-09 9:13 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-09 9:13 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: Linus Walleij, Andrei Warkentin, linux-mmc
On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
>
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> > Hi,
> >
> > I'm not sure if this is the best place to bring this up, but Russel's
> > name is on a fair share of drivers/mmc code, and there does seem to be
> > quite a bit of MMC-related discussions. Excuse me in advance if this
> > isn't the right forum :-).
> >
> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > rigid buffering scheme when it comes to handling writes. There is
> > usually a buffer A for random accesses, and a buffer B for sequential
> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > writes are treated as one large sequential access, once again ending
> > up in buffer B, thus necessitating out-of-order writing to work around
> > this.
It's more complex, but I now have a pretty good understanding of
what the flash media actually do, after doing a lot of benchmarking.
Most of my results so far are documented on
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
but I still need to write about the more recent discoveries.
What you describe as buffer A is the "page size" of the underlying
flash. It depends on the size and brand of the NAND flash chip and
can be anywhere between 2 KB and 16 KB for modern cards, depending
on how they combine multiple chips and planes within the chips.
What you describe as buffer B is sometime called an "erase block
group" or an "allocation unit". This is the smallest unit that
gets kept in a global lookup table in the medium and can be anywhere
between 1 MB and 8 MB for cards larger than 4 GB, or as small as
128 KB (a single erase block) for smaller media, as far as I have
seen. When you don't write full aligned allocation units, the
card will have to eventually do garbage collection on the allocation
unit, which can take a long time (many milliseconds).
Most cards have a third size, typically somewhere between 32 and 128 KB,
which is the optimimum size for writes. While you can do linear
writes to the card in page size units (writing an allocation unit
from start to finish), doing random access within the allocation unit
will be much faster doing larger writes.
> > What this means is decreased life span for the parts, and it also
> > means a performance impact on small writes, but the first item is much
> > more crucial, especially for smaller parts.
> >
> > As I've mentioned, probably more vendors are affected. How about a
> > generic MMC_BLOCK quirk that splits the requests (and optionally
> > reorders) them? The thresholds would then be adjustable as
> > module/kernel parameters based on manfid. I'm asking because I have a
> > patch now, but its ugly and hardcoded against a specific manufacturer.
It's not just MMC specific: USB flash drives, CF cards and even cheap
PATA or SATA SSDs have the same patterns. I think this will need
to be solved on a higher level, in the block device elevator code
and in the file systems.
> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
>
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?
Yes, I agree, that would be good. Also, I'd be interested to see the
output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
the worst cards that I have seen so far, because they can not do
random access within an allocation unit, and they can not write to
multiple allocation units alternating (# open AUs linear is "1" in
my wiki table), while most cards can do at least two.
Andrei, I'm certainly interested in working with you on this.
The point you brought up about the toshiba cards being especially
bad is certainly vald, even if we do something better in the block
layer, we need to have a way to detect the worst-case scenario,
so we can work around that.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-09 9:13 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-09 9:13 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
>
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> > Hi,
> >
> > I'm not sure if this is the best place to bring this up, but Russel's
> > name is on a fair share of drivers/mmc code, and there does seem to be
> > quite a bit of MMC-related discussions. Excuse me in advance if this
> > isn't the right forum :-).
> >
> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > rigid buffering scheme when it comes to handling writes. There is
> > usually a buffer A for random accesses, and a buffer B for sequential
> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > writes are treated as one large sequential access, once again ending
> > up in buffer B, thus necessitating out-of-order writing to work around
> > this.
It's more complex, but I now have a pretty good understanding of
what the flash media actually do, after doing a lot of benchmarking.
Most of my results so far are documented on
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
but I still need to write about the more recent discoveries.
What you describe as buffer A is the "page size" of the underlying
flash. It depends on the size and brand of the NAND flash chip and
can be anywhere between 2 KB and 16 KB for modern cards, depending
on how they combine multiple chips and planes within the chips.
What you describe as buffer B is sometime called an "erase block
group" or an "allocation unit". This is the smallest unit that
gets kept in a global lookup table in the medium and can be anywhere
between 1 MB and 8 MB for cards larger than 4 GB, or as small as
128 KB (a single erase block) for smaller media, as far as I have
seen. When you don't write full aligned allocation units, the
card will have to eventually do garbage collection on the allocation
unit, which can take a long time (many milliseconds).
Most cards have a third size, typically somewhere between 32 and 128 KB,
which is the optimimum size for writes. While you can do linear
writes to the card in page size units (writing an allocation unit
from start to finish), doing random access within the allocation unit
will be much faster doing larger writes.
> > What this means is decreased life span for the parts, and it also
> > means a performance impact on small writes, but the first item is much
> > more crucial, especially for smaller parts.
> >
> > As I've mentioned, probably more vendors are affected. How about a
> > generic MMC_BLOCK quirk that splits the requests (and optionally
> > reorders) them? The thresholds would then be adjustable as
> > module/kernel parameters based on manfid. I'm asking because I have a
> > patch now, but its ugly and hardcoded against a specific manufacturer.
It's not just MMC specific: USB flash drives, CF cards and even cheap
PATA or SATA SSDs have the same patterns. I think this will need
to be solved on a higher level, in the block device elevator code
and in the file systems.
> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
>
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?
Yes, I agree, that would be good. Also, I'd be interested to see the
output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
the worst cards that I have seen so far, because they can not do
random access within an allocation unit, and they can not write to
multiple allocation units alternating (# open AUs linear is "1" in
my wiki table), while most cards can do at least two.
Andrei, I'm certainly interested in working with you on this.
The point you brought up about the toshiba cards being especially
bad is certainly vald, even if we do something better in the block
layer, we need to have a way to detect the worst-case scenario,
so we can work around that.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
` (2 preceding siblings ...)
2011-02-09 8:37 ` Linus Walleij
@ 2011-02-11 14:41 ` Pavel Machek
2011-02-11 14:51 ` Arnd Bergmann
3 siblings, 1 reply; 117+ messages in thread
From: Pavel Machek @ 2011-02-11 14:41 UTC (permalink / raw)
To: linux-arm-kernel
Hi!
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
>
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
Hmmmm, I somehow assumed MMCs would be much more cleverr than this.
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.
How big is performance difference?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
2011-02-11 14:41 ` Pavel Machek
@ 2011-02-11 14:51 ` Arnd Bergmann
2011-02-11 15:20 ` Lei Wen
2011-03-08 6:59 ` Pavel Machek
0 siblings, 2 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-11 14:51 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 11 February 2011, Pavel Machek wrote:
> Hi!
>
> > I'm not sure if this is the best place to bring this up, but Russel's
> > name is on a fair share of drivers/mmc code, and there does seem to be
> > quite a bit of MMC-related discussions. Excuse me in advance if this
> > isn't the right forum :-).
> >
> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > rigid buffering scheme when it comes to handling writes. There is
> > usually a buffer A for random accesses, and a buffer B for sequential
> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > writes are treated as one large sequential access, once again ending
> > up in buffer B, thus necessitating out-of-order writing to work around
> > this.
>
> Hmmmm, I somehow assumed MMCs would be much more cleverr than this.
No, these devices are incredibly stupid, or extremely optimized to
a specific use case (writing large video files to FAT32), depending on how
you look at them.
> > reorders) them? The thresholds would then be adjustable as
> > module/kernel parameters based on manfid. I'm asking because I have a
> > patch now, but its ugly and hardcoded against a specific manufacturer.
>
> How big is performance difference?
Several orders of magnitude. It is very easy to get a card that can write
12 MB/s into a case where it writes no more than 30 KB/s, doing only
things that happen frequently with ext3.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
2011-02-11 14:51 ` Arnd Bergmann
@ 2011-02-11 15:20 ` Lei Wen
2011-02-11 15:25 ` Arnd Bergmann
2011-03-08 6:59 ` Pavel Machek
1 sibling, 1 reply; 117+ messages in thread
From: Lei Wen @ 2011-02-11 15:20 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Feb 11, 2011 at 10:51 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011, Pavel Machek wrote:
>> Hi!
>>
>> > I'm not sure if this is the best place to bring this up, but Russel's
>> > name is on a fair share of drivers/mmc code, and there does seem to be
>> > quite a bit of MMC-related discussions. Excuse me in advance if this
>> > isn't the right forum :-).
>> >
>> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> > rigid buffering scheme when it comes to handling writes. There is
>> > usually a buffer A for random accesses, and a buffer B for sequential
>> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> > writes are treated as one large sequential access, once again ending
>> > up in buffer B, thus necessitating out-of-order writing to work around
>> > this.
>>
>> Hmmmm, I somehow assumed MMCs would be much more cleverr than this.
>
> No, these devices are incredibly stupid, or extremely optimized to
> a specific use case (writing large video files to FAT32), depending on how
> you look at them.
>
>> > reorders) them? The thresholds would then be adjustable as
>> > module/kernel parameters based on manfid. I'm asking because I have a
>> > patch now, but its ugly and hardcoded against a specific manufacturer.
>>
>> How big is performance difference?
>
> Several orders of magnitude. It is very easy to get a card that can write
> 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> things that happen frequently with ext3.
>
Maybe we could get that case into mmc_test code, so that we could track
that in latter whether it already be fixed or not? Or in other word, to prove
the firmware in sd card is stupid or not. :)
Best regards,
Lei
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
2011-02-11 15:20 ` Lei Wen
@ 2011-02-11 15:25 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-11 15:25 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 11 February 2011, Lei Wen wrote:
> > Several orders of magnitude. It is very easy to get a card that can write
> > 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> > things that happen frequently with ext3.
> >
>
> Maybe we could get that case into mmc_test code, so that we could track
> that in latter whether it already be fixed or not? Or in other word, to prove
> the firmware in sd card is stupid or not. :)
There are many kinds of stupid, and a lot of cards are. I've actually had
excellent success with simply measuring from user space, which is
much easier than in mmc_test.
Unfortunately, you have to write to the card to do that, which may destroy
the data even if you write the same data that is already on it.
See
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
for most of my results. I'm about to write up a better paper with all the
measurements, and will make my tools available soon.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-09 8:37 ` Linus Walleij
@ 2011-02-11 22:27 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:27 UTC (permalink / raw)
To: Linus Walleij; +Cc: linux-mmc, linux-arm-kernel
[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]
On Wed, Feb 9, 2011 at 2:37 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
>
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> Hi,
>>
>> I'm not sure if this is the best place to bring this up, but Russel's
>> name is on a fair share of drivers/mmc code, and there does seem to be
>> quite a bit of MMC-related discussions. Excuse me in advance if this
>> isn't the right forum :-).
>>
>> Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> rigid buffering scheme when it comes to handling writes. There is
>> usually a buffer A for random accesses, and a buffer B for sequential
>> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> writes are treated as one large sequential access, once again ending
>> up in buffer B, thus necessitating out-of-order writing to work around
>> this.
>>
>> What this means is decreased life span for the parts, and it also
>> means a performance impact on small writes, but the first item is much
>> more crucial, especially for smaller parts.
>>
>> As I've mentioned, probably more vendors are affected. How about a
>> generic MMC_BLOCK quirk that splits the requests (and optionally
>> reorders) them? The thresholds would then be adjustable as
>> module/kernel parameters based on manfid. I'm asking because I have a
>> patch now, but its ugly and hardcoded against a specific manufacturer.
>
> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
>
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?
>
> Yours,
> Linus Walleij
>
I think it's worthwhile to make make the upper block layers aware of
MMC (and apparently other flash memory) limitations, but I think as a
first step it could make sense (for me) to reformat the patch I am
attaching into something that looks better.
Don't take the attached patch too seriously :-).
Thanks,
A
[-- Attachment #2: toshiba_emmc_opt.patch --]
[-- Type: text/x-diff, Size: 8738 bytes --]
diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 7054fd5..3b32329 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -60,6 +60,7 @@ struct mmc_blk_data {
spinlock_t lock;
struct gendisk *disk;
struct mmc_queue queue;
+ char *bounce;
unsigned int usage;
unsigned int read_only;
@@ -93,6 +94,9 @@ static void mmc_blk_put(struct mmc_blk_data *md)
__clear_bit(devidx, dev_use);
+ if (md->bounce)
+ kfree(md->bounce);
+
put_disk(md->disk);
kfree(md);
}
@@ -312,6 +316,157 @@ out:
return err ? 0 : 1;
}
+/*
+ * Workaround for Toshiba eMMC performance. If the request is less than two
+ * flash pages in size, then we want to split the write into one or two
+ * page-aligned writes to take advantage of faster buffering. Here we can
+ * adjust the size of the MMC request and let the block layer request handler
+ * deal with generating another MMC request.
+ */
+#define TOSHIBA_MANFID 0x11
+#define TOSHIBA_PAGE_SIZE 16 /* sectors */
+#define TOSHIBA_ADJUST_THRESHOLD 24 /* sectors */
+static bool mmc_adjust_toshiba_write(struct mmc_card *card,
+ struct mmc_request *mrq)
+{
+ if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
+ mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
+ int sectors_in_page = TOSHIBA_PAGE_SIZE -
+ (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
+ if (mrq->data->blocks > sectors_in_page) {
+ mrq->data->blocks = sectors_in_page;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+/*
+ * This is another strange workaround to try to close the gap on Toshiba eMMC
+ * performance when compared to other vendors. In order to take advantage
+ * of certain optimizations and assumptions in those cards, we will look for
+ * multiblock write transfers below a certain size and we do the following:
+ *
+ * - Break them up into seperate page-aligned (8k flash pages) transfers.
+ * - Execute the transfers in reverse order.
+ * - Use "reliable write" transfer mode.
+ *
+ * Neither the block I/O layer nor the scatterlist design seem to lend them-
+ * selves well to executing a block request out of order. So instead we let
+ * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
+ * break it up and reorder it here. This also requires that we put the data
+ * into a bounce buffer and send it as individual sg's.
+ */
+#define TOSHIBA_LOW_THRESHOLD 48 /* sectors */
+#define TOSHIBA_HIGH_THRESHOLD 64 /* sectors */
+static bool mmc_handle_toshiba_write(struct mmc_queue *mq,
+ struct mmc_card *card,
+ struct mmc_request *mrq)
+{
+ struct mmc_blk_data *md = mq->data;
+ unsigned int first_page, last_page, page;
+ unsigned long flags;
+
+ if (!md->bounce ||
+ mrq->data->blocks > TOSHIBA_HIGH_THRESHOLD ||
+ mrq->data->blocks < TOSHIBA_LOW_THRESHOLD)
+ return false;
+
+ first_page = mrq->cmd->arg / TOSHIBA_PAGE_SIZE;
+ last_page = (mrq->cmd->arg + mrq->data->blocks - 1) / TOSHIBA_PAGE_SIZE;
+
+ /* Single page write: just do it the normal way */
+ if (first_page == last_page)
+ return false;
+
+ local_irq_save(flags);
+ sg_copy_to_buffer(mrq->data->sg, mrq->data->sg_len,
+ md->bounce, mrq->data->blocks * 512);
+ local_irq_restore(flags);
+
+ for (page = last_page; page >= first_page; page--) {
+ unsigned long offset, length;
+ struct mmc_blk_request brq;
+ struct mmc_command cmd;
+ struct scatterlist sg;
+
+ memset(&brq, 0, sizeof(struct mmc_blk_request));
+ brq.mrq.cmd = &brq.cmd;
+ brq.mrq.data = &brq.data;
+
+ brq.cmd.arg = page * TOSHIBA_PAGE_SIZE;
+ brq.data.blksz = 512;
+ if (page == first_page) {
+ brq.cmd.arg = mrq->cmd->arg;
+ brq.data.blocks = TOSHIBA_PAGE_SIZE -
+ (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
+ } else if (page == last_page)
+ brq.data.blocks = (mrq->cmd->arg + mrq->data->blocks) %
+ TOSHIBA_PAGE_SIZE;
+ if (brq.data.blocks == 0)
+ brq.data.blocks = TOSHIBA_PAGE_SIZE;
+
+ if (!mmc_card_blockaddr(card))
+ brq.cmd.arg <<= 9;
+ brq.cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;
+ brq.stop.opcode = MMC_STOP_TRANSMISSION;
+ brq.stop.arg = 0;
+ brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
+
+ brq.data.flags |= MMC_DATA_WRITE;
+ if (brq.data.blocks > 1) {
+ if (!mmc_host_is_spi(card->host))
+ brq.mrq.stop = &brq.stop;
+ brq.cmd.opcode = MMC_WRITE_MULTIPLE_BLOCK;
+ } else {
+ brq.mrq.stop = NULL;
+ brq.cmd.opcode = MMC_WRITE_BLOCK;
+ }
+
+ if (brq.cmd.opcode == MMC_WRITE_MULTIPLE_BLOCK &&
+ brq.data.blocks <= card->ext_csd.rel_wr_sec_c) {
+ int err;
+
+ cmd.opcode = MMC_SET_BLOCK_COUNT;
+ cmd.arg = brq.data.blocks | (1 << 31);
+ cmd.flags = MMC_RSP_R1 | MMC_CMD_AC;
+ err = mmc_wait_for_cmd(card->host, &cmd, 0);
+ if (!err)
+ brq.mrq.stop = NULL;
+ }
+
+ mmc_set_data_timeout(&brq.data, card);
+
+ offset = (brq.cmd.arg - mrq->cmd->arg) * 512;
+ length = brq.data.blocks * 512;
+ sg_init_one(&sg, md->bounce + offset, length);
+ brq.data.sg = &sg;
+ brq.data.sg_len = 1;
+
+ mmc_wait_for_req(card->host, &brq.mrq);
+
+ mrq->data->bytes_xfered += brq.data.bytes_xfered;
+
+ if (brq.cmd.error || brq.data.error || brq.stop.error) {
+ mrq->cmd->error = brq.cmd.error;
+ mrq->data->error = brq.data.error;
+ mrq->stop->error = brq.stop.error;
+
+ /*
+ * We're executing the request backwards, so don't let
+ * the block layer think some part of it has succeeded.
+ * It will get it wrong. Since the failure will cause
+ * us to fall back on single block writes, we're better
+ * off reporting that none of the data was written.
+ */
+ mrq->data->bytes_xfered = 0;
+ break;
+ }
+ }
+
+ return true;
+}
static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
{
struct mmc_blk_data *md = mq->data;
@@ -378,6 +533,9 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
brq.data.flags |= MMC_DATA_WRITE;
}
+ if (rq_data_dir(req) == WRITE)
+ mmc_adjust_toshiba_write(card, &brq.mrq);
+
mmc_set_data_timeout(&brq.data, card);
brq.data.sg = mq->sg;
@@ -402,9 +560,14 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
brq.data.sg_len = i;
}
- mmc_queue_bounce_pre(mq);
-
- mmc_wait_for_req(card->host, &brq.mrq);
+ mmc_queue_bounce_pre(mq);
+
+ /*
+ * Try the workaround first for writes, then fall back.
+ */
+ if (rq_data_dir(req) != WRITE || disable_multi ||
+ !mmc_handle_toshiba_write(mq, card, &brq.mrq))
+ mmc_wait_for_req(card->host, &brq.mrq);
mmc_queue_bounce_post(mq);
@@ -589,6 +752,15 @@ static struct mmc_blk_data *mmc_blk_alloc(struct mmc_card *card)
goto out;
}
+ if (card->cid.manfid == TOSHIBA_MANFID && mmc_card_mmc(card)) {
+ pr_info("%s: enable Toshiba workaround\n",
+ mmc_hostname(card->host));
+ md->bounce = kmalloc(TOSHIBA_HIGH_THRESHOLD * 512, GFP_KERNEL);
+ if (!md->bounce) {
+ ret = -ENOMEM;
+ goto err_kfree;
+ }
+ }
/*
* Set the read-only status based on the supported commands
@@ -655,6 +827,8 @@ static struct mmc_blk_data *mmc_blk_alloc(struct mmc_card *card)
err_putdisk:
put_disk(md->disk);
err_kfree:
+ if (md->bounce)
+ kfree(md->bounce);
kfree(md);
out:
return ERR_PTR(ret);
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 45055c4..17eef89 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -307,6 +307,9 @@ static int mmc_read_ext_csd(struct mmc_card *card)
else
card->erased_byte = 0x0;
+ if (card->ext_csd.rev >= 5)
+ card->ext_csd.rel_wr_sec_c = ext_csd[EXT_CSD_REL_WR_SEC_C];
+
out:
kfree(ext_csd);
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 6b75250..fea7ecb 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -43,6 +43,7 @@ struct mmc_csd {
struct mmc_ext_csd {
u8 rev;
+ u8 rel_wr_sec_c;
u8 erase_group_def;
u8 sec_feature_support;
unsigned int sa_timeout; /* Units: 100ns */
diff --git a/include/linux/mmc/mmc.h b/include/linux/mmc/mmc.h
index a5d765c..1e87020 100644
--- a/include/linux/mmc/mmc.h
+++ b/include/linux/mmc/mmc.h
@@ -260,6 +260,7 @@ struct _mmc_csd {
#define EXT_CSD_CARD_TYPE 196 /* RO */
#define EXT_CSD_SEC_CNT 212 /* RO, 4 bytes */
#define EXT_CSD_S_A_TIMEOUT 217 /* RO */
+#define EXT_CSD_REL_WR_SEC_C 222
#define EXT_CSD_ERASE_TIMEOUT_MULT 223 /* RO */
#define EXT_CSD_HC_ERASE_GRP_SIZE 224 /* RO */
#define EXT_CSD_BOOT_SIZE_MULTI 226
^ permalink raw reply related [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-11 22:27 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:27 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Feb 9, 2011 at 2:37 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
>
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> Hi,
>>
>> I'm not sure if this is the best place to bring this up, but Russel's
>> name is on a fair share of drivers/mmc code, and there does seem to be
>> quite a bit of MMC-related discussions. Excuse me in advance if this
>> isn't the right forum :-).
>>
>> Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> rigid buffering scheme when it comes to handling writes. There is
>> usually a buffer A for random accesses, and a buffer B for sequential
>> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> writes are treated as one large sequential access, once again ending
>> up in buffer B, thus necessitating out-of-order writing to work around
>> this.
>>
>> What this means is decreased life span for the parts, and it also
>> means a performance impact on small writes, but the first item is much
>> more crucial, especially for smaller parts.
>>
>> As I've mentioned, probably more vendors are affected. How about a
>> generic MMC_BLOCK quirk that splits the requests (and optionally
>> reorders) them? The thresholds would then be adjustable as
>> module/kernel parameters based on manfid. I'm asking because I have a
>> patch now, but its ugly and hardcoded against a specific manufacturer.
>
> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
>
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?
>
> Yours,
> Linus Walleij
>
I think it's worthwhile to make make the upper block layers aware of
MMC (and apparently other flash memory) limitations, but I think as a
first step it could make sense (for me) to reformat the patch I am
attaching into something that looks better.
Don't take the attached patch too seriously :-).
Thanks,
A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: toshiba_emmc_opt.patch
Type: text/x-diff
Size: 8737 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110211/426789b7/attachment.bin>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-09 9:13 ` Arnd Bergmann
@ 2011-02-11 22:33 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:33 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: Linus Walleij, linux-mmc, linux-arm-kernel
On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
>> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
>> interesting!]
>>
>> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> > Hi,
>> >
>> > I'm not sure if this is the best place to bring this up, but Russel's
>> > name is on a fair share of drivers/mmc code, and there does seem to be
>> > quite a bit of MMC-related discussions. Excuse me in advance if this
>> > isn't the right forum :-).
>> >
>> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> > rigid buffering scheme when it comes to handling writes. There is
>> > usually a buffer A for random accesses, and a buffer B for sequential
>> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> > writes are treated as one large sequential access, once again ending
>> > up in buffer B, thus necessitating out-of-order writing to work around
>> > this.
>
> It's more complex, but I now have a pretty good understanding of
> what the flash media actually do, after doing a lot of benchmarking.
> Most of my results so far are documented on
>
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
>
> but I still need to write about the more recent discoveries.
>
> What you describe as buffer A is the "page size" of the underlying
> flash. It depends on the size and brand of the NAND flash chip and
> can be anywhere between 2 KB and 16 KB for modern cards, depending
> on how they combine multiple chips and planes within the chips.
>
> What you describe as buffer B is sometime called an "erase block
> group" or an "allocation unit". This is the smallest unit that
> gets kept in a global lookup table in the medium and can be anywhere
> between 1 MB and 8 MB for cards larger than 4 GB, or as small as
> 128 KB (a single erase block) for smaller media, as far as I have
> seen. When you don't write full aligned allocation units, the
> card will have to eventually do garbage collection on the allocation
> unit, which can take a long time (many milliseconds).
>
> Most cards have a third size, typically somewhere between 32 and 128 KB,
> which is the optimimum size for writes. While you can do linear
> writes to the card in page size units (writing an allocation unit
> from start to finish), doing random access within the allocation unit
> will be much faster doing larger writes.
>
>> > What this means is decreased life span for the parts, and it also
>> > means a performance impact on small writes, but the first item is much
>> > more crucial, especially for smaller parts.
>> >
>> > As I've mentioned, probably more vendors are affected. How about a
>> > generic MMC_BLOCK quirk that splits the requests (and optionally
>> > reorders) them? The thresholds would then be adjustable as
>> > module/kernel parameters based on manfid. I'm asking because I have a
>> > patch now, but its ugly and hardcoded against a specific manufacturer.
>
> It's not just MMC specific: USB flash drives, CF cards and even cheap
> PATA or SATA SSDs have the same patterns. I think this will need
> to be solved on a higher level, in the block device elevator code
> and in the file systems.
>
>> There is a quirk API so that specific quirks can be flagged for certain
>> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
>> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>>
>> But as Russell says this probably needs to be signalled up to the
>> block layer to be handled properly.
>>
>> Why don't you post the code you have today as an RFC: patch,
>> I think many will be interested?
>
> Yes, I agree, that would be good. Also, I'd be interested to see the
> output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
> that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
> the worst cards that I have seen so far, because they can not do
> random access within an allocation unit, and they can not write to
> multiple allocation units alternating (# open AUs linear is "1" in
> my wiki table), while most cards can do at least two.
>
> Andrei, I'm certainly interested in working with you on this.
> The point you brought up about the toshiba cards being especially
> bad is certainly vald, even if we do something better in the block
> layer, we need to have a way to detect the worst-case scenario,
> so we can work around that.
>
> Arnd
>
Arnd,
Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
cid - 02010053454d3332479070cc51451d00
csd - d00f00320f5903ffffffffff92404000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000002
name - SEM32G
oemid - 0x0100
preferred_erase_size - 2097152
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-11 22:33 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:33 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
>> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
>> interesting!]
>>
>> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> > Hi,
>> >
>> > I'm not sure if this is the best place to bring this up, but Russel's
>> > name is on a fair share of drivers/mmc code, and there does seem to be
>> > quite a bit of MMC-related discussions. Excuse me in advance if this
>> > isn't the right forum :-).
>> >
>> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> > rigid buffering scheme when it comes to handling writes. There is
>> > usually a buffer A for random accesses, and a buffer B for sequential
>> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> > writes are treated as one large sequential access, once again ending
>> > up in buffer B, thus necessitating out-of-order writing to work around
>> > this.
>
> It's more complex, but I now have a pretty good understanding of
> what the flash media actually do, after doing a lot of benchmarking.
> Most of my results so far are documented on
>
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
>
> but I still need to write about the more recent discoveries.
>
> What you describe as buffer A is the "page size" of the underlying
> flash. It depends on the size and brand of the NAND flash chip and
> can be anywhere between 2 KB and 16 KB for modern cards, depending
> on how they combine multiple chips and planes within the chips.
>
> What you describe as buffer B is sometime called an "erase block
> group" or an "allocation unit". This is the smallest unit that
> gets kept in a global lookup table in the medium and can be anywhere
> between 1 MB and 8 MB for cards larger than 4 GB, or as small as
> 128 KB (a single erase block) for smaller media, as far as I have
> seen. When you don't write full aligned allocation units, the
> card will have to eventually do garbage collection on the allocation
> unit, which can take a long time (many milliseconds).
>
> Most cards have a third size, typically somewhere between 32 and 128 KB,
> which is the optimimum size for writes. While you can do linear
> writes to the card in page size units (writing an allocation unit
> from start to finish), doing random access within the allocation unit
> will be much faster doing larger writes.
>
>> > What this means is decreased life span for the parts, and it also
>> > means a performance impact on small writes, but the first item is much
>> > more crucial, especially for smaller parts.
>> >
>> > As I've mentioned, probably more vendors are affected. How about a
>> > generic MMC_BLOCK quirk that splits the requests (and optionally
>> > reorders) them? The thresholds would then be adjustable as
>> > module/kernel parameters based on manfid. I'm asking because I have a
>> > patch now, but its ugly and hardcoded against a specific manufacturer.
>
> It's not just MMC specific: USB flash drives, CF cards and even cheap
> PATA or SATA SSDs have the same patterns. I think this will need
> to be solved on a higher level, in the block device elevator code
> and in the file systems.
>
>> There is a quirk API so that specific quirks can be flagged for certain
>> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
>> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>>
>> But as Russell says this probably needs to be signalled up to the
>> block layer to be handled properly.
>>
>> Why don't you post the code you have today as an RFC: patch,
>> I think many will be interested?
>
> Yes, I agree, that would be good. Also, I'd be interested to see the
> output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
> that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
> the worst cards that I have seen so far, because they can not do
> random access within an allocation unit, and they can not write to
> multiple allocation units alternating (# open AUs linear is "1" in
> my wiki table), while most cards can do at least two.
>
> Andrei, I'm certainly interested in working with you on this.
> The point you brought up about the toshiba cards being especially
> bad is certainly vald, even if we do something better in the block
> layer, we need to have a way to detect the worst-case scenario,
> so we can work around that.
>
> ? ? ? ?Arnd
>
Arnd,
Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
cid - 02010053454d3332479070cc51451d00
csd - d00f00320f5903ffffffffff92404000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000002
name - SEM32G
oemid - 0x0100
preferred_erase_size - 2097152
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-09 9:13 ` Arnd Bergmann
@ 2011-02-11 23:23 ` Linus Walleij
-1 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-11 23:23 UTC (permalink / raw)
To: Arnd Bergmann
Cc: linux-arm-kernel, Andrei Warkentin, linux-mmc,
Sebastian Rasmussen, Ulf Hansson
2011/2/9 Arnd Bergmann <arnd@arndb.de>:
> Most of my results so far are documented on
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
H'm! That's an interesting resource indeed. When you write
"From measurements, it appears that the size in which data is
managed is typically 64 kb on SD cards" and "the size of the
medium is always a multiple of entire allocation groups, and
the most common size today is 4 MB" and then list
Size, Allocation Unit, Write Size, Page Size, FAT Location,
open AUs linear, open AUs random, Algorithm.
How exactly do you measure that?
I'm sort of smelling a card-probe.git with this tool that you
can run on your device and get out data like that listed
in your table. We have a rather large stash of cards we can
probe for you to get that kind of data out if it is useful, and
I believe other Linaro members may have such stuff too,
if empirical data is usefult to your work.
Yours,
Linus Walleij
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-11 23:23 ` Linus Walleij
0 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-11 23:23 UTC (permalink / raw)
To: linux-arm-kernel
2011/2/9 Arnd Bergmann <arnd@arndb.de>:
> Most of my results so far are documented on
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
H'm! That's an interesting resource indeed. When you write
"From measurements, it appears that the size in which data is
managed is typically 64 kb on SD cards" and "the size of the
medium is always a multiple of entire allocation groups, and
the most common size today is 4 MB" and then list
Size, Allocation Unit, Write Size, Page Size, FAT Location,
open AUs linear, open AUs random, Algorithm.
How exactly do you measure that?
I'm sort of smelling a card-probe.git with this tool that you
can run on your device and get out data like that listed
in your table. We have a rather large stash of cards we can
probe for you to get that kind of data out if it is useful, and
I believe other Linaro members may have such stuff too,
if empirical data is usefult to your work.
Yours,
Linus Walleij
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-11 23:23 ` Linus Walleij
@ 2011-02-12 10:45 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 10:45 UTC (permalink / raw)
To: Linus Walleij
Cc: linux-arm-kernel, Andrei Warkentin, linux-mmc,
Sebastian Rasmussen, Ulf Hansson
On Saturday 12 February 2011 00:23:37 Linus Walleij wrote:
> H'm! That's an interesting resource indeed. When you write
> "From measurements, it appears that the size in which data is
> managed is typically 64 kb on SD cards" and "the size of the
> medium is always a multiple of entire allocation groups, and
> the most common size today is 4 MB" and then list
> Size, Allocation Unit, Write Size, Page Size, FAT Location,
> open AUs linear, open AUs random, Algorithm.
>
> How exactly do you measure that?
It's not an exact science, but for most cards I have found
reasonably good ways to identify these numbers:
* the allocation unit size can almost always be found
using read-only tests: reading 2kb across an allocation
unit boundary is slightly slower than reading 2kb
just before or just after the boundary.
For a few cards where this doesn't work, I do write tests.
After finding out how many allocation units can be open,
it's trivial to find out the size.
* Finding the number of open allocation units means I write
to the start of a few AUs alternating. Up to a certain
number, the throughput is constant, above that, it drops
sharply, sometimes by one or two orders of magnitude.
* The page size can also be found doing read-only tests, with
varying block sizes. Smaller reads always give lower throughput
than larger reads, but getting smaller than page size
drops down significantly more than the difference between
multi-page reads. This effect is more prominent in write tests.
* Finding the algorithm basically means I write an allocation
unit using varying block sizes two times, using both linear
access and random access. Cards that are optimized for
linear access can be unbelievably slow in the random access
tests. Sometimes the performance is the same above a specific
block size, but slower for random access below that size.
This is the write block size.
* Finding the write block size in cases where this is not the
case can be harder. Most cards have a noticable performance
drop in writes of less than a few pages, so that's the
size I put in the table.
* The FAT location is clearly visible in a number of tests
done inside of an allocation unit. It's normally slower for
linear access, but faster for random access. Sometimes
reading the FAT is also slower than reading elsewhere.
> I'm sort of smelling a card-probe.git with this tool that you
> can run on your device and get out data like that listed
> in your table. We have a rather large stash of cards we can
> probe for you to get that kind of data out if it is useful, and
> I believe other Linaro members may have such stuff too,
> if empirical data is usefult to your work.
The tool I'm using is on http://git.linaro.org/gitweb?p=people/arnd/flashbench.git
Unfortunately, it's not yet in the state that I'm recommending
anyone besides me to run it. I'm still rewriting the source
for every new card I get to nail down the specific properties.
I will make an announcement when I have the tool in a state
of general usefulness, and at that point I would really
appreciate people to run it, but just not yet.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 10:45 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 10:45 UTC (permalink / raw)
To: linux-arm-kernel
On Saturday 12 February 2011 00:23:37 Linus Walleij wrote:
> H'm! That's an interesting resource indeed. When you write
> "From measurements, it appears that the size in which data is
> managed is typically 64 kb on SD cards" and "the size of the
> medium is always a multiple of entire allocation groups, and
> the most common size today is 4 MB" and then list
> Size, Allocation Unit, Write Size, Page Size, FAT Location,
> open AUs linear, open AUs random, Algorithm.
>
> How exactly do you measure that?
It's not an exact science, but for most cards I have found
reasonably good ways to identify these numbers:
* the allocation unit size can almost always be found
using read-only tests: reading 2kb across an allocation
unit boundary is slightly slower than reading 2kb
just before or just after the boundary.
For a few cards where this doesn't work, I do write tests.
After finding out how many allocation units can be open,
it's trivial to find out the size.
* Finding the number of open allocation units means I write
to the start of a few AUs alternating. Up to a certain
number, the throughput is constant, above that, it drops
sharply, sometimes by one or two orders of magnitude.
* The page size can also be found doing read-only tests, with
varying block sizes. Smaller reads always give lower throughput
than larger reads, but getting smaller than page size
drops down significantly more than the difference between
multi-page reads. This effect is more prominent in write tests.
* Finding the algorithm basically means I write an allocation
unit using varying block sizes two times, using both linear
access and random access. Cards that are optimized for
linear access can be unbelievably slow in the random access
tests. Sometimes the performance is the same above a specific
block size, but slower for random access below that size.
This is the write block size.
* Finding the write block size in cases where this is not the
case can be harder. Most cards have a noticable performance
drop in writes of less than a few pages, so that's the
size I put in the table.
* The FAT location is clearly visible in a number of tests
done inside of an allocation unit. It's normally slower for
linear access, but faster for random access. Sometimes
reading the FAT is also slower than reading elsewhere.
> I'm sort of smelling a card-probe.git with this tool that you
> can run on your device and get out data like that listed
> in your table. We have a rather large stash of cards we can
> probe for you to get that kind of data out if it is useful, and
> I believe other Linaro members may have such stuff too,
> if empirical data is usefult to your work.
The tool I'm using is on http://git.linaro.org/gitweb?p=people/arnd/flashbench.git
Unfortunately, it's not yet in the state that I'm recommending
anyone besides me to run it. I'm still rewriting the source
for every new card I get to nail down the specific properties.
I will make an announcement when I have the tool in a state
of general usefulness, and at that point I would really
appreciate people to run it, but just not yet.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-12 10:45 ` Arnd Bergmann
@ 2011-02-12 10:59 ` Russell King - ARM Linux
-1 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 10:59 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Linus Walleij, Ulf Hansson, linux-mmc, Andrei Warkentin,
linux-arm-kernel, Sebastian Rasmussen
On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> * The FAT location is clearly visible in a number of tests
> done inside of an allocation unit. It's normally slower for
> linear access, but faster for random access. Sometimes
> reading the FAT is also slower than reading elsewhere.
I wouldn't also be surprised if there's some cards out there which parse
the FAT being written, and start activities (such as erasing clusters)
based upon changes therein. Such cards would be unsuitable for use with
non-FAT filesystems.
It might be worth devising some sort of check for this kind of behaviour.
Unrelated, I have a USB based device which provides an emulated FAT
filesystem - all files except one on this filesystem are read-only.
The writable file is a textual configuration file. It can be reliably
updated by Windows based systems, but updates from Linux based systems
are ignored - presumably because updates to the FAT/directory/data
clusters are occuring in a different order.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 10:59 ` Russell King - ARM Linux
0 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 10:59 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> * The FAT location is clearly visible in a number of tests
> done inside of an allocation unit. It's normally slower for
> linear access, but faster for random access. Sometimes
> reading the FAT is also slower than reading elsewhere.
I wouldn't also be surprised if there's some cards out there which parse
the FAT being written, and start activities (such as erasing clusters)
based upon changes therein. Such cards would be unsuitable for use with
non-FAT filesystems.
It might be worth devising some sort of check for this kind of behaviour.
Unrelated, I have a USB based device which provides an emulated FAT
filesystem - all files except one on this filesystem are read-only.
The writable file is a textual configuration file. It can be reliably
updated by Windows based systems, but updates from Linux based systems
are ignored - presumably because updates to the FAT/directory/data
clusters are occuring in a different order.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-12 10:59 ` Russell King - ARM Linux
@ 2011-02-12 16:28 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 16:28 UTC (permalink / raw)
To: Russell King - ARM Linux
Cc: Linus Walleij, Ulf Hansson, linux-mmc, Andrei Warkentin,
linux-arm-kernel, Sebastian Rasmussen
On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> > * The FAT location is clearly visible in a number of tests
> > done inside of an allocation unit. It's normally slower for
> > linear access, but faster for random access. Sometimes
> > reading the FAT is also slower than reading elsewhere.
>
> I wouldn't also be surprised if there's some cards out there which parse
> the FAT being written, and start activities (such as erasing clusters)
> based upon changes therein. Such cards would be unsuitable for use with
> non-FAT filesystems.
>
> It might be worth devising some sort of check for this kind of behaviour.
Possible, but doesn't seem to happen with any of the cards I have
tested, the controllers in there appear to be too simplistic.
Also, the recommendations for SD cards are to issue explicit erase
requests, which would make this unnecessary.
OTOH, SD cards do specify exactly where the FAT should be stored on
the medium, so it would be possible to make this kind of assumption.
USB sticks and CF cards might be smart enough to actually do it,
some of them have more sophisticated logic than SD cards (most
do not), and there is no usb mass storage command for erase.
> Unrelated, I have a USB based device which provides an emulated FAT
> filesystem - all files except one on this filesystem are read-only.
> The writable file is a textual configuration file. It can be reliably
> updated by Windows based systems, but updates from Linux based systems
> are ignored - presumably because updates to the FAT/directory/data
> clusters are occuring in a different order.
Fun. I think qemu also comes with one of these FAT emulation layers,
as do some mp3 players, but from what I have heard, they are not as
broken.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 16:28 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 16:28 UTC (permalink / raw)
To: linux-arm-kernel
On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> > * The FAT location is clearly visible in a number of tests
> > done inside of an allocation unit. It's normally slower for
> > linear access, but faster for random access. Sometimes
> > reading the FAT is also slower than reading elsewhere.
>
> I wouldn't also be surprised if there's some cards out there which parse
> the FAT being written, and start activities (such as erasing clusters)
> based upon changes therein. Such cards would be unsuitable for use with
> non-FAT filesystems.
>
> It might be worth devising some sort of check for this kind of behaviour.
Possible, but doesn't seem to happen with any of the cards I have
tested, the controllers in there appear to be too simplistic.
Also, the recommendations for SD cards are to issue explicit erase
requests, which would make this unnecessary.
OTOH, SD cards do specify exactly where the FAT should be stored on
the medium, so it would be possible to make this kind of assumption.
USB sticks and CF cards might be smart enough to actually do it,
some of them have more sophisticated logic than SD cards (most
do not), and there is no usb mass storage command for erase.
> Unrelated, I have a USB based device which provides an emulated FAT
> filesystem - all files except one on this filesystem are read-only.
> The writable file is a textual configuration file. It can be reliably
> updated by Windows based systems, but updates from Linux based systems
> are ignored - presumably because updates to the FAT/directory/data
> clusters are occuring in a different order.
Fun. I think qemu also comes with one of these FAT emulation layers,
as do some mp3 players, but from what I have heard, they are not as
broken.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-12 16:28 ` Arnd Bergmann
@ 2011-02-12 16:37 ` Russell King - ARM Linux
-1 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 16:37 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Linus Walleij, Ulf Hansson, linux-mmc, Andrei Warkentin,
linux-arm-kernel, Sebastian Rasmussen
On Sat, Feb 12, 2011 at 05:28:32PM +0100, Arnd Bergmann wrote:
> On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> > Unrelated, I have a USB based device which provides an emulated FAT
> > filesystem - all files except one on this filesystem are read-only.
> > The writable file is a textual configuration file. It can be reliably
> > updated by Windows based systems, but updates from Linux based systems
> > are ignored - presumably because updates to the FAT/directory/data
> > clusters are occuring in a different order.
>
> Fun. I think qemu also comes with one of these FAT emulation layers,
> as do some mp3 players, but from what I have heard, they are not as
> broken.
Given that it is a secure GPS/barographic flight logger which has
approval for ratifing world record flight claims, you may understand why
it has to be extremely picky about how it interfaces with the external
world. Especially restricting updates to modification of the
configuration file, while not allowing any of the logged data files to
be changed in any way.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 16:37 ` Russell King - ARM Linux
0 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 16:37 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Feb 12, 2011 at 05:28:32PM +0100, Arnd Bergmann wrote:
> On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> > Unrelated, I have a USB based device which provides an emulated FAT
> > filesystem - all files except one on this filesystem are read-only.
> > The writable file is a textual configuration file. It can be reliably
> > updated by Windows based systems, but updates from Linux based systems
> > are ignored - presumably because updates to the FAT/directory/data
> > clusters are occuring in a different order.
>
> Fun. I think qemu also comes with one of these FAT emulation layers,
> as do some mp3 players, but from what I have heard, they are not as
> broken.
Given that it is a secure GPS/barographic flight logger which has
approval for ratifing world record flight claims, you may understand why
it has to be extremely picky about how it interfaces with the external
world. Especially restricting updates to modification of the
configuration file, while not allowing any of the logged data files to
be changed in any way.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-11 22:33 ` Andrei Warkentin
@ 2011-02-12 17:05 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 17:05 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152
Very interesting. So the manfid is the same as on most Kingston cards,
but the oemid is different. Most cards have a two-letter ASCII code
in there, 0x544d ("TM") on Kingston cards, and I always assumed that
this stood for "Toshiba Memory".
What is even stranger is the size value (among other fields) in the CSD,
the card claims a size of exactly 32GB, which I find hard to believe,
given that there are always some bad and reserved blocks.
Are you sure that the card you have is authentic? I've heard a lot about
fake USB sticks advertising a size that is much larger than the actual
flash inside of them.
Also this is the first card that I see advertise an allocation unit
size of 2MB (preferred_erase_size), all other cards seem to advertise
4 MB these days, even if they actually have 2 or 8 MB.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 17:05 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 17:05 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152
Very interesting. So the manfid is the same as on most Kingston cards,
but the oemid is different. Most cards have a two-letter ASCII code
in there, 0x544d ("TM") on Kingston cards, and I always assumed that
this stood for "Toshiba Memory".
What is even stranger is the size value (among other fields) in the CSD,
the card claims a size of exactly 32GB, which I find hard to believe,
given that there are always some bad and reserved blocks.
Are you sure that the card you have is authentic? I've heard a lot about
fake USB sticks advertising a size that is much larger than the actual
flash inside of them.
Also this is the first card that I see advertise an allocation unit
size of 2MB (preferred_erase_size), all other cards seem to advertise
4 MB these days, even if they actually have 2 or 8 MB.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-12 17:05 ` Arnd Bergmann
@ 2011-02-12 17:33 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-12 17:33 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
>> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
>> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>>
>> cid - 02010053454d3332479070cc51451d00
>> csd - d00f00320f5903ffffffffff92404000
>> erase_size - 524288
>> fwrev - 0x0
>> hwrev - 0x0
>> manfid - 0x000002
>> name - SEM32G
>> oemid - 0x0100
>> preferred_erase_size - 2097152
>
> Very interesting. So the manfid is the same as on most Kingston cards,
> but the oemid is different. Most cards have a two-letter ASCII code
> in there, 0x544d ("TM") on Kingston cards, and I always assumed that
> this stood for "Toshiba Memory".
>
> What is even stranger is the size value (among other fields) in the CSD,
> the card claims a size of exactly 32GB, which I find hard to believe,
> given that there are always some bad and reserved blocks.
>
> Are you sure that the card you have is authentic? I've heard a lot about
> fake USB sticks advertising a size that is much larger than the actual
> flash inside of them.
>
> Also this is the first card that I see advertise an allocation unit
> size of 2MB (preferred_erase_size), all other cards seem to advertise
> 4 MB these days, even if they actually have 2 or 8 MB.
>
> Arnd
>
This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 17:33 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-12 17:33 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
>> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
>> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>>
>> cid - 02010053454d3332479070cc51451d00
>> csd - d00f00320f5903ffffffffff92404000
>> erase_size - 524288
>> fwrev - 0x0
>> hwrev - 0x0
>> manfid - 0x000002
>> name - SEM32G
>> oemid - 0x0100
>> preferred_erase_size - 2097152
>
> Very interesting. So the manfid is the same as on most Kingston cards,
> but the oemid is different. Most cards have a two-letter ASCII code
> in there, 0x544d ("TM") on Kingston cards, and I always assumed that
> this stood for "Toshiba Memory".
>
> What is even stranger is the size value (among other fields) in the CSD,
> the card claims a size of exactly 32GB, which I find hard to believe,
> given that there are always some bad and reserved blocks.
>
> Are you sure that the card you have is authentic? I've heard a lot about
> fake USB sticks advertising a size that is much larger than the actual
> flash inside of them.
>
> Also this is the first card that I see advertise an allocation unit
> size of 2MB (preferred_erase_size), all other cards seem to advertise
> 4 MB these days, even if they actually have 2 or 8 MB.
>
> ? ? ? ?Arnd
>
This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-12 17:33 ` Andrei Warkentin
@ 2011-02-12 18:22 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:22 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Saturday 12 February 2011 18:33:10 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> >> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> >> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >>
> >> cid - 02010053454d3332479070cc51451d00
> >> csd - d00f 0032 0f59 03ff ffffffff92404000
> >> erase_size - 524288
> >> fwrev - 0x0
> >> hwrev - 0x0
> >> manfid - 0x000002
> >> name - SEM32G
> >> oemid - 0x0100
> >> preferred_erase_size - 2097152
> >
>
> This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.
Ah, right, that explains all the values, which make sense for eMMC4
but not for SDHC ;-)
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 18:22 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:22 UTC (permalink / raw)
To: linux-arm-kernel
On Saturday 12 February 2011 18:33:10 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> >> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> >> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >>
> >> cid - 02010053454d3332479070cc51451d00
> >> csd - d00f 0032 0f59 03ff ffffffff92404000
> >> erase_size - 524288
> >> fwrev - 0x0
> >> hwrev - 0x0
> >> manfid - 0x000002
> >> name - SEM32G
> >> oemid - 0x0100
> >> preferred_erase_size - 2097152
> >
>
> This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.
Ah, right, that explains all the values, which make sense for eMMC4
but not for SDHC ;-)
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-11 22:27 ` Andrei Warkentin
@ 2011-02-12 18:37 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:37 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc
On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>
> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
> index 7054fd5..3b32329 100644
> --- a/drivers/mmc/card/block.c
> +++ b/drivers/mmc/card/block.c
> @@ -312,6 +316,157 @@ out:
> return err ? 0 : 1;
> }
>
> +/*
> + * Workaround for Toshiba eMMC performance. If the request is less than two
> + * flash pages in size, then we want to split the write into one or two
> + * page-aligned writes to take advantage of faster buffering. Here we can
> + * adjust the size of the MMC request and let the block layer request handler
> + * deal with generating another MMC request.
> + */
> +#define TOSHIBA_MANFID 0x11
> +#define TOSHIBA_PAGE_SIZE 16 /* sectors */
> +#define TOSHIBA_ADJUST_THRESHOLD 24 /* sectors */
> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
> + struct mmc_request *mrq)
> +{
> + if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
> + mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
> + int sectors_in_page = TOSHIBA_PAGE_SIZE -
> + (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
> + if (mrq->data->blocks > sectors_in_page) {
> + mrq->data->blocks = sectors_in_page;
> + return true;
> + }
> + }
> +
> + return false;
> +}
This part might make sense in general, though it's hard to know the
page size in the general case. For many SD cards, writing naturally
aligned 64 KB blocks was the ideal case in my testing, but some need
larger alignment or can deal well with smaller blocks.
> +/*
> + * This is another strange workaround to try to close the gap on Toshiba eMMC
> + * performance when compared to other vendors. In order to take advantage
> + * of certain optimizations and assumptions in those cards, we will look for
> + * multiblock write transfers below a certain size and we do the following:
> + *
> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
> + * - Execute the transfers in reverse order.
> + * - Use "reliable write" transfer mode.
> + *
> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
> + * selves well to executing a block request out of order. So instead we let
> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
> + * break it up and reorder it here. This also requires that we put the data
> + * into a bounce buffer and send it as individual sg's.
> + */
A lot of the SD cards I've seen will react very badly to reverse order,
so that is definitely a dangerous thing to put into the code.
Also, the "reliable write" seems like a really interesting thing to
rely on for performance. I believe what the card is trying to do here
is to optimize FAT32 directory updates. By using the small blocks in
unpredictable order (anything but linear), you tell the card to treat
this as part of a directory, so it probably gets written in a different
way, but that might mean that it will try to turn the current erase
block group into a special small write mode.
I could imagine that this will cause problems on your eMMC once you
write small blocks to more than erase block group, because that probably
causes it to start garbage collection -- it makes sense for the cards
to know that something is a directory, but it can only know about
a small number of directories, so it will turn the segment into a regular
one as soon something else becomes a directory.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-12 18:37 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:37 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>
> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
> index 7054fd5..3b32329 100644
> --- a/drivers/mmc/card/block.c
> +++ b/drivers/mmc/card/block.c
> @@ -312,6 +316,157 @@ out:
> return err ? 0 : 1;
> }
>
> +/*
> + * Workaround for Toshiba eMMC performance. If the request is less than two
> + * flash pages in size, then we want to split the write into one or two
> + * page-aligned writes to take advantage of faster buffering. Here we can
> + * adjust the size of the MMC request and let the block layer request handler
> + * deal with generating another MMC request.
> + */
> +#define TOSHIBA_MANFID 0x11
> +#define TOSHIBA_PAGE_SIZE 16 /* sectors */
> +#define TOSHIBA_ADJUST_THRESHOLD 24 /* sectors */
> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
> + struct mmc_request *mrq)
> +{
> + if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
> + mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
> + int sectors_in_page = TOSHIBA_PAGE_SIZE -
> + (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
> + if (mrq->data->blocks > sectors_in_page) {
> + mrq->data->blocks = sectors_in_page;
> + return true;
> + }
> + }
> +
> + return false;
> +}
This part might make sense in general, though it's hard to know the
page size in the general case. For many SD cards, writing naturally
aligned 64 KB blocks was the ideal case in my testing, but some need
larger alignment or can deal well with smaller blocks.
> +/*
> + * This is another strange workaround to try to close the gap on Toshiba eMMC
> + * performance when compared to other vendors. In order to take advantage
> + * of certain optimizations and assumptions in those cards, we will look for
> + * multiblock write transfers below a certain size and we do the following:
> + *
> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
> + * - Execute the transfers in reverse order.
> + * - Use "reliable write" transfer mode.
> + *
> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
> + * selves well to executing a block request out of order. So instead we let
> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
> + * break it up and reorder it here. This also requires that we put the data
> + * into a bounce buffer and send it as individual sg's.
> + */
A lot of the SD cards I've seen will react very badly to reverse order,
so that is definitely a dangerous thing to put into the code.
Also, the "reliable write" seems like a really interesting thing to
rely on for performance. I believe what the card is trying to do here
is to optimize FAT32 directory updates. By using the small blocks in
unpredictable order (anything but linear), you tell the card to treat
this as part of a directory, so it probably gets written in a different
way, but that might mean that it will try to turn the current erase
block group into a special small write mode.
I could imagine that this will cause problems on your eMMC once you
write small blocks to more than erase block group, because that probably
causes it to start garbage collection -- it makes sense for the cards
to know that something is a directory, but it can only know about
a small number of directories, so it will turn the segment into a regular
one as soon something else becomes a directory.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-12 18:37 ` Arnd Bergmann
@ 2011-02-13 0:10 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-13 0:10 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>>
>> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
>> index 7054fd5..3b32329 100644
>> --- a/drivers/mmc/card/block.c
>> +++ b/drivers/mmc/card/block.c
>> @@ -312,6 +316,157 @@ out:
>> return err ? 0 : 1;
>> }
>>
>> +/*
>> + * Workaround for Toshiba eMMC performance. If the request is less than two
>> + * flash pages in size, then we want to split the write into one or two
>> + * page-aligned writes to take advantage of faster buffering. Here we can
>> + * adjust the size of the MMC request and let the block layer request handler
>> + * deal with generating another MMC request.
>> + */
>> +#define TOSHIBA_MANFID 0x11
>> +#define TOSHIBA_PAGE_SIZE 16 /* sectors */
>> +#define TOSHIBA_ADJUST_THRESHOLD 24 /* sectors */
>> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
>> + struct mmc_request *mrq)
>> +{
>> + if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
>> + mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
>> + int sectors_in_page = TOSHIBA_PAGE_SIZE -
>> + (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
>> + if (mrq->data->blocks > sectors_in_page) {
>> + mrq->data->blocks = sectors_in_page;
>> + return true;
>> + }
>> + }
>> +
>> + return false;
>> +}
>
> This part might make sense in general, though it's hard to know the
> page size in the general case. For many SD cards, writing naturally
> aligned 64 KB blocks was the ideal case in my testing, but some need
> larger alignment or can deal well with smaller blocks.
>
...which is why I believe this should be a boot per-card parameter,
and that it really only makes sense for embedded parts, where you know
nothing else is going to be used as, say, mmcblk0.
>> +/*
>> + * This is another strange workaround to try to close the gap on Toshiba eMMC
>> + * performance when compared to other vendors. In order to take advantage
>> + * of certain optimizations and assumptions in those cards, we will look for
>> + * multiblock write transfers below a certain size and we do the following:
>> + *
>> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
>> + * - Execute the transfers in reverse order.
>> + * - Use "reliable write" transfer mode.
>> + *
>> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
>> + * selves well to executing a block request out of order. So instead we let
>> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
>> + * break it up and reorder it here. This also requires that we put the data
>> + * into a bounce buffer and send it as individual sg's.
>> + */
>
> A lot of the SD cards I've seen will react very badly to reverse order,
> so that is definitely a dangerous thing to put into the code.
>
> Also, the "reliable write" seems like a really interesting thing to
> rely on for performance. I believe what the card is trying to do here
> is to optimize FAT32 directory updates. By using the small blocks in
> unpredictable order (anything but linear), you tell the card to treat
> this as part of a directory, so it probably gets written in a different
> way, but that might mean that it will try to turn the current erase
> block group into a special small write mode.
>
> I could imagine that this will cause problems on your eMMC once you
> write small blocks to more than erase block group, because that probably
> causes it to start garbage collection -- it makes sense for the cards
> to know that something is a directory, but it can only know about
> a small number of directories, so it will turn the segment into a regular
> one as soon something else becomes a directory.
>
It's difficult for me to argue one way or another. The code provided
is implementing Toshiba's suggestions for mitigating excessive wear.
Basically, as far as certain Android products are concerned, Motorola
created some "typical usage" cases, and collected data logs. These
logs were analyzed by Toshiba, which reported an approx x16
multiplication factor for writes.
Analysis of data written showed that there were many random accesses
with 16KB or 32KB, meaning they go into buffer B. According to T, that
means extra GC and PE cycle. I'm guessing per write.
So T suggested for random data to better go into buffer A. How? Two suggestions.
1) Split smaller accesses into 8KB and write with reliable write.
2) Split smaller accesses into 8KB and write in reverse.
The patch does both and I am verifying if that is really necessary. I
need to go see the mmc spec and what it says about reliable write.
Basically, whatever behavior you choose is going to be wrong some set
of cards. Which is why tuning it probably only makes sense for eMMC
parts, and should be a set of runtime/compile-time quirks. What do you
think?
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-13 0:10 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-13 0:10 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>>
>> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
>> index 7054fd5..3b32329 100644
>> --- a/drivers/mmc/card/block.c
>> +++ b/drivers/mmc/card/block.c
>> @@ -312,6 +316,157 @@ out:
>> ? ? ? return err ? 0 : 1;
>> ?}
>>
>> +/*
>> + * Workaround for Toshiba eMMC performance. ?If the request is less than two
>> + * flash pages in size, then we want to split the write into one or two
>> + * page-aligned writes to take advantage of faster buffering. ?Here we can
>> + * adjust the size of the MMC request and let the block layer request handler
>> + * deal with generating another MMC request.
>> + */
>> +#define TOSHIBA_MANFID 0x11
>> +#define TOSHIBA_PAGE_SIZE 16 ? ? ? ? /* sectors */
>> +#define TOSHIBA_ADJUST_THRESHOLD 24 ?/* sectors */
>> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mmc_request *mrq)
>> +{
>> + ? ? if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
>> + ? ? ? ? mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
>> + ? ? ? ? ? ? int sectors_in_page = TOSHIBA_PAGE_SIZE -
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
>> + ? ? ? ? ? ? if (mrq->data->blocks > sectors_in_page) {
>> + ? ? ? ? ? ? ? ? ? ? mrq->data->blocks = sectors_in_page;
>> + ? ? ? ? ? ? ? ? ? ? return true;
>> + ? ? ? ? ? ? }
>> + ? ? }
>> +
>> + ? ? return false;
>> +}
>
> This part might make sense in general, though it's hard to know the
> page size in the general case. For many SD cards, writing naturally
> aligned 64 KB blocks was the ideal case in my testing, but some need
> larger alignment or can deal well with smaller blocks.
>
...which is why I believe this should be a boot per-card parameter,
and that it really only makes sense for embedded parts, where you know
nothing else is going to be used as, say, mmcblk0.
>> +/*
>> + * This is another strange workaround to try to close the gap on Toshiba eMMC
>> + * performance when compared to other vendors. ?In order to take advantage
>> + * of certain optimizations and assumptions in those cards, we will look for
>> + * multiblock write transfers below a certain size and we do the following:
>> + *
>> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
>> + * - Execute the transfers in reverse order.
>> + * - Use "reliable write" transfer mode.
>> + *
>> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
>> + * selves well to executing a block request out of order. ?So instead we let
>> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
>> + * break it up and reorder it here. ?This also requires that we put the data
>> + * into a bounce buffer and send it as individual sg's.
>> + */
>
> A lot of the SD cards I've seen will react very badly to reverse order,
> so that is definitely a dangerous thing to put into the code.
>
> Also, the "reliable write" seems like a really interesting thing to
> rely on for performance. I believe what the card is trying to do here
> is to optimize FAT32 directory updates. By using the small blocks in
> unpredictable order (anything but linear), you tell the card to treat
> this as part of a directory, so it probably gets written in a different
> way, but that might mean that it will try to turn the current erase
> block group into a special small write mode.
>
> I could imagine that this will cause problems on your eMMC once you
> write small blocks to more than erase block group, because that probably
> causes it to start garbage collection -- it makes sense for the cards
> to know that something is a directory, but it can only know about
> a small number of directories, so it will turn the segment into a regular
> one as soon something else becomes a directory.
>
It's difficult for me to argue one way or another. The code provided
is implementing Toshiba's suggestions for mitigating excessive wear.
Basically, as far as certain Android products are concerned, Motorola
created some "typical usage" cases, and collected data logs. These
logs were analyzed by Toshiba, which reported an approx x16
multiplication factor for writes.
Analysis of data written showed that there were many random accesses
with 16KB or 32KB, meaning they go into buffer B. According to T, that
means extra GC and PE cycle. I'm guessing per write.
So T suggested for random data to better go into buffer A. How? Two suggestions.
1) Split smaller accesses into 8KB and write with reliable write.
2) Split smaller accesses into 8KB and write in reverse.
The patch does both and I am verifying if that is really necessary. I
need to go see the mmc spec and what it says about reliable write.
Basically, whatever behavior you choose is going to be wrong some set
of cards. Which is why tuning it probably only makes sense for eMMC
parts, and should be a set of runtime/compile-time quirks. What do you
think?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-13 0:10 ` Andrei Warkentin
@ 2011-02-13 17:39 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-13 17:39 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sunday 13 February 2011 01:10:09 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> > This part might make sense in general, though it's hard to know the
> > page size in the general case. For many SD cards, writing naturally
> > aligned 64 KB blocks was the ideal case in my testing, but some need
> > larger alignment or can deal well with smaller blocks.
> >
>
> ...which is why I believe this should be a boot per-card parameter,
> and that it really only makes sense for embedded parts, where you know
> nothing else is going to be used as, say, mmcblk0.
I don't think it needs to be boot-time, it can easily be run-time
tuneable using sysfs, where you can configure it using an init script
or some other logic from user space.
> > I could imagine that this will cause problems on your eMMC once you
> > write small blocks to more than erase block group, because that probably
> > causes it to start garbage collection -- it makes sense for the cards
> > to know that something is a directory, but it can only know about
> > a small number of directories, so it will turn the segment into a regular
> > one as soon something else becomes a directory.
> >
>
> It's difficult for me to argue one way or another. The code provided
> is implementing Toshiba's suggestions for mitigating excessive wear.
> Basically, as far as certain Android products are concerned, Motorola
> created some "typical usage" cases, and collected data logs. These
> logs were analyzed by Toshiba, which reported an approx x16
> multiplication factor for writes.
Yes, I've seen similar numbers in my measurements. My experience with
the Kingston/Toshiba cards is that they combine two unfortunate
problems:
* Only one 4 MB AU can be open, writing to a different AU waits for
garbage collection on the old one. Other cards typically have
five buffers for open AUs, which makes them much easier to work with.
* Only linear access within one AU is fast. Writing to a block with
a lower address in the same AU causes garbage collection of the AU.
> Analysis of data written showed that there were many random accesses
> with 16KB or 32KB, meaning they go into buffer B.
I have started a remapping layer that should be able to deal with
this independent of the card, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
It's still in the early stages, but maybe something like that will
help you as well.
The real solution would be to have a file system that knows what
accesses are fast and reorders file data accordingly. Right now,
the only thing that is normally fast is FAT32 using 32KB clusters,
and only if the file system is aligned properly.
> According to T, that
> means extra GC and PE cycle. I'm guessing per write.
Yes.
What is "PE" here?
> So T suggested for random data to better go into buffer A. How? Two suggestions.
> 1) Split smaller accesses into 8KB and write with reliable write.
> 2) Split smaller accesses into 8KB and write in reverse.
>
> The patch does both and I am verifying if that is really necessary. I
> need to go see the mmc spec and what it says about reliable write.
I should add this to my test tool once I can reproduce it. If it turns
out that other media do the same, we can also trigger the same behavior
for those.
> Basically, whatever behavior you choose is going to be wrong some set
> of cards. Which is why tuning it probably only makes sense for eMMC
> parts, and should be a set of runtime/compile-time quirks. What do you
> think?
Your explanation makes sense, but I'd definitely favor a run-time solution
over compile-time or boot-time, because it would be much more flexible.
We should also be able to find some optimizations that are universally
good so we can always use them.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-13 17:39 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-13 17:39 UTC (permalink / raw)
To: linux-arm-kernel
On Sunday 13 February 2011 01:10:09 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> > This part might make sense in general, though it's hard to know the
> > page size in the general case. For many SD cards, writing naturally
> > aligned 64 KB blocks was the ideal case in my testing, but some need
> > larger alignment or can deal well with smaller blocks.
> >
>
> ...which is why I believe this should be a boot per-card parameter,
> and that it really only makes sense for embedded parts, where you know
> nothing else is going to be used as, say, mmcblk0.
I don't think it needs to be boot-time, it can easily be run-time
tuneable using sysfs, where you can configure it using an init script
or some other logic from user space.
> > I could imagine that this will cause problems on your eMMC once you
> > write small blocks to more than erase block group, because that probably
> > causes it to start garbage collection -- it makes sense for the cards
> > to know that something is a directory, but it can only know about
> > a small number of directories, so it will turn the segment into a regular
> > one as soon something else becomes a directory.
> >
>
> It's difficult for me to argue one way or another. The code provided
> is implementing Toshiba's suggestions for mitigating excessive wear.
> Basically, as far as certain Android products are concerned, Motorola
> created some "typical usage" cases, and collected data logs. These
> logs were analyzed by Toshiba, which reported an approx x16
> multiplication factor for writes.
Yes, I've seen similar numbers in my measurements. My experience with
the Kingston/Toshiba cards is that they combine two unfortunate
problems:
* Only one 4 MB AU can be open, writing to a different AU waits for
garbage collection on the old one. Other cards typically have
five buffers for open AUs, which makes them much easier to work with.
* Only linear access within one AU is fast. Writing to a block with
a lower address in the same AU causes garbage collection of the AU.
> Analysis of data written showed that there were many random accesses
> with 16KB or 32KB, meaning they go into buffer B.
I have started a remapping layer that should be able to deal with
this independent of the card, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
It's still in the early stages, but maybe something like that will
help you as well.
The real solution would be to have a file system that knows what
accesses are fast and reorders file data accordingly. Right now,
the only thing that is normally fast is FAT32 using 32KB clusters,
and only if the file system is aligned properly.
> According to T, that
> means extra GC and PE cycle. I'm guessing per write.
Yes.
What is "PE" here?
> So T suggested for random data to better go into buffer A. How? Two suggestions.
> 1) Split smaller accesses into 8KB and write with reliable write.
> 2) Split smaller accesses into 8KB and write in reverse.
>
> The patch does both and I am verifying if that is really necessary. I
> need to go see the mmc spec and what it says about reliable write.
I should add this to my test tool once I can reproduce it. If it turns
out that other media do the same, we can also trigger the same behavior
for those.
> Basically, whatever behavior you choose is going to be wrong some set
> of cards. Which is why tuning it probably only makes sense for eMMC
> parts, and should be a set of runtime/compile-time quirks. What do you
> think?
Your explanation makes sense, but I'd definitely favor a run-time solution
over compile-time or boot-time, because it would be much more flexible.
We should also be able to find some optimizations that are universally
good so we can always use them.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-13 17:39 ` Arnd Bergmann
@ 2011-02-14 19:29 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 19:29 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I don't think it needs to be boot-time, it can easily be run-time
> tuneable using sysfs, where you can configure it using an init script
> or some other logic from user space.
True, definitely expose the controls through sysfs.
>
> Yes.
>
> What is "PE" here?
>
Ah sorry, I had to look that one up myself, I thought it was the local
jargon associated with the problem space :-). Program/Erase cycle.
>> So T suggested for random data to better go into buffer A. How? Two suggestions.
>> 1) Split smaller accesses into 8KB and write with reliable write.
>> 2) Split smaller accesses into 8KB and write in reverse.
>>
>> The patch does both and I am verifying if that is really necessary. I
>> need to go see the mmc spec and what it says about reliable write.
>
> I should add this to my test tool once I can reproduce it. If it turns
> out that other media do the same, we can also trigger the same behavior
> for those.
>
As I mentioned, I am checking with T right now on whether we can use
suggestion (1) or
suggestion (2) or if they need to be combined. The documentation we
got was open to interpretation and the patch created from that did
both.
You mentioned that writing in reverse is not a good idea. Could you
elaborate why? I would guess because you're always causing a write
into a different AU (on these Toshiba cards), causing extra GC on
every write?
>> Basically, whatever behavior you choose is going to be wrong some set
>> of cards. Which is why tuning it probably only makes sense for eMMC
>> parts, and should be a set of runtime/compile-time quirks. What do you
>> think?
>
> Your explanation makes sense, but I'd definitely favor a run-time solution
> over compile-time or boot-time, because it would be much more flexible.
> We should also be able to find some optimizations that are universally
> good so we can always use them.
>
Then that's the angle I will pursue. It is the most flexible and then
you don't have to pollute the block driver with little workarounds for
soon-to-be-obsolete hardware. Hopefully I'll have something for
re-review soon.
Thanks Again!
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-14 19:29 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 19:29 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I don't think it needs to be boot-time, it can easily be run-time
> tuneable using sysfs, where you can configure it using an init script
> or some other logic from user space.
True, definitely expose the controls through sysfs.
>
> Yes.
>
> What is "PE" here?
>
Ah sorry, I had to look that one up myself, I thought it was the local
jargon associated with the problem space :-). Program/Erase cycle.
>> So T suggested for random data to better go into buffer A. How? Two suggestions.
>> 1) Split smaller accesses into 8KB and write with reliable write.
>> 2) Split smaller accesses into 8KB and write in reverse.
>>
>> The patch does both and I am verifying if that is really necessary. I
>> need to go see the mmc spec and what it says about reliable write.
>
> I should add this to my test tool once I can reproduce it. If it turns
> out that other media do the same, we can also trigger the same behavior
> for those.
>
As I mentioned, I am checking with T right now on whether we can use
suggestion (1) or
suggestion (2) or if they need to be combined. The documentation we
got was open to interpretation and the patch created from that did
both.
You mentioned that writing in reverse is not a good idea. Could you
elaborate why? I would guess because you're always causing a write
into a different AU (on these Toshiba cards), causing extra GC on
every write?
>> Basically, whatever behavior you choose is going to be wrong some set
>> of cards. Which is why tuning it probably only makes sense for eMMC
>> parts, and should be a set of runtime/compile-time quirks. What do you
>> think?
>
> Your explanation makes sense, but I'd definitely favor a run-time solution
> over compile-time or boot-time, because it would be much more flexible.
> We should also be able to find some optimizations that are universally
> good so we can always use them.
>
Then that's the angle I will pursue. It is the most flexible and then
you don't have to pollute the block driver with little workarounds for
soon-to-be-obsolete hardware. Hopefully I'll have something for
re-review soon.
Thanks Again!
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-14 19:29 ` Andrei Warkentin
@ 2011-02-14 20:22 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-14 20:22 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc
On Monday 14 February 2011 20:29:59 Andrei Warkentin wrote:
> On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Ah sorry, I had to look that one up myself, I thought it was the local
> jargon associated with the problem space :-). Program/Erase cycle.
Ok, makes sense.
> >> So T suggested for random data to better go into buffer A. How? Two suggestions.
> >> 1) Split smaller accesses into 8KB and write with reliable write.
> >> 2) Split smaller accesses into 8KB and write in reverse.
> >>
> >> The patch does both and I am verifying if that is really necessary. I
> >> need to go see the mmc spec and what it says about reliable write.
> >
> > I should add this to my test tool once I can reproduce it. If it turns
> > out that other media do the same, we can also trigger the same behavior
> > for those.
> >
>
> As I mentioned, I am checking with T right now on whether we can use
> suggestion (1) or
> suggestion (2) or if they need to be combined. The documentation we
> got was open to interpretation and the patch created from that did
> both.
> You mentioned that writing in reverse is not a good idea. Could you
> elaborate why? I would guess because you're always causing a write
> into a different AU (on these Toshiba cards), causing extra GC on
> every write?
Probably both the reliable write and writing small blocks in reverse
order will cause any card to do something that is different from
what it does on normal 64kb (or larger) aligned accesses.
There are multiple ways how this could be implemented:
1. Have one exception cache for all "special" blocks. This would normally
be for FAT32 subdirectory updates, which always write to the same
few blocks. This means you can do small writes efficiently anywhere
on the card, but only up to a (small) fixed number of block addresses.
If you overflow the table, the card still needs to go through an
extra PE for each new entry you write, in order to free up an entry.
2. Have a small number of AUs that can be in a special mode with efficient
small writes but inefficient large writes. This means that when you
alternate between small and large writes in the same AU, it has to go
through a PE on every switch. Similarly, if you do small writes to
more than the maximum number of AUs that can be held in this mode, you
get the same effect. This number can be as small as one, because that
is what FAT32 requires.
In both cases, you don't actually have a solution for the problem, you just
make it less likely for specific workloads.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-14 20:22 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-14 20:22 UTC (permalink / raw)
To: linux-arm-kernel
On Monday 14 February 2011 20:29:59 Andrei Warkentin wrote:
> On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Ah sorry, I had to look that one up myself, I thought it was the local
> jargon associated with the problem space :-). Program/Erase cycle.
Ok, makes sense.
> >> So T suggested for random data to better go into buffer A. How? Two suggestions.
> >> 1) Split smaller accesses into 8KB and write with reliable write.
> >> 2) Split smaller accesses into 8KB and write in reverse.
> >>
> >> The patch does both and I am verifying if that is really necessary. I
> >> need to go see the mmc spec and what it says about reliable write.
> >
> > I should add this to my test tool once I can reproduce it. If it turns
> > out that other media do the same, we can also trigger the same behavior
> > for those.
> >
>
> As I mentioned, I am checking with T right now on whether we can use
> suggestion (1) or
> suggestion (2) or if they need to be combined. The documentation we
> got was open to interpretation and the patch created from that did
> both.
> You mentioned that writing in reverse is not a good idea. Could you
> elaborate why? I would guess because you're always causing a write
> into a different AU (on these Toshiba cards), causing extra GC on
> every write?
Probably both the reliable write and writing small blocks in reverse
order will cause any card to do something that is different from
what it does on normal 64kb (or larger) aligned accesses.
There are multiple ways how this could be implemented:
1. Have one exception cache for all "special" blocks. This would normally
be for FAT32 subdirectory updates, which always write to the same
few blocks. This means you can do small writes efficiently anywhere
on the card, but only up to a (small) fixed number of block addresses.
If you overflow the table, the card still needs to go through an
extra PE for each new entry you write, in order to free up an entry.
2. Have a small number of AUs that can be in a special mode with efficient
small writes but inefficient large writes. This means that when you
alternate between small and large writes in the same AU, it has to go
through a PE on every switch. Similarly, if you do small writes to
more than the maximum number of AUs that can be held in this mode, you
get the same effect. This number can be as small as one, because that
is what FAT32 requires.
In both cases, you don't actually have a solution for the problem, you just
make it less likely for specific workloads.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-14 20:22 ` Arnd Bergmann
@ 2011-02-14 22:25 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 22:25 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Mon, Feb 14, 2011 at 2:22 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> As I mentioned, I am checking with T right now on whether we can use
>> suggestion (1) or
>> suggestion (2) or if they need to be combined. The documentation we
>> got was open to interpretation and the patch created from that did
>> both.
>> You mentioned that writing in reverse is not a good idea. Could you
>> elaborate why? I would guess because you're always causing a write
>> into a different AU (on these Toshiba cards), causing extra GC on
>> every write?
>
> Probably both the reliable write and writing small blocks in reverse
> order will cause any card to do something that is different from
> what it does on normal 64kb (or larger) aligned accesses.
>
> There are multiple ways how this could be implemented:
>
> 1. Have one exception cache for all "special" blocks. This would normally
> be for FAT32 subdirectory updates, which always write to the same
> few blocks. This means you can do small writes efficiently anywhere
> on the card, but only up to a (small) fixed number of block addresses.
> If you overflow the table, the card still needs to go through an
> extra PE for each new entry you write, in order to free up an entry.
>
> 2. Have a small number of AUs that can be in a special mode with efficient
> small writes but inefficient large writes. This means that when you
> alternate between small and large writes in the same AU, it has to go
> through a PE on every switch. Similarly, if you do small writes to
> more than the maximum number of AUs that can be held in this mode, you
> get the same effect. This number can be as small as one, because that
> is what FAT32 requires.
>
> In both cases, you don't actually have a solution for the problem, you just
> make it less likely for specific workloads.
Aha, ok. By the way, I did find out that either suggestion works. So
I'll pull out the reversing portion of the patch. No need to
overcomplicate :).
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-14 22:25 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 22:25 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Feb 14, 2011 at 2:22 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> As I mentioned, I am checking with T right now on whether we can use
>> suggestion (1) or
>> suggestion (2) or if they need to be combined. The documentation we
>> got was open to interpretation and the patch created from that did
>> both.
>> You mentioned that writing in reverse is not a good idea. Could you
>> elaborate why? I would guess because you're always causing a write
>> into a different AU (on these Toshiba cards), causing extra GC on
>> every write?
>
> Probably both the reliable write and writing small blocks in reverse
> order will cause any card to do something that is different from
> what it does on normal 64kb (or larger) aligned accesses.
>
> There are multiple ways how this could be implemented:
>
> 1. Have one exception cache for all "special" blocks. This would normally
> ? be for FAT32 subdirectory updates, which always write to the same
> ? few blocks. This means you can do small writes efficiently anywhere
> ? on the card, but only up to a (small) fixed number of block addresses.
> ? If you overflow the table, the card still needs to go through an
> ? extra PE for each new entry you write, in order to free up an entry.
>
> 2. Have a small number of AUs that can be in a special mode with efficient
> ? small writes but inefficient large writes. This means that when you
> ? alternate between small and large writes in the same AU, it has to go
> ? through a PE on every switch. Similarly, if you do small writes to
> ? more than the maximum number of AUs that can be held in this mode, you
> ? get the same effect. This number can be as small as one, because that
> ? is what FAT32 requires.
>
> In both cases, you don't actually have a solution for the problem, you just
> make it less likely for specific workloads.
Aha, ok. By the way, I did find out that either suggestion works. So
I'll pull out the reversing portion of the patch. No need to
overcomplicate :).
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-14 22:25 ` Andrei Warkentin
@ 2011-02-15 17:16 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-15 17:16 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Monday 14 February 2011, Andrei Warkentin wrote:
> > There are multiple ways how this could be implemented:
> >
> > 1. Have one exception cache for all "special" blocks. This would normally
> > be for FAT32 subdirectory updates, which always write to the same
> > few blocks. This means you can do small writes efficiently anywhere
> > on the card, but only up to a (small) fixed number of block addresses.
> > If you overflow the table, the card still needs to go through an
> > extra PE for each new entry you write, in order to free up an entry.
> >
> > 2. Have a small number of AUs that can be in a special mode with efficient
> > small writes but inefficient large writes. This means that when you
> > alternate between small and large writes in the same AU, it has to go
> > through a PE on every switch. Similarly, if you do small writes to
> > more than the maximum number of AUs that can be held in this mode, you
> > get the same effect. This number can be as small as one, because that
> > is what FAT32 requires.
> >
> > In both cases, you don't actually have a solution for the problem, you just
> > make it less likely for specific workloads.
>
> Aha, ok. By the way, I did find out that either suggestion works. So
> I'll pull out the reversing portion of the patch. No need to
> overcomplicate :).
BTW, what file system are you using? I could imagine that each of ext4, btrfs
and nilfs2 give you very different results here. It could be that if your
patch is optimizing for one file system, it is actually pessimising for
another one.
What benchmark do you use to find out of your optimizations actually help you?
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-15 17:16 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-15 17:16 UTC (permalink / raw)
To: linux-arm-kernel
On Monday 14 February 2011, Andrei Warkentin wrote:
> > There are multiple ways how this could be implemented:
> >
> > 1. Have one exception cache for all "special" blocks. This would normally
> > be for FAT32 subdirectory updates, which always write to the same
> > few blocks. This means you can do small writes efficiently anywhere
> > on the card, but only up to a (small) fixed number of block addresses.
> > If you overflow the table, the card still needs to go through an
> > extra PE for each new entry you write, in order to free up an entry.
> >
> > 2. Have a small number of AUs that can be in a special mode with efficient
> > small writes but inefficient large writes. This means that when you
> > alternate between small and large writes in the same AU, it has to go
> > through a PE on every switch. Similarly, if you do small writes to
> > more than the maximum number of AUs that can be held in this mode, you
> > get the same effect. This number can be as small as one, because that
> > is what FAT32 requires.
> >
> > In both cases, you don't actually have a solution for the problem, you just
> > make it less likely for specific workloads.
>
> Aha, ok. By the way, I did find out that either suggestion works. So
> I'll pull out the reversing portion of the patch. No need to
> overcomplicate :).
BTW, what file system are you using? I could imagine that each of ext4, btrfs
and nilfs2 give you very different results here. It could be that if your
patch is optimizing for one file system, it is actually pessimising for
another one.
What benchmark do you use to find out of your optimizations actually help you?
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-15 17:16 ` Arnd Bergmann
@ 2011-02-17 2:08 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-17 2:08 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Tue, Feb 15, 2011 at 11:16 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 14 February 2011, Andrei Warkentin wrote:
>> > There are multiple ways how this could be implemented:
>> >
>> > 1. Have one exception cache for all "special" blocks. This would normally
>> > be for FAT32 subdirectory updates, which always write to the same
>> > few blocks. This means you can do small writes efficiently anywhere
>> > on the card, but only up to a (small) fixed number of block addresses.
>> > If you overflow the table, the card still needs to go through an
>> > extra PE for each new entry you write, in order to free up an entry.
>> >
>> > 2. Have a small number of AUs that can be in a special mode with efficient
>> > small writes but inefficient large writes. This means that when you
>> > alternate between small and large writes in the same AU, it has to go
>> > through a PE on every switch. Similarly, if you do small writes to
>> > more than the maximum number of AUs that can be held in this mode, you
>> > get the same effect. This number can be as small as one, because that
>> > is what FAT32 requires.
>> >
>> > In both cases, you don't actually have a solution for the problem, you just
>> > make it less likely for specific workloads.
>>
>> Aha, ok. By the way, I did find out that either suggestion works. So
>> I'll pull out the reversing portion of the patch. No need to
>> overcomplicate :).
>
> BTW, what file system are you using? I could imagine that each of ext4, btrfs
> and nilfs2 give you very different results here. It could be that if your
> patch is optimizing for one file system, it is actually pessimising for
> another one.
>
Ext4. I've actually been rewriting the patch a lot and it's taking
time because there are a lot of things that are wrong in it (so I feel
kinda bad for forwarding it to this list in the first place...). I've
already mentioned that there is no need to reorder, so that's going
away and it simplifies everything greatly.
I agree, which is why all of this is controlled now through sysfs, and
there are no more hard-coded checks for manfid, mmc versus sd or any
other magic. There is a page_size_secs attribute, through which you
can notify of the page size for the device. The workaround for small
writes crossing the page boundary (and winding up in Buffer B, instead
of A) is turned on by setting split_tlow and split_thigh, which
provided a threshold range in sectors over which the the writes will
be split/aligned. The second workaround for splitting larger requests
and writing them with reliable write (to avoid getting coalesced and
winding up in Buffer B again) is controlled through split_relw_tlow
and split_relw_thigh. Do you think there is a better way? Or is this
good enough?
So, as I mentioned before, T had done some tests given data provided
by M, and then T verified that this fix was good. I need to do my own
tests on the patch after I rewrite it. Is iozone the best tool I can
use? So far I have a MMC logging facility through connector that I use
to collect stats (useful for seeing how fs traffic translates to
actual mmc commands...once I clean it up I'll push here for RFC). What
about the tool you're writing? Any way I can use it?
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-17 2:08 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-17 2:08 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Feb 15, 2011 at 11:16 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 14 February 2011, Andrei Warkentin wrote:
>> > There are multiple ways how this could be implemented:
>> >
>> > 1. Have one exception cache for all "special" blocks. This would normally
>> > ? be for FAT32 subdirectory updates, which always write to the same
>> > ? few blocks. This means you can do small writes efficiently anywhere
>> > ? on the card, but only up to a (small) fixed number of block addresses.
>> > ? If you overflow the table, the card still needs to go through an
>> > ? extra PE for each new entry you write, in order to free up an entry.
>> >
>> > 2. Have a small number of AUs that can be in a special mode with efficient
>> > ? small writes but inefficient large writes. This means that when you
>> > ? alternate between small and large writes in the same AU, it has to go
>> > ? through a PE on every switch. Similarly, if you do small writes to
>> > ? more than the maximum number of AUs that can be held in this mode, you
>> > ? get the same effect. This number can be as small as one, because that
>> > ? is what FAT32 requires.
>> >
>> > In both cases, you don't actually have a solution for the problem, you just
>> > make it less likely for specific workloads.
>>
>> Aha, ok. By the way, I did find out that either suggestion works. So
>> I'll pull out the reversing portion of the patch. No need to
>> overcomplicate :).
>
> BTW, what file system are you using? I could imagine that each of ext4, btrfs
> and nilfs2 give you very different results here. It could be that if your
> patch is optimizing for one file system, it is actually pessimising for
> another one.
>
Ext4. I've actually been rewriting the patch a lot and it's taking
time because there are a lot of things that are wrong in it (so I feel
kinda bad for forwarding it to this list in the first place...). I've
already mentioned that there is no need to reorder, so that's going
away and it simplifies everything greatly.
I agree, which is why all of this is controlled now through sysfs, and
there are no more hard-coded checks for manfid, mmc versus sd or any
other magic. There is a page_size_secs attribute, through which you
can notify of the page size for the device. The workaround for small
writes crossing the page boundary (and winding up in Buffer B, instead
of A) is turned on by setting split_tlow and split_thigh, which
provided a threshold range in sectors over which the the writes will
be split/aligned. The second workaround for splitting larger requests
and writing them with reliable write (to avoid getting coalesced and
winding up in Buffer B again) is controlled through split_relw_tlow
and split_relw_thigh. Do you think there is a better way? Or is this
good enough?
So, as I mentioned before, T had done some tests given data provided
by M, and then T verified that this fix was good. I need to do my own
tests on the patch after I rewrite it. Is iozone the best tool I can
use? So far I have a MMC logging facility through connector that I use
to collect stats (useful for seeing how fs traffic translates to
actual mmc commands...once I clean it up I'll push here for RFC). What
about the tool you're writing? Any way I can use it?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-17 2:08 ` Andrei Warkentin
@ 2011-02-17 15:47 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-17 15:47 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Thursday 17 February 2011, Andrei Warkentin wrote:
> Ext4.
Ok, I see. I haven't really done this kind of tests before, but my
feeling is that ext3/ext4 may be much worse than the alternatives
at the moment. It would certainly be worthwhile to do tests using
nilfs2 and btrfs, whose default behaviour matches the requirements
of your eMMC flash much better, and see how they perform with and
without your patch.
> I agree, which is why all of this is controlled now through sysfs, and
> there are no more hard-coded checks for manfid, mmc versus sd or any
> other magic. There is a page_size_secs attribute, through which you
> can notify of the page size for the device.
How about making that just page_size in bytes? sectors don't always
mean 512 bytes, so this would be both shorter and less anbiguous.
> The workaround for small
> writes crossing the page boundary (and winding up in Buffer B, instead
> of A) is turned on by setting split_tlow and split_thigh, which
> provided a threshold range in sectors over which the the writes will
> be split/aligned. The second workaround for splitting larger requests
> and writing them with reliable write (to avoid getting coalesced and
> winding up in Buffer B again) is controlled through split_relw_tlow
> and split_relw_thigh. Do you think there is a better way? Or is this
> good enough?
I think I'd try to reduce the number of sysfs files needed for this.
What are the values you would typically set here?
My feeling is that separating unaligned page writes from full pages
or multiples of pages could always be benefitial for all cards, or at
least harmless, but that will require more measurements.
Whether to do the reliable write or not could be a simple flag
if the numbers are the same.
> So, as I mentioned before, T had done some tests given data provided
> by M, and then T verified that this fix was good. I need to do my own
> tests on the patch after I rewrite it. Is iozone the best tool I can
> use? So far I have a MMC logging facility through connector that I use
> to collect stats (useful for seeing how fs traffic translates to
> actual mmc commands...once I clean it up I'll push here for RFC). What
> about the tool you're writing? Any way I can use it?
It's now available in a an early almost-usable version at
git://git.linaro.org/people/arnd/flashbench.git
I don't have a test for the second buffer yet, but it would be
good to know some of the other characteristics of your eMMC drive.
Please try some of these commands:
flashbench -a /dev/mmcblk0 --blocksize=1024
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512 --random
Note that the --open-au test will overwrite your data. You can do it on a
partition you don't use, but it needs to be aligned to 4 MB.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-17 15:47 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-17 15:47 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 17 February 2011, Andrei Warkentin wrote:
> Ext4.
Ok, I see. I haven't really done this kind of tests before, but my
feeling is that ext3/ext4 may be much worse than the alternatives
at the moment. It would certainly be worthwhile to do tests using
nilfs2 and btrfs, whose default behaviour matches the requirements
of your eMMC flash much better, and see how they perform with and
without your patch.
> I agree, which is why all of this is controlled now through sysfs, and
> there are no more hard-coded checks for manfid, mmc versus sd or any
> other magic. There is a page_size_secs attribute, through which you
> can notify of the page size for the device.
How about making that just page_size in bytes? sectors don't always
mean 512 bytes, so this would be both shorter and less anbiguous.
> The workaround for small
> writes crossing the page boundary (and winding up in Buffer B, instead
> of A) is turned on by setting split_tlow and split_thigh, which
> provided a threshold range in sectors over which the the writes will
> be split/aligned. The second workaround for splitting larger requests
> and writing them with reliable write (to avoid getting coalesced and
> winding up in Buffer B again) is controlled through split_relw_tlow
> and split_relw_thigh. Do you think there is a better way? Or is this
> good enough?
I think I'd try to reduce the number of sysfs files needed for this.
What are the values you would typically set here?
My feeling is that separating unaligned page writes from full pages
or multiples of pages could always be benefitial for all cards, or at
least harmless, but that will require more measurements.
Whether to do the reliable write or not could be a simple flag
if the numbers are the same.
> So, as I mentioned before, T had done some tests given data provided
> by M, and then T verified that this fix was good. I need to do my own
> tests on the patch after I rewrite it. Is iozone the best tool I can
> use? So far I have a MMC logging facility through connector that I use
> to collect stats (useful for seeing how fs traffic translates to
> actual mmc commands...once I clean it up I'll push here for RFC). What
> about the tool you're writing? Any way I can use it?
It's now available in a an early almost-usable version at
git://git.linaro.org/people/arnd/flashbench.git
I don't have a test for the second buffer yet, but it would be
good to know some of the other characteristics of your eMMC drive.
Please try some of these commands:
flashbench -a /dev/mmcblk0 --blocksize=1024
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512 --random
Note that the --open-au test will overwrite your data. You can do it on a
partition you don't use, but it needs to be aligned to 4 MB.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-11 22:33 ` Andrei Warkentin
@ 2011-02-18 1:10 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 1:10 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Arnd,
>
> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152
>
Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
confused over all the manfids changing.
Here is the Toshiba card:
cid - 1101004d4d4333324703101a17746d00
csd - 900e00320f5903ffffffffe796400000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000011
name - MMC32G
oemid - 0x0100
preferred_erase_size - 4194304
I'll get you the flashbench timings for both.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-18 1:10 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 1:10 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Arnd,
>
> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152
>
Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
confused over all the manfids changing.
Here is the Toshiba card:
cid - 1101004d4d4333324703101a17746d00
csd - 900e00320f5903ffffffffe796400000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000011
name - MMC32G
oemid - 0x0100
preferred_erase_size - 4194304
I'll get you the flashbench timings for both.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-18 1:10 ` Andrei Warkentin
@ 2011-02-18 13:44 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-18 13:44 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc
On Friday 18 February 2011, Andrei Warkentin wrote:
> On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> > Arnd,
> >
> > Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >
> > cid - 02010053454d3332479070cc51451d00
> > csd - d00f00320f5903ffffffffff92404000
> > erase_size - 524288
> > fwrev - 0x0
> > hwrev - 0x0
> > manfid - 0x000002
> > name - SEM32G
> > oemid - 0x0100
> > preferred_erase_size - 2097152
> >
>
> Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
> confused over all the manfids changing.
>
> Here is the Toshiba card:
>
> cid - 1101004d4d4333324703101a17746d00
> csd - 900e00320f5903ffffffffe796400000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000011
> name - MMC32G
> oemid - 0x0100
> preferred_erase_size - 4194304
>
> I'll get you the flashbench timings for both.
I'm curious. Neither the manfid nor the oemid fields of either card
match what I have seen on SD cards, I would expect them to be
Sandisk: manfid 0x000003, oemid 0x5344
Toshiba: manfid 0x000002, oemid 0x544d
I have not actually seen any Toshiba SD cards, but I assume that they
use the same controllers as Kingston.
Does anyone know if the IDs have any correlation between MMC and SD
controllers?
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-18 13:44 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-18 13:44 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 18 February 2011, Andrei Warkentin wrote:
> On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> > Arnd,
> >
> > Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >
> > cid - 02010053454d3332479070cc51451d00
> > csd - d00f00320f5903ffffffffff92404000
> > erase_size - 524288
> > fwrev - 0x0
> > hwrev - 0x0
> > manfid - 0x000002
> > name - SEM32G
> > oemid - 0x0100
> > preferred_erase_size - 2097152
> >
>
> Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
> confused over all the manfids changing.
>
> Here is the Toshiba card:
>
> cid - 1101004d4d4333324703101a17746d00
> csd - 900e00320f5903ffffffffe796400000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000011
> name - MMC32G
> oemid - 0x0100
> preferred_erase_size - 4194304
>
> I'll get you the flashbench timings for both.
I'm curious. Neither the manfid nor the oemid fields of either card
match what I have seen on SD cards, I would expect them to be
Sandisk: manfid 0x000003, oemid 0x5344
Toshiba: manfid 0x000002, oemid 0x544d
I have not actually seen any Toshiba SD cards, but I assume that they
use the same controllers as Kingston.
Does anyone know if the IDs have any correlation between MMC and SD
controllers?
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-18 13:44 ` Arnd Bergmann
@ 2011-02-18 19:47 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 19:47 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I'm curious. Neither the manfid nor the oemid fields of either card
> match what I have seen on SD cards, I would expect them to be
>
> Sandisk: manfid 0x000003, oemid 0x5344
> Toshiba: manfid 0x000002, oemid 0x544d
>
> I have not actually seen any Toshiba SD cards, but I assume that they
> use the same controllers as Kingston.
>
> Does anyone know if the IDs have any correlation between MMC and SD
> controllers?
>
> Arnd
>
I'm unsure about the older scheme (assigned by MMCA), but ever since
MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
will be 0x45, and Toshiba I guess will be 0x11.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-18 19:47 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 19:47 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I'm curious. Neither the manfid nor the oemid fields of either card
> match what I have seen on SD cards, I would expect them to be
>
> Sandisk: manfid 0x000003, oemid 0x5344
> Toshiba: manfid 0x000002, oemid 0x544d
>
> I have not actually seen any Toshiba SD cards, but I assume that they
> use the same controllers as Kingston.
>
> Does anyone know if the IDs have any correlation between MMC and SD
> controllers?
>
> ? ? ? ?Arnd
>
I'm unsure about the older scheme (assigned by MMCA), but ever since
MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
will be 0x45, and Toshiba I guess will be 0x11.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-18 19:47 ` Andrei Warkentin
@ 2011-02-18 22:40 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 22:40 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
[-- Attachment #1: Type: text/plain, Size: 2014 bytes --]
On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> I'm curious. Neither the manfid nor the oemid fields of either card
>> match what I have seen on SD cards, I would expect them to be
>>
>> Sandisk: manfid 0x000003, oemid 0x5344
>> Toshiba: manfid 0x000002, oemid 0x544d
>>
>> I have not actually seen any Toshiba SD cards, but I assume that they
>> use the same controllers as Kingston.
>>
>> Does anyone know if the IDs have any correlation between MMC and SD
>> controllers?
>>
>> Arnd
>>
>
> I'm unsure about the older scheme (assigned by MMCA), but ever since
> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
> will be 0x45, and Toshiba I guess will be 0x11.
>
Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
Some interesting things that I don't understand. For the align test, I
extended it to do a write align test (-A). I tried two partitions that
I could write over, and both read and writes behaved differently for
the two partitions on same device. Odd. They are both 4MB aligned.
On the sandisk it was the write align that made the page size stand
out. The read align had pretty constant results.
On the toshiba the results varied wildly for the two partitions. For
partition 6, there was a clear pattern in the diff values for read
align. For 9, it was all over the place. For 9 with the write align,
8K and 16K the crossing writes took ~115ms!! Look in attached files
for all the data.
The AU tests were interesting too, especially how with several open
AUs the throughput is higher for certain smaller sizes on sandisk, but
if I interpret it correctly both cards have at least 4 AUs, as I
didn't see yet a significant drop for small sizes. The larger ones I
am running now on mmcblk0p9 which is sufficiently larger for these
tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
Thanks,
A
[-- Attachment #2: toshiba.txt --]
[-- Type: text/plain, Size: 5447 bytes --]
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/start
643072
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/size
346112
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/start
77824
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/size
24576
# ./flashbench -a -b 1024 /dev/block/mmcblk0p6
align 524288 pre 613µs on 801µs post 570µs diff 210µs
align 262144 pre 739µs on 988µs post 767µs diff 235µs
align 131072 pre 740µs on 990µs post 767µs diff 236µs
align 65536 pre 749µs on 998µs post 767µs diff 240µs
align 32768 pre 761µs on 992µs post 746µs diff 238µs
align 16384 pre 755µs on 982µs post 755µs diff 227µs
align 8192 pre 748µs on 750µs post 748µs diff 1.94µs
align 4096 pre 747µs on 749µs post 747µs diff 1.41µs
align 2048 pre 747µs on 747µs post 748µs diff -93ns
# ./flashbench -a -b 1024 /dev/block/mmcblk0p9
align 8388608 pre 527µs on 743µs post 476µs diff 242µs
align 4194304 pre 544µs on 730µs post 543µs diff 187µs
align 2097152 pre 551µs on 714µs post 485µs diff 196µs
align 1048576 pre 742µs on 864µs post 745µs diff 120µs
align 524288 pre 760µs on 822µs post 789µs diff 47.9µs
align 262144 pre 760µs on 816µs post 789µs diff 42µs
align 131072 pre 760µs on 822µs post 789µs diff 47.8µs
align 65536 pre 758µs on 821µs post 789µs diff 48µs
align 32768 pre 771µs on 828µs post 760µs diff 62.7µs
align 16384 pre 672µs on 939µs post 771µs diff 217µs
align 8192 pre 668µs on 806µs post 671µs diff 136µs
align 4096 pre 671µs on 672µs post 670µs diff 1.5µs
align 2048 pre 671µs on 670µs post 671µs diff -859ns
# ./flashbench -A -b 1024 /dev/block/mmcblk0p6
write align 524288 pre 3.59ms on 6.74ms post 3.73ms diff 3.08ms
write align 262144 pre 3.69ms on 7.11ms post 3.69ms diff 3.42ms
write align 131072 pre 3.71ms on 17.4ms post 3.72ms diff 13.7ms
write align 65536 pre 3.72ms on 7.18ms post 3.52ms diff 3.56ms
write align 32768 pre 3.73ms on 11.9ms post 3.7ms diff 8.24ms
write align 16384 pre 3.93ms on 5.01ms post 4.6ms diff 745µs
write align 8192 pre 4.9ms on 4.89ms post 4.87ms diff 4.77µs
write align 4096 pre 5.03ms on 5.02ms post 5.01ms diff -437ns
write align 2048 pre 5.08ms on 5.08ms post 5.06ms diff 12.3µs
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 3.76ms on 7.07ms post 4.05ms diff 3.16ms
write align 4194304 pre 3.62ms on 6.5ms post 3.63ms diff 2.88ms
write align 2097152 pre 3.91ms on 6.84ms post 3.7ms diff 3.04ms
write align 1048576 pre 3.88ms on 6.96ms post 3.96ms diff 3.04ms
write align 524288 pre 3.93ms on 7.07ms post 4.05ms diff 3.08ms
write align 262144 pre 3.94ms on 7.07ms post 4.05ms diff 3.07ms
write align 131072 pre 3.95ms on 7.05ms post 4.05ms diff 3.05ms
write align 65536 pre 3.94ms on 7.07ms post 4.05ms diff 3.07ms
write align 32768 pre 3.95ms on 7.07ms post 4.04ms diff 3.07ms
write align 16384 pre 4.48ms on 117ms post 3.81ms diff 113ms
write align 8192 pre 3.61ms on 114ms post 3.58ms diff 110ms
write align 4096 pre 3.88ms on 3.87ms post 3.86ms diff 1.87µs
write align 2048 pre 3.88ms on 3.89ms post 3.89ms diff 3.11µs
./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p6
4MiB 7.17M/s
2MiB 7.91M/s
1MiB 9.23M/s
512KiB 10.3M/s
256KiB 10.5M/s
128KiB 10.4M/s
64KiB 9.81M/s
32KiB 9.09M/s
16KiB 3.71M/s
8KiB 1.73M/s
4KiB 845K/s
2KiB 418K/s
1KiB 208K/s
512B 103K/s
./flashbench -O -0 1 -r -b 512 /dev/block/mmcblk0p6
4MiB 6.58M/s
2MiB 7.98M/s
1MiB 9.33M/s
512KiB 10.4M/s
256KiB 10.9M/s
128KiB 10.5M/s
64KiB 9.94M/s
32KiB 9.11M/s
16KiB 3.72M/s
8KiB 1.75M/s
4KiB 853K/s
2KiB 419K/s
1KiB 207K/s
512B 102K/s
./flashbench -O -0 2 -b 512 /dev/block/mmcblk0p6
4MiB 8.95M/s
2MiB 9.44M/s
1MiB 10.3M/s
512KiB 10.9M/s
256KiB 10.8M/s
128KiB 10.5M/s
64KiB 9.91M/s
32KiB 8.79M/s
16KiB 3.65M/s
8KiB 1.75M/s
4KiB 851K/s
2KiB 419K/s
1KiB 208K/s
512B 103K/s
./flashbench -O -0 2 -r -b 512 /dev/block/mmcblk0p6
4MiB 9.06M/s
2MiB 9.68M/s
1MiB 10.3M/s
512KiB 10.5M/s
256KiB 9.94M/s
128KiB 10.1M/s
64KiB 9.41M/s
32KiB 7.99M/s
16KiB 3.5M/s
8KiB 1.64M/s
4KiB 798K/s
2KiB 393K/s
1KiB 196K/s
512B 96.5K/s
./flashbench -O -0 3 -b 512 /dev/block/mmcblk0p6
4MiB 8.07M/s
2MiB 9.07M/s
1MiB 9.88M/s
512KiB 10.1M/s
256KiB 10M/s
128KiB 9.83M/s
64KiB 8.68M/s
32KiB 7.1M/s
16KiB 3.09M/s
8KiB 1.49M/s
4KiB 726K/s
2KiB 357K/s
1KiB 178K/s
512B 88.5K/s
./flashbench -O -0 3 -r -b 512 /dev/block/mmcblk0p6
4MiB 8.12M/s
2MiB 9.28M/s
1MiB 9.83M/s
512KiB 10M/s
256KiB 9.97M/s
128KiB 9.91M/s
64KiB 8.9M/s
32KiB 7.3M/s
16KiB 3.2M/s
8KiB 1.54M/s
4KiB 751K/s
2KiB 367K/s
1KiB 183K/s
512B 90.3K/s
./flashbench -O -0 4 -b 512 /dev/block/mmcblk0p6
4MiB 5.87M/s
2MiB 8.71M/s
1MiB 9.11M/s
512KiB 10.3M/s
256KiB 10.5M/s
128KiB 10M/s
64KiB 9.09M/s
32KiB 7.5M/s
16KiB 3.28M/s
8KiB 1.56M/s
4KiB 758K/s
2KiB 372K/s
1KiB 185K/s
512B 92.3K/s
./flashbench -O -0 4 -r -b 512 /dev/block/mmcblk0p6
4MiB 7.57M/s
2MiB 7.23M/s
1MiB 9.71M/s
512KiB 10M/s
256KiB 9.98M/s
128KiB 9.82M/s
64KiB 9.07M/s
32KiB 7.62M/s
16KiB 3.34M/s
8KiB 1.58M/s
4KiB 776K/s
2KiB 379K/s
1KiB 188K/s
512B 92.7K/s
[-- Attachment #3: sandisk.txt --]
[-- Type: text/plain, Size: 5529 bytes --]
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/start
647168
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/size
346112
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/start
81920
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/size
24576
/data # ./flashbench -a -b 1024 /dev/block/mmcblk0p6
align 524288 pre 1.01ms on 1.03ms post 858µs diff 93.5µs
align 262144 pre 1.16ms on 1.2ms post 926µs diff 153µs
align 131072 pre 1.16ms on 1.2ms post 924µs diff 151µs
align 65536 pre 1.15ms on 1.12ms post 919µs diff 84.9µs
align 32768 pre 1.16ms on 1.2ms post 923µs diff 154µs
align 16384 pre 1.16ms on 1.21ms post 941µs diff 162µs
align 8192 pre 1.15ms on 1.09ms post 874µs diff 80.2µs
align 4096 pre 1.16ms on 1.17ms post 902µs diff 138µs
align 2048 pre 1.16ms on 1.17ms post 903µs diff 135µs
/data # ./flashbench -a -b 1024 /dev/block/mmcblk0p9
align 8388608 pre 1.07ms on 1.1ms post 933µs diff 92.9µs
align 4194304 pre 1.28ms on 1.29ms post 1.05ms diff 129µs
align 2097152 pre 1.28ms on 1.31ms post 1.07ms diff 132µs
align 1048576 pre 1.27ms on 1.32ms post 1.07ms diff 147µs
align 524288 pre 1.38ms on 1.38ms post 1.12ms diff 135µs
align 262144 pre 1.27ms on 1.3ms post 1.04ms diff 140µs
align 131072 pre 1.28ms on 1.31ms post 1.02ms diff 164µs
align 65536 pre 1.38ms on 1.38ms post 1.12ms diff 135µs
align 32768 pre 1.38ms on 1.38ms post 1.12ms diff 134µs
align 16384 pre 1.38ms on 1.38ms post 1.11ms diff 135µs
align 8192 pre 1.38ms on 1.38ms post 1.11ms diff 134µs
align 4096 pre 1.38ms on 1.38ms post 1.11ms diff 136µs
align 2048 pre 1.38ms on 1.38ms post 1.11ms diff 134µs
/data # ./flashbench -A -b 1024 /dev/block/mmcblk0p6
write align 524288 pre 1.69ms on 2.38ms post 1.78ms diff 653µs
write align 262144 pre 1.87ms on 2.59ms post 1.86ms diff 723µs
write align 131072 pre 1.88ms on 2.61ms post 1.89ms diff 729µs
write align 65536 pre 1.86ms on 2.65ms post 1.83ms diff 805µs
write align 32768 pre 1.88ms on 2.61ms post 1.92ms diff 710µs
write align 16384 pre 1.8ms on 2.57ms post 1.95ms diff 701µs
write align 8192 pre 1.66ms on 1.71ms post 1.64ms diff 55µs
write align 4096 pre 1.67ms on 1.71ms post 1.64ms diff 51.9µs
write align 2048 pre 1.67ms on 1.71ms post 1.61ms diff 68.7µs
/data # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 1.83ms on 2.62ms post 1.91ms diff 750µs
write align 4194304 pre 1.89ms on 2.87ms post 2.06ms diff 892µs
write align 2097152 pre 2.08ms on 2.86ms post 2.13ms diff 751µs
write align 1048576 pre 2.06ms on 2.93ms post 2.17ms diff 818µs
write align 524288 pre 2.07ms on 2.85ms post 2.18ms diff 724µs
write align 262144 pre 2.07ms on 2.85ms post 2.15ms diff 741µs
write align 131072 pre 2.05ms on 2.93ms post 2.19ms diff 809µs
write align 65536 pre 1.86ms on 2.77ms post 1.9ms diff 888µs
write align 32768 pre 2.06ms on 2.91ms post 2.19ms diff 783µs
write align 16384 pre 2.05ms on 2.76ms post 1.8ms diff 835µs
write align 8192 pre 1.83ms on 1.89ms post 1.8ms diff 72.9µs
write align 4096 pre 1.84ms on 1.9ms post 1.8ms diff 75µs
write align 2048 pre 1.84ms on 1.89ms post 1.8ms diff 70.8µs
/data # ./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p6
4MiB 10.5M/s
2MiB 10.1M/s
1MiB 10.6M/s
512KiB 10.5M/s
256KiB 8.94M/s
128KiB 7.74M/s
64KiB 6.04M/s
32KiB 4.13M/s
16KiB 3.2M/s
8KiB 3.87M/s
4KiB 1.86M/s
2KiB 1.16M/s
1KiB 667K/s
512B 396K/s
/data # ./flashbench -O -0 1 -r -b 512 /dev/block/mmcblk0p6
4MiB 10.7M/s
2MiB 10.3M/s
1MiB 10.4M/s
512KiB 16.3M/s
256KiB 16.6M/s
128KiB 16.1M/s
64KiB 14M/s
32KiB 11.1M/s
16KiB 6.77M/s
8KiB 3.15M/s
4KiB 1.77M/s
2KiB 1.01M/s
1KiB 523K/s
512B 296K/s
/data # ./flashbench -O -0 2 -b 512 /dev/block/mmcblk0p6
4MiB 11.5M/s
2MiB 11.3M/s
1MiB 11.5M/s
512KiB 11.6M/s
256KiB 10.8M/s
128KiB 9.84M/s
64KiB 7.88M/s
32KiB 5.65M/s
16KiB 4.14M/s
8KiB 1.99M/s
4KiB 1.42M/s
2KiB 760K/s
1KiB 392K/s
512B 213K/s
/data # ./flashbench -O -0 2 -r -b 512 /dev/block/mmcblk0p6
4MiB 10.3M/s
2MiB 10.2M/s
1MiB 10.1M/s
512KiB 16M/s
256KiB 15.8M/s
128KiB 14.6M/s
64KiB 11.4M/s
32KiB 8.07M/s
16KiB 5.12M/s
8KiB 2.65M/s
4KiB 1.43M/s
2KiB 768K/s
1KiB 395K/s
512B 212K/s
/data # ./flashbench -O -0 3 -b 512 /dev/block/mmcblk0p6
4MiB 11.3M/s
2MiB 11.5M/s
1MiB 11.5M/s
512KiB 11.5M/s
256KiB 10.4M/s
128KiB 9.1M/s
64KiB 7.3M/s
32KiB 5.21M/s
16KiB 3.78M/s
8KiB 2.08M/s
4KiB 1.42M/s
2KiB 792K/s
1KiB 418K/s
512B 217K/s
/data/flashbench -O -0 3 -r -b 512 /dev/block/mmcblk0p6
4MiB 10.7M/s
2MiB 10.5M/s
1MiB 10.2M/s
512KiB 17.3M/s
256KiB 16.3M/s
128KiB 14.5M/s
64KiB 11.4M/s
32KiB 8.12M/s
16KiB 4.98M/s
8KiB 2.62M/s
4KiB 1.4M/s
2KiB 768K/s
1KiB 390K/s
512B 212K/s
./flashbench -O -0 4 -b 512 /dev/block/mmcblk0p6
4MiB 14.4M/s
2MiB 14M/s
1MiB 13.9M/s
512KiB 14.2M/s
256KiB 13.5M/s
128KiB 11.9M/s
64KiB 9.8M/s
32KiB 7.35M/s
16KiB 5.1M/s
8KiB 2.69M/s
4KiB 1.58M/s
2KiB 877K/s
1KiB 476K/s
512B 268K/s
./flashbench -O -0 4 -r -b 512 /dev/block/mmcblk0p6
4MiB 10.4M/s
2MiB 10.5M/s
1MiB 14.3M/s
512KiB 17.7M/s
256KiB 16.9M/s
128KiB 15.5M/s
64KiB 12.4M/s
32KiB 9.36M/s
16KiB 5.62M/s
8KiB 3M/s
4KiB 1.62M/s
2KiB 880K/s
1KiB 462K/s
512B 261K/s
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-18 22:40 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 22:40 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> I'm curious. Neither the manfid nor the oemid fields of either card
>> match what I have seen on SD cards, I would expect them to be
>>
>> Sandisk: manfid 0x000003, oemid 0x5344
>> Toshiba: manfid 0x000002, oemid 0x544d
>>
>> I have not actually seen any Toshiba SD cards, but I assume that they
>> use the same controllers as Kingston.
>>
>> Does anyone know if the IDs have any correlation between MMC and SD
>> controllers?
>>
>> ? ? ? ?Arnd
>>
>
> I'm unsure about the older scheme (assigned by MMCA), but ever since
> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
> will be 0x45, and Toshiba I guess will be 0x11.
>
Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
Some interesting things that I don't understand. For the align test, I
extended it to do a write align test (-A). I tried two partitions that
I could write over, and both read and writes behaved differently for
the two partitions on same device. Odd. They are both 4MB aligned.
On the sandisk it was the write align that made the page size stand
out. The read align had pretty constant results.
On the toshiba the results varied wildly for the two partitions. For
partition 6, there was a clear pattern in the diff values for read
align. For 9, it was all over the place. For 9 with the write align,
8K and 16K the crossing writes took ~115ms!! Look in attached files
for all the data.
The AU tests were interesting too, especially how with several open
AUs the throughput is higher for certain smaller sizes on sandisk, but
if I interpret it correctly both cards have at least 4 AUs, as I
didn't see yet a significant drop for small sizes. The larger ones I
am running now on mmcblk0p9 which is sufficiently larger for these
tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
Thanks,
A
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: toshiba.txt
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110218/3e560d5a/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sandisk.txt
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110218/3e560d5a/attachment-0001.txt>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-18 22:40 ` Andrei Warkentin
@ 2011-02-18 23:17 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 23:17 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
[-- Attachment #1: Type: text/plain, Size: 4240 bytes --]
2011/2/18 Andrei Warkentin <andreiw@motorola.com>:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> I'm curious. Neither the manfid nor the oemid fields of either card
>>> match what I have seen on SD cards, I would expect them to be
>>>
>>> Sandisk: manfid 0x000003, oemid 0x5344
>>> Toshiba: manfid 0x000002, oemid 0x544d
>>>
>>> I have not actually seen any Toshiba SD cards, but I assume that they
>>> use the same controllers as Kingston.
>>>
>>> Does anyone know if the IDs have any correlation between MMC and SD
>>> controllers?
>>>
>>> Arnd
>>>
>>
>> I'm unsure about the older scheme (assigned by MMCA), but ever since
>> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
>> will be 0x45, and Toshiba I guess will be 0x11.
>>
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.
>
> On the sandisk it was the write align that made the page size stand
> out. The read align had pretty constant results.
>
> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.
>
> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Thanks,
> A
>
I thought this was pretty interesting -
# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 3.59ms on 6.54ms post 3.65ms diff 2.92ms
write align 4194304 pre 4.13ms on 7.37ms post 4.27ms diff 3.17ms
write align 2097152 pre 3.62ms on 6.81ms post 3.94ms diff 3.03ms
write align 1048576 pre 3.62ms on 6.53ms post 3.55ms diff 2.95ms
write align 524288 pre 3.62ms on 6.51ms post 3.63ms diff 2.88ms
write align 262144 pre 3.62ms on 6.51ms post 3.63ms diff 2.89ms
write align 131072 pre 3.62ms on 6.5ms post 3.63ms diff 2.88ms
write align 65536 pre 3.61ms on 6.49ms post 3.62ms diff 2.88ms
write align 32768 pre 3.61ms on 6.49ms post 3.61ms diff 2.88ms
write align 16384 pre 3.68ms on 107ms post 3.51ms diff 103ms
write align 8192 pre 3.74ms on 121ms post 3.91ms diff 117ms
write align 4096 pre 3.88ms on 3.87ms post 3.87ms diff -2937ns
write align 2048 pre 3.89ms on 3.88ms post 3.88ms diff -8734ns
# fjnh84@fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
# echo 8192 > /sys/block/mmcblk0/device/page_size
# cd data
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 3.33ms on 6.8ms post 3.65ms diff 3.31ms
write align 4194304 pre 4.34ms on 8.14ms post 4.53ms diff 3.71ms
write align 2097152 pre 3.64ms on 7.31ms post 4.09ms diff 3.44ms
write align 1048576 pre 3.65ms on 7.52ms post 3.65ms diff 3.87ms
write align 524288 pre 3.62ms on 6.8ms post 3.63ms diff 3.17ms
write align 262144 pre 3.62ms on 6.84ms post 3.63ms diff 3.22ms
write align 131072 pre 3.62ms on 6.85ms post 3.44ms diff 3.32ms
write align 65536 pre 3.39ms on 6.8ms post 3.66ms diff 3.28ms
write align 32768 pre 3.64ms on 6.86ms post 3.66ms diff 3.21ms
write align 16384 pre 3.67ms on 6.86ms post 3.65ms diff 3.2ms
write align 8192 pre 3.66ms on 6.84ms post 3.64ms diff 3.19ms
write align 4096 pre 3.71ms on 3.71ms post 3.64ms diff 38.6µs
write align 2048 pre 3.71ms on 3.71ms post 3.72ms diff -656ns
This was with the split unaligned accesses patch... Which I am
attaching for comments.
Thanks,
A
[-- Attachment #2: 0001-MMC-Split-non-page-size-aligned-accesses.patch --]
[-- Type: text/x-diff, Size: 5196 bytes --]
From b3e6a556a716e7cec86071342197e798b38c3cbf Mon Sep 17 00:00:00 2001
From: Andrei Warkentin <andreiw@motorola.com>
Date: Fri, 18 Feb 2011 17:46:00 -0600
Subject: [PATCH] MMC: Split non-page-size aligned accesses.
If the card page size is known, splits the access into an unaligned
and an aligned portion, which helps with the performance.
Change-Id: I4ad7588d613d775212fac87436e418577909a22b
Signed-off-by: Andrei Warkentin <andreiw@motorola.com>
---
drivers/mmc/card/block.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mmc/card.h | 1 +
2 files changed, 112 insertions(+), 0 deletions(-)
diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 7054fd5..be7d739 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -22,6 +22,7 @@
#include <linux/init.h>
#include <linux/kernel.h>
+#include <linux/ctype.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/errno.h>
@@ -67,6 +68,74 @@ struct mmc_blk_data {
static DEFINE_MUTEX(open_lock);
+static ssize_t
+show_block_attr(struct device *dev, struct device_attribute *attr,
+ char *buf);
+
+static ssize_t
+set_block_attr(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count);
+
+static DEVICE_ATTR(page_size, S_IRUGO | S_IWUSR, show_block_attr, set_block_attr);
+
+static ssize_t
+show_block_attr(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ unsigned int val;
+ ssize_t ret = 0;
+ struct mmc_card *card = container_of(dev, struct mmc_card, dev);
+ mmc_claim_host(card->host);
+ if (attr == &dev_attr_page_size)
+ val = card->page_size;
+ else
+ ret = -EINVAL;
+
+ mmc_release_host(card->host);
+ if (!ret)
+ ret = sprintf(buf, "%u\n", val);
+ return ret;
+}
+
+static ssize_t
+set_block_attr(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret;
+ char *after;
+ unsigned int val, *dest = NULL;
+ struct mmc_card *card = container_of(dev, struct mmc_card, dev);
+ val = simple_strtoul(buf, &after, 10);
+ ret = after - buf;
+
+ while (isspace(*after++))
+ ret++;
+
+ if (ret != count)
+ return -EINVAL;
+
+ if (attr == &dev_attr_page_size)
+ dest = &card->page_size;
+ else
+ return -EINVAL;
+
+ if (dest) {
+ mmc_claim_host(card->host);
+ *dest = val;
+ mmc_release_host(card->host);
+ }
+ return ret;
+}
+
+static struct attribute *capability_attrs[] = {
+ &dev_attr_page_size.attr,
+ NULL,
+};
+
+static struct attribute_group attr_group = {
+ .attrs = capability_attrs,
+};
+
static struct mmc_blk_data *mmc_blk_get(struct gendisk *disk)
{
struct mmc_blk_data *md;
@@ -312,6 +381,38 @@ out:
return err ? 0 : 1;
}
+
+/*
+ * If the request is not aligned, split it into an unaligned
+ * and an aligned portion. Here we can adjust
+ * the size of the MMC request and let the block layer request handle
+ * deal with generating another MMC request.
+ */
+static bool mmc_adjust_write(struct mmc_card *card,
+ struct mmc_request *mrq)
+{
+ unsigned int left_in_page;
+ unsigned int page_size_blocks;
+
+ if (!card->page_size)
+ return false;
+
+ page_size_blocks = card->page_size / mrq->data->blksz;
+ left_in_page = page_size_blocks -
+ (mrq->cmd->arg % page_size_blocks);
+
+ /* Aligned access. */
+ if (left_in_page == page_size_blocks)
+ return false;
+
+ /* Not straddling page boundary. */
+ if (mrq->data->blocks <= left_in_page)
+ return false;
+
+ mrq->data->blocks = left_in_page;
+ return true;
+}
+
static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
{
struct mmc_blk_data *md = mq->data;
@@ -339,6 +440,10 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
brq.data.blocks = blk_rq_sectors(req);
+ /* Check for unaligned accesses straddling pages. */
+ if (rq_data_dir(req) == WRITE)
+ mmc_adjust_write(card, &brq.mrq);
+
/*
* The block layer doesn't support all sector count
* restrictions, so we need to be prepared for too big
@@ -707,6 +812,10 @@ static int mmc_blk_probe(struct mmc_card *card)
if (err)
goto out;
+ err = sysfs_create_group(&card->dev.kobj, &attr_group);
+ if (err)
+ goto out;
+
string_get_size((u64)get_capacity(md->disk) << 9, STRING_UNITS_2,
cap_str, sizeof(cap_str));
printk(KERN_INFO "%s: %s %s %s %s\n",
@@ -735,6 +844,8 @@ static void mmc_blk_remove(struct mmc_card *card)
/* Stop new requests from getting into the queue */
del_gendisk(md->disk);
+ sysfs_remove_group(&card->dev.kobj, &attr_group);
+
/* Then flush out any already in there */
mmc_cleanup_queue(&md->queue);
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 6b75250..d52768a 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -123,7 +123,7 @@ struct mmc_card {
unsigned int erase_size; /* erase size in sectors */
unsigned int erase_shift; /* if erase unit is power 2 */
unsigned int pref_erase; /* in sectors */
+ unsigned int page_size; /* page size in bytes */
u8 erased_byte; /* value of erased bytes */
u32 raw_cid[4]; /* raw card CID */
--
1.7.0.4
^ permalink raw reply related [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-18 23:17 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 23:17 UTC (permalink / raw)
To: linux-arm-kernel
2011/2/18 Andrei Warkentin <andreiw@motorola.com>:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> I'm curious. Neither the manfid nor the oemid fields of either card
>>> match what I have seen on SD cards, I would expect them to be
>>>
>>> Sandisk: manfid 0x000003, oemid 0x5344
>>> Toshiba: manfid 0x000002, oemid 0x544d
>>>
>>> I have not actually seen any Toshiba SD cards, but I assume that they
>>> use the same controllers as Kingston.
>>>
>>> Does anyone know if the IDs have any correlation between MMC and SD
>>> controllers?
>>>
>>> ? ? ? ?Arnd
>>>
>>
>> I'm unsure about the older scheme (assigned by MMCA), but ever since
>> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
>> will be 0x45, and Toshiba I guess will be 0x11.
>>
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.
>
> On the sandisk it was the write align that made the page size stand
> out. ?The read align had pretty constant results.
>
> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.
>
> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Thanks,
> A
>
I thought this was pretty interesting -
# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 3.59ms on 6.54ms post 3.65ms diff 2.92ms
write align 4194304 pre 4.13ms on 7.37ms post 4.27ms diff 3.17ms
write align 2097152 pre 3.62ms on 6.81ms post 3.94ms diff 3.03ms
write align 1048576 pre 3.62ms on 6.53ms post 3.55ms diff 2.95ms
write align 524288 pre 3.62ms on 6.51ms post 3.63ms diff 2.88ms
write align 262144 pre 3.62ms on 6.51ms post 3.63ms diff 2.89ms
write align 131072 pre 3.62ms on 6.5ms post 3.63ms diff 2.88ms
write align 65536 pre 3.61ms on 6.49ms post 3.62ms diff 2.88ms
write align 32768 pre 3.61ms on 6.49ms post 3.61ms diff 2.88ms
write align 16384 pre 3.68ms on 107ms post 3.51ms diff 103ms
write align 8192 pre 3.74ms on 121ms post 3.91ms diff 117ms
write align 4096 pre 3.88ms on 3.87ms post 3.87ms diff -2937ns
write align 2048 pre 3.89ms on 3.88ms post 3.88ms diff -8734ns
# fjnh84 at fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
# echo 8192 > /sys/block/mmcblk0/device/page_size
# cd data
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 3.33ms on 6.8ms post 3.65ms diff 3.31ms
write align 4194304 pre 4.34ms on 8.14ms post 4.53ms diff 3.71ms
write align 2097152 pre 3.64ms on 7.31ms post 4.09ms diff 3.44ms
write align 1048576 pre 3.65ms on 7.52ms post 3.65ms diff 3.87ms
write align 524288 pre 3.62ms on 6.8ms post 3.63ms diff 3.17ms
write align 262144 pre 3.62ms on 6.84ms post 3.63ms diff 3.22ms
write align 131072 pre 3.62ms on 6.85ms post 3.44ms diff 3.32ms
write align 65536 pre 3.39ms on 6.8ms post 3.66ms diff 3.28ms
write align 32768 pre 3.64ms on 6.86ms post 3.66ms diff 3.21ms
write align 16384 pre 3.67ms on 6.86ms post 3.65ms diff 3.2ms
write align 8192 pre 3.66ms on 6.84ms post 3.64ms diff 3.19ms
write align 4096 pre 3.71ms on 3.71ms post 3.64ms diff 38.6?s
write align 2048 pre 3.71ms on 3.71ms post 3.72ms diff -656ns
This was with the split unaligned accesses patch... Which I am
attaching for comments.
Thanks,
A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-MMC-Split-non-page-size-aligned-accesses.patch
Type: text/x-diff
Size: 5195 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110218/333fe63e/attachment-0001.bin>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-18 22:40 ` Andrei Warkentin
@ 2011-02-19 9:54 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19 9:54 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
Very nice, thanks for the measurement!
I don't think having the results inline in the mail is a problem,
it would even make it easier to quote.
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.
I never did a write align test because the results will be highly
unreliable as soon as you get into thrashing. Your results seem
to be meaningful still, so maybe we should have it after all, but
I'll put a big warning on it.
> On the sandisk it was the write align that made the page size stand
> out. The read align had pretty constant results.
I've noticed on other Sandisk media that the read align test is
sometimes useless. It may help to do a full erase of the partition,
or to fill it with data before running the test.
> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.
Partition 6 is a lot smaller, so you have the accesses less than a
segment apart, so it shows other effects.
> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
Right, you should try larger values for --open-au-nr here. It's at
least a good sign that the drive can do random access inside a segment
and that it can have at least 4 segments open. This is much better
than I expected from your descriptions at first.
However, the drop from 32 KB to 16 KB in performance is horrifying
for the Toshiba drive, it's clear that this one does not like
to be accessed smaller than 32 KB at a time, an obvious optimization
for FAT32 with 32 KB clusters. How does this change with your
kernel patches?
For the sandisk drive, it's funny how it is consistently faster
doing random access than linear access. I don't think I've seem that
before. It does seem to have some cache for linear access using
smaller than 16 KB, and can probably combine them when it's only
writing to a single segment.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-19 9:54 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19 9:54 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
Very nice, thanks for the measurement!
I don't think having the results inline in the mail is a problem,
it would even make it easier to quote.
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.
I never did a write align test because the results will be highly
unreliable as soon as you get into thrashing. Your results seem
to be meaningful still, so maybe we should have it after all, but
I'll put a big warning on it.
> On the sandisk it was the write align that made the page size stand
> out. The read align had pretty constant results.
I've noticed on other Sandisk media that the read align test is
sometimes useless. It may help to do a full erase of the partition,
or to fill it with data before running the test.
> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.
Partition 6 is a lot smaller, so you have the accesses less than a
segment apart, so it shows other effects.
> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
Right, you should try larger values for --open-au-nr here. It's at
least a good sign that the drive can do random access inside a segment
and that it can have at least 4 segments open. This is much better
than I expected from your descriptions at first.
However, the drop from 32 KB to 16 KB in performance is horrifying
for the Toshiba drive, it's clear that this one does not like
to be accessed smaller than 32 KB at a time, an obvious optimization
for FAT32 with 32 KB clusters. How does this change with your
kernel patches?
For the sandisk drive, it's funny how it is consistently faster
doing random access than linear access. I don't think I've seem that
before. It does seem to have some cache for linear access using
smaller than 16 KB, and can probably combine them when it's only
writing to a single segment.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-18 23:17 ` Andrei Warkentin
@ 2011-02-19 11:20 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19 11:20 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 3.59ms on 6.54ms post 3.65ms diff 2.92ms
> write align 4194304 pre 4.13ms on 7.37ms post 4.27ms diff 3.17ms
> write align 2097152 pre 3.62ms on 6.81ms post 3.94ms diff 3.03ms
> write align 1048576 pre 3.62ms on 6.53ms post 3.55ms diff 2.95ms
> write align 524288 pre 3.62ms on 6.51ms post 3.63ms diff 2.88ms
> write align 262144 pre 3.62ms on 6.51ms post 3.63ms diff 2.89ms
> write align 131072 pre 3.62ms on 6.5ms post 3.63ms diff 2.88ms
> write align 65536 pre 3.61ms on 6.49ms post 3.62ms diff 2.88ms
> write align 32768 pre 3.61ms on 6.49ms post 3.61ms diff 2.88ms
> write align 16384 pre 3.68ms on 107ms post 3.51ms diff 103ms
> write align 8192 pre 3.74ms on 121ms post 3.91ms diff 117ms
> write align 4096 pre 3.88ms on 3.87ms post 3.87ms diff -2937ns
> write align 2048 pre 3.89ms on 3.88ms post 3.88ms diff -8734ns
> # fjnh84@fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
> # echo 8192 > /sys/block/mmcblk0/device/page_size
> # cd data
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 3.33ms on 6.8ms post 3.65ms diff 3.31ms
> write align 4194304 pre 4.34ms on 8.14ms post 4.53ms diff 3.71ms
> write align 2097152 pre 3.64ms on 7.31ms post 4.09ms diff 3.44ms
> write align 1048576 pre 3.65ms on 7.52ms post 3.65ms diff 3.87ms
> write align 524288 pre 3.62ms on 6.8ms post 3.63ms diff 3.17ms
> write align 262144 pre 3.62ms on 6.84ms post 3.63ms diff 3.22ms
> write align 131072 pre 3.62ms on 6.85ms post 3.44ms diff 3.32ms
> write align 65536 pre 3.39ms on 6.8ms post 3.66ms diff 3.28ms
> write align 32768 pre 3.64ms on 6.86ms post 3.66ms diff 3.21ms
> write align 16384 pre 3.67ms on 6.86ms post 3.65ms diff 3.2ms
> write align 8192 pre 3.66ms on 6.84ms post 3.64ms diff 3.19ms
> write align 4096 pre 3.71ms on 3.71ms post 3.64ms diff 38.6µs
> write align 2048 pre 3.71ms on 3.71ms post 3.72ms diff -656ns
>
> This was with the split unaligned accesses patch... Which I am
> attaching for comments.
I agree, this is very fascinating behavior. 100ms second latency for a
single 2KB access is definitely something we should try to avoid, and I
wonder why the drive decides to do that. It must get into a state where
it requires an extra garbage collection (you mentioned that earlier).
The numbers you see here are taken over multiple runs. Do you see a lot
of fluctuation when doing this with --count=1?
Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
to flashbench?
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-19 11:20 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19 11:20 UTC (permalink / raw)
To: linux-arm-kernel
On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 3.59ms on 6.54ms post 3.65ms diff 2.92ms
> write align 4194304 pre 4.13ms on 7.37ms post 4.27ms diff 3.17ms
> write align 2097152 pre 3.62ms on 6.81ms post 3.94ms diff 3.03ms
> write align 1048576 pre 3.62ms on 6.53ms post 3.55ms diff 2.95ms
> write align 524288 pre 3.62ms on 6.51ms post 3.63ms diff 2.88ms
> write align 262144 pre 3.62ms on 6.51ms post 3.63ms diff 2.89ms
> write align 131072 pre 3.62ms on 6.5ms post 3.63ms diff 2.88ms
> write align 65536 pre 3.61ms on 6.49ms post 3.62ms diff 2.88ms
> write align 32768 pre 3.61ms on 6.49ms post 3.61ms diff 2.88ms
> write align 16384 pre 3.68ms on 107ms post 3.51ms diff 103ms
> write align 8192 pre 3.74ms on 121ms post 3.91ms diff 117ms
> write align 4096 pre 3.88ms on 3.87ms post 3.87ms diff -2937ns
> write align 2048 pre 3.89ms on 3.88ms post 3.88ms diff -8734ns
> # fjnh84 at fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
> # echo 8192 > /sys/block/mmcblk0/device/page_size
> # cd data
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 3.33ms on 6.8ms post 3.65ms diff 3.31ms
> write align 4194304 pre 4.34ms on 8.14ms post 4.53ms diff 3.71ms
> write align 2097152 pre 3.64ms on 7.31ms post 4.09ms diff 3.44ms
> write align 1048576 pre 3.65ms on 7.52ms post 3.65ms diff 3.87ms
> write align 524288 pre 3.62ms on 6.8ms post 3.63ms diff 3.17ms
> write align 262144 pre 3.62ms on 6.84ms post 3.63ms diff 3.22ms
> write align 131072 pre 3.62ms on 6.85ms post 3.44ms diff 3.32ms
> write align 65536 pre 3.39ms on 6.8ms post 3.66ms diff 3.28ms
> write align 32768 pre 3.64ms on 6.86ms post 3.66ms diff 3.21ms
> write align 16384 pre 3.67ms on 6.86ms post 3.65ms diff 3.2ms
> write align 8192 pre 3.66ms on 6.84ms post 3.64ms diff 3.19ms
> write align 4096 pre 3.71ms on 3.71ms post 3.64ms diff 38.6?s
> write align 2048 pre 3.71ms on 3.71ms post 3.72ms diff -656ns
>
> This was with the split unaligned accesses patch... Which I am
> attaching for comments.
I agree, this is very fascinating behavior. 100ms second latency for a
single 2KB access is definitely something we should try to avoid, and I
wonder why the drive decides to do that. It must get into a state where
it requires an extra garbage collection (you mentioned that earlier).
The numbers you see here are taken over multiple runs. Do you see a lot
of fluctuation when doing this with --count=1?
Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
to flashbench?
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-19 9:54 ` Arnd Bergmann
@ 2011-02-20 4:39 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 4:39 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sat, Feb 19, 2011 at 3:54 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
>> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>>
>> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Very nice, thanks for the measurement!
>
> I don't think having the results inline in the mail is a problem,
> it would even make it easier to quote.
>
>> Some interesting things that I don't understand. For the align test, I
>> extended it to do a write align test (-A). I tried two partitions that
>> I could write over, and both read and writes behaved differently for
>> the two partitions on same device. Odd. They are both 4MB aligned.
>
> I never did a write align test because the results will be highly
> unreliable as soon as you get into thrashing. Your results seem
> to be meaningful still, so maybe we should have it after all, but
> I'll put a big warning on it.
>
Actually it would be a good idea to also bail/warn if you do the au
test with more open au's than the size of the passed device allows,
since it'll just wrap around and skew the results.
>> On the sandisk it was the write align that made the page size stand
>> out. The read align had pretty constant results.
>
> I've noticed on other Sandisk media that the read align test is
> sometimes useless. It may help to do a full erase of the partition,
> or to fill it with data before running the test.
>
>> On the toshiba the results varied wildly for the two partitions. For
>> partition 6, there was a clear pattern in the diff values for read
>> align. For 9, it was all over the place. For 9 with the write align,
>> 8K and 16K the crossing writes took ~115ms!! Look in attached files
>> for all the data.
>
> Partition 6 is a lot smaller, so you have the accesses less than a
> segment apart, so it shows other effects.
>
>> The AU tests were interesting too, especially how with several open
>> AUs the throughput is higher for certain smaller sizes on sandisk, but
>> if I interpret it correctly both cards have at least 4 AUs, as I
>> didn't see yet a significant drop for small sizes. The larger ones I
>> am running now on mmcblk0p9 which is sufficiently larger for these
>> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Right, you should try larger values for --open-au-nr here. It's at
> least a good sign that the drive can do random access inside a segment
> and that it can have at least 4 segments open. This is much better
> than I expected from your descriptions at first.
Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
^C
# ./flashbench -O -0 6 -b 512 /dev/block/mmcblk0p9
4MiB 5.91M/s
2MiB 8.84M/s
1MiB 10.8M/s
512KiB 13M/s
256KiB 13.6M/s
^C
# ./flashbench -O -0 7 -b 512 /dev/block/mmcblk0p9
4MiB 6.32M/s
2MiB 8.63M/s
1MiB 10.5M/s
512KiB 13.2M/s
256KiB 13M/s
^[[A^[[D^[[A128KiB 12.3M/s
^C
# ./flashbench -O -0 8 -b 512 /dev/block/mmcblk0p9
4MiB 6.65M/s
2MiB 7.02M/s
1MiB 6.36M/s
512KiB 3.17M/s
256KiB 1.53M/s
The Sandisk one has 20 AUs.
# ./flashbench -O -0 20 -b 512 /dev/block/mmcblk0p9
4MiB 11.3M/s
2MiB 12.8M/s
1MiB 9.87M/s
512KiB 9.97M/s
256KiB 9.13M/s
128KiB 8.05M/s
^C
# ./flashbench -O -0 50 -b 512 /dev/block/mmcblk0p9
4MiB 7.19M/s
^C
# ./flashbench -O -0 2 -b 512 /dev/block/mmcblk0p9
^C
# ./flashbench -O -0 22 -b 512 /dev/block/mmcblk0p9
4MiB 11.6M/s
2MiB 12.3M/s
1MiB 5.13M/s
512KiB 2.57M/s
256KiB 1.59M/s
128KiB 1.16M/s
64KiB 776K/s
^C
# ./flashbench -O -0 21 -b 512 /dev/block/mmcblk0p9
4MiB 11.2M/s
2MiB 12.4M/s
1MiB 4.65M/s
512KiB 1.95M/s
256KiB 955K/s
>
> However, the drop from 32 KB to 16 KB in performance is horrifying
> for the Toshiba drive, it's clear that this one does not like
> to be accessed smaller than 32 KB at a time, an obvious optimization
> for FAT32 with 32 KB clusters. How does this change with your
> kernel patches?
Since the only performance-increasing patch here would be just the one
that splits unaligned accesses, I wouldn't expect any improvements for
page-aligned accesses < 32KB. As you can see here...
# cat /sys/block/mmcblk0/device/page_size
8192
# ./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p9
4MiB 6.81M/s
2MiB 7.73M/s
1MiB 9.21M/s
512KiB 9.98M/s
256KiB 10.3M/s
128KiB 10.2M/s
64KiB 9.76M/s
32KiB 8.52M/s
16KiB 3.68M/s
8KiB 1.72M/s
4KiB 837K/s
^C
# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p9
4MiB 6.42M/s
2MiB 7.79M/s
1MiB 9.22M/s
512KiB 10M/s
256KiB 9.94M/s
128KiB 10.1M/s
64KiB 9.68M/s
32KiB 8.5M/s
16KiB 3.65M/s
8KiB 1.73M/s
4KiB 838K/s
2KiB 417K/s
^C
#
>
> For the sandisk drive, it's funny how it is consistently faster
> doing random access than linear access. I don't think I've seem that
> before. It does seem to have some cache for linear access using
> smaller than 16 KB, and can probably combine them when it's only
> writing to a single segment.
Yes, that is pretty interesting. Smaller than 16K? Not smaller than
32K? I wonder what it is doing...
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-20 4:39 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 4:39 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Feb 19, 2011 at 3:54 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
>> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>>
>> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Very nice, thanks for the measurement!
>
> I don't think having the results inline in the mail is a problem,
> it would even make it easier to quote.
>
>> Some interesting things that I don't understand. For the align test, I
>> extended it to do a write align test (-A). I tried two partitions that
>> I could write over, and both read and writes behaved differently for
>> the two partitions on same device. Odd. They are both 4MB aligned.
>
> I never did a write align test because the results will be highly
> unreliable as soon as you get into thrashing. Your results seem
> to be meaningful still, so maybe we should have it after all, but
> I'll put a big warning on it.
>
Actually it would be a good idea to also bail/warn if you do the au
test with more open au's than the size of the passed device allows,
since it'll just wrap around and skew the results.
>> On the sandisk it was the write align that made the page size stand
>> out. ?The read align had pretty constant results.
>
> I've noticed on other Sandisk media that the read align test is
> sometimes useless. It may help to do a full erase of the partition,
> or to fill it with data before running the test.
>
>> On the toshiba the results varied wildly for the two partitions. For
>> partition 6, there was a clear pattern in the diff values for read
>> align. For 9, it was all over the place. For 9 with the write align,
>> 8K and 16K the crossing writes took ~115ms!! Look in attached files
>> for all the data.
>
> Partition 6 is a lot smaller, so you have the accesses less than a
> segment apart, so it shows other effects.
>
>> The AU tests were interesting too, especially how with several open
>> AUs the throughput is higher for certain smaller sizes on sandisk, but
>> if I interpret it correctly both cards have at least 4 AUs, as I
>> didn't see yet a significant drop for small sizes. The larger ones I
>> am running now on mmcblk0p9 which is sufficiently larger for these
>> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Right, you should try larger values for --open-au-nr here. It's at
> least a good sign that the drive can do random access inside a segment
> and that it can have at least 4 segments open. This is much better
> than I expected from your descriptions at first.
Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
^C
# ./flashbench -O -0 6 -b 512 /dev/block/mmcblk0p9
4MiB 5.91M/s
2MiB 8.84M/s
1MiB 10.8M/s
512KiB 13M/s
256KiB 13.6M/s
^C
# ./flashbench -O -0 7 -b 512 /dev/block/mmcblk0p9
4MiB 6.32M/s
2MiB 8.63M/s
1MiB 10.5M/s
512KiB 13.2M/s
256KiB 13M/s
^[[A^[[D^[[A128KiB 12.3M/s
^C
# ./flashbench -O -0 8 -b 512 /dev/block/mmcblk0p9
4MiB 6.65M/s
2MiB 7.02M/s
1MiB 6.36M/s
512KiB 3.17M/s
256KiB 1.53M/s
The Sandisk one has 20 AUs.
# ./flashbench -O -0 20 -b 512 /dev/block/mmcblk0p9
4MiB 11.3M/s
2MiB 12.8M/s
1MiB 9.87M/s
512KiB 9.97M/s
256KiB 9.13M/s
128KiB 8.05M/s
^C
# ./flashbench -O -0 50 -b 512 /dev/block/mmcblk0p9
4MiB 7.19M/s
^C
# ./flashbench -O -0 2 -b 512 /dev/block/mmcblk0p9
^C
# ./flashbench -O -0 22 -b 512 /dev/block/mmcblk0p9
4MiB 11.6M/s
2MiB 12.3M/s
1MiB 5.13M/s
512KiB 2.57M/s
256KiB 1.59M/s
128KiB 1.16M/s
64KiB 776K/s
^C
# ./flashbench -O -0 21 -b 512 /dev/block/mmcblk0p9
4MiB 11.2M/s
2MiB 12.4M/s
1MiB 4.65M/s
512KiB 1.95M/s
256KiB 955K/s
>
> However, the drop from 32 KB to 16 KB in performance is horrifying
> for the Toshiba drive, it's clear that this one does not like
> to be accessed smaller than 32 KB at a time, an obvious optimization
> for FAT32 with 32 KB clusters. How does this change with your
> kernel patches?
Since the only performance-increasing patch here would be just the one
that splits unaligned accesses, I wouldn't expect any improvements for
page-aligned accesses < 32KB. As you can see here...
# cat /sys/block/mmcblk0/device/page_size
8192
# ./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p9
4MiB 6.81M/s
2MiB 7.73M/s
1MiB 9.21M/s
512KiB 9.98M/s
256KiB 10.3M/s
128KiB 10.2M/s
64KiB 9.76M/s
32KiB 8.52M/s
16KiB 3.68M/s
8KiB 1.72M/s
4KiB 837K/s
^C
# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p9
4MiB 6.42M/s
2MiB 7.79M/s
1MiB 9.22M/s
512KiB 10M/s
256KiB 9.94M/s
128KiB 10.1M/s
64KiB 9.68M/s
32KiB 8.5M/s
16KiB 3.65M/s
8KiB 1.73M/s
4KiB 838K/s
2KiB 417K/s
^C
#
>
> For the sandisk drive, it's funny how it is consistently faster
> doing random access than linear access. I don't think I've seem that
> before. It does seem to have some cache for linear access using
> smaller than 16 KB, and can probably combine them when it's only
> writing to a single segment.
Yes, that is pretty interesting. Smaller than 16K? Not smaller than
32K? I wonder what it is doing...
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-19 11:20 ` Arnd Bergmann
@ 2011-02-20 5:56 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 5:56 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
>> # echo 0 > /sys/block/mmcblk0/device/page_size
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608 pre 3.59ms on 6.54ms post 3.65ms diff 2.92ms
>> write align 4194304 pre 4.13ms on 7.37ms post 4.27ms diff 3.17ms
>> write align 2097152 pre 3.62ms on 6.81ms post 3.94ms diff 3.03ms
>> write align 1048576 pre 3.62ms on 6.53ms post 3.55ms diff 2.95ms
>> write align 524288 pre 3.62ms on 6.51ms post 3.63ms diff 2.88ms
>> write align 262144 pre 3.62ms on 6.51ms post 3.63ms diff 2.89ms
>> write align 131072 pre 3.62ms on 6.5ms post 3.63ms diff 2.88ms
>> write align 65536 pre 3.61ms on 6.49ms post 3.62ms diff 2.88ms
>> write align 32768 pre 3.61ms on 6.49ms post 3.61ms diff 2.88ms
>> write align 16384 pre 3.68ms on 107ms post 3.51ms diff 103ms
>> write align 8192 pre 3.74ms on 121ms post 3.91ms diff 117ms
>> write align 4096 pre 3.88ms on 3.87ms post 3.87ms diff -2937ns
>> write align 2048 pre 3.89ms on 3.88ms post 3.88ms diff -8734ns
>> # fjnh84@fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
>> # echo 8192 > /sys/block/mmcblk0/device/page_size
>> # cd data
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608 pre 3.33ms on 6.8ms post 3.65ms diff 3.31ms
>> write align 4194304 pre 4.34ms on 8.14ms post 4.53ms diff 3.71ms
>> write align 2097152 pre 3.64ms on 7.31ms post 4.09ms diff 3.44ms
>> write align 1048576 pre 3.65ms on 7.52ms post 3.65ms diff 3.87ms
>> write align 524288 pre 3.62ms on 6.8ms post 3.63ms diff 3.17ms
>> write align 262144 pre 3.62ms on 6.84ms post 3.63ms diff 3.22ms
>> write align 131072 pre 3.62ms on 6.85ms post 3.44ms diff 3.32ms
>> write align 65536 pre 3.39ms on 6.8ms post 3.66ms diff 3.28ms
>> write align 32768 pre 3.64ms on 6.86ms post 3.66ms diff 3.21ms
>> write align 16384 pre 3.67ms on 6.86ms post 3.65ms diff 3.2ms
>> write align 8192 pre 3.66ms on 6.84ms post 3.64ms diff 3.19ms
>> write align 4096 pre 3.71ms on 3.71ms post 3.64ms diff 38.6µs
>> write align 2048 pre 3.71ms on 3.71ms post 3.72ms diff -656ns
>>
>> This was with the split unaligned accesses patch... Which I am
>> attaching for comments.
>
> I agree, this is very fascinating behavior. 100ms second latency for a
> single 2KB access is definitely something we should try to avoid, and I
> wonder why the drive decides to do that. It must get into a state where
> it requires an extra garbage collection (you mentioned that earlier).
>
> The numbers you see here are taken over multiple runs. Do you see a lot
> of fluctuation when doing this with --count=1?
>
Yep. Quite a bit.
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 4.52ms on 7.58ms post 3.93ms diff 3.36ms
write align 4194304 pre 5.97ms on 8.69ms post 4.36ms diff 3.53ms
write align 2097152 pre 3.57ms on 7.96ms post 4.6ms diff 3.88ms
write align 1048576 pre 5.33ms on 27.4ms post 4.88ms diff 22.3ms
write align 524288 pre 49.3ms on 31.4ms post 14.9ms diff -679265
write align 262144 pre 39.7ms on 38.3ms post 5.27ms diff 15.8ms
write align 131072 pre 33.8ms on 45.4ms post 5.26ms diff 25.9ms
write align 65536 pre 34.4ms on 40.9ms post 3.3ms diff 22.1ms
write align 32768 pre 30.2ms on 44.8ms post 5.13ms diff 27.1ms
write align 16384 pre 44.5ms on 5.05ms post 33.3ms diff -338542
write align 8192 pre 25.5ms on 70.6ms post 25.3ms diff 45.2ms
write align 4096 pre 4.89ms on 4.47ms post 5.29ms diff -623390
write align 2048 pre 4.88ms on 4.89ms post 5.2ms diff -155781
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 4.68ms on 9.06ms post 5.14ms diff 4.15ms
write align 4194304 pre 4.37ms on 7.49ms post 4.59ms diff 3.01ms
write align 2097152 pre 23.7ms on 1.9ms post 14.8ms diff -173218
write align 1048576 pre 14.8ms on 19.9ms post 4.75ms diff 10.2ms
write align 524288 pre 20.2ms on 24.9ms post 10.7ms diff 9.46ms
write align 262144 pre 20.2ms on 3.01ms post 20.1ms diff -171062
write align 131072 pre 25.9ms on 24.9ms post 9.85ms diff 7.06ms
write align 65536 pre 15.5ms on 30.3ms post 2.95ms diff 21.1ms
write align 32768 pre 27.3ms on 19.1ms post 5.86ms diff 2.5ms
write align 16384 pre 25.4ms on 55.9ms post 12.7ms diff 36.9ms
write align 8192 pre 4.8ms on 102ms post 9.47ms diff 94.8ms
write align 4096 pre 4.92ms on 5.16ms post 4.98ms diff 207µs
write align 2048 pre 4.64ms on 4.92ms post 5.45ms diff -121860
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 15.8ms on 9.39ms post 4.68ms diff -854295
write align 4194304 pre 4.76ms on 7.54ms post 3.82ms diff 3.24ms
write align 2097152 pre 19.9ms on 9.73ms post 4.44ms diff -244517
write align 1048576 pre 14.5ms on 19.1ms post 5.21ms diff 9.23ms
write align 524288 pre 24.9ms on 29ms post 5.89ms diff 13.6ms
write align 262144 pre 24.9ms on 2.41ms post 20.8ms diff -204328
write align 131072 pre 25.6ms on 30ms post 4.84ms diff 14.8ms
write align 65536 pre 26.4ms on 24.4ms post 6.16ms diff 8.12ms
write align 32768 pre 15ms on 30.6ms post 15.4ms diff 15.4ms
write align 16384 pre 16.1ms on 45.4ms post 16.5ms diff 29.1ms
write align 8192 pre 5.88ms on 107ms post 5.45ms diff 101ms
write align 4096 pre 5.17ms on 5.78ms post 4.83ms diff 778µs
write align 2048 pre 3.99ms on 5.27ms post 3.97ms diff 1.29ms
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 16.1ms on 8.37ms post 5.44ms diff -241222
write align 4194304 pre 4.07ms on 7.27ms post 3.89ms diff 3.29ms
write align 2097152 pre 24.2ms on 18.5ms post 5.63ms diff 3.59ms
write align 1048576 pre 4.08ms on 18.9ms post 5.46ms diff 14.1ms
write align 524288 pre 25.1ms on 28ms post 14.6ms diff 8.13ms
write align 262144 pre 15.8ms on 30ms post 5.4ms diff 19.4ms
write align 131072 pre 24.7ms on 30.8ms post 4.43ms diff 16.2ms
write align 65536 pre 5ms on 40.5ms post 5.95ms diff 35.1ms
write align 32768 pre 24.7ms on 30.6ms post 4.92ms diff 15.8ms
write align 16384 pre 25.2ms on 132ms post 10.2ms diff 114ms
write align 8192 pre 7.64ms on 111ms post 9.18ms diff 102ms
write align 4096 pre 5.11ms on 3.92ms post 5.4ms diff -134159
write align 2048 pre 3.92ms on 4.41ms post 4.51ms diff 196µs
> Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> to flashbench?
>
# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 3.63ms on 6.51ms post 3.66ms diff 2.86ms
write align 4194304 pre 3.61ms on 6.51ms post 3.62ms diff 2.89ms
write align 2097152 pre 3.61ms on 6.49ms post 3.62ms diff 2.87ms
write align 1048576 pre 3.64ms on 6.55ms post 3.62ms diff 2.92ms
write align 524288 pre 3.64ms on 6.57ms post 3.66ms diff 2.92ms
write align 262144 pre 3.44ms on 6.45ms post 3.66ms diff 2.9ms
write align 131072 pre 3.64ms on 6.56ms post 3.67ms diff 2.91ms
write align 65536 pre 3.33ms on 6.57ms post 3.65ms diff 3.08ms
write align 32768 pre 3.68ms on 6.6ms post 3.7ms diff 2.91ms
write align 16384 pre 3.64ms on 97.6ms post 3.26ms diff 94.2ms
write align 8192 pre 3.49ms on 115ms post 3.62ms diff 112ms
write align 4096 pre 3.91ms on 3.91ms post 3.9ms diff 360ns
write align 2048 pre 3.92ms on 3.92ms post 3.92ms diff -1374ns
# ./flashbench -A -b 2048 /dev/block/mmcblk0p9
write align 8388608 pre 3.76ms on 7.23ms post 4.18ms diff 3.27ms
write align 4194304 pre 3.65ms on 6.56ms post 3.66ms diff 2.9ms
write align 2097152 pre 3.9ms on 6.99ms post 3.67ms diff 3.2ms
write align 1048576 pre 4.03ms on 7.09ms post 4.07ms diff 3.04ms
write align 524288 pre 4.04ms on 7.26ms post 4.16ms diff 3.16ms
write align 262144 pre 3.8ms on 7.26ms post 4.06ms diff 3.33ms
write align 131072 pre 4.05ms on 7.25ms post 4.18ms diff 3.14ms
write align 65536 pre 4.02ms on 7.22ms post 4.14ms diff 3.14ms
write align 32768 pre 4ms on 7.07ms post 3.95ms diff 3.1ms
write align 16384 pre 3.66ms on 106ms post 3.4ms diff 102ms
write align 8192 pre 3.56ms on 106ms post 3.36ms diff 103ms
write align 4096 pre 3.61ms on 4.1ms post 4.35ms diff 117µs
# ./flashbench -A -b 4096 /dev/block/mmcblk0p9
write align 8388608 pre 3.64ms on 6.95ms post 3.96ms diff 3.15ms
write align 4194304 pre 3.65ms on 6.56ms post 3.66ms diff 2.9ms
write align 2097152 pre 3.89ms on 6.79ms post 3.66ms diff 3.01ms
write align 1048576 pre 3.88ms on 6.88ms post 3.95ms diff 2.97ms
write align 524288 pre 3.72ms on 6.97ms post 3.93ms diff 3.15ms
write align 262144 pre 3.89ms on 6.93ms post 3.95ms diff 3.01ms
write align 131072 pre 3.9ms on 6.98ms post 3.96ms diff 3.05ms
write align 65536 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
write align 32768 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
write align 16384 pre 3.74ms on 114ms post 4.05ms diff 110ms
write align 8192 pre 4.25ms on 115ms post 4.8ms diff 110ms
# ./flashbench -A -b 8192 /dev/block/mmcblk0p9
write align 8388608 pre 3.84ms on 7.53ms post 4.29ms diff 3.47ms
write align 4194304 pre 3.58ms on 6.54ms post 3.6ms diff 2.95ms
write align 2097152 pre 4.12ms on 7.27ms post 3.87ms diff 3.28ms
write align 1048576 pre 4.14ms on 7.49ms post 4.24ms diff 3.3ms
write align 524288 pre 4.12ms on 7.46ms post 4.23ms diff 3.29ms
write align 262144 pre 4.14ms on 7.45ms post 3.97ms diff 3.4ms
write align 131072 pre 3.89ms on 7.43ms post 4.24ms diff 3.37ms
write align 65536 pre 4.11ms on 7.46ms post 4.24ms diff 3.29ms
write align 32768 pre 4.15ms on 7.45ms post 4.25ms diff 3.25ms
write align 16384 pre 4.24ms on 96.1ms post 3.83ms diff 92.1ms
The following I thought this was interesting. I did it to see the big
time go away, since it would end up being a 16K write straddling an 8K
boundary, but the pre and post results I don't understand at all.
# ./flashbench -A -b 16384 /dev/block/mmcblk0p9
write align 8388608 pre 121ms on 7.76ms post 116ms diff -110845
write align 4194304 pre 129ms on 7.57ms post 115ms diff -114863
write align 2097152 pre 121ms on 7.78ms post 123ms diff -114318
write align 1048576 pre 131ms on 7.74ms post 106ms diff -110856
write align 524288 pre 131ms on 7.58ms post 116ms diff -115926
write align 262144 pre 131ms on 7.55ms post 115ms diff -115591
write align 131072 pre 131ms on 7.54ms post 116ms diff -115617
write align 65536 pre 131ms on 7.54ms post 115ms diff -115579
write align 32768 pre 125ms on 6.89ms post 116ms diff -113408
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-20 5:56 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 5:56 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
>> # echo 0 > /sys/block/mmcblk0/device/page_size
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608 ? ? pre 3.59ms ? ? ?on 6.54ms ? ? ? post 3.65ms ? ? diff 2.92ms
>> write align 4194304 ? ? pre 4.13ms ? ? ?on 7.37ms ? ? ? post 4.27ms ? ? diff 3.17ms
>> write align 2097152 ? ? pre 3.62ms ? ? ?on 6.81ms ? ? ? post 3.94ms ? ? diff 3.03ms
>> write align 1048576 ? ? pre 3.62ms ? ? ?on 6.53ms ? ? ? post 3.55ms ? ? diff 2.95ms
>> write align 524288 ? ? ?pre 3.62ms ? ? ?on 6.51ms ? ? ? post 3.63ms ? ? diff 2.88ms
>> write align 262144 ? ? ?pre 3.62ms ? ? ?on 6.51ms ? ? ? post 3.63ms ? ? diff 2.89ms
>> write align 131072 ? ? ?pre 3.62ms ? ? ?on 6.5ms ? ? ? ?post 3.63ms ? ? diff 2.88ms
>> write align 65536 ? ? ? pre 3.61ms ? ? ?on 6.49ms ? ? ? post 3.62ms ? ? diff 2.88ms
>> write align 32768 ? ? ? pre 3.61ms ? ? ?on 6.49ms ? ? ? post 3.61ms ? ? diff 2.88ms
>> write align 16384 ? ? ? pre 3.68ms ? ? ?on 107ms ? ? ? ?post 3.51ms ? ? diff 103ms
>> write align 8192 ? ? ? ?pre 3.74ms ? ? ?on 121ms ? ? ? ?post 3.91ms ? ? diff 117ms
>> write align 4096 ? ? ? ?pre 3.88ms ? ? ?on 3.87ms ? ? ? post 3.87ms ? ? diff -2937ns
>> write align 2048 ? ? ? ?pre 3.89ms ? ? ?on 3.88ms ? ? ? post 3.88ms ? ? diff -8734ns
>> # fjnh84 at fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
>> # echo 8192 > /sys/block/mmcblk0/device/page_size
>> # cd data
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608 ? ? pre 3.33ms ? ? ?on 6.8ms ? ? ? ?post 3.65ms ? ? diff 3.31ms
>> write align 4194304 ? ? pre 4.34ms ? ? ?on 8.14ms ? ? ? post 4.53ms ? ? diff 3.71ms
>> write align 2097152 ? ? pre 3.64ms ? ? ?on 7.31ms ? ? ? post 4.09ms ? ? diff 3.44ms
>> write align 1048576 ? ? pre 3.65ms ? ? ?on 7.52ms ? ? ? post 3.65ms ? ? diff 3.87ms
>> write align 524288 ? ? ?pre 3.62ms ? ? ?on 6.8ms ? ? ? ?post 3.63ms ? ? diff 3.17ms
>> write align 262144 ? ? ?pre 3.62ms ? ? ?on 6.84ms ? ? ? post 3.63ms ? ? diff 3.22ms
>> write align 131072 ? ? ?pre 3.62ms ? ? ?on 6.85ms ? ? ? post 3.44ms ? ? diff 3.32ms
>> write align 65536 ? ? ? pre 3.39ms ? ? ?on 6.8ms ? ? ? ?post 3.66ms ? ? diff 3.28ms
>> write align 32768 ? ? ? pre 3.64ms ? ? ?on 6.86ms ? ? ? post 3.66ms ? ? diff 3.21ms
>> write align 16384 ? ? ? pre 3.67ms ? ? ?on 6.86ms ? ? ? post 3.65ms ? ? diff 3.2ms
>> write align 8192 ? ? ? ?pre 3.66ms ? ? ?on 6.84ms ? ? ? post 3.64ms ? ? diff 3.19ms
>> write align 4096 ? ? ? ?pre 3.71ms ? ? ?on 3.71ms ? ? ? post 3.64ms ? ? diff 38.6?s
>> write align 2048 ? ? ? ?pre 3.71ms ? ? ?on 3.71ms ? ? ? post 3.72ms ? ? diff -656ns
>>
>> This was with the split unaligned accesses patch... Which I am
>> attaching for comments.
>
> I agree, this is very fascinating behavior. 100ms second latency for a
> single 2KB access is definitely something we should try to avoid, and I
> wonder why the drive decides to do that. It must get into a state where
> it requires an extra garbage collection (you mentioned that earlier).
>
> The numbers you see here are taken over multiple runs. Do you see a lot
> of fluctuation when doing this with --count=1?
>
Yep. Quite a bit.
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 4.52ms on 7.58ms post 3.93ms diff 3.36ms
write align 4194304 pre 5.97ms on 8.69ms post 4.36ms diff 3.53ms
write align 2097152 pre 3.57ms on 7.96ms post 4.6ms diff 3.88ms
write align 1048576 pre 5.33ms on 27.4ms post 4.88ms diff 22.3ms
write align 524288 pre 49.3ms on 31.4ms post 14.9ms diff -679265
write align 262144 pre 39.7ms on 38.3ms post 5.27ms diff 15.8ms
write align 131072 pre 33.8ms on 45.4ms post 5.26ms diff 25.9ms
write align 65536 pre 34.4ms on 40.9ms post 3.3ms diff 22.1ms
write align 32768 pre 30.2ms on 44.8ms post 5.13ms diff 27.1ms
write align 16384 pre 44.5ms on 5.05ms post 33.3ms diff -338542
write align 8192 pre 25.5ms on 70.6ms post 25.3ms diff 45.2ms
write align 4096 pre 4.89ms on 4.47ms post 5.29ms diff -623390
write align 2048 pre 4.88ms on 4.89ms post 5.2ms diff -155781
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 4.68ms on 9.06ms post 5.14ms diff 4.15ms
write align 4194304 pre 4.37ms on 7.49ms post 4.59ms diff 3.01ms
write align 2097152 pre 23.7ms on 1.9ms post 14.8ms diff -173218
write align 1048576 pre 14.8ms on 19.9ms post 4.75ms diff 10.2ms
write align 524288 pre 20.2ms on 24.9ms post 10.7ms diff 9.46ms
write align 262144 pre 20.2ms on 3.01ms post 20.1ms diff -171062
write align 131072 pre 25.9ms on 24.9ms post 9.85ms diff 7.06ms
write align 65536 pre 15.5ms on 30.3ms post 2.95ms diff 21.1ms
write align 32768 pre 27.3ms on 19.1ms post 5.86ms diff 2.5ms
write align 16384 pre 25.4ms on 55.9ms post 12.7ms diff 36.9ms
write align 8192 pre 4.8ms on 102ms post 9.47ms diff 94.8ms
write align 4096 pre 4.92ms on 5.16ms post 4.98ms diff 207?s
write align 2048 pre 4.64ms on 4.92ms post 5.45ms diff -121860
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 15.8ms on 9.39ms post 4.68ms diff -854295
write align 4194304 pre 4.76ms on 7.54ms post 3.82ms diff 3.24ms
write align 2097152 pre 19.9ms on 9.73ms post 4.44ms diff -244517
write align 1048576 pre 14.5ms on 19.1ms post 5.21ms diff 9.23ms
write align 524288 pre 24.9ms on 29ms post 5.89ms diff 13.6ms
write align 262144 pre 24.9ms on 2.41ms post 20.8ms diff -204328
write align 131072 pre 25.6ms on 30ms post 4.84ms diff 14.8ms
write align 65536 pre 26.4ms on 24.4ms post 6.16ms diff 8.12ms
write align 32768 pre 15ms on 30.6ms post 15.4ms diff 15.4ms
write align 16384 pre 16.1ms on 45.4ms post 16.5ms diff 29.1ms
write align 8192 pre 5.88ms on 107ms post 5.45ms diff 101ms
write align 4096 pre 5.17ms on 5.78ms post 4.83ms diff 778?s
write align 2048 pre 3.99ms on 5.27ms post 3.97ms diff 1.29ms
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 16.1ms on 8.37ms post 5.44ms diff -241222
write align 4194304 pre 4.07ms on 7.27ms post 3.89ms diff 3.29ms
write align 2097152 pre 24.2ms on 18.5ms post 5.63ms diff 3.59ms
write align 1048576 pre 4.08ms on 18.9ms post 5.46ms diff 14.1ms
write align 524288 pre 25.1ms on 28ms post 14.6ms diff 8.13ms
write align 262144 pre 15.8ms on 30ms post 5.4ms diff 19.4ms
write align 131072 pre 24.7ms on 30.8ms post 4.43ms diff 16.2ms
write align 65536 pre 5ms on 40.5ms post 5.95ms diff 35.1ms
write align 32768 pre 24.7ms on 30.6ms post 4.92ms diff 15.8ms
write align 16384 pre 25.2ms on 132ms post 10.2ms diff 114ms
write align 8192 pre 7.64ms on 111ms post 9.18ms diff 102ms
write align 4096 pre 5.11ms on 3.92ms post 5.4ms diff -134159
write align 2048 pre 3.92ms on 4.41ms post 4.51ms diff 196?s
> Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> to flashbench?
>
# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608 pre 3.63ms on 6.51ms post 3.66ms diff 2.86ms
write align 4194304 pre 3.61ms on 6.51ms post 3.62ms diff 2.89ms
write align 2097152 pre 3.61ms on 6.49ms post 3.62ms diff 2.87ms
write align 1048576 pre 3.64ms on 6.55ms post 3.62ms diff 2.92ms
write align 524288 pre 3.64ms on 6.57ms post 3.66ms diff 2.92ms
write align 262144 pre 3.44ms on 6.45ms post 3.66ms diff 2.9ms
write align 131072 pre 3.64ms on 6.56ms post 3.67ms diff 2.91ms
write align 65536 pre 3.33ms on 6.57ms post 3.65ms diff 3.08ms
write align 32768 pre 3.68ms on 6.6ms post 3.7ms diff 2.91ms
write align 16384 pre 3.64ms on 97.6ms post 3.26ms diff 94.2ms
write align 8192 pre 3.49ms on 115ms post 3.62ms diff 112ms
write align 4096 pre 3.91ms on 3.91ms post 3.9ms diff 360ns
write align 2048 pre 3.92ms on 3.92ms post 3.92ms diff -1374ns
# ./flashbench -A -b 2048 /dev/block/mmcblk0p9
write align 8388608 pre 3.76ms on 7.23ms post 4.18ms diff 3.27ms
write align 4194304 pre 3.65ms on 6.56ms post 3.66ms diff 2.9ms
write align 2097152 pre 3.9ms on 6.99ms post 3.67ms diff 3.2ms
write align 1048576 pre 4.03ms on 7.09ms post 4.07ms diff 3.04ms
write align 524288 pre 4.04ms on 7.26ms post 4.16ms diff 3.16ms
write align 262144 pre 3.8ms on 7.26ms post 4.06ms diff 3.33ms
write align 131072 pre 4.05ms on 7.25ms post 4.18ms diff 3.14ms
write align 65536 pre 4.02ms on 7.22ms post 4.14ms diff 3.14ms
write align 32768 pre 4ms on 7.07ms post 3.95ms diff 3.1ms
write align 16384 pre 3.66ms on 106ms post 3.4ms diff 102ms
write align 8192 pre 3.56ms on 106ms post 3.36ms diff 103ms
write align 4096 pre 3.61ms on 4.1ms post 4.35ms diff 117?s
# ./flashbench -A -b 4096 /dev/block/mmcblk0p9
write align 8388608 pre 3.64ms on 6.95ms post 3.96ms diff 3.15ms
write align 4194304 pre 3.65ms on 6.56ms post 3.66ms diff 2.9ms
write align 2097152 pre 3.89ms on 6.79ms post 3.66ms diff 3.01ms
write align 1048576 pre 3.88ms on 6.88ms post 3.95ms diff 2.97ms
write align 524288 pre 3.72ms on 6.97ms post 3.93ms diff 3.15ms
write align 262144 pre 3.89ms on 6.93ms post 3.95ms diff 3.01ms
write align 131072 pre 3.9ms on 6.98ms post 3.96ms diff 3.05ms
write align 65536 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
write align 32768 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
write align 16384 pre 3.74ms on 114ms post 4.05ms diff 110ms
write align 8192 pre 4.25ms on 115ms post 4.8ms diff 110ms
# ./flashbench -A -b 8192 /dev/block/mmcblk0p9
write align 8388608 pre 3.84ms on 7.53ms post 4.29ms diff 3.47ms
write align 4194304 pre 3.58ms on 6.54ms post 3.6ms diff 2.95ms
write align 2097152 pre 4.12ms on 7.27ms post 3.87ms diff 3.28ms
write align 1048576 pre 4.14ms on 7.49ms post 4.24ms diff 3.3ms
write align 524288 pre 4.12ms on 7.46ms post 4.23ms diff 3.29ms
write align 262144 pre 4.14ms on 7.45ms post 3.97ms diff 3.4ms
write align 131072 pre 3.89ms on 7.43ms post 4.24ms diff 3.37ms
write align 65536 pre 4.11ms on 7.46ms post 4.24ms diff 3.29ms
write align 32768 pre 4.15ms on 7.45ms post 4.25ms diff 3.25ms
write align 16384 pre 4.24ms on 96.1ms post 3.83ms diff 92.1ms
The following I thought this was interesting. I did it to see the big
time go away, since it would end up being a 16K write straddling an 8K
boundary, but the pre and post results I don't understand at all.
# ./flashbench -A -b 16384 /dev/block/mmcblk0p9
write align 8388608 pre 121ms on 7.76ms post 116ms diff -110845
write align 4194304 pre 129ms on 7.57ms post 115ms diff -114863
write align 2097152 pre 121ms on 7.78ms post 123ms diff -114318
write align 1048576 pre 131ms on 7.74ms post 106ms diff -110856
write align 524288 pre 131ms on 7.58ms post 116ms diff -115926
write align 262144 pre 131ms on 7.55ms post 115ms diff -115591
write align 131072 pre 131ms on 7.54ms post 116ms diff -115617
write align 65536 pre 131ms on 7.54ms post 115ms diff -115579
write align 32768 pre 125ms on 6.89ms post 116ms diff -113408
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-17 15:47 ` Arnd Bergmann
@ 2011-02-20 11:27 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 11:27 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I think I'd try to reduce the number of sysfs files needed for this.
> What are the values you would typically set here?
>
> My feeling is that separating unaligned page writes from full pages
> or multiples of pages could always be benefitial for all cards, or at
> least harmless, but that will require more measurements.
> Whether to do the reliable write or not could be a simple flag
> if the numbers are the same.
I thought about this some more, and I realized it would be ugly if
everybody added enable_workaround_sec_start/enable_workaround_sec_end
for every novel idea of working around some issue with
performance/reliability on mmc/sd cards.
What about letting the user/embedder create policies for how certain
accesses are done? That way you give runtime-accessible
blocks for tuning mmc block layer while having one interface to
manipulate (and combine) multiple workarounds, all the while catching
conflicts and
without forcing specific policy in code.
Essentially under /sys/block/mmcblk0/device you have an attribute
called "policies". Example:
# echo mypol0 > /sys/block/mmcblk0/device/policies
# ls /sys/block/mmcblk0/device/mypol0
debug
delete
start_block
end_block
access_size_low
access_size_high
write_policy
erase_policy
read_policy
# cat /sys/block/mmcblk0/device/mypol0/write_policy
Current: none
0x00000001: Split unaligned writes across page_size
0x00000002: Split writes into page_size chunks and write using reliable writes
0x00000004: Use reliable writes for WRITE_META blocks.
# cat /sys/block/mmcblk0/device/mypol0/erase_policy
Current: none
0x00000001: Use secure erase.
# echo 1 > delete
# Policy is deleted.
The policies are all stored in a rb-tree. First order of business
inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
policy given the access type and block start/end (which both tells
where the access is going and the size of the access). Later, it's
that policy information which controls how the request is translated
into MMC commands. I'm almost done with a prototype.
I noticed that all sysfs attributes are managed by code under
core/mmc.c and core/sd.c, duplicating where necessary. I think some of
the new block-related settings like page_size (or policies) are
generic enough that they should live in the card/block code. How about
putting all future sysfs block related things into block-sysfs.c?
Thanks,
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-20 11:27 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 11:27 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I think I'd try to reduce the number of sysfs files needed for this.
> What are the values you would typically set here?
>
> My feeling is that separating unaligned page writes from full pages
> or multiples of pages could always be benefitial for all cards, or at
> least harmless, but that will require more measurements.
> Whether to do the reliable write or not could be a simple flag
> if the numbers are the same.
I thought about this some more, and I realized it would be ugly if
everybody added enable_workaround_sec_start/enable_workaround_sec_end
for every novel idea of working around some issue with
performance/reliability on mmc/sd cards.
What about letting the user/embedder create policies for how certain
accesses are done? That way you give runtime-accessible
blocks for tuning mmc block layer while having one interface to
manipulate (and combine) multiple workarounds, all the while catching
conflicts and
without forcing specific policy in code.
Essentially under /sys/block/mmcblk0/device you have an attribute
called "policies". Example:
# echo mypol0 > /sys/block/mmcblk0/device/policies
# ls /sys/block/mmcblk0/device/mypol0
debug
delete
start_block
end_block
access_size_low
access_size_high
write_policy
erase_policy
read_policy
# cat /sys/block/mmcblk0/device/mypol0/write_policy
Current: none
0x00000001: Split unaligned writes across page_size
0x00000002: Split writes into page_size chunks and write using reliable writes
0x00000004: Use reliable writes for WRITE_META blocks.
# cat /sys/block/mmcblk0/device/mypol0/erase_policy
Current: none
0x00000001: Use secure erase.
# echo 1 > delete
# Policy is deleted.
The policies are all stored in a rb-tree. First order of business
inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
policy given the access type and block start/end (which both tells
where the access is going and the size of the access). Later, it's
that policy information which controls how the request is translated
into MMC commands. I'm almost done with a prototype.
I noticed that all sysfs attributes are managed by code under
core/mmc.c and core/sd.c, duplicating where necessary. I think some of
the new block-related settings like page_size (or policies) are
generic enough that they should live in the card/block code. How about
putting all future sysfs block related things into block-sysfs.c?
Thanks,
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-20 11:27 ` Andrei Warkentin
@ 2011-02-20 14:39 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 14:39 UTC (permalink / raw)
To: linux-arm-kernel, linux-fsdevel
Cc: Andrei Warkentin, Linus Walleij, linux-mmc
[adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
on this discussion.]
On Sunday 20 February 2011 12:27:39 Andrei Warkentin wrote:
> On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > I think I'd try to reduce the number of sysfs files needed for this.
> > What are the values you would typically set here?
> >
> > My feeling is that separating unaligned page writes from full pages
> > or multiples of pages could always be benefitial for all cards, or at
> > least harmless, but that will require more measurements.
> > Whether to do the reliable write or not could be a simple flag
> > if the numbers are the same.
>
> I thought about this some more, and I realized it would be ugly if
> everybody added enable_workaround_sec_start/enable_workaround_sec_end
> for every novel idea of working around some issue with
> performance/reliability on mmc/sd cards.
>
> What about letting the user/embedder create policies for how certain
> accesses are done? That way you give runtime-accessible
> blocks for tuning mmc block layer while having one interface to
> manipulate (and combine) multiple workarounds, all the while catching
> conflicts and
> without forcing specific policy in code.
>
> Essentially under /sys/block/mmcblk0/device you have an attribute
> called "policies". Example:
>
> # echo mypol0 > /sys/block/mmcblk0/device/policies
> # ls /sys/block/mmcblk0/device/mypol0
> debug
> delete
> start_block
> end_block
> access_size_low
> access_size_high
> write_policy
> erase_policy
> read_policy
> # cat /sys/block/mmcblk0/device/mypol0/write_policy
> Current: none
> 0x00000001: Split unaligned writes across page_size
> 0x00000002: Split writes into page_size chunks and write using reliable writes
> 0x00000004: Use reliable writes for WRITE_META blocks.
> # cat /sys/block/mmcblk0/device/mypol0/erase_policy
> Current: none
> 0x00000001: Use secure erase.
> # echo 1 > delete
> # Policy is deleted.
>
> The policies are all stored in a rb-tree. First order of business
> inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
> policy given the access type and block start/end (which both tells
> where the access is going and the size of the access). Later, it's
> that policy information which controls how the request is translated
> into MMC commands. I'm almost done with a prototype.
I think it's good to discuss all the options, but my feeling is that
we should not add so much complexity at the interface level, because
we will never be able to change all that again. In general, sysfs
files should contain simple values that are self-descriptive (a simple
number or one word), and should have no side-effects (unlike the delete
or the policies attributes you describe).
The behavior of the Toshiba chip is peculiar enough to justify having
some workarounds for it, including run-time selected ones, but I'm
looking for something much simpler. I'd certainly be interested in
the patch you come up with and any performance results, but I don't
think it can be merged like that.
In the end, Chris will have to make the decision on mmc patches of
course -- I'm just trying to contribute experience from other subsystems.
What I see as a more promising approach is to add the tunables
to attributes of the CFQ I/O scheduler once we know what we want.
This will allow doing the same optimizations to non-MMC devices such
as USB sticks or CF/IDE cards without reimplementing it in other
subsystems, and give more control over the individual requests than
the MMC layer has.
E.g. the I/O scheduler can also make sure that we always submit all
blocks from the start of one erase unit (e.g. 4 MB) to the end, but
not try to merge requests across erase unit boundaries. It can
also try to group the requests in aligned power-of-two sized chunks
rather than merging as many sectors as possible up to the maximum
request size, ignoring the alignment.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-20 14:39 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 14:39 UTC (permalink / raw)
To: linux-arm-kernel
[adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
on this discussion.]
On Sunday 20 February 2011 12:27:39 Andrei Warkentin wrote:
> On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > I think I'd try to reduce the number of sysfs files needed for this.
> > What are the values you would typically set here?
> >
> > My feeling is that separating unaligned page writes from full pages
> > or multiples of pages could always be benefitial for all cards, or at
> > least harmless, but that will require more measurements.
> > Whether to do the reliable write or not could be a simple flag
> > if the numbers are the same.
>
> I thought about this some more, and I realized it would be ugly if
> everybody added enable_workaround_sec_start/enable_workaround_sec_end
> for every novel idea of working around some issue with
> performance/reliability on mmc/sd cards.
>
> What about letting the user/embedder create policies for how certain
> accesses are done? That way you give runtime-accessible
> blocks for tuning mmc block layer while having one interface to
> manipulate (and combine) multiple workarounds, all the while catching
> conflicts and
> without forcing specific policy in code.
>
> Essentially under /sys/block/mmcblk0/device you have an attribute
> called "policies". Example:
>
> # echo mypol0 > /sys/block/mmcblk0/device/policies
> # ls /sys/block/mmcblk0/device/mypol0
> debug
> delete
> start_block
> end_block
> access_size_low
> access_size_high
> write_policy
> erase_policy
> read_policy
> # cat /sys/block/mmcblk0/device/mypol0/write_policy
> Current: none
> 0x00000001: Split unaligned writes across page_size
> 0x00000002: Split writes into page_size chunks and write using reliable writes
> 0x00000004: Use reliable writes for WRITE_META blocks.
> # cat /sys/block/mmcblk0/device/mypol0/erase_policy
> Current: none
> 0x00000001: Use secure erase.
> # echo 1 > delete
> # Policy is deleted.
>
> The policies are all stored in a rb-tree. First order of business
> inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
> policy given the access type and block start/end (which both tells
> where the access is going and the size of the access). Later, it's
> that policy information which controls how the request is translated
> into MMC commands. I'm almost done with a prototype.
I think it's good to discuss all the options, but my feeling is that
we should not add so much complexity at the interface level, because
we will never be able to change all that again. In general, sysfs
files should contain simple values that are self-descriptive (a simple
number or one word), and should have no side-effects (unlike the delete
or the policies attributes you describe).
The behavior of the Toshiba chip is peculiar enough to justify having
some workarounds for it, including run-time selected ones, but I'm
looking for something much simpler. I'd certainly be interested in
the patch you come up with and any performance results, but I don't
think it can be merged like that.
In the end, Chris will have to make the decision on mmc patches of
course -- I'm just trying to contribute experience from other subsystems.
What I see as a more promising approach is to add the tunables
to attributes of the CFQ I/O scheduler once we know what we want.
This will allow doing the same optimizations to non-MMC devices such
as USB sticks or CF/IDE cards without reimplementing it in other
subsystems, and give more control over the individual requests than
the MMC layer has.
E.g. the I/O scheduler can also make sure that we always submit all
blocks from the start of one erase unit (e.g. 4 MB) to the end, but
not try to merge requests across erase unit boundaries. It can
also try to group the requests in aligned power-of-two sized chunks
rather than merging as many sectors as possible up to the maximum
request size, ignoring the alignment.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-20 4:39 ` Andrei Warkentin
@ 2011-02-20 15:03 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:03 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc
On Sunday 20 February 2011 05:39:06 Andrei Warkentin wrote:
> Actually it would be a good idea to also bail/warn if you do the au
> test with more open au's than the size of the passed device allows,
> since it'll just wrap around and skew the results.
Yes, that's a bug. I never noticed because all the devices I tested
have much more space than the test can possibly exercise. I'll
fix it tomorrow.
> > Right, you should try larger values for --open-au-nr here. It's at
> > least a good sign that the drive can do random access inside a segment
> > and that it can have at least 4 segments open. This is much better
> > than I expected from your descriptions at first.
>
> Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
> ^C
> # ./flashbench -O -0 6 -b 512 /dev/block/mmcblk0p9
> 4MiB 5.91M/s
> 2MiB 8.84M/s
> 1MiB 10.8M/s
> 512KiB 13M/s
> 256KiB 13.6M/s
>
> ^C
> # ./flashbench -O -0 7 -b 512 /dev/block/mmcblk0p9
> 4MiB 6.32M/s
> 2MiB 8.63M/s
> 1MiB 10.5M/s
> 512KiB 13.2M/s
> 256KiB 13M/s
> ^[[A^[[D^[[A128KiB 12.3M/s
> ^C
> # ./flashbench -O -0 8 -b 512 /dev/block/mmcblk0p9
> 4MiB 6.65M/s
> 2MiB 7.02M/s
> 1MiB 6.36M/s
> 512KiB 3.17M/s
> 256KiB 1.53M/s
Yes, very good. I've never seen 7, but I've seen all other numbers
betwen 1 and 8 ;-).
> The Sandisk one has 20 AUs.
>
> # ./flashbench -O -0 20 -b 512 /dev/block/mmcblk0p9
> 4MiB 11.3M/s
> 2MiB 12.8M/s
> 1MiB 9.87M/s
> 512KiB 9.97M/s
> 256KiB 9.13M/s
> 128KiB 8.05M/s
> ^C
> # ./flashbench -O -0 50 -b 512 /dev/block/mmcblk0p9
> 4MiB 7.19M/s
> ^C
> # ./flashbench -O -0 2 -b 512 /dev/block/mmcblk0p9
> ^C
> # ./flashbench -O -0 22 -b 512 /dev/block/mmcblk0p9
> 4MiB 11.6M/s
> 2MiB 12.3M/s
> 1MiB 5.13M/s
> 512KiB 2.57M/s
> 256KiB 1.59M/s
> 128KiB 1.16M/s
> 64KiB 776K/s
> ^C
> # ./flashbench -O -0 21 -b 512 /dev/block/mmcblk0p9
> 4MiB 11.2M/s
> 2MiB 12.4M/s
> 1MiB 4.65M/s
> 512KiB 1.95M/s
> 256KiB 955K/s
20 is a lot, more than any other device I've tested, but that's
good. Sandisk keeps impressing me ;-)
Are you sure you have the allocation unit size correctly for
this device and you don't get into the wrap-around bug
you mention above?
If it indeed uses 4 MB allocation units, flashbench will show
only 10 open segments when run with --erasesize=$[8*1024*1024],
but 20 open segments when run with --erasesize=$[2*1024*1024].
>From your flashbench -a run, I would guess that it uses
8 MB allocation units, although the data is not 100% conclusive
there.
> > However, the drop from 32 KB to 16 KB in performance is horrifying
> > for the Toshiba drive, it's clear that this one does not like
> > to be accessed smaller than 32 KB at a time, an obvious optimization
> > for FAT32 with 32 KB clusters. How does this change with your
> > kernel patches?
>
> Since the only performance-increasing patch here would be just the one
> that splits unaligned accesses, I wouldn't expect any improvements for
> page-aligned accesses < 32KB. As you can see here...
Ok.
> > For the sandisk drive, it's funny how it is consistently faster
> > doing random access than linear access. I don't think I've seem that
> > before. It does seem to have some cache for linear access using
> > smaller than 16 KB, and can probably combine them when it's only
> > writing to a single segment.
>
> Yes, that is pretty interesting. Smaller than 16K? Not smaller than
> 32K? I wonder what it is doing...
My interpretation is that it uses 16 KB pages, but can do two page-sized
writes in a single access (multi-plane write). Anything smaller than
a page goes to a temporary buffer first (like the Toshiba chip), but
gets flushed when the next one is not contiguous. If you manage to fill
the entire 16 KB page using small contiguous writes, it can do a single
efficient write access instead.
To confirm that 16 KB is the page size, you can try
flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
/dev/mmcblk1 -c 32 --blocksize=16384
gnuplot -p -e 'plot "plot.data" '
On most MLC flashes, this will show a pattern alternating between slow
and fast pages like the one from https://lwn.net/Articles/428836/
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-20 15:03 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:03 UTC (permalink / raw)
To: linux-arm-kernel
On Sunday 20 February 2011 05:39:06 Andrei Warkentin wrote:
> Actually it would be a good idea to also bail/warn if you do the au
> test with more open au's than the size of the passed device allows,
> since it'll just wrap around and skew the results.
Yes, that's a bug. I never noticed because all the devices I tested
have much more space than the test can possibly exercise. I'll
fix it tomorrow.
> > Right, you should try larger values for --open-au-nr here. It's at
> > least a good sign that the drive can do random access inside a segment
> > and that it can have at least 4 segments open. This is much better
> > than I expected from your descriptions at first.
>
> Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
> ^C
> # ./flashbench -O -0 6 -b 512 /dev/block/mmcblk0p9
> 4MiB 5.91M/s
> 2MiB 8.84M/s
> 1MiB 10.8M/s
> 512KiB 13M/s
> 256KiB 13.6M/s
>
> ^C
> # ./flashbench -O -0 7 -b 512 /dev/block/mmcblk0p9
> 4MiB 6.32M/s
> 2MiB 8.63M/s
> 1MiB 10.5M/s
> 512KiB 13.2M/s
> 256KiB 13M/s
> ^[[A^[[D^[[A128KiB 12.3M/s
> ^C
> # ./flashbench -O -0 8 -b 512 /dev/block/mmcblk0p9
> 4MiB 6.65M/s
> 2MiB 7.02M/s
> 1MiB 6.36M/s
> 512KiB 3.17M/s
> 256KiB 1.53M/s
Yes, very good. I've never seen 7, but I've seen all other numbers
betwen 1 and 8 ;-).
> The Sandisk one has 20 AUs.
>
> # ./flashbench -O -0 20 -b 512 /dev/block/mmcblk0p9
> 4MiB 11.3M/s
> 2MiB 12.8M/s
> 1MiB 9.87M/s
> 512KiB 9.97M/s
> 256KiB 9.13M/s
> 128KiB 8.05M/s
> ^C
> # ./flashbench -O -0 50 -b 512 /dev/block/mmcblk0p9
> 4MiB 7.19M/s
> ^C
> # ./flashbench -O -0 2 -b 512 /dev/block/mmcblk0p9
> ^C
> # ./flashbench -O -0 22 -b 512 /dev/block/mmcblk0p9
> 4MiB 11.6M/s
> 2MiB 12.3M/s
> 1MiB 5.13M/s
> 512KiB 2.57M/s
> 256KiB 1.59M/s
> 128KiB 1.16M/s
> 64KiB 776K/s
> ^C
> # ./flashbench -O -0 21 -b 512 /dev/block/mmcblk0p9
> 4MiB 11.2M/s
> 2MiB 12.4M/s
> 1MiB 4.65M/s
> 512KiB 1.95M/s
> 256KiB 955K/s
20 is a lot, more than any other device I've tested, but that's
good. Sandisk keeps impressing me ;-)
Are you sure you have the allocation unit size correctly for
this device and you don't get into the wrap-around bug
you mention above?
If it indeed uses 4 MB allocation units, flashbench will show
only 10 open segments when run with --erasesize=$[8*1024*1024],
but 20 open segments when run with --erasesize=$[2*1024*1024].
>From your flashbench -a run, I would guess that it uses
8 MB allocation units, although the data is not 100% conclusive
there.
> > However, the drop from 32 KB to 16 KB in performance is horrifying
> > for the Toshiba drive, it's clear that this one does not like
> > to be accessed smaller than 32 KB at a time, an obvious optimization
> > for FAT32 with 32 KB clusters. How does this change with your
> > kernel patches?
>
> Since the only performance-increasing patch here would be just the one
> that splits unaligned accesses, I wouldn't expect any improvements for
> page-aligned accesses < 32KB. As you can see here...
Ok.
> > For the sandisk drive, it's funny how it is consistently faster
> > doing random access than linear access. I don't think I've seem that
> > before. It does seem to have some cache for linear access using
> > smaller than 16 KB, and can probably combine them when it's only
> > writing to a single segment.
>
> Yes, that is pretty interesting. Smaller than 16K? Not smaller than
> 32K? I wonder what it is doing...
My interpretation is that it uses 16 KB pages, but can do two page-sized
writes in a single access (multi-plane write). Anything smaller than
a page goes to a temporary buffer first (like the Toshiba chip), but
gets flushed when the next one is not contiguous. If you manage to fill
the entire 16 KB page using small contiguous writes, it can do a single
efficient write access instead.
To confirm that 16 KB is the page size, you can try
flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
/dev/mmcblk1 -c 32 --blocksize=16384
gnuplot -p -e 'plot "plot.data" '
On most MLC flashes, this will show a pattern alternating between slow
and fast pages like the one from https://lwn.net/Articles/428836/
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-20 5:56 ` Andrei Warkentin
@ 2011-02-20 15:23 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:23 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sunday 20 February 2011 06:56:39 Andrei Warkentin wrote:
> On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > The numbers you see here are taken over multiple runs. Do you see a lot
> > of fluctuation when doing this with --count=1?
> >
>
> Yep. Quite a bit.
>
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 4.52ms on 7.58ms post 3.93ms diff 3.36ms
> write align 4194304 pre 5.97ms on 8.69ms post 4.36ms diff 3.53ms
> write align 2097152 pre 3.57ms on 7.96ms post 4.6ms diff 3.88ms
> write align 1048576 pre 5.33ms on 27.4ms post 4.88ms diff 22.3ms
> write align 524288 pre 49.3ms on 31.4ms post 14.9ms diff -679265
> write align 262144 pre 39.7ms on 38.3ms post 5.27ms diff 15.8ms
> write align 131072 pre 33.8ms on 45.4ms post 5.26ms diff 25.9ms
> write align 65536 pre 34.4ms on 40.9ms post 3.3ms diff 22.1ms
> write align 32768 pre 30.2ms on 44.8ms post 5.13ms diff 27.1ms
> write align 16384 pre 44.5ms on 5.05ms post 33.3ms diff -338542
> write align 8192 pre 25.5ms on 70.6ms post 25.3ms diff 45.2ms
> write align 4096 pre 4.89ms on 4.47ms post 5.29ms diff -623390
> write align 2048 pre 4.88ms on 4.89ms post 5.2ms diff -155781
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 4.68ms on 9.06ms post 5.14ms diff 4.15ms
> write align 4194304 pre 4.37ms on 7.49ms post 4.59ms diff 3.01ms
> write align 2097152 pre 23.7ms on 1.9ms post 14.8ms diff -173218
> write align 1048576 pre 14.8ms on 19.9ms post 4.75ms diff 10.2ms
> write align 524288 pre 20.2ms on 24.9ms post 10.7ms diff 9.46ms
> write align 262144 pre 20.2ms on 3.01ms post 20.1ms diff -171062
> write align 131072 pre 25.9ms on 24.9ms post 9.85ms diff 7.06ms
> write align 65536 pre 15.5ms on 30.3ms post 2.95ms diff 21.1ms
> write align 32768 pre 27.3ms on 19.1ms post 5.86ms diff 2.5ms
> write align 16384 pre 25.4ms on 55.9ms post 12.7ms diff 36.9ms
> write align 8192 pre 4.8ms on 102ms post 9.47ms diff 94.8ms
> write align 4096 pre 4.92ms on 5.16ms post 4.98ms diff 207µs
> write align 2048 pre 4.64ms on 4.92ms post 5.45ms diff -121860
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 15.8ms on 9.39ms post 4.68ms diff -854295
> write align 4194304 pre 4.76ms on 7.54ms post 3.82ms diff 3.24ms
> write align 2097152 pre 19.9ms on 9.73ms post 4.44ms diff -244517
> write align 1048576 pre 14.5ms on 19.1ms post 5.21ms diff 9.23ms
> write align 524288 pre 24.9ms on 29ms post 5.89ms diff 13.6ms
> write align 262144 pre 24.9ms on 2.41ms post 20.8ms diff -204328
> write align 131072 pre 25.6ms on 30ms post 4.84ms diff 14.8ms
> write align 65536 pre 26.4ms on 24.4ms post 6.16ms diff 8.12ms
> write align 32768 pre 15ms on 30.6ms post 15.4ms diff 15.4ms
> write align 16384 pre 16.1ms on 45.4ms post 16.5ms diff 29.1ms
> write align 8192 pre 5.88ms on 107ms post 5.45ms diff 101ms
> write align 4096 pre 5.17ms on 5.78ms post 4.83ms diff 778µs
> write align 2048 pre 3.99ms on 5.27ms post 3.97ms diff 1.29ms
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 16.1ms on 8.37ms post 5.44ms diff -241222
> write align 4194304 pre 4.07ms on 7.27ms post 3.89ms diff 3.29ms
> write align 2097152 pre 24.2ms on 18.5ms post 5.63ms diff 3.59ms
> write align 1048576 pre 4.08ms on 18.9ms post 5.46ms diff 14.1ms
> write align 524288 pre 25.1ms on 28ms post 14.6ms diff 8.13ms
> write align 262144 pre 15.8ms on 30ms post 5.4ms diff 19.4ms
> write align 131072 pre 24.7ms on 30.8ms post 4.43ms diff 16.2ms
> write align 65536 pre 5ms on 40.5ms post 5.95ms diff 35.1ms
> write align 32768 pre 24.7ms on 30.6ms post 4.92ms diff 15.8ms
> write align 16384 pre 25.2ms on 132ms post 10.2ms diff 114ms
> write align 8192 pre 7.64ms on 111ms post 9.18ms diff 102ms
> write align 4096 pre 5.11ms on 3.92ms post 5.4ms diff -134159
> write align 2048 pre 3.92ms on 4.41ms post 4.51ms diff 196µs
Every value is the average of eight measurements, so there are probably
some that include the 100ms garbage collection, and others that don't.
I'm more confused about this now than I was before.
> > Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> > to flashbench?
>
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 65536 pre 3.33ms on 6.57ms post 3.65ms diff 3.08ms
> write align 32768 pre 3.68ms on 6.6ms post 3.7ms diff 2.91ms
> write align 16384 pre 3.64ms on 97.6ms post 3.26ms diff 94.2ms
> write align 8192 pre 3.49ms on 115ms post 3.62ms diff 112ms
> write align 4096 pre 3.91ms on 3.91ms post 3.9ms diff 360ns
> write align 2048 pre 3.92ms on 3.92ms post 3.92ms diff -1374ns
> # ./flashbench -A -b 2048 /dev/block/mmcblk0p9
> write align 65536 pre 4.02ms on 7.22ms post 4.14ms diff 3.14ms
> write align 32768 pre 4ms on 7.07ms post 3.95ms diff 3.1ms
> write align 16384 pre 3.66ms on 106ms post 3.4ms diff 102ms
> write align 8192 pre 3.56ms on 106ms post 3.36ms diff 103ms
> write align 4096 pre 3.61ms on 4.1ms post 4.35ms diff 117µs
> # ./flashbench -A -b 4096 /dev/block/mmcblk0p9
> write align 65536 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
> write align 32768 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
> write align 16384 pre 3.74ms on 114ms post 4.05ms diff 110ms
> write align 8192 pre 4.25ms on 115ms post 4.8ms diff 110ms
> # ./flashbench -A -b 8192 /dev/block/mmcblk0p9
> write align 65536 pre 4.11ms on 7.46ms post 4.24ms diff 3.29ms
> write align 32768 pre 4.15ms on 7.45ms post 4.25ms diff 3.25ms
> write align 16384 pre 4.24ms on 96.1ms post 3.83ms diff 92.1ms
Ok, that is very consistent then at least.
> The following I thought this was interesting. I did it to see the big
> time go away, since it would end up being a 16K write straddling an 8K
> boundary, but the pre and post results I don't understand at all.
>
> # ./flashbench -A -b 16384 /dev/block/mmcblk0p9
> write align 8388608 pre 121ms on 7.76ms post 116ms diff -110845
> write align 4194304 pre 129ms on 7.57ms post 115ms diff -114863
> write align 2097152 pre 121ms on 7.78ms post 123ms diff -114318
> write align 1048576 pre 131ms on 7.74ms post 106ms diff -110856
> write align 524288 pre 131ms on 7.58ms post 116ms diff -115926
> write align 262144 pre 131ms on 7.55ms post 115ms diff -115591
> write align 131072 pre 131ms on 7.54ms post 116ms diff -115617
> write align 65536 pre 131ms on 7.54ms post 115ms diff -115579
> write align 32768 pre 125ms on 6.89ms post 116ms diff -113408
The description of the test case is probably suboptimal. What this does
is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
alignment in the "on" case. The idea here is that it should never do
any access with less than "--blocksize" aligment.
This is what I think happens:
Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
all the time for 32KB accesses and larger. However, the "on" measurement is only
16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
the garbage collection all the time, so it ends up being a lot faster.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-20 15:23 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:23 UTC (permalink / raw)
To: linux-arm-kernel
On Sunday 20 February 2011 06:56:39 Andrei Warkentin wrote:
> On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > The numbers you see here are taken over multiple runs. Do you see a lot
> > of fluctuation when doing this with --count=1?
> >
>
> Yep. Quite a bit.
>
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 4.52ms on 7.58ms post 3.93ms diff 3.36ms
> write align 4194304 pre 5.97ms on 8.69ms post 4.36ms diff 3.53ms
> write align 2097152 pre 3.57ms on 7.96ms post 4.6ms diff 3.88ms
> write align 1048576 pre 5.33ms on 27.4ms post 4.88ms diff 22.3ms
> write align 524288 pre 49.3ms on 31.4ms post 14.9ms diff -679265
> write align 262144 pre 39.7ms on 38.3ms post 5.27ms diff 15.8ms
> write align 131072 pre 33.8ms on 45.4ms post 5.26ms diff 25.9ms
> write align 65536 pre 34.4ms on 40.9ms post 3.3ms diff 22.1ms
> write align 32768 pre 30.2ms on 44.8ms post 5.13ms diff 27.1ms
> write align 16384 pre 44.5ms on 5.05ms post 33.3ms diff -338542
> write align 8192 pre 25.5ms on 70.6ms post 25.3ms diff 45.2ms
> write align 4096 pre 4.89ms on 4.47ms post 5.29ms diff -623390
> write align 2048 pre 4.88ms on 4.89ms post 5.2ms diff -155781
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 4.68ms on 9.06ms post 5.14ms diff 4.15ms
> write align 4194304 pre 4.37ms on 7.49ms post 4.59ms diff 3.01ms
> write align 2097152 pre 23.7ms on 1.9ms post 14.8ms diff -173218
> write align 1048576 pre 14.8ms on 19.9ms post 4.75ms diff 10.2ms
> write align 524288 pre 20.2ms on 24.9ms post 10.7ms diff 9.46ms
> write align 262144 pre 20.2ms on 3.01ms post 20.1ms diff -171062
> write align 131072 pre 25.9ms on 24.9ms post 9.85ms diff 7.06ms
> write align 65536 pre 15.5ms on 30.3ms post 2.95ms diff 21.1ms
> write align 32768 pre 27.3ms on 19.1ms post 5.86ms diff 2.5ms
> write align 16384 pre 25.4ms on 55.9ms post 12.7ms diff 36.9ms
> write align 8192 pre 4.8ms on 102ms post 9.47ms diff 94.8ms
> write align 4096 pre 4.92ms on 5.16ms post 4.98ms diff 207?s
> write align 2048 pre 4.64ms on 4.92ms post 5.45ms diff -121860
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 15.8ms on 9.39ms post 4.68ms diff -854295
> write align 4194304 pre 4.76ms on 7.54ms post 3.82ms diff 3.24ms
> write align 2097152 pre 19.9ms on 9.73ms post 4.44ms diff -244517
> write align 1048576 pre 14.5ms on 19.1ms post 5.21ms diff 9.23ms
> write align 524288 pre 24.9ms on 29ms post 5.89ms diff 13.6ms
> write align 262144 pre 24.9ms on 2.41ms post 20.8ms diff -204328
> write align 131072 pre 25.6ms on 30ms post 4.84ms diff 14.8ms
> write align 65536 pre 26.4ms on 24.4ms post 6.16ms diff 8.12ms
> write align 32768 pre 15ms on 30.6ms post 15.4ms diff 15.4ms
> write align 16384 pre 16.1ms on 45.4ms post 16.5ms diff 29.1ms
> write align 8192 pre 5.88ms on 107ms post 5.45ms diff 101ms
> write align 4096 pre 5.17ms on 5.78ms post 4.83ms diff 778?s
> write align 2048 pre 3.99ms on 5.27ms post 3.97ms diff 1.29ms
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608 pre 16.1ms on 8.37ms post 5.44ms diff -241222
> write align 4194304 pre 4.07ms on 7.27ms post 3.89ms diff 3.29ms
> write align 2097152 pre 24.2ms on 18.5ms post 5.63ms diff 3.59ms
> write align 1048576 pre 4.08ms on 18.9ms post 5.46ms diff 14.1ms
> write align 524288 pre 25.1ms on 28ms post 14.6ms diff 8.13ms
> write align 262144 pre 15.8ms on 30ms post 5.4ms diff 19.4ms
> write align 131072 pre 24.7ms on 30.8ms post 4.43ms diff 16.2ms
> write align 65536 pre 5ms on 40.5ms post 5.95ms diff 35.1ms
> write align 32768 pre 24.7ms on 30.6ms post 4.92ms diff 15.8ms
> write align 16384 pre 25.2ms on 132ms post 10.2ms diff 114ms
> write align 8192 pre 7.64ms on 111ms post 9.18ms diff 102ms
> write align 4096 pre 5.11ms on 3.92ms post 5.4ms diff -134159
> write align 2048 pre 3.92ms on 4.41ms post 4.51ms diff 196?s
Every value is the average of eight measurements, so there are probably
some that include the 100ms garbage collection, and others that don't.
I'm more confused about this now than I was before.
> > Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> > to flashbench?
>
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 65536 pre 3.33ms on 6.57ms post 3.65ms diff 3.08ms
> write align 32768 pre 3.68ms on 6.6ms post 3.7ms diff 2.91ms
> write align 16384 pre 3.64ms on 97.6ms post 3.26ms diff 94.2ms
> write align 8192 pre 3.49ms on 115ms post 3.62ms diff 112ms
> write align 4096 pre 3.91ms on 3.91ms post 3.9ms diff 360ns
> write align 2048 pre 3.92ms on 3.92ms post 3.92ms diff -1374ns
> # ./flashbench -A -b 2048 /dev/block/mmcblk0p9
> write align 65536 pre 4.02ms on 7.22ms post 4.14ms diff 3.14ms
> write align 32768 pre 4ms on 7.07ms post 3.95ms diff 3.1ms
> write align 16384 pre 3.66ms on 106ms post 3.4ms diff 102ms
> write align 8192 pre 3.56ms on 106ms post 3.36ms diff 103ms
> write align 4096 pre 3.61ms on 4.1ms post 4.35ms diff 117?s
> # ./flashbench -A -b 4096 /dev/block/mmcblk0p9
> write align 65536 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
> write align 32768 pre 3.89ms on 6.97ms post 3.96ms diff 3.04ms
> write align 16384 pre 3.74ms on 114ms post 4.05ms diff 110ms
> write align 8192 pre 4.25ms on 115ms post 4.8ms diff 110ms
> # ./flashbench -A -b 8192 /dev/block/mmcblk0p9
> write align 65536 pre 4.11ms on 7.46ms post 4.24ms diff 3.29ms
> write align 32768 pre 4.15ms on 7.45ms post 4.25ms diff 3.25ms
> write align 16384 pre 4.24ms on 96.1ms post 3.83ms diff 92.1ms
Ok, that is very consistent then at least.
> The following I thought this was interesting. I did it to see the big
> time go away, since it would end up being a 16K write straddling an 8K
> boundary, but the pre and post results I don't understand at all.
>
> # ./flashbench -A -b 16384 /dev/block/mmcblk0p9
> write align 8388608 pre 121ms on 7.76ms post 116ms diff -110845
> write align 4194304 pre 129ms on 7.57ms post 115ms diff -114863
> write align 2097152 pre 121ms on 7.78ms post 123ms diff -114318
> write align 1048576 pre 131ms on 7.74ms post 106ms diff -110856
> write align 524288 pre 131ms on 7.58ms post 116ms diff -115926
> write align 262144 pre 131ms on 7.55ms post 115ms diff -115591
> write align 131072 pre 131ms on 7.54ms post 116ms diff -115617
> write align 65536 pre 131ms on 7.54ms post 115ms diff -115579
> write align 32768 pre 125ms on 6.89ms post 116ms diff -113408
The description of the test case is probably suboptimal. What this does
is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
alignment in the "on" case. The idea here is that it should never do
any access with less than "--blocksize" aligment.
This is what I think happens:
Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
all the time for 32KB accesses and larger. However, the "on" measurement is only
16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
the garbage collection all the time, so it ends up being a lot faster.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-20 15:03 ` Arnd Bergmann
@ 2011-02-22 6:42 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22 6:42 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]
On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> From your flashbench -a run, I would guess that it uses
> 8 MB allocation units, although the data is not 100% conclusive
> there.
>
Because the 8MB aligned write time is significantly faster, right?
>
> My interpretation is that it uses 16 KB pages, but can do two page-sized
> writes in a single access (multi-plane write). Anything smaller than
> a page goes to a temporary buffer first (like the Toshiba chip), but
> gets flushed when the next one is not contiguous. If you manage to fill
> the entire 16 KB page using small contiguous writes, it can do a single
> efficient write access instead.
>
> To confirm that 16 KB is the page size, you can try
>
> flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
> /dev/mmcblk1 -c 32 --blocksize=16384
> gnuplot -p -e 'plot "plot.data" '
>
> On most MLC flashes, this will show a pattern alternating between slow
> and fast pages like the one from https://lwn.net/Articles/428836/
Cool.
I am attaching some graphs. The 16k sandisk shows the slow and fast
page parallel lines, as does the 8k toshiba (but we knew it for the
toshiba case), but the boundaries are strange for the sandisk case,
and there an interesting 2mb variation in the toshiba 8k graph. What
is the correct way to interpret graphs with other block sizes?
A
[-- Attachment #2: scatter_8k_read_ts.png --]
[-- Type: image/png, Size: 11238 bytes --]
[-- Attachment #3: scatter_8k_sandisk.png --]
[-- Type: image/png, Size: 8964 bytes --]
[-- Attachment #4: scatter_16k_sandisk.png --]
[-- Type: image/png, Size: 6853 bytes --]
[-- Attachment #5: scatter_32k_read_ts.png --]
[-- Type: image/png, Size: 9471 bytes --]
[-- Attachment #6: scatter_32k_sandisk.png --]
[-- Type: image/png, Size: 6790 bytes --]
[-- Attachment #7: scatter_16k_read_ts.png --]
[-- Type: image/png, Size: 9040 bytes --]
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-22 6:42 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22 6:42 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> From your flashbench -a run, I would guess that it uses
> 8 MB allocation units, although the data is not 100% conclusive
> there.
>
Because the 8MB aligned write time is significantly faster, right?
>
> My interpretation is that it uses 16 KB pages, but can do two page-sized
> writes in a single access (multi-plane write). Anything smaller than
> a page goes to a temporary buffer first (like the Toshiba chip), but
> gets flushed when the next one is not contiguous. If you manage to fill
> the entire 16 KB page using small contiguous writes, it can do a single
> efficient write access instead.
>
> To confirm that 16 KB is the page size, you can try
>
> flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
> ? ? ? ?/dev/mmcblk1 -c 32 --blocksize=16384
> gnuplot -p -e 'plot "plot.data" '
>
> On most MLC flashes, this will show a pattern alternating between slow
> and fast pages like the one from https://lwn.net/Articles/428836/
Cool.
I am attaching some graphs. The 16k sandisk shows the slow and fast
page parallel lines, as does the 8k toshiba (but we knew it for the
toshiba case), but the boundaries are strange for the sandisk case,
and there an interesting 2mb variation in the toshiba 8k graph. What
is the correct way to interpret graphs with other block sizes?
A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_8k_read_ts.png
Type: image/png
Size: 11238 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_8k_sandisk.png
Type: image/png
Size: 8964 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_16k_sandisk.png
Type: image/png
Size: 6853 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_32k_read_ts.png
Type: image/png
Size: 9471 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_32k_sandisk.png
Type: image/png
Size: 6790 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_16k_read_ts.png
Type: image/png
Size: 9040 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0011.png>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-20 15:23 ` Arnd Bergmann
@ 2011-02-22 7:05 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22 7:05 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> The description of the test case is probably suboptimal. What this does
> is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> alignment in the "on" case. The idea here is that it should never do
> any access with less than "--blocksize" aligment.
>
Now I feel slightly confused :(.
-b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,
ret = time_rw_interval(dev, count, pre, blocksize,
align - blocksize, maxalign,
do_write); //
<----------------- read 16k at align - 16k with 8mb intervals?
returnif(ret);
ret = time_rw_interval(dev, count, on, blocksize,
align - blocksize / 2, maxalign,
do_write); //
<----------------- read 16k at align - 8k with 8mb intervals?
returnif(ret);
ret = time_rw_interval(dev, count, post, blocksize,
align, maxalign, do_write); //
<-------- read 16k at align with 8mb intervals?
returnif(ret);
I hope I'm not missing something obvious...
> This is what I think happens:
> Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
> writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
> all the time for 32KB accesses and larger. However, the "on" measurement is only
> 16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
> the garbage collection all the time, so it ends up being a lot faster.
>
Can't go to A. A is 8KB big. Strange...
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-22 7:05 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22 7:05 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> The description of the test case is probably suboptimal. What this does
> is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> alignment in the "on" case. The idea here is that it should never do
> any access with less than "--blocksize" aligment.
>
Now I feel slightly confused :(.
-b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,
ret = time_rw_interval(dev, count, pre, blocksize,
align - blocksize, maxalign,
do_write); //
<----------------- read 16k at align - 16k with 8mb intervals?
returnif(ret);
ret = time_rw_interval(dev, count, on, blocksize,
align - blocksize / 2, maxalign,
do_write); //
<----------------- read 16k at align - 8k with 8mb intervals?
returnif(ret);
ret = time_rw_interval(dev, count, post, blocksize,
align, maxalign, do_write); //
<-------- read 16k@align with 8mb intervals?
returnif(ret);
I hope I'm not missing something obvious...
> This is what I think happens:
> Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
> writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
> all the time for 32KB accesses and larger. However, the "on" measurement is only
> 16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
> the garbage collection all the time, so it ends up being a lot faster.
>
Can't go to A. A is 8KB big. Strange...
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-20 14:39 ` Arnd Bergmann
@ 2011-02-22 7:46 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22 7:46 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> [adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
> http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
> on this discussion.]
>
>
> I think it's good to discuss all the options, but my feeling is that
> we should not add so much complexity at the interface level, because
> we will never be able to change all that again. In general, sysfs
> files should contain simple values that are self-descriptive (a simple
> number or one word), and should have no side-effects (unlike the delete
> or the policies attributes you describe).
>
> The behavior of the Toshiba chip is peculiar enough to justify having
> some workarounds for it, including run-time selected ones, but I'm
> looking for something much simpler. I'd certainly be interested in
> the patch you come up with and any performance results, but I don't
> think it can be merged like that.
>
Sure. The page_align patch is just going to be a single sysfs
attribute. All I need to prove to myself now is the effect for large
unaligned accesses (and show everyone else the data :-)).
> In the end, Chris will have to make the decision on mmc patches of
> course -- I'm just trying to contribute experience from other subsystems.
>
> What I see as a more promising approach is to add the tunables
> to attributes of the CFQ I/O scheduler once we know what we want.
> This will allow doing the same optimizations to non-MMC devices such
> as USB sticks or CF/IDE cards without reimplementing it in other
> subsystems, and give more control over the individual requests than
> the MMC layer has.
>
> E.g. the I/O scheduler can also make sure that we always submit all
> blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> not try to merge requests across erase unit boundaries. It can
> also try to group the requests in aligned power-of-two sized chunks
> rather than merging as many sectors as possible up to the maximum
> request size, ignoring the alignment.
I agree. These are common things that affect any kind of flash
storage, and it belongs in the I/O scheduler as simple tuneables. I'll
see if I can figure my way around that...
What belongs in mmc card driver are tunable workarounds for MMC/SD
brokeness. For example - needing to use 8K-spitted reliable writes to
ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
improve lifespan of the card.) But you want a waterline above which
you don't do this anymore, otherwise the overall performance will go
to 0 - i.e. there is a need to balance between performance and
reliability, so the range of access size for which the workaround
works needs to be runtime controlled, as it's potentially different.
Another example (this one is apparently affecting Sandisk) - do
special stuff for block erase, since the card violates spec in that
regard (touch ext_csd instead of argument, I believe). A different
example might be turning on reliable writes for WRITE_META (or all)
blocks for a certain partition (but I just made that up... ).
So there are things that just should be on (spec brokeness
workarounds), and things that apply only to a subset of accesses (and
thus they are selective at issue_*_rq time), whether it's because of
accessed offset or access size.
I agree that the sysfs method is particularly nasty, and I guess I
didn't have to make a prototype to figure that out :-) (but needed
something similar for selective testing anyway). Nothing else exists
right now that acts in the same way, and nothing really should, as
there is no feedback for manipulating the policies (echo POLICY_ENUM >
policy, if it doesn't stick, then the arguments were wrong, etc).
You could put the entire MMC block policy interface through an API
usable by system integrators - i.e. you would really only care for
tuning the MMC parameters if you're creating a device around an emmc.
Idea (1). One idea is to keep the "policies" from my previous mail.
Policies are registered through platform-specific code. The policies
could be then matched for enabling against a specific block device by
manfid/date/etc at the time of mmc_block_alloc... For removable media
no one would fiddle with the tunable parameters anyway, unless there
was some global database of cards and workarounds and a daemon or some
such to take care of that... Probably don't want to add such baggage
to the kernel.
Idea (2). There is probably no need to overcomplicate. Just add a
platform callback (something like int
(*mmc_platform_block_workaround)(struct request *, struct
mmc_blk_request *)). This will be usable as-is for R/W accesses, and
the discard code will need to be slightly modified.
Do you think there is any need for runtime tuning of the MMC
workarounds (disregarding ones that really belong in the I/O
scheduler)? Should the workarounds be simply platform callbacks, or
should they be something heftier ("policies")?
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-22 7:46 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22 7:46 UTC (permalink / raw)
To: linux-arm-kernel
On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> [adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
> http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
> on this discussion.]
>
>
> I think it's good to discuss all the options, but my feeling is that
> we should not add so much complexity at the interface level, because
> we will never be able to change all that again. In general, sysfs
> files should contain simple values that are self-descriptive (a simple
> number or one word), and should have no side-effects (unlike the delete
> or the policies attributes you describe).
>
> The behavior of the Toshiba chip is peculiar enough to justify having
> some workarounds for it, including run-time selected ones, but I'm
> looking for something much simpler. I'd certainly be interested in
> the patch you come up with and any performance results, but I don't
> think it can be merged like that.
>
Sure. The page_align patch is just going to be a single sysfs
attribute. All I need to prove to myself now is the effect for large
unaligned accesses (and show everyone else the data :-)).
> In the end, Chris will have to make the decision on mmc patches of
> course -- I'm just trying to contribute experience from other subsystems.
>
> What I see as a more promising approach is to add the tunables
> to attributes of the CFQ I/O scheduler once we know what we want.
> This will allow doing the same optimizations to non-MMC devices such
> as USB sticks or CF/IDE cards without reimplementing it in other
> subsystems, and give more control over the individual requests than
> the MMC layer has.
>
> E.g. the I/O scheduler can also make sure that we always submit all
> blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> not try to merge requests across erase unit boundaries. It can
> also try to group the requests in aligned power-of-two sized chunks
> rather than merging as many sectors as possible up to the maximum
> request size, ignoring the alignment.
I agree. These are common things that affect any kind of flash
storage, and it belongs in the I/O scheduler as simple tuneables. I'll
see if I can figure my way around that...
What belongs in mmc card driver are tunable workarounds for MMC/SD
brokeness. For example - needing to use 8K-spitted reliable writes to
ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
improve lifespan of the card.) But you want a waterline above which
you don't do this anymore, otherwise the overall performance will go
to 0 - i.e. there is a need to balance between performance and
reliability, so the range of access size for which the workaround
works needs to be runtime controlled, as it's potentially different.
Another example (this one is apparently affecting Sandisk) - do
special stuff for block erase, since the card violates spec in that
regard (touch ext_csd instead of argument, I believe). A different
example might be turning on reliable writes for WRITE_META (or all)
blocks for a certain partition (but I just made that up... ).
So there are things that just should be on (spec brokeness
workarounds), and things that apply only to a subset of accesses (and
thus they are selective at issue_*_rq time), whether it's because of
accessed offset or access size.
I agree that the sysfs method is particularly nasty, and I guess I
didn't have to make a prototype to figure that out :-) (but needed
something similar for selective testing anyway). Nothing else exists
right now that acts in the same way, and nothing really should, as
there is no feedback for manipulating the policies (echo POLICY_ENUM >
policy, if it doesn't stick, then the arguments were wrong, etc).
You could put the entire MMC block policy interface through an API
usable by system integrators - i.e. you would really only care for
tuning the MMC parameters if you're creating a device around an emmc.
Idea (1). One idea is to keep the "policies" from my previous mail.
Policies are registered through platform-specific code. The policies
could be then matched for enabling against a specific block device by
manfid/date/etc at the time of mmc_block_alloc... For removable media
no one would fiddle with the tunable parameters anyway, unless there
was some global database of cards and workarounds and a daemon or some
such to take care of that... Probably don't want to add such baggage
to the kernel.
Idea (2). There is probably no need to overcomplicate. Just add a
platform callback (something like int
(*mmc_platform_block_workaround)(struct request *, struct
mmc_blk_request *)). This will be usable as-is for R/W accesses, and
the discard code will need to be slightly modified.
Do you think there is any need for runtime tuning of the MMC
workarounds (disregarding ones that really belong in the I/O
scheduler)? Should the workarounds be simply platform callbacks, or
should they be something heftier ("policies")?
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-22 6:42 ` Andrei Warkentin
@ 2011-02-22 16:42 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:42 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > From your flashbench -a run, I would guess that it uses
> > 8 MB allocation units, although the data is not 100% conclusive
> > there.
> >
>
> Because the 8MB aligned write time is significantly faster, right?
I mean because a read spanning an 8 MB boundary is noticably
slower than one spanning a 4 MB boundary (diff 242µs instead of 187µs),
while everything below the numbers for the 4 and 2 MB boundaries
are quite similar.
> I am attaching some graphs. The 16k sandisk shows the slow and fast
> page parallel lines, as does the 8k toshiba (but we knew it for the
> toshiba case), but the boundaries are strange for the sandisk case,
> and there an interesting 2mb variation in the toshiba 8k graph. What
> is the correct way to interpret graphs with other block sizes?
Not sure if it's correct, but my interpretation of your output
is this:
In the Toshiba graph, you see parallel lines that show measurements
30µs apart, e.g. 1.06ms and 1.09 ms in the first one. I assume what
you see here are fast and slow pages, respectively. It's a bit hard
to tell in the resolution you have, and it would make sense to zoom
into the picture to see if they are alternating or just random.
The three groups of double lines are probably just some jitter
from the timing of the interrupt controller. If you run with a larger
--count= value, these should become less visible.
The sandisk plot shows some sector ranges taht are slower than others,
I'd assume that those are the ones that have been recently written.
The 16KB page plot has parallel lines (again, I'd have to see a
finer resolution plot to see if they are alternating), which the
32KB page plot does not have. I see this as an indication that the
pages are indeed 16KB, and in the 32KB plot the results are just
averaged out.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-22 16:42 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:42 UTC (permalink / raw)
To: linux-arm-kernel
On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > From your flashbench -a run, I would guess that it uses
> > 8 MB allocation units, although the data is not 100% conclusive
> > there.
> >
>
> Because the 8MB aligned write time is significantly faster, right?
I mean because a read spanning an 8 MB boundary is noticably
slower than one spanning a 4 MB boundary (diff 242?s instead of 187?s),
while everything below the numbers for the 4 and 2 MB boundaries
are quite similar.
> I am attaching some graphs. The 16k sandisk shows the slow and fast
> page parallel lines, as does the 8k toshiba (but we knew it for the
> toshiba case), but the boundaries are strange for the sandisk case,
> and there an interesting 2mb variation in the toshiba 8k graph. What
> is the correct way to interpret graphs with other block sizes?
Not sure if it's correct, but my interpretation of your output
is this:
In the Toshiba graph, you see parallel lines that show measurements
30?s apart, e.g. 1.06ms and 1.09 ms in the first one. I assume what
you see here are fast and slow pages, respectively. It's a bit hard
to tell in the resolution you have, and it would make sense to zoom
into the picture to see if they are alternating or just random.
The three groups of double lines are probably just some jitter
from the timing of the interrupt controller. If you run with a larger
--count= value, these should become less visible.
The sandisk plot shows some sector ranges taht are slower than others,
I'd assume that those are the ones that have been recently written.
The 16KB page plot has parallel lines (again, I'd have to see a
finer resolution plot to see if they are alternating), which the
32KB page plot does not have. I see this as an indication that the
pages are indeed 16KB, and in the 32KB plot the results are just
averaged out.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-22 7:05 ` Andrei Warkentin
@ 2011-02-22 16:49 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:49 UTC (permalink / raw)
To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc
On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > The description of the test case is probably suboptimal. What this does
> > is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> > alignment in the "on" case. The idea here is that it should never do
> > any access with less than "--blocksize" aligment.
> >
>
> Now I feel slightly confused :(.
>
> -b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,
>
> ret = time_rw_interval(dev, count, pre, blocksize,
> align - blocksize, maxalign,
> do_write); //
> <----------------- read 16k at align - 16k with 8mb intervals?
> returnif(ret);
>
> ret = time_rw_interval(dev, count, on, blocksize,
> align - blocksize / 2, maxalign,
> do_write); //
> <----------------- read 16k at align - 8k with 8mb intervals?
> returnif(ret);
>
> ret = time_rw_interval(dev, count, post, blocksize,
> align, maxalign, do_write); //
> <-------- read 16k at align with 8mb intervals?
> returnif(ret);
>
> I hope I'm not missing something obvious...
No, you are absolutely right. I think I changed this once and no longer
remembered what the final version did.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-22 16:49 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:49 UTC (permalink / raw)
To: linux-arm-kernel
On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > The description of the test case is probably suboptimal. What this does
> > is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> > alignment in the "on" case. The idea here is that it should never do
> > any access with less than "--blocksize" aligment.
> >
>
> Now I feel slightly confused :(.
>
> -b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,
>
> ret = time_rw_interval(dev, count, pre, blocksize,
> align - blocksize, maxalign,
> do_write); //
> <----------------- read 16k@align - 16k with 8mb intervals?
> returnif(ret);
>
> ret = time_rw_interval(dev, count, on, blocksize,
> align - blocksize / 2, maxalign,
> do_write); //
> <----------------- read 16k@align - 8k with 8mb intervals?
> returnif(ret);
>
> ret = time_rw_interval(dev, count, post, blocksize,
> align, maxalign, do_write); //
> <-------- read 16k@align with 8mb intervals?
> returnif(ret);
>
> I hope I'm not missing something obvious...
No, you are absolutely right. I think I changed this once and no longer
remembered what the final version did.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-22 7:46 ` Andrei Warkentin
@ 2011-02-22 17:00 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 17:00 UTC (permalink / raw)
To: Andrei Warkentin
Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > E.g. the I/O scheduler can also make sure that we always submit all
> > blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> > not try to merge requests across erase unit boundaries. It can
> > also try to group the requests in aligned power-of-two sized chunks
> > rather than merging as many sectors as possible up to the maximum
> > request size, ignoring the alignment.
>
> I agree. These are common things that affect any kind of flash
> storage, and it belongs in the I/O scheduler as simple tuneables. I'll
> see if I can figure my way around that...
>
> What belongs in mmc card driver are tunable workarounds for MMC/SD
> brokeness. For example - needing to use 8K-spitted reliable writes to
> ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
> improve lifespan of the card.) But you want a waterline above which
> you don't do this anymore, otherwise the overall performance will go
> to 0 - i.e. there is a need to balance between performance and
> reliability, so the range of access size for which the workaround
> works needs to be runtime controlled, as it's potentially different.
> Another example (this one is apparently affecting Sandisk) - do
> special stuff for block erase, since the card violates spec in that
> regard (touch ext_csd instead of argument, I believe). A different
> example might be turning on reliable writes for WRITE_META (or all)
> blocks for a certain partition (but I just made that up... ).
Yes, makes sense.
> You could put the entire MMC block policy interface through an API
> usable by system integrators - i.e. you would really only care for
> tuning the MMC parameters if you're creating a device around an emmc.
>
> Idea (1). One idea is to keep the "policies" from my previous mail.
> Policies are registered through platform-specific code. The policies
> could be then matched for enabling against a specific block device by
> manfid/date/etc at the time of mmc_block_alloc... For removable media
> no one would fiddle with the tunable parameters anyway, unless there
> was some global database of cards and workarounds and a daemon or some
> such to take care of that... Probably don't want to add such baggage
> to the kernel.
>
> Idea (2). There is probably no need to overcomplicate. Just add a
> platform callback (something like int
> (*mmc_platform_block_workaround)(struct request *, struct
> mmc_blk_request *)). This will be usable as-is for R/W accesses, and
> the discard code will need to be slightly modified.
>
> Do you think there is any need for runtime tuning of the MMC
> workarounds (disregarding ones that really belong in the I/O
> scheduler)? Should the workarounds be simply platform callbacks, or
> should they be something heftier ("policies")?
The platform hook seems the wrong place, because you might use
the same chip in multiple platforms, and a single platform might
have a large number of different boards, all of which require
separate workarounds.
A per-card quirk table does not seem so bad, we have that in
other subsystems as well. I wouldn't necessarily make it
a list of possible quirks, but rather a __devinit function that
is called for a new card on insertion, in order to tweak various
parameters.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-22 17:00 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 17:00 UTC (permalink / raw)
To: linux-arm-kernel
On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > E.g. the I/O scheduler can also make sure that we always submit all
> > blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> > not try to merge requests across erase unit boundaries. It can
> > also try to group the requests in aligned power-of-two sized chunks
> > rather than merging as many sectors as possible up to the maximum
> > request size, ignoring the alignment.
>
> I agree. These are common things that affect any kind of flash
> storage, and it belongs in the I/O scheduler as simple tuneables. I'll
> see if I can figure my way around that...
>
> What belongs in mmc card driver are tunable workarounds for MMC/SD
> brokeness. For example - needing to use 8K-spitted reliable writes to
> ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
> improve lifespan of the card.) But you want a waterline above which
> you don't do this anymore, otherwise the overall performance will go
> to 0 - i.e. there is a need to balance between performance and
> reliability, so the range of access size for which the workaround
> works needs to be runtime controlled, as it's potentially different.
> Another example (this one is apparently affecting Sandisk) - do
> special stuff for block erase, since the card violates spec in that
> regard (touch ext_csd instead of argument, I believe). A different
> example might be turning on reliable writes for WRITE_META (or all)
> blocks for a certain partition (but I just made that up... ).
Yes, makes sense.
> You could put the entire MMC block policy interface through an API
> usable by system integrators - i.e. you would really only care for
> tuning the MMC parameters if you're creating a device around an emmc.
>
> Idea (1). One idea is to keep the "policies" from my previous mail.
> Policies are registered through platform-specific code. The policies
> could be then matched for enabling against a specific block device by
> manfid/date/etc at the time of mmc_block_alloc... For removable media
> no one would fiddle with the tunable parameters anyway, unless there
> was some global database of cards and workarounds and a daemon or some
> such to take care of that... Probably don't want to add such baggage
> to the kernel.
>
> Idea (2). There is probably no need to overcomplicate. Just add a
> platform callback (something like int
> (*mmc_platform_block_workaround)(struct request *, struct
> mmc_blk_request *)). This will be usable as-is for R/W accesses, and
> the discard code will need to be slightly modified.
>
> Do you think there is any need for runtime tuning of the MMC
> workarounds (disregarding ones that really belong in the I/O
> scheduler)? Should the workarounds be simply platform callbacks, or
> should they be something heftier ("policies")?
The platform hook seems the wrong place, because you might use
the same chip in multiple platforms, and a single platform might
have a large number of different boards, all of which require
separate workarounds.
A per-card quirk table does not seem so bad, we have that in
other subsystems as well. I wouldn't necessarily make it
a list of possible quirks, but rather a __devinit function that
is called for a new card on insertion, in order to tweak various
parameters.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-22 17:00 ` Arnd Bergmann
@ 2011-02-23 10:19 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 10:19 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Tue, Feb 22, 2011 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> Do you think there is any need for runtime tuning of the MMC
>> workarounds (disregarding ones that really belong in the I/O
>> scheduler)? Should the workarounds be simply platform callbacks, or
>> should they be something heftier ("policies")?
>
> The platform hook seems the wrong place, because you might use
> the same chip in multiple platforms, and a single platform might
> have a large number of different boards, all of which require
> separate workarounds.
>
That's a good point. At best it would result in massive copy-paste/
> A per-card quirk table does not seem so bad, we have that in
> other subsystems as well. I wouldn't necessarily make it
> a list of possible quirks, but rather a __devinit function that
> is called for a new card on insertion, in order to tweak various
> parameters.
>
That sounds good! In fact, for any quirks enabled for a particular
card, I'll expose the tuneables through sysfs attributes, something
like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
Quirks will have block intervals and access size intervals over which
they are valid, along with any other quirk-specific parameter.
Interval overlap will not be allowed for quirks in the same operation
type (r/w/e). The goal here is to make the changes to issue_*_rq as
small as possible, and not to pollute block.c at all with the quirks
stuff. Quirks are looked up inside issue_*_rq based on req type and
[start,end) interval. The resulting found quirks structure will
contain a callback used inside issue_*_rq to modify mmc block request
structures prior to generating actual MMC commands.
Quirks consist of a callback called inside of mmc issue_*_rq,
configurable attributes, and the sysfs interface. Quirk groups are
defined per-card. At card insertion time, a matching quirk group is
found, and is enabled. The quirk group enable function then enables
the relevant quirks with the right parameters (adds them to per
mmc_blk_data quirk interval tree). Some sane defaults for the tunables
are used. If the tunables are modified through sysfs, care is taken
that an interval overlap never happens, otherwise the tunable is not
modified and a kernel error message is logged.
I hope I explained the tentative idea clearly... Thoughts?
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-23 10:19 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 10:19 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Feb 22, 2011 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> Do you think there is any need for runtime tuning of the MMC
>> workarounds (disregarding ones that really belong in the I/O
>> scheduler)? Should the workarounds be simply platform callbacks, or
>> should they be something heftier ("policies")?
>
> The platform hook seems the wrong place, because you might use
> the same chip in multiple platforms, and a single platform might
> have a large number of different boards, all of which require
> separate workarounds.
>
That's a good point. At best it would result in massive copy-paste/
> A per-card quirk table does not seem so bad, we have that in
> other subsystems as well. I wouldn't necessarily make it
> a list of possible quirks, but rather a __devinit function that
> is called for a new card on insertion, in order to tweak various
> parameters.
>
That sounds good! In fact, for any quirks enabled for a particular
card, I'll expose the tuneables through sysfs attributes, something
like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
Quirks will have block intervals and access size intervals over which
they are valid, along with any other quirk-specific parameter.
Interval overlap will not be allowed for quirks in the same operation
type (r/w/e). The goal here is to make the changes to issue_*_rq as
small as possible, and not to pollute block.c at all with the quirks
stuff. Quirks are looked up inside issue_*_rq based on req type and
[start,end) interval. The resulting found quirks structure will
contain a callback used inside issue_*_rq to modify mmc block request
structures prior to generating actual MMC commands.
Quirks consist of a callback called inside of mmc issue_*_rq,
configurable attributes, and the sysfs interface. Quirk groups are
defined per-card. At card insertion time, a matching quirk group is
found, and is enabled. The quirk group enable function then enables
the relevant quirks with the right parameters (adds them to per
mmc_blk_data quirk interval tree). Some sane defaults for the tunables
are used. If the tunables are modified through sysfs, care is taken
that an interval overlap never happens, otherwise the tunable is not
modified and a kernel error message is logged.
I hope I explained the tentative idea clearly... Thoughts?
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-23 10:19 ` Andrei Warkentin
@ 2011-02-23 16:09 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-23 16:09 UTC (permalink / raw)
To: Andrei Warkentin
Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Wednesday 23 February 2011, Andrei Warkentin wrote:
> That sounds good! In fact, for any quirks enabled for a particular
> card, I'll expose the tuneables through sysfs attributes, something
> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
>
> Quirks will have block intervals and access size intervals over which
> they are valid, along with any other quirk-specific parameter.
> Interval overlap will not be allowed for quirks in the same operation
> type (r/w/e). The goal here is to make the changes to issue_*_rq as
> small as possible, and not to pollute block.c at all with the quirks
> stuff. Quirks are looked up inside issue_*_rq based on req type and
> [start,end) interval. The resulting found quirks structure will
> contain a callback used inside issue_*_rq to modify mmc block request
> structures prior to generating actual MMC commands.
>
> Quirks consist of a callback called inside of mmc issue_*_rq,
> configurable attributes, and the sysfs interface. Quirk groups are
> defined per-card. At card insertion time, a matching quirk group is
> found, and is enabled. The quirk group enable function then enables
> the relevant quirks with the right parameters (adds them to per
> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
> are used. If the tunables are modified through sysfs, care is taken
> that an interval overlap never happens, otherwise the tunable is not
> modified and a kernel error message is logged.
>
> I hope I explained the tentative idea clearly... Thoughts?
I would hope that the quirks can be simpler than this still, without
the need to call any function pointers while using the device, or
quirk specific sysfs directories.
What I meant is to have a single function pointer that can get
called when detecting a specific known card. All this function
does is to set values and flags that we can export either through
common attributes of block devices (e.g. preferred erase size),
or attributes specific to mmc devices (e.g. the toshiba hack, as
a bool attribute).
An obvious attribute would be the minimum size of an atomic
page update. By default this could be 32KB, because any device
should support that (FAT32 cannot have larger clusters). A
card specific quirk can set it to another value, like 8KB, 16KB
or 64KB, and file systems or other tools like mkfs can optimize
for this value.
I would like the flags like "don't submit requests spanning
this boundary" and "make all writes below this size" to be defined
in terms of the regular sizes we already know about, like the
page size or the erase size.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-23 16:09 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-23 16:09 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 23 February 2011, Andrei Warkentin wrote:
> That sounds good! In fact, for any quirks enabled for a particular
> card, I'll expose the tuneables through sysfs attributes, something
> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
>
> Quirks will have block intervals and access size intervals over which
> they are valid, along with any other quirk-specific parameter.
> Interval overlap will not be allowed for quirks in the same operation
> type (r/w/e). The goal here is to make the changes to issue_*_rq as
> small as possible, and not to pollute block.c at all with the quirks
> stuff. Quirks are looked up inside issue_*_rq based on req type and
> [start,end) interval. The resulting found quirks structure will
> contain a callback used inside issue_*_rq to modify mmc block request
> structures prior to generating actual MMC commands.
>
> Quirks consist of a callback called inside of mmc issue_*_rq,
> configurable attributes, and the sysfs interface. Quirk groups are
> defined per-card. At card insertion time, a matching quirk group is
> found, and is enabled. The quirk group enable function then enables
> the relevant quirks with the right parameters (adds them to per
> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
> are used. If the tunables are modified through sysfs, care is taken
> that an interval overlap never happens, otherwise the tunable is not
> modified and a kernel error message is logged.
>
> I hope I explained the tentative idea clearly... Thoughts?
I would hope that the quirks can be simpler than this still, without
the need to call any function pointers while using the device, or
quirk specific sysfs directories.
What I meant is to have a single function pointer that can get
called when detecting a specific known card. All this function
does is to set values and flags that we can export either through
common attributes of block devices (e.g. preferred erase size),
or attributes specific to mmc devices (e.g. the toshiba hack, as
a bool attribute).
An obvious attribute would be the minimum size of an atomic
page update. By default this could be 32KB, because any device
should support that (FAT32 cannot have larger clusters). A
card specific quirk can set it to another value, like 8KB, 16KB
or 64KB, and file systems or other tools like mkfs can optimize
for this value.
I would like the flags like "don't submit requests spanning
this boundary" and "make all writes below this size" to be defined
in terms of the regular sizes we already know about, like the
page size or the erase size.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-23 16:09 ` Arnd Bergmann
@ 2011-02-23 22:26 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 22:26 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Wed, Feb 23, 2011 at 10:09 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 23 February 2011, Andrei Warkentin wrote:
>> That sounds good! In fact, for any quirks enabled for a particular
>> card, I'll expose the tuneables through sysfs attributes, something
>> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
>>
>> Quirks will have block intervals and access size intervals over which
>> they are valid, along with any other quirk-specific parameter.
>> Interval overlap will not be allowed for quirks in the same operation
>> type (r/w/e). The goal here is to make the changes to issue_*_rq as
>> small as possible, and not to pollute block.c at all with the quirks
>> stuff. Quirks are looked up inside issue_*_rq based on req type and
>> [start,end) interval. The resulting found quirks structure will
>> contain a callback used inside issue_*_rq to modify mmc block request
>> structures prior to generating actual MMC commands.
>>
>> Quirks consist of a callback called inside of mmc issue_*_rq,
>> configurable attributes, and the sysfs interface. Quirk groups are
>> defined per-card. At card insertion time, a matching quirk group is
>> found, and is enabled. The quirk group enable function then enables
>> the relevant quirks with the right parameters (adds them to per
>> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
>> are used. If the tunables are modified through sysfs, care is taken
>> that an interval overlap never happens, otherwise the tunable is not
>> modified and a kernel error message is logged.
>>
>> I hope I explained the tentative idea clearly... Thoughts?
>
> I would hope that the quirks can be simpler than this still, without
> the need to call any function pointers while using the device, or
> quirk specific sysfs directories.
>
I'll skip the sysfs part from the first RFC patch. I think this
complicates what I'm trying to achieve and makes this whole thing look
bigger than it is.
> What I meant is to have a single function pointer that can get
> called when detecting a specific known card. All this function
> does is to set values and flags that we can export either through
> common attributes of block devices (e.g. preferred erase size),
> or attributes specific to mmc devices (e.g. the toshiba hack, as
> a bool attribute).
>
> An obvious attribute would be the minimum size of an atomic
> page update. By default this could be 32KB, because any device
> should support that (FAT32 cannot have larger clusters). A
> card specific quirk can set it to another value, like 8KB, 16KB
> or 64KB, and file systems or other tools like mkfs can optimize
> for this value.
>
> I would like the flags like "don't submit requests spanning
> this boundary" and "make all writes below this size" to be defined
> in terms of the regular sizes we already know about, like the
> page size or the erase size.
>
I agree with you on the size/align issues. These are very generic
attributes and don't need a complicated framework like I described to
be dealt with. Ultimately they are just hints to the I/O scheduler, so
they should be part of the block device.
I am more concerned with workarounds that depend on access size (like
the toshiba one) and that modify the MMC commands sent (using reliable
writes, like the Toshiba one, or putting parameters differently like
the Sandisk erase workaround). It's these kinds of workarounds that
the quirks framework is meant to address. I don't think it's a good
idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
if()-elsed workarounds, because it's going to quickly complicate the
logic, and get out of hand and unmanageable the more cards are added.
I'm trying to avoid having to make any changes to card/block.c as part
of making quirk workarounds. The only cost when compared to an if-else
will be one O(log n) quirk lookup, where n is either one or something
close that (since the search is only done for quirks per
mmc_blk_data), and one callback invoked after "brq.data.sg_len =
mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-23 22:26 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 22:26 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Feb 23, 2011 at 10:09 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 23 February 2011, Andrei Warkentin wrote:
>> That sounds good! In fact, for any quirks enabled for a particular
>> card, I'll expose the tuneables through sysfs attributes, something
>> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
>>
>> Quirks will have block intervals and access size intervals over which
>> they are valid, along with any other quirk-specific parameter.
>> Interval overlap will not be allowed for quirks in the same operation
>> type (r/w/e). The goal here is to make the changes to issue_*_rq as
>> small as possible, and not to pollute block.c at all with the quirks
>> stuff. Quirks are looked up inside issue_*_rq based on req type and
>> [start,end) interval. The resulting found quirks structure will
>> contain a callback used inside issue_*_rq to modify mmc block request
>> structures prior to generating actual MMC commands.
>>
>> Quirks consist of a callback called inside of mmc issue_*_rq,
>> configurable attributes, and the sysfs interface. Quirk groups are
>> defined per-card. At card insertion time, a matching quirk group is
>> found, and is enabled. The quirk group enable function then enables
>> the relevant quirks with the right parameters (adds them to per
>> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
>> are used. If the tunables are modified through sysfs, care is taken
>> that an interval overlap never happens, otherwise the tunable is not
>> modified and a kernel error message is logged.
>>
>> I hope I explained the tentative idea clearly... Thoughts?
>
> I would hope that the quirks can be simpler than this still, without
> the need to call any function pointers while using the device, or
> quirk specific sysfs directories.
>
I'll skip the sysfs part from the first RFC patch. I think this
complicates what I'm trying to achieve and makes this whole thing look
bigger than it is.
> What I meant is to have a single function pointer that can get
> called when detecting a specific known card. All this function
> does is to set values and flags that we can export either through
> common attributes of block devices (e.g. preferred erase size),
> or attributes specific to mmc devices (e.g. the toshiba hack, as
> a bool attribute).
>
> An obvious attribute would be the minimum size of an atomic
> page update. By default this could be 32KB, because any device
> should support that (FAT32 cannot have larger clusters). A
> card specific quirk can set it to another value, like 8KB, 16KB
> or 64KB, and file systems or other tools like mkfs can optimize
> for this value.
>
> I would like the flags like "don't submit requests spanning
> this boundary" and "make all writes below this size" to be defined
> in terms of the regular sizes we already know about, like the
> page size or the erase size.
>
I agree with you on the size/align issues. These are very generic
attributes and don't need a complicated framework like I described to
be dealt with. Ultimately they are just hints to the I/O scheduler, so
they should be part of the block device.
I am more concerned with workarounds that depend on access size (like
the toshiba one) and that modify the MMC commands sent (using reliable
writes, like the Toshiba one, or putting parameters differently like
the Sandisk erase workaround). It's these kinds of workarounds that
the quirks framework is meant to address. I don't think it's a good
idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
if()-elsed workarounds, because it's going to quickly complicate the
logic, and get out of hand and unmanageable the more cards are added.
I'm trying to avoid having to make any changes to card/block.c as part
of making quirk workarounds. The only cost when compared to an if-else
will be one O(log n) quirk lookup, where n is either one or something
close that (since the search is only done for quirks per
mmc_blk_data), and one callback invoked after "brq.data.sg_len =
mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-23 22:26 ` Andrei Warkentin
@ 2011-02-24 9:24 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-24 9:24 UTC (permalink / raw)
To: Andrei Warkentin
Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Wednesday 23 February 2011, Andrei Warkentin wrote:
> I am more concerned with workarounds that depend on access size (like
> the toshiba one) and that modify the MMC commands sent (using reliable
> writes, like the Toshiba one, or putting parameters differently like
> the Sandisk erase workaround). It's these kinds of workarounds that
> the quirks framework is meant to address. I don't think it's a good
> idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
> if()-elsed workarounds, because it's going to quickly complicate the
> logic, and get out of hand and unmanageable the more cards are added.
> I'm trying to avoid having to make any changes to card/block.c as part
> of making quirk workarounds. The only cost when compared to an if-else
> will be one O(log n) quirk lookup, where n is either one or something
> close that (since the search is only done for quirks per
> mmc_blk_data), and one callback invoked after "brq.data.sg_len =
> mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.
Unlike the sysfs interface, the code does not need to be future-proof,
it can always be changed if we feel the code becomes more maintainable
by doing it another way.
The approach that I'd like to see here is:
* Start out with an ad-hoc patch for a quirk (like the one you already
have).
* Add a boolean variable to enable it per card.
* Get performance data for this quirk to show that it's useful in
real-world workloads for some cards but counterproductive for others
* Get the patch into the mmc tree.
* Repeat for the next quirk
* When the code becomes overly complicated after adding all the quirks,
decide on a good strategy to move the code around, and do a new patch.
I understand that you are convinced that you will need the indirect function
calls in the end. That is fine, just don't add them before they are
actually needed -- that would only make it harder for you to get the first
patch included.
Note that the situation is very different for user interfaces such as sysfs:
You need to plan ahead because once the interface is merged upstream, it
can never be changed. When you submit a patch that introduces a new sysfs
interface, it has to be documented, and you have to convince the reviewers
that it is sufficient to cover all the cases it is designed for, while
at the same time it is the most simple way to achieve this.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-24 9:24 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-24 9:24 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 23 February 2011, Andrei Warkentin wrote:
> I am more concerned with workarounds that depend on access size (like
> the toshiba one) and that modify the MMC commands sent (using reliable
> writes, like the Toshiba one, or putting parameters differently like
> the Sandisk erase workaround). It's these kinds of workarounds that
> the quirks framework is meant to address. I don't think it's a good
> idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
> if()-elsed workarounds, because it's going to quickly complicate the
> logic, and get out of hand and unmanageable the more cards are added.
> I'm trying to avoid having to make any changes to card/block.c as part
> of making quirk workarounds. The only cost when compared to an if-else
> will be one O(log n) quirk lookup, where n is either one or something
> close that (since the search is only done for quirks per
> mmc_blk_data), and one callback invoked after "brq.data.sg_len =
> mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.
Unlike the sysfs interface, the code does not need to be future-proof,
it can always be changed if we feel the code becomes more maintainable
by doing it another way.
The approach that I'd like to see here is:
* Start out with an ad-hoc patch for a quirk (like the one you already
have).
* Add a boolean variable to enable it per card.
* Get performance data for this quirk to show that it's useful in
real-world workloads for some cards but counterproductive for others
* Get the patch into the mmc tree.
* Repeat for the next quirk
* When the code becomes overly complicated after adding all the quirks,
decide on a good strategy to move the code around, and do a new patch.
I understand that you are convinced that you will need the indirect function
calls in the end. That is fine, just don't add them before they are
actually needed -- that would only make it harder for you to get the first
patch included.
Note that the situation is very different for user interfaces such as sysfs:
You need to plan ahead because once the interface is merged upstream, it
can never be changed. When you submit a patch that introduces a new sysfs
interface, it has to be documented, and you have to convince the reviewers
that it is sufficient to cover all the cases it is designed for, while
at the same time it is the most simple way to achieve this.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-24 9:24 ` Arnd Bergmann
@ 2011-02-25 11:02 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-25 11:02 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Thu, Feb 24, 2011 at 3:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> Unlike the sysfs interface, the code does not need to be future-proof,
> it can always be changed if we feel the code becomes more maintainable
> by doing it another way.
>
> The approach that I'd like to see here is:
>
> * Start out with an ad-hoc patch for a quirk (like the one you already
> have).
> * Add a boolean variable to enable it per card.
> * Get performance data for this quirk to show that it's useful in
> real-world workloads for some cards but counterproductive for others
> * Get the patch into the mmc tree.
> * Repeat for the next quirk
> * When the code becomes overly complicated after adding all the quirks,
> decide on a good strategy to move the code around, and do a new patch.
>
Yup. I understand :-). That's the strategy I'm going to follow. For
page_size-alignment/splitting I'm looking at the block layer now. Is
that the right approach or should I still submit a (cleaned up) patch
to mmc/card/block.c for that performance improvement? The other
(Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.
> I understand that you are convinced that you will need the indirect function
> calls in the end. That is fine, just don't add them before they are
> actually needed -- that would only make it harder for you to get the first
> patch included.
>
> Note that the situation is very different for user interfaces such as sysfs:
> You need to plan ahead because once the interface is merged upstream, it
> can never be changed. When you submit a patch that introduces a new sysfs
> interface, it has to be documented, and you have to convince the reviewers
> that it is sufficient to cover all the cases it is designed for, while
> at the same time it is the most simple way to achieve this.
Ok, thanks a lot for the explanation, I hadn't thought of it that way
(and should have).
A
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-25 11:02 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-25 11:02 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Feb 24, 2011 at 3:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> Unlike the sysfs interface, the code does not need to be future-proof,
> it can always be changed if we feel the code becomes more maintainable
> by doing it another way.
>
> The approach that I'd like to see here is:
>
> * Start out with an ad-hoc patch for a quirk (like the one you already
> ?have).
> * Add a boolean variable to enable it per card.
> * Get performance data for this quirk to show that it's useful in
> ?real-world workloads for some cards but counterproductive for others
> * Get the patch into the mmc tree.
> * Repeat for the next quirk
> * When the code becomes overly complicated after adding all the quirks,
> ?decide on a good strategy to move the code around, and do a new patch.
>
Yup. I understand :-). That's the strategy I'm going to follow. For
page_size-alignment/splitting I'm looking at the block layer now. Is
that the right approach or should I still submit a (cleaned up) patch
to mmc/card/block.c for that performance improvement? The other
(Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.
> I understand that you are convinced that you will need the indirect function
> calls in the end. That is fine, just don't add them before they are
> actually needed -- that would only make it harder for you to get the first
> patch included.
>
> Note that the situation is very different for user interfaces such as sysfs:
> You need to plan ahead because once the interface is merged upstream, it
> can never be changed. When you submit a patch that introduces a new sysfs
> interface, it has to be documented, and you have to convince the reviewers
> that it is sufficient to cover all the cases it is designed for, while
> at the same time it is the most simple way to achieve this.
Ok, thanks a lot for the explanation, I hadn't thought of it that way
(and should have).
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-25 11:02 ` Andrei Warkentin
@ 2011-02-25 12:21 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-25 12:21 UTC (permalink / raw)
To: Andrei Warkentin, Jens Axboe
Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Friday 25 February 2011, Andrei Warkentin wrote:
> Yup. I understand :-). That's the strategy I'm going to follow. For
> page_size-alignment/splitting I'm looking at the block layer now. Is
> that the right approach or should I still submit a (cleaned up) patch
> to mmc/card/block.c for that performance improvement.
I guess it should live in block/cfq-iosched in the long run, but I don't
know how easy it is to implement it there for test purposes.
It may be easier to prototype it in the mmc code, since you are more
familiar with that already, post that patch together with benchmark
results and then do a new patch for the final solution. We'll need
more benchmarking to figure out if that should be applied for
all nonrotational storage, or if there are cases where it actually
hurts performance to split requests on page boundaries.
If it turns out to be a good idea in general, we won't even need a
sysfs interface for enabling it, just one for reading/writing the
underlying page size.
> The other (Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.
Makes sense.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-02-25 12:21 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-25 12:21 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 25 February 2011, Andrei Warkentin wrote:
> Yup. I understand :-). That's the strategy I'm going to follow. For
> page_size-alignment/splitting I'm looking at the block layer now. Is
> that the right approach or should I still submit a (cleaned up) patch
> to mmc/card/block.c for that performance improvement.
I guess it should live in block/cfq-iosched in the long run, but I don't
know how easy it is to implement it there for test purposes.
It may be easier to prototype it in the mmc code, since you are more
familiar with that already, post that patch together with benchmark
results and then do a new patch for the final solution. We'll need
more benchmarking to figure out if that should be applied for
all nonrotational storage, or if there are cases where it actually
hurts performance to split requests on page boundaries.
If it turns out to be a good idea in general, we won't even need a
sysfs interface for enabling it, just one for reading/writing the
underlying page size.
> The other (Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.
Makes sense.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-02-25 12:21 ` Arnd Bergmann
@ 2011-03-01 18:48 ` Jens Axboe
-1 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 18:48 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
linux-mmc
On 2011-02-25 07:21, Arnd Bergmann wrote:
> On Friday 25 February 2011, Andrei Warkentin wrote:
>> Yup. I understand :-). That's the strategy I'm going to follow. For
>> page_size-alignment/splitting I'm looking at the block layer now. Is
>> that the right approach or should I still submit a (cleaned up) patch
>> to mmc/card/block.c for that performance improvement.
>
> I guess it should live in block/cfq-iosched in the long run, but I don't
> know how easy it is to implement it there for test purposes.
I don't think I saw the original patch(es) for this?
--
Jens Axboe
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-01 18:48 ` Jens Axboe
0 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 18:48 UTC (permalink / raw)
To: linux-arm-kernel
On 2011-02-25 07:21, Arnd Bergmann wrote:
> On Friday 25 February 2011, Andrei Warkentin wrote:
>> Yup. I understand :-). That's the strategy I'm going to follow. For
>> page_size-alignment/splitting I'm looking at the block layer now. Is
>> that the right approach or should I still submit a (cleaned up) patch
>> to mmc/card/block.c for that performance improvement.
>
> I guess it should live in block/cfq-iosched in the long run, but I don't
> know how easy it is to implement it there for test purposes.
I don't think I saw the original patch(es) for this?
--
Jens Axboe
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-03-01 18:48 ` Jens Axboe
@ 2011-03-01 19:11 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:11 UTC (permalink / raw)
To: Jens Axboe
Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
linux-mmc
On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>
> On 2011-02-25 07:21, Arnd Bergmann wrote:
> > On Friday 25 February 2011, Andrei Warkentin wrote:
> >> Yup. I understand :-). That's the strategy I'm going to follow. For
> >> page_size-alignment/splitting I'm looking at the block layer now. Is
> >> that the right approach or should I still submit a (cleaned up) patch
> >> to mmc/card/block.c for that performance improvement.
> >
> > I guess it should live in block/cfq-iosched in the long run, but I don't
> > know how easy it is to implement it there for test purposes.
>
> I don't think I saw the original patch(es) for this?
Nobody has posted one yet, only discussions. Andrei made a patch for the
MMC block driver to split requests in some cases, but I think the
concept has changed enough that it's probably not useful to look at
that patch.
I think what needs to be done here is to split requests in these cases:
* Small requests should be split on flash page boundaries, where a page
is typically 8 to 32 KB. Sending one hardware request that spans two
partial pages can be slower than sending two requests with the same
data, but on page boundaries.
* If a hardware transfer is limited to a few sectors, these should be
aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
maximum transfers, a request that spans from sector 7 to 62 should be
split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
This reduces the number of page read-modify-write cycles that the drive
does.
* No request should ever span multiple erase blocks. Most flash drives today
have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
treat the erase block boundary like a seek on a hard drive. The I/O
scheduler should try to send all sector writes of an erase block in sequence,
but after that it can chose any other erase block to write to next.
I think if we get this logic, we can deal well with all cheap flash drives.
The two parameters we need are the page size and the erase block size,
which the kernel can sometimes guess, but should also be tunable in
sysfs for devices that don't tell us or lie to the kernel about them.
I'm not sure if we want to do this for all nonrotational media, or
add another flag to enable these optimizations. On proper SSDs that have
an intelligent controller and enough RAM, they probably would not help
all that much, or even make it slightly slower due to a higher number
of separate write requests.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-01 19:11 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:11 UTC (permalink / raw)
To: linux-arm-kernel
On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>
> On 2011-02-25 07:21, Arnd Bergmann wrote:
> > On Friday 25 February 2011, Andrei Warkentin wrote:
> >> Yup. I understand :-). That's the strategy I'm going to follow. For
> >> page_size-alignment/splitting I'm looking at the block layer now. Is
> >> that the right approach or should I still submit a (cleaned up) patch
> >> to mmc/card/block.c for that performance improvement.
> >
> > I guess it should live in block/cfq-iosched in the long run, but I don't
> > know how easy it is to implement it there for test purposes.
>
> I don't think I saw the original patch(es) for this?
Nobody has posted one yet, only discussions. Andrei made a patch for the
MMC block driver to split requests in some cases, but I think the
concept has changed enough that it's probably not useful to look at
that patch.
I think what needs to be done here is to split requests in these cases:
* Small requests should be split on flash page boundaries, where a page
is typically 8 to 32 KB. Sending one hardware request that spans two
partial pages can be slower than sending two requests with the same
data, but on page boundaries.
* If a hardware transfer is limited to a few sectors, these should be
aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
maximum transfers, a request that spans from sector 7 to 62 should be
split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
This reduces the number of page read-modify-write cycles that the drive
does.
* No request should ever span multiple erase blocks. Most flash drives today
have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
treat the erase block boundary like a seek on a hard drive. The I/O
scheduler should try to send all sector writes of an erase block in sequence,
but after that it can chose any other erase block to write to next.
I think if we get this logic, we can deal well with all cheap flash drives.
The two parameters we need are the page size and the erase block size,
which the kernel can sometimes guess, but should also be tunable in
sysfs for devices that don't tell us or lie to the kernel about them.
I'm not sure if we want to do this for all nonrotational media, or
add another flag to enable these optimizations. On proper SSDs that have
an intelligent controller and enough RAM, they probably would not help
all that much, or even make it slightly slower due to a higher number
of separate write requests.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-03-01 19:11 ` Arnd Bergmann
@ 2011-03-01 19:15 ` Jens Axboe
-1 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 19:15 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
linux-mmc
On 2011-03-01 14:11, Arnd Bergmann wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>>> On Friday 25 February 2011, Andrei Warkentin wrote:
>>>> Yup. I understand :-). That's the strategy I'm going to follow. For
>>>> page_size-alignment/splitting I'm looking at the block layer now. Is
>>>> that the right approach or should I still submit a (cleaned up) patch
>>>> to mmc/card/block.c for that performance improvement.
>>>
>>> I guess it should live in block/cfq-iosched in the long run, but I don't
>>> know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
>
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
>
> I think what needs to be done here is to split requests in these cases:
>
> * Small requests should be split on flash page boundaries, where a page
> is typically 8 to 32 KB. Sending one hardware request that spans two
> partial pages can be slower than sending two requests with the same
> data, but on page boundaries.
>
> * If a hardware transfer is limited to a few sectors, these should be
> aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
> maximum transfers, a request that spans from sector 7 to 62 should be
> split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
> This reduces the number of page read-modify-write cycles that the drive
> does.
>
> * No request should ever span multiple erase blocks. Most flash drives today
> have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
> treat the erase block boundary like a seek on a hard drive. The I/O
> scheduler should try to send all sector writes of an erase block in sequence,
> but after that it can chose any other erase block to write to next.
>
> I think if we get this logic, we can deal well with all cheap flash drives.
> The two parameters we need are the page size and the erase block size,
> which the kernel can sometimes guess, but should also be tunable in
> sysfs for devices that don't tell us or lie to the kernel about them.
>
> I'm not sure if we want to do this for all nonrotational media, or
> add another flag to enable these optimizations. On proper SSDs that have
> an intelligent controller and enough RAM, they probably would not help
> all that much, or even make it slightly slower due to a higher number
> of separate write requests.
Thanks for the recap. One way to handle this would be to have a dm
target that ensures that requests are never built up to violate any of
the above items. Doing splitting is a little silly, when you can prevent
it from happening in the first place.
Alternatively, a queue ->merge_bvec_fn() with a settings table could
provide the same.
As this is of limited scope, I would prefer having this done via a
plugin of some sort (like a dm target).
--
Jens Axboe
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-01 19:15 ` Jens Axboe
0 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 19:15 UTC (permalink / raw)
To: linux-arm-kernel
On 2011-03-01 14:11, Arnd Bergmann wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>>> On Friday 25 February 2011, Andrei Warkentin wrote:
>>>> Yup. I understand :-). That's the strategy I'm going to follow. For
>>>> page_size-alignment/splitting I'm looking at the block layer now. Is
>>>> that the right approach or should I still submit a (cleaned up) patch
>>>> to mmc/card/block.c for that performance improvement.
>>>
>>> I guess it should live in block/cfq-iosched in the long run, but I don't
>>> know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
>
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
>
> I think what needs to be done here is to split requests in these cases:
>
> * Small requests should be split on flash page boundaries, where a page
> is typically 8 to 32 KB. Sending one hardware request that spans two
> partial pages can be slower than sending two requests with the same
> data, but on page boundaries.
>
> * If a hardware transfer is limited to a few sectors, these should be
> aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
> maximum transfers, a request that spans from sector 7 to 62 should be
> split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
> This reduces the number of page read-modify-write cycles that the drive
> does.
>
> * No request should ever span multiple erase blocks. Most flash drives today
> have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
> treat the erase block boundary like a seek on a hard drive. The I/O
> scheduler should try to send all sector writes of an erase block in sequence,
> but after that it can chose any other erase block to write to next.
>
> I think if we get this logic, we can deal well with all cheap flash drives.
> The two parameters we need are the page size and the erase block size,
> which the kernel can sometimes guess, but should also be tunable in
> sysfs for devices that don't tell us or lie to the kernel about them.
>
> I'm not sure if we want to do this for all nonrotational media, or
> add another flag to enable these optimizations. On proper SSDs that have
> an intelligent controller and enough RAM, they probably would not help
> all that much, or even make it slightly slower due to a higher number
> of separate write requests.
Thanks for the recap. One way to handle this would be to have a dm
target that ensures that requests are never built up to violate any of
the above items. Doing splitting is a little silly, when you can prevent
it from happening in the first place.
Alternatively, a queue ->merge_bvec_fn() with a settings table could
provide the same.
As this is of limited scope, I would prefer having this done via a
plugin of some sort (like a dm target).
--
Jens Axboe
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-03-01 19:15 ` Jens Axboe
@ 2011-03-01 19:51 ` Arnd Bergmann
-1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:51 UTC (permalink / raw)
To: Jens Axboe
Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
linux-mmc
On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
> Thanks for the recap. One way to handle this would be to have a dm
> target that ensures that requests are never built up to violate any of
> the above items. Doing splitting is a little silly, when you can prevent
> it from happening in the first place.
Ok, that sounds good. I didn't know that it's possible to prevent
bios from getting created that violate this.
I'm actually trying to do a device mapper target that does much more than
this, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
for an early draft. The design has moved on since I wrote that, but
the basic idea is still the same: all blocks get written in a way that
fills up entire 4MB segments before moving to another segment,
independent of what the logical block numbers are, and a little space
is used to store a lookup table for the logical-to-physical block mapping.
> Alternatively, a queue ->merge_bvec_fn() with a settings table could
> provide the same.
That's probably better for the common case. The device mapper target
would be useful for those that want the best case write performance,
but if I understand you correctly, the merge_bvec_fn() could be used
per block driver, so we could simply add that to the SCSI (for USB and
consumer SSD) case and MMC block drivers.
The point that this does not solve is submitting all outstanding writes
for an erase block together, which is needed to reduce the garbage
collection overhead. When you do a partial update of an erase block
(4MB typically) and then start writing to another erase block, the
drive will have to copy all data you did not write in order to free
up internal resources.
> As this is of limited scope, I would prefer having this done via a
> plugin of some sort (like a dm target).
I'm not sure what you mean with limited scope. This is certainly not
as important for the classic server environment (aside from USB boot
drives), but I assume that it is highly relevant for the a large
portion of new embedded designs as people move from raw flash to
eMMC and similar "technologies".
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-01 19:51 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:51 UTC (permalink / raw)
To: linux-arm-kernel
On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
> Thanks for the recap. One way to handle this would be to have a dm
> target that ensures that requests are never built up to violate any of
> the above items. Doing splitting is a little silly, when you can prevent
> it from happening in the first place.
Ok, that sounds good. I didn't know that it's possible to prevent
bios from getting created that violate this.
I'm actually trying to do a device mapper target that does much more than
this, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
for an early draft. The design has moved on since I wrote that, but
the basic idea is still the same: all blocks get written in a way that
fills up entire 4MB segments before moving to another segment,
independent of what the logical block numbers are, and a little space
is used to store a lookup table for the logical-to-physical block mapping.
> Alternatively, a queue ->merge_bvec_fn() with a settings table could
> provide the same.
That's probably better for the common case. The device mapper target
would be useful for those that want the best case write performance,
but if I understand you correctly, the merge_bvec_fn() could be used
per block driver, so we could simply add that to the SCSI (for USB and
consumer SSD) case and MMC block drivers.
The point that this does not solve is submitting all outstanding writes
for an erase block together, which is needed to reduce the garbage
collection overhead. When you do a partial update of an erase block
(4MB typically) and then start writing to another erase block, the
drive will have to copy all data you did not write in order to free
up internal resources.
> As this is of limited scope, I would prefer having this done via a
> plugin of some sort (like a dm target).
I'm not sure what you mean with limited scope. This is certainly not
as important for the classic server environment (aside from USB boot
drives), but I assume that it is highly relevant for the a large
portion of new embedded designs as people move from raw flash to
eMMC and similar "technologies".
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-03-01 19:51 ` Arnd Bergmann
@ 2011-03-01 21:33 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-01 21:33 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Tue, Mar 1, 2011 at 1:51 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
>> Thanks for the recap. One way to handle this would be to have a dm
>> target that ensures that requests are never built up to violate any of
>> the above items. Doing splitting is a little silly, when you can prevent
>> it from happening in the first place.
>
> Ok, that sounds good. I didn't know that it's possible to prevent
> bios from getting created that violate this.
>
Wouldn't someone still be able to perform a generic_make_request that
would violate the conditions (i.e. cross alignment boundary while
performing unaligned write)? You could prevent the merges that would
result in violating the conditions, sure, but you would need to handle
single unaligned accesses correctly too... Sorry, I'm just groping my
way around the block layer...a lot I'm still trying to draw a mental
picture for.
P.S. I've submitted for review the first 3 patches. Tear into them :).
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-01 21:33 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-01 21:33 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Mar 1, 2011 at 1:51 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
>> Thanks for the recap. One way to handle this would be to have a dm
>> target that ensures that requests are never built up to violate any of
>> the above items. Doing splitting is a little silly, when you can prevent
>> it from happening in the first place.
>
> Ok, that sounds good. I didn't know that it's possible to prevent
> bios from getting created that violate this.
>
Wouldn't someone still be able to perform a generic_make_request that
would violate the conditions (i.e. cross alignment boundary while
performing unaligned write)? You could prevent the merges that would
result in violating the conditions, sure, but you would need to handle
single unaligned accesses correctly too... Sorry, I'm just groping my
way around the block layer...a lot I'm still trying to draw a mental
picture for.
P.S. I've submitted for review the first 3 patches. Tear into them :).
A
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-03-01 19:11 ` Arnd Bergmann
@ 2011-03-02 10:34 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-02 10:34 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Tue, Mar 1, 2011 at 1:11 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>> > On Friday 25 February 2011, Andrei Warkentin wrote:
>> >> Yup. I understand :-). That's the strategy I'm going to follow. For
>> >> page_size-alignment/splitting I'm looking at the block layer now. Is
>> >> that the right approach or should I still submit a (cleaned up) patch
>> >> to mmc/card/block.c for that performance improvement.
>> >
>> > I guess it should live in block/cfq-iosched in the long run, but I don't
>> > know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
>
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
>
Before the generic improvements are made to the block layer, I think
there is some value
in implementing the (simpler) ones in mmc block code, as well as
expose an mmc block quirk interface by which its easy to add complex
workarounds. Some things will never be able to completely stay above
mmc block code, for example, when splitting up smaller accesses, you
need to be careful on the Toshiba card, since the 4th consecutive 8KB
block results in the entire 32KB getting pushed into the bigger 4MB
buffer. On our platform, there are a lot of accesses in the 16KB-32KB
range which benefit from the splitting. Data collected showed
splitting more than 32KB to have adverse effect on performance (I
guess that sort of makes sense, after all, why else would the
controller treat 4 consecutive 8KB accesses as a larger access and
treat it accordingly?) On the other hand, that data was collected on
code that used reliable write for every portion of the split access,
so I'm going to have to get some new data...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-02 10:34 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-02 10:34 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Mar 1, 2011 at 1:11 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>> > On Friday 25 February 2011, Andrei Warkentin wrote:
>> >> Yup. I understand :-). ?That's the strategy I'm going to follow. For
>> >> page_size-alignment/splitting I'm looking at the block layer now. Is
>> >> that the right approach or should I still submit a (cleaned up) patch
>> >> to mmc/card/block.c for that performance improvement.
>> >
>> > I guess it should live in block/cfq-iosched in the long run, but I don't
>> > know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
>
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
>
Before the generic improvements are made to the block layer, I think
there is some value
in implementing the (simpler) ones in mmc block code, as well as
expose an mmc block quirk interface by which its easy to add complex
workarounds. Some things will never be able to completely stay above
mmc block code, for example, when splitting up smaller accesses, you
need to be careful on the Toshiba card, since the 4th consecutive 8KB
block results in the entire 32KB getting pushed into the bigger 4MB
buffer. On our platform, there are a lot of accesses in the 16KB-32KB
range which benefit from the splitting. Data collected showed
splitting more than 32KB to have adverse effect on performance (I
guess that sort of makes sense, after all, why else would the
controller treat 4 consecutive 8KB accesses as a larger access and
treat it accordingly?) On the other hand, that data was collected on
code that used reliable write for every portion of the split access,
so I'm going to have to get some new data...
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-03-02 10:34 ` Andrei Warkentin
@ 2011-03-05 9:23 ` Andrei Warkentin
-1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-05 9:23 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc
On Wed, Mar 2, 2011 at 4:34 AM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Before the generic improvements are made to the block layer, I think
> there is some value
> in implementing the (simpler) ones in mmc block code, as well as
> expose an mmc block quirk interface by which its easy to add complex
> workarounds. Some things will never be able to completely stay above
> mmc block code, for example, when splitting up smaller accesses, you
> need to be careful on the Toshiba card, since the 4th consecutive 8KB
> block results in the entire 32KB getting pushed into the bigger 4MB
> buffer. On our platform, there are a lot of accesses in the 16KB-32KB
> range which benefit from the splitting. Data collected showed
> splitting more than 32KB to have adverse effect on performance (I
> guess that sort of makes sense, after all, why else would the
> controller treat 4 consecutive 8KB accesses as a larger access and
> treat it accordingly?) On the other hand, that data was collected on
> code that used reliable write for every portion of the split access,
> so I'm going to have to get some new data...
>
Just want to correct myself - any consecutive write that exceeds 8K
goes into the 4MB buffer.
Also, according to vendor, there is no performance penalty for using
reliable write.
This is why in the patch set, for splitting larger requests (to
improve lifetime by reducing the number of AU write/erase cycles) I
perform a reliable write for each split block set.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-05 9:23 ` Andrei Warkentin
0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-05 9:23 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Mar 2, 2011 at 4:34 AM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Before the generic improvements are made to the block layer, I think
> there is some value
> in implementing the (simpler) ones in mmc block code, as well as
> expose an mmc block quirk interface by which its easy to add complex
> workarounds. Some things will never be able to completely stay above
> mmc block code, for example, when splitting up smaller accesses, you
> need to be careful on the Toshiba card, since the 4th consecutive 8KB
> block results in the entire 32KB getting pushed ?into the bigger 4MB
> buffer. On our platform, there are a lot of accesses in the 16KB-32KB
> range which benefit from the splitting. Data collected showed
> splitting more than 32KB to have adverse effect on performance (I
> guess that sort of makes sense, after all, why else would the
> controller treat 4 consecutive 8KB accesses as a larger access and
> treat it accordingly?) On the other hand, that data was collected on
> code that used reliable write for every portion of the split access,
> so I'm going to have to get some new data...
>
Just want to correct myself - any consecutive write that exceeds 8K
goes into the 4MB buffer.
Also, according to vendor, there is no performance penalty for using
reliable write.
This is why in the patch set, for splitting larger requests (to
improve lifetime by reducing the number of AU write/erase cycles) I
perform a reliable write for each split block set.
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
2011-02-11 14:51 ` Arnd Bergmann
2011-02-11 15:20 ` Lei Wen
@ 2011-03-08 6:59 ` Pavel Machek
2011-03-08 14:03 ` Arnd Bergmann
1 sibling, 1 reply; 117+ messages in thread
From: Pavel Machek @ 2011-03-08 6:59 UTC (permalink / raw)
To: linux-arm-kernel
Hi!
> > > I'm not sure if this is the best place to bring this up, but Russel's
> > > name is on a fair share of drivers/mmc code, and there does seem to be
> > > quite a bit of MMC-related discussions. Excuse me in advance if this
> > > isn't the right forum :-).
> > >
> > > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > > rigid buffering scheme when it comes to handling writes. There is
> > > usually a buffer A for random accesses, and a buffer B for sequential
> > > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > > writes are treated as one large sequential access, once again ending
> > > up in buffer B, thus necessitating out-of-order writing to work around
> > > this.
> >
> > Hmmmm, I somehow assumed MMCs would be much more cleverr than this.
>
> No, these devices are incredibly stupid, or extremely optimized to
> a specific use case (writing large video files to FAT32), depending on how
> you look at them.
>
> > > reorders) them? The thresholds would then be adjustable as
> > > module/kernel parameters based on manfid. I'm asking because I have a
> > > patch now, but its ugly and hardcoded against a specific manufacturer.
> >
> > How big is performance difference?
>
> Several orders of magnitude. It is very easy to get a card that can write
> 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> things that happen frequently with ext3.
Ungood.
I guess we should create something like loopback device, which knows
about flash specifics, and does the right coalescing so that card
stays in the fast mode?
...or, do we need to create new, simple filesystem with layout similar
to fat32, for use on mmc cards?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: MMC quirks relating to performance/lifetime.
2011-03-08 6:59 ` Pavel Machek
@ 2011-03-08 14:03 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-08 14:03 UTC (permalink / raw)
To: Pavel Machek; +Cc: linux-arm-kernel, Andrei Warkentin, linux-fsdevel, linux-mmc
On Tuesday 08 March 2011, Pavel Machek wrote:
> > >
> > > How big is performance difference?
> >
> > Several orders of magnitude. It is very easy to get a card that can write
> > 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> > things that happen frequently with ext3.
>
> Ungood.
>
> I guess we should create something like loopback device, which knows
> about flash specifics, and does the right coalescing so that card
> stays in the fast mode?
I have listed a few suggestions for areas to work in my article
at https://lwn.net/Articles/428584/. My idea was to use a device mapper
target, as described in https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
but a loopback device might work as well.
The other area that I think will help a lot is to make the I/O
scheduler aware of the erase block size and the preferred access
patterns.
> ...or, do we need to create new, simple filesystem with layout similar
> to fat32, for use on mmc cards?
It doesn't need to be similar to fat32, but creating a new file system
could fix this, too. Microsoft seems to have built ExFAT around
cheap flash devices, though they don't document what that does exactly.
I think we can do better than that, and I still want to find out
how close nilfs2 and btrfs can actually get to the optimum.
Note that it's not just MMC cards though, you get the exact same
effects on some low-end SSDs (which are basically repackaged CF
cards) and most USB sticks. The best USB sticks I have seen
can hide some effects with a bit of caching, and they have a higher
number of open segments than the cheap ones, but the basic
problems are unchanged.
The requirements for a good low-end flash optimized file system
would be roughly:
1. Do all writes is chunks of 32 or 64 KB. If there is less
data to write, fill the chunk with zeroes and clean up later,
but don't write more data to the same chunk.
2. Start writing on a segment (e.g. 4 MB, configurable) boundary,
then write that segment to the end using the chunks mentioned
above.
3. Erase full segments using trim/erase/discard before writing
to them, if supported by the drive.
4. Have a configurable number of segments open for writing, i.e.
you have written blocks at the start of the segment but not
filled the segment to the end. Typical hardware limitations
are between 1 and 10 open segments.
5. Keep all metadata within a single 4 MB segment. Drives that cannot
do random access within normal segments can do it in the area
that holds the FAT. If 4 MB is not enough, the FAT area can be
used as a journal or cache, for a larger metadata area that gets
written less frequently.
6. Because of the requirement to erase 4 MB chunks at once, there
needs to be garbage collection to free up space. The quality
of the garbage collection algorithm directly relates to the
performance on full file systems and/or the space overhead.
7. Some static wear levelling is required to increase the expected
life of consumer devices that only do dynamic wear levelling,
i.e. the segments that contain purely static data need to
be written occasionally so they make it back into the
wear leveling pool of the hardware.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
* MMC quirks relating to performance/lifetime.
@ 2011-03-08 14:03 ` Arnd Bergmann
0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-08 14:03 UTC (permalink / raw)
To: linux-arm-kernel
On Tuesday 08 March 2011, Pavel Machek wrote:
> > >
> > > How big is performance difference?
> >
> > Several orders of magnitude. It is very easy to get a card that can write
> > 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> > things that happen frequently with ext3.
>
> Ungood.
>
> I guess we should create something like loopback device, which knows
> about flash specifics, and does the right coalescing so that card
> stays in the fast mode?
I have listed a few suggestions for areas to work in my article
at https://lwn.net/Articles/428584/. My idea was to use a device mapper
target, as described in https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
but a loopback device might work as well.
The other area that I think will help a lot is to make the I/O
scheduler aware of the erase block size and the preferred access
patterns.
> ...or, do we need to create new, simple filesystem with layout similar
> to fat32, for use on mmc cards?
It doesn't need to be similar to fat32, but creating a new file system
could fix this, too. Microsoft seems to have built ExFAT around
cheap flash devices, though they don't document what that does exactly.
I think we can do better than that, and I still want to find out
how close nilfs2 and btrfs can actually get to the optimum.
Note that it's not just MMC cards though, you get the exact same
effects on some low-end SSDs (which are basically repackaged CF
cards) and most USB sticks. The best USB sticks I have seen
can hide some effects with a bit of caching, and they have a higher
number of open segments than the cheap ones, but the basic
problems are unchanged.
The requirements for a good low-end flash optimized file system
would be roughly:
1. Do all writes is chunks of 32 or 64 KB. If there is less
data to write, fill the chunk with zeroes and clean up later,
but don't write more data to the same chunk.
2. Start writing on a segment (e.g. 4 MB, configurable) boundary,
then write that segment to the end using the chunks mentioned
above.
3. Erase full segments using trim/erase/discard before writing
to them, if supported by the drive.
4. Have a configurable number of segments open for writing, i.e.
you have written blocks at the start of the segment but not
filled the segment to the end. Typical hardware limitations
are between 1 and 10 open segments.
5. Keep all metadata within a single 4 MB segment. Drives that cannot
do random access within normal segments can do it in the area
that holds the FAT. If 4 MB is not enough, the FAT area can be
used as a journal or cache, for a larger metadata area that gets
written less frequently.
6. Because of the requirement to erase 4 MB chunks at once, there
needs to be garbage collection to free up space. The quality
of the garbage collection algorithm directly relates to the
performance on full file systems and/or the space overhead.
7. Some static wear levelling is required to increase the expected
life of consumer devices that only do dynamic wear levelling,
i.e. the segments that contain purely static data need to
be written occasionally so they make it back into the
wear leveling pool of the hardware.
Arnd
^ permalink raw reply [flat|nested] 117+ messages in thread
end of thread, other threads:[~2011-03-08 14:03 UTC | newest]
Thread overview: 117+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
2011-02-08 21:38 ` Wolfram Sang
2011-02-08 21:38 ` Wolfram Sang
2011-02-08 22:42 ` Russell King - ARM Linux
2011-02-09 8:37 ` Linus Walleij
2011-02-09 8:37 ` Linus Walleij
2011-02-09 9:13 ` Arnd Bergmann
2011-02-09 9:13 ` Arnd Bergmann
2011-02-11 22:33 ` Andrei Warkentin
2011-02-11 22:33 ` Andrei Warkentin
2011-02-12 17:05 ` Arnd Bergmann
2011-02-12 17:05 ` Arnd Bergmann
2011-02-12 17:33 ` Andrei Warkentin
2011-02-12 17:33 ` Andrei Warkentin
2011-02-12 18:22 ` Arnd Bergmann
2011-02-12 18:22 ` Arnd Bergmann
2011-02-18 1:10 ` Andrei Warkentin
2011-02-18 1:10 ` Andrei Warkentin
2011-02-18 13:44 ` Arnd Bergmann
2011-02-18 13:44 ` Arnd Bergmann
2011-02-18 19:47 ` Andrei Warkentin
2011-02-18 19:47 ` Andrei Warkentin
2011-02-18 22:40 ` Andrei Warkentin
2011-02-18 22:40 ` Andrei Warkentin
2011-02-18 23:17 ` Andrei Warkentin
2011-02-18 23:17 ` Andrei Warkentin
2011-02-19 11:20 ` Arnd Bergmann
2011-02-19 11:20 ` Arnd Bergmann
2011-02-20 5:56 ` Andrei Warkentin
2011-02-20 5:56 ` Andrei Warkentin
2011-02-20 15:23 ` Arnd Bergmann
2011-02-20 15:23 ` Arnd Bergmann
2011-02-22 7:05 ` Andrei Warkentin
2011-02-22 7:05 ` Andrei Warkentin
2011-02-22 16:49 ` Arnd Bergmann
2011-02-22 16:49 ` Arnd Bergmann
2011-02-19 9:54 ` Arnd Bergmann
2011-02-19 9:54 ` Arnd Bergmann
2011-02-20 4:39 ` Andrei Warkentin
2011-02-20 4:39 ` Andrei Warkentin
2011-02-20 15:03 ` Arnd Bergmann
2011-02-20 15:03 ` Arnd Bergmann
2011-02-22 6:42 ` Andrei Warkentin
2011-02-22 6:42 ` Andrei Warkentin
2011-02-22 16:42 ` Arnd Bergmann
2011-02-22 16:42 ` Arnd Bergmann
2011-02-11 23:23 ` Linus Walleij
2011-02-11 23:23 ` Linus Walleij
2011-02-12 10:45 ` Arnd Bergmann
2011-02-12 10:45 ` Arnd Bergmann
2011-02-12 10:59 ` Russell King - ARM Linux
2011-02-12 10:59 ` Russell King - ARM Linux
2011-02-12 16:28 ` Arnd Bergmann
2011-02-12 16:28 ` Arnd Bergmann
2011-02-12 16:37 ` Russell King - ARM Linux
2011-02-12 16:37 ` Russell King - ARM Linux
2011-02-11 22:27 ` Andrei Warkentin
2011-02-11 22:27 ` Andrei Warkentin
2011-02-12 18:37 ` Arnd Bergmann
2011-02-12 18:37 ` Arnd Bergmann
2011-02-13 0:10 ` Andrei Warkentin
2011-02-13 0:10 ` Andrei Warkentin
2011-02-13 17:39 ` Arnd Bergmann
2011-02-13 17:39 ` Arnd Bergmann
2011-02-14 19:29 ` Andrei Warkentin
2011-02-14 19:29 ` Andrei Warkentin
2011-02-14 20:22 ` Arnd Bergmann
2011-02-14 20:22 ` Arnd Bergmann
2011-02-14 22:25 ` Andrei Warkentin
2011-02-14 22:25 ` Andrei Warkentin
2011-02-15 17:16 ` Arnd Bergmann
2011-02-15 17:16 ` Arnd Bergmann
2011-02-17 2:08 ` Andrei Warkentin
2011-02-17 2:08 ` Andrei Warkentin
2011-02-17 15:47 ` Arnd Bergmann
2011-02-17 15:47 ` Arnd Bergmann
2011-02-20 11:27 ` Andrei Warkentin
2011-02-20 11:27 ` Andrei Warkentin
2011-02-20 14:39 ` Arnd Bergmann
2011-02-20 14:39 ` Arnd Bergmann
2011-02-22 7:46 ` Andrei Warkentin
2011-02-22 7:46 ` Andrei Warkentin
2011-02-22 17:00 ` Arnd Bergmann
2011-02-22 17:00 ` Arnd Bergmann
2011-02-23 10:19 ` Andrei Warkentin
2011-02-23 10:19 ` Andrei Warkentin
2011-02-23 16:09 ` Arnd Bergmann
2011-02-23 16:09 ` Arnd Bergmann
2011-02-23 22:26 ` Andrei Warkentin
2011-02-23 22:26 ` Andrei Warkentin
2011-02-24 9:24 ` Arnd Bergmann
2011-02-24 9:24 ` Arnd Bergmann
2011-02-25 11:02 ` Andrei Warkentin
2011-02-25 11:02 ` Andrei Warkentin
2011-02-25 12:21 ` Arnd Bergmann
2011-02-25 12:21 ` Arnd Bergmann
2011-03-01 18:48 ` Jens Axboe
2011-03-01 18:48 ` Jens Axboe
2011-03-01 19:11 ` Arnd Bergmann
2011-03-01 19:11 ` Arnd Bergmann
2011-03-01 19:15 ` Jens Axboe
2011-03-01 19:15 ` Jens Axboe
2011-03-01 19:51 ` Arnd Bergmann
2011-03-01 19:51 ` Arnd Bergmann
2011-03-01 21:33 ` Andrei Warkentin
2011-03-01 21:33 ` Andrei Warkentin
2011-03-02 10:34 ` Andrei Warkentin
2011-03-02 10:34 ` Andrei Warkentin
2011-03-05 9:23 ` Andrei Warkentin
2011-03-05 9:23 ` Andrei Warkentin
2011-02-11 14:41 ` Pavel Machek
2011-02-11 14:51 ` Arnd Bergmann
2011-02-11 15:20 ` Lei Wen
2011-02-11 15:25 ` Arnd Bergmann
2011-03-08 6:59 ` Pavel Machek
2011-03-08 14:03 ` Arnd Bergmann
2011-03-08 14:03 ` Arnd Bergmann
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.