All of lore.kernel.org
 help / color / mirror / Atom feed
* MMC quirks relating to performance/lifetime.
@ 2011-02-08 21:22 Andrei Warkentin
  2011-02-08 21:38   ` Wolfram Sang
                   ` (3 more replies)
  0 siblings, 4 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-08 21:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

I'm not sure if this is the best place to bring this up, but Russel's
name is on a fair share of drivers/mmc code, and there does seem to be
quite a bit of MMC-related discussions. Excuse me in advance if this
isn't the right forum :-).

Certain MMC vendors (maybe even quite a bit of them) use a pretty
rigid buffering scheme when it comes to handling writes. There is
usually a buffer A for random accesses, and a buffer B for sequential
accesses. For certain Toshiba parts, it looks like buffer A is 8KB
wide, with buffer B being 4MB wide, and all accesses larger than 8KB
effectively equating to 4MB accesses. Worse, consecutive small (8k)
writes are treated as one large sequential access, once again ending
up in buffer B, thus necessitating out-of-order writing to work around
this.

What this means is decreased life span for the parts, and it also
means a performance impact on small writes, but the first item is much
more crucial, especially for smaller parts.

As I've mentioned, probably more vendors are affected. How about a
generic MMC_BLOCK quirk that splits the requests (and optionally
reorders) them? The thresholds would then be adjustable as
module/kernel parameters based on manfid. I'm asking because I have a
patch now, but its ugly and hardcoded against a specific manufacturer.

Thanks,
A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
@ 2011-02-08 21:38   ` Wolfram Sang
  2011-02-08 22:42 ` Russell King - ARM Linux
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 117+ messages in thread
From: Wolfram Sang @ 2011-02-08 21:38 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, linux-mmc

[-- Attachment #1: Type: text/plain, Size: 2032 bytes --]

On Tue, Feb 08, 2011 at 03:22:59PM -0600, Andrei Warkentin wrote:
> Hi,
> 
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).

Searching for MMC in MAINTAINERS will get you:

MULTIMEDIA CARD (MMC), SECURE DIGITAL (SD) AND SDIO SUBSYSTEM
M:      Chris Ball <cjb@laptop.org>
L:      linux-mmc@vger.kernel.org
...

List CCed...

> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
> 
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
> 
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.
> 
> Thanks,
> A
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Pengutronix e.K.                           | Wolfram Sang                |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-08 21:38   ` Wolfram Sang
  0 siblings, 0 replies; 117+ messages in thread
From: Wolfram Sang @ 2011-02-08 21:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 08, 2011 at 03:22:59PM -0600, Andrei Warkentin wrote:
> Hi,
> 
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).

Searching for MMC in MAINTAINERS will get you:

MULTIMEDIA CARD (MMC), SECURE DIGITAL (SD) AND SDIO SUBSYSTEM
M:      Chris Ball <cjb@laptop.org>
L:      linux-mmc at vger.kernel.org
...

List CCed...

> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
> 
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
> 
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.
> 
> Thanks,
> A
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Pengutronix e.K.                           | Wolfram Sang                |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110208/88da88a3/attachment.sig>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
  2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
  2011-02-08 21:38   ` Wolfram Sang
@ 2011-02-08 22:42 ` Russell King - ARM Linux
  2011-02-09  8:37   ` Linus Walleij
  2011-02-11 14:41 ` Pavel Machek
  3 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-08 22:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 08, 2011 at 03:22:59PM -0600, Andrei Warkentin wrote:
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).

I dropped out of MMC stuff once we had a functional infrastructure
in place in the kernel - before that, there were various competing
implementations around.

The implementation that's there was based off what meager information
was available on the MMC protocol, as published by some of the card
manufacturers.  Certainly no one had the backing to be able to get the
official specifications and such like, nor to approach the various
companies to get the sort of details you're talking about.

So, what's there is basically a best-effort to provide something usable
and which works (most of the time.)  And to reflect that, error handling
is almost non-existent.

As part of trying to get better performance out of PIO-based interfaces,
I've recently been putting some effort into making the mmc block driver
a little more rugged in the face of various communication errors.

That's not to say that I'm now taking an active interest in MMC - I'm
not.  I'm just fixing the occasional issue which causes me problem.

As for what you're talking about (controlling the coalescing of requests),
I think you're better off sorting that out with the higher block layers
to restrict the amount of coalescing that happens there.  I think there
are some hooks already in place which allow you to define the maximum
size of any request, but this doesn't take account of read/write
properties.  Maybe that's something the higher block layer should be
extended with?

If so, you'll have to discuss it with the block layer folk.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
@ 2011-02-09  8:37   ` Linus Walleij
  2011-02-08 22:42 ` Russell King - ARM Linux
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-09  8:37 UTC (permalink / raw)
  To: Andrei Warkentin, linux-mmc; +Cc: linux-arm-kernel

[Quoting in verbatin so the orginal mail hits linux-mmc, this is very
interesting!]

2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> Hi,
>
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
>
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
>
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
>
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.

There is a quirk API so that specific quirks can be flagged for certain
vendors and cards, e.g. some Toshibas in this case. e.g. grep the
kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.

But as Russell says this probably needs to be signalled up to the
block layer to be handled properly.

Why don't you post the code you have today as an RFC: patch,
I think many will be interested?

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-09  8:37   ` Linus Walleij
  0 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-09  8:37 UTC (permalink / raw)
  To: linux-arm-kernel

[Quoting in verbatin so the orginal mail hits linux-mmc, this is very
interesting!]

2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> Hi,
>
> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
>
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.
>
> What this means is decreased life span for the parts, and it also
> means a performance impact on small writes, but the first item is much
> more crucial, especially for smaller parts.
>
> As I've mentioned, probably more vendors are affected. How about a
> generic MMC_BLOCK quirk that splits the requests (and optionally
> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.

There is a quirk API so that specific quirks can be flagged for certain
vendors and cards, e.g. some Toshibas in this case. e.g. grep the
kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.

But as Russell says this probably needs to be signalled up to the
block layer to be handled properly.

Why don't you post the code you have today as an RFC: patch,
I think many will be interested?

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-09  8:37   ` Linus Walleij
@ 2011-02-09  9:13     ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-09  9:13 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Linus Walleij, Andrei Warkentin, linux-mmc

On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
> 
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> > Hi,
> >
> > I'm not sure if this is the best place to bring this up, but Russel's
> > name is on a fair share of drivers/mmc code, and there does seem to be
> > quite a bit of MMC-related discussions. Excuse me in advance if this
> > isn't the right forum :-).
> >
> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > rigid buffering scheme when it comes to handling writes. There is
> > usually a buffer A for random accesses, and a buffer B for sequential
> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > writes are treated as one large sequential access, once again ending
> > up in buffer B, thus necessitating out-of-order writing to work around
> > this.

It's more complex, but I now have a pretty good understanding of
what the flash media actually do, after doing a lot of benchmarking.
Most of my results so far are documented on

https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey

but I still need to write about the more recent discoveries.

What you describe as buffer A is the "page size" of the underlying
flash. It depends on the size and brand of the NAND flash chip and
can be anywhere between 2 KB and 16 KB for modern cards, depending
on how they combine multiple chips and planes within the chips.

What you describe as buffer B is sometime called an "erase block
group" or an "allocation unit". This is the smallest unit that
gets kept in a global lookup table in the medium and can be anywhere
between 1 MB and 8 MB for cards larger than 4 GB, or as small as
128 KB (a single erase block) for smaller media, as far as I have
seen. When you don't write full aligned allocation units, the
card will have to eventually do garbage collection on the allocation
unit, which can take a long time (many milliseconds).

Most cards have a third size, typically somewhere between 32 and 128 KB,
which is the optimimum size for writes. While you can do linear
writes to the card in page size units (writing an allocation unit
from start to finish), doing random access within the allocation unit
will be much faster doing larger writes.

> > What this means is decreased life span for the parts, and it also
> > means a performance impact on small writes, but the first item is much
> > more crucial, especially for smaller parts.
> >
> > As I've mentioned, probably more vendors are affected. How about a
> > generic MMC_BLOCK quirk that splits the requests (and optionally
> > reorders) them? The thresholds would then be adjustable as
> > module/kernel parameters based on manfid. I'm asking because I have a
> > patch now, but its ugly and hardcoded against a specific manufacturer.

It's not just MMC specific: USB flash drives, CF cards and even cheap
PATA or SATA SSDs have the same patterns. I think this will need
to be solved on a higher level, in the block device elevator code
and in the file systems.

> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
> 
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
> 
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?

Yes, I agree, that would be good. Also, I'd be interested to see the
output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
the worst cards that I have seen so far, because they can not do
random access within an allocation unit, and they can not write to
multiple allocation units alternating (# open AUs linear is "1" in
my wiki table), while most cards can do at least two.

Andrei, I'm certainly interested in working with you on this.
The point you brought up about the toshiba cards being especially
bad is certainly vald, even if we do something better in the block
layer, we need to have a way to detect the worst-case scenario,
so we can work around that.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-09  9:13     ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-09  9:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
> 
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
> > Hi,
> >
> > I'm not sure if this is the best place to bring this up, but Russel's
> > name is on a fair share of drivers/mmc code, and there does seem to be
> > quite a bit of MMC-related discussions. Excuse me in advance if this
> > isn't the right forum :-).
> >
> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > rigid buffering scheme when it comes to handling writes. There is
> > usually a buffer A for random accesses, and a buffer B for sequential
> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > writes are treated as one large sequential access, once again ending
> > up in buffer B, thus necessitating out-of-order writing to work around
> > this.

It's more complex, but I now have a pretty good understanding of
what the flash media actually do, after doing a lot of benchmarking.
Most of my results so far are documented on

https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey

but I still need to write about the more recent discoveries.

What you describe as buffer A is the "page size" of the underlying
flash. It depends on the size and brand of the NAND flash chip and
can be anywhere between 2 KB and 16 KB for modern cards, depending
on how they combine multiple chips and planes within the chips.

What you describe as buffer B is sometime called an "erase block
group" or an "allocation unit". This is the smallest unit that
gets kept in a global lookup table in the medium and can be anywhere
between 1 MB and 8 MB for cards larger than 4 GB, or as small as
128 KB (a single erase block) for smaller media, as far as I have
seen. When you don't write full aligned allocation units, the
card will have to eventually do garbage collection on the allocation
unit, which can take a long time (many milliseconds).

Most cards have a third size, typically somewhere between 32 and 128 KB,
which is the optimimum size for writes. While you can do linear
writes to the card in page size units (writing an allocation unit
from start to finish), doing random access within the allocation unit
will be much faster doing larger writes.

> > What this means is decreased life span for the parts, and it also
> > means a performance impact on small writes, but the first item is much
> > more crucial, especially for smaller parts.
> >
> > As I've mentioned, probably more vendors are affected. How about a
> > generic MMC_BLOCK quirk that splits the requests (and optionally
> > reorders) them? The thresholds would then be adjustable as
> > module/kernel parameters based on manfid. I'm asking because I have a
> > patch now, but its ugly and hardcoded against a specific manufacturer.

It's not just MMC specific: USB flash drives, CF cards and even cheap
PATA or SATA SSDs have the same patterns. I think this will need
to be solved on a higher level, in the block device elevator code
and in the file systems.

> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
> 
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
> 
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?

Yes, I agree, that would be good. Also, I'd be interested to see the
output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
the worst cards that I have seen so far, because they can not do
random access within an allocation unit, and they can not write to
multiple allocation units alternating (# open AUs linear is "1" in
my wiki table), while most cards can do at least two.

Andrei, I'm certainly interested in working with you on this.
The point you brought up about the toshiba cards being especially
bad is certainly vald, even if we do something better in the block
layer, we need to have a way to detect the worst-case scenario,
so we can work around that.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
  2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
                   ` (2 preceding siblings ...)
  2011-02-09  8:37   ` Linus Walleij
@ 2011-02-11 14:41 ` Pavel Machek
  2011-02-11 14:51   ` Arnd Bergmann
  3 siblings, 1 reply; 117+ messages in thread
From: Pavel Machek @ 2011-02-11 14:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> I'm not sure if this is the best place to bring this up, but Russel's
> name is on a fair share of drivers/mmc code, and there does seem to be
> quite a bit of MMC-related discussions. Excuse me in advance if this
> isn't the right forum :-).
> 
> Certain MMC vendors (maybe even quite a bit of them) use a pretty
> rigid buffering scheme when it comes to handling writes. There is
> usually a buffer A for random accesses, and a buffer B for sequential
> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> effectively equating to 4MB accesses. Worse, consecutive small (8k)
> writes are treated as one large sequential access, once again ending
> up in buffer B, thus necessitating out-of-order writing to work around
> this.

Hmmmm, I somehow assumed MMCs would be much more cleverr than this.

> reorders) them? The thresholds would then be adjustable as
> module/kernel parameters based on manfid. I'm asking because I have a
> patch now, but its ugly and hardcoded against a specific manufacturer.

How big is performance difference?
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
  2011-02-11 14:41 ` Pavel Machek
@ 2011-02-11 14:51   ` Arnd Bergmann
  2011-02-11 15:20     ` Lei Wen
  2011-03-08  6:59     ` Pavel Machek
  0 siblings, 2 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-11 14:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 11 February 2011, Pavel Machek wrote:
> Hi!
> 
> > I'm not sure if this is the best place to bring this up, but Russel's
> > name is on a fair share of drivers/mmc code, and there does seem to be
> > quite a bit of MMC-related discussions. Excuse me in advance if this
> > isn't the right forum :-).
> > 
> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > rigid buffering scheme when it comes to handling writes. There is
> > usually a buffer A for random accesses, and a buffer B for sequential
> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > writes are treated as one large sequential access, once again ending
> > up in buffer B, thus necessitating out-of-order writing to work around
> > this.
> 
> Hmmmm, I somehow assumed MMCs would be much more cleverr than this.

No, these devices are incredibly stupid, or extremely optimized to
a specific use case (writing large video files to FAT32), depending on how
you look at them.

> > reorders) them? The thresholds would then be adjustable as
> > module/kernel parameters based on manfid. I'm asking because I have a
> > patch now, but its ugly and hardcoded against a specific manufacturer.
> 
> How big is performance difference?

Several orders of magnitude. It is very easy to get a card that can write
12 MB/s into a case where it writes no more than 30 KB/s, doing only
things that happen frequently with ext3.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
  2011-02-11 14:51   ` Arnd Bergmann
@ 2011-02-11 15:20     ` Lei Wen
  2011-02-11 15:25       ` Arnd Bergmann
  2011-03-08  6:59     ` Pavel Machek
  1 sibling, 1 reply; 117+ messages in thread
From: Lei Wen @ 2011-02-11 15:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 11, 2011 at 10:51 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011, Pavel Machek wrote:
>> Hi!
>>
>> > I'm not sure if this is the best place to bring this up, but Russel's
>> > name is on a fair share of drivers/mmc code, and there does seem to be
>> > quite a bit of MMC-related discussions. Excuse me in advance if this
>> > isn't the right forum :-).
>> >
>> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> > rigid buffering scheme when it comes to handling writes. There is
>> > usually a buffer A for random accesses, and a buffer B for sequential
>> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> > writes are treated as one large sequential access, once again ending
>> > up in buffer B, thus necessitating out-of-order writing to work around
>> > this.
>>
>> Hmmmm, I somehow assumed MMCs would be much more cleverr than this.
>
> No, these devices are incredibly stupid, or extremely optimized to
> a specific use case (writing large video files to FAT32), depending on how
> you look at them.
>
>> > reorders) them? The thresholds would then be adjustable as
>> > module/kernel parameters based on manfid. I'm asking because I have a
>> > patch now, but its ugly and hardcoded against a specific manufacturer.
>>
>> How big is performance difference?
>
> Several orders of magnitude. It is very easy to get a card that can write
> 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> things that happen frequently with ext3.
>

Maybe we could get that case into mmc_test code, so that we could track
that in latter whether it already be fixed or not? Or in other word, to prove
the firmware in sd card is stupid or not. :)

Best regards,
Lei

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
  2011-02-11 15:20     ` Lei Wen
@ 2011-02-11 15:25       ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-11 15:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 11 February 2011, Lei Wen wrote:
> > Several orders of magnitude. It is very easy to get a card that can write
> > 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> > things that happen frequently with ext3.
> >
> 
> Maybe we could get that case into mmc_test code, so that we could track
> that in latter whether it already be fixed or not? Or in other word, to prove
> the firmware in sd card is stupid or not. :)
 
There are many kinds of stupid, and a lot of cards are. I've actually had
excellent success with simply measuring from user space, which is
much easier than in mmc_test.

Unfortunately, you have to write to the card to do that, which may destroy
the data even if you write the same data that is already on it.

See
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
for most of my results. I'm about to write up a better paper with all the
measurements, and will make my tools available soon.


	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-09  8:37   ` Linus Walleij
@ 2011-02-11 22:27     ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:27 UTC (permalink / raw)
  To: Linus Walleij; +Cc: linux-mmc, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]

On Wed, Feb 9, 2011 at 2:37 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
>
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> Hi,
>>
>> I'm not sure if this is the best place to bring this up, but Russel's
>> name is on a fair share of drivers/mmc code, and there does seem to be
>> quite a bit of MMC-related discussions. Excuse me in advance if this
>> isn't the right forum :-).
>>
>> Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> rigid buffering scheme when it comes to handling writes. There is
>> usually a buffer A for random accesses, and a buffer B for sequential
>> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> writes are treated as one large sequential access, once again ending
>> up in buffer B, thus necessitating out-of-order writing to work around
>> this.
>>
>> What this means is decreased life span for the parts, and it also
>> means a performance impact on small writes, but the first item is much
>> more crucial, especially for smaller parts.
>>
>> As I've mentioned, probably more vendors are affected. How about a
>> generic MMC_BLOCK quirk that splits the requests (and optionally
>> reorders) them? The thresholds would then be adjustable as
>> module/kernel parameters based on manfid. I'm asking because I have a
>> patch now, but its ugly and hardcoded against a specific manufacturer.
>
> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
>
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?
>
> Yours,
> Linus Walleij
>

I think it's worthwhile to make make the upper block layers aware of
MMC (and apparently other flash memory) limitations, but I think as a
first step it could make sense (for me) to reformat the patch I am
attaching into something that looks better.

Don't take the attached patch too seriously :-).

Thanks,
A

[-- Attachment #2: toshiba_emmc_opt.patch --]
[-- Type: text/x-diff, Size: 8738 bytes --]

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 7054fd5..3b32329 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -60,6 +60,7 @@ struct mmc_blk_data {
 	spinlock_t	lock;
 	struct gendisk	*disk;
 	struct mmc_queue queue;
+	char            *bounce;
 
 	unsigned int	usage;
 	unsigned int	read_only;
@@ -93,6 +94,9 @@ static void mmc_blk_put(struct mmc_blk_data *md)
 
 		__clear_bit(devidx, dev_use);
 
+		if (md->bounce)
+			kfree(md->bounce);
+
 		put_disk(md->disk);
 		kfree(md);
 	}
@@ -312,6 +316,157 @@ out:
 	return err ? 0 : 1;
 }
 
+/*
+ * Workaround for Toshiba eMMC performance.  If the request is less than two
+ * flash pages in size, then we want to split the write into one or two
+ * page-aligned writes to take advantage of faster buffering.  Here we can
+ * adjust the size of the MMC request and let the block layer request handler
+ * deal with generating another MMC request.
+ */
+#define TOSHIBA_MANFID 0x11
+#define TOSHIBA_PAGE_SIZE 16		/* sectors */
+#define TOSHIBA_ADJUST_THRESHOLD 24	/* sectors */
+static bool mmc_adjust_toshiba_write(struct mmc_card *card,
+                                     struct mmc_request *mrq)
+{
+	if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
+	    mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
+		int sectors_in_page = TOSHIBA_PAGE_SIZE -
+		                      (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
+		if (mrq->data->blocks > sectors_in_page) {
+			mrq->data->blocks = sectors_in_page;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * This is another strange workaround to try to close the gap on Toshiba eMMC
+ * performance when compared to other vendors.  In order to take advantage
+ * of certain optimizations and assumptions in those cards, we will look for
+ * multiblock write transfers below a certain size and we do the following:
+ *
+ * - Break them up into seperate page-aligned (8k flash pages) transfers.
+ * - Execute the transfers in reverse order.
+ * - Use "reliable write" transfer mode.
+ *
+ * Neither the block I/O layer nor the scatterlist design seem to lend them-
+ * selves well to executing a block request out of order.  So instead we let
+ * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
+ * break it up and reorder it here.  This also requires that we put the data
+ * into a bounce buffer and send it as individual sg's.
+ */
+#define TOSHIBA_LOW_THRESHOLD 48	/* sectors */
+#define TOSHIBA_HIGH_THRESHOLD 64	/* sectors */
+static bool mmc_handle_toshiba_write(struct mmc_queue *mq,
+                                     struct mmc_card *card,
+                                     struct mmc_request *mrq)
+{
+	struct mmc_blk_data *md = mq->data;
+	unsigned int first_page, last_page, page;
+	unsigned long flags;
+
+	if (!md->bounce ||
+	    mrq->data->blocks > TOSHIBA_HIGH_THRESHOLD ||
+	    mrq->data->blocks < TOSHIBA_LOW_THRESHOLD)
+		return false;
+
+	first_page = mrq->cmd->arg / TOSHIBA_PAGE_SIZE;
+	last_page = (mrq->cmd->arg + mrq->data->blocks - 1) / TOSHIBA_PAGE_SIZE;
+
+	/* Single page write: just do it the normal way */
+	if (first_page == last_page)
+		return false;
+
+	local_irq_save(flags);
+	sg_copy_to_buffer(mrq->data->sg, mrq->data->sg_len,
+	                  md->bounce, mrq->data->blocks * 512);
+	local_irq_restore(flags);
+
+	for (page = last_page; page >= first_page; page--) {
+		unsigned long offset, length;
+		struct mmc_blk_request brq;
+		struct mmc_command cmd;
+		struct scatterlist sg;
+
+		memset(&brq, 0, sizeof(struct mmc_blk_request));
+		brq.mrq.cmd = &brq.cmd;
+		brq.mrq.data = &brq.data;
+
+		brq.cmd.arg = page * TOSHIBA_PAGE_SIZE;
+		brq.data.blksz = 512;
+		if (page == first_page) {
+			brq.cmd.arg = mrq->cmd->arg;
+			brq.data.blocks = TOSHIBA_PAGE_SIZE -
+			                  (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
+		} else if (page == last_page)
+			brq.data.blocks = (mrq->cmd->arg + mrq->data->blocks) %
+			                  TOSHIBA_PAGE_SIZE;
+		if (brq.data.blocks == 0)
+			brq.data.blocks = TOSHIBA_PAGE_SIZE;
+
+		if (!mmc_card_blockaddr(card))
+			brq.cmd.arg <<= 9;
+		brq.cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;
+		brq.stop.opcode = MMC_STOP_TRANSMISSION;
+		brq.stop.arg = 0;
+		brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
+
+		brq.data.flags |= MMC_DATA_WRITE;
+		if (brq.data.blocks > 1) {
+			if (!mmc_host_is_spi(card->host))
+				brq.mrq.stop = &brq.stop;
+			brq.cmd.opcode = MMC_WRITE_MULTIPLE_BLOCK;
+		} else {
+			brq.mrq.stop = NULL;
+			brq.cmd.opcode = MMC_WRITE_BLOCK;
+		}
+
+		if (brq.cmd.opcode == MMC_WRITE_MULTIPLE_BLOCK &&
+		    brq.data.blocks <= card->ext_csd.rel_wr_sec_c) {
+			int err;
+
+			cmd.opcode = MMC_SET_BLOCK_COUNT;
+			cmd.arg = brq.data.blocks | (1 << 31);
+			cmd.flags = MMC_RSP_R1 | MMC_CMD_AC;
+			err = mmc_wait_for_cmd(card->host, &cmd, 0);
+			if (!err)
+				brq.mrq.stop = NULL;
+		}
+
+		mmc_set_data_timeout(&brq.data, card);
+
+		offset = (brq.cmd.arg - mrq->cmd->arg) * 512;
+		length = brq.data.blocks * 512;
+		sg_init_one(&sg, md->bounce + offset, length);
+		brq.data.sg = &sg;
+		brq.data.sg_len = 1;
+
+		mmc_wait_for_req(card->host, &brq.mrq);
+
+		mrq->data->bytes_xfered += brq.data.bytes_xfered;
+
+		if (brq.cmd.error || brq.data.error || brq.stop.error) {
+			mrq->cmd->error = brq.cmd.error;
+			mrq->data->error = brq.data.error;
+			mrq->stop->error = brq.stop.error;
+
+			/*
+			 * We're executing the request backwards, so don't let
+			 * the block layer think some part of it has succeeded.
+			 * It will get it wrong.  Since the failure will cause
+			 * us to fall back on single block writes, we're better
+			 * off reporting that none of the data was written.
+			 */
+			mrq->data->bytes_xfered = 0;
+			break;
+		}
+	}
+
+	return true;
+}
 static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 {
 	struct mmc_blk_data *md = mq->data;
@@ -378,6 +533,9 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 			brq.data.flags |= MMC_DATA_WRITE;
 		}
 
+		if (rq_data_dir(req) == WRITE)
+			mmc_adjust_toshiba_write(card, &brq.mrq);
+
 		mmc_set_data_timeout(&brq.data, card);
 
 		brq.data.sg = mq->sg;
@@ -402,9 +560,14 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 			brq.data.sg_len = i;
 		}
 
-		mmc_queue_bounce_pre(mq);
-
-		mmc_wait_for_req(card->host, &brq.mrq);
+               mmc_queue_bounce_pre(mq);
+ 
+               /*
+                * Try the workaround first for writes, then fall back.
+                */
+               if (rq_data_dir(req) != WRITE || disable_multi ||
+                   !mmc_handle_toshiba_write(mq, card, &brq.mrq))
+                       mmc_wait_for_req(card->host, &brq.mrq);
 
 		mmc_queue_bounce_post(mq);
 
@@ -589,6 +752,15 @@ static struct mmc_blk_data *mmc_blk_alloc(struct mmc_card *card)
 		goto out;
 	}
 
+	if (card->cid.manfid == TOSHIBA_MANFID && mmc_card_mmc(card)) {
+		pr_info("%s: enable Toshiba workaround\n",
+			mmc_hostname(card->host));
+		md->bounce = kmalloc(TOSHIBA_HIGH_THRESHOLD * 512, GFP_KERNEL);
+		if (!md->bounce) {
+			ret = -ENOMEM;
+			goto err_kfree;
+		}
+	}
 
 	/*
 	 * Set the read-only status based on the supported commands
@@ -655,6 +827,8 @@ static struct mmc_blk_data *mmc_blk_alloc(struct mmc_card *card)
  err_putdisk:
 	put_disk(md->disk);
  err_kfree:
+	if (md->bounce)
+		kfree(md->bounce);
 	kfree(md);
  out:
 	return ERR_PTR(ret);
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 45055c4..17eef89 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -307,6 +307,9 @@ static int mmc_read_ext_csd(struct mmc_card *card)
 	else
 		card->erased_byte = 0x0;
 
+	if (card->ext_csd.rev >= 5)
+		card->ext_csd.rel_wr_sec_c = ext_csd[EXT_CSD_REL_WR_SEC_C];
+
 out:
 	kfree(ext_csd);
 
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 6b75250..fea7ecb 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -43,6 +43,7 @@ struct mmc_csd {
 
 struct mmc_ext_csd {
 	u8			rev;
+        u8                      rel_wr_sec_c;
 	u8			erase_group_def;
 	u8			sec_feature_support;
 	unsigned int		sa_timeout;		/* Units: 100ns */
diff --git a/include/linux/mmc/mmc.h b/include/linux/mmc/mmc.h
index a5d765c..1e87020 100644
--- a/include/linux/mmc/mmc.h
+++ b/include/linux/mmc/mmc.h
@@ -260,6 +260,7 @@ struct _mmc_csd {
 #define EXT_CSD_CARD_TYPE		196	/* RO */
 #define EXT_CSD_SEC_CNT			212	/* RO, 4 bytes */
 #define EXT_CSD_S_A_TIMEOUT		217	/* RO */
+#define EXT_CSD_REL_WR_SEC_C            222
 #define EXT_CSD_ERASE_TIMEOUT_MULT	223	/* RO */
 #define EXT_CSD_HC_ERASE_GRP_SIZE	224	/* RO */
 #define EXT_CSD_BOOT_SIZE_MULTI		226

^ permalink raw reply related	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-11 22:27     ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 9, 2011 at 2:37 AM, Linus Walleij <linus.walleij@linaro.org> wrote:
> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
> interesting!]
>
> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> Hi,
>>
>> I'm not sure if this is the best place to bring this up, but Russel's
>> name is on a fair share of drivers/mmc code, and there does seem to be
>> quite a bit of MMC-related discussions. Excuse me in advance if this
>> isn't the right forum :-).
>>
>> Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> rigid buffering scheme when it comes to handling writes. There is
>> usually a buffer A for random accesses, and a buffer B for sequential
>> accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> writes are treated as one large sequential access, once again ending
>> up in buffer B, thus necessitating out-of-order writing to work around
>> this.
>>
>> What this means is decreased life span for the parts, and it also
>> means a performance impact on small writes, but the first item is much
>> more crucial, especially for smaller parts.
>>
>> As I've mentioned, probably more vendors are affected. How about a
>> generic MMC_BLOCK quirk that splits the requests (and optionally
>> reorders) them? The thresholds would then be adjustable as
>> module/kernel parameters based on manfid. I'm asking because I have a
>> patch now, but its ugly and hardcoded against a specific manufacturer.
>
> There is a quirk API so that specific quirks can be flagged for certain
> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>
> But as Russell says this probably needs to be signalled up to the
> block layer to be handled properly.
>
> Why don't you post the code you have today as an RFC: patch,
> I think many will be interested?
>
> Yours,
> Linus Walleij
>

I think it's worthwhile to make make the upper block layers aware of
MMC (and apparently other flash memory) limitations, but I think as a
first step it could make sense (for me) to reformat the patch I am
attaching into something that looks better.

Don't take the attached patch too seriously :-).

Thanks,
A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: toshiba_emmc_opt.patch
Type: text/x-diff
Size: 8737 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110211/426789b7/attachment.bin>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-09  9:13     ` Arnd Bergmann
@ 2011-02-11 22:33       ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:33 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Linus Walleij, linux-mmc, linux-arm-kernel

On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
>> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
>> interesting!]
>>
>> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> > Hi,
>> >
>> > I'm not sure if this is the best place to bring this up, but Russel's
>> > name is on a fair share of drivers/mmc code, and there does seem to be
>> > quite a bit of MMC-related discussions. Excuse me in advance if this
>> > isn't the right forum :-).
>> >
>> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> > rigid buffering scheme when it comes to handling writes. There is
>> > usually a buffer A for random accesses, and a buffer B for sequential
>> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> > writes are treated as one large sequential access, once again ending
>> > up in buffer B, thus necessitating out-of-order writing to work around
>> > this.
>
> It's more complex, but I now have a pretty good understanding of
> what the flash media actually do, after doing a lot of benchmarking.
> Most of my results so far are documented on
>
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
>
> but I still need to write about the more recent discoveries.
>
> What you describe as buffer A is the "page size" of the underlying
> flash. It depends on the size and brand of the NAND flash chip and
> can be anywhere between 2 KB and 16 KB for modern cards, depending
> on how they combine multiple chips and planes within the chips.
>
> What you describe as buffer B is sometime called an "erase block
> group" or an "allocation unit". This is the smallest unit that
> gets kept in a global lookup table in the medium and can be anywhere
> between 1 MB and 8 MB for cards larger than 4 GB, or as small as
> 128 KB (a single erase block) for smaller media, as far as I have
> seen. When you don't write full aligned allocation units, the
> card will have to eventually do garbage collection on the allocation
> unit, which can take a long time (many milliseconds).
>
> Most cards have a third size, typically somewhere between 32 and 128 KB,
> which is the optimimum size for writes. While you can do linear
> writes to the card in page size units (writing an allocation unit
> from start to finish), doing random access within the allocation unit
> will be much faster doing larger writes.
>
>> > What this means is decreased life span for the parts, and it also
>> > means a performance impact on small writes, but the first item is much
>> > more crucial, especially for smaller parts.
>> >
>> > As I've mentioned, probably more vendors are affected. How about a
>> > generic MMC_BLOCK quirk that splits the requests (and optionally
>> > reorders) them? The thresholds would then be adjustable as
>> > module/kernel parameters based on manfid. I'm asking because I have a
>> > patch now, but its ugly and hardcoded against a specific manufacturer.
>
> It's not just MMC specific: USB flash drives, CF cards and even cheap
> PATA or SATA SSDs have the same patterns. I think this will need
> to be solved on a higher level, in the block device elevator code
> and in the file systems.
>
>> There is a quirk API so that specific quirks can be flagged for certain
>> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
>> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>>
>> But as Russell says this probably needs to be signalled up to the
>> block layer to be handled properly.
>>
>> Why don't you post the code you have today as an RFC: patch,
>> I think many will be interested?
>
> Yes, I agree, that would be good. Also, I'd be interested to see the
> output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
> that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
> the worst cards that I have seen so far, because they can not do
> random access within an allocation unit, and they can not write to
> multiple allocation units alternating (# open AUs linear is "1" in
> my wiki table), while most cards can do at least two.
>
> Andrei, I'm certainly interested in working with you on this.
> The point you brought up about the toshiba cards being especially
> bad is certainly vald, even if we do something better in the block
> layer, we need to have a way to detect the worst-case scenario,
> so we can work around that.
>
>        Arnd
>

Arnd,

Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.

cid - 02010053454d3332479070cc51451d00
csd - d00f00320f5903ffffffffff92404000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000002
name - SEM32G
oemid - 0x0100
preferred_erase_size - 2097152

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-11 22:33       ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-11 22:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
>> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
>> interesting!]
>>
>> 2011/2/8 Andrei Warkentin <andreiw@motorola.com>:
>> > Hi,
>> >
>> > I'm not sure if this is the best place to bring this up, but Russel's
>> > name is on a fair share of drivers/mmc code, and there does seem to be
>> > quite a bit of MMC-related discussions. Excuse me in advance if this
>> > isn't the right forum :-).
>> >
>> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> > rigid buffering scheme when it comes to handling writes. There is
>> > usually a buffer A for random accesses, and a buffer B for sequential
>> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> > writes are treated as one large sequential access, once again ending
>> > up in buffer B, thus necessitating out-of-order writing to work around
>> > this.
>
> It's more complex, but I now have a pretty good understanding of
> what the flash media actually do, after doing a lot of benchmarking.
> Most of my results so far are documented on
>
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
>
> but I still need to write about the more recent discoveries.
>
> What you describe as buffer A is the "page size" of the underlying
> flash. It depends on the size and brand of the NAND flash chip and
> can be anywhere between 2 KB and 16 KB for modern cards, depending
> on how they combine multiple chips and planes within the chips.
>
> What you describe as buffer B is sometime called an "erase block
> group" or an "allocation unit". This is the smallest unit that
> gets kept in a global lookup table in the medium and can be anywhere
> between 1 MB and 8 MB for cards larger than 4 GB, or as small as
> 128 KB (a single erase block) for smaller media, as far as I have
> seen. When you don't write full aligned allocation units, the
> card will have to eventually do garbage collection on the allocation
> unit, which can take a long time (many milliseconds).
>
> Most cards have a third size, typically somewhere between 32 and 128 KB,
> which is the optimimum size for writes. While you can do linear
> writes to the card in page size units (writing an allocation unit
> from start to finish), doing random access within the allocation unit
> will be much faster doing larger writes.
>
>> > What this means is decreased life span for the parts, and it also
>> > means a performance impact on small writes, but the first item is much
>> > more crucial, especially for smaller parts.
>> >
>> > As I've mentioned, probably more vendors are affected. How about a
>> > generic MMC_BLOCK quirk that splits the requests (and optionally
>> > reorders) them? The thresholds would then be adjustable as
>> > module/kernel parameters based on manfid. I'm asking because I have a
>> > patch now, but its ugly and hardcoded against a specific manufacturer.
>
> It's not just MMC specific: USB flash drives, CF cards and even cheap
> PATA or SATA SSDs have the same patterns. I think this will need
> to be solved on a higher level, in the block device elevator code
> and in the file systems.
>
>> There is a quirk API so that specific quirks can be flagged for certain
>> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
>> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>>
>> But as Russell says this probably needs to be signalled up to the
>> block layer to be handled properly.
>>
>> Why don't you post the code you have today as an RFC: patch,
>> I think many will be interested?
>
> Yes, I agree, that would be good. Also, I'd be interested to see the
> output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
> that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
> the worst cards that I have seen so far, because they can not do
> random access within an allocation unit, and they can not write to
> multiple allocation units alternating (# open AUs linear is "1" in
> my wiki table), while most cards can do at least two.
>
> Andrei, I'm certainly interested in working with you on this.
> The point you brought up about the toshiba cards being especially
> bad is certainly vald, even if we do something better in the block
> layer, we need to have a way to detect the worst-case scenario,
> so we can work around that.
>
> ? ? ? ?Arnd
>

Arnd,

Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.

cid - 02010053454d3332479070cc51451d00
csd - d00f00320f5903ffffffffff92404000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000002
name - SEM32G
oemid - 0x0100
preferred_erase_size - 2097152

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-09  9:13     ` Arnd Bergmann
@ 2011-02-11 23:23       ` Linus Walleij
  -1 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-11 23:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Andrei Warkentin, linux-mmc,
	Sebastian Rasmussen, Ulf Hansson

2011/2/9 Arnd Bergmann <arnd@arndb.de>:

> Most of my results so far are documented on
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey

H'm! That's an interesting resource indeed. When you write
"From measurements, it appears that the size in which data is
managed is typically 64 kb on SD cards" and "the size of the
medium is always a multiple of entire allocation groups, and
the most common size today is 4 MB" and then list
Size, Allocation Unit, Write Size, Page Size, FAT Location,
open AUs linear, open AUs random, Algorithm.

How exactly do you measure that?

I'm sort of smelling a card-probe.git with this tool that you
can run on your device and get out data like that listed
in your table. We have a rather large stash of cards we can
probe for you to get that kind of data out if it is useful, and
I believe other Linaro members may have such stuff too,
if empirical data is usefult to your work.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-11 23:23       ` Linus Walleij
  0 siblings, 0 replies; 117+ messages in thread
From: Linus Walleij @ 2011-02-11 23:23 UTC (permalink / raw)
  To: linux-arm-kernel

2011/2/9 Arnd Bergmann <arnd@arndb.de>:

> Most of my results so far are documented on
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey

H'm! That's an interesting resource indeed. When you write
"From measurements, it appears that the size in which data is
managed is typically 64 kb on SD cards" and "the size of the
medium is always a multiple of entire allocation groups, and
the most common size today is 4 MB" and then list
Size, Allocation Unit, Write Size, Page Size, FAT Location,
open AUs linear, open AUs random, Algorithm.

How exactly do you measure that?

I'm sort of smelling a card-probe.git with this tool that you
can run on your device and get out data like that listed
in your table. We have a rather large stash of cards we can
probe for you to get that kind of data out if it is useful, and
I believe other Linaro members may have such stuff too,
if empirical data is usefult to your work.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-11 23:23       ` Linus Walleij
@ 2011-02-12 10:45         ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 10:45 UTC (permalink / raw)
  To: Linus Walleij
  Cc: linux-arm-kernel, Andrei Warkentin, linux-mmc,
	Sebastian Rasmussen, Ulf Hansson

On Saturday 12 February 2011 00:23:37 Linus Walleij wrote:
> H'm! That's an interesting resource indeed. When you write
> "From measurements, it appears that the size in which data is
> managed is typically 64 kb on SD cards" and "the size of the
> medium is always a multiple of entire allocation groups, and
> the most common size today is 4 MB" and then list
> Size, Allocation Unit, Write Size, Page Size, FAT Location,
> open AUs linear, open AUs random, Algorithm.
> 
> How exactly do you measure that?

It's not an exact science, but for most cards I have found
reasonably good ways to identify these numbers:

* the allocation unit size can almost always be found
  using read-only tests: reading 2kb across an allocation
  unit boundary is slightly slower than reading 2kb
  just before or just after the boundary.
  For a few cards where this doesn't work, I do write tests.
  After finding out how many allocation units can be open,
  it's trivial to find out the size.

* Finding the number of open allocation units means I write
  to the start of a few AUs alternating. Up to a certain
  number, the throughput is constant, above that, it drops
  sharply, sometimes by one or two orders of magnitude.

* The page size can also be found doing read-only tests, with
  varying block sizes. Smaller reads always give lower throughput
  than larger reads, but getting smaller than page size
  drops down significantly more than the difference between
  multi-page reads. This effect is more prominent in write tests.

* Finding the algorithm basically means I write an allocation
  unit using varying block sizes two times, using both linear
  access and random access. Cards that are optimized for
  linear access can be unbelievably slow in the random access
  tests. Sometimes the performance is the same above a specific
  block size, but slower for random access below that size.
  This is the write block size.

* Finding the write block size in cases where this is not the
  case can be harder. Most cards have a noticable performance
  drop in writes of less than a few pages, so that's the
  size I put in the table.

* The FAT location is clearly visible in a number of tests
  done inside of an allocation unit. It's normally slower for
  linear access, but faster for random access. Sometimes
  reading the FAT is also slower than reading elsewhere.

> I'm sort of smelling a card-probe.git with this tool that you
> can run on your device and get out data like that listed
> in your table. We have a rather large stash of cards we can
> probe for you to get that kind of data out if it is useful, and
> I believe other Linaro members may have such stuff too,
> if empirical data is usefult to your work.

The tool I'm using is on http://git.linaro.org/gitweb?p=people/arnd/flashbench.git
Unfortunately, it's not yet in the state that I'm recommending
anyone besides me to run it. I'm still rewriting the source
for every new card I get to nail down the specific properties.

I will make an announcement when I have the tool in a state
of general usefulness, and at that point I would really
appreciate people to run it, but just not yet.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 10:45         ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 10:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 February 2011 00:23:37 Linus Walleij wrote:
> H'm! That's an interesting resource indeed. When you write
> "From measurements, it appears that the size in which data is
> managed is typically 64 kb on SD cards" and "the size of the
> medium is always a multiple of entire allocation groups, and
> the most common size today is 4 MB" and then list
> Size, Allocation Unit, Write Size, Page Size, FAT Location,
> open AUs linear, open AUs random, Algorithm.
> 
> How exactly do you measure that?

It's not an exact science, but for most cards I have found
reasonably good ways to identify these numbers:

* the allocation unit size can almost always be found
  using read-only tests: reading 2kb across an allocation
  unit boundary is slightly slower than reading 2kb
  just before or just after the boundary.
  For a few cards where this doesn't work, I do write tests.
  After finding out how many allocation units can be open,
  it's trivial to find out the size.

* Finding the number of open allocation units means I write
  to the start of a few AUs alternating. Up to a certain
  number, the throughput is constant, above that, it drops
  sharply, sometimes by one or two orders of magnitude.

* The page size can also be found doing read-only tests, with
  varying block sizes. Smaller reads always give lower throughput
  than larger reads, but getting smaller than page size
  drops down significantly more than the difference between
  multi-page reads. This effect is more prominent in write tests.

* Finding the algorithm basically means I write an allocation
  unit using varying block sizes two times, using both linear
  access and random access. Cards that are optimized for
  linear access can be unbelievably slow in the random access
  tests. Sometimes the performance is the same above a specific
  block size, but slower for random access below that size.
  This is the write block size.

* Finding the write block size in cases where this is not the
  case can be harder. Most cards have a noticable performance
  drop in writes of less than a few pages, so that's the
  size I put in the table.

* The FAT location is clearly visible in a number of tests
  done inside of an allocation unit. It's normally slower for
  linear access, but faster for random access. Sometimes
  reading the FAT is also slower than reading elsewhere.

> I'm sort of smelling a card-probe.git with this tool that you
> can run on your device and get out data like that listed
> in your table. We have a rather large stash of cards we can
> probe for you to get that kind of data out if it is useful, and
> I believe other Linaro members may have such stuff too,
> if empirical data is usefult to your work.

The tool I'm using is on http://git.linaro.org/gitweb?p=people/arnd/flashbench.git
Unfortunately, it's not yet in the state that I'm recommending
anyone besides me to run it. I'm still rewriting the source
for every new card I get to nail down the specific properties.

I will make an announcement when I have the tool in a state
of general usefulness, and at that point I would really
appreciate people to run it, but just not yet.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-12 10:45         ` Arnd Bergmann
@ 2011-02-12 10:59           ` Russell King - ARM Linux
  -1 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 10:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Walleij, Ulf Hansson, linux-mmc, Andrei Warkentin,
	linux-arm-kernel, Sebastian Rasmussen

On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> * The FAT location is clearly visible in a number of tests
>   done inside of an allocation unit. It's normally slower for
>   linear access, but faster for random access. Sometimes
>   reading the FAT is also slower than reading elsewhere.

I wouldn't also be surprised if there's some cards out there which parse
the FAT being written, and start activities (such as erasing clusters)
based upon changes therein.  Such cards would be unsuitable for use with
non-FAT filesystems.

It might be worth devising some sort of check for this kind of behaviour.

Unrelated, I have a USB based device which provides an emulated FAT
filesystem - all files except one on this filesystem are read-only.
The writable file is a textual configuration file.  It can be reliably
updated by Windows based systems, but updates from Linux based systems
are ignored - presumably because updates to the FAT/directory/data
clusters are occuring in a different order.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 10:59           ` Russell King - ARM Linux
  0 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 10:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> * The FAT location is clearly visible in a number of tests
>   done inside of an allocation unit. It's normally slower for
>   linear access, but faster for random access. Sometimes
>   reading the FAT is also slower than reading elsewhere.

I wouldn't also be surprised if there's some cards out there which parse
the FAT being written, and start activities (such as erasing clusters)
based upon changes therein.  Such cards would be unsuitable for use with
non-FAT filesystems.

It might be worth devising some sort of check for this kind of behaviour.

Unrelated, I have a USB based device which provides an emulated FAT
filesystem - all files except one on this filesystem are read-only.
The writable file is a textual configuration file.  It can be reliably
updated by Windows based systems, but updates from Linux based systems
are ignored - presumably because updates to the FAT/directory/data
clusters are occuring in a different order.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-12 10:59           ` Russell King - ARM Linux
@ 2011-02-12 16:28             ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 16:28 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Linus Walleij, Ulf Hansson, linux-mmc, Andrei Warkentin,
	linux-arm-kernel, Sebastian Rasmussen

On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> > * The FAT location is clearly visible in a number of tests
> >   done inside of an allocation unit. It's normally slower for
> >   linear access, but faster for random access. Sometimes
> >   reading the FAT is also slower than reading elsewhere.
> 
> I wouldn't also be surprised if there's some cards out there which parse
> the FAT being written, and start activities (such as erasing clusters)
> based upon changes therein.  Such cards would be unsuitable for use with
> non-FAT filesystems.
> 
> It might be worth devising some sort of check for this kind of behaviour.

Possible, but doesn't seem to happen with any of the cards I have
tested, the controllers in there appear to be too simplistic.
Also, the recommendations for SD cards are to issue explicit erase
requests, which would make this unnecessary.

OTOH, SD cards do specify exactly where the FAT should be stored on
the medium, so it would be possible to make this kind of assumption.

USB sticks and CF cards might be smart enough to actually do it,
some of them have more sophisticated logic than SD cards (most
do not), and there is no usb mass storage command for erase.

> Unrelated, I have a USB based device which provides an emulated FAT
> filesystem - all files except one on this filesystem are read-only.
> The writable file is a textual configuration file.  It can be reliably
> updated by Windows based systems, but updates from Linux based systems
> are ignored - presumably because updates to the FAT/directory/data
> clusters are occuring in a different order.

Fun. I think qemu also comes with one of these FAT emulation layers,
as do some mp3 players, but from what I have heard, they are not as
broken.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 16:28             ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 16:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> On Sat, Feb 12, 2011 at 11:45:41AM +0100, Arnd Bergmann wrote:
> > * The FAT location is clearly visible in a number of tests
> >   done inside of an allocation unit. It's normally slower for
> >   linear access, but faster for random access. Sometimes
> >   reading the FAT is also slower than reading elsewhere.
> 
> I wouldn't also be surprised if there's some cards out there which parse
> the FAT being written, and start activities (such as erasing clusters)
> based upon changes therein.  Such cards would be unsuitable for use with
> non-FAT filesystems.
> 
> It might be worth devising some sort of check for this kind of behaviour.

Possible, but doesn't seem to happen with any of the cards I have
tested, the controllers in there appear to be too simplistic.
Also, the recommendations for SD cards are to issue explicit erase
requests, which would make this unnecessary.

OTOH, SD cards do specify exactly where the FAT should be stored on
the medium, so it would be possible to make this kind of assumption.

USB sticks and CF cards might be smart enough to actually do it,
some of them have more sophisticated logic than SD cards (most
do not), and there is no usb mass storage command for erase.

> Unrelated, I have a USB based device which provides an emulated FAT
> filesystem - all files except one on this filesystem are read-only.
> The writable file is a textual configuration file.  It can be reliably
> updated by Windows based systems, but updates from Linux based systems
> are ignored - presumably because updates to the FAT/directory/data
> clusters are occuring in a different order.

Fun. I think qemu also comes with one of these FAT emulation layers,
as do some mp3 players, but from what I have heard, they are not as
broken.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-12 16:28             ` Arnd Bergmann
@ 2011-02-12 16:37               ` Russell King - ARM Linux
  -1 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 16:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Walleij, Ulf Hansson, linux-mmc, Andrei Warkentin,
	linux-arm-kernel, Sebastian Rasmussen

On Sat, Feb 12, 2011 at 05:28:32PM +0100, Arnd Bergmann wrote:
> On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> > Unrelated, I have a USB based device which provides an emulated FAT
> > filesystem - all files except one on this filesystem are read-only.
> > The writable file is a textual configuration file.  It can be reliably
> > updated by Windows based systems, but updates from Linux based systems
> > are ignored - presumably because updates to the FAT/directory/data
> > clusters are occuring in a different order.
> 
> Fun. I think qemu also comes with one of these FAT emulation layers,
> as do some mp3 players, but from what I have heard, they are not as
> broken.

Given that it is a secure GPS/barographic flight logger which has
approval for ratifing world record flight claims, you may understand why
it has to be extremely picky about how it interfaces with the external
world.  Especially restricting updates to modification of the
configuration file, while not allowing any of the logged data files to
be changed in any way.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 16:37               ` Russell King - ARM Linux
  0 siblings, 0 replies; 117+ messages in thread
From: Russell King - ARM Linux @ 2011-02-12 16:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 12, 2011 at 05:28:32PM +0100, Arnd Bergmann wrote:
> On Saturday 12 February 2011 11:59:18 Russell King - ARM Linux wrote:
> > Unrelated, I have a USB based device which provides an emulated FAT
> > filesystem - all files except one on this filesystem are read-only.
> > The writable file is a textual configuration file.  It can be reliably
> > updated by Windows based systems, but updates from Linux based systems
> > are ignored - presumably because updates to the FAT/directory/data
> > clusters are occuring in a different order.
> 
> Fun. I think qemu also comes with one of these FAT emulation layers,
> as do some mp3 players, but from what I have heard, they are not as
> broken.

Given that it is a secure GPS/barographic flight logger which has
approval for ratifing world record flight claims, you may understand why
it has to be extremely picky about how it interfaces with the external
world.  Especially restricting updates to modification of the
configuration file, while not allowing any of the logged data files to
be changed in any way.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-11 22:33       ` Andrei Warkentin
@ 2011-02-12 17:05         ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 17:05 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> 
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152

Very interesting. So the manfid is the same as on most Kingston cards,
but the oemid is different. Most cards have a two-letter ASCII code
in there, 0x544d ("TM") on Kingston cards, and I always assumed that
this stood for "Toshiba Memory".

What is even stranger is the size value (among other fields) in the CSD,
the card claims a size of exactly 32GB, which I find hard to believe,
given that there are always some bad and reserved blocks.

Are you sure that the card you have is authentic? I've heard a lot about
fake USB sticks advertising a size that is much larger than the actual
flash inside of them.

Also this is the first card that I see advertise an allocation unit
size of 2MB (preferred_erase_size), all other cards seem to advertise
4 MB these days, even if they actually have 2 or 8 MB.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 17:05         ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 17:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> 
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152

Very interesting. So the manfid is the same as on most Kingston cards,
but the oemid is different. Most cards have a two-letter ASCII code
in there, 0x544d ("TM") on Kingston cards, and I always assumed that
this stood for "Toshiba Memory".

What is even stranger is the size value (among other fields) in the CSD,
the card claims a size of exactly 32GB, which I find hard to believe,
given that there are always some bad and reserved blocks.

Are you sure that the card you have is authentic? I've heard a lot about
fake USB sticks advertising a size that is much larger than the actual
flash inside of them.

Also this is the first card that I see advertise an allocation unit
size of 2MB (preferred_erase_size), all other cards seem to advertise
4 MB these days, even if they actually have 2 or 8 MB.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-12 17:05         ` Arnd Bergmann
@ 2011-02-12 17:33           ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-12 17:33 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
>> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
>> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>>
>> cid - 02010053454d3332479070cc51451d00
>> csd - d00f00320f5903ffffffffff92404000
>> erase_size - 524288
>> fwrev - 0x0
>> hwrev - 0x0
>> manfid - 0x000002
>> name - SEM32G
>> oemid - 0x0100
>> preferred_erase_size - 2097152
>
> Very interesting. So the manfid is the same as on most Kingston cards,
> but the oemid is different. Most cards have a two-letter ASCII code
> in there, 0x544d ("TM") on Kingston cards, and I always assumed that
> this stood for "Toshiba Memory".
>
> What is even stranger is the size value (among other fields) in the CSD,
> the card claims a size of exactly 32GB, which I find hard to believe,
> given that there are always some bad and reserved blocks.
>
> Are you sure that the card you have is authentic? I've heard a lot about
> fake USB sticks advertising a size that is much larger than the actual
> flash inside of them.
>
> Also this is the first card that I see advertise an allocation unit
> size of 2MB (preferred_erase_size), all other cards seem to advertise
> 4 MB these days, even if they actually have 2 or 8 MB.
>
>        Arnd
>

This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 17:33           ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-12 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
>> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
>> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>>
>> cid - 02010053454d3332479070cc51451d00
>> csd - d00f00320f5903ffffffffff92404000
>> erase_size - 524288
>> fwrev - 0x0
>> hwrev - 0x0
>> manfid - 0x000002
>> name - SEM32G
>> oemid - 0x0100
>> preferred_erase_size - 2097152
>
> Very interesting. So the manfid is the same as on most Kingston cards,
> but the oemid is different. Most cards have a two-letter ASCII code
> in there, 0x544d ("TM") on Kingston cards, and I always assumed that
> this stood for "Toshiba Memory".
>
> What is even stranger is the size value (among other fields) in the CSD,
> the card claims a size of exactly 32GB, which I find hard to believe,
> given that there are always some bad and reserved blocks.
>
> Are you sure that the card you have is authentic? I've heard a lot about
> fake USB sticks advertising a size that is much larger than the actual
> flash inside of them.
>
> Also this is the first card that I see advertise an allocation unit
> size of 2MB (preferred_erase_size), all other cards seem to advertise
> 4 MB these days, even if they actually have 2 or 8 MB.
>
> ? ? ? ?Arnd
>

This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-12 17:33           ` Andrei Warkentin
@ 2011-02-12 18:22             ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:22 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Saturday 12 February 2011 18:33:10 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> >> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> >> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >>
> >> cid - 02010053454d3332479070cc51451d00
> >> csd - d00f 0032 0f59 03ff ffffffff92404000
> >> erase_size - 524288
> >> fwrev - 0x0
> >> hwrev - 0x0
> >> manfid - 0x000002
> >> name - SEM32G
> >> oemid - 0x0100
> >> preferred_erase_size - 2097152
> >
> 
> This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.

Ah, right, that explains all the values, which make sense for eMMC4
but not for SDHC ;-)

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 18:22             ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 12 February 2011 18:33:10 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 11:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Friday 11 February 2011 23:33:42 Andrei Warkentin wrote:
> >> On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> >> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >>
> >> cid - 02010053454d3332479070cc51451d00
> >> csd - d00f 0032 0f59 03ff ffffffff92404000
> >> erase_size - 524288
> >> fwrev - 0x0
> >> hwrev - 0x0
> >> manfid - 0x000002
> >> name - SEM32G
> >> oemid - 0x0100
> >> preferred_erase_size - 2097152
> >
> 
> This is a Toshiba eMMC part. It is 32GB as far as the OS can see and access.

Ah, right, that explains all the values, which make sense for eMMC4
but not for SDHC ;-)

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-11 22:27     ` Andrei Warkentin
@ 2011-02-12 18:37       ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:37 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc

On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>  
> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
> index 7054fd5..3b32329 100644
> --- a/drivers/mmc/card/block.c
> +++ b/drivers/mmc/card/block.c
> @@ -312,6 +316,157 @@ out:
>  	return err ? 0 : 1;
>  }
>  
> +/*
> + * Workaround for Toshiba eMMC performance.  If the request is less than two
> + * flash pages in size, then we want to split the write into one or two
> + * page-aligned writes to take advantage of faster buffering.  Here we can
> + * adjust the size of the MMC request and let the block layer request handler
> + * deal with generating another MMC request.
> + */
> +#define TOSHIBA_MANFID 0x11
> +#define TOSHIBA_PAGE_SIZE 16		/* sectors */
> +#define TOSHIBA_ADJUST_THRESHOLD 24	/* sectors */
> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
> +                                     struct mmc_request *mrq)
> +{
> +	if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
> +	    mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
> +		int sectors_in_page = TOSHIBA_PAGE_SIZE -
> +		                      (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
> +		if (mrq->data->blocks > sectors_in_page) {
> +			mrq->data->blocks = sectors_in_page;
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}

This part might make sense in general, though it's hard to know the
page size in the general case. For many SD cards, writing naturally
aligned 64 KB blocks was the ideal case in my testing, but some need
larger alignment or can deal well with smaller blocks.

> +/*
> + * This is another strange workaround to try to close the gap on Toshiba eMMC
> + * performance when compared to other vendors.  In order to take advantage
> + * of certain optimizations and assumptions in those cards, we will look for
> + * multiblock write transfers below a certain size and we do the following:
> + *
> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
> + * - Execute the transfers in reverse order.
> + * - Use "reliable write" transfer mode.
> + *
> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
> + * selves well to executing a block request out of order.  So instead we let
> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
> + * break it up and reorder it here.  This also requires that we put the data
> + * into a bounce buffer and send it as individual sg's.
> + */

A lot of the SD cards I've seen will react very badly to reverse order,
so that is definitely a dangerous thing to put into the code.

Also, the "reliable write" seems like a really interesting thing to
rely on for performance. I believe what the card is trying to do here
is to optimize FAT32 directory updates. By using the small blocks in
unpredictable order (anything but linear), you tell the card to treat
this as part of a directory, so it probably gets written in a different
way, but that might mean that it will try to turn the current erase
block group into a special small write mode.

I could imagine that this will cause problems on your eMMC once you
write small blocks to more than erase block group, because that probably
causes it to start garbage collection -- it makes sense for the cards
to know that something is a directory, but it can only know about
a small number of directories, so it will turn the segment into a regular
one as soon something else becomes a directory.

	Arnd 

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-12 18:37       ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-12 18:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>  
> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
> index 7054fd5..3b32329 100644
> --- a/drivers/mmc/card/block.c
> +++ b/drivers/mmc/card/block.c
> @@ -312,6 +316,157 @@ out:
>  	return err ? 0 : 1;
>  }
>  
> +/*
> + * Workaround for Toshiba eMMC performance.  If the request is less than two
> + * flash pages in size, then we want to split the write into one or two
> + * page-aligned writes to take advantage of faster buffering.  Here we can
> + * adjust the size of the MMC request and let the block layer request handler
> + * deal with generating another MMC request.
> + */
> +#define TOSHIBA_MANFID 0x11
> +#define TOSHIBA_PAGE_SIZE 16		/* sectors */
> +#define TOSHIBA_ADJUST_THRESHOLD 24	/* sectors */
> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
> +                                     struct mmc_request *mrq)
> +{
> +	if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
> +	    mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
> +		int sectors_in_page = TOSHIBA_PAGE_SIZE -
> +		                      (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
> +		if (mrq->data->blocks > sectors_in_page) {
> +			mrq->data->blocks = sectors_in_page;
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}

This part might make sense in general, though it's hard to know the
page size in the general case. For many SD cards, writing naturally
aligned 64 KB blocks was the ideal case in my testing, but some need
larger alignment or can deal well with smaller blocks.

> +/*
> + * This is another strange workaround to try to close the gap on Toshiba eMMC
> + * performance when compared to other vendors.  In order to take advantage
> + * of certain optimizations and assumptions in those cards, we will look for
> + * multiblock write transfers below a certain size and we do the following:
> + *
> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
> + * - Execute the transfers in reverse order.
> + * - Use "reliable write" transfer mode.
> + *
> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
> + * selves well to executing a block request out of order.  So instead we let
> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
> + * break it up and reorder it here.  This also requires that we put the data
> + * into a bounce buffer and send it as individual sg's.
> + */

A lot of the SD cards I've seen will react very badly to reverse order,
so that is definitely a dangerous thing to put into the code.

Also, the "reliable write" seems like a really interesting thing to
rely on for performance. I believe what the card is trying to do here
is to optimize FAT32 directory updates. By using the small blocks in
unpredictable order (anything but linear), you tell the card to treat
this as part of a directory, so it probably gets written in a different
way, but that might mean that it will try to turn the current erase
block group into a special small write mode.

I could imagine that this will cause problems on your eMMC once you
write small blocks to more than erase block group, because that probably
causes it to start garbage collection -- it makes sense for the cards
to know that something is a directory, but it can only know about
a small number of directories, so it will turn the segment into a regular
one as soon something else becomes a directory.

	Arnd 

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-12 18:37       ` Arnd Bergmann
@ 2011-02-13  0:10         ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-13  0:10 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>>
>> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
>> index 7054fd5..3b32329 100644
>> --- a/drivers/mmc/card/block.c
>> +++ b/drivers/mmc/card/block.c
>> @@ -312,6 +316,157 @@ out:
>>       return err ? 0 : 1;
>>  }
>>
>> +/*
>> + * Workaround for Toshiba eMMC performance.  If the request is less than two
>> + * flash pages in size, then we want to split the write into one or two
>> + * page-aligned writes to take advantage of faster buffering.  Here we can
>> + * adjust the size of the MMC request and let the block layer request handler
>> + * deal with generating another MMC request.
>> + */
>> +#define TOSHIBA_MANFID 0x11
>> +#define TOSHIBA_PAGE_SIZE 16         /* sectors */
>> +#define TOSHIBA_ADJUST_THRESHOLD 24  /* sectors */
>> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
>> +                                     struct mmc_request *mrq)
>> +{
>> +     if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
>> +         mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
>> +             int sectors_in_page = TOSHIBA_PAGE_SIZE -
>> +                                   (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
>> +             if (mrq->data->blocks > sectors_in_page) {
>> +                     mrq->data->blocks = sectors_in_page;
>> +                     return true;
>> +             }
>> +     }
>> +
>> +     return false;
>> +}
>
> This part might make sense in general, though it's hard to know the
> page size in the general case. For many SD cards, writing naturally
> aligned 64 KB blocks was the ideal case in my testing, but some need
> larger alignment or can deal well with smaller blocks.
>

...which is why I believe this should be a boot per-card parameter,
and that it really only makes sense for embedded parts, where you know
nothing else is going to be used as, say, mmcblk0.


>> +/*
>> + * This is another strange workaround to try to close the gap on Toshiba eMMC
>> + * performance when compared to other vendors.  In order to take advantage
>> + * of certain optimizations and assumptions in those cards, we will look for
>> + * multiblock write transfers below a certain size and we do the following:
>> + *
>> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
>> + * - Execute the transfers in reverse order.
>> + * - Use "reliable write" transfer mode.
>> + *
>> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
>> + * selves well to executing a block request out of order.  So instead we let
>> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
>> + * break it up and reorder it here.  This also requires that we put the data
>> + * into a bounce buffer and send it as individual sg's.
>> + */
>
> A lot of the SD cards I've seen will react very badly to reverse order,
> so that is definitely a dangerous thing to put into the code.
>
> Also, the "reliable write" seems like a really interesting thing to
> rely on for performance. I believe what the card is trying to do here
> is to optimize FAT32 directory updates. By using the small blocks in
> unpredictable order (anything but linear), you tell the card to treat
> this as part of a directory, so it probably gets written in a different
> way, but that might mean that it will try to turn the current erase
> block group into a special small write mode.
>
> I could imagine that this will cause problems on your eMMC once you
> write small blocks to more than erase block group, because that probably
> causes it to start garbage collection -- it makes sense for the cards
> to know that something is a directory, but it can only know about
> a small number of directories, so it will turn the segment into a regular
> one as soon something else becomes a directory.
>

It's difficult for me to argue one way or another. The code provided
is implementing Toshiba's suggestions for mitigating excessive wear.
Basically, as far as certain Android products are concerned, Motorola
created some "typical usage" cases, and collected data logs. These
logs were analyzed by Toshiba, which reported an approx x16
multiplication factor for writes.

Analysis of data written showed that there were many random accesses
with 16KB or 32KB, meaning they go into buffer B. According to T, that
means extra GC and PE cycle. I'm guessing per write.

So T suggested for random data to better go into buffer A. How? Two suggestions.
1) Split smaller accesses into 8KB and write with reliable write.
2) Split smaller accesses into 8KB and write in reverse.

The patch does both and I am verifying if that is really necessary. I
need to go see the mmc spec and what it says about reliable write.

Basically, whatever behavior you choose is going to be wrong some set
of cards. Which is why tuning it probably only makes sense for eMMC
parts, and should be a set of runtime/compile-time quirks. What do you
think?

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-13  0:10         ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-13  0:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 11 February 2011 23:27:51 Andrei Warkentin wrote:
>>
>> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
>> index 7054fd5..3b32329 100644
>> --- a/drivers/mmc/card/block.c
>> +++ b/drivers/mmc/card/block.c
>> @@ -312,6 +316,157 @@ out:
>> ? ? ? return err ? 0 : 1;
>> ?}
>>
>> +/*
>> + * Workaround for Toshiba eMMC performance. ?If the request is less than two
>> + * flash pages in size, then we want to split the write into one or two
>> + * page-aligned writes to take advantage of faster buffering. ?Here we can
>> + * adjust the size of the MMC request and let the block layer request handler
>> + * deal with generating another MMC request.
>> + */
>> +#define TOSHIBA_MANFID 0x11
>> +#define TOSHIBA_PAGE_SIZE 16 ? ? ? ? /* sectors */
>> +#define TOSHIBA_ADJUST_THRESHOLD 24 ?/* sectors */
>> +static bool mmc_adjust_toshiba_write(struct mmc_card *card,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mmc_request *mrq)
>> +{
>> + ? ? if (mmc_card_mmc(card) && card->cid.manfid == TOSHIBA_MANFID &&
>> + ? ? ? ? mrq->data->blocks <= TOSHIBA_ADJUST_THRESHOLD) {
>> + ? ? ? ? ? ? int sectors_in_page = TOSHIBA_PAGE_SIZE -
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (mrq->cmd->arg % TOSHIBA_PAGE_SIZE);
>> + ? ? ? ? ? ? if (mrq->data->blocks > sectors_in_page) {
>> + ? ? ? ? ? ? ? ? ? ? mrq->data->blocks = sectors_in_page;
>> + ? ? ? ? ? ? ? ? ? ? return true;
>> + ? ? ? ? ? ? }
>> + ? ? }
>> +
>> + ? ? return false;
>> +}
>
> This part might make sense in general, though it's hard to know the
> page size in the general case. For many SD cards, writing naturally
> aligned 64 KB blocks was the ideal case in my testing, but some need
> larger alignment or can deal well with smaller blocks.
>

...which is why I believe this should be a boot per-card parameter,
and that it really only makes sense for embedded parts, where you know
nothing else is going to be used as, say, mmcblk0.


>> +/*
>> + * This is another strange workaround to try to close the gap on Toshiba eMMC
>> + * performance when compared to other vendors. ?In order to take advantage
>> + * of certain optimizations and assumptions in those cards, we will look for
>> + * multiblock write transfers below a certain size and we do the following:
>> + *
>> + * - Break them up into seperate page-aligned (8k flash pages) transfers.
>> + * - Execute the transfers in reverse order.
>> + * - Use "reliable write" transfer mode.
>> + *
>> + * Neither the block I/O layer nor the scatterlist design seem to lend them-
>> + * selves well to executing a block request out of order. ?So instead we let
>> + * mmc_blk_issue_rq() setup the MMC request for the entire transfer and then
>> + * break it up and reorder it here. ?This also requires that we put the data
>> + * into a bounce buffer and send it as individual sg's.
>> + */
>
> A lot of the SD cards I've seen will react very badly to reverse order,
> so that is definitely a dangerous thing to put into the code.
>
> Also, the "reliable write" seems like a really interesting thing to
> rely on for performance. I believe what the card is trying to do here
> is to optimize FAT32 directory updates. By using the small blocks in
> unpredictable order (anything but linear), you tell the card to treat
> this as part of a directory, so it probably gets written in a different
> way, but that might mean that it will try to turn the current erase
> block group into a special small write mode.
>
> I could imagine that this will cause problems on your eMMC once you
> write small blocks to more than erase block group, because that probably
> causes it to start garbage collection -- it makes sense for the cards
> to know that something is a directory, but it can only know about
> a small number of directories, so it will turn the segment into a regular
> one as soon something else becomes a directory.
>

It's difficult for me to argue one way or another. The code provided
is implementing Toshiba's suggestions for mitigating excessive wear.
Basically, as far as certain Android products are concerned, Motorola
created some "typical usage" cases, and collected data logs. These
logs were analyzed by Toshiba, which reported an approx x16
multiplication factor for writes.

Analysis of data written showed that there were many random accesses
with 16KB or 32KB, meaning they go into buffer B. According to T, that
means extra GC and PE cycle. I'm guessing per write.

So T suggested for random data to better go into buffer A. How? Two suggestions.
1) Split smaller accesses into 8KB and write with reliable write.
2) Split smaller accesses into 8KB and write in reverse.

The patch does both and I am verifying if that is really necessary. I
need to go see the mmc spec and what it says about reliable write.

Basically, whatever behavior you choose is going to be wrong some set
of cards. Which is why tuning it probably only makes sense for eMMC
parts, and should be a set of runtime/compile-time quirks. What do you
think?

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-13  0:10         ` Andrei Warkentin
@ 2011-02-13 17:39           ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-13 17:39 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sunday 13 February 2011 01:10:09 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> > This part might make sense in general, though it's hard to know the
> > page size in the general case. For many SD cards, writing naturally
> > aligned 64 KB blocks was the ideal case in my testing, but some need
> > larger alignment or can deal well with smaller blocks.
> >
> 
> ...which is why I believe this should be a boot per-card parameter,
> and that it really only makes sense for embedded parts, where you know
> nothing else is going to be used as, say, mmcblk0.

I don't think it needs to be boot-time, it can easily be run-time
tuneable using sysfs, where you can configure it using an init script
or some other logic from user space.

> > I could imagine that this will cause problems on your eMMC once you
> > write small blocks to more than erase block group, because that probably
> > causes it to start garbage collection -- it makes sense for the cards
> > to know that something is a directory, but it can only know about
> > a small number of directories, so it will turn the segment into a regular
> > one as soon something else becomes a directory.
> >
> 
> It's difficult for me to argue one way or another. The code provided
> is implementing Toshiba's suggestions for mitigating excessive wear.
> Basically, as far as certain Android products are concerned, Motorola
> created some "typical usage" cases, and collected data logs. These
> logs were analyzed by Toshiba, which reported an approx x16
> multiplication factor for writes.

Yes, I've seen similar numbers in my measurements. My experience with
the Kingston/Toshiba cards is that they combine two unfortunate
problems:

* Only one 4 MB AU can be open, writing to a different AU waits for
garbage collection on the old one. Other cards typically have
five buffers for open AUs, which makes them much easier to work with.

* Only linear access within one AU is fast. Writing to a block with
a lower address in the same AU causes garbage collection of the AU.

> Analysis of data written showed that there were many random accesses
> with 16KB or 32KB, meaning they go into buffer B. 

I have started a remapping layer that should be able to deal with
this independent of the card, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
It's still in the early stages, but maybe something like that will
help you as well.

The real solution would be to have a file system that knows what
accesses are fast and reorders file data accordingly. Right now,
the only thing that is normally fast is FAT32 using 32KB clusters,
and only if the file system is aligned properly.

> According to T, that
> means extra GC and PE cycle. I'm guessing per write.

Yes.

What is "PE" here?

> So T suggested for random data to better go into buffer A. How? Two suggestions.
> 1) Split smaller accesses into 8KB and write with reliable write.
> 2) Split smaller accesses into 8KB and write in reverse.
> 
> The patch does both and I am verifying if that is really necessary. I
> need to go see the mmc spec and what it says about reliable write.

I should add this to my test tool once I can reproduce it. If it turns
out that other media do the same, we can also trigger the same behavior
for those.

> Basically, whatever behavior you choose is going to be wrong some set
> of cards. Which is why tuning it probably only makes sense for eMMC
> parts, and should be a set of runtime/compile-time quirks. What do you
> think?

Your explanation makes sense, but I'd definitely favor a run-time solution
over compile-time or boot-time, because it would be much more flexible.
We should also be able to find some optimizations that are universally
good so we can always use them.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-13 17:39           ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-13 17:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Sunday 13 February 2011 01:10:09 Andrei Warkentin wrote:
> On Sat, Feb 12, 2011 at 12:37 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> > This part might make sense in general, though it's hard to know the
> > page size in the general case. For many SD cards, writing naturally
> > aligned 64 KB blocks was the ideal case in my testing, but some need
> > larger alignment or can deal well with smaller blocks.
> >
> 
> ...which is why I believe this should be a boot per-card parameter,
> and that it really only makes sense for embedded parts, where you know
> nothing else is going to be used as, say, mmcblk0.

I don't think it needs to be boot-time, it can easily be run-time
tuneable using sysfs, where you can configure it using an init script
or some other logic from user space.

> > I could imagine that this will cause problems on your eMMC once you
> > write small blocks to more than erase block group, because that probably
> > causes it to start garbage collection -- it makes sense for the cards
> > to know that something is a directory, but it can only know about
> > a small number of directories, so it will turn the segment into a regular
> > one as soon something else becomes a directory.
> >
> 
> It's difficult for me to argue one way or another. The code provided
> is implementing Toshiba's suggestions for mitigating excessive wear.
> Basically, as far as certain Android products are concerned, Motorola
> created some "typical usage" cases, and collected data logs. These
> logs were analyzed by Toshiba, which reported an approx x16
> multiplication factor for writes.

Yes, I've seen similar numbers in my measurements. My experience with
the Kingston/Toshiba cards is that they combine two unfortunate
problems:

* Only one 4 MB AU can be open, writing to a different AU waits for
garbage collection on the old one. Other cards typically have
five buffers for open AUs, which makes them much easier to work with.

* Only linear access within one AU is fast. Writing to a block with
a lower address in the same AU causes garbage collection of the AU.

> Analysis of data written showed that there were many random accesses
> with 16KB or 32KB, meaning they go into buffer B. 

I have started a remapping layer that should be able to deal with
this independent of the card, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
It's still in the early stages, but maybe something like that will
help you as well.

The real solution would be to have a file system that knows what
accesses are fast and reorders file data accordingly. Right now,
the only thing that is normally fast is FAT32 using 32KB clusters,
and only if the file system is aligned properly.

> According to T, that
> means extra GC and PE cycle. I'm guessing per write.

Yes.

What is "PE" here?

> So T suggested for random data to better go into buffer A. How? Two suggestions.
> 1) Split smaller accesses into 8KB and write with reliable write.
> 2) Split smaller accesses into 8KB and write in reverse.
> 
> The patch does both and I am verifying if that is really necessary. I
> need to go see the mmc spec and what it says about reliable write.

I should add this to my test tool once I can reproduce it. If it turns
out that other media do the same, we can also trigger the same behavior
for those.

> Basically, whatever behavior you choose is going to be wrong some set
> of cards. Which is why tuning it probably only makes sense for eMMC
> parts, and should be a set of runtime/compile-time quirks. What do you
> think?

Your explanation makes sense, but I'd definitely favor a run-time solution
over compile-time or boot-time, because it would be much more flexible.
We should also be able to find some optimizations that are universally
good so we can always use them.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-13 17:39           ` Arnd Bergmann
@ 2011-02-14 19:29             ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 19:29 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> I don't think it needs to be boot-time, it can easily be run-time
> tuneable using sysfs, where you can configure it using an init script
> or some other logic from user space.

True, definitely expose the controls through sysfs.

>
> Yes.
>
> What is "PE" here?
>

Ah sorry, I had to look that one up myself, I thought it was the local
jargon associated with the problem space :-). Program/Erase cycle.

>> So T suggested for random data to better go into buffer A. How? Two suggestions.
>> 1) Split smaller accesses into 8KB and write with reliable write.
>> 2) Split smaller accesses into 8KB and write in reverse.
>>
>> The patch does both and I am verifying if that is really necessary. I
>> need to go see the mmc spec and what it says about reliable write.
>
> I should add this to my test tool once I can reproduce it. If it turns
> out that other media do the same, we can also trigger the same behavior
> for those.
>

As I mentioned, I am checking with T right now on whether we can use
suggestion (1) or
suggestion (2) or if they need to be combined. The documentation we
got was open to interpretation and the patch created from that did
both.
You mentioned that writing in reverse is not a good idea. Could you
elaborate why? I would guess because you're always causing a write
into a different AU (on these Toshiba cards), causing extra GC on
every write?

>> Basically, whatever behavior you choose is going to be wrong some set
>> of cards. Which is why tuning it probably only makes sense for eMMC
>> parts, and should be a set of runtime/compile-time quirks. What do you
>> think?
>
> Your explanation makes sense, but I'd definitely favor a run-time solution
> over compile-time or boot-time, because it would be much more flexible.
> We should also be able to find some optimizations that are universally
> good so we can always use them.
>

Then that's the angle I will pursue. It is the most flexible and then
you don't have to pollute the block driver with little workarounds for
soon-to-be-obsolete hardware. Hopefully I'll have something for
re-review soon.

Thanks Again!

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-14 19:29             ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 19:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> I don't think it needs to be boot-time, it can easily be run-time
> tuneable using sysfs, where you can configure it using an init script
> or some other logic from user space.

True, definitely expose the controls through sysfs.

>
> Yes.
>
> What is "PE" here?
>

Ah sorry, I had to look that one up myself, I thought it was the local
jargon associated with the problem space :-). Program/Erase cycle.

>> So T suggested for random data to better go into buffer A. How? Two suggestions.
>> 1) Split smaller accesses into 8KB and write with reliable write.
>> 2) Split smaller accesses into 8KB and write in reverse.
>>
>> The patch does both and I am verifying if that is really necessary. I
>> need to go see the mmc spec and what it says about reliable write.
>
> I should add this to my test tool once I can reproduce it. If it turns
> out that other media do the same, we can also trigger the same behavior
> for those.
>

As I mentioned, I am checking with T right now on whether we can use
suggestion (1) or
suggestion (2) or if they need to be combined. The documentation we
got was open to interpretation and the patch created from that did
both.
You mentioned that writing in reverse is not a good idea. Could you
elaborate why? I would guess because you're always causing a write
into a different AU (on these Toshiba cards), causing extra GC on
every write?

>> Basically, whatever behavior you choose is going to be wrong some set
>> of cards. Which is why tuning it probably only makes sense for eMMC
>> parts, and should be a set of runtime/compile-time quirks. What do you
>> think?
>
> Your explanation makes sense, but I'd definitely favor a run-time solution
> over compile-time or boot-time, because it would be much more flexible.
> We should also be able to find some optimizations that are universally
> good so we can always use them.
>

Then that's the angle I will pursue. It is the most flexible and then
you don't have to pollute the block driver with little workarounds for
soon-to-be-obsolete hardware. Hopefully I'll have something for
re-review soon.

Thanks Again!

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-14 19:29             ` Andrei Warkentin
@ 2011-02-14 20:22               ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-14 20:22 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc

On Monday 14 February 2011 20:29:59 Andrei Warkentin wrote:
> On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Ah sorry, I had to look that one up myself, I thought it was the local
> jargon associated with the problem space :-). Program/Erase cycle.

Ok, makes sense.

> >> So T suggested for random data to better go into buffer A. How? Two suggestions.
> >> 1) Split smaller accesses into 8KB and write with reliable write.
> >> 2) Split smaller accesses into 8KB and write in reverse.
> >>
> >> The patch does both and I am verifying if that is really necessary. I
> >> need to go see the mmc spec and what it says about reliable write.
> >
> > I should add this to my test tool once I can reproduce it. If it turns
> > out that other media do the same, we can also trigger the same behavior
> > for those.
> >
> 
> As I mentioned, I am checking with T right now on whether we can use
> suggestion (1) or
> suggestion (2) or if they need to be combined. The documentation we
> got was open to interpretation and the patch created from that did
> both.
> You mentioned that writing in reverse is not a good idea. Could you
> elaborate why? I would guess because you're always causing a write
> into a different AU (on these Toshiba cards), causing extra GC on
> every write?

Probably both the reliable write and writing small blocks in reverse
order will cause any card to do something that is different from
what it does on normal 64kb (or larger) aligned accesses.

There are multiple ways how this could be implemented:

1. Have one exception cache for all "special" blocks. This would normally
   be for FAT32 subdirectory updates, which always write to the same
   few blocks. This means you can do small writes efficiently anywhere
   on the card, but only up to a (small) fixed number of block addresses.
   If you overflow the table, the card still needs to go through an
   extra PE for each new entry you write, in order to free up an entry.

2. Have a small number of AUs that can be in a special mode with efficient
   small writes but inefficient large writes. This means that when you
   alternate between small and large writes in the same AU, it has to go
   through a PE on every switch. Similarly, if you do small writes to
   more than the maximum number of AUs that can be held in this mode, you
   get the same effect. This number can be as small as one, because that
   is what FAT32 requires.

In both cases, you don't actually have a solution for the problem, you just
make it less likely for specific workloads.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-14 20:22               ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-14 20:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 14 February 2011 20:29:59 Andrei Warkentin wrote:
> On Sun, Feb 13, 2011 at 11:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Ah sorry, I had to look that one up myself, I thought it was the local
> jargon associated with the problem space :-). Program/Erase cycle.

Ok, makes sense.

> >> So T suggested for random data to better go into buffer A. How? Two suggestions.
> >> 1) Split smaller accesses into 8KB and write with reliable write.
> >> 2) Split smaller accesses into 8KB and write in reverse.
> >>
> >> The patch does both and I am verifying if that is really necessary. I
> >> need to go see the mmc spec and what it says about reliable write.
> >
> > I should add this to my test tool once I can reproduce it. If it turns
> > out that other media do the same, we can also trigger the same behavior
> > for those.
> >
> 
> As I mentioned, I am checking with T right now on whether we can use
> suggestion (1) or
> suggestion (2) or if they need to be combined. The documentation we
> got was open to interpretation and the patch created from that did
> both.
> You mentioned that writing in reverse is not a good idea. Could you
> elaborate why? I would guess because you're always causing a write
> into a different AU (on these Toshiba cards), causing extra GC on
> every write?

Probably both the reliable write and writing small blocks in reverse
order will cause any card to do something that is different from
what it does on normal 64kb (or larger) aligned accesses.

There are multiple ways how this could be implemented:

1. Have one exception cache for all "special" blocks. This would normally
   be for FAT32 subdirectory updates, which always write to the same
   few blocks. This means you can do small writes efficiently anywhere
   on the card, but only up to a (small) fixed number of block addresses.
   If you overflow the table, the card still needs to go through an
   extra PE for each new entry you write, in order to free up an entry.

2. Have a small number of AUs that can be in a special mode with efficient
   small writes but inefficient large writes. This means that when you
   alternate between small and large writes in the same AU, it has to go
   through a PE on every switch. Similarly, if you do small writes to
   more than the maximum number of AUs that can be held in this mode, you
   get the same effect. This number can be as small as one, because that
   is what FAT32 requires.

In both cases, you don't actually have a solution for the problem, you just
make it less likely for specific workloads.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-14 20:22               ` Arnd Bergmann
@ 2011-02-14 22:25                 ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 22:25 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Mon, Feb 14, 2011 at 2:22 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> As I mentioned, I am checking with T right now on whether we can use
>> suggestion (1) or
>> suggestion (2) or if they need to be combined. The documentation we
>> got was open to interpretation and the patch created from that did
>> both.
>> You mentioned that writing in reverse is not a good idea. Could you
>> elaborate why? I would guess because you're always causing a write
>> into a different AU (on these Toshiba cards), causing extra GC on
>> every write?
>
> Probably both the reliable write and writing small blocks in reverse
> order will cause any card to do something that is different from
> what it does on normal 64kb (or larger) aligned accesses.
>
> There are multiple ways how this could be implemented:
>
> 1. Have one exception cache for all "special" blocks. This would normally
>   be for FAT32 subdirectory updates, which always write to the same
>   few blocks. This means you can do small writes efficiently anywhere
>   on the card, but only up to a (small) fixed number of block addresses.
>   If you overflow the table, the card still needs to go through an
>   extra PE for each new entry you write, in order to free up an entry.
>
> 2. Have a small number of AUs that can be in a special mode with efficient
>   small writes but inefficient large writes. This means that when you
>   alternate between small and large writes in the same AU, it has to go
>   through a PE on every switch. Similarly, if you do small writes to
>   more than the maximum number of AUs that can be held in this mode, you
>   get the same effect. This number can be as small as one, because that
>   is what FAT32 requires.
>
> In both cases, you don't actually have a solution for the problem, you just
> make it less likely for specific workloads.

Aha, ok. By the way, I did find out that either suggestion works. So
I'll pull out the reversing portion of the patch. No need to
overcomplicate :).

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-14 22:25                 ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-14 22:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 14, 2011 at 2:22 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> As I mentioned, I am checking with T right now on whether we can use
>> suggestion (1) or
>> suggestion (2) or if they need to be combined. The documentation we
>> got was open to interpretation and the patch created from that did
>> both.
>> You mentioned that writing in reverse is not a good idea. Could you
>> elaborate why? I would guess because you're always causing a write
>> into a different AU (on these Toshiba cards), causing extra GC on
>> every write?
>
> Probably both the reliable write and writing small blocks in reverse
> order will cause any card to do something that is different from
> what it does on normal 64kb (or larger) aligned accesses.
>
> There are multiple ways how this could be implemented:
>
> 1. Have one exception cache for all "special" blocks. This would normally
> ? be for FAT32 subdirectory updates, which always write to the same
> ? few blocks. This means you can do small writes efficiently anywhere
> ? on the card, but only up to a (small) fixed number of block addresses.
> ? If you overflow the table, the card still needs to go through an
> ? extra PE for each new entry you write, in order to free up an entry.
>
> 2. Have a small number of AUs that can be in a special mode with efficient
> ? small writes but inefficient large writes. This means that when you
> ? alternate between small and large writes in the same AU, it has to go
> ? through a PE on every switch. Similarly, if you do small writes to
> ? more than the maximum number of AUs that can be held in this mode, you
> ? get the same effect. This number can be as small as one, because that
> ? is what FAT32 requires.
>
> In both cases, you don't actually have a solution for the problem, you just
> make it less likely for specific workloads.

Aha, ok. By the way, I did find out that either suggestion works. So
I'll pull out the reversing portion of the patch. No need to
overcomplicate :).

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-14 22:25                 ` Andrei Warkentin
@ 2011-02-15 17:16                   ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-15 17:16 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Monday 14 February 2011, Andrei Warkentin wrote:
> > There are multiple ways how this could be implemented:
> >
> > 1. Have one exception cache for all "special" blocks. This would normally
> >   be for FAT32 subdirectory updates, which always write to the same
> >   few blocks. This means you can do small writes efficiently anywhere
> >   on the card, but only up to a (small) fixed number of block addresses.
> >   If you overflow the table, the card still needs to go through an
> >   extra PE for each new entry you write, in order to free up an entry.
> >
> > 2. Have a small number of AUs that can be in a special mode with efficient
> >   small writes but inefficient large writes. This means that when you
> >   alternate between small and large writes in the same AU, it has to go
> >   through a PE on every switch. Similarly, if you do small writes to
> >   more than the maximum number of AUs that can be held in this mode, you
> >   get the same effect. This number can be as small as one, because that
> >   is what FAT32 requires.
> >
> > In both cases, you don't actually have a solution for the problem, you just
> > make it less likely for specific workloads.
> 
> Aha, ok. By the way, I did find out that either suggestion works. So
> I'll pull out the reversing portion of the patch. No need to
> overcomplicate :).

BTW, what file system are you using? I could imagine that each of ext4, btrfs
and nilfs2 give you very different results here. It could be that if your
patch is optimizing for one file system, it is actually pessimising for
another one.

What benchmark do you use to find out of your optimizations actually help you?

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-15 17:16                   ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-15 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 14 February 2011, Andrei Warkentin wrote:
> > There are multiple ways how this could be implemented:
> >
> > 1. Have one exception cache for all "special" blocks. This would normally
> >   be for FAT32 subdirectory updates, which always write to the same
> >   few blocks. This means you can do small writes efficiently anywhere
> >   on the card, but only up to a (small) fixed number of block addresses.
> >   If you overflow the table, the card still needs to go through an
> >   extra PE for each new entry you write, in order to free up an entry.
> >
> > 2. Have a small number of AUs that can be in a special mode with efficient
> >   small writes but inefficient large writes. This means that when you
> >   alternate between small and large writes in the same AU, it has to go
> >   through a PE on every switch. Similarly, if you do small writes to
> >   more than the maximum number of AUs that can be held in this mode, you
> >   get the same effect. This number can be as small as one, because that
> >   is what FAT32 requires.
> >
> > In both cases, you don't actually have a solution for the problem, you just
> > make it less likely for specific workloads.
> 
> Aha, ok. By the way, I did find out that either suggestion works. So
> I'll pull out the reversing portion of the patch. No need to
> overcomplicate :).

BTW, what file system are you using? I could imagine that each of ext4, btrfs
and nilfs2 give you very different results here. It could be that if your
patch is optimizing for one file system, it is actually pessimising for
another one.

What benchmark do you use to find out of your optimizations actually help you?

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-15 17:16                   ` Arnd Bergmann
@ 2011-02-17  2:08                     ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-17  2:08 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Tue, Feb 15, 2011 at 11:16 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 14 February 2011, Andrei Warkentin wrote:
>> > There are multiple ways how this could be implemented:
>> >
>> > 1. Have one exception cache for all "special" blocks. This would normally
>> >   be for FAT32 subdirectory updates, which always write to the same
>> >   few blocks. This means you can do small writes efficiently anywhere
>> >   on the card, but only up to a (small) fixed number of block addresses.
>> >   If you overflow the table, the card still needs to go through an
>> >   extra PE for each new entry you write, in order to free up an entry.
>> >
>> > 2. Have a small number of AUs that can be in a special mode with efficient
>> >   small writes but inefficient large writes. This means that when you
>> >   alternate between small and large writes in the same AU, it has to go
>> >   through a PE on every switch. Similarly, if you do small writes to
>> >   more than the maximum number of AUs that can be held in this mode, you
>> >   get the same effect. This number can be as small as one, because that
>> >   is what FAT32 requires.
>> >
>> > In both cases, you don't actually have a solution for the problem, you just
>> > make it less likely for specific workloads.
>>
>> Aha, ok. By the way, I did find out that either suggestion works. So
>> I'll pull out the reversing portion of the patch. No need to
>> overcomplicate :).
>
> BTW, what file system are you using? I could imagine that each of ext4, btrfs
> and nilfs2 give you very different results here. It could be that if your
> patch is optimizing for one file system, it is actually pessimising for
> another one.
>

Ext4. I've actually been rewriting the patch a lot and it's taking
time because there are a lot of things that are wrong in it (so I feel
kinda bad for forwarding it to this list in the first place...). I've
already mentioned that there is no need to reorder, so that's going
away and it simplifies everything greatly.

I agree, which is why all of this is controlled now through sysfs, and
there are no more hard-coded checks for manfid, mmc versus sd or any
other magic. There is a page_size_secs attribute, through which you
can notify of the page size for the device. The workaround for small
writes crossing the page boundary (and winding up in Buffer B, instead
of A) is turned on by setting split_tlow and split_thigh, which
provided a threshold range in sectors over which the the writes will
be split/aligned. The second workaround for splitting larger requests
and writing them with reliable write (to avoid getting coalesced and
winding up in Buffer B again) is controlled through split_relw_tlow
and split_relw_thigh. Do you think there is a better way? Or is this
good enough?

So, as I mentioned before, T had done some tests given data provided
by M, and then T verified that this fix was good. I need to do my own
tests on the patch after I rewrite it. Is iozone the best tool I can
use? So far I have a MMC logging facility through connector that I use
to collect stats (useful for seeing how fs traffic translates to
actual mmc commands...once I clean it up I'll push here for RFC). What
about the tool you're writing? Any way I can use it?

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-17  2:08                     ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-17  2:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 15, 2011 at 11:16 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 14 February 2011, Andrei Warkentin wrote:
>> > There are multiple ways how this could be implemented:
>> >
>> > 1. Have one exception cache for all "special" blocks. This would normally
>> > ? be for FAT32 subdirectory updates, which always write to the same
>> > ? few blocks. This means you can do small writes efficiently anywhere
>> > ? on the card, but only up to a (small) fixed number of block addresses.
>> > ? If you overflow the table, the card still needs to go through an
>> > ? extra PE for each new entry you write, in order to free up an entry.
>> >
>> > 2. Have a small number of AUs that can be in a special mode with efficient
>> > ? small writes but inefficient large writes. This means that when you
>> > ? alternate between small and large writes in the same AU, it has to go
>> > ? through a PE on every switch. Similarly, if you do small writes to
>> > ? more than the maximum number of AUs that can be held in this mode, you
>> > ? get the same effect. This number can be as small as one, because that
>> > ? is what FAT32 requires.
>> >
>> > In both cases, you don't actually have a solution for the problem, you just
>> > make it less likely for specific workloads.
>>
>> Aha, ok. By the way, I did find out that either suggestion works. So
>> I'll pull out the reversing portion of the patch. No need to
>> overcomplicate :).
>
> BTW, what file system are you using? I could imagine that each of ext4, btrfs
> and nilfs2 give you very different results here. It could be that if your
> patch is optimizing for one file system, it is actually pessimising for
> another one.
>

Ext4. I've actually been rewriting the patch a lot and it's taking
time because there are a lot of things that are wrong in it (so I feel
kinda bad for forwarding it to this list in the first place...). I've
already mentioned that there is no need to reorder, so that's going
away and it simplifies everything greatly.

I agree, which is why all of this is controlled now through sysfs, and
there are no more hard-coded checks for manfid, mmc versus sd or any
other magic. There is a page_size_secs attribute, through which you
can notify of the page size for the device. The workaround for small
writes crossing the page boundary (and winding up in Buffer B, instead
of A) is turned on by setting split_tlow and split_thigh, which
provided a threshold range in sectors over which the the writes will
be split/aligned. The second workaround for splitting larger requests
and writing them with reliable write (to avoid getting coalesced and
winding up in Buffer B again) is controlled through split_relw_tlow
and split_relw_thigh. Do you think there is a better way? Or is this
good enough?

So, as I mentioned before, T had done some tests given data provided
by M, and then T verified that this fix was good. I need to do my own
tests on the patch after I rewrite it. Is iozone the best tool I can
use? So far I have a MMC logging facility through connector that I use
to collect stats (useful for seeing how fs traffic translates to
actual mmc commands...once I clean it up I'll push here for RFC). What
about the tool you're writing? Any way I can use it?

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-17  2:08                     ` Andrei Warkentin
@ 2011-02-17 15:47                       ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-17 15:47 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Thursday 17 February 2011, Andrei Warkentin wrote:
> Ext4.

Ok, I see. I haven't really done this kind of tests before, but my
feeling is that ext3/ext4 may be much worse than the alternatives
at the moment. It would certainly be worthwhile to do tests using
nilfs2 and btrfs, whose default behaviour matches the requirements
of your eMMC flash much better, and see how they perform with and
without your patch.

> I agree, which is why all of this is controlled now through sysfs, and
> there are no more hard-coded checks for manfid, mmc versus sd or any
> other magic. There is a page_size_secs attribute, through which you
> can notify of the page size for the device.

How about making that just page_size in bytes? sectors don't always
mean 512 bytes, so this would be both shorter and less anbiguous.

> The workaround for small
> writes crossing the page boundary (and winding up in Buffer B, instead
> of A) is turned on by setting split_tlow and split_thigh, which
> provided a threshold range in sectors over which the the writes will
> be split/aligned. The second workaround for splitting larger requests
> and writing them with reliable write (to avoid getting coalesced and
> winding up in Buffer B again) is controlled through split_relw_tlow
> and split_relw_thigh. Do you think there is a better way? Or is this
> good enough?

I think I'd try to reduce the number of sysfs files needed for this.
What are the values you would typically set here?

My feeling is that separating unaligned page writes from full pages
or multiples of pages could always be benefitial for all cards, or at
least harmless, but that will require more measurements.
Whether to do the reliable write or not could be a simple flag
if the numbers are the same.

> So, as I mentioned before, T had done some tests given data provided
> by M, and then T verified that this fix was good. I need to do my own
> tests on the patch after I rewrite it. Is iozone the best tool I can
> use? So far I have a MMC logging facility through connector that I use
> to collect stats (useful for seeing how fs traffic translates to
> actual mmc commands...once I clean it up I'll push here for RFC). What
> about the tool you're writing? Any way I can use it?

It's now available in a an early almost-usable version at
git://git.linaro.org/people/arnd/flashbench.git

I don't have a test for the second buffer yet, but it would be
good to know some of the other characteristics of your eMMC drive.

Please try some of these commands:

flashbench -a /dev/mmcblk0  --blocksize=1024
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512 --random

Note that the --open-au test will overwrite your data. You can do it on a
partition you don't use, but it needs to be aligned to 4 MB.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-17 15:47                       ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-17 15:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 17 February 2011, Andrei Warkentin wrote:
> Ext4.

Ok, I see. I haven't really done this kind of tests before, but my
feeling is that ext3/ext4 may be much worse than the alternatives
at the moment. It would certainly be worthwhile to do tests using
nilfs2 and btrfs, whose default behaviour matches the requirements
of your eMMC flash much better, and see how they perform with and
without your patch.

> I agree, which is why all of this is controlled now through sysfs, and
> there are no more hard-coded checks for manfid, mmc versus sd or any
> other magic. There is a page_size_secs attribute, through which you
> can notify of the page size for the device.

How about making that just page_size in bytes? sectors don't always
mean 512 bytes, so this would be both shorter and less anbiguous.

> The workaround for small
> writes crossing the page boundary (and winding up in Buffer B, instead
> of A) is turned on by setting split_tlow and split_thigh, which
> provided a threshold range in sectors over which the the writes will
> be split/aligned. The second workaround for splitting larger requests
> and writing them with reliable write (to avoid getting coalesced and
> winding up in Buffer B again) is controlled through split_relw_tlow
> and split_relw_thigh. Do you think there is a better way? Or is this
> good enough?

I think I'd try to reduce the number of sysfs files needed for this.
What are the values you would typically set here?

My feeling is that separating unaligned page writes from full pages
or multiples of pages could always be benefitial for all cards, or at
least harmless, but that will require more measurements.
Whether to do the reliable write or not could be a simple flag
if the numbers are the same.

> So, as I mentioned before, T had done some tests given data provided
> by M, and then T verified that this fix was good. I need to do my own
> tests on the patch after I rewrite it. Is iozone the best tool I can
> use? So far I have a MMC logging facility through connector that I use
> to collect stats (useful for seeing how fs traffic translates to
> actual mmc commands...once I clean it up I'll push here for RFC). What
> about the tool you're writing? Any way I can use it?

It's now available in a an early almost-usable version at
git://git.linaro.org/people/arnd/flashbench.git

I don't have a test for the second buffer yet, but it would be
good to know some of the other characteristics of your eMMC drive.

Please try some of these commands:

flashbench -a /dev/mmcblk0  --blocksize=1024
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=1 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=2 /dev/mmcblk0 --blocksize=512 --random
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512
flashbench --open-au --open-au-nr=3 /dev/mmcblk0 --blocksize=512 --random

Note that the --open-au test will overwrite your data. You can do it on a
partition you don't use, but it needs to be aligned to 4 MB.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-11 22:33       ` Andrei Warkentin
@ 2011-02-18  1:10         ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18  1:10 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Arnd,
>
> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152
>

Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
confused over all the manfids changing.

Here is the Toshiba card:

cid - 1101004d4d4333324703101a17746d00
csd - 900e00320f5903ffffffffe796400000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000011
name - MMC32G
oemid - 0x0100
preferred_erase_size - 4194304

I'll get you the flashbench timings for both.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-18  1:10         ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18  1:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Arnd,
>
> Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
>
> cid - 02010053454d3332479070cc51451d00
> csd - d00f00320f5903ffffffffff92404000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000002
> name - SEM32G
> oemid - 0x0100
> preferred_erase_size - 2097152
>

Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
confused over all the manfids changing.

Here is the Toshiba card:

cid - 1101004d4d4333324703101a17746d00
csd - 900e00320f5903ffffffffe796400000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000011
name - MMC32G
oemid - 0x0100
preferred_erase_size - 4194304

I'll get you the flashbench timings for both.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-18  1:10         ` Andrei Warkentin
@ 2011-02-18 13:44           ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-18 13:44 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc

On Friday 18 February 2011, Andrei Warkentin wrote:
> On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> > Arnd,
> >
> > Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >
> > cid - 02010053454d3332479070cc51451d00
> > csd - d00f00320f5903ffffffffff92404000
> > erase_size - 524288
> > fwrev - 0x0
> > hwrev - 0x0
> > manfid - 0x000002
> > name - SEM32G
> > oemid - 0x0100
> > preferred_erase_size - 2097152
> >
> 
> Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
> confused over all the manfids changing.
> 
> Here is the Toshiba card:
> 
> cid - 1101004d4d4333324703101a17746d00
> csd - 900e00320f5903ffffffffe796400000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000011
> name - MMC32G
> oemid - 0x0100
> preferred_erase_size - 4194304
> 
> I'll get you the flashbench timings for both.

I'm curious. Neither the manfid nor the oemid fields of either card
match what I have seen on SD cards, I would expect them to be

Sandisk: manfid 0x000003, oemid 0x5344
Toshiba: manfid 0x000002, oemid 0x544d

I have not actually seen any Toshiba SD cards, but I assume that they
use the same controllers as Kingston.

Does anyone know if the IDs have any correlation between MMC and SD
controllers?

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-18 13:44           ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-18 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 18 February 2011, Andrei Warkentin wrote:
> On Fri, Feb 11, 2011 at 4:33 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> > Arnd,
> >
> > Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.
> >
> > cid - 02010053454d3332479070cc51451d00
> > csd - d00f00320f5903ffffffffff92404000
> > erase_size - 524288
> > fwrev - 0x0
> > hwrev - 0x0
> > manfid - 0x000002
> > name - SEM32G
> > oemid - 0x0100
> > preferred_erase_size - 2097152
> >
> 
> Ok. Big mistake. Sorry about that. This card is Sandisk card. I got
> confused over all the manfids changing.
> 
> Here is the Toshiba card:
> 
> cid - 1101004d4d4333324703101a17746d00
> csd - 900e00320f5903ffffffffe796400000
> erase_size - 524288
> fwrev - 0x0
> hwrev - 0x0
> manfid - 0x000011
> name - MMC32G
> oemid - 0x0100
> preferred_erase_size - 4194304
> 
> I'll get you the flashbench timings for both.

I'm curious. Neither the manfid nor the oemid fields of either card
match what I have seen on SD cards, I would expect them to be

Sandisk: manfid 0x000003, oemid 0x5344
Toshiba: manfid 0x000002, oemid 0x544d

I have not actually seen any Toshiba SD cards, but I assume that they
use the same controllers as Kingston.

Does anyone know if the IDs have any correlation between MMC and SD
controllers?

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-18 13:44           ` Arnd Bergmann
@ 2011-02-18 19:47             ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 19:47 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I'm curious. Neither the manfid nor the oemid fields of either card
> match what I have seen on SD cards, I would expect them to be
>
> Sandisk: manfid 0x000003, oemid 0x5344
> Toshiba: manfid 0x000002, oemid 0x544d
>
> I have not actually seen any Toshiba SD cards, but I assume that they
> use the same controllers as Kingston.
>
> Does anyone know if the IDs have any correlation between MMC and SD
> controllers?
>
>        Arnd
>

I'm unsure about the older scheme (assigned by MMCA), but ever since
MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
will be 0x45, and Toshiba I guess will be 0x11.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-18 19:47             ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 19:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I'm curious. Neither the manfid nor the oemid fields of either card
> match what I have seen on SD cards, I would expect them to be
>
> Sandisk: manfid 0x000003, oemid 0x5344
> Toshiba: manfid 0x000002, oemid 0x544d
>
> I have not actually seen any Toshiba SD cards, but I assume that they
> use the same controllers as Kingston.
>
> Does anyone know if the IDs have any correlation between MMC and SD
> controllers?
>
> ? ? ? ?Arnd
>

I'm unsure about the older scheme (assigned by MMCA), but ever since
MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
will be 0x45, and Toshiba I guess will be 0x11.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-18 19:47             ` Andrei Warkentin
@ 2011-02-18 22:40               ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 22:40 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

[-- Attachment #1: Type: text/plain, Size: 2014 bytes --]

On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> I'm curious. Neither the manfid nor the oemid fields of either card
>> match what I have seen on SD cards, I would expect them to be
>>
>> Sandisk: manfid 0x000003, oemid 0x5344
>> Toshiba: manfid 0x000002, oemid 0x544d
>>
>> I have not actually seen any Toshiba SD cards, but I assume that they
>> use the same controllers as Kingston.
>>
>> Does anyone know if the IDs have any correlation between MMC and SD
>> controllers?
>>
>>        Arnd
>>
>
> I'm unsure about the older scheme (assigned by MMCA), but ever since
> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
> will be 0x45, and Toshiba I guess will be 0x11.
>

Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.

Some interesting things that I don't understand. For the align test, I
extended it to do a write align test (-A). I tried two partitions that
I could write over, and both read and writes behaved differently for
the two partitions on same device. Odd. They are both 4MB aligned.

On the sandisk it was the write align that made the page size stand
out.  The read align had pretty constant results.

On the toshiba the results varied wildly for the two partitions. For
partition 6, there was a clear pattern in the diff values for read
align. For 9, it was all over the place. For 9 with the write align,
8K and 16K the crossing writes took ~115ms!! Look in attached files
for all the data.

The AU tests were interesting too, especially how with several open
AUs the throughput is higher for certain smaller sizes on sandisk, but
if I interpret it correctly both cards have at least 4 AUs, as I
didn't see yet a significant drop for small sizes. The larger ones I
am running now on mmcblk0p9 which is sufficiently larger for these
tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)

Thanks,
A

[-- Attachment #2: toshiba.txt --]
[-- Type: text/plain, Size: 5447 bytes --]

/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/start
643072
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/size 
346112
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/start
77824
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/size
24576
# ./flashbench -a -b 1024 /dev/block/mmcblk0p6
align 524288	pre 613µs	on 801µs	post 570µs	diff 210µs
align 262144	pre 739µs	on 988µs	post 767µs	diff 235µs
align 131072	pre 740µs	on 990µs	post 767µs	diff 236µs
align 65536	pre 749µs	on 998µs	post 767µs	diff 240µs
align 32768	pre 761µs	on 992µs	post 746µs	diff 238µs
align 16384	pre 755µs	on 982µs	post 755µs	diff 227µs
align 8192	pre 748µs	on 750µs	post 748µs	diff 1.94µs
align 4096	pre 747µs	on 749µs	post 747µs	diff 1.41µs
align 2048	pre 747µs	on 747µs	post 748µs	diff -93ns
# ./flashbench -a -b 1024 /dev/block/mmcblk0p9
align 8388608	pre 527µs	on 743µs	post 476µs	diff 242µs
align 4194304	pre 544µs	on 730µs	post 543µs	diff 187µs
align 2097152	pre 551µs	on 714µs	post 485µs	diff 196µs
align 1048576	pre 742µs	on 864µs	post 745µs	diff 120µs
align 524288	pre 760µs	on 822µs	post 789µs	diff 47.9µs
align 262144	pre 760µs	on 816µs	post 789µs	diff 42µs
align 131072	pre 760µs	on 822µs	post 789µs	diff 47.8µs
align 65536	pre 758µs	on 821µs	post 789µs	diff 48µs
align 32768	pre 771µs	on 828µs	post 760µs	diff 62.7µs
align 16384	pre 672µs	on 939µs	post 771µs	diff 217µs
align 8192	pre 668µs	on 806µs	post 671µs	diff 136µs
align 4096	pre 671µs	on 672µs	post 670µs	diff 1.5µs
align 2048	pre 671µs	on 670µs	post 671µs	diff -859ns
# ./flashbench -A -b 1024 /dev/block/mmcblk0p6
write align 524288	pre 3.59ms	on 6.74ms	post 3.73ms	diff 3.08ms
write align 262144	pre 3.69ms	on 7.11ms	post 3.69ms	diff 3.42ms
write align 131072	pre 3.71ms	on 17.4ms	post 3.72ms	diff 13.7ms
write align 65536	pre 3.72ms	on 7.18ms	post 3.52ms	diff 3.56ms
write align 32768	pre 3.73ms	on 11.9ms	post 3.7ms	diff 8.24ms
write align 16384	pre 3.93ms	on 5.01ms	post 4.6ms	diff 745µs
write align 8192	pre 4.9ms	on 4.89ms	post 4.87ms	diff 4.77µs
write align 4096	pre 5.03ms	on 5.02ms	post 5.01ms	diff -437ns
write align 2048	pre 5.08ms	on 5.08ms	post 5.06ms	diff 12.3µs
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 3.76ms	on 7.07ms	post 4.05ms	diff 3.16ms
write align 4194304	pre 3.62ms	on 6.5ms	post 3.63ms	diff 2.88ms
write align 2097152	pre 3.91ms	on 6.84ms	post 3.7ms	diff 3.04ms
write align 1048576	pre 3.88ms	on 6.96ms	post 3.96ms	diff 3.04ms
write align 524288	pre 3.93ms	on 7.07ms	post 4.05ms	diff 3.08ms
write align 262144	pre 3.94ms	on 7.07ms	post 4.05ms	diff 3.07ms
write align 131072	pre 3.95ms	on 7.05ms	post 4.05ms	diff 3.05ms
write align 65536	pre 3.94ms	on 7.07ms	post 4.05ms	diff 3.07ms
write align 32768	pre 3.95ms	on 7.07ms	post 4.04ms	diff 3.07ms
write align 16384	pre 4.48ms	on 117ms	post 3.81ms	diff 113ms
write align 8192	pre 3.61ms	on 114ms	post 3.58ms	diff 110ms
write align 4096	pre 3.88ms	on 3.87ms	post 3.86ms	diff 1.87µs
write align 2048	pre 3.88ms	on 3.89ms	post 3.89ms	diff 3.11µs

./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p6
4MiB    7.17M/s 
2MiB    7.91M/s 
1MiB    9.23M/s 
512KiB  10.3M/s 
256KiB  10.5M/s 
128KiB  10.4M/s 
64KiB   9.81M/s 
32KiB   9.09M/s 
16KiB   3.71M/s 
8KiB    1.73M/s 
4KiB    845K/s  
2KiB    418K/s  
1KiB    208K/s  
512B    103K/s  
./flashbench -O -0 1 -r -b 512 /dev/block/mmcblk0p6
4MiB    6.58M/s 
2MiB    7.98M/s 
1MiB    9.33M/s 
512KiB  10.4M/s 
256KiB  10.9M/s 
128KiB  10.5M/s 
64KiB   9.94M/s 
32KiB   9.11M/s 
16KiB   3.72M/s 
8KiB    1.75M/s 
4KiB    853K/s  
2KiB    419K/s  
1KiB    207K/s  
512B    102K/s  
./flashbench -O -0 2 -b 512 /dev/block/mmcblk0p6
4MiB    8.95M/s 
2MiB    9.44M/s 
1MiB    10.3M/s 
512KiB  10.9M/s 
256KiB  10.8M/s 
128KiB  10.5M/s 
64KiB   9.91M/s 
32KiB   8.79M/s 
16KiB   3.65M/s 
8KiB    1.75M/s 
4KiB    851K/s  
2KiB    419K/s  
1KiB    208K/s  
512B    103K/s  
./flashbench -O -0 2 -r -b 512 /dev/block/mmcblk0p6
4MiB    9.06M/s 
2MiB    9.68M/s 
1MiB    10.3M/s 
512KiB  10.5M/s 
256KiB  9.94M/s 
128KiB  10.1M/s 
64KiB   9.41M/s 
32KiB   7.99M/s 
16KiB   3.5M/s  
8KiB    1.64M/s 
4KiB    798K/s  
2KiB    393K/s  
1KiB    196K/s  
512B    96.5K/s 
./flashbench -O -0 3 -b 512 /dev/block/mmcblk0p6
4MiB    8.07M/s 
2MiB    9.07M/s 
1MiB    9.88M/s 
512KiB  10.1M/s 
256KiB  10M/s   
128KiB  9.83M/s 
64KiB   8.68M/s 
32KiB   7.1M/s  
16KiB   3.09M/s 
8KiB    1.49M/s 
4KiB    726K/s  
2KiB    357K/s  
1KiB    178K/s  
512B    88.5K/s 
./flashbench -O -0 3 -r -b 512 /dev/block/mmcblk0p6
4MiB    8.12M/s 
2MiB    9.28M/s 
1MiB    9.83M/s 
512KiB  10M/s   
256KiB  9.97M/s 
128KiB  9.91M/s 
64KiB   8.9M/s  
32KiB   7.3M/s  
16KiB   3.2M/s  
8KiB    1.54M/s 
4KiB    751K/s  
2KiB    367K/s  
1KiB    183K/s  
512B    90.3K/s 
./flashbench -O -0 4 -b 512 /dev/block/mmcblk0p6
4MiB    5.87M/s 
2MiB    8.71M/s 
1MiB    9.11M/s 
512KiB  10.3M/s 
256KiB  10.5M/s 
128KiB  10M/s   
64KiB   9.09M/s 
32KiB   7.5M/s  
16KiB   3.28M/s 
8KiB    1.56M/s 
4KiB    758K/s  
2KiB    372K/s  
1KiB    185K/s  
512B    92.3K/s 
./flashbench -O -0 4 -r -b 512 /dev/block/mmcblk0p6
4MiB    7.57M/s 
2MiB    7.23M/s 
1MiB    9.71M/s 
512KiB  10M/s   
256KiB  9.98M/s 
128KiB  9.82M/s 
64KiB   9.07M/s 
32KiB   7.62M/s 
16KiB   3.34M/s 
8KiB    1.58M/s 
4KiB    776K/s  
2KiB    379K/s  
1KiB    188K/s  
512B    92.7K/s 

[-- Attachment #3: sandisk.txt --]
[-- Type: text/plain, Size: 5529 bytes --]

/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/start
647168
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p9/size 
346112
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/start
81920
/data # cat /sys/block/mmcblk0/device/block/mmcblk0/mmcblk0p6/size
24576
/data # ./flashbench -a -b 1024 /dev/block/mmcblk0p6
align 524288	pre 1.01ms	on 1.03ms	post 858µs	diff 93.5µs
align 262144	pre 1.16ms	on 1.2ms	post 926µs	diff 153µs
align 131072	pre 1.16ms	on 1.2ms	post 924µs	diff 151µs
align 65536	pre 1.15ms	on 1.12ms	post 919µs	diff 84.9µs
align 32768	pre 1.16ms	on 1.2ms	post 923µs	diff 154µs
align 16384	pre 1.16ms	on 1.21ms	post 941µs	diff 162µs
align 8192	pre 1.15ms	on 1.09ms	post 874µs	diff 80.2µs
align 4096	pre 1.16ms	on 1.17ms	post 902µs	diff 138µs
align 2048	pre 1.16ms	on 1.17ms	post 903µs	diff 135µs
/data # ./flashbench -a -b 1024 /dev/block/mmcblk0p9
align 8388608	pre 1.07ms	on 1.1ms	post 933µs	diff 92.9µs
align 4194304	pre 1.28ms	on 1.29ms	post 1.05ms	diff 129µs
align 2097152	pre 1.28ms	on 1.31ms	post 1.07ms	diff 132µs
align 1048576	pre 1.27ms	on 1.32ms	post 1.07ms	diff 147µs
align 524288	pre 1.38ms	on 1.38ms	post 1.12ms	diff 135µs
align 262144	pre 1.27ms	on 1.3ms	post 1.04ms	diff 140µs
align 131072	pre 1.28ms	on 1.31ms	post 1.02ms	diff 164µs
align 65536	pre 1.38ms	on 1.38ms	post 1.12ms	diff 135µs
align 32768	pre 1.38ms	on 1.38ms	post 1.12ms	diff 134µs
align 16384	pre 1.38ms	on 1.38ms	post 1.11ms	diff 135µs
align 8192	pre 1.38ms	on 1.38ms	post 1.11ms	diff 134µs
align 4096	pre 1.38ms	on 1.38ms	post 1.11ms	diff 136µs
align 2048	pre 1.38ms	on 1.38ms	post 1.11ms	diff 134µs
/data # ./flashbench -A -b 1024 /dev/block/mmcblk0p6
write align 524288	pre 1.69ms	on 2.38ms	post 1.78ms	diff 653µs
write align 262144	pre 1.87ms	on 2.59ms	post 1.86ms	diff 723µs
write align 131072	pre 1.88ms	on 2.61ms	post 1.89ms	diff 729µs
write align 65536	pre 1.86ms	on 2.65ms	post 1.83ms	diff 805µs
write align 32768	pre 1.88ms	on 2.61ms	post 1.92ms	diff 710µs
write align 16384	pre 1.8ms	on 2.57ms	post 1.95ms	diff 701µs
write align 8192	pre 1.66ms	on 1.71ms	post 1.64ms	diff 55µs
write align 4096	pre 1.67ms	on 1.71ms	post 1.64ms	diff 51.9µs
write align 2048	pre 1.67ms	on 1.71ms	post 1.61ms	diff 68.7µs
/data # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 1.83ms	on 2.62ms	post 1.91ms	diff 750µs
write align 4194304	pre 1.89ms	on 2.87ms	post 2.06ms	diff 892µs
write align 2097152	pre 2.08ms	on 2.86ms	post 2.13ms	diff 751µs
write align 1048576	pre 2.06ms	on 2.93ms	post 2.17ms	diff 818µs
write align 524288	pre 2.07ms	on 2.85ms	post 2.18ms	diff 724µs
write align 262144	pre 2.07ms	on 2.85ms	post 2.15ms	diff 741µs
write align 131072	pre 2.05ms	on 2.93ms	post 2.19ms	diff 809µs
write align 65536	pre 1.86ms	on 2.77ms	post 1.9ms	diff 888µs
write align 32768	pre 2.06ms	on 2.91ms	post 2.19ms	diff 783µs
write align 16384	pre 2.05ms	on 2.76ms	post 1.8ms	diff 835µs
write align 8192	pre 1.83ms	on 1.89ms	post 1.8ms	diff 72.9µs
write align 4096	pre 1.84ms	on 1.9ms	post 1.8ms	diff 75µs
write align 2048	pre 1.84ms	on 1.89ms	post 1.8ms	diff 70.8µs
/data # ./flashbench -O -0 1 -b 512 /dev/block/mmcblk0p6 
4MiB    10.5M/s 
2MiB    10.1M/s 
1MiB    10.6M/s 
512KiB  10.5M/s 
256KiB  8.94M/s 
128KiB  7.74M/s 
64KiB   6.04M/s 
32KiB   4.13M/s 
16KiB   3.2M/s  
8KiB    3.87M/s 
4KiB    1.86M/s 
2KiB    1.16M/s 
1KiB    667K/s  
512B    396K/s  
/data # ./flashbench -O -0 1 -r  -b 512 /dev/block/mmcblk0p6 
4MiB    10.7M/s 
2MiB    10.3M/s 
1MiB    10.4M/s 
512KiB  16.3M/s 
256KiB  16.6M/s 
128KiB  16.1M/s 
64KiB   14M/s   
32KiB   11.1M/s 
16KiB   6.77M/s 
8KiB    3.15M/s 
4KiB    1.77M/s 
2KiB    1.01M/s 
1KiB    523K/s  
512B    296K/s  
/data # ./flashbench -O -0 2  -b 512 /dev/block/mmcblk0p6    
4MiB    11.5M/s 
2MiB    11.3M/s 
1MiB    11.5M/s 
512KiB  11.6M/s 
256KiB  10.8M/s 
128KiB  9.84M/s 
64KiB   7.88M/s 
32KiB   5.65M/s 
16KiB   4.14M/s 
8KiB    1.99M/s 
4KiB    1.42M/s 
2KiB    760K/s  
1KiB    392K/s  
512B    213K/s  
/data # ./flashbench -O -0 2 -r   -b 512 /dev/block/mmcblk0p6 
4MiB    10.3M/s 
2MiB    10.2M/s 
1MiB    10.1M/s 
512KiB  16M/s   
256KiB  15.8M/s 
128KiB  14.6M/s 
64KiB   11.4M/s 
32KiB   8.07M/s 
16KiB   5.12M/s 
8KiB    2.65M/s 
4KiB    1.43M/s 
2KiB    768K/s  
1KiB    395K/s  
512B    212K/s  
/data # ./flashbench -O -0 3    -b 512 /dev/block/mmcblk0p6   
4MiB    11.3M/s 
2MiB    11.5M/s 
1MiB    11.5M/s 
512KiB  11.5M/s 
256KiB  10.4M/s 
128KiB  9.1M/s  
64KiB   7.3M/s  
32KiB   5.21M/s 
16KiB   3.78M/s 
8KiB    2.08M/s 
4KiB    1.42M/s 
2KiB    792K/s  
1KiB    418K/s  
512B    217K/s 
/data/flashbench -O -0 3 -r  -b 512 /dev/block/mmcblk0p6
4MiB    10.7M/s 
2MiB    10.5M/s 
1MiB    10.2M/s 
512KiB  17.3M/s 
256KiB  16.3M/s 
128KiB  14.5M/s 
64KiB   11.4M/s 
32KiB   8.12M/s 
16KiB   4.98M/s 
8KiB    2.62M/s 
4KiB    1.4M/s  
2KiB    768K/s  
1KiB    390K/s  
512B    212K/s  
./flashbench -O -0 4 -b 512 /dev/block/mmcblk0p6
4MiB    14.4M/s 
2MiB    14M/s   
1MiB    13.9M/s 
512KiB  14.2M/s 
256KiB  13.5M/s 
128KiB  11.9M/s 
64KiB   9.8M/s  
32KiB   7.35M/s 
16KiB   5.1M/s  
8KiB    2.69M/s 
4KiB    1.58M/s 
2KiB    877K/s  
1KiB    476K/s  
512B    268K/s  
./flashbench -O -0 4 -r -b 512 /dev/block/mmcblk0p6
4MiB    10.4M/s 
2MiB    10.5M/s 
1MiB    14.3M/s 
512KiB  17.7M/s 
256KiB  16.9M/s 
128KiB  15.5M/s 
64KiB   12.4M/s 
32KiB   9.36M/s 
16KiB   5.62M/s 
8KiB    3M/s    
4KiB    1.62M/s 
2KiB    880K/s  
1KiB    462K/s  
512B    261K/s  


^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-18 22:40               ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 22:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> I'm curious. Neither the manfid nor the oemid fields of either card
>> match what I have seen on SD cards, I would expect them to be
>>
>> Sandisk: manfid 0x000003, oemid 0x5344
>> Toshiba: manfid 0x000002, oemid 0x544d
>>
>> I have not actually seen any Toshiba SD cards, but I assume that they
>> use the same controllers as Kingston.
>>
>> Does anyone know if the IDs have any correlation between MMC and SD
>> controllers?
>>
>> ? ? ? ?Arnd
>>
>
> I'm unsure about the older scheme (assigned by MMCA), but ever since
> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
> will be 0x45, and Toshiba I guess will be 0x11.
>

Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.

Some interesting things that I don't understand. For the align test, I
extended it to do a write align test (-A). I tried two partitions that
I could write over, and both read and writes behaved differently for
the two partitions on same device. Odd. They are both 4MB aligned.

On the sandisk it was the write align that made the page size stand
out.  The read align had pretty constant results.

On the toshiba the results varied wildly for the two partitions. For
partition 6, there was a clear pattern in the diff values for read
align. For 9, it was all over the place. For 9 with the write align,
8K and 16K the crossing writes took ~115ms!! Look in attached files
for all the data.

The AU tests were interesting too, especially how with several open
AUs the throughput is higher for certain smaller sizes on sandisk, but
if I interpret it correctly both cards have at least 4 AUs, as I
didn't see yet a significant drop for small sizes. The larger ones I
am running now on mmcblk0p9 which is sufficiently larger for these
tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)

Thanks,
A
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: toshiba.txt
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110218/3e560d5a/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sandisk.txt
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110218/3e560d5a/attachment-0001.txt>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-18 22:40               ` Andrei Warkentin
@ 2011-02-18 23:17                 ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 23:17 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

[-- Attachment #1: Type: text/plain, Size: 4240 bytes --]

2011/2/18 Andrei Warkentin <andreiw@motorola.com>:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> I'm curious. Neither the manfid nor the oemid fields of either card
>>> match what I have seen on SD cards, I would expect them to be
>>>
>>> Sandisk: manfid 0x000003, oemid 0x5344
>>> Toshiba: manfid 0x000002, oemid 0x544d
>>>
>>> I have not actually seen any Toshiba SD cards, but I assume that they
>>> use the same controllers as Kingston.
>>>
>>> Does anyone know if the IDs have any correlation between MMC and SD
>>> controllers?
>>>
>>>        Arnd
>>>
>>
>> I'm unsure about the older scheme (assigned by MMCA), but ever since
>> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
>> will be 0x45, and Toshiba I guess will be 0x11.
>>
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.
>
> On the sandisk it was the write align that made the page size stand
> out.  The read align had pretty constant results.
>
> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.
>
> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Thanks,
> A
>

I thought this was pretty interesting -

# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 3.59ms	on 6.54ms	post 3.65ms	diff 2.92ms
write align 4194304	pre 4.13ms	on 7.37ms	post 4.27ms	diff 3.17ms
write align 2097152	pre 3.62ms	on 6.81ms	post 3.94ms	diff 3.03ms
write align 1048576	pre 3.62ms	on 6.53ms	post 3.55ms	diff 2.95ms
write align 524288	pre 3.62ms	on 6.51ms	post 3.63ms	diff 2.88ms
write align 262144	pre 3.62ms	on 6.51ms	post 3.63ms	diff 2.89ms
write align 131072	pre 3.62ms	on 6.5ms	post 3.63ms	diff 2.88ms
write align 65536	pre 3.61ms	on 6.49ms	post 3.62ms	diff 2.88ms
write align 32768	pre 3.61ms	on 6.49ms	post 3.61ms	diff 2.88ms
write align 16384	pre 3.68ms	on 107ms	post 3.51ms	diff 103ms
write align 8192	pre 3.74ms	on 121ms	post 3.91ms	diff 117ms
write align 4096	pre 3.88ms	on 3.87ms	post 3.87ms	diff -2937ns
write align 2048	pre 3.89ms	on 3.88ms	post 3.88ms	diff -8734ns
# fjnh84@fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
# echo 8192 > /sys/block/mmcblk0/device/page_size
# cd data
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 3.33ms	on 6.8ms	post 3.65ms	diff 3.31ms
write align 4194304	pre 4.34ms	on 8.14ms	post 4.53ms	diff 3.71ms
write align 2097152	pre 3.64ms	on 7.31ms	post 4.09ms	diff 3.44ms
write align 1048576	pre 3.65ms	on 7.52ms	post 3.65ms	diff 3.87ms
write align 524288	pre 3.62ms	on 6.8ms	post 3.63ms	diff 3.17ms
write align 262144	pre 3.62ms	on 6.84ms	post 3.63ms	diff 3.22ms
write align 131072	pre 3.62ms	on 6.85ms	post 3.44ms	diff 3.32ms
write align 65536	pre 3.39ms	on 6.8ms	post 3.66ms	diff 3.28ms
write align 32768	pre 3.64ms	on 6.86ms	post 3.66ms	diff 3.21ms
write align 16384	pre 3.67ms	on 6.86ms	post 3.65ms	diff 3.2ms
write align 8192	pre 3.66ms	on 6.84ms	post 3.64ms	diff 3.19ms
write align 4096	pre 3.71ms	on 3.71ms	post 3.64ms	diff 38.6µs
write align 2048	pre 3.71ms	on 3.71ms	post 3.72ms	diff -656ns

This was with the split unaligned accesses patch... Which I am
attaching for comments.

Thanks,
A

[-- Attachment #2: 0001-MMC-Split-non-page-size-aligned-accesses.patch --]
[-- Type: text/x-diff, Size: 5196 bytes --]

From b3e6a556a716e7cec86071342197e798b38c3cbf Mon Sep 17 00:00:00 2001
From: Andrei Warkentin <andreiw@motorola.com>
Date: Fri, 18 Feb 2011 17:46:00 -0600
Subject: [PATCH] MMC: Split non-page-size aligned accesses.

If the card page size is known, splits the access into an unaligned
and an aligned portion, which helps with the performance.

Change-Id: I4ad7588d613d775212fac87436e418577909a22b
Signed-off-by: Andrei Warkentin <andreiw@motorola.com>
---
 drivers/mmc/card/block.c |  111 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mmc/card.h |    1 +
 2 files changed, 112 insertions(+), 0 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 7054fd5..be7d739 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -22,6 +22,7 @@
 #include <linux/init.h>
 
 #include <linux/kernel.h>
+#include <linux/ctype.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/errno.h>
@@ -67,6 +68,74 @@ struct mmc_blk_data {
 
 static DEFINE_MUTEX(open_lock);
 
+static ssize_t
+show_block_attr(struct device *dev, struct device_attribute *attr,
+		char *buf);
+
+static ssize_t
+set_block_attr(struct device *dev, struct device_attribute *attr,
+	       const char *buf, size_t count);
+
+static DEVICE_ATTR(page_size, S_IRUGO | S_IWUSR, show_block_attr, set_block_attr);
+
+static ssize_t
+show_block_attr(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	unsigned int val;
+	ssize_t ret = 0;
+	struct mmc_card *card = container_of(dev, struct mmc_card, dev);
+	mmc_claim_host(card->host);
+	if (attr == &dev_attr_page_size)
+		val = card->page_size;
+	else
+		ret = -EINVAL;
+
+	mmc_release_host(card->host);
+	if (!ret)
+		ret = sprintf(buf, "%u\n", val);
+	return ret;
+}
+
+static ssize_t
+set_block_attr(struct device *dev, struct device_attribute *attr,
+	       const char *buf, size_t count)
+{
+	ssize_t ret;
+	char *after;
+	unsigned int val, *dest = NULL;
+	struct mmc_card *card = container_of(dev, struct mmc_card, dev);
+	val = simple_strtoul(buf, &after, 10);
+	ret = after - buf;
+
+	while (isspace(*after++))
+		ret++;
+
+	if (ret != count)
+		return -EINVAL;
+
+	if (attr == &dev_attr_page_size)
+		dest = &card->page_size;
+	else
+		return -EINVAL;
+
+	if (dest) {
+		mmc_claim_host(card->host);
+		*dest = val;
+		mmc_release_host(card->host);
+	}
+	return ret;
+}
+
+static struct attribute *capability_attrs[] = {
+	&dev_attr_page_size.attr,
+	NULL,
+};
+
+static struct attribute_group attr_group = {
+        .attrs = capability_attrs,
+};
+
 static struct mmc_blk_data *mmc_blk_get(struct gendisk *disk)
 {
 	struct mmc_blk_data *md;
@@ -312,6 +381,38 @@ out:
 	return err ? 0 : 1;
 }
 
+
+/*
+ * If the request is not aligned, split it into an unaligned
+ * and an aligned portion. Here we can adjust
+ * the size of the MMC request and let the block layer request handle
+ * deal with generating another MMC request.
+ */
+static bool mmc_adjust_write(struct mmc_card *card,
+			     struct mmc_request *mrq)
+{
+	unsigned int left_in_page;
+	unsigned int page_size_blocks;
+
+	if (!card->page_size)
+		return false;
+
+	page_size_blocks = card->page_size / mrq->data->blksz;
+	left_in_page = page_size_blocks -
+		(mrq->cmd->arg % page_size_blocks);
+
+	/* Aligned access. */
+	if (left_in_page == page_size_blocks)
+		return false;
+
+	/* Not straddling page boundary. */
+	if (mrq->data->blocks <= left_in_page)
+		return false;
+
+	mrq->data->blocks = left_in_page;
+	return true;
+}
+
 static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 {
 	struct mmc_blk_data *md = mq->data;
@@ -339,6 +440,10 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
 		brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
 		brq.data.blocks = blk_rq_sectors(req);
 
+		/* Check for unaligned accesses straddling pages. */
+		if (rq_data_dir(req) == WRITE)
+			mmc_adjust_write(card, &brq.mrq);
+
 		/*
 		 * The block layer doesn't support all sector count
 		 * restrictions, so we need to be prepared for too big
@@ -707,6 +812,10 @@ static int mmc_blk_probe(struct mmc_card *card)
 	if (err)
 		goto out;
 
+	err = sysfs_create_group(&card->dev.kobj, &attr_group);
+	if (err)
+		goto out;
+
 	string_get_size((u64)get_capacity(md->disk) << 9, STRING_UNITS_2,
 			cap_str, sizeof(cap_str));
 	printk(KERN_INFO "%s: %s %s %s %s\n",
@@ -735,6 +844,8 @@ static void mmc_blk_remove(struct mmc_card *card)
 		/* Stop new requests from getting into the queue */
 		del_gendisk(md->disk);
 
+		sysfs_remove_group(&card->dev.kobj, &attr_group);
+
 		/* Then flush out any already in there */
 		mmc_cleanup_queue(&md->queue);
 
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 6b75250..d52768a 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -123,7 +123,7 @@ struct mmc_card {
 	unsigned int		erase_size;	/* erase size in sectors */
  	unsigned int		erase_shift;	/* if erase unit is power 2 */
  	unsigned int		pref_erase;	/* in sectors */
+ 	unsigned int		page_size;	/* page size in bytes */
  	u8			erased_byte;	/* value of erased bytes */
 
 	u32			raw_cid[4];	/* raw card CID */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-18 23:17                 ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-18 23:17 UTC (permalink / raw)
  To: linux-arm-kernel

2011/2/18 Andrei Warkentin <andreiw@motorola.com>:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>> On Fri, Feb 18, 2011 at 7:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> I'm curious. Neither the manfid nor the oemid fields of either card
>>> match what I have seen on SD cards, I would expect them to be
>>>
>>> Sandisk: manfid 0x000003, oemid 0x5344
>>> Toshiba: manfid 0x000002, oemid 0x544d
>>>
>>> I have not actually seen any Toshiba SD cards, but I assume that they
>>> use the same controllers as Kingston.
>>>
>>> Does anyone know if the IDs have any correlation between MMC and SD
>>> controllers?
>>>
>>> ? ? ? ?Arnd
>>>
>>
>> I'm unsure about the older scheme (assigned by MMCA), but ever since
>> MMC is now JEDEC-controlled, the IDs have changed. Sandisk's new id
>> will be 0x45, and Toshiba I guess will be 0x11.
>>
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.
>
> On the sandisk it was the write align that made the page size stand
> out. ?The read align had pretty constant results.
>
> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.
>
> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Thanks,
> A
>

I thought this was pretty interesting -

# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 3.59ms	on 6.54ms	post 3.65ms	diff 2.92ms
write align 4194304	pre 4.13ms	on 7.37ms	post 4.27ms	diff 3.17ms
write align 2097152	pre 3.62ms	on 6.81ms	post 3.94ms	diff 3.03ms
write align 1048576	pre 3.62ms	on 6.53ms	post 3.55ms	diff 2.95ms
write align 524288	pre 3.62ms	on 6.51ms	post 3.63ms	diff 2.88ms
write align 262144	pre 3.62ms	on 6.51ms	post 3.63ms	diff 2.89ms
write align 131072	pre 3.62ms	on 6.5ms	post 3.63ms	diff 2.88ms
write align 65536	pre 3.61ms	on 6.49ms	post 3.62ms	diff 2.88ms
write align 32768	pre 3.61ms	on 6.49ms	post 3.61ms	diff 2.88ms
write align 16384	pre 3.68ms	on 107ms	post 3.51ms	diff 103ms
write align 8192	pre 3.74ms	on 121ms	post 3.91ms	diff 117ms
write align 4096	pre 3.88ms	on 3.87ms	post 3.87ms	diff -2937ns
write align 2048	pre 3.89ms	on 3.88ms	post 3.88ms	diff -8734ns
# fjnh84 at fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
# echo 8192 > /sys/block/mmcblk0/device/page_size
# cd data
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 3.33ms	on 6.8ms	post 3.65ms	diff 3.31ms
write align 4194304	pre 4.34ms	on 8.14ms	post 4.53ms	diff 3.71ms
write align 2097152	pre 3.64ms	on 7.31ms	post 4.09ms	diff 3.44ms
write align 1048576	pre 3.65ms	on 7.52ms	post 3.65ms	diff 3.87ms
write align 524288	pre 3.62ms	on 6.8ms	post 3.63ms	diff 3.17ms
write align 262144	pre 3.62ms	on 6.84ms	post 3.63ms	diff 3.22ms
write align 131072	pre 3.62ms	on 6.85ms	post 3.44ms	diff 3.32ms
write align 65536	pre 3.39ms	on 6.8ms	post 3.66ms	diff 3.28ms
write align 32768	pre 3.64ms	on 6.86ms	post 3.66ms	diff 3.21ms
write align 16384	pre 3.67ms	on 6.86ms	post 3.65ms	diff 3.2ms
write align 8192	pre 3.66ms	on 6.84ms	post 3.64ms	diff 3.19ms
write align 4096	pre 3.71ms	on 3.71ms	post 3.64ms	diff 38.6?s
write align 2048	pre 3.71ms	on 3.71ms	post 3.72ms	diff -656ns

This was with the split unaligned accesses patch... Which I am
attaching for comments.

Thanks,
A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-MMC-Split-non-page-size-aligned-accesses.patch
Type: text/x-diff
Size: 5195 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110218/333fe63e/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-18 22:40               ` Andrei Warkentin
@ 2011-02-19  9:54                 ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19  9:54 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.

Very nice, thanks for the measurement!

I don't think having the results inline in the mail is a problem,
it would even make it easier to quote.
 
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.

I never did a write align test because the results will be highly
unreliable as soon as you get into thrashing. Your results seem
to be meaningful still, so maybe we should have it after all, but
I'll put a big warning on it.

> On the sandisk it was the write align that made the page size stand
> out.  The read align had pretty constant results.

I've noticed on other Sandisk media that the read align test is
sometimes useless. It may help to do a full erase of the partition,
or to fill it with data before running the test.

> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.

Partition 6 is a lot smaller, so you have the accesses less than a
segment apart, so it shows other effects.

> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)

Right, you should try larger values for --open-au-nr here. It's at
least a good sign that the drive can do random access inside a segment
and that it can have at least 4 segments open. This is much better
than I expected from your descriptions at first.

However, the drop from 32 KB to 16 KB in performance is horrifying
for the Toshiba drive, it's clear that this one does not like
to be accessed smaller than 32 KB at a time, an obvious optimization
for FAT32 with 32 KB clusters. How does this change with your
kernel patches?

For the sandisk drive, it's funny how it is consistently faster
doing random access than linear access. I don't think I've seem that
before. It does seem to have some cache for linear access using
smaller than 16 KB, and can probably combine them when it's only
writing to a single segment.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-19  9:54                 ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19  9:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>
> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.

Very nice, thanks for the measurement!

I don't think having the results inline in the mail is a problem,
it would even make it easier to quote.
 
> Some interesting things that I don't understand. For the align test, I
> extended it to do a write align test (-A). I tried two partitions that
> I could write over, and both read and writes behaved differently for
> the two partitions on same device. Odd. They are both 4MB aligned.

I never did a write align test because the results will be highly
unreliable as soon as you get into thrashing. Your results seem
to be meaningful still, so maybe we should have it after all, but
I'll put a big warning on it.

> On the sandisk it was the write align that made the page size stand
> out.  The read align had pretty constant results.

I've noticed on other Sandisk media that the read align test is
sometimes useless. It may help to do a full erase of the partition,
or to fill it with data before running the test.

> On the toshiba the results varied wildly for the two partitions. For
> partition 6, there was a clear pattern in the diff values for read
> align. For 9, it was all over the place. For 9 with the write align,
> 8K and 16K the crossing writes took ~115ms!! Look in attached files
> for all the data.

Partition 6 is a lot smaller, so you have the accesses less than a
segment apart, so it shows other effects.

> The AU tests were interesting too, especially how with several open
> AUs the throughput is higher for certain smaller sizes on sandisk, but
> if I interpret it correctly both cards have at least 4 AUs, as I
> didn't see yet a significant drop for small sizes. The larger ones I
> am running now on mmcblk0p9 which is sufficiently larger for these
> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)

Right, you should try larger values for --open-au-nr here. It's at
least a good sign that the drive can do random access inside a segment
and that it can have at least 4 segments open. This is much better
than I expected from your descriptions at first.

However, the drop from 32 KB to 16 KB in performance is horrifying
for the Toshiba drive, it's clear that this one does not like
to be accessed smaller than 32 KB at a time, an obvious optimization
for FAT32 with 32 KB clusters. How does this change with your
kernel patches?

For the sandisk drive, it's funny how it is consistently faster
doing random access than linear access. I don't think I've seem that
before. It does seem to have some cache for linear access using
smaller than 16 KB, and can probably combine them when it's only
writing to a single segment.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-18 23:17                 ` Andrei Warkentin
@ 2011-02-19 11:20                   ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19 11:20 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608     pre 3.59ms      on 6.54ms       post 3.65ms     diff 2.92ms
> write align 4194304     pre 4.13ms      on 7.37ms       post 4.27ms     diff 3.17ms
> write align 2097152     pre 3.62ms      on 6.81ms       post 3.94ms     diff 3.03ms
> write align 1048576     pre 3.62ms      on 6.53ms       post 3.55ms     diff 2.95ms
> write align 524288      pre 3.62ms      on 6.51ms       post 3.63ms     diff 2.88ms
> write align 262144      pre 3.62ms      on 6.51ms       post 3.63ms     diff 2.89ms
> write align 131072      pre 3.62ms      on 6.5ms        post 3.63ms     diff 2.88ms
> write align 65536       pre 3.61ms      on 6.49ms       post 3.62ms     diff 2.88ms
> write align 32768       pre 3.61ms      on 6.49ms       post 3.61ms     diff 2.88ms
> write align 16384       pre 3.68ms      on 107ms        post 3.51ms     diff 103ms
> write align 8192        pre 3.74ms      on 121ms        post 3.91ms     diff 117ms
> write align 4096        pre 3.88ms      on 3.87ms       post 3.87ms     diff -2937ns
> write align 2048        pre 3.89ms      on 3.88ms       post 3.88ms     diff -8734ns
> # fjnh84@fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
> # echo 8192 > /sys/block/mmcblk0/device/page_size
> # cd data
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608     pre 3.33ms      on 6.8ms        post 3.65ms     diff 3.31ms
> write align 4194304     pre 4.34ms      on 8.14ms       post 4.53ms     diff 3.71ms
> write align 2097152     pre 3.64ms      on 7.31ms       post 4.09ms     diff 3.44ms
> write align 1048576     pre 3.65ms      on 7.52ms       post 3.65ms     diff 3.87ms
> write align 524288      pre 3.62ms      on 6.8ms        post 3.63ms     diff 3.17ms
> write align 262144      pre 3.62ms      on 6.84ms       post 3.63ms     diff 3.22ms
> write align 131072      pre 3.62ms      on 6.85ms       post 3.44ms     diff 3.32ms
> write align 65536       pre 3.39ms      on 6.8ms        post 3.66ms     diff 3.28ms
> write align 32768       pre 3.64ms      on 6.86ms       post 3.66ms     diff 3.21ms
> write align 16384       pre 3.67ms      on 6.86ms       post 3.65ms     diff 3.2ms
> write align 8192        pre 3.66ms      on 6.84ms       post 3.64ms     diff 3.19ms
> write align 4096        pre 3.71ms      on 3.71ms       post 3.64ms     diff 38.6µs
> write align 2048        pre 3.71ms      on 3.71ms       post 3.72ms     diff -656ns
> 
> This was with the split unaligned accesses patch... Which I am
> attaching for comments.

I agree, this is very fascinating behavior. 100ms second latency for a
single 2KB access is definitely something we should try to avoid, and I
wonder why the drive decides to do that. It must get into a state where
it requires an extra garbage collection (you mentioned that earlier).

The numbers you see here are taken over multiple runs. Do you see a lot
of fluctuation when doing this with --count=1?

Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
to flashbench?

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-19 11:20                   ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-19 11:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608     pre 3.59ms      on 6.54ms       post 3.65ms     diff 2.92ms
> write align 4194304     pre 4.13ms      on 7.37ms       post 4.27ms     diff 3.17ms
> write align 2097152     pre 3.62ms      on 6.81ms       post 3.94ms     diff 3.03ms
> write align 1048576     pre 3.62ms      on 6.53ms       post 3.55ms     diff 2.95ms
> write align 524288      pre 3.62ms      on 6.51ms       post 3.63ms     diff 2.88ms
> write align 262144      pre 3.62ms      on 6.51ms       post 3.63ms     diff 2.89ms
> write align 131072      pre 3.62ms      on 6.5ms        post 3.63ms     diff 2.88ms
> write align 65536       pre 3.61ms      on 6.49ms       post 3.62ms     diff 2.88ms
> write align 32768       pre 3.61ms      on 6.49ms       post 3.61ms     diff 2.88ms
> write align 16384       pre 3.68ms      on 107ms        post 3.51ms     diff 103ms
> write align 8192        pre 3.74ms      on 121ms        post 3.91ms     diff 117ms
> write align 4096        pre 3.88ms      on 3.87ms       post 3.87ms     diff -2937ns
> write align 2048        pre 3.89ms      on 3.88ms       post 3.88ms     diff -8734ns
> # fjnh84 at fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
> # echo 8192 > /sys/block/mmcblk0/device/page_size
> # cd data
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608     pre 3.33ms      on 6.8ms        post 3.65ms     diff 3.31ms
> write align 4194304     pre 4.34ms      on 8.14ms       post 4.53ms     diff 3.71ms
> write align 2097152     pre 3.64ms      on 7.31ms       post 4.09ms     diff 3.44ms
> write align 1048576     pre 3.65ms      on 7.52ms       post 3.65ms     diff 3.87ms
> write align 524288      pre 3.62ms      on 6.8ms        post 3.63ms     diff 3.17ms
> write align 262144      pre 3.62ms      on 6.84ms       post 3.63ms     diff 3.22ms
> write align 131072      pre 3.62ms      on 6.85ms       post 3.44ms     diff 3.32ms
> write align 65536       pre 3.39ms      on 6.8ms        post 3.66ms     diff 3.28ms
> write align 32768       pre 3.64ms      on 6.86ms       post 3.66ms     diff 3.21ms
> write align 16384       pre 3.67ms      on 6.86ms       post 3.65ms     diff 3.2ms
> write align 8192        pre 3.66ms      on 6.84ms       post 3.64ms     diff 3.19ms
> write align 4096        pre 3.71ms      on 3.71ms       post 3.64ms     diff 38.6?s
> write align 2048        pre 3.71ms      on 3.71ms       post 3.72ms     diff -656ns
> 
> This was with the split unaligned accesses patch... Which I am
> attaching for comments.

I agree, this is very fascinating behavior. 100ms second latency for a
single 2KB access is definitely something we should try to avoid, and I
wonder why the drive decides to do that. It must get into a state where
it requires an extra garbage collection (you mentioned that earlier).

The numbers you see here are taken over multiple runs. Do you see a lot
of fluctuation when doing this with --count=1?

Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
to flashbench?

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-19  9:54                 ` Arnd Bergmann
@ 2011-02-20  4:39                   ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20  4:39 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sat, Feb 19, 2011 at 3:54 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
>> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>>
>> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Very nice, thanks for the measurement!
>
> I don't think having the results inline in the mail is a problem,
> it would even make it easier to quote.
>
>> Some interesting things that I don't understand. For the align test, I
>> extended it to do a write align test (-A). I tried two partitions that
>> I could write over, and both read and writes behaved differently for
>> the two partitions on same device. Odd. They are both 4MB aligned.
>
> I never did a write align test because the results will be highly
> unreliable as soon as you get into thrashing. Your results seem
> to be meaningful still, so maybe we should have it after all, but
> I'll put a big warning on it.
>

Actually it would be a good idea to also bail/warn if you do the au
test with more open au's than the size of the passed device allows,
since it'll just wrap around and skew the results.

>> On the sandisk it was the write align that made the page size stand
>> out.  The read align had pretty constant results.
>
> I've noticed on other Sandisk media that the read align test is
> sometimes useless. It may help to do a full erase of the partition,
> or to fill it with data before running the test.
>
>> On the toshiba the results varied wildly for the two partitions. For
>> partition 6, there was a clear pattern in the diff values for read
>> align. For 9, it was all over the place. For 9 with the write align,
>> 8K and 16K the crossing writes took ~115ms!! Look in attached files
>> for all the data.
>
> Partition 6 is a lot smaller, so you have the accesses less than a
> segment apart, so it shows other effects.
>
>> The AU tests were interesting too, especially how with several open
>> AUs the throughput is higher for certain smaller sizes on sandisk, but
>> if I interpret it correctly both cards have at least 4 AUs, as I
>> didn't see yet a significant drop for small sizes. The larger ones I
>> am running now on mmcblk0p9 which is sufficiently larger for these
>> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Right, you should try larger values for --open-au-nr here. It's at
> least a good sign that the drive can do random access inside a segment
> and that it can have at least 4 segments open. This is much better
> than I expected from your descriptions at first.

Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
^C
# ./flashbench -O -0 6  -b 512 /dev/block/mmcblk0p9
4MiB    5.91M/s
2MiB    8.84M/s
1MiB    10.8M/s
512KiB  13M/s
256KiB  13.6M/s

^C
# ./flashbench -O -0 7  -b 512 /dev/block/mmcblk0p9
4MiB    6.32M/s
2MiB    8.63M/s
1MiB    10.5M/s
512KiB  13.2M/s
256KiB  13M/s
^[[A^[[D^[[A128KiB  12.3M/s
^C
# ./flashbench -O -0 8  -b 512 /dev/block/mmcblk0p9
4MiB    6.65M/s
2MiB    7.02M/s
1MiB    6.36M/s
512KiB  3.17M/s
256KiB  1.53M/s

The Sandisk one has 20 AUs.

# ./flashbench -O -0 20  -b 512 /dev/block/mmcblk0p9
4MiB    11.3M/s
2MiB    12.8M/s
1MiB    9.87M/s
512KiB  9.97M/s
256KiB  9.13M/s
128KiB  8.05M/s
^C
# ./flashbench -O -0 50  -b 512 /dev/block/mmcblk0p9
4MiB    7.19M/s
^C
# ./flashbench -O -0 2  -b 512 /dev/block/mmcblk0p9
^C
# ./flashbench -O -0 22  -b 512 /dev/block/mmcblk0p9
4MiB    11.6M/s
2MiB    12.3M/s
1MiB    5.13M/s
512KiB  2.57M/s
256KiB  1.59M/s
128KiB  1.16M/s
64KiB   776K/s
^C
# ./flashbench -O -0 21  -b 512 /dev/block/mmcblk0p9
4MiB    11.2M/s
2MiB    12.4M/s
1MiB    4.65M/s
512KiB  1.95M/s
256KiB  955K/s

>
> However, the drop from 32 KB to 16 KB in performance is horrifying
> for the Toshiba drive, it's clear that this one does not like
> to be accessed smaller than 32 KB at a time, an obvious optimization
> for FAT32 with 32 KB clusters. How does this change with your
> kernel patches?

Since the only performance-increasing patch here would be just the one
that splits unaligned accesses, I wouldn't expect any improvements for
page-aligned accesses < 32KB. As you can see here...

# cat /sys/block/mmcblk0/device/page_size
8192
# ./flashbench -O -0 1  -b 512 /dev/block/mmcblk0p9
4MiB    6.81M/s
2MiB    7.73M/s
1MiB    9.21M/s
512KiB  9.98M/s
256KiB  10.3M/s
128KiB  10.2M/s
64KiB   9.76M/s
32KiB   8.52M/s
16KiB   3.68M/s
8KiB    1.72M/s
4KiB    837K/s
^C
# echo 0 >  /sys/block/mmcblk0/device/page_size
# ./flashbench -O -0 1  -b 512 /dev/block/mmcblk0p9
4MiB    6.42M/s
2MiB    7.79M/s
1MiB    9.22M/s
512KiB  10M/s
256KiB  9.94M/s
128KiB  10.1M/s
64KiB   9.68M/s
32KiB   8.5M/s
16KiB   3.65M/s
8KiB    1.73M/s
4KiB    838K/s
2KiB    417K/s
^C
#


>
> For the sandisk drive, it's funny how it is consistently faster
> doing random access than linear access. I don't think I've seem that
> before. It does seem to have some cache for linear access using
> smaller than 16 KB, and can probably combine them when it's only
> writing to a single segment.

Yes, that is pretty interesting. Smaller than 16K? Not smaller than
32K? I wonder what it is doing...

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-20  4:39                   ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20  4:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 19, 2011 at 3:54 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Friday 18 February 2011 23:40:16 Andrei Warkentin wrote:
>> On Fri, Feb 18, 2011 at 1:47 PM, Andrei Warkentin <andreiw@motorola.com> wrote:
>>
>> Flashbench timings for both Sandisk and Toshiba cards. Attaching due to size.
>
> Very nice, thanks for the measurement!
>
> I don't think having the results inline in the mail is a problem,
> it would even make it easier to quote.
>
>> Some interesting things that I don't understand. For the align test, I
>> extended it to do a write align test (-A). I tried two partitions that
>> I could write over, and both read and writes behaved differently for
>> the two partitions on same device. Odd. They are both 4MB aligned.
>
> I never did a write align test because the results will be highly
> unreliable as soon as you get into thrashing. Your results seem
> to be meaningful still, so maybe we should have it after all, but
> I'll put a big warning on it.
>

Actually it would be a good idea to also bail/warn if you do the au
test with more open au's than the size of the passed device allows,
since it'll just wrap around and skew the results.

>> On the sandisk it was the write align that made the page size stand
>> out. ?The read align had pretty constant results.
>
> I've noticed on other Sandisk media that the read align test is
> sometimes useless. It may help to do a full erase of the partition,
> or to fill it with data before running the test.
>
>> On the toshiba the results varied wildly for the two partitions. For
>> partition 6, there was a clear pattern in the diff values for read
>> align. For 9, it was all over the place. For 9 with the write align,
>> 8K and 16K the crossing writes took ~115ms!! Look in attached files
>> for all the data.
>
> Partition 6 is a lot smaller, so you have the accesses less than a
> segment apart, so it shows other effects.
>
>> The AU tests were interesting too, especially how with several open
>> AUs the throughput is higher for certain smaller sizes on sandisk, but
>> if I interpret it correctly both cards have at least 4 AUs, as I
>> didn't see yet a significant drop for small sizes. The larger ones I
>> am running now on mmcblk0p9 which is sufficiently larger for these
>> tests... (mmcblk0p6 is only 40mb, p9 is 314 mb)
>
> Right, you should try larger values for --open-au-nr here. It's at
> least a good sign that the drive can do random access inside a segment
> and that it can have at least 4 segments open. This is much better
> than I expected from your descriptions at first.

Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
^C
# ./flashbench -O -0 6  -b 512 /dev/block/mmcblk0p9
4MiB    5.91M/s
2MiB    8.84M/s
1MiB    10.8M/s
512KiB  13M/s
256KiB  13.6M/s

^C
# ./flashbench -O -0 7  -b 512 /dev/block/mmcblk0p9
4MiB    6.32M/s
2MiB    8.63M/s
1MiB    10.5M/s
512KiB  13.2M/s
256KiB  13M/s
^[[A^[[D^[[A128KiB  12.3M/s
^C
# ./flashbench -O -0 8  -b 512 /dev/block/mmcblk0p9
4MiB    6.65M/s
2MiB    7.02M/s
1MiB    6.36M/s
512KiB  3.17M/s
256KiB  1.53M/s

The Sandisk one has 20 AUs.

# ./flashbench -O -0 20  -b 512 /dev/block/mmcblk0p9
4MiB    11.3M/s
2MiB    12.8M/s
1MiB    9.87M/s
512KiB  9.97M/s
256KiB  9.13M/s
128KiB  8.05M/s
^C
# ./flashbench -O -0 50  -b 512 /dev/block/mmcblk0p9
4MiB    7.19M/s
^C
# ./flashbench -O -0 2  -b 512 /dev/block/mmcblk0p9
^C
# ./flashbench -O -0 22  -b 512 /dev/block/mmcblk0p9
4MiB    11.6M/s
2MiB    12.3M/s
1MiB    5.13M/s
512KiB  2.57M/s
256KiB  1.59M/s
128KiB  1.16M/s
64KiB   776K/s
^C
# ./flashbench -O -0 21  -b 512 /dev/block/mmcblk0p9
4MiB    11.2M/s
2MiB    12.4M/s
1MiB    4.65M/s
512KiB  1.95M/s
256KiB  955K/s

>
> However, the drop from 32 KB to 16 KB in performance is horrifying
> for the Toshiba drive, it's clear that this one does not like
> to be accessed smaller than 32 KB at a time, an obvious optimization
> for FAT32 with 32 KB clusters. How does this change with your
> kernel patches?

Since the only performance-increasing patch here would be just the one
that splits unaligned accesses, I wouldn't expect any improvements for
page-aligned accesses < 32KB. As you can see here...

# cat /sys/block/mmcblk0/device/page_size
8192
# ./flashbench -O -0 1  -b 512 /dev/block/mmcblk0p9
4MiB    6.81M/s
2MiB    7.73M/s
1MiB    9.21M/s
512KiB  9.98M/s
256KiB  10.3M/s
128KiB  10.2M/s
64KiB   9.76M/s
32KiB   8.52M/s
16KiB   3.68M/s
8KiB    1.72M/s
4KiB    837K/s
^C
# echo 0 >  /sys/block/mmcblk0/device/page_size
# ./flashbench -O -0 1  -b 512 /dev/block/mmcblk0p9
4MiB    6.42M/s
2MiB    7.79M/s
1MiB    9.22M/s
512KiB  10M/s
256KiB  9.94M/s
128KiB  10.1M/s
64KiB   9.68M/s
32KiB   8.5M/s
16KiB   3.65M/s
8KiB    1.73M/s
4KiB    838K/s
2KiB    417K/s
^C
#


>
> For the sandisk drive, it's funny how it is consistently faster
> doing random access than linear access. I don't think I've seem that
> before. It does seem to have some cache for linear access using
> smaller than 16 KB, and can probably combine them when it's only
> writing to a single segment.

Yes, that is pretty interesting. Smaller than 16K? Not smaller than
32K? I wonder what it is doing...

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-19 11:20                   ` Arnd Bergmann
@ 2011-02-20  5:56                     ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20  5:56 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
>> # echo 0 > /sys/block/mmcblk0/device/page_size
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608     pre 3.59ms      on 6.54ms       post 3.65ms     diff 2.92ms
>> write align 4194304     pre 4.13ms      on 7.37ms       post 4.27ms     diff 3.17ms
>> write align 2097152     pre 3.62ms      on 6.81ms       post 3.94ms     diff 3.03ms
>> write align 1048576     pre 3.62ms      on 6.53ms       post 3.55ms     diff 2.95ms
>> write align 524288      pre 3.62ms      on 6.51ms       post 3.63ms     diff 2.88ms
>> write align 262144      pre 3.62ms      on 6.51ms       post 3.63ms     diff 2.89ms
>> write align 131072      pre 3.62ms      on 6.5ms        post 3.63ms     diff 2.88ms
>> write align 65536       pre 3.61ms      on 6.49ms       post 3.62ms     diff 2.88ms
>> write align 32768       pre 3.61ms      on 6.49ms       post 3.61ms     diff 2.88ms
>> write align 16384       pre 3.68ms      on 107ms        post 3.51ms     diff 103ms
>> write align 8192        pre 3.74ms      on 121ms        post 3.91ms     diff 117ms
>> write align 4096        pre 3.88ms      on 3.87ms       post 3.87ms     diff -2937ns
>> write align 2048        pre 3.89ms      on 3.88ms       post 3.88ms     diff -8734ns
>> # fjnh84@fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
>> # echo 8192 > /sys/block/mmcblk0/device/page_size
>> # cd data
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608     pre 3.33ms      on 6.8ms        post 3.65ms     diff 3.31ms
>> write align 4194304     pre 4.34ms      on 8.14ms       post 4.53ms     diff 3.71ms
>> write align 2097152     pre 3.64ms      on 7.31ms       post 4.09ms     diff 3.44ms
>> write align 1048576     pre 3.65ms      on 7.52ms       post 3.65ms     diff 3.87ms
>> write align 524288      pre 3.62ms      on 6.8ms        post 3.63ms     diff 3.17ms
>> write align 262144      pre 3.62ms      on 6.84ms       post 3.63ms     diff 3.22ms
>> write align 131072      pre 3.62ms      on 6.85ms       post 3.44ms     diff 3.32ms
>> write align 65536       pre 3.39ms      on 6.8ms        post 3.66ms     diff 3.28ms
>> write align 32768       pre 3.64ms      on 6.86ms       post 3.66ms     diff 3.21ms
>> write align 16384       pre 3.67ms      on 6.86ms       post 3.65ms     diff 3.2ms
>> write align 8192        pre 3.66ms      on 6.84ms       post 3.64ms     diff 3.19ms
>> write align 4096        pre 3.71ms      on 3.71ms       post 3.64ms     diff 38.6µs
>> write align 2048        pre 3.71ms      on 3.71ms       post 3.72ms     diff -656ns
>>
>> This was with the split unaligned accesses patch... Which I am
>> attaching for comments.
>
> I agree, this is very fascinating behavior. 100ms second latency for a
> single 2KB access is definitely something we should try to avoid, and I
> wonder why the drive decides to do that. It must get into a state where
> it requires an extra garbage collection (you mentioned that earlier).
>
> The numbers you see here are taken over multiple runs. Do you see a lot
> of fluctuation when doing this with --count=1?
>

Yep. Quite a bit.

# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 4.52ms	on 7.58ms	post 3.93ms	diff 3.36ms
write align 4194304	pre 5.97ms	on 8.69ms	post 4.36ms	diff 3.53ms
write align 2097152	pre 3.57ms	on 7.96ms	post 4.6ms	diff 3.88ms
write align 1048576	pre 5.33ms	on 27.4ms	post 4.88ms	diff 22.3ms
write align 524288	pre 49.3ms	on 31.4ms	post 14.9ms	diff -679265
write align 262144	pre 39.7ms	on 38.3ms	post 5.27ms	diff 15.8ms
write align 131072	pre 33.8ms	on 45.4ms	post 5.26ms	diff 25.9ms
write align 65536	pre 34.4ms	on 40.9ms	post 3.3ms	diff 22.1ms
write align 32768	pre 30.2ms	on 44.8ms	post 5.13ms	diff 27.1ms
write align 16384	pre 44.5ms	on 5.05ms	post 33.3ms	diff -338542
write align 8192	pre 25.5ms	on 70.6ms	post 25.3ms	diff 45.2ms
write align 4096	pre 4.89ms	on 4.47ms	post 5.29ms	diff -623390
write align 2048	pre 4.88ms	on 4.89ms	post 5.2ms	diff -155781
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 4.68ms	on 9.06ms	post 5.14ms	diff 4.15ms
write align 4194304	pre 4.37ms	on 7.49ms	post 4.59ms	diff 3.01ms
write align 2097152	pre 23.7ms	on 1.9ms	post 14.8ms	diff -173218
write align 1048576	pre 14.8ms	on 19.9ms	post 4.75ms	diff 10.2ms
write align 524288	pre 20.2ms	on 24.9ms	post 10.7ms	diff 9.46ms
write align 262144	pre 20.2ms	on 3.01ms	post 20.1ms	diff -171062
write align 131072	pre 25.9ms	on 24.9ms	post 9.85ms	diff 7.06ms
write align 65536	pre 15.5ms	on 30.3ms	post 2.95ms	diff 21.1ms
write align 32768	pre 27.3ms	on 19.1ms	post 5.86ms	diff 2.5ms
write align 16384	pre 25.4ms	on 55.9ms	post 12.7ms	diff 36.9ms
write align 8192	pre 4.8ms	on 102ms	post 9.47ms	diff 94.8ms
write align 4096	pre 4.92ms	on 5.16ms	post 4.98ms	diff 207µs
write align 2048	pre 4.64ms	on 4.92ms	post 5.45ms	diff -121860
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 15.8ms	on 9.39ms	post 4.68ms	diff -854295
write align 4194304	pre 4.76ms	on 7.54ms	post 3.82ms	diff 3.24ms
write align 2097152	pre 19.9ms	on 9.73ms	post 4.44ms	diff -244517
write align 1048576	pre 14.5ms	on 19.1ms	post 5.21ms	diff 9.23ms
write align 524288	pre 24.9ms	on 29ms	post 5.89ms	diff 13.6ms
write align 262144	pre 24.9ms	on 2.41ms	post 20.8ms	diff -204328
write align 131072	pre 25.6ms	on 30ms	post 4.84ms	diff 14.8ms
write align 65536	pre 26.4ms	on 24.4ms	post 6.16ms	diff 8.12ms
write align 32768	pre 15ms	on 30.6ms	post 15.4ms	diff 15.4ms
write align 16384	pre 16.1ms	on 45.4ms	post 16.5ms	diff 29.1ms
write align 8192	pre 5.88ms	on 107ms	post 5.45ms	diff 101ms
write align 4096	pre 5.17ms	on 5.78ms	post 4.83ms	diff 778µs
write align 2048	pre 3.99ms	on 5.27ms	post 3.97ms	diff 1.29ms
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 16.1ms	on 8.37ms	post 5.44ms	diff -241222
write align 4194304	pre 4.07ms	on 7.27ms	post 3.89ms	diff 3.29ms
write align 2097152	pre 24.2ms	on 18.5ms	post 5.63ms	diff 3.59ms
write align 1048576	pre 4.08ms	on 18.9ms	post 5.46ms	diff 14.1ms
write align 524288	pre 25.1ms	on 28ms	post 14.6ms	diff 8.13ms
write align 262144	pre 15.8ms	on 30ms	post 5.4ms	diff 19.4ms
write align 131072	pre 24.7ms	on 30.8ms	post 4.43ms	diff 16.2ms
write align 65536	pre 5ms	on 40.5ms	post 5.95ms	diff 35.1ms
write align 32768	pre 24.7ms	on 30.6ms	post 4.92ms	diff 15.8ms
write align 16384	pre 25.2ms	on 132ms	post 10.2ms	diff 114ms
write align 8192	pre 7.64ms	on 111ms	post 9.18ms	diff 102ms
write align 4096	pre 5.11ms	on 3.92ms	post 5.4ms	diff -134159
write align 2048	pre 3.92ms	on 4.41ms	post 4.51ms	diff 196µs

> Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> to flashbench?
>

# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 3.63ms	on 6.51ms	post 3.66ms	diff 2.86ms
write align 4194304	pre 3.61ms	on 6.51ms	post 3.62ms	diff 2.89ms
write align 2097152	pre 3.61ms	on 6.49ms	post 3.62ms	diff 2.87ms
write align 1048576	pre 3.64ms	on 6.55ms	post 3.62ms	diff 2.92ms
write align 524288	pre 3.64ms	on 6.57ms	post 3.66ms	diff 2.92ms
write align 262144	pre 3.44ms	on 6.45ms	post 3.66ms	diff 2.9ms
write align 131072	pre 3.64ms	on 6.56ms	post 3.67ms	diff 2.91ms
write align 65536	pre 3.33ms	on 6.57ms	post 3.65ms	diff 3.08ms
write align 32768	pre 3.68ms	on 6.6ms	post 3.7ms	diff 2.91ms
write align 16384	pre 3.64ms	on 97.6ms	post 3.26ms	diff 94.2ms
write align 8192	pre 3.49ms	on 115ms	post 3.62ms	diff 112ms
write align 4096	pre 3.91ms	on 3.91ms	post 3.9ms	diff 360ns
write align 2048	pre 3.92ms	on 3.92ms	post 3.92ms	diff -1374ns
# ./flashbench -A -b 2048 /dev/block/mmcblk0p9
write align 8388608	pre 3.76ms	on 7.23ms	post 4.18ms	diff 3.27ms
write align 4194304	pre 3.65ms	on 6.56ms	post 3.66ms	diff 2.9ms
write align 2097152	pre 3.9ms	on 6.99ms	post 3.67ms	diff 3.2ms
write align 1048576	pre 4.03ms	on 7.09ms	post 4.07ms	diff 3.04ms
write align 524288	pre 4.04ms	on 7.26ms	post 4.16ms	diff 3.16ms
write align 262144	pre 3.8ms	on 7.26ms	post 4.06ms	diff 3.33ms
write align 131072	pre 4.05ms	on 7.25ms	post 4.18ms	diff 3.14ms
write align 65536	pre 4.02ms	on 7.22ms	post 4.14ms	diff 3.14ms
write align 32768	pre 4ms	on 7.07ms	post 3.95ms	diff 3.1ms
write align 16384	pre 3.66ms	on 106ms	post 3.4ms	diff 102ms
write align 8192	pre 3.56ms	on 106ms	post 3.36ms	diff 103ms
write align 4096	pre 3.61ms	on 4.1ms	post 4.35ms	diff 117µs
# ./flashbench -A -b 4096 /dev/block/mmcblk0p9
write align 8388608	pre 3.64ms	on 6.95ms	post 3.96ms	diff 3.15ms
write align 4194304	pre 3.65ms	on 6.56ms	post 3.66ms	diff 2.9ms
write align 2097152	pre 3.89ms	on 6.79ms	post 3.66ms	diff 3.01ms
write align 1048576	pre 3.88ms	on 6.88ms	post 3.95ms	diff 2.97ms
write align 524288	pre 3.72ms	on 6.97ms	post 3.93ms	diff 3.15ms
write align 262144	pre 3.89ms	on 6.93ms	post 3.95ms	diff 3.01ms
write align 131072	pre 3.9ms	on 6.98ms	post 3.96ms	diff 3.05ms
write align 65536	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
write align 32768	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
write align 16384	pre 3.74ms	on 114ms	post 4.05ms	diff 110ms
write align 8192	pre 4.25ms	on 115ms	post 4.8ms	diff 110ms
# ./flashbench -A -b 8192 /dev/block/mmcblk0p9
write align 8388608	pre 3.84ms	on 7.53ms	post 4.29ms	diff 3.47ms
write align 4194304	pre 3.58ms	on 6.54ms	post 3.6ms	diff 2.95ms
write align 2097152	pre 4.12ms	on 7.27ms	post 3.87ms	diff 3.28ms
write align 1048576	pre 4.14ms	on 7.49ms	post 4.24ms	diff 3.3ms
write align 524288	pre 4.12ms	on 7.46ms	post 4.23ms	diff 3.29ms
write align 262144	pre 4.14ms	on 7.45ms	post 3.97ms	diff 3.4ms
write align 131072	pre 3.89ms	on 7.43ms	post 4.24ms	diff 3.37ms
write align 65536	pre 4.11ms	on 7.46ms	post 4.24ms	diff 3.29ms
write align 32768	pre 4.15ms	on 7.45ms	post 4.25ms	diff 3.25ms
write align 16384	pre 4.24ms	on 96.1ms	post 3.83ms	diff 92.1ms

The following I thought this was interesting. I did it to see the big
time go away, since it would end up being a 16K write straddling an 8K
boundary, but the pre and post results I don't understand at all.

# ./flashbench -A -b 16384  /dev/block/mmcblk0p9
write align 8388608	pre 121ms	on 7.76ms	post 116ms	diff -110845
write align 4194304	pre 129ms	on 7.57ms	post 115ms	diff -114863
write align 2097152	pre 121ms	on 7.78ms	post 123ms	diff -114318
write align 1048576	pre 131ms	on 7.74ms	post 106ms	diff -110856
write align 524288	pre 131ms	on 7.58ms	post 116ms	diff -115926
write align 262144	pre 131ms	on 7.55ms	post 115ms	diff -115591
write align 131072	pre 131ms	on 7.54ms	post 116ms	diff -115617
write align 65536	pre 131ms	on 7.54ms	post 115ms	diff -115579
write align 32768	pre 125ms	on 6.89ms	post 116ms	diff -113408

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-20  5:56                     ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20  5:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Saturday 19 February 2011 00:17:51 Andrei Warkentin wrote:
>> # echo 0 > /sys/block/mmcblk0/device/page_size
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608 ? ? pre 3.59ms ? ? ?on 6.54ms ? ? ? post 3.65ms ? ? diff 2.92ms
>> write align 4194304 ? ? pre 4.13ms ? ? ?on 7.37ms ? ? ? post 4.27ms ? ? diff 3.17ms
>> write align 2097152 ? ? pre 3.62ms ? ? ?on 6.81ms ? ? ? post 3.94ms ? ? diff 3.03ms
>> write align 1048576 ? ? pre 3.62ms ? ? ?on 6.53ms ? ? ? post 3.55ms ? ? diff 2.95ms
>> write align 524288 ? ? ?pre 3.62ms ? ? ?on 6.51ms ? ? ? post 3.63ms ? ? diff 2.88ms
>> write align 262144 ? ? ?pre 3.62ms ? ? ?on 6.51ms ? ? ? post 3.63ms ? ? diff 2.89ms
>> write align 131072 ? ? ?pre 3.62ms ? ? ?on 6.5ms ? ? ? ?post 3.63ms ? ? diff 2.88ms
>> write align 65536 ? ? ? pre 3.61ms ? ? ?on 6.49ms ? ? ? post 3.62ms ? ? diff 2.88ms
>> write align 32768 ? ? ? pre 3.61ms ? ? ?on 6.49ms ? ? ? post 3.61ms ? ? diff 2.88ms
>> write align 16384 ? ? ? pre 3.68ms ? ? ?on 107ms ? ? ? ?post 3.51ms ? ? diff 103ms
>> write align 8192 ? ? ? ?pre 3.74ms ? ? ?on 121ms ? ? ? ?post 3.91ms ? ? diff 117ms
>> write align 4096 ? ? ? ?pre 3.88ms ? ? ?on 3.87ms ? ? ? post 3.87ms ? ? diff -2937ns
>> write align 2048 ? ? ? ?pre 3.89ms ? ? ?on 3.88ms ? ? ? post 3.88ms ? ? diff -8734ns
>> # fjnh84 at fjnh84-desktop:~/src/n/src/flash$ adb -s 17006185428011d7 shell
>> # echo 8192 > /sys/block/mmcblk0/device/page_size
>> # cd data
>> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
>> write align 8388608 ? ? pre 3.33ms ? ? ?on 6.8ms ? ? ? ?post 3.65ms ? ? diff 3.31ms
>> write align 4194304 ? ? pre 4.34ms ? ? ?on 8.14ms ? ? ? post 4.53ms ? ? diff 3.71ms
>> write align 2097152 ? ? pre 3.64ms ? ? ?on 7.31ms ? ? ? post 4.09ms ? ? diff 3.44ms
>> write align 1048576 ? ? pre 3.65ms ? ? ?on 7.52ms ? ? ? post 3.65ms ? ? diff 3.87ms
>> write align 524288 ? ? ?pre 3.62ms ? ? ?on 6.8ms ? ? ? ?post 3.63ms ? ? diff 3.17ms
>> write align 262144 ? ? ?pre 3.62ms ? ? ?on 6.84ms ? ? ? post 3.63ms ? ? diff 3.22ms
>> write align 131072 ? ? ?pre 3.62ms ? ? ?on 6.85ms ? ? ? post 3.44ms ? ? diff 3.32ms
>> write align 65536 ? ? ? pre 3.39ms ? ? ?on 6.8ms ? ? ? ?post 3.66ms ? ? diff 3.28ms
>> write align 32768 ? ? ? pre 3.64ms ? ? ?on 6.86ms ? ? ? post 3.66ms ? ? diff 3.21ms
>> write align 16384 ? ? ? pre 3.67ms ? ? ?on 6.86ms ? ? ? post 3.65ms ? ? diff 3.2ms
>> write align 8192 ? ? ? ?pre 3.66ms ? ? ?on 6.84ms ? ? ? post 3.64ms ? ? diff 3.19ms
>> write align 4096 ? ? ? ?pre 3.71ms ? ? ?on 3.71ms ? ? ? post 3.64ms ? ? diff 38.6?s
>> write align 2048 ? ? ? ?pre 3.71ms ? ? ?on 3.71ms ? ? ? post 3.72ms ? ? diff -656ns
>>
>> This was with the split unaligned accesses patch... Which I am
>> attaching for comments.
>
> I agree, this is very fascinating behavior. 100ms second latency for a
> single 2KB access is definitely something we should try to avoid, and I
> wonder why the drive decides to do that. It must get into a state where
> it requires an extra garbage collection (you mentioned that earlier).
>
> The numbers you see here are taken over multiple runs. Do you see a lot
> of fluctuation when doing this with --count=1?
>

Yep. Quite a bit.

# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 4.52ms	on 7.58ms	post 3.93ms	diff 3.36ms
write align 4194304	pre 5.97ms	on 8.69ms	post 4.36ms	diff 3.53ms
write align 2097152	pre 3.57ms	on 7.96ms	post 4.6ms	diff 3.88ms
write align 1048576	pre 5.33ms	on 27.4ms	post 4.88ms	diff 22.3ms
write align 524288	pre 49.3ms	on 31.4ms	post 14.9ms	diff -679265
write align 262144	pre 39.7ms	on 38.3ms	post 5.27ms	diff 15.8ms
write align 131072	pre 33.8ms	on 45.4ms	post 5.26ms	diff 25.9ms
write align 65536	pre 34.4ms	on 40.9ms	post 3.3ms	diff 22.1ms
write align 32768	pre 30.2ms	on 44.8ms	post 5.13ms	diff 27.1ms
write align 16384	pre 44.5ms	on 5.05ms	post 33.3ms	diff -338542
write align 8192	pre 25.5ms	on 70.6ms	post 25.3ms	diff 45.2ms
write align 4096	pre 4.89ms	on 4.47ms	post 5.29ms	diff -623390
write align 2048	pre 4.88ms	on 4.89ms	post 5.2ms	diff -155781
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 4.68ms	on 9.06ms	post 5.14ms	diff 4.15ms
write align 4194304	pre 4.37ms	on 7.49ms	post 4.59ms	diff 3.01ms
write align 2097152	pre 23.7ms	on 1.9ms	post 14.8ms	diff -173218
write align 1048576	pre 14.8ms	on 19.9ms	post 4.75ms	diff 10.2ms
write align 524288	pre 20.2ms	on 24.9ms	post 10.7ms	diff 9.46ms
write align 262144	pre 20.2ms	on 3.01ms	post 20.1ms	diff -171062
write align 131072	pre 25.9ms	on 24.9ms	post 9.85ms	diff 7.06ms
write align 65536	pre 15.5ms	on 30.3ms	post 2.95ms	diff 21.1ms
write align 32768	pre 27.3ms	on 19.1ms	post 5.86ms	diff 2.5ms
write align 16384	pre 25.4ms	on 55.9ms	post 12.7ms	diff 36.9ms
write align 8192	pre 4.8ms	on 102ms	post 9.47ms	diff 94.8ms
write align 4096	pre 4.92ms	on 5.16ms	post 4.98ms	diff 207?s
write align 2048	pre 4.64ms	on 4.92ms	post 5.45ms	diff -121860
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 15.8ms	on 9.39ms	post 4.68ms	diff -854295
write align 4194304	pre 4.76ms	on 7.54ms	post 3.82ms	diff 3.24ms
write align 2097152	pre 19.9ms	on 9.73ms	post 4.44ms	diff -244517
write align 1048576	pre 14.5ms	on 19.1ms	post 5.21ms	diff 9.23ms
write align 524288	pre 24.9ms	on 29ms	post 5.89ms	diff 13.6ms
write align 262144	pre 24.9ms	on 2.41ms	post 20.8ms	diff -204328
write align 131072	pre 25.6ms	on 30ms	post 4.84ms	diff 14.8ms
write align 65536	pre 26.4ms	on 24.4ms	post 6.16ms	diff 8.12ms
write align 32768	pre 15ms	on 30.6ms	post 15.4ms	diff 15.4ms
write align 16384	pre 16.1ms	on 45.4ms	post 16.5ms	diff 29.1ms
write align 8192	pre 5.88ms	on 107ms	post 5.45ms	diff 101ms
write align 4096	pre 5.17ms	on 5.78ms	post 4.83ms	diff 778?s
write align 2048	pre 3.99ms	on 5.27ms	post 3.97ms	diff 1.29ms
# ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 16.1ms	on 8.37ms	post 5.44ms	diff -241222
write align 4194304	pre 4.07ms	on 7.27ms	post 3.89ms	diff 3.29ms
write align 2097152	pre 24.2ms	on 18.5ms	post 5.63ms	diff 3.59ms
write align 1048576	pre 4.08ms	on 18.9ms	post 5.46ms	diff 14.1ms
write align 524288	pre 25.1ms	on 28ms	post 14.6ms	diff 8.13ms
write align 262144	pre 15.8ms	on 30ms	post 5.4ms	diff 19.4ms
write align 131072	pre 24.7ms	on 30.8ms	post 4.43ms	diff 16.2ms
write align 65536	pre 5ms	on 40.5ms	post 5.95ms	diff 35.1ms
write align 32768	pre 24.7ms	on 30.6ms	post 4.92ms	diff 15.8ms
write align 16384	pre 25.2ms	on 132ms	post 10.2ms	diff 114ms
write align 8192	pre 7.64ms	on 111ms	post 9.18ms	diff 102ms
write align 4096	pre 5.11ms	on 3.92ms	post 5.4ms	diff -134159
write align 2048	pre 3.92ms	on 4.41ms	post 4.51ms	diff 196?s

> Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> to flashbench?
>

# echo 0 > /sys/block/mmcblk0/device/page_size
# ./flashbench -A -b 1024 /dev/block/mmcblk0p9
write align 8388608	pre 3.63ms	on 6.51ms	post 3.66ms	diff 2.86ms
write align 4194304	pre 3.61ms	on 6.51ms	post 3.62ms	diff 2.89ms
write align 2097152	pre 3.61ms	on 6.49ms	post 3.62ms	diff 2.87ms
write align 1048576	pre 3.64ms	on 6.55ms	post 3.62ms	diff 2.92ms
write align 524288	pre 3.64ms	on 6.57ms	post 3.66ms	diff 2.92ms
write align 262144	pre 3.44ms	on 6.45ms	post 3.66ms	diff 2.9ms
write align 131072	pre 3.64ms	on 6.56ms	post 3.67ms	diff 2.91ms
write align 65536	pre 3.33ms	on 6.57ms	post 3.65ms	diff 3.08ms
write align 32768	pre 3.68ms	on 6.6ms	post 3.7ms	diff 2.91ms
write align 16384	pre 3.64ms	on 97.6ms	post 3.26ms	diff 94.2ms
write align 8192	pre 3.49ms	on 115ms	post 3.62ms	diff 112ms
write align 4096	pre 3.91ms	on 3.91ms	post 3.9ms	diff 360ns
write align 2048	pre 3.92ms	on 3.92ms	post 3.92ms	diff -1374ns
# ./flashbench -A -b 2048 /dev/block/mmcblk0p9
write align 8388608	pre 3.76ms	on 7.23ms	post 4.18ms	diff 3.27ms
write align 4194304	pre 3.65ms	on 6.56ms	post 3.66ms	diff 2.9ms
write align 2097152	pre 3.9ms	on 6.99ms	post 3.67ms	diff 3.2ms
write align 1048576	pre 4.03ms	on 7.09ms	post 4.07ms	diff 3.04ms
write align 524288	pre 4.04ms	on 7.26ms	post 4.16ms	diff 3.16ms
write align 262144	pre 3.8ms	on 7.26ms	post 4.06ms	diff 3.33ms
write align 131072	pre 4.05ms	on 7.25ms	post 4.18ms	diff 3.14ms
write align 65536	pre 4.02ms	on 7.22ms	post 4.14ms	diff 3.14ms
write align 32768	pre 4ms	on 7.07ms	post 3.95ms	diff 3.1ms
write align 16384	pre 3.66ms	on 106ms	post 3.4ms	diff 102ms
write align 8192	pre 3.56ms	on 106ms	post 3.36ms	diff 103ms
write align 4096	pre 3.61ms	on 4.1ms	post 4.35ms	diff 117?s
# ./flashbench -A -b 4096 /dev/block/mmcblk0p9
write align 8388608	pre 3.64ms	on 6.95ms	post 3.96ms	diff 3.15ms
write align 4194304	pre 3.65ms	on 6.56ms	post 3.66ms	diff 2.9ms
write align 2097152	pre 3.89ms	on 6.79ms	post 3.66ms	diff 3.01ms
write align 1048576	pre 3.88ms	on 6.88ms	post 3.95ms	diff 2.97ms
write align 524288	pre 3.72ms	on 6.97ms	post 3.93ms	diff 3.15ms
write align 262144	pre 3.89ms	on 6.93ms	post 3.95ms	diff 3.01ms
write align 131072	pre 3.9ms	on 6.98ms	post 3.96ms	diff 3.05ms
write align 65536	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
write align 32768	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
write align 16384	pre 3.74ms	on 114ms	post 4.05ms	diff 110ms
write align 8192	pre 4.25ms	on 115ms	post 4.8ms	diff 110ms
# ./flashbench -A -b 8192 /dev/block/mmcblk0p9
write align 8388608	pre 3.84ms	on 7.53ms	post 4.29ms	diff 3.47ms
write align 4194304	pre 3.58ms	on 6.54ms	post 3.6ms	diff 2.95ms
write align 2097152	pre 4.12ms	on 7.27ms	post 3.87ms	diff 3.28ms
write align 1048576	pre 4.14ms	on 7.49ms	post 4.24ms	diff 3.3ms
write align 524288	pre 4.12ms	on 7.46ms	post 4.23ms	diff 3.29ms
write align 262144	pre 4.14ms	on 7.45ms	post 3.97ms	diff 3.4ms
write align 131072	pre 3.89ms	on 7.43ms	post 4.24ms	diff 3.37ms
write align 65536	pre 4.11ms	on 7.46ms	post 4.24ms	diff 3.29ms
write align 32768	pre 4.15ms	on 7.45ms	post 4.25ms	diff 3.25ms
write align 16384	pre 4.24ms	on 96.1ms	post 3.83ms	diff 92.1ms

The following I thought this was interesting. I did it to see the big
time go away, since it would end up being a 16K write straddling an 8K
boundary, but the pre and post results I don't understand at all.

# ./flashbench -A -b 16384  /dev/block/mmcblk0p9
write align 8388608	pre 121ms	on 7.76ms	post 116ms	diff -110845
write align 4194304	pre 129ms	on 7.57ms	post 115ms	diff -114863
write align 2097152	pre 121ms	on 7.78ms	post 123ms	diff -114318
write align 1048576	pre 131ms	on 7.74ms	post 106ms	diff -110856
write align 524288	pre 131ms	on 7.58ms	post 116ms	diff -115926
write align 262144	pre 131ms	on 7.55ms	post 115ms	diff -115591
write align 131072	pre 131ms	on 7.54ms	post 116ms	diff -115617
write align 65536	pre 131ms	on 7.54ms	post 115ms	diff -115579
write align 32768	pre 125ms	on 6.89ms	post 116ms	diff -113408

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-17 15:47                       ` Arnd Bergmann
@ 2011-02-20 11:27                         ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 11:27 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I think I'd try to reduce the number of sysfs files needed for this.
> What are the values you would typically set here?
>
> My feeling is that separating unaligned page writes from full pages
> or multiples of pages could always be benefitial for all cards, or at
> least harmless, but that will require more measurements.
> Whether to do the reliable write or not could be a simple flag
> if the numbers are the same.

I thought about this some more, and I realized it would be ugly if
everybody added enable_workaround_sec_start/enable_workaround_sec_end
for every novel idea of working around some issue with
performance/reliability on mmc/sd cards.

What about letting the user/embedder create policies for how certain
accesses are done? That way you give runtime-accessible
blocks for tuning mmc block layer while having one interface to
manipulate (and combine) multiple workarounds, all the while catching
conflicts and
without forcing specific policy in code.

Essentially under /sys/block/mmcblk0/device you have an attribute
called "policies". Example:

# echo mypol0 > /sys/block/mmcblk0/device/policies
# ls /sys/block/mmcblk0/device/mypol0
debug
delete
start_block
end_block
access_size_low
access_size_high
write_policy
erase_policy
read_policy
# cat /sys/block/mmcblk0/device/mypol0/write_policy
Current: none
0x00000001: Split unaligned writes across page_size
0x00000002: Split writes into page_size chunks and write using reliable writes
0x00000004: Use reliable writes for WRITE_META blocks.
# cat /sys/block/mmcblk0/device/mypol0/erase_policy
Current: none
0x00000001: Use secure erase.
# echo 1 > delete
# Policy is deleted.

The policies are all stored in a rb-tree. First order of business
inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
policy given the access type and block start/end (which both tells
where the access is going and the size of the access). Later, it's
that policy information which controls how the request is translated
into MMC commands. I'm almost done with a prototype.

I noticed that all sysfs attributes are managed by code under
core/mmc.c and core/sd.c, duplicating where necessary. I think some of
the new block-related settings like page_size (or policies) are
generic enough that they should live in the card/block code. How about
putting all future sysfs block related things into block-sysfs.c?

Thanks,
A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-20 11:27                         ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-20 11:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I think I'd try to reduce the number of sysfs files needed for this.
> What are the values you would typically set here?
>
> My feeling is that separating unaligned page writes from full pages
> or multiples of pages could always be benefitial for all cards, or at
> least harmless, but that will require more measurements.
> Whether to do the reliable write or not could be a simple flag
> if the numbers are the same.

I thought about this some more, and I realized it would be ugly if
everybody added enable_workaround_sec_start/enable_workaround_sec_end
for every novel idea of working around some issue with
performance/reliability on mmc/sd cards.

What about letting the user/embedder create policies for how certain
accesses are done? That way you give runtime-accessible
blocks for tuning mmc block layer while having one interface to
manipulate (and combine) multiple workarounds, all the while catching
conflicts and
without forcing specific policy in code.

Essentially under /sys/block/mmcblk0/device you have an attribute
called "policies". Example:

# echo mypol0 > /sys/block/mmcblk0/device/policies
# ls /sys/block/mmcblk0/device/mypol0
debug
delete
start_block
end_block
access_size_low
access_size_high
write_policy
erase_policy
read_policy
# cat /sys/block/mmcblk0/device/mypol0/write_policy
Current: none
0x00000001: Split unaligned writes across page_size
0x00000002: Split writes into page_size chunks and write using reliable writes
0x00000004: Use reliable writes for WRITE_META blocks.
# cat /sys/block/mmcblk0/device/mypol0/erase_policy
Current: none
0x00000001: Use secure erase.
# echo 1 > delete
# Policy is deleted.

The policies are all stored in a rb-tree. First order of business
inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
policy given the access type and block start/end (which both tells
where the access is going and the size of the access). Later, it's
that policy information which controls how the request is translated
into MMC commands. I'm almost done with a prototype.

I noticed that all sysfs attributes are managed by code under
core/mmc.c and core/sd.c, duplicating where necessary. I think some of
the new block-related settings like page_size (or policies) are
generic enough that they should live in the card/block code. How about
putting all future sysfs block related things into block-sysfs.c?

Thanks,
A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-20 11:27                         ` Andrei Warkentin
@ 2011-02-20 14:39                           ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 14:39 UTC (permalink / raw)
  To: linux-arm-kernel, linux-fsdevel
  Cc: Andrei Warkentin, Linus Walleij, linux-mmc

[adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
on this discussion.]

On Sunday 20 February 2011 12:27:39 Andrei Warkentin wrote:
> On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > I think I'd try to reduce the number of sysfs files needed for this.
> > What are the values you would typically set here?
> >
> > My feeling is that separating unaligned page writes from full pages
> > or multiples of pages could always be benefitial for all cards, or at
> > least harmless, but that will require more measurements.
> > Whether to do the reliable write or not could be a simple flag
> > if the numbers are the same.
> 
> I thought about this some more, and I realized it would be ugly if
> everybody added enable_workaround_sec_start/enable_workaround_sec_end
> for every novel idea of working around some issue with
> performance/reliability on mmc/sd cards.
> 
> What about letting the user/embedder create policies for how certain
> accesses are done? That way you give runtime-accessible
> blocks for tuning mmc block layer while having one interface to
> manipulate (and combine) multiple workarounds, all the while catching
> conflicts and
> without forcing specific policy in code.
> 
> Essentially under /sys/block/mmcblk0/device you have an attribute
> called "policies". Example:
> 
> # echo mypol0 > /sys/block/mmcblk0/device/policies
> # ls /sys/block/mmcblk0/device/mypol0
> debug
> delete
> start_block
> end_block
> access_size_low
> access_size_high
> write_policy
> erase_policy
> read_policy
> # cat /sys/block/mmcblk0/device/mypol0/write_policy
> Current: none
> 0x00000001: Split unaligned writes across page_size
> 0x00000002: Split writes into page_size chunks and write using reliable writes
> 0x00000004: Use reliable writes for WRITE_META blocks.
> # cat /sys/block/mmcblk0/device/mypol0/erase_policy
> Current: none
> 0x00000001: Use secure erase.
> # echo 1 > delete
> # Policy is deleted.
> 
> The policies are all stored in a rb-tree. First order of business
> inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
> policy given the access type and block start/end (which both tells
> where the access is going and the size of the access). Later, it's
> that policy information which controls how the request is translated
> into MMC commands. I'm almost done with a prototype.

I think it's good to discuss all the options, but my feeling is that
we should not add so much complexity at the interface level, because
we will never be able to change all that again. In general, sysfs
files should contain simple values that are self-descriptive (a simple
number or one word), and should have no side-effects (unlike the delete
or the policies attributes you describe).

The behavior of the Toshiba chip is peculiar enough to justify having
some workarounds for it, including run-time selected ones, but I'm
looking for something much simpler. I'd certainly be interested in
the patch you come up with and any performance results, but I don't
think it can be merged like that.

In the end, Chris will have to make the decision on mmc patches of
course -- I'm just trying to contribute experience from other subsystems.

What I see as a more promising approach is to add the tunables
to attributes of the CFQ I/O scheduler once we know what we want.
This will allow doing the same optimizations to non-MMC devices such
as USB sticks or CF/IDE cards without reimplementing it in other
subsystems, and give more control over the individual requests than
the MMC layer has.

E.g. the I/O scheduler can also make sure that we always submit all
blocks from the start of one erase unit (e.g. 4 MB) to the end, but
not try to merge requests across erase unit boundaries. It can
also try to group the requests in aligned power-of-two sized chunks
rather than merging as many sectors as possible up to the maximum
request size, ignoring the alignment.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-20 14:39                           ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 14:39 UTC (permalink / raw)
  To: linux-arm-kernel

[adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
on this discussion.]

On Sunday 20 February 2011 12:27:39 Andrei Warkentin wrote:
> On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > I think I'd try to reduce the number of sysfs files needed for this.
> > What are the values you would typically set here?
> >
> > My feeling is that separating unaligned page writes from full pages
> > or multiples of pages could always be benefitial for all cards, or at
> > least harmless, but that will require more measurements.
> > Whether to do the reliable write or not could be a simple flag
> > if the numbers are the same.
> 
> I thought about this some more, and I realized it would be ugly if
> everybody added enable_workaround_sec_start/enable_workaround_sec_end
> for every novel idea of working around some issue with
> performance/reliability on mmc/sd cards.
> 
> What about letting the user/embedder create policies for how certain
> accesses are done? That way you give runtime-accessible
> blocks for tuning mmc block layer while having one interface to
> manipulate (and combine) multiple workarounds, all the while catching
> conflicts and
> without forcing specific policy in code.
> 
> Essentially under /sys/block/mmcblk0/device you have an attribute
> called "policies". Example:
> 
> # echo mypol0 > /sys/block/mmcblk0/device/policies
> # ls /sys/block/mmcblk0/device/mypol0
> debug
> delete
> start_block
> end_block
> access_size_low
> access_size_high
> write_policy
> erase_policy
> read_policy
> # cat /sys/block/mmcblk0/device/mypol0/write_policy
> Current: none
> 0x00000001: Split unaligned writes across page_size
> 0x00000002: Split writes into page_size chunks and write using reliable writes
> 0x00000004: Use reliable writes for WRITE_META blocks.
> # cat /sys/block/mmcblk0/device/mypol0/erase_policy
> Current: none
> 0x00000001: Use secure erase.
> # echo 1 > delete
> # Policy is deleted.
> 
> The policies are all stored in a rb-tree. First order of business
> inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
> policy given the access type and block start/end (which both tells
> where the access is going and the size of the access). Later, it's
> that policy information which controls how the request is translated
> into MMC commands. I'm almost done with a prototype.

I think it's good to discuss all the options, but my feeling is that
we should not add so much complexity at the interface level, because
we will never be able to change all that again. In general, sysfs
files should contain simple values that are self-descriptive (a simple
number or one word), and should have no side-effects (unlike the delete
or the policies attributes you describe).

The behavior of the Toshiba chip is peculiar enough to justify having
some workarounds for it, including run-time selected ones, but I'm
looking for something much simpler. I'd certainly be interested in
the patch you come up with and any performance results, but I don't
think it can be merged like that.

In the end, Chris will have to make the decision on mmc patches of
course -- I'm just trying to contribute experience from other subsystems.

What I see as a more promising approach is to add the tunables
to attributes of the CFQ I/O scheduler once we know what we want.
This will allow doing the same optimizations to non-MMC devices such
as USB sticks or CF/IDE cards without reimplementing it in other
subsystems, and give more control over the individual requests than
the MMC layer has.

E.g. the I/O scheduler can also make sure that we always submit all
blocks from the start of one erase unit (e.g. 4 MB) to the end, but
not try to merge requests across erase unit boundaries. It can
also try to group the requests in aligned power-of-two sized chunks
rather than merging as many sectors as possible up to the maximum
request size, ignoring the alignment.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-20  4:39                   ` Andrei Warkentin
@ 2011-02-20 15:03                     ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:03 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Andrei Warkentin, Linus Walleij, linux-mmc

On Sunday 20 February 2011 05:39:06 Andrei Warkentin wrote:
> Actually it would be a good idea to also bail/warn if you do the au
> test with more open au's than the size of the passed device allows,
> since it'll just wrap around and skew the results.

Yes, that's a bug. I never noticed because all the devices I tested
have much more space than the test can possibly exercise. I'll
fix it tomorrow.

> > Right, you should try larger values for --open-au-nr here. It's at
> > least a good sign that the drive can do random access inside a segment
> > and that it can have at least 4 segments open. This is much better
> > than I expected from your descriptions at first.
> 
> Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
> ^C
> # ./flashbench -O -0 6  -b 512 /dev/block/mmcblk0p9
> 4MiB    5.91M/s
> 2MiB    8.84M/s
> 1MiB    10.8M/s
> 512KiB  13M/s
> 256KiB  13.6M/s
> 
> ^C
> # ./flashbench -O -0 7  -b 512 /dev/block/mmcblk0p9
> 4MiB    6.32M/s
> 2MiB    8.63M/s
> 1MiB    10.5M/s
> 512KiB  13.2M/s
> 256KiB  13M/s
> ^[[A^[[D^[[A128KiB  12.3M/s
> ^C
> # ./flashbench -O -0 8  -b 512 /dev/block/mmcblk0p9
> 4MiB    6.65M/s
> 2MiB    7.02M/s
> 1MiB    6.36M/s
> 512KiB  3.17M/s
> 256KiB  1.53M/s

Yes, very good. I've never seen 7, but I've seen all other numbers
betwen 1 and 8 ;-).

> The Sandisk one has 20 AUs.
> 
> # ./flashbench -O -0 20  -b 512 /dev/block/mmcblk0p9
> 4MiB    11.3M/s
> 2MiB    12.8M/s
> 1MiB    9.87M/s
> 512KiB  9.97M/s
> 256KiB  9.13M/s
> 128KiB  8.05M/s
> ^C
> # ./flashbench -O -0 50  -b 512 /dev/block/mmcblk0p9
> 4MiB    7.19M/s
> ^C
> # ./flashbench -O -0 2  -b 512 /dev/block/mmcblk0p9
> ^C
> # ./flashbench -O -0 22  -b 512 /dev/block/mmcblk0p9
> 4MiB    11.6M/s
> 2MiB    12.3M/s
> 1MiB    5.13M/s
> 512KiB  2.57M/s
> 256KiB  1.59M/s
> 128KiB  1.16M/s
> 64KiB   776K/s
> ^C
> # ./flashbench -O -0 21  -b 512 /dev/block/mmcblk0p9
> 4MiB    11.2M/s
> 2MiB    12.4M/s
> 1MiB    4.65M/s
> 512KiB  1.95M/s
> 256KiB  955K/s

20 is a lot, more than any other device I've tested, but that's
good. Sandisk keeps impressing me ;-)

Are you sure you have the allocation unit size correctly for
this device and you don't get into the wrap-around bug
you mention above?

If it indeed uses 4 MB allocation units, flashbench will show
only 10 open segments when run with --erasesize=$[8*1024*1024],
but 20 open segments when run with --erasesize=$[2*1024*1024].

>From your flashbench -a run, I would guess that it uses
8 MB allocation units, although the data is not 100% conclusive
there.

> > However, the drop from 32 KB to 16 KB in performance is horrifying
> > for the Toshiba drive, it's clear that this one does not like
> > to be accessed smaller than 32 KB at a time, an obvious optimization
> > for FAT32 with 32 KB clusters. How does this change with your
> > kernel patches?
> 
> Since the only performance-increasing patch here would be just the one
> that splits unaligned accesses, I wouldn't expect any improvements for
> page-aligned accesses < 32KB. As you can see here...

Ok.

> > For the sandisk drive, it's funny how it is consistently faster
> > doing random access than linear access. I don't think I've seem that
> > before. It does seem to have some cache for linear access using
> > smaller than 16 KB, and can probably combine them when it's only
> > writing to a single segment.
> 
> Yes, that is pretty interesting. Smaller than 16K? Not smaller than
> 32K? I wonder what it is doing...

My interpretation is that it uses 16 KB pages, but can do two page-sized
writes in a single access (multi-plane write). Anything smaller than
a page goes to a temporary buffer first (like the Toshiba chip), but
gets flushed when the next one is not contiguous. If you manage to fill
the entire 16 KB page using small contiguous writes, it can do a single
efficient write access instead.

To confirm that 16 KB is the page size, you can try 

flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
	/dev/mmcblk1 -c 32 --blocksize=16384
gnuplot -p -e 'plot "plot.data" '

On most MLC flashes, this will show a pattern alternating between slow
and fast pages like the one from https://lwn.net/Articles/428836/

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-20 15:03                     ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Sunday 20 February 2011 05:39:06 Andrei Warkentin wrote:
> Actually it would be a good idea to also bail/warn if you do the au
> test with more open au's than the size of the passed device allows,
> since it'll just wrap around and skew the results.

Yes, that's a bug. I never noticed because all the devices I tested
have much more space than the test can possibly exercise. I'll
fix it tomorrow.

> > Right, you should try larger values for --open-au-nr here. It's at
> > least a good sign that the drive can do random access inside a segment
> > and that it can have at least 4 segments open. This is much better
> > than I expected from your descriptions at first.
> 
> Actually the Toshiba one seems to have 7 AUs if I interpret this correctly.
> ^C
> # ./flashbench -O -0 6  -b 512 /dev/block/mmcblk0p9
> 4MiB    5.91M/s
> 2MiB    8.84M/s
> 1MiB    10.8M/s
> 512KiB  13M/s
> 256KiB  13.6M/s
> 
> ^C
> # ./flashbench -O -0 7  -b 512 /dev/block/mmcblk0p9
> 4MiB    6.32M/s
> 2MiB    8.63M/s
> 1MiB    10.5M/s
> 512KiB  13.2M/s
> 256KiB  13M/s
> ^[[A^[[D^[[A128KiB  12.3M/s
> ^C
> # ./flashbench -O -0 8  -b 512 /dev/block/mmcblk0p9
> 4MiB    6.65M/s
> 2MiB    7.02M/s
> 1MiB    6.36M/s
> 512KiB  3.17M/s
> 256KiB  1.53M/s

Yes, very good. I've never seen 7, but I've seen all other numbers
betwen 1 and 8 ;-).

> The Sandisk one has 20 AUs.
> 
> # ./flashbench -O -0 20  -b 512 /dev/block/mmcblk0p9
> 4MiB    11.3M/s
> 2MiB    12.8M/s
> 1MiB    9.87M/s
> 512KiB  9.97M/s
> 256KiB  9.13M/s
> 128KiB  8.05M/s
> ^C
> # ./flashbench -O -0 50  -b 512 /dev/block/mmcblk0p9
> 4MiB    7.19M/s
> ^C
> # ./flashbench -O -0 2  -b 512 /dev/block/mmcblk0p9
> ^C
> # ./flashbench -O -0 22  -b 512 /dev/block/mmcblk0p9
> 4MiB    11.6M/s
> 2MiB    12.3M/s
> 1MiB    5.13M/s
> 512KiB  2.57M/s
> 256KiB  1.59M/s
> 128KiB  1.16M/s
> 64KiB   776K/s
> ^C
> # ./flashbench -O -0 21  -b 512 /dev/block/mmcblk0p9
> 4MiB    11.2M/s
> 2MiB    12.4M/s
> 1MiB    4.65M/s
> 512KiB  1.95M/s
> 256KiB  955K/s

20 is a lot, more than any other device I've tested, but that's
good. Sandisk keeps impressing me ;-)

Are you sure you have the allocation unit size correctly for
this device and you don't get into the wrap-around bug
you mention above?

If it indeed uses 4 MB allocation units, flashbench will show
only 10 open segments when run with --erasesize=$[8*1024*1024],
but 20 open segments when run with --erasesize=$[2*1024*1024].

>From your flashbench -a run, I would guess that it uses
8 MB allocation units, although the data is not 100% conclusive
there.

> > However, the drop from 32 KB to 16 KB in performance is horrifying
> > for the Toshiba drive, it's clear that this one does not like
> > to be accessed smaller than 32 KB at a time, an obvious optimization
> > for FAT32 with 32 KB clusters. How does this change with your
> > kernel patches?
> 
> Since the only performance-increasing patch here would be just the one
> that splits unaligned accesses, I wouldn't expect any improvements for
> page-aligned accesses < 32KB. As you can see here...

Ok.

> > For the sandisk drive, it's funny how it is consistently faster
> > doing random access than linear access. I don't think I've seem that
> > before. It does seem to have some cache for linear access using
> > smaller than 16 KB, and can probably combine them when it's only
> > writing to a single segment.
> 
> Yes, that is pretty interesting. Smaller than 16K? Not smaller than
> 32K? I wonder what it is doing...

My interpretation is that it uses 16 KB pages, but can do two page-sized
writes in a single access (multi-plane write). Anything smaller than
a page goes to a temporary buffer first (like the Toshiba chip), but
gets flushed when the next one is not contiguous. If you manage to fill
the entire 16 KB page using small contiguous writes, it can do a single
efficient write access instead.

To confirm that 16 KB is the page size, you can try 

flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
	/dev/mmcblk1 -c 32 --blocksize=16384
gnuplot -p -e 'plot "plot.data" '

On most MLC flashes, this will show a pattern alternating between slow
and fast pages like the one from https://lwn.net/Articles/428836/

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-20  5:56                     ` Andrei Warkentin
@ 2011-02-20 15:23                       ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:23 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sunday 20 February 2011 06:56:39 Andrei Warkentin wrote:
> On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> > The numbers you see here are taken over multiple runs. Do you see a lot
> > of fluctuation when doing this with --count=1?
> >
> 
> Yep. Quite a bit.
> 
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 4.52ms	on 7.58ms	post 3.93ms	diff 3.36ms
> write align 4194304	pre 5.97ms	on 8.69ms	post 4.36ms	diff 3.53ms
> write align 2097152	pre 3.57ms	on 7.96ms	post 4.6ms	diff 3.88ms
> write align 1048576	pre 5.33ms	on 27.4ms	post 4.88ms	diff 22.3ms
> write align 524288	pre 49.3ms	on 31.4ms	post 14.9ms	diff -679265
> write align 262144	pre 39.7ms	on 38.3ms	post 5.27ms	diff 15.8ms
> write align 131072	pre 33.8ms	on 45.4ms	post 5.26ms	diff 25.9ms
> write align 65536	pre 34.4ms	on 40.9ms	post 3.3ms	diff 22.1ms
> write align 32768	pre 30.2ms	on 44.8ms	post 5.13ms	diff 27.1ms
> write align 16384	pre 44.5ms	on 5.05ms	post 33.3ms	diff -338542
> write align 8192	pre 25.5ms	on 70.6ms	post 25.3ms	diff 45.2ms
> write align 4096	pre 4.89ms	on 4.47ms	post 5.29ms	diff -623390
> write align 2048	pre 4.88ms	on 4.89ms	post 5.2ms	diff -155781
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 4.68ms	on 9.06ms	post 5.14ms	diff 4.15ms
> write align 4194304	pre 4.37ms	on 7.49ms	post 4.59ms	diff 3.01ms
> write align 2097152	pre 23.7ms	on 1.9ms	post 14.8ms	diff -173218
> write align 1048576	pre 14.8ms	on 19.9ms	post 4.75ms	diff 10.2ms
> write align 524288	pre 20.2ms	on 24.9ms	post 10.7ms	diff 9.46ms
> write align 262144	pre 20.2ms	on 3.01ms	post 20.1ms	diff -171062
> write align 131072	pre 25.9ms	on 24.9ms	post 9.85ms	diff 7.06ms
> write align 65536	pre 15.5ms	on 30.3ms	post 2.95ms	diff 21.1ms
> write align 32768	pre 27.3ms	on 19.1ms	post 5.86ms	diff 2.5ms
> write align 16384	pre 25.4ms	on 55.9ms	post 12.7ms	diff 36.9ms
> write align 8192	pre 4.8ms	on 102ms	post 9.47ms	diff 94.8ms
> write align 4096	pre 4.92ms	on 5.16ms	post 4.98ms	diff 207µs
> write align 2048	pre 4.64ms	on 4.92ms	post 5.45ms	diff -121860
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 15.8ms	on 9.39ms	post 4.68ms	diff -854295
> write align 4194304	pre 4.76ms	on 7.54ms	post 3.82ms	diff 3.24ms
> write align 2097152	pre 19.9ms	on 9.73ms	post 4.44ms	diff -244517
> write align 1048576	pre 14.5ms	on 19.1ms	post 5.21ms	diff 9.23ms
> write align 524288	pre 24.9ms	on 29ms	post 5.89ms	diff 13.6ms
> write align 262144	pre 24.9ms	on 2.41ms	post 20.8ms	diff -204328
> write align 131072	pre 25.6ms	on 30ms	post 4.84ms	diff 14.8ms
> write align 65536	pre 26.4ms	on 24.4ms	post 6.16ms	diff 8.12ms
> write align 32768	pre 15ms	on 30.6ms	post 15.4ms	diff 15.4ms
> write align 16384	pre 16.1ms	on 45.4ms	post 16.5ms	diff 29.1ms
> write align 8192	pre 5.88ms	on 107ms	post 5.45ms	diff 101ms
> write align 4096	pre 5.17ms	on 5.78ms	post 4.83ms	diff 778µs
> write align 2048	pre 3.99ms	on 5.27ms	post 3.97ms	diff 1.29ms
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 16.1ms	on 8.37ms	post 5.44ms	diff -241222
> write align 4194304	pre 4.07ms	on 7.27ms	post 3.89ms	diff 3.29ms
> write align 2097152	pre 24.2ms	on 18.5ms	post 5.63ms	diff 3.59ms
> write align 1048576	pre 4.08ms	on 18.9ms	post 5.46ms	diff 14.1ms
> write align 524288	pre 25.1ms	on 28ms	post 14.6ms	diff 8.13ms
> write align 262144	pre 15.8ms	on 30ms	post 5.4ms	diff 19.4ms
> write align 131072	pre 24.7ms	on 30.8ms	post 4.43ms	diff 16.2ms
> write align 65536	pre 5ms	on 40.5ms	post 5.95ms	diff 35.1ms
> write align 32768	pre 24.7ms	on 30.6ms	post 4.92ms	diff 15.8ms
> write align 16384	pre 25.2ms	on 132ms	post 10.2ms	diff 114ms
> write align 8192	pre 7.64ms	on 111ms	post 9.18ms	diff 102ms
> write align 4096	pre 5.11ms	on 3.92ms	post 5.4ms	diff -134159
> write align 2048	pre 3.92ms	on 4.41ms	post 4.51ms	diff 196µs

Every value is the average of eight measurements, so there are probably
some that include the 100ms garbage collection, and others that don't.
I'm more confused about this now than I was before.

> > Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> > to flashbench?
>
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 65536	pre 3.33ms	on 6.57ms	post 3.65ms	diff 3.08ms
> write align 32768	pre 3.68ms	on 6.6ms	post 3.7ms	diff 2.91ms
> write align 16384	pre 3.64ms	on 97.6ms	post 3.26ms	diff 94.2ms
> write align 8192	pre 3.49ms	on 115ms	post 3.62ms	diff 112ms
> write align 4096	pre 3.91ms	on 3.91ms	post 3.9ms	diff 360ns
> write align 2048	pre 3.92ms	on 3.92ms	post 3.92ms	diff -1374ns
> # ./flashbench -A -b 2048 /dev/block/mmcblk0p9
> write align 65536	pre 4.02ms	on 7.22ms	post 4.14ms	diff 3.14ms
> write align 32768	pre 4ms	on 7.07ms	post 3.95ms	diff 3.1ms
> write align 16384	pre 3.66ms	on 106ms	post 3.4ms	diff 102ms
> write align 8192	pre 3.56ms	on 106ms	post 3.36ms	diff 103ms
> write align 4096	pre 3.61ms	on 4.1ms	post 4.35ms	diff 117µs
> # ./flashbench -A -b 4096 /dev/block/mmcblk0p9
> write align 65536	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
> write align 32768	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
> write align 16384	pre 3.74ms	on 114ms	post 4.05ms	diff 110ms
> write align 8192	pre 4.25ms	on 115ms	post 4.8ms	diff 110ms
> # ./flashbench -A -b 8192 /dev/block/mmcblk0p9
> write align 65536	pre 4.11ms	on 7.46ms	post 4.24ms	diff 3.29ms
> write align 32768	pre 4.15ms	on 7.45ms	post 4.25ms	diff 3.25ms
> write align 16384	pre 4.24ms	on 96.1ms	post 3.83ms	diff 92.1ms

Ok, that is very consistent then at least.

> The following I thought this was interesting. I did it to see the big
> time go away, since it would end up being a 16K write straddling an 8K
> boundary, but the pre and post results I don't understand at all.
> 
> # ./flashbench -A -b 16384  /dev/block/mmcblk0p9
> write align 8388608	pre 121ms	on 7.76ms	post 116ms	diff -110845
> write align 4194304	pre 129ms	on 7.57ms	post 115ms	diff -114863
> write align 2097152	pre 121ms	on 7.78ms	post 123ms	diff -114318
> write align 1048576	pre 131ms	on 7.74ms	post 106ms	diff -110856
> write align 524288	pre 131ms	on 7.58ms	post 116ms	diff -115926
> write align 262144	pre 131ms	on 7.55ms	post 115ms	diff -115591
> write align 131072	pre 131ms	on 7.54ms	post 116ms	diff -115617
> write align 65536	pre 131ms	on 7.54ms	post 115ms	diff -115579
> write align 32768	pre 125ms	on 6.89ms	post 116ms	diff -113408

The description of the test case is probably suboptimal. What this does
is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
alignment in the "on" case. The idea here is that it should never do
any access with less than "--blocksize" aligment.

This is what I think happens:
Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
all the time for 32KB accesses and larger. However, the "on" measurement is only
16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
the garbage collection all the time, so it ends up being a lot faster.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-20 15:23                       ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-20 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Sunday 20 February 2011 06:56:39 Andrei Warkentin wrote:
> On Sat, Feb 19, 2011 at 5:20 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> > The numbers you see here are taken over multiple runs. Do you see a lot
> > of fluctuation when doing this with --count=1?
> >
> 
> Yep. Quite a bit.
> 
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 4.52ms	on 7.58ms	post 3.93ms	diff 3.36ms
> write align 4194304	pre 5.97ms	on 8.69ms	post 4.36ms	diff 3.53ms
> write align 2097152	pre 3.57ms	on 7.96ms	post 4.6ms	diff 3.88ms
> write align 1048576	pre 5.33ms	on 27.4ms	post 4.88ms	diff 22.3ms
> write align 524288	pre 49.3ms	on 31.4ms	post 14.9ms	diff -679265
> write align 262144	pre 39.7ms	on 38.3ms	post 5.27ms	diff 15.8ms
> write align 131072	pre 33.8ms	on 45.4ms	post 5.26ms	diff 25.9ms
> write align 65536	pre 34.4ms	on 40.9ms	post 3.3ms	diff 22.1ms
> write align 32768	pre 30.2ms	on 44.8ms	post 5.13ms	diff 27.1ms
> write align 16384	pre 44.5ms	on 5.05ms	post 33.3ms	diff -338542
> write align 8192	pre 25.5ms	on 70.6ms	post 25.3ms	diff 45.2ms
> write align 4096	pre 4.89ms	on 4.47ms	post 5.29ms	diff -623390
> write align 2048	pre 4.88ms	on 4.89ms	post 5.2ms	diff -155781
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 4.68ms	on 9.06ms	post 5.14ms	diff 4.15ms
> write align 4194304	pre 4.37ms	on 7.49ms	post 4.59ms	diff 3.01ms
> write align 2097152	pre 23.7ms	on 1.9ms	post 14.8ms	diff -173218
> write align 1048576	pre 14.8ms	on 19.9ms	post 4.75ms	diff 10.2ms
> write align 524288	pre 20.2ms	on 24.9ms	post 10.7ms	diff 9.46ms
> write align 262144	pre 20.2ms	on 3.01ms	post 20.1ms	diff -171062
> write align 131072	pre 25.9ms	on 24.9ms	post 9.85ms	diff 7.06ms
> write align 65536	pre 15.5ms	on 30.3ms	post 2.95ms	diff 21.1ms
> write align 32768	pre 27.3ms	on 19.1ms	post 5.86ms	diff 2.5ms
> write align 16384	pre 25.4ms	on 55.9ms	post 12.7ms	diff 36.9ms
> write align 8192	pre 4.8ms	on 102ms	post 9.47ms	diff 94.8ms
> write align 4096	pre 4.92ms	on 5.16ms	post 4.98ms	diff 207?s
> write align 2048	pre 4.64ms	on 4.92ms	post 5.45ms	diff -121860
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 15.8ms	on 9.39ms	post 4.68ms	diff -854295
> write align 4194304	pre 4.76ms	on 7.54ms	post 3.82ms	diff 3.24ms
> write align 2097152	pre 19.9ms	on 9.73ms	post 4.44ms	diff -244517
> write align 1048576	pre 14.5ms	on 19.1ms	post 5.21ms	diff 9.23ms
> write align 524288	pre 24.9ms	on 29ms	post 5.89ms	diff 13.6ms
> write align 262144	pre 24.9ms	on 2.41ms	post 20.8ms	diff -204328
> write align 131072	pre 25.6ms	on 30ms	post 4.84ms	diff 14.8ms
> write align 65536	pre 26.4ms	on 24.4ms	post 6.16ms	diff 8.12ms
> write align 32768	pre 15ms	on 30.6ms	post 15.4ms	diff 15.4ms
> write align 16384	pre 16.1ms	on 45.4ms	post 16.5ms	diff 29.1ms
> write align 8192	pre 5.88ms	on 107ms	post 5.45ms	diff 101ms
> write align 4096	pre 5.17ms	on 5.78ms	post 4.83ms	diff 778?s
> write align 2048	pre 3.99ms	on 5.27ms	post 3.97ms	diff 1.29ms
> # ./flashbench -c 1 -A -b 1024 /dev/block/mmcblk0p9
> write align 8388608	pre 16.1ms	on 8.37ms	post 5.44ms	diff -241222
> write align 4194304	pre 4.07ms	on 7.27ms	post 3.89ms	diff 3.29ms
> write align 2097152	pre 24.2ms	on 18.5ms	post 5.63ms	diff 3.59ms
> write align 1048576	pre 4.08ms	on 18.9ms	post 5.46ms	diff 14.1ms
> write align 524288	pre 25.1ms	on 28ms	post 14.6ms	diff 8.13ms
> write align 262144	pre 15.8ms	on 30ms	post 5.4ms	diff 19.4ms
> write align 131072	pre 24.7ms	on 30.8ms	post 4.43ms	diff 16.2ms
> write align 65536	pre 5ms	on 40.5ms	post 5.95ms	diff 35.1ms
> write align 32768	pre 24.7ms	on 30.6ms	post 4.92ms	diff 15.8ms
> write align 16384	pre 25.2ms	on 132ms	post 10.2ms	diff 114ms
> write align 8192	pre 7.64ms	on 111ms	post 9.18ms	diff 102ms
> write align 4096	pre 5.11ms	on 3.92ms	post 5.4ms	diff -134159
> write align 2048	pre 3.92ms	on 4.41ms	post 4.51ms	diff 196?s

Every value is the average of eight measurements, so there are probably
some that include the 100ms garbage collection, and others that don't.
I'm more confused about this now than I was before.

> > Also, does the same happen with other blocksizes, e.g. 4096 or 8192, passed
> > to flashbench?
>
> # echo 0 > /sys/block/mmcblk0/device/page_size
> # ./flashbench -A -b 1024 /dev/block/mmcblk0p9
> write align 65536	pre 3.33ms	on 6.57ms	post 3.65ms	diff 3.08ms
> write align 32768	pre 3.68ms	on 6.6ms	post 3.7ms	diff 2.91ms
> write align 16384	pre 3.64ms	on 97.6ms	post 3.26ms	diff 94.2ms
> write align 8192	pre 3.49ms	on 115ms	post 3.62ms	diff 112ms
> write align 4096	pre 3.91ms	on 3.91ms	post 3.9ms	diff 360ns
> write align 2048	pre 3.92ms	on 3.92ms	post 3.92ms	diff -1374ns
> # ./flashbench -A -b 2048 /dev/block/mmcblk0p9
> write align 65536	pre 4.02ms	on 7.22ms	post 4.14ms	diff 3.14ms
> write align 32768	pre 4ms	on 7.07ms	post 3.95ms	diff 3.1ms
> write align 16384	pre 3.66ms	on 106ms	post 3.4ms	diff 102ms
> write align 8192	pre 3.56ms	on 106ms	post 3.36ms	diff 103ms
> write align 4096	pre 3.61ms	on 4.1ms	post 4.35ms	diff 117?s
> # ./flashbench -A -b 4096 /dev/block/mmcblk0p9
> write align 65536	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
> write align 32768	pre 3.89ms	on 6.97ms	post 3.96ms	diff 3.04ms
> write align 16384	pre 3.74ms	on 114ms	post 4.05ms	diff 110ms
> write align 8192	pre 4.25ms	on 115ms	post 4.8ms	diff 110ms
> # ./flashbench -A -b 8192 /dev/block/mmcblk0p9
> write align 65536	pre 4.11ms	on 7.46ms	post 4.24ms	diff 3.29ms
> write align 32768	pre 4.15ms	on 7.45ms	post 4.25ms	diff 3.25ms
> write align 16384	pre 4.24ms	on 96.1ms	post 3.83ms	diff 92.1ms

Ok, that is very consistent then at least.

> The following I thought this was interesting. I did it to see the big
> time go away, since it would end up being a 16K write straddling an 8K
> boundary, but the pre and post results I don't understand at all.
> 
> # ./flashbench -A -b 16384  /dev/block/mmcblk0p9
> write align 8388608	pre 121ms	on 7.76ms	post 116ms	diff -110845
> write align 4194304	pre 129ms	on 7.57ms	post 115ms	diff -114863
> write align 2097152	pre 121ms	on 7.78ms	post 123ms	diff -114318
> write align 1048576	pre 131ms	on 7.74ms	post 106ms	diff -110856
> write align 524288	pre 131ms	on 7.58ms	post 116ms	diff -115926
> write align 262144	pre 131ms	on 7.55ms	post 115ms	diff -115591
> write align 131072	pre 131ms	on 7.54ms	post 116ms	diff -115617
> write align 65536	pre 131ms	on 7.54ms	post 115ms	diff -115579
> write align 32768	pre 125ms	on 6.89ms	post 116ms	diff -113408

The description of the test case is probably suboptimal. What this does
is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
alignment in the "on" case. The idea here is that it should never do
any access with less than "--blocksize" aligment.

This is what I think happens:
Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
all the time for 32KB accesses and larger. However, the "on" measurement is only
16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
the garbage collection all the time, so it ends up being a lot faster.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-20 15:03                     ` Arnd Bergmann
@ 2011-02-22  6:42                       ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22  6:42 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> From your flashbench -a run, I would guess that it uses
> 8 MB allocation units, although the data is not 100% conclusive
> there.
>

Because the 8MB aligned write time is significantly faster, right?

>
> My interpretation is that it uses 16 KB pages, but can do two page-sized
> writes in a single access (multi-plane write). Anything smaller than
> a page goes to a temporary buffer first (like the Toshiba chip), but
> gets flushed when the next one is not contiguous. If you manage to fill
> the entire 16 KB page using small contiguous writes, it can do a single
> efficient write access instead.
>
> To confirm that 16 KB is the page size, you can try
>
> flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
>        /dev/mmcblk1 -c 32 --blocksize=16384
> gnuplot -p -e 'plot "plot.data" '
>
> On most MLC flashes, this will show a pattern alternating between slow
> and fast pages like the one from https://lwn.net/Articles/428836/

Cool.

I am attaching some graphs. The 16k sandisk shows the slow and fast
page parallel lines, as does the 8k toshiba (but we knew it for the
toshiba case), but the boundaries are strange for the sandisk case,
and there an interesting 2mb variation in the toshiba 8k graph. What
is the correct way to interpret graphs with other block sizes?

A

[-- Attachment #2: scatter_8k_read_ts.png --]
[-- Type: image/png, Size: 11238 bytes --]

[-- Attachment #3: scatter_8k_sandisk.png --]
[-- Type: image/png, Size: 8964 bytes --]

[-- Attachment #4: scatter_16k_sandisk.png --]
[-- Type: image/png, Size: 6853 bytes --]

[-- Attachment #5: scatter_32k_read_ts.png --]
[-- Type: image/png, Size: 9471 bytes --]

[-- Attachment #6: scatter_32k_sandisk.png --]
[-- Type: image/png, Size: 6790 bytes --]

[-- Attachment #7: scatter_16k_read_ts.png --]
[-- Type: image/png, Size: 9040 bytes --]

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-22  6:42                       ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22  6:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> From your flashbench -a run, I would guess that it uses
> 8 MB allocation units, although the data is not 100% conclusive
> there.
>

Because the 8MB aligned write time is significantly faster, right?

>
> My interpretation is that it uses 16 KB pages, but can do two page-sized
> writes in a single access (multi-plane write). Anything smaller than
> a page goes to a temporary buffer first (like the Toshiba chip), but
> gets flushed when the next one is not contiguous. If you manage to fill
> the entire 16 KB page using small contiguous writes, it can do a single
> efficient write access instead.
>
> To confirm that 16 KB is the page size, you can try
>
> flashbench -s --scatter-span=1 --scatter-order=10 -o plot.data \
> ? ? ? ?/dev/mmcblk1 -c 32 --blocksize=16384
> gnuplot -p -e 'plot "plot.data" '
>
> On most MLC flashes, this will show a pattern alternating between slow
> and fast pages like the one from https://lwn.net/Articles/428836/

Cool.

I am attaching some graphs. The 16k sandisk shows the slow and fast
page parallel lines, as does the 8k toshiba (but we knew it for the
toshiba case), but the boundaries are strange for the sandisk case,
and there an interesting 2mb variation in the toshiba 8k graph. What
is the correct way to interpret graphs with other block sizes?

A
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_8k_read_ts.png
Type: image/png
Size: 11238 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_8k_sandisk.png
Type: image/png
Size: 8964 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_16k_sandisk.png
Type: image/png
Size: 6853 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_32k_read_ts.png
Type: image/png
Size: 9471 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_32k_sandisk.png
Type: image/png
Size: 6790 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scatter_16k_read_ts.png
Type: image/png
Size: 9040 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110222/220679a1/attachment-0011.png>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-20 15:23                       ` Arnd Bergmann
@ 2011-02-22  7:05                         ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22  7:05 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> The description of the test case is probably suboptimal. What this does
> is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> alignment in the "on" case. The idea here is that it should never do
> any access with less than "--blocksize" aligment.
>

Now I feel slightly confused :(.

-b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,

               ret = time_rw_interval(dev, count, pre, blocksize,
                                       align - blocksize, maxalign,
                                       do_write);   //
<----------------- read 16k at align - 16k with 8mb intervals?
                returnif(ret);

                ret = time_rw_interval(dev, count, on, blocksize,
                                       align - blocksize / 2, maxalign,
                                       do_write);  //
<----------------- read 16k at align - 8k with 8mb intervals?
	        returnif(ret);

                ret = time_rw_interval(dev, count, post, blocksize,
 	                               align, maxalign, do_write); //
<-------- read 16k at align with 8mb intervals?
		returnif(ret);

I hope I'm not missing something obvious...


> This is what I think happens:
> Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
> writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
> all the time for 32KB accesses and larger. However, the "on" measurement is only
> 16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
> the garbage collection all the time, so it ends up being a lot faster.
>

Can't go to A. A is 8KB big. Strange...

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-22  7:05                         ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22  7:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> The description of the test case is probably suboptimal. What this does
> is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> alignment in the "on" case. The idea here is that it should never do
> any access with less than "--blocksize" aligment.
>

Now I feel slightly confused :(.

-b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,

               ret = time_rw_interval(dev, count, pre, blocksize,
                                       align - blocksize, maxalign,
                                       do_write);   //
<----------------- read 16k at align - 16k with 8mb intervals?
                returnif(ret);

                ret = time_rw_interval(dev, count, on, blocksize,
                                       align - blocksize / 2, maxalign,
                                       do_write);  //
<----------------- read 16k at align - 8k with 8mb intervals?
	        returnif(ret);

                ret = time_rw_interval(dev, count, post, blocksize,
 	                               align, maxalign, do_write); //
<-------- read 16k@align with 8mb intervals?
		returnif(ret);

I hope I'm not missing something obvious...


> This is what I think happens:
> Since the partition is over 64 MB size and it can have 7 4 MB allocation units open,
> writing to 8 locations on the drive separated 8 MB causes it to do garbage collection
> all the time for 32KB accesses and larger. However, the "on" measurement is only
> 16 KB aligned, so it goes into T's buffer A for small writes, and does not hit
> the garbage collection all the time, so it ends up being a lot faster.
>

Can't go to A. A is 8KB big. Strange...

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-20 14:39                           ` Arnd Bergmann
@ 2011-02-22  7:46                             ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22  7:46 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> [adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
> http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
> on this discussion.]
>
>
> I think it's good to discuss all the options, but my feeling is that
> we should not add so much complexity at the interface level, because
> we will never be able to change all that again. In general, sysfs
> files should contain simple values that are self-descriptive (a simple
> number or one word), and should have no side-effects (unlike the delete
> or the policies attributes you describe).
>
> The behavior of the Toshiba chip is peculiar enough to justify having
> some workarounds for it, including run-time selected ones, but I'm
> looking for something much simpler. I'd certainly be interested in
> the patch you come up with and any performance results, but I don't
> think it can be merged like that.
>

Sure. The page_align patch is just going to be a single sysfs
attribute. All I need to prove to myself now is the effect for large
unaligned accesses (and show everyone else the data :-)).

> In the end, Chris will have to make the decision on mmc patches of
> course -- I'm just trying to contribute experience from other subsystems.
>
> What I see as a more promising approach is to add the tunables
> to attributes of the CFQ I/O scheduler once we know what we want.
> This will allow doing the same optimizations to non-MMC devices such
> as USB sticks or CF/IDE cards without reimplementing it in other
> subsystems, and give more control over the individual requests than
> the MMC layer has.
>
> E.g. the I/O scheduler can also make sure that we always submit all
> blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> not try to merge requests across erase unit boundaries. It can
> also try to group the requests in aligned power-of-two sized chunks
> rather than merging as many sectors as possible up to the maximum
> request size, ignoring the alignment.

I agree. These are common things that affect any kind of flash
storage, and it belongs in the I/O scheduler as simple tuneables. I'll
see if I can figure my way around that...

What belongs in mmc card driver are tunable workarounds for MMC/SD
brokeness. For example - needing to use 8K-spitted reliable writes to
ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
improve lifespan of the card.) But you want a waterline above which
you don't do this anymore, otherwise the overall performance will go
to 0 - i.e. there is a need to balance between performance and
reliability, so the range of access size for which the workaround
works needs to be runtime controlled, as it's potentially different.
Another example (this one is apparently affecting Sandisk) - do
special stuff for block erase, since the card violates spec in that
regard (touch ext_csd instead of argument, I believe). A different
example might be turning on reliable writes for WRITE_META (or all)
blocks for a certain partition (but I just made that up... ).

So there are things that just should be on (spec brokeness
workarounds), and things that apply only to a subset of accesses (and
thus they are selective at issue_*_rq time), whether it's because of
accessed offset or access size.

I agree that the sysfs method is particularly nasty, and I guess I
didn't have to make a prototype to figure that out :-) (but needed
something similar for selective testing anyway). Nothing else exists
right now that acts in the same way, and nothing really should, as
there is no feedback for manipulating the policies (echo POLICY_ENUM >
policy, if it doesn't stick, then the arguments were wrong, etc).

You could put the entire MMC block policy interface through an API
usable by system integrators - i.e. you would really only care for
tuning the MMC parameters if you're creating a device around an emmc.

Idea (1). One idea is to keep the "policies" from my previous mail.
Policies are registered through platform-specific code. The policies
could be then matched for enabling against a specific block device by
manfid/date/etc at the time of mmc_block_alloc... For removable media
no one would fiddle with the tunable parameters anyway, unless there
was some global database of cards and workarounds and a daemon or some
such to take care of that... Probably don't want to add such baggage
to the kernel.

Idea (2). There is probably no need to overcomplicate. Just add a
platform callback (something like int
(*mmc_platform_block_workaround)(struct request *, struct
mmc_blk_request *)). This will be usable as-is for R/W accesses, and
the discard code will need to be slightly modified.

Do you think there is any need for runtime tuning of the MMC
workarounds (disregarding ones that really belong in the I/O
scheduler)? Should the workarounds be simply platform callbacks, or
should they be something heftier ("policies")?

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-22  7:46                             ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-22  7:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> [adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
> http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
> on this discussion.]
>
>
> I think it's good to discuss all the options, but my feeling is that
> we should not add so much complexity at the interface level, because
> we will never be able to change all that again. In general, sysfs
> files should contain simple values that are self-descriptive (a simple
> number or one word), and should have no side-effects (unlike the delete
> or the policies attributes you describe).
>
> The behavior of the Toshiba chip is peculiar enough to justify having
> some workarounds for it, including run-time selected ones, but I'm
> looking for something much simpler. I'd certainly be interested in
> the patch you come up with and any performance results, but I don't
> think it can be merged like that.
>

Sure. The page_align patch is just going to be a single sysfs
attribute. All I need to prove to myself now is the effect for large
unaligned accesses (and show everyone else the data :-)).

> In the end, Chris will have to make the decision on mmc patches of
> course -- I'm just trying to contribute experience from other subsystems.
>
> What I see as a more promising approach is to add the tunables
> to attributes of the CFQ I/O scheduler once we know what we want.
> This will allow doing the same optimizations to non-MMC devices such
> as USB sticks or CF/IDE cards without reimplementing it in other
> subsystems, and give more control over the individual requests than
> the MMC layer has.
>
> E.g. the I/O scheduler can also make sure that we always submit all
> blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> not try to merge requests across erase unit boundaries. It can
> also try to group the requests in aligned power-of-two sized chunks
> rather than merging as many sectors as possible up to the maximum
> request size, ignoring the alignment.

I agree. These are common things that affect any kind of flash
storage, and it belongs in the I/O scheduler as simple tuneables. I'll
see if I can figure my way around that...

What belongs in mmc card driver are tunable workarounds for MMC/SD
brokeness. For example - needing to use 8K-spitted reliable writes to
ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
improve lifespan of the card.) But you want a waterline above which
you don't do this anymore, otherwise the overall performance will go
to 0 - i.e. there is a need to balance between performance and
reliability, so the range of access size for which the workaround
works needs to be runtime controlled, as it's potentially different.
Another example (this one is apparently affecting Sandisk) - do
special stuff for block erase, since the card violates spec in that
regard (touch ext_csd instead of argument, I believe). A different
example might be turning on reliable writes for WRITE_META (or all)
blocks for a certain partition (but I just made that up... ).

So there are things that just should be on (spec brokeness
workarounds), and things that apply only to a subset of accesses (and
thus they are selective at issue_*_rq time), whether it's because of
accessed offset or access size.

I agree that the sysfs method is particularly nasty, and I guess I
didn't have to make a prototype to figure that out :-) (but needed
something similar for selective testing anyway). Nothing else exists
right now that acts in the same way, and nothing really should, as
there is no feedback for manipulating the policies (echo POLICY_ENUM >
policy, if it doesn't stick, then the arguments were wrong, etc).

You could put the entire MMC block policy interface through an API
usable by system integrators - i.e. you would really only care for
tuning the MMC parameters if you're creating a device around an emmc.

Idea (1). One idea is to keep the "policies" from my previous mail.
Policies are registered through platform-specific code. The policies
could be then matched for enabling against a specific block device by
manfid/date/etc at the time of mmc_block_alloc... For removable media
no one would fiddle with the tunable parameters anyway, unless there
was some global database of cards and workarounds and a daemon or some
such to take care of that... Probably don't want to add such baggage
to the kernel.

Idea (2). There is probably no need to overcomplicate. Just add a
platform callback (something like int
(*mmc_platform_block_workaround)(struct request *, struct
mmc_blk_request *)). This will be usable as-is for R/W accesses, and
the discard code will need to be slightly modified.

Do you think there is any need for runtime tuning of the MMC
workarounds (disregarding ones that really belong in the I/O
scheduler)? Should the workarounds be simply platform callbacks, or
should they be something heftier ("policies")?

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-22  6:42                       ` Andrei Warkentin
@ 2011-02-22 16:42                         ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:42 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > From your flashbench -a run, I would guess that it uses
> > 8 MB allocation units, although the data is not 100% conclusive
> > there.
> >
> 
> Because the 8MB aligned write time is significantly faster, right?

I mean because a read spanning an 8 MB boundary is noticably
slower than one spanning a 4 MB boundary (diff 242µs instead of 187µs),
while everything below the numbers for the 4 and 2 MB boundaries
are quite similar.

> I am attaching some graphs. The 16k sandisk shows the slow and fast
> page parallel lines, as does the 8k toshiba (but we knew it for the
> toshiba case), but the boundaries are strange for the sandisk case,
> and there an interesting 2mb variation in the toshiba 8k graph. What
> is the correct way to interpret graphs with other block sizes?

Not sure if it's correct, but my interpretation of your output
is this:

In the Toshiba graph, you see parallel lines that show measurements
30µs apart, e.g. 1.06ms and 1.09 ms in the first one. I assume what
you see here are fast and slow pages, respectively. It's a bit hard
to tell in the resolution you have, and it would make sense to zoom
into the picture to see if they are alternating or just random.
The three groups of double lines are probably just some jitter
from the timing of the interrupt controller. If you run with a larger
--count= value, these should become less visible.

The sandisk plot shows some sector ranges taht are slower than others,
I'd assume that those are the ones that have been recently written.
The 16KB page plot has parallel lines (again, I'd have to see a
finer resolution plot to see if they are alternating), which the
32KB page plot does not have. I see this as an indication that the
pages are indeed 16KB, and in the 32KB plot the results are just
averaged out.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-22 16:42                         ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:03 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > From your flashbench -a run, I would guess that it uses
> > 8 MB allocation units, although the data is not 100% conclusive
> > there.
> >
> 
> Because the 8MB aligned write time is significantly faster, right?

I mean because a read spanning an 8 MB boundary is noticably
slower than one spanning a 4 MB boundary (diff 242?s instead of 187?s),
while everything below the numbers for the 4 and 2 MB boundaries
are quite similar.

> I am attaching some graphs. The 16k sandisk shows the slow and fast
> page parallel lines, as does the 8k toshiba (but we knew it for the
> toshiba case), but the boundaries are strange for the sandisk case,
> and there an interesting 2mb variation in the toshiba 8k graph. What
> is the correct way to interpret graphs with other block sizes?

Not sure if it's correct, but my interpretation of your output
is this:

In the Toshiba graph, you see parallel lines that show measurements
30?s apart, e.g. 1.06ms and 1.09 ms in the first one. I assume what
you see here are fast and slow pages, respectively. It's a bit hard
to tell in the resolution you have, and it would make sense to zoom
into the picture to see if they are alternating or just random.
The three groups of double lines are probably just some jitter
from the timing of the interrupt controller. If you run with a larger
--count= value, these should become less visible.

The sandisk plot shows some sector ranges taht are slower than others,
I'd assume that those are the ones that have been recently written.
The 16KB page plot has parallel lines (again, I'd have to see a
finer resolution plot to see if they are alternating), which the
32KB page plot does not have. I see this as an indication that the
pages are indeed 16KB, and in the 32KB plot the results are just
averaged out.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-22  7:05                         ` Andrei Warkentin
@ 2011-02-22 16:49                           ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:49 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-arm-kernel, Linus Walleij, linux-mmc

On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > The description of the test case is probably suboptimal. What this does
> > is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> > alignment in the "on" case. The idea here is that it should never do
> > any access with less than "--blocksize" aligment.
> >
> 
> Now I feel slightly confused :(.
> 
> -b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,
> 
>                ret = time_rw_interval(dev, count, pre, blocksize,
>                                        align - blocksize, maxalign,
>                                        do_write);   //
> <----------------- read 16k at align - 16k with 8mb intervals?
>                 returnif(ret);
> 
>                 ret = time_rw_interval(dev, count, on, blocksize,
>                                        align - blocksize / 2, maxalign,
>                                        do_write);  //
> <----------------- read 16k at align - 8k with 8mb intervals?
>                 returnif(ret);
> 
>                 ret = time_rw_interval(dev, count, post, blocksize,
>                                        align, maxalign, do_write); //
> <-------- read 16k at align with 8mb intervals?
>                 returnif(ret);
> 
> I hope I'm not missing something obvious...

No, you are absolutely right. I think I changed this once and no longer
remembered what the final version did.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-22 16:49                           ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 16:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 9:23 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > The description of the test case is probably suboptimal. What this does
> > is 32 KB accesses, with 32 KB alignment in the pre and post case, but 16 KB
> > alignment in the "on" case. The idea here is that it should never do
> > any access with less than "--blocksize" aligment.
> >
> 
> Now I feel slightly confused :(.
> 
> -b 16384 implies blocksize = 16384, maxalign is 8mb due to count 32,
> 
>                ret = time_rw_interval(dev, count, pre, blocksize,
>                                        align - blocksize, maxalign,
>                                        do_write);   //
> <----------------- read 16k@align - 16k with 8mb intervals?
>                 returnif(ret);
> 
>                 ret = time_rw_interval(dev, count, on, blocksize,
>                                        align - blocksize / 2, maxalign,
>                                        do_write);  //
> <----------------- read 16k@align - 8k with 8mb intervals?
>                 returnif(ret);
> 
>                 ret = time_rw_interval(dev, count, post, blocksize,
>                                        align, maxalign, do_write); //
> <-------- read 16k@align with 8mb intervals?
>                 returnif(ret);
> 
> I hope I'm not missing something obvious...

No, you are absolutely right. I think I changed this once and no longer
remembered what the final version did.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-22  7:46                             ` Andrei Warkentin
@ 2011-02-22 17:00                               ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 17:00 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > E.g. the I/O scheduler can also make sure that we always submit all
> > blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> > not try to merge requests across erase unit boundaries. It can
> > also try to group the requests in aligned power-of-two sized chunks
> > rather than merging as many sectors as possible up to the maximum
> > request size, ignoring the alignment.
> 
> I agree. These are common things that affect any kind of flash
> storage, and it belongs in the I/O scheduler as simple tuneables. I'll
> see if I can figure my way around that...
> 
> What belongs in mmc card driver are tunable workarounds for MMC/SD
> brokeness. For example - needing to use 8K-spitted reliable writes to
> ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
> improve lifespan of the card.) But you want a waterline above which
> you don't do this anymore, otherwise the overall performance will go
> to 0 - i.e. there is a need to balance between performance and
> reliability, so the range of access size for which the workaround
> works needs to be runtime controlled, as it's potentially different.
> Another example (this one is apparently affecting Sandisk) - do
> special stuff for block erase, since the card violates spec in that
> regard (touch ext_csd instead of argument, I believe). A different
> example might be turning on reliable writes for WRITE_META (or all)
> blocks for a certain partition (but I just made that up... ).

Yes, makes sense.

> You could put the entire MMC block policy interface through an API
> usable by system integrators - i.e. you would really only care for
> tuning the MMC parameters if you're creating a device around an emmc.
> 
> Idea (1). One idea is to keep the "policies" from my previous mail.
> Policies are registered through platform-specific code. The policies
> could be then matched for enabling against a specific block device by
> manfid/date/etc at the time of mmc_block_alloc... For removable media
> no one would fiddle with the tunable parameters anyway, unless there
> was some global database of cards and workarounds and a daemon or some
> such to take care of that... Probably don't want to add such baggage
> to the kernel.
> 
> Idea (2). There is probably no need to overcomplicate. Just add a
> platform callback (something like int
> (*mmc_platform_block_workaround)(struct request *, struct
> mmc_blk_request *)). This will be usable as-is for R/W accesses, and
> the discard code will need to be slightly modified.
> 
> Do you think there is any need for runtime tuning of the MMC
> workarounds (disregarding ones that really belong in the I/O
> scheduler)? Should the workarounds be simply platform callbacks, or
> should they be something heftier ("policies")?

The platform hook seems the wrong place, because you might use
the same chip in multiple platforms, and a single platform might
have a large number of different boards, all of which require
separate workarounds.

A per-card quirk table does not seem so bad, we have that in
other subsystems as well. I wouldn't necessarily make it
a list of possible quirks, but rather a __devinit function that
is called for a new card on insertion, in order to tweak various
parameters.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-22 17:00                               ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-22 17:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > E.g. the I/O scheduler can also make sure that we always submit all
> > blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> > not try to merge requests across erase unit boundaries. It can
> > also try to group the requests in aligned power-of-two sized chunks
> > rather than merging as many sectors as possible up to the maximum
> > request size, ignoring the alignment.
> 
> I agree. These are common things that affect any kind of flash
> storage, and it belongs in the I/O scheduler as simple tuneables. I'll
> see if I can figure my way around that...
> 
> What belongs in mmc card driver are tunable workarounds for MMC/SD
> brokeness. For example - needing to use 8K-spitted reliable writes to
> ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
> improve lifespan of the card.) But you want a waterline above which
> you don't do this anymore, otherwise the overall performance will go
> to 0 - i.e. there is a need to balance between performance and
> reliability, so the range of access size for which the workaround
> works needs to be runtime controlled, as it's potentially different.
> Another example (this one is apparently affecting Sandisk) - do
> special stuff for block erase, since the card violates spec in that
> regard (touch ext_csd instead of argument, I believe). A different
> example might be turning on reliable writes for WRITE_META (or all)
> blocks for a certain partition (but I just made that up... ).

Yes, makes sense.

> You could put the entire MMC block policy interface through an API
> usable by system integrators - i.e. you would really only care for
> tuning the MMC parameters if you're creating a device around an emmc.
> 
> Idea (1). One idea is to keep the "policies" from my previous mail.
> Policies are registered through platform-specific code. The policies
> could be then matched for enabling against a specific block device by
> manfid/date/etc at the time of mmc_block_alloc... For removable media
> no one would fiddle with the tunable parameters anyway, unless there
> was some global database of cards and workarounds and a daemon or some
> such to take care of that... Probably don't want to add such baggage
> to the kernel.
> 
> Idea (2). There is probably no need to overcomplicate. Just add a
> platform callback (something like int
> (*mmc_platform_block_workaround)(struct request *, struct
> mmc_blk_request *)). This will be usable as-is for R/W accesses, and
> the discard code will need to be slightly modified.
> 
> Do you think there is any need for runtime tuning of the MMC
> workarounds (disregarding ones that really belong in the I/O
> scheduler)? Should the workarounds be simply platform callbacks, or
> should they be something heftier ("policies")?

The platform hook seems the wrong place, because you might use
the same chip in multiple platforms, and a single platform might
have a large number of different boards, all of which require
separate workarounds.

A per-card quirk table does not seem so bad, we have that in
other subsystems as well. I wouldn't necessarily make it
a list of possible quirks, but rather a __devinit function that
is called for a new card on insertion, in order to tweak various
parameters.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-22 17:00                               ` Arnd Bergmann
@ 2011-02-23 10:19                                 ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 10:19 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Tue, Feb 22, 2011 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> Do you think there is any need for runtime tuning of the MMC
>> workarounds (disregarding ones that really belong in the I/O
>> scheduler)? Should the workarounds be simply platform callbacks, or
>> should they be something heftier ("policies")?
>
> The platform hook seems the wrong place, because you might use
> the same chip in multiple platforms, and a single platform might
> have a large number of different boards, all of which require
> separate workarounds.
>

That's a good point. At best it would result in massive copy-paste/

> A per-card quirk table does not seem so bad, we have that in
> other subsystems as well. I wouldn't necessarily make it
> a list of possible quirks, but rather a __devinit function that
> is called for a new card on insertion, in order to tweak various
> parameters.
>

That sounds good! In fact, for any quirks enabled for a particular
card, I'll expose the tuneables through sysfs attributes, something
like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.

Quirks will have block intervals and access size intervals over which
they are valid, along with any other quirk-specific parameter.
Interval overlap will not be allowed for quirks in the same operation
type (r/w/e). The goal here is to make the changes to issue_*_rq as
small as possible, and not to pollute block.c at all with the quirks
stuff. Quirks are looked up inside issue_*_rq based on req type and
[start,end) interval. The resulting found quirks structure will
contain a callback used inside issue_*_rq to modify mmc block request
structures prior to generating actual MMC commands.

Quirks consist of a callback called inside of mmc issue_*_rq,
configurable attributes, and the sysfs interface. Quirk groups are
defined per-card. At card insertion time, a matching quirk group is
found, and is enabled. The quirk group enable function then enables
the relevant quirks with the right parameters (adds them to per
mmc_blk_data quirk interval tree). Some sane defaults for the tunables
are used. If the tunables are modified through sysfs, care is taken
that an interval overlap never happens, otherwise the tunable is not
modified and a kernel error message is logged.

I hope I explained the tentative idea clearly... Thoughts?

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-23 10:19                                 ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 10:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 22, 2011 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> Do you think there is any need for runtime tuning of the MMC
>> workarounds (disregarding ones that really belong in the I/O
>> scheduler)? Should the workarounds be simply platform callbacks, or
>> should they be something heftier ("policies")?
>
> The platform hook seems the wrong place, because you might use
> the same chip in multiple platforms, and a single platform might
> have a large number of different boards, all of which require
> separate workarounds.
>

That's a good point. At best it would result in massive copy-paste/

> A per-card quirk table does not seem so bad, we have that in
> other subsystems as well. I wouldn't necessarily make it
> a list of possible quirks, but rather a __devinit function that
> is called for a new card on insertion, in order to tweak various
> parameters.
>

That sounds good! In fact, for any quirks enabled for a particular
card, I'll expose the tuneables through sysfs attributes, something
like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.

Quirks will have block intervals and access size intervals over which
they are valid, along with any other quirk-specific parameter.
Interval overlap will not be allowed for quirks in the same operation
type (r/w/e). The goal here is to make the changes to issue_*_rq as
small as possible, and not to pollute block.c at all with the quirks
stuff. Quirks are looked up inside issue_*_rq based on req type and
[start,end) interval. The resulting found quirks structure will
contain a callback used inside issue_*_rq to modify mmc block request
structures prior to generating actual MMC commands.

Quirks consist of a callback called inside of mmc issue_*_rq,
configurable attributes, and the sysfs interface. Quirk groups are
defined per-card. At card insertion time, a matching quirk group is
found, and is enabled. The quirk group enable function then enables
the relevant quirks with the right parameters (adds them to per
mmc_blk_data quirk interval tree). Some sane defaults for the tunables
are used. If the tunables are modified through sysfs, care is taken
that an interval overlap never happens, otherwise the tunable is not
modified and a kernel error message is logged.

I hope I explained the tentative idea clearly... Thoughts?

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-23 10:19                                 ` Andrei Warkentin
@ 2011-02-23 16:09                                   ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-23 16:09 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Wednesday 23 February 2011, Andrei Warkentin wrote:
> That sounds good! In fact, for any quirks enabled for a particular
> card, I'll expose the tuneables through sysfs attributes, something
> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
> 
> Quirks will have block intervals and access size intervals over which
> they are valid, along with any other quirk-specific parameter.
> Interval overlap will not be allowed for quirks in the same operation
> type (r/w/e). The goal here is to make the changes to issue_*_rq as
> small as possible, and not to pollute block.c at all with the quirks
> stuff. Quirks are looked up inside issue_*_rq based on req type and
> [start,end) interval. The resulting found quirks structure will
> contain a callback used inside issue_*_rq to modify mmc block request
> structures prior to generating actual MMC commands.
>
> Quirks consist of a callback called inside of mmc issue_*_rq,
> configurable attributes, and the sysfs interface. Quirk groups are
> defined per-card. At card insertion time, a matching quirk group is
> found, and is enabled. The quirk group enable function then enables
> the relevant quirks with the right parameters (adds them to per
> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
> are used. If the tunables are modified through sysfs, care is taken
> that an interval overlap never happens, otherwise the tunable is not
> modified and a kernel error message is logged.
> 
> I hope I explained the tentative idea clearly... Thoughts?

I would hope that the quirks can be simpler than this still, without
the need to call any function pointers while using the device, or
quirk specific sysfs directories.

What I meant is to have a single function pointer that can get
called when detecting a specific known card. All this function
does is to set values and flags that we can export either through
common attributes of block devices (e.g. preferred erase size),
or attributes specific to mmc devices (e.g. the toshiba hack, as
a bool attribute).

An obvious attribute would be the minimum size of an atomic
page update. By default this could be 32KB, because any device
should support that (FAT32 cannot have larger clusters). A
card specific quirk can set it to another value, like 8KB, 16KB
or 64KB, and file systems or other tools like mkfs can optimize
for this value.

I would like the flags like "don't submit requests spanning
this boundary" and "make all writes below this size" to be defined
in terms of the regular sizes we already know about, like the
page size or the erase size.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-23 16:09                                   ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-23 16:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 23 February 2011, Andrei Warkentin wrote:
> That sounds good! In fact, for any quirks enabled for a particular
> card, I'll expose the tuneables through sysfs attributes, something
> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
> 
> Quirks will have block intervals and access size intervals over which
> they are valid, along with any other quirk-specific parameter.
> Interval overlap will not be allowed for quirks in the same operation
> type (r/w/e). The goal here is to make the changes to issue_*_rq as
> small as possible, and not to pollute block.c at all with the quirks
> stuff. Quirks are looked up inside issue_*_rq based on req type and
> [start,end) interval. The resulting found quirks structure will
> contain a callback used inside issue_*_rq to modify mmc block request
> structures prior to generating actual MMC commands.
>
> Quirks consist of a callback called inside of mmc issue_*_rq,
> configurable attributes, and the sysfs interface. Quirk groups are
> defined per-card. At card insertion time, a matching quirk group is
> found, and is enabled. The quirk group enable function then enables
> the relevant quirks with the right parameters (adds them to per
> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
> are used. If the tunables are modified through sysfs, care is taken
> that an interval overlap never happens, otherwise the tunable is not
> modified and a kernel error message is logged.
> 
> I hope I explained the tentative idea clearly... Thoughts?

I would hope that the quirks can be simpler than this still, without
the need to call any function pointers while using the device, or
quirk specific sysfs directories.

What I meant is to have a single function pointer that can get
called when detecting a specific known card. All this function
does is to set values and flags that we can export either through
common attributes of block devices (e.g. preferred erase size),
or attributes specific to mmc devices (e.g. the toshiba hack, as
a bool attribute).

An obvious attribute would be the minimum size of an atomic
page update. By default this could be 32KB, because any device
should support that (FAT32 cannot have larger clusters). A
card specific quirk can set it to another value, like 8KB, 16KB
or 64KB, and file systems or other tools like mkfs can optimize
for this value.

I would like the flags like "don't submit requests spanning
this boundary" and "make all writes below this size" to be defined
in terms of the regular sizes we already know about, like the
page size or the erase size.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-23 16:09                                   ` Arnd Bergmann
@ 2011-02-23 22:26                                     ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 22:26 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Wed, Feb 23, 2011 at 10:09 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 23 February 2011, Andrei Warkentin wrote:
>> That sounds good! In fact, for any quirks enabled for a particular
>> card, I'll expose the tuneables through sysfs attributes, something
>> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
>>
>> Quirks will have block intervals and access size intervals over which
>> they are valid, along with any other quirk-specific parameter.
>> Interval overlap will not be allowed for quirks in the same operation
>> type (r/w/e). The goal here is to make the changes to issue_*_rq as
>> small as possible, and not to pollute block.c at all with the quirks
>> stuff. Quirks are looked up inside issue_*_rq based on req type and
>> [start,end) interval. The resulting found quirks structure will
>> contain a callback used inside issue_*_rq to modify mmc block request
>> structures prior to generating actual MMC commands.
>>
>> Quirks consist of a callback called inside of mmc issue_*_rq,
>> configurable attributes, and the sysfs interface. Quirk groups are
>> defined per-card. At card insertion time, a matching quirk group is
>> found, and is enabled. The quirk group enable function then enables
>> the relevant quirks with the right parameters (adds them to per
>> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
>> are used. If the tunables are modified through sysfs, care is taken
>> that an interval overlap never happens, otherwise the tunable is not
>> modified and a kernel error message is logged.
>>
>> I hope I explained the tentative idea clearly... Thoughts?
>

> I would hope that the quirks can be simpler than this still, without
> the need to call any function pointers while using the device, or
> quirk specific sysfs directories.
>

I'll skip the sysfs part from the first RFC patch. I think this
complicates what I'm trying to achieve and makes this whole thing look
bigger than it is.

> What I meant is to have a single function pointer that can get
> called when detecting a specific known card. All this function
> does is to set values and flags that we can export either through
> common attributes of block devices (e.g. preferred erase size),
> or attributes specific to mmc devices (e.g. the toshiba hack, as
> a bool attribute).
>
> An obvious attribute would be the minimum size of an atomic
> page update. By default this could be 32KB, because any device
> should support that (FAT32 cannot have larger clusters). A
> card specific quirk can set it to another value, like 8KB, 16KB
> or 64KB, and file systems or other tools like mkfs can optimize
> for this value.
>
> I would like the flags like "don't submit requests spanning
> this boundary" and "make all writes below this size" to be defined
> in terms of the regular sizes we already know about, like the
> page size or the erase size.
>

I agree with you on the size/align issues. These are very generic
attributes and don't need a complicated framework like I described to
be dealt with. Ultimately they are just hints to the I/O scheduler, so
they should be part of the block device.

I am more concerned with workarounds that depend on access size (like
the toshiba one) and that modify the MMC commands sent (using reliable
writes, like the Toshiba one, or putting parameters differently like
the Sandisk erase workaround). It's these kinds of workarounds that
the quirks framework is meant to address. I don't think it's a good
idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
if()-elsed workarounds, because it's going to quickly complicate the
logic, and get out of hand and unmanageable the more cards are added.
I'm trying to avoid having to make any changes to card/block.c as part
of making quirk workarounds. The only cost when compared to an if-else
will be one O(log n) quirk lookup, where n is either one or something
close that (since the search is only done for quirks per
mmc_blk_data), and one callback invoked after "brq.data.sg_len =
mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-23 22:26                                     ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-23 22:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 23, 2011 at 10:09 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 23 February 2011, Andrei Warkentin wrote:
>> That sounds good! In fact, for any quirks enabled for a particular
>> card, I'll expose the tuneables through sysfs attributes, something
>> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
>>
>> Quirks will have block intervals and access size intervals over which
>> they are valid, along with any other quirk-specific parameter.
>> Interval overlap will not be allowed for quirks in the same operation
>> type (r/w/e). The goal here is to make the changes to issue_*_rq as
>> small as possible, and not to pollute block.c at all with the quirks
>> stuff. Quirks are looked up inside issue_*_rq based on req type and
>> [start,end) interval. The resulting found quirks structure will
>> contain a callback used inside issue_*_rq to modify mmc block request
>> structures prior to generating actual MMC commands.
>>
>> Quirks consist of a callback called inside of mmc issue_*_rq,
>> configurable attributes, and the sysfs interface. Quirk groups are
>> defined per-card. At card insertion time, a matching quirk group is
>> found, and is enabled. The quirk group enable function then enables
>> the relevant quirks with the right parameters (adds them to per
>> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
>> are used. If the tunables are modified through sysfs, care is taken
>> that an interval overlap never happens, otherwise the tunable is not
>> modified and a kernel error message is logged.
>>
>> I hope I explained the tentative idea clearly... Thoughts?
>

> I would hope that the quirks can be simpler than this still, without
> the need to call any function pointers while using the device, or
> quirk specific sysfs directories.
>

I'll skip the sysfs part from the first RFC patch. I think this
complicates what I'm trying to achieve and makes this whole thing look
bigger than it is.

> What I meant is to have a single function pointer that can get
> called when detecting a specific known card. All this function
> does is to set values and flags that we can export either through
> common attributes of block devices (e.g. preferred erase size),
> or attributes specific to mmc devices (e.g. the toshiba hack, as
> a bool attribute).
>
> An obvious attribute would be the minimum size of an atomic
> page update. By default this could be 32KB, because any device
> should support that (FAT32 cannot have larger clusters). A
> card specific quirk can set it to another value, like 8KB, 16KB
> or 64KB, and file systems or other tools like mkfs can optimize
> for this value.
>
> I would like the flags like "don't submit requests spanning
> this boundary" and "make all writes below this size" to be defined
> in terms of the regular sizes we already know about, like the
> page size or the erase size.
>

I agree with you on the size/align issues. These are very generic
attributes and don't need a complicated framework like I described to
be dealt with. Ultimately they are just hints to the I/O scheduler, so
they should be part of the block device.

I am more concerned with workarounds that depend on access size (like
the toshiba one) and that modify the MMC commands sent (using reliable
writes, like the Toshiba one, or putting parameters differently like
the Sandisk erase workaround). It's these kinds of workarounds that
the quirks framework is meant to address. I don't think it's a good
idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
if()-elsed workarounds, because it's going to quickly complicate the
logic, and get out of hand and unmanageable the more cards are added.
I'm trying to avoid having to make any changes to card/block.c as part
of making quirk workarounds. The only cost when compared to an if-else
will be one O(log n) quirk lookup, where n is either one or something
close that (since the search is only done for quirks per
mmc_blk_data), and one callback invoked after "brq.data.sg_len =
mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-23 22:26                                     ` Andrei Warkentin
@ 2011-02-24  9:24                                       ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-24  9:24 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Wednesday 23 February 2011, Andrei Warkentin wrote:

> I am more concerned with workarounds that depend on access size (like
> the toshiba one) and that modify the MMC commands sent (using reliable
> writes, like the Toshiba one, or putting parameters differently like
> the Sandisk erase workaround). It's these kinds of workarounds that
> the quirks framework is meant to address. I don't think it's a good
> idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
> if()-elsed workarounds, because it's going to quickly complicate the
> logic, and get out of hand and unmanageable the more cards are added.
> I'm trying to avoid having to make any changes to card/block.c as part
> of making quirk workarounds. The only cost when compared to an if-else
> will be one O(log n) quirk lookup, where n is either one or something
> close that (since the search is only done for quirks per
> mmc_blk_data), and one callback invoked after "brq.data.sg_len =
> mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.

Unlike the sysfs interface, the code does not need to be future-proof,
it can always be changed if we feel the code becomes more maintainable
by doing it another way.

The approach that I'd like to see here is:

* Start out with an ad-hoc patch for a quirk (like the one you already
  have).
* Add a boolean variable to enable it per card.
* Get performance data for this quirk to show that it's useful in
  real-world workloads for some cards but counterproductive for others
* Get the patch into the mmc tree.
* Repeat for the next quirk
* When the code becomes overly complicated after adding all the quirks,
  decide on a good strategy to move the code around, and do a new patch.

I understand that you are convinced that you will need the indirect function
calls in the end. That is fine, just don't add them before they are
actually needed -- that would only make it harder for you to get the first
patch included.

Note that the situation is very different for user interfaces such as sysfs:
You need to plan ahead because once the interface is merged upstream, it
can never be changed. When you submit a patch that introduces a new sysfs
interface, it has to be documented, and you have to convince the reviewers
that it is sufficient to cover all the cases it is designed for, while
at the same time it is the most simple way to achieve this.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-24  9:24                                       ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-24  9:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 23 February 2011, Andrei Warkentin wrote:

> I am more concerned with workarounds that depend on access size (like
> the toshiba one) and that modify the MMC commands sent (using reliable
> writes, like the Toshiba one, or putting parameters differently like
> the Sandisk erase workaround). It's these kinds of workarounds that
> the quirks framework is meant to address. I don't think it's a good
> idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
> if()-elsed workarounds, because it's going to quickly complicate the
> logic, and get out of hand and unmanageable the more cards are added.
> I'm trying to avoid having to make any changes to card/block.c as part
> of making quirk workarounds. The only cost when compared to an if-else
> will be one O(log n) quirk lookup, where n is either one or something
> close that (since the search is only done for quirks per
> mmc_blk_data), and one callback invoked after "brq.data.sg_len =
> mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.

Unlike the sysfs interface, the code does not need to be future-proof,
it can always be changed if we feel the code becomes more maintainable
by doing it another way.

The approach that I'd like to see here is:

* Start out with an ad-hoc patch for a quirk (like the one you already
  have).
* Add a boolean variable to enable it per card.
* Get performance data for this quirk to show that it's useful in
  real-world workloads for some cards but counterproductive for others
* Get the patch into the mmc tree.
* Repeat for the next quirk
* When the code becomes overly complicated after adding all the quirks,
  decide on a good strategy to move the code around, and do a new patch.

I understand that you are convinced that you will need the indirect function
calls in the end. That is fine, just don't add them before they are
actually needed -- that would only make it harder for you to get the first
patch included.

Note that the situation is very different for user interfaces such as sysfs:
You need to plan ahead because once the interface is merged upstream, it
can never be changed. When you submit a patch that introduces a new sysfs
interface, it has to be documented, and you have to convince the reviewers
that it is sufficient to cover all the cases it is designed for, while
at the same time it is the most simple way to achieve this.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-24  9:24                                       ` Arnd Bergmann
@ 2011-02-25 11:02                                         ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-25 11:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Thu, Feb 24, 2011 at 3:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> Unlike the sysfs interface, the code does not need to be future-proof,
> it can always be changed if we feel the code becomes more maintainable
> by doing it another way.
>
> The approach that I'd like to see here is:
>
> * Start out with an ad-hoc patch for a quirk (like the one you already
>  have).
> * Add a boolean variable to enable it per card.
> * Get performance data for this quirk to show that it's useful in
>  real-world workloads for some cards but counterproductive for others
> * Get the patch into the mmc tree.
> * Repeat for the next quirk
> * When the code becomes overly complicated after adding all the quirks,
>  decide on a good strategy to move the code around, and do a new patch.
>

Yup. I understand :-).  That's the strategy I'm going to follow. For
page_size-alignment/splitting I'm looking at the block layer now. Is
that the right approach or should I still submit a (cleaned up) patch
to mmc/card/block.c for that performance improvement? The other
(Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.

> I understand that you are convinced that you will need the indirect function
> calls in the end. That is fine, just don't add them before they are
> actually needed -- that would only make it harder for you to get the first
> patch included.
>
> Note that the situation is very different for user interfaces such as sysfs:
> You need to plan ahead because once the interface is merged upstream, it
> can never be changed. When you submit a patch that introduces a new sysfs
> interface, it has to be documented, and you have to convince the reviewers
> that it is sufficient to cover all the cases it is designed for, while
> at the same time it is the most simple way to achieve this.


Ok, thanks a lot for the explanation, I hadn't thought of it that way
(and should have).

A
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-25 11:02                                         ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-02-25 11:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 24, 2011 at 3:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> Unlike the sysfs interface, the code does not need to be future-proof,
> it can always be changed if we feel the code becomes more maintainable
> by doing it another way.
>
> The approach that I'd like to see here is:
>
> * Start out with an ad-hoc patch for a quirk (like the one you already
> ?have).
> * Add a boolean variable to enable it per card.
> * Get performance data for this quirk to show that it's useful in
> ?real-world workloads for some cards but counterproductive for others
> * Get the patch into the mmc tree.
> * Repeat for the next quirk
> * When the code becomes overly complicated after adding all the quirks,
> ?decide on a good strategy to move the code around, and do a new patch.
>

Yup. I understand :-).  That's the strategy I'm going to follow. For
page_size-alignment/splitting I'm looking at the block layer now. Is
that the right approach or should I still submit a (cleaned up) patch
to mmc/card/block.c for that performance improvement? The other
(Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.

> I understand that you are convinced that you will need the indirect function
> calls in the end. That is fine, just don't add them before they are
> actually needed -- that would only make it harder for you to get the first
> patch included.
>
> Note that the situation is very different for user interfaces such as sysfs:
> You need to plan ahead because once the interface is merged upstream, it
> can never be changed. When you submit a patch that introduces a new sysfs
> interface, it has to be documented, and you have to convince the reviewers
> that it is sufficient to cover all the cases it is designed for, while
> at the same time it is the most simple way to achieve this.


Ok, thanks a lot for the explanation, I hadn't thought of it that way
(and should have).

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-25 11:02                                         ` Andrei Warkentin
@ 2011-02-25 12:21                                           ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-25 12:21 UTC (permalink / raw)
  To: Andrei Warkentin, Jens Axboe
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Friday 25 February 2011, Andrei Warkentin wrote:
> Yup. I understand :-).  That's the strategy I'm going to follow. For
> page_size-alignment/splitting I'm looking at the block layer now. Is
> that the right approach or should I still submit a (cleaned up) patch
> to mmc/card/block.c for that performance improvement.

I guess it should live in block/cfq-iosched in the long run, but I don't
know how easy it is to implement it there for test purposes.

It may be easier to prototype it in the mmc code, since you are more
familiar with that already, post that patch together with benchmark
results and then do a new patch for the final solution. We'll need
more benchmarking to figure out if that should be applied for
all nonrotational storage, or if there are cases where it actually
hurts performance to split requests on page boundaries.

If it turns out to be a good idea in general, we won't even need a
sysfs interface for enabling it, just one for reading/writing the
underlying page size.

> The other (Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.

Makes sense.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-02-25 12:21                                           ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-02-25 12:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 25 February 2011, Andrei Warkentin wrote:
> Yup. I understand :-).  That's the strategy I'm going to follow. For
> page_size-alignment/splitting I'm looking at the block layer now. Is
> that the right approach or should I still submit a (cleaned up) patch
> to mmc/card/block.c for that performance improvement.

I guess it should live in block/cfq-iosched in the long run, but I don't
know how easy it is to implement it there for test purposes.

It may be easier to prototype it in the mmc code, since you are more
familiar with that already, post that patch together with benchmark
results and then do a new patch for the final solution. We'll need
more benchmarking to figure out if that should be applied for
all nonrotational storage, or if there are cases where it actually
hurts performance to split requests on page boundaries.

If it turns out to be a good idea in general, we won't even need a
sysfs interface for enabling it, just one for reading/writing the
underlying page size.

> The other (Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.

Makes sense.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-25 12:21                                           ` Arnd Bergmann
@ 2011-03-01 18:48                                             ` Jens Axboe
  -1 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 18:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On 2011-02-25 07:21, Arnd Bergmann wrote:
> On Friday 25 February 2011, Andrei Warkentin wrote:
>> Yup. I understand :-).  That's the strategy I'm going to follow. For
>> page_size-alignment/splitting I'm looking at the block layer now. Is
>> that the right approach or should I still submit a (cleaned up) patch
>> to mmc/card/block.c for that performance improvement.
> 
> I guess it should live in block/cfq-iosched in the long run, but I don't
> know how easy it is to implement it there for test purposes.

I don't think I saw the original patch(es) for this?


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-01 18:48                                             ` Jens Axboe
  0 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 18:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 2011-02-25 07:21, Arnd Bergmann wrote:
> On Friday 25 February 2011, Andrei Warkentin wrote:
>> Yup. I understand :-).  That's the strategy I'm going to follow. For
>> page_size-alignment/splitting I'm looking at the block layer now. Is
>> that the right approach or should I still submit a (cleaned up) patch
>> to mmc/card/block.c for that performance improvement.
> 
> I guess it should live in block/cfq-iosched in the long run, but I don't
> know how easy it is to implement it there for test purposes.

I don't think I saw the original patch(es) for this?


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 18:48                                             ` Jens Axboe
@ 2011-03-01 19:11                                               ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
> 
> On 2011-02-25 07:21, Arnd Bergmann wrote:
> > On Friday 25 February 2011, Andrei Warkentin wrote:
> >> Yup. I understand :-).  That's the strategy I'm going to follow. For
> >> page_size-alignment/splitting I'm looking at the block layer now. Is
> >> that the right approach or should I still submit a (cleaned up) patch
> >> to mmc/card/block.c for that performance improvement.
> > 
> > I guess it should live in block/cfq-iosched in the long run, but I don't
> > know how easy it is to implement it there for test purposes.
> 
> I don't think I saw the original patch(es) for this?

Nobody has posted one yet, only discussions. Andrei made a patch for the
MMC block driver to split requests in some cases, but I think the
concept has changed enough that it's probably not useful to look at
that patch.

I think what needs to be done here is to split requests in these cases:

* Small requests should be split on flash page boundaries, where a page
is typically 8 to 32 KB. Sending one hardware request that spans two
partial pages can be slower than sending two requests with the same
data, but on page boundaries.

* If a hardware transfer is limited to a few sectors, these should be
aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
maximum transfers, a request that spans from sector 7 to 62 should be
split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
This reduces the number of page read-modify-write cycles that the drive
does.

* No request should ever span multiple erase blocks. Most flash drives today
have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
treat the erase block boundary like a seek on a hard drive. The I/O
scheduler should try to send all sector writes of an erase block in sequence,
but after that it can chose any other erase block to write to next.

I think if we get this logic, we can deal well with all cheap flash drives.
The two parameters we need are the page size and the erase block size,
which the kernel can sometimes guess, but should also be tunable in
sysfs for devices that don't tell us or lie to the kernel about them.

I'm not sure if we want to do this for all nonrotational media, or
add another flag to enable these optimizations. On proper SSDs that have
an intelligent controller and enough RAM, they probably would not help
all that much, or even make it slightly slower due to a higher number
of separate write requests.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-01 19:11                                               ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
> 
> On 2011-02-25 07:21, Arnd Bergmann wrote:
> > On Friday 25 February 2011, Andrei Warkentin wrote:
> >> Yup. I understand :-).  That's the strategy I'm going to follow. For
> >> page_size-alignment/splitting I'm looking at the block layer now. Is
> >> that the right approach or should I still submit a (cleaned up) patch
> >> to mmc/card/block.c for that performance improvement.
> > 
> > I guess it should live in block/cfq-iosched in the long run, but I don't
> > know how easy it is to implement it there for test purposes.
> 
> I don't think I saw the original patch(es) for this?

Nobody has posted one yet, only discussions. Andrei made a patch for the
MMC block driver to split requests in some cases, but I think the
concept has changed enough that it's probably not useful to look at
that patch.

I think what needs to be done here is to split requests in these cases:

* Small requests should be split on flash page boundaries, where a page
is typically 8 to 32 KB. Sending one hardware request that spans two
partial pages can be slower than sending two requests with the same
data, but on page boundaries.

* If a hardware transfer is limited to a few sectors, these should be
aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
maximum transfers, a request that spans from sector 7 to 62 should be
split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
This reduces the number of page read-modify-write cycles that the drive
does.

* No request should ever span multiple erase blocks. Most flash drives today
have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
treat the erase block boundary like a seek on a hard drive. The I/O
scheduler should try to send all sector writes of an erase block in sequence,
but after that it can chose any other erase block to write to next.

I think if we get this logic, we can deal well with all cheap flash drives.
The two parameters we need are the page size and the erase block size,
which the kernel can sometimes guess, but should also be tunable in
sysfs for devices that don't tell us or lie to the kernel about them.

I'm not sure if we want to do this for all nonrotational media, or
add another flag to enable these optimizations. On proper SSDs that have
an intelligent controller and enough RAM, they probably would not help
all that much, or even make it slightly slower due to a higher number
of separate write requests.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:11                                               ` Arnd Bergmann
@ 2011-03-01 19:15                                                 ` Jens Axboe
  -1 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 19:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On 2011-03-01 14:11, Arnd Bergmann wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>>> On Friday 25 February 2011, Andrei Warkentin wrote:
>>>> Yup. I understand :-).  That's the strategy I'm going to follow. For
>>>> page_size-alignment/splitting I'm looking at the block layer now. Is
>>>> that the right approach or should I still submit a (cleaned up) patch
>>>> to mmc/card/block.c for that performance improvement.
>>>
>>> I guess it should live in block/cfq-iosched in the long run, but I don't
>>> know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
> 
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
> 
> I think what needs to be done here is to split requests in these cases:
> 
> * Small requests should be split on flash page boundaries, where a page
> is typically 8 to 32 KB. Sending one hardware request that spans two
> partial pages can be slower than sending two requests with the same
> data, but on page boundaries.
> 
> * If a hardware transfer is limited to a few sectors, these should be
> aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
> maximum transfers, a request that spans from sector 7 to 62 should be
> split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
> This reduces the number of page read-modify-write cycles that the drive
> does.
> 
> * No request should ever span multiple erase blocks. Most flash drives today
> have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
> treat the erase block boundary like a seek on a hard drive. The I/O
> scheduler should try to send all sector writes of an erase block in sequence,
> but after that it can chose any other erase block to write to next.
> 
> I think if we get this logic, we can deal well with all cheap flash drives.
> The two parameters we need are the page size and the erase block size,
> which the kernel can sometimes guess, but should also be tunable in
> sysfs for devices that don't tell us or lie to the kernel about them.
> 
> I'm not sure if we want to do this for all nonrotational media, or
> add another flag to enable these optimizations. On proper SSDs that have
> an intelligent controller and enough RAM, they probably would not help
> all that much, or even make it slightly slower due to a higher number
> of separate write requests.

Thanks for the recap. One way to handle this would be to have a dm
target that ensures that requests are never built up to violate any of
the above items. Doing splitting is a little silly, when you can prevent
it from happening in the first place.

Alternatively, a queue ->merge_bvec_fn() with a settings table could
provide the same.

As this is of limited scope, I would prefer having this done via a
plugin of some sort (like a dm target).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-01 19:15                                                 ` Jens Axboe
  0 siblings, 0 replies; 117+ messages in thread
From: Jens Axboe @ 2011-03-01 19:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 2011-03-01 14:11, Arnd Bergmann wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>>> On Friday 25 February 2011, Andrei Warkentin wrote:
>>>> Yup. I understand :-).  That's the strategy I'm going to follow. For
>>>> page_size-alignment/splitting I'm looking at the block layer now. Is
>>>> that the right approach or should I still submit a (cleaned up) patch
>>>> to mmc/card/block.c for that performance improvement.
>>>
>>> I guess it should live in block/cfq-iosched in the long run, but I don't
>>> know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
> 
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
> 
> I think what needs to be done here is to split requests in these cases:
> 
> * Small requests should be split on flash page boundaries, where a page
> is typically 8 to 32 KB. Sending one hardware request that spans two
> partial pages can be slower than sending two requests with the same
> data, but on page boundaries.
> 
> * If a hardware transfer is limited to a few sectors, these should be
> aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
> maximum transfers, a request that spans from sector 7 to 62 should be
> split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
> This reduces the number of page read-modify-write cycles that the drive
> does.
> 
> * No request should ever span multiple erase blocks. Most flash drives today
> have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
> treat the erase block boundary like a seek on a hard drive. The I/O
> scheduler should try to send all sector writes of an erase block in sequence,
> but after that it can chose any other erase block to write to next.
> 
> I think if we get this logic, we can deal well with all cheap flash drives.
> The two parameters we need are the page size and the erase block size,
> which the kernel can sometimes guess, but should also be tunable in
> sysfs for devices that don't tell us or lie to the kernel about them.
> 
> I'm not sure if we want to do this for all nonrotational media, or
> add another flag to enable these optimizations. On proper SSDs that have
> an intelligent controller and enough RAM, they probably would not help
> all that much, or even make it slightly slower due to a higher number
> of separate write requests.

Thanks for the recap. One way to handle this would be to have a dm
target that ensures that requests are never built up to violate any of
the above items. Doing splitting is a little silly, when you can prevent
it from happening in the first place.

Alternatively, a queue ->merge_bvec_fn() with a settings table could
provide the same.

As this is of limited scope, I would prefer having this done via a
plugin of some sort (like a dm target).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:15                                                 ` Jens Axboe
@ 2011-03-01 19:51                                                   ` Arnd Bergmann
  -1 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
> Thanks for the recap. One way to handle this would be to have a dm
> target that ensures that requests are never built up to violate any of
> the above items. Doing splitting is a little silly, when you can prevent
> it from happening in the first place.

Ok, that sounds good. I didn't know that it's possible to prevent
bios from getting created that violate this.

I'm actually trying to do a device mapper target that does much more than
this, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
for an early draft. The design has moved on since I wrote that, but
the basic idea is still the same: all blocks get written in a way that
fills up entire 4MB segments before moving to another segment,
independent of what the logical block numbers are, and a little space
is used to store a lookup table for the logical-to-physical block mapping.

> Alternatively, a queue ->merge_bvec_fn() with a settings table could
> provide the same.

That's probably better for the common case. The device mapper target
would be useful for those that want the best case write performance,
but if I understand you correctly, the merge_bvec_fn() could be used
per block driver, so we could simply add that to the SCSI (for USB and
consumer SSD) case and MMC block drivers.

The point that this does not solve is submitting all outstanding writes
for an erase block together, which is needed to reduce the garbage
collection overhead. When you do a partial update of an erase block
(4MB typically) and then start writing to another erase block, the
drive will have to copy all data you did not write in order to free
up internal resources.

> As this is of limited scope, I would prefer having this done via a
> plugin of some sort (like a dm target).

I'm not sure what you mean with limited scope. This is certainly not
as important for the classic server environment (aside from USB boot
drives), but I assume that it is highly relevant for the a large
portion of new embedded designs as people move from raw flash to
eMMC and similar "technologies".

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-01 19:51                                                   ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
> Thanks for the recap. One way to handle this would be to have a dm
> target that ensures that requests are never built up to violate any of
> the above items. Doing splitting is a little silly, when you can prevent
> it from happening in the first place.

Ok, that sounds good. I didn't know that it's possible to prevent
bios from getting created that violate this.

I'm actually trying to do a device mapper target that does much more than
this, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
for an early draft. The design has moved on since I wrote that, but
the basic idea is still the same: all blocks get written in a way that
fills up entire 4MB segments before moving to another segment,
independent of what the logical block numbers are, and a little space
is used to store a lookup table for the logical-to-physical block mapping.

> Alternatively, a queue ->merge_bvec_fn() with a settings table could
> provide the same.

That's probably better for the common case. The device mapper target
would be useful for those that want the best case write performance,
but if I understand you correctly, the merge_bvec_fn() could be used
per block driver, so we could simply add that to the SCSI (for USB and
consumer SSD) case and MMC block drivers.

The point that this does not solve is submitting all outstanding writes
for an erase block together, which is needed to reduce the garbage
collection overhead. When you do a partial update of an erase block
(4MB typically) and then start writing to another erase block, the
drive will have to copy all data you did not write in order to free
up internal resources.

> As this is of limited scope, I would prefer having this done via a
> plugin of some sort (like a dm target).

I'm not sure what you mean with limited scope. This is certainly not
as important for the classic server environment (aside from USB boot
drives), but I assume that it is highly relevant for the a large
portion of new embedded designs as people move from raw flash to
eMMC and similar "technologies".

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:51                                                   ` Arnd Bergmann
@ 2011-03-01 21:33                                                     ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-01 21:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Tue, Mar 1, 2011 at 1:51 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
>> Thanks for the recap. One way to handle this would be to have a dm
>> target that ensures that requests are never built up to violate any of
>> the above items. Doing splitting is a little silly, when you can prevent
>> it from happening in the first place.
>
> Ok, that sounds good. I didn't know that it's possible to prevent
> bios from getting created that violate this.
>

Wouldn't someone still be able to perform a generic_make_request that
would violate the conditions (i.e. cross alignment boundary while
performing unaligned write)? You could prevent the merges that would
result in violating the conditions, sure, but you would need to handle
single unaligned accesses correctly too... Sorry, I'm just groping my
way around the block layer...a lot I'm still trying to draw a mental
picture for.

P.S. I've submitted for review the first 3 patches. Tear into them :).

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-01 21:33                                                     ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-01 21:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 1, 2011 at 1:51 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
>> Thanks for the recap. One way to handle this would be to have a dm
>> target that ensures that requests are never built up to violate any of
>> the above items. Doing splitting is a little silly, when you can prevent
>> it from happening in the first place.
>
> Ok, that sounds good. I didn't know that it's possible to prevent
> bios from getting created that violate this.
>

Wouldn't someone still be able to perform a generic_make_request that
would violate the conditions (i.e. cross alignment boundary while
performing unaligned write)? You could prevent the merges that would
result in violating the conditions, sure, but you would need to handle
single unaligned accesses correctly too... Sorry, I'm just groping my
way around the block layer...a lot I'm still trying to draw a mental
picture for.

P.S. I've submitted for review the first 3 patches. Tear into them :).

A

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:11                                               ` Arnd Bergmann
@ 2011-03-02 10:34                                                 ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-02 10:34 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Tue, Mar 1, 2011 at 1:11 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>> > On Friday 25 February 2011, Andrei Warkentin wrote:
>> >> Yup. I understand :-).  That's the strategy I'm going to follow. For
>> >> page_size-alignment/splitting I'm looking at the block layer now. Is
>> >> that the right approach or should I still submit a (cleaned up) patch
>> >> to mmc/card/block.c for that performance improvement.
>> >
>> > I guess it should live in block/cfq-iosched in the long run, but I don't
>> > know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
>
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
>

Before the generic improvements are made to the block layer, I think
there is some value
in implementing the (simpler) ones in mmc block code, as well as
expose an mmc block quirk interface by which its easy to add complex
workarounds. Some things will never be able to completely stay above
mmc block code, for example, when splitting up smaller accesses, you
need to be careful on the Toshiba card, since the 4th consecutive 8KB
block results in the entire 32KB getting pushed  into the bigger 4MB
buffer. On our platform, there are a lot of accesses in the 16KB-32KB
range which benefit from the splitting. Data collected showed
splitting more than 32KB to have adverse effect on performance (I
guess that sort of makes sense, after all, why else would the
controller treat 4 consecutive 8KB accesses as a larger access and
treat it accordingly?) On the other hand, that data was collected on
code that used reliable write for every portion of the split access,
so I'm going to have to get some new data...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-02 10:34                                                 ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-02 10:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 1, 2011 at 1:11 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>> > On Friday 25 February 2011, Andrei Warkentin wrote:
>> >> Yup. I understand :-). ?That's the strategy I'm going to follow. For
>> >> page_size-alignment/splitting I'm looking at the block layer now. Is
>> >> that the right approach or should I still submit a (cleaned up) patch
>> >> to mmc/card/block.c for that performance improvement.
>> >
>> > I guess it should live in block/cfq-iosched in the long run, but I don't
>> > know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
>
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
>

Before the generic improvements are made to the block layer, I think
there is some value
in implementing the (simpler) ones in mmc block code, as well as
expose an mmc block quirk interface by which its easy to add complex
workarounds. Some things will never be able to completely stay above
mmc block code, for example, when splitting up smaller accesses, you
need to be careful on the Toshiba card, since the 4th consecutive 8KB
block results in the entire 32KB getting pushed  into the bigger 4MB
buffer. On our platform, there are a lot of accesses in the 16KB-32KB
range which benefit from the splitting. Data collected showed
splitting more than 32KB to have adverse effect on performance (I
guess that sort of makes sense, after all, why else would the
controller treat 4 consecutive 8KB accesses as a larger access and
treat it accordingly?) On the other hand, that data was collected on
code that used reliable write for every portion of the split access,
so I'm going to have to get some new data...

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-02 10:34                                                 ` Andrei Warkentin
@ 2011-03-05  9:23                                                   ` Andrei Warkentin
  -1 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-05  9:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Wed, Mar 2, 2011 at 4:34 AM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Before the generic improvements are made to the block layer, I think
> there is some value
> in implementing the (simpler) ones in mmc block code, as well as
> expose an mmc block quirk interface by which its easy to add complex
> workarounds. Some things will never be able to completely stay above
> mmc block code, for example, when splitting up smaller accesses, you
> need to be careful on the Toshiba card, since the 4th consecutive 8KB
> block results in the entire 32KB getting pushed  into the bigger 4MB
> buffer. On our platform, there are a lot of accesses in the 16KB-32KB
> range which benefit from the splitting. Data collected showed
> splitting more than 32KB to have adverse effect on performance (I
> guess that sort of makes sense, after all, why else would the
> controller treat 4 consecutive 8KB accesses as a larger access and
> treat it accordingly?) On the other hand, that data was collected on
> code that used reliable write for every portion of the split access,
> so I'm going to have to get some new data...
>

Just want to correct myself - any consecutive write that exceeds 8K
goes into the 4MB buffer.
Also, according to vendor, there is no performance penalty for using
reliable write.
This is why in the patch set, for splitting larger requests (to
improve lifetime by reducing the number of AU write/erase cycles) I
perform a reliable write for each split block set.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-05  9:23                                                   ` Andrei Warkentin
  0 siblings, 0 replies; 117+ messages in thread
From: Andrei Warkentin @ 2011-03-05  9:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 2, 2011 at 4:34 AM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Before the generic improvements are made to the block layer, I think
> there is some value
> in implementing the (simpler) ones in mmc block code, as well as
> expose an mmc block quirk interface by which its easy to add complex
> workarounds. Some things will never be able to completely stay above
> mmc block code, for example, when splitting up smaller accesses, you
> need to be careful on the Toshiba card, since the 4th consecutive 8KB
> block results in the entire 32KB getting pushed ?into the bigger 4MB
> buffer. On our platform, there are a lot of accesses in the 16KB-32KB
> range which benefit from the splitting. Data collected showed
> splitting more than 32KB to have adverse effect on performance (I
> guess that sort of makes sense, after all, why else would the
> controller treat 4 consecutive 8KB accesses as a larger access and
> treat it accordingly?) On the other hand, that data was collected on
> code that used reliable write for every portion of the split access,
> so I'm going to have to get some new data...
>

Just want to correct myself - any consecutive write that exceeds 8K
goes into the 4MB buffer.
Also, according to vendor, there is no performance penalty for using
reliable write.
This is why in the patch set, for splitting larger requests (to
improve lifetime by reducing the number of AU write/erase cycles) I
perform a reliable write for each split block set.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
  2011-02-11 14:51   ` Arnd Bergmann
  2011-02-11 15:20     ` Lei Wen
@ 2011-03-08  6:59     ` Pavel Machek
  2011-03-08 14:03         ` Arnd Bergmann
  1 sibling, 1 reply; 117+ messages in thread
From: Pavel Machek @ 2011-03-08  6:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > > I'm not sure if this is the best place to bring this up, but Russel's
> > > name is on a fair share of drivers/mmc code, and there does seem to be
> > > quite a bit of MMC-related discussions. Excuse me in advance if this
> > > isn't the right forum :-).
> > > 
> > > Certain MMC vendors (maybe even quite a bit of them) use a pretty
> > > rigid buffering scheme when it comes to handling writes. There is
> > > usually a buffer A for random accesses, and a buffer B for sequential
> > > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
> > > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
> > > effectively equating to 4MB accesses. Worse, consecutive small (8k)
> > > writes are treated as one large sequential access, once again ending
> > > up in buffer B, thus necessitating out-of-order writing to work around
> > > this.
> > 
> > Hmmmm, I somehow assumed MMCs would be much more cleverr than this.
> 
> No, these devices are incredibly stupid, or extremely optimized to
> a specific use case (writing large video files to FAT32), depending on how
> you look at them.
> 
> > > reorders) them? The thresholds would then be adjustable as
> > > module/kernel parameters based on manfid. I'm asking because I have a
> > > patch now, but its ugly and hardcoded against a specific manufacturer.
> > 
> > How big is performance difference?
> 
> Several orders of magnitude. It is very easy to get a card that can write
> 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> things that happen frequently with ext3.

Ungood.

I guess we should create something like loopback device, which knows
about flash specifics, and does the right coalescing so that card
stays in the fast mode?

...or, do we need to create new, simple filesystem with layout similar
to fat32, for use on mmc cards?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-08  6:59     ` Pavel Machek
@ 2011-03-08 14:03         ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-08 14:03 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-arm-kernel, Andrei Warkentin, linux-fsdevel, linux-mmc

On Tuesday 08 March 2011, Pavel Machek wrote:
> > > 
> > > How big is performance difference?
> > 
> > Several orders of magnitude. It is very easy to get a card that can write
> > 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> > things that happen frequently with ext3.
> 
> Ungood.
> 
> I guess we should create something like loopback device, which knows
> about flash specifics, and does the right coalescing so that card
> stays in the fast mode?

I have listed a few suggestions for areas to work in my article
at https://lwn.net/Articles/428584/. My idea was to use a device mapper
target, as described in https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
but a loopback device might work as well.

The other area that I think will help a lot is to make the I/O
scheduler aware of the erase block size and the preferred access
patterns.
 
> ...or, do we need to create new, simple filesystem with layout similar
> to fat32, for use on mmc cards?

It doesn't need to be similar to fat32, but creating a new file system
could fix this, too. Microsoft seems to have built ExFAT around
cheap flash devices, though they don't document what that does exactly.
I think we can do better than that, and I still want to find out
how close nilfs2 and btrfs can actually get to the optimum.

Note that it's not just MMC cards though, you get the exact same
effects on some low-end SSDs (which are basically repackaged CF
cards) and most USB sticks. The best USB sticks I have seen
can hide some effects with a bit of caching, and they have a higher
number of open segments than the cheap ones, but the basic
problems are unchanged.

The requirements for a good low-end flash optimized file system
would be roughly:

1. Do all writes is chunks of 32 or 64 KB. If there is less
   data to write, fill the chunk with zeroes and clean up later,
   but don't write more data to the same chunk.
2. Start writing on a segment (e.g. 4 MB, configurable) boundary,
   then write that segment to the end using the chunks mentioned
   above.
3. Erase full segments using trim/erase/discard before writing
   to them, if supported by the drive.
4. Have a configurable number of segments open for writing, i.e.
   you have written blocks at the start of the segment but not
   filled the segment to the end. Typical hardware limitations
   are between 1 and 10 open segments.
5. Keep all metadata within a single 4 MB segment. Drives that cannot
   do random access within normal segments can do it in the area
   that holds the FAT. If 4 MB is not enough, the FAT area can be
   used as a journal or cache, for a larger metadata area that gets
   written less frequently.
6. Because of the requirement to erase 4 MB chunks at once, there
   needs to be garbage collection to free up space. The quality
   of the garbage collection algorithm directly relates to the
   performance on full file systems and/or the space overhead.
7. Some static wear levelling is required to increase the expected
   life of consumer devices that only do dynamic wear levelling,
   i.e. the segments that contain purely static data need to
   be written occasionally so they make it back into the
   wear leveling pool of the hardware.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

* MMC quirks relating to performance/lifetime.
@ 2011-03-08 14:03         ` Arnd Bergmann
  0 siblings, 0 replies; 117+ messages in thread
From: Arnd Bergmann @ 2011-03-08 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 08 March 2011, Pavel Machek wrote:
> > > 
> > > How big is performance difference?
> > 
> > Several orders of magnitude. It is very easy to get a card that can write
> > 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> > things that happen frequently with ext3.
> 
> Ungood.
> 
> I guess we should create something like loopback device, which knows
> about flash specifics, and does the right coalescing so that card
> stays in the fast mode?

I have listed a few suggestions for areas to work in my article
at https://lwn.net/Articles/428584/. My idea was to use a device mapper
target, as described in https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
but a loopback device might work as well.

The other area that I think will help a lot is to make the I/O
scheduler aware of the erase block size and the preferred access
patterns.
 
> ...or, do we need to create new, simple filesystem with layout similar
> to fat32, for use on mmc cards?

It doesn't need to be similar to fat32, but creating a new file system
could fix this, too. Microsoft seems to have built ExFAT around
cheap flash devices, though they don't document what that does exactly.
I think we can do better than that, and I still want to find out
how close nilfs2 and btrfs can actually get to the optimum.

Note that it's not just MMC cards though, you get the exact same
effects on some low-end SSDs (which are basically repackaged CF
cards) and most USB sticks. The best USB sticks I have seen
can hide some effects with a bit of caching, and they have a higher
number of open segments than the cheap ones, but the basic
problems are unchanged.

The requirements for a good low-end flash optimized file system
would be roughly:

1. Do all writes is chunks of 32 or 64 KB. If there is less
   data to write, fill the chunk with zeroes and clean up later,
   but don't write more data to the same chunk.
2. Start writing on a segment (e.g. 4 MB, configurable) boundary,
   then write that segment to the end using the chunks mentioned
   above.
3. Erase full segments using trim/erase/discard before writing
   to them, if supported by the drive.
4. Have a configurable number of segments open for writing, i.e.
   you have written blocks at the start of the segment but not
   filled the segment to the end. Typical hardware limitations
   are between 1 and 10 open segments.
5. Keep all metadata within a single 4 MB segment. Drives that cannot
   do random access within normal segments can do it in the area
   that holds the FAT. If 4 MB is not enough, the FAT area can be
   used as a journal or cache, for a larger metadata area that gets
   written less frequently.
6. Because of the requirement to erase 4 MB chunks at once, there
   needs to be garbage collection to free up space. The quality
   of the garbage collection algorithm directly relates to the
   performance on full file systems and/or the space overhead.
7. Some static wear levelling is required to increase the expected
   life of consumer devices that only do dynamic wear levelling,
   i.e. the segments that contain purely static data need to
   be written occasionally so they make it back into the
   wear leveling pool of the hardware.

	Arnd

^ permalink raw reply	[flat|nested] 117+ messages in thread

end of thread, other threads:[~2011-03-08 14:03 UTC | newest]

Thread overview: 117+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-08 21:22 MMC quirks relating to performance/lifetime Andrei Warkentin
2011-02-08 21:38 ` Wolfram Sang
2011-02-08 21:38   ` Wolfram Sang
2011-02-08 22:42 ` Russell King - ARM Linux
2011-02-09  8:37 ` Linus Walleij
2011-02-09  8:37   ` Linus Walleij
2011-02-09  9:13   ` Arnd Bergmann
2011-02-09  9:13     ` Arnd Bergmann
2011-02-11 22:33     ` Andrei Warkentin
2011-02-11 22:33       ` Andrei Warkentin
2011-02-12 17:05       ` Arnd Bergmann
2011-02-12 17:05         ` Arnd Bergmann
2011-02-12 17:33         ` Andrei Warkentin
2011-02-12 17:33           ` Andrei Warkentin
2011-02-12 18:22           ` Arnd Bergmann
2011-02-12 18:22             ` Arnd Bergmann
2011-02-18  1:10       ` Andrei Warkentin
2011-02-18  1:10         ` Andrei Warkentin
2011-02-18 13:44         ` Arnd Bergmann
2011-02-18 13:44           ` Arnd Bergmann
2011-02-18 19:47           ` Andrei Warkentin
2011-02-18 19:47             ` Andrei Warkentin
2011-02-18 22:40             ` Andrei Warkentin
2011-02-18 22:40               ` Andrei Warkentin
2011-02-18 23:17               ` Andrei Warkentin
2011-02-18 23:17                 ` Andrei Warkentin
2011-02-19 11:20                 ` Arnd Bergmann
2011-02-19 11:20                   ` Arnd Bergmann
2011-02-20  5:56                   ` Andrei Warkentin
2011-02-20  5:56                     ` Andrei Warkentin
2011-02-20 15:23                     ` Arnd Bergmann
2011-02-20 15:23                       ` Arnd Bergmann
2011-02-22  7:05                       ` Andrei Warkentin
2011-02-22  7:05                         ` Andrei Warkentin
2011-02-22 16:49                         ` Arnd Bergmann
2011-02-22 16:49                           ` Arnd Bergmann
2011-02-19  9:54               ` Arnd Bergmann
2011-02-19  9:54                 ` Arnd Bergmann
2011-02-20  4:39                 ` Andrei Warkentin
2011-02-20  4:39                   ` Andrei Warkentin
2011-02-20 15:03                   ` Arnd Bergmann
2011-02-20 15:03                     ` Arnd Bergmann
2011-02-22  6:42                     ` Andrei Warkentin
2011-02-22  6:42                       ` Andrei Warkentin
2011-02-22 16:42                       ` Arnd Bergmann
2011-02-22 16:42                         ` Arnd Bergmann
2011-02-11 23:23     ` Linus Walleij
2011-02-11 23:23       ` Linus Walleij
2011-02-12 10:45       ` Arnd Bergmann
2011-02-12 10:45         ` Arnd Bergmann
2011-02-12 10:59         ` Russell King - ARM Linux
2011-02-12 10:59           ` Russell King - ARM Linux
2011-02-12 16:28           ` Arnd Bergmann
2011-02-12 16:28             ` Arnd Bergmann
2011-02-12 16:37             ` Russell King - ARM Linux
2011-02-12 16:37               ` Russell King - ARM Linux
2011-02-11 22:27   ` Andrei Warkentin
2011-02-11 22:27     ` Andrei Warkentin
2011-02-12 18:37     ` Arnd Bergmann
2011-02-12 18:37       ` Arnd Bergmann
2011-02-13  0:10       ` Andrei Warkentin
2011-02-13  0:10         ` Andrei Warkentin
2011-02-13 17:39         ` Arnd Bergmann
2011-02-13 17:39           ` Arnd Bergmann
2011-02-14 19:29           ` Andrei Warkentin
2011-02-14 19:29             ` Andrei Warkentin
2011-02-14 20:22             ` Arnd Bergmann
2011-02-14 20:22               ` Arnd Bergmann
2011-02-14 22:25               ` Andrei Warkentin
2011-02-14 22:25                 ` Andrei Warkentin
2011-02-15 17:16                 ` Arnd Bergmann
2011-02-15 17:16                   ` Arnd Bergmann
2011-02-17  2:08                   ` Andrei Warkentin
2011-02-17  2:08                     ` Andrei Warkentin
2011-02-17 15:47                     ` Arnd Bergmann
2011-02-17 15:47                       ` Arnd Bergmann
2011-02-20 11:27                       ` Andrei Warkentin
2011-02-20 11:27                         ` Andrei Warkentin
2011-02-20 14:39                         ` Arnd Bergmann
2011-02-20 14:39                           ` Arnd Bergmann
2011-02-22  7:46                           ` Andrei Warkentin
2011-02-22  7:46                             ` Andrei Warkentin
2011-02-22 17:00                             ` Arnd Bergmann
2011-02-22 17:00                               ` Arnd Bergmann
2011-02-23 10:19                               ` Andrei Warkentin
2011-02-23 10:19                                 ` Andrei Warkentin
2011-02-23 16:09                                 ` Arnd Bergmann
2011-02-23 16:09                                   ` Arnd Bergmann
2011-02-23 22:26                                   ` Andrei Warkentin
2011-02-23 22:26                                     ` Andrei Warkentin
2011-02-24  9:24                                     ` Arnd Bergmann
2011-02-24  9:24                                       ` Arnd Bergmann
2011-02-25 11:02                                       ` Andrei Warkentin
2011-02-25 11:02                                         ` Andrei Warkentin
2011-02-25 12:21                                         ` Arnd Bergmann
2011-02-25 12:21                                           ` Arnd Bergmann
2011-03-01 18:48                                           ` Jens Axboe
2011-03-01 18:48                                             ` Jens Axboe
2011-03-01 19:11                                             ` Arnd Bergmann
2011-03-01 19:11                                               ` Arnd Bergmann
2011-03-01 19:15                                               ` Jens Axboe
2011-03-01 19:15                                                 ` Jens Axboe
2011-03-01 19:51                                                 ` Arnd Bergmann
2011-03-01 19:51                                                   ` Arnd Bergmann
2011-03-01 21:33                                                   ` Andrei Warkentin
2011-03-01 21:33                                                     ` Andrei Warkentin
2011-03-02 10:34                                               ` Andrei Warkentin
2011-03-02 10:34                                                 ` Andrei Warkentin
2011-03-05  9:23                                                 ` Andrei Warkentin
2011-03-05  9:23                                                   ` Andrei Warkentin
2011-02-11 14:41 ` Pavel Machek
2011-02-11 14:51   ` Arnd Bergmann
2011-02-11 15:20     ` Lei Wen
2011-02-11 15:25       ` Arnd Bergmann
2011-03-08  6:59     ` Pavel Machek
2011-03-08 14:03       ` Arnd Bergmann
2011-03-08 14:03         ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.