All of lore.kernel.org
 help / color / mirror / Atom feed
* slow eMMC write speed
@ 2011-09-23  5:05 Praveen G K
  2011-09-28  5:42 ` Linus Walleij
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-09-23  5:05 UTC (permalink / raw)
  To: linux-mmc

Hello all,

I am working on the block driver module of the eMMC driver (SDIO 3.0
controller).  I am seeing very low write speed for eMMC transfers.  On
further debugging, I observed that every 63rd and 64th transfer takes
a long time.  I have tabulated this by noting the time interval
between when CMD 25 (multi block write command) is sent and when
DATA_END interrupt is received.  In normal cases, to transfer 64k (128
blocks of data), it takes about 2.5ms.  But, under the really slow
transfer speed cases, it takes about 250-350ms to get the data_end
interrupt after the multiblock write command is sent.  In some cases,
it takes a longer time.  This radically reduces the throughput of the
transfers.

I have also enabled the bounce buffers.  As a side question, can
somebody please tell me why the bounce buffer is restricted to 64kB?

Thanks in advance for any help that you can provide me on this issue.
Also, please include my email id in the cc field when replying to this
email since I am not subscribed to this mailing list.

Thanks,
Praveen Krishnan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-23  5:05 slow eMMC write speed Praveen G K
@ 2011-09-28  5:42 ` Linus Walleij
  2011-09-28 19:06   ` Praveen G K
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Walleij @ 2011-09-28  5:42 UTC (permalink / raw)
  To: Praveen G K; +Cc: linux-mmc

On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K <praveen.gk@gmail.com> wrote:

> I am working on the block driver module of the eMMC driver (SDIO 3.0
> controller).  I am seeing very low write speed for eMMC transfers.  On
> further debugging, I observed that every 63rd and 64th transfer takes
> a long time.

Are you not just seeing the card-internal garbage collection?
http://lwn.net/Articles/428584/

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28  5:42 ` Linus Walleij
@ 2011-09-28 19:06   ` Praveen G K
  2011-09-28 19:59     ` J Freyensee
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-09-28 19:06 UTC (permalink / raw)
  To: Linus Walleij; +Cc: linux-mmc

On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
<linus.walleij@linaro.org> wrote:
> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K <praveen.gk@gmail.com> wrote:
>
>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>> controller).  I am seeing very low write speed for eMMC transfers.  On
>> further debugging, I observed that every 63rd and 64th transfer takes
>> a long time.
>
> Are you not just seeing the card-internal garbage collection?
> http://lwn.net/Articles/428584/
Does this mean, theoretically, I should be able to achieve larger
speeds if I am not using linux? So, is this a software issue? or if
there is a way to increase the size of bounce buffers to 4MB?

> Yours,
> Linus Walleij
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 19:06   ` Praveen G K
@ 2011-09-28 19:59     ` J Freyensee
  2011-09-28 20:34       ` Praveen G K
  0 siblings, 1 reply; 32+ messages in thread
From: J Freyensee @ 2011-09-28 19:59 UTC (permalink / raw)
  To: Praveen G K; +Cc: Linus Walleij, linux-mmc

On 09/28/2011 12:06 PM, Praveen G K wrote:
> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
> <linus.walleij@linaro.org>  wrote:
>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>  wrote:
>>
>>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>>> controller).  I am seeing very low write speed for eMMC transfers.  On
>>> further debugging, I observed that every 63rd and 64th transfer takes
>>> a long time.
>>
>> Are you not just seeing the card-internal garbage collection?
>> http://lwn.net/Articles/428584/
> Does this mean, theoretically, I should be able to achieve larger
> speeds if I am not using linux?

In theory in a fairy-tale world, maybe, in reality, not really.  In R/W 
performance measurements we have done, eMMC performance in products 
users would buy falls well, well short of any theoretical numbers.  We 
believe in theory, the eMMC interface should be able to support up to 
100MB/s, but in reality on real customer platforms write bandwidths (for 
example) barely approach 20MB/s, regardless if it's a Microsoft Windows 
environment or Android (Linux OS environment we care about).  So maybe 
it is software implementation issues of multiple OSs preventing higher 
eMMC performance numbers (hence the reason why I sometimes ask basic 
coding questions of the MMC subsystem- the code isn't the easiest to 
follow); however, one looks no further than what Apple has done with the 
iPad2 to see that eMMC probably just is not a good solution to use in 
the first place.  We have measured Apple's iPad2 write performance on 
*WHAT A USER WOULD SEE* being double what we see with products using 
eMMC solutions. The big difference?  Apple doesn't use eMMC at all for 
the iPad2.

So, is this a software issue? or if
> there is a way to increase the size of bounce buffers to 4MB?
>



>> Yours,
>> Linus Walleij
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 19:59     ` J Freyensee
@ 2011-09-28 20:34       ` Praveen G K
  2011-09-28 21:01         ` J Freyensee
  2011-09-29  7:05         ` Linus Walleij
  0 siblings, 2 replies; 32+ messages in thread
From: Praveen G K @ 2011-09-28 20:34 UTC (permalink / raw)
  To: J Freyensee; +Cc: Linus Walleij, linux-mmc

On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>
>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>> <linus.walleij@linaro.org>  wrote:
>>>
>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>  wrote:
>>>
>>>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>>>> controller).  I am seeing very low write speed for eMMC transfers.  On
>>>> further debugging, I observed that every 63rd and 64th transfer takes
>>>> a long time.
>>>
>>> Are you not just seeing the card-internal garbage collection?
>>> http://lwn.net/Articles/428584/
>>
>> Does this mean, theoretically, I should be able to achieve larger
>> speeds if I am not using linux?
>
> In theory in a fairy-tale world, maybe, in reality, not really.  In R/W
> performance measurements we have done, eMMC performance in products users
> would buy falls well, well short of any theoretical numbers.  We believe in
> theory, the eMMC interface should be able to support up to 100MB/s, but in
> reality on real customer platforms write bandwidths (for example) barely
> approach 20MB/s, regardless if it's a Microsoft Windows environment or
> Android (Linux OS environment we care about).  So maybe it is software
> implementation issues of multiple OSs preventing higher eMMC performance
> numbers (hence the reason why I sometimes ask basic coding questions of the
> MMC subsystem- the code isn't the easiest to follow); however, one looks no
> further than what Apple has done with the iPad2 to see that eMMC probably
> just is not a good solution to use in the first place.  We have measured
> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being double what
> we see with products using eMMC solutions. The big difference?  Apple
> doesn't use eMMC at all for the iPad2.

Thanks for all the clarification.  The problem is I am seeing write
speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
the time lost when measured between sending a command and receiving a
data irq.  I am not sure what kind of an issue this is.  5MBps feels
really slow but can the internal housekeeping of the card take so much
time? I mean, for the usual transfers it takes about 3ms to transfer
64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
The thing is this is not on a file system.  I am measuring the speed
using basic "dd" command to write directly to the block device.

> So, is this a software issue? or if
>>
>> there is a way to increase the size of bounce buffers to 4MB?
>>
>
>
>
>>> Yours,
>>> Linus Walleij
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> J (James/Jay) Freyensee
> Storage Technology Group
> Intel Corporation
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 20:34       ` Praveen G K
@ 2011-09-28 21:01         ` J Freyensee
  2011-09-28 21:03           ` Praveen G K
  2011-09-29  7:05         ` Linus Walleij
  1 sibling, 1 reply; 32+ messages in thread
From: J Freyensee @ 2011-09-28 21:01 UTC (permalink / raw)
  To: Praveen G K; +Cc: Linus Walleij, linux-mmc

On 09/28/2011 01:34 PM, Praveen G K wrote:
> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
> <james_p_freyensee@linux.intel.com>  wrote:
>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>
>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>> <linus.walleij@linaro.org>    wrote:
>>>>
>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>   wrote:
>>>>
>>>>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>>>>> controller).  I am seeing very low write speed for eMMC transfers.  On
>>>>> further debugging, I observed that every 63rd and 64th transfer takes
>>>>> a long time.
>>>>
>>>> Are you not just seeing the card-internal garbage collection?
>>>> http://lwn.net/Articles/428584/
>>>
>>> Does this mean, theoretically, I should be able to achieve larger
>>> speeds if I am not using linux?
>>
>> In theory in a fairy-tale world, maybe, in reality, not really.  In R/W
>> performance measurements we have done, eMMC performance in products users
>> would buy falls well, well short of any theoretical numbers.  We believe in
>> theory, the eMMC interface should be able to support up to 100MB/s, but in
>> reality on real customer platforms write bandwidths (for example) barely
>> approach 20MB/s, regardless if it's a Microsoft Windows environment or
>> Android (Linux OS environment we care about).  So maybe it is software
>> implementation issues of multiple OSs preventing higher eMMC performance
>> numbers (hence the reason why I sometimes ask basic coding questions of the
>> MMC subsystem- the code isn't the easiest to follow); however, one looks no
>> further than what Apple has done with the iPad2 to see that eMMC probably
>> just is not a good solution to use in the first place.  We have measured
>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being double what
>> we see with products using eMMC solutions. The big difference?  Apple
>> doesn't use eMMC at all for the iPad2.
>
> Thanks for all the clarification.  The problem is I am seeing write
> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
> the time lost when measured between sending a command and receiving a
> data irq.  I am not sure what kind of an issue this is.  5MBps feels
> really slow but can the internal housekeeping of the card take so much
> time?

Have you tried to trace through all structs used for an MMC operation??! 
  Good gravy, there are request, mmc_queue, mmc_card, mmc_host, 
mmc_blk_request, mmc_request, multiple mmc_command and multiple 
scatterlists that these other structs use...I've been playing around on 
trying to cache some things to try and improve performance and it blows 
me away how many variables and pointers I have to keep track of for one 
operation going to an LBA on an MMC.  I keep wondering if more of the 
'struct request' could have been used, and 1/3 of these structures could 
be eliminated.  And another thing I wonder too is how much of this 
infrastructure is really needed, that when I do ask "what is this for?" 
question on the list and no one responds, if anyone else understands if 
it's needed either.

I mean, for the usual transfers it takes about 3ms to transfer
> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
> The thing is this is not on a file system.  I am measuring the speed
> using basic "dd" command to write directly to the block device.
>
>> So, is this a software issue? or if
>>>
>>> there is a way to increase the size of bounce buffers to 4MB?
>>>
>>
>>
>>
>>>> Yours,
>>>> Linus Walleij
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> J (James/Jay) Freyensee
>> Storage Technology Group
>> Intel Corporation
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 21:01         ` J Freyensee
@ 2011-09-28 21:03           ` Praveen G K
  2011-09-28 21:34             ` J Freyensee
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-09-28 21:03 UTC (permalink / raw)
  To: J Freyensee; +Cc: Linus Walleij, linux-mmc

On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>
>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>> <james_p_freyensee@linux.intel.com>  wrote:
>>>
>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>
>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>> <linus.walleij@linaro.org>    wrote:
>>>>>
>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>  wrote:
>>>>>
>>>>>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>>>>>> controller).  I am seeing very low write speed for eMMC transfers.  On
>>>>>> further debugging, I observed that every 63rd and 64th transfer takes
>>>>>> a long time.
>>>>>
>>>>> Are you not just seeing the card-internal garbage collection?
>>>>> http://lwn.net/Articles/428584/
>>>>
>>>> Does this mean, theoretically, I should be able to achieve larger
>>>> speeds if I am not using linux?
>>>
>>> In theory in a fairy-tale world, maybe, in reality, not really.  In R/W
>>> performance measurements we have done, eMMC performance in products users
>>> would buy falls well, well short of any theoretical numbers.  We believe
>>> in
>>> theory, the eMMC interface should be able to support up to 100MB/s, but
>>> in
>>> reality on real customer platforms write bandwidths (for example) barely
>>> approach 20MB/s, regardless if it's a Microsoft Windows environment or
>>> Android (Linux OS environment we care about).  So maybe it is software
>>> implementation issues of multiple OSs preventing higher eMMC performance
>>> numbers (hence the reason why I sometimes ask basic coding questions of
>>> the
>>> MMC subsystem- the code isn't the easiest to follow); however, one looks
>>> no
>>> further than what Apple has done with the iPad2 to see that eMMC probably
>>> just is not a good solution to use in the first place.  We have measured
>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being double
>>> what
>>> we see with products using eMMC solutions. The big difference?  Apple
>>> doesn't use eMMC at all for the iPad2.
>>
>> Thanks for all the clarification.  The problem is I am seeing write
>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>> the time lost when measured between sending a command and receiving a
>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>> really slow but can the internal housekeeping of the card take so much
>> time?
>
> Have you tried to trace through all structs used for an MMC operation??!
>  Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
> mmc_blk_request, mmc_request, multiple mmc_command and multiple scatterlists
> that these other structs use...I've been playing around on trying to cache
> some things to try and improve performance and it blows me away how many
> variables and pointers I have to keep track of for one operation going to an
> LBA on an MMC.  I keep wondering if more of the 'struct request' could have
> been used, and 1/3 of these structures could be eliminated.  And another
> thing I wonder too is how much of this infrastructure is really needed, that
> when I do ask "what is this for?" question on the list and no one responds,
> if anyone else understands if it's needed either.

I know I am not using the scatterlists, since the scatterlists are
aggregated into a 64k bounce buffer.  Regarding the different structs,
I am just taking them on face value assuming everything works "well".
But, my concern is why does it take such a long time (250 ms) to
return a transfer complete interrupt on occasional cases.  During this
time, the kernel is just waiting for the txfer_complete interrupt.
That's it.

> I mean, for the usual transfers it takes about 3ms to transfer
>>
>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>> The thing is this is not on a file system.  I am measuring the speed
>> using basic "dd" command to write directly to the block device.
>>
>>> So, is this a software issue? or if
>>>>
>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>
>>>
>>>
>>>
>>>>> Yours,
>>>>> Linus Walleij
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> J (James/Jay) Freyensee
>>> Storage Technology Group
>>> Intel Corporation
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> J (James/Jay) Freyensee
> Storage Technology Group
> Intel Corporation
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 21:03           ` Praveen G K
@ 2011-09-28 21:34             ` J Freyensee
  2011-09-28 22:24               ` Praveen G K
  2011-09-29  7:24               ` Linus Walleij
  0 siblings, 2 replies; 32+ messages in thread
From: J Freyensee @ 2011-09-28 21:34 UTC (permalink / raw)
  To: Praveen G K; +Cc: Linus Walleij, linux-mmc

On 09/28/2011 02:03 PM, Praveen G K wrote:
> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
> <james_p_freyensee@linux.intel.com>  wrote:
>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>
>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>> <james_p_freyensee@linux.intel.com>    wrote:
>>>>
>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>
>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>> <linus.walleij@linaro.org>      wrote:
>>>>>>
>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>>   wrote:
>>>>>>
>>>>>>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.  On
>>>>>>> further debugging, I observed that every 63rd and 64th transfer takes
>>>>>>> a long time.
>>>>>>
>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>> http://lwn.net/Articles/428584/
>>>>>
>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>> speeds if I am not using linux?
>>>>
>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In R/W
>>>> performance measurements we have done, eMMC performance in products users
>>>> would buy falls well, well short of any theoretical numbers.  We believe
>>>> in
>>>> theory, the eMMC interface should be able to support up to 100MB/s, but
>>>> in
>>>> reality on real customer platforms write bandwidths (for example) barely
>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment or
>>>> Android (Linux OS environment we care about).  So maybe it is software
>>>> implementation issues of multiple OSs preventing higher eMMC performance
>>>> numbers (hence the reason why I sometimes ask basic coding questions of
>>>> the
>>>> MMC subsystem- the code isn't the easiest to follow); however, one looks
>>>> no
>>>> further than what Apple has done with the iPad2 to see that eMMC probably
>>>> just is not a good solution to use in the first place.  We have measured
>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being double
>>>> what
>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>> doesn't use eMMC at all for the iPad2.
>>>
>>> Thanks for all the clarification.  The problem is I am seeing write
>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>> the time lost when measured between sending a command and receiving a
>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>> really slow but can the internal housekeeping of the card take so much
>>> time?
>>
>> Have you tried to trace through all structs used for an MMC operation??!
>>   Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>> mmc_blk_request, mmc_request, multiple mmc_command and multiple scatterlists
>> that these other structs use...I've been playing around on trying to cache
>> some things to try and improve performance and it blows me away how many
>> variables and pointers I have to keep track of for one operation going to an
>> LBA on an MMC.  I keep wondering if more of the 'struct request' could have
>> been used, and 1/3 of these structures could be eliminated.  And another
>> thing I wonder too is how much of this infrastructure is really needed, that
>> when I do ask "what is this for?" question on the list and no one responds,
>> if anyone else understands if it's needed either.
>
> I know I am not using the scatterlists, since the scatterlists are
> aggregated into a 64k bounce buffer.  Regarding the different structs,
> I am just taking them on face value assuming everything works "well".
> But, my concern is why does it take such a long time (250 ms) to
> return a transfer complete interrupt on occasional cases.  During this
> time, the kernel is just waiting for the txfer_complete interrupt.
> That's it.

I think one fundamental problem with execution of the MMC commands is 
even though the MMC has it's own structures and own DMA/Host-controller, 
the OS's block subsystem and MMC subsystem do not really run independent 
of either other and each are still tied to each others' fate, holding up 
performance of the kernel in general.

In particular, I have found that in the 2.6.36+ kernels that the sooner 
you can retire the 'struct request *req' (ie using __blk_end_request()) 
with respect to when the mmc_wait_for_req() call is made, the higher 
performance you are going to get out of the OS in terms of reads/writes 
using an MMC.  mmc_wait_for_req() is a blocking call, so that OS 'struct 
request req' will just sit around and do nothing until 
mmc_wait_for_req() is done.  I have been able to do some caching of some 
commands, calling __blk_end_request() before mmc_wait_for_req(), and 
getting much higher performance in a few experiments (but the work 
certainly is not ready for prime-time).

Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal 
was to try and make that function a bit more non-blocking, but I have 
not played with it too much because my current focus is on existing 
products and no handheld product uses a 3.0 kernel yet (that I am aware 
of at least).  However, I still see the fundamental problem is that the 
MMC stack, which was probably written with the intended purpose to be 
independent of the OS block subsystem (struct request and other stuff), 
really isn't independent of the OS block subsystem and will cause 
holdups between one another, thereby dragging down read/write 
performance of the MMC.

The other fundamental problem is the writes themselves.  Way, WAY more 
writes occur on a handheld system in an end-user's hands than reads. 
Fundamental computer principle states "you make the common case fast". 
So focus should be on how to complete a write operation the fastest way 
possible.

>
>> I mean, for the usual transfers it takes about 3ms to transfer
>>>
>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>> The thing is this is not on a file system.  I am measuring the speed
>>> using basic "dd" command to write directly to the block device.
>>>
>>>> So, is this a software issue? or if
>>>>>
>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>
>>>>
>>>>
>>>>
>>>>>> Yours,
>>>>>> Linus Walleij
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>> --
>>>> J (James/Jay) Freyensee
>>>> Storage Technology Group
>>>> Intel Corporation
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> J (James/Jay) Freyensee
>> Storage Technology Group
>> Intel Corporation
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 21:34             ` J Freyensee
@ 2011-09-28 22:24               ` Praveen G K
  2011-09-28 22:59                 ` J Freyensee
  2011-09-29  7:24               ` Linus Walleij
  1 sibling, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-09-28 22:24 UTC (permalink / raw)
  To: J Freyensee; +Cc: Linus Walleij, linux-mmc

On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
> On 09/28/2011 02:03 PM, Praveen G K wrote:
>>
>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
>> <james_p_freyensee@linux.intel.com>  wrote:
>>>
>>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>>
>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>>> <james_p_freyensee@linux.intel.com>    wrote:
>>>>>
>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>>
>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>>> <linus.walleij@linaro.org>      wrote:
>>>>>>>
>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>>>  wrote:
>>>>>>>
>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.
>>>>>>>>  On
>>>>>>>> further debugging, I observed that every 63rd and 64th transfer
>>>>>>>> takes
>>>>>>>> a long time.
>>>>>>>
>>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>>> http://lwn.net/Articles/428584/
>>>>>>
>>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>>> speeds if I am not using linux?
>>>>>
>>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In R/W
>>>>> performance measurements we have done, eMMC performance in products
>>>>> users
>>>>> would buy falls well, well short of any theoretical numbers.  We
>>>>> believe
>>>>> in
>>>>> theory, the eMMC interface should be able to support up to 100MB/s, but
>>>>> in
>>>>> reality on real customer platforms write bandwidths (for example)
>>>>> barely
>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment or
>>>>> Android (Linux OS environment we care about).  So maybe it is software
>>>>> implementation issues of multiple OSs preventing higher eMMC
>>>>> performance
>>>>> numbers (hence the reason why I sometimes ask basic coding questions of
>>>>> the
>>>>> MMC subsystem- the code isn't the easiest to follow); however, one
>>>>> looks
>>>>> no
>>>>> further than what Apple has done with the iPad2 to see that eMMC
>>>>> probably
>>>>> just is not a good solution to use in the first place.  We have
>>>>> measured
>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being double
>>>>> what
>>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>>> doesn't use eMMC at all for the iPad2.
>>>>
>>>> Thanks for all the clarification.  The problem is I am seeing write
>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>>> the time lost when measured between sending a command and receiving a
>>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>>> really slow but can the internal housekeeping of the card take so much
>>>> time?
>>>
>>> Have you tried to trace through all structs used for an MMC operation??!
>>>  Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple
>>> scatterlists
>>> that these other structs use...I've been playing around on trying to
>>> cache
>>> some things to try and improve performance and it blows me away how many
>>> variables and pointers I have to keep track of for one operation going to
>>> an
>>> LBA on an MMC.  I keep wondering if more of the 'struct request' could
>>> have
>>> been used, and 1/3 of these structures could be eliminated.  And another
>>> thing I wonder too is how much of this infrastructure is really needed,
>>> that
>>> when I do ask "what is this for?" question on the list and no one
>>> responds,
>>> if anyone else understands if it's needed either.
>>
>> I know I am not using the scatterlists, since the scatterlists are
>> aggregated into a 64k bounce buffer.  Regarding the different structs,
>> I am just taking them on face value assuming everything works "well".
>> But, my concern is why does it take such a long time (250 ms) to
>> return a transfer complete interrupt on occasional cases.  During this
>> time, the kernel is just waiting for the txfer_complete interrupt.
>> That's it.
>
> I think one fundamental problem with execution of the MMC commands is even
> though the MMC has it's own structures and own DMA/Host-controller, the OS's
> block subsystem and MMC subsystem do not really run independent of either
> other and each are still tied to each others' fate, holding up performance
> of the kernel in general.
>
> In particular, I have found that in the 2.6.36+ kernels that the sooner you
> can retire the 'struct request *req' (ie using __blk_end_request()) with
> respect to when the mmc_wait_for_req() call is made, the higher performance
> you are going to get out of the OS in terms of reads/writes using an MMC.
>  mmc_wait_for_req() is a blocking call, so that OS 'struct request req' will
> just sit around and do nothing until mmc_wait_for_req() is done.  I have
> been able to do some caching of some commands, calling __blk_end_request()
> before mmc_wait_for_req(), and getting much higher performance in a few
> experiments (but the work certainly is not ready for prime-time).
>
> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal was
> to try and make that function a bit more non-blocking, but I have not played
> with it too much because my current focus is on existing products and no
> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>  However, I still see the fundamental problem is that the MMC stack, which
> was probably written with the intended purpose to be independent of the OS
> block subsystem (struct request and other stuff), really isn't independent
> of the OS block subsystem and will cause holdups between one another,
> thereby dragging down read/write performance of the MMC.
>
> The other fundamental problem is the writes themselves.  Way, WAY more
> writes occur on a handheld system in an end-user's hands than reads.
> Fundamental computer principle states "you make the common case fast". So
> focus should be on how to complete a write operation the fastest way
> possible.

Thanks for the detailed explanation.
Please let me know if there is a fundamental issue with the way I am
inserting the high res timers.  In the block.c file, I am timing the
transfers as follows

1. Start timer
mmc_queue_bounce_pre()
mmc_wait_for_req()
mmc_queue_bounce_post()
End timer

So, I don't really have to worry about the blk_end_request right.
Like you said, wait_for_req is a blocking wait.  I don't see what is
wrong with that being a blocking wait, because until you get the data
xfer complete irq, there is no point in going ahead.  The
blk_end_request comes later in the picture only when all the data is
transferred to the card.
My line of thought is that the card is taking a lot of time for its
internal housekeeping.  But, I want to be absolutely sure of my
analysis before I can pass that judgement.

I have also used another Toshiba card that gives me about 12 MBps
write speed for the same code, but I am worried is whether I am
masking some issue by blaming it on the card.  What if the Toshiba
card can give a throughput more than 12MBps ideally?

Or could there be an issue that the irq handler(sdhci_irq) is called
with some kind of a delay and is there a possibility that we are not
capturing the transfer complete interrupt immediately?

>>
>>> I mean, for the usual transfers it takes about 3ms to transfer
>>>>
>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>>> The thing is this is not on a file system.  I am measuring the speed
>>>> using basic "dd" command to write directly to the block device.
>>>>
>>>>> So, is this a software issue? or if
>>>>>>
>>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> Yours,
>>>>>>> Linus Walleij
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>> --
>>>>> J (James/Jay) Freyensee
>>>>> Storage Technology Group
>>>>> Intel Corporation
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> J (James/Jay) Freyensee
>>> Storage Technology Group
>>> Intel Corporation
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> J (James/Jay) Freyensee
> Storage Technology Group
> Intel Corporation
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 22:24               ` Praveen G K
@ 2011-09-28 22:59                 ` J Freyensee
  2011-09-28 23:16                   ` Praveen G K
  0 siblings, 1 reply; 32+ messages in thread
From: J Freyensee @ 2011-09-28 22:59 UTC (permalink / raw)
  To: Praveen G K; +Cc: Linus Walleij, linux-mmc

On 09/28/2011 03:24 PM, Praveen G K wrote:
> On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee
> <james_p_freyensee@linux.intel.com>  wrote:
>> On 09/28/2011 02:03 PM, Praveen G K wrote:
>>>
>>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
>>> <james_p_freyensee@linux.intel.com>    wrote:
>>>>
>>>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>>>
>>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>>>> <james_p_freyensee@linux.intel.com>      wrote:
>>>>>>
>>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>>>
>>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>>>> <linus.walleij@linaro.org>        wrote:
>>>>>>>>
>>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>>>>   wrote:
>>>>>>>>
>>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO 3.0
>>>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.
>>>>>>>>>   On
>>>>>>>>> further debugging, I observed that every 63rd and 64th transfer
>>>>>>>>> takes
>>>>>>>>> a long time.
>>>>>>>>
>>>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>>>> http://lwn.net/Articles/428584/
>>>>>>>
>>>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>>>> speeds if I am not using linux?
>>>>>>
>>>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In R/W
>>>>>> performance measurements we have done, eMMC performance in products
>>>>>> users
>>>>>> would buy falls well, well short of any theoretical numbers.  We
>>>>>> believe
>>>>>> in
>>>>>> theory, the eMMC interface should be able to support up to 100MB/s, but
>>>>>> in
>>>>>> reality on real customer platforms write bandwidths (for example)
>>>>>> barely
>>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment or
>>>>>> Android (Linux OS environment we care about).  So maybe it is software
>>>>>> implementation issues of multiple OSs preventing higher eMMC
>>>>>> performance
>>>>>> numbers (hence the reason why I sometimes ask basic coding questions of
>>>>>> the
>>>>>> MMC subsystem- the code isn't the easiest to follow); however, one
>>>>>> looks
>>>>>> no
>>>>>> further than what Apple has done with the iPad2 to see that eMMC
>>>>>> probably
>>>>>> just is not a good solution to use in the first place.  We have
>>>>>> measured
>>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being double
>>>>>> what
>>>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>>>> doesn't use eMMC at all for the iPad2.
>>>>>
>>>>> Thanks for all the clarification.  The problem is I am seeing write
>>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>>>> the time lost when measured between sending a command and receiving a
>>>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>>>> really slow but can the internal housekeeping of the card take so much
>>>>> time?
>>>>
>>>> Have you tried to trace through all structs used for an MMC operation??!
>>>>   Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple
>>>> scatterlists
>>>> that these other structs use...I've been playing around on trying to
>>>> cache
>>>> some things to try and improve performance and it blows me away how many
>>>> variables and pointers I have to keep track of for one operation going to
>>>> an
>>>> LBA on an MMC.  I keep wondering if more of the 'struct request' could
>>>> have
>>>> been used, and 1/3 of these structures could be eliminated.  And another
>>>> thing I wonder too is how much of this infrastructure is really needed,
>>>> that
>>>> when I do ask "what is this for?" question on the list and no one
>>>> responds,
>>>> if anyone else understands if it's needed either.
>>>
>>> I know I am not using the scatterlists, since the scatterlists are
>>> aggregated into a 64k bounce buffer.  Regarding the different structs,
>>> I am just taking them on face value assuming everything works "well".
>>> But, my concern is why does it take such a long time (250 ms) to
>>> return a transfer complete interrupt on occasional cases.  During this
>>> time, the kernel is just waiting for the txfer_complete interrupt.
>>> That's it.
>>
>> I think one fundamental problem with execution of the MMC commands is even
>> though the MMC has it's own structures and own DMA/Host-controller, the OS's
>> block subsystem and MMC subsystem do not really run independent of either
>> other and each are still tied to each others' fate, holding up performance
>> of the kernel in general.
>>
>> In particular, I have found that in the 2.6.36+ kernels that the sooner you
>> can retire the 'struct request *req' (ie using __blk_end_request()) with
>> respect to when the mmc_wait_for_req() call is made, the higher performance
>> you are going to get out of the OS in terms of reads/writes using an MMC.
>>   mmc_wait_for_req() is a blocking call, so that OS 'struct request req' will
>> just sit around and do nothing until mmc_wait_for_req() is done.  I have
>> been able to do some caching of some commands, calling __blk_end_request()
>> before mmc_wait_for_req(), and getting much higher performance in a few
>> experiments (but the work certainly is not ready for prime-time).
>>
>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal was
>> to try and make that function a bit more non-blocking, but I have not played
>> with it too much because my current focus is on existing products and no
>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>   However, I still see the fundamental problem is that the MMC stack, which
>> was probably written with the intended purpose to be independent of the OS
>> block subsystem (struct request and other stuff), really isn't independent
>> of the OS block subsystem and will cause holdups between one another,
>> thereby dragging down read/write performance of the MMC.
>>
>> The other fundamental problem is the writes themselves.  Way, WAY more
>> writes occur on a handheld system in an end-user's hands than reads.
>> Fundamental computer principle states "you make the common case fast". So
>> focus should be on how to complete a write operation the fastest way
>> possible.
>
> Thanks for the detailed explanation.
> Please let me know if there is a fundamental issue with the way I am
> inserting the high res timers.  In the block.c file, I am timing the
> transfers as follows
>
> 1. Start timer
> mmc_queue_bounce_pre()
> mmc_wait_for_req()
> mmc_queue_bounce_post()
> End timer
>
> So, I don't really have to worry about the blk_end_request right.
> Like you said, wait_for_req is a blocking wait.  I don't see what is
> wrong with that being a blocking wait, because until you get the data
> xfer complete irq, there is no point in going ahead.  The
> blk_end_request comes later in the picture only when all the data is
> transferred to the card.

Yes, that is correct.

But if you can do some cache trickery or queue tricks, you can delay 
when you have to actually write to the MMC, so then __blk_end_request() 
and retiring the 'struct request *req' becomes the time-sync.  That is a 
reason why mmc_wait_for_req() got some work done on it in the 3.0 
kernel.  The OS does not have to wait for the host controller to 
complete the operation (ie, block on mmc_wait_for_data()) if there is no 
immediate dependency on that data- that is kind-of dumb.  This is why 
this can be a problem and a time sync.  It's no different than 
out-of-order execution in CPUs.

> My line of thought is that the card is taking a lot of time for its
> internal housekeeping.

Each 'write' to a solid-state/nand/flash requires an erase operation 
first, so yes, there is more housekeeping going on than a simple 'write'.

But, I want to be absolutely sure of my
> analysis before I can pass that judgement.
>
> I have also used another Toshiba card that gives me about 12 MBps
> write speed for the same code, but I am worried is whether I am
> masking some issue by blaming it on the card.  What if the Toshiba
> card can give a throughput more than 12MBps ideally?

No clue...you'd have to talk to Toshiba.

>
> Or could there be an issue that the irq handler(sdhci_irq) is called
> with some kind of a delay and is there a possibility that we are not
> capturing the transfer complete interrupt immediately?
>
>>>
>>>> I mean, for the usual transfers it takes about 3ms to transfer
>>>>>
>>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>>>> The thing is this is not on a file system.  I am measuring the speed
>>>>> using basic "dd" command to write directly to the block device.
>>>>>
>>>>>> So, is this a software issue? or if
>>>>>>>
>>>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Yours,
>>>>>>>> Linus Walleij
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>> in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>> --
>>>>>> J (James/Jay) Freyensee
>>>>>> Storage Technology Group
>>>>>> Intel Corporation
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>> --
>>>> J (James/Jay) Freyensee
>>>> Storage Technology Group
>>>> Intel Corporation
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> J (James/Jay) Freyensee
>> Storage Technology Group
>> Intel Corporation
>>


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 22:59                 ` J Freyensee
@ 2011-09-28 23:16                   ` Praveen G K
  2011-09-29  0:57                     ` Philip Rakity
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-09-28 23:16 UTC (permalink / raw)
  To: J Freyensee; +Cc: Linus Walleij, linux-mmc

On Wed, Sep 28, 2011 at 3:59 PM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
> On 09/28/2011 03:24 PM, Praveen G K wrote:
>>
>> On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee
>> <james_p_freyensee@linux.intel.com>  wrote:
>>>
>>> On 09/28/2011 02:03 PM, Praveen G K wrote:
>>>>
>>>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
>>>> <james_p_freyensee@linux.intel.com>    wrote:
>>>>>
>>>>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>>>>
>>>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>>>>> <james_p_freyensee@linux.intel.com>      wrote:
>>>>>>>
>>>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>>>>
>>>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>>>>> <linus.walleij@linaro.org>        wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO
>>>>>>>>>> 3.0
>>>>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.
>>>>>>>>>>  On
>>>>>>>>>> further debugging, I observed that every 63rd and 64th transfer
>>>>>>>>>> takes
>>>>>>>>>> a long time.
>>>>>>>>>
>>>>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>>>>> http://lwn.net/Articles/428584/
>>>>>>>>
>>>>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>>>>> speeds if I am not using linux?
>>>>>>>
>>>>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In
>>>>>>> R/W
>>>>>>> performance measurements we have done, eMMC performance in products
>>>>>>> users
>>>>>>> would buy falls well, well short of any theoretical numbers.  We
>>>>>>> believe
>>>>>>> in
>>>>>>> theory, the eMMC interface should be able to support up to 100MB/s,
>>>>>>> but
>>>>>>> in
>>>>>>> reality on real customer platforms write bandwidths (for example)
>>>>>>> barely
>>>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment
>>>>>>> or
>>>>>>> Android (Linux OS environment we care about).  So maybe it is
>>>>>>> software
>>>>>>> implementation issues of multiple OSs preventing higher eMMC
>>>>>>> performance
>>>>>>> numbers (hence the reason why I sometimes ask basic coding questions
>>>>>>> of
>>>>>>> the
>>>>>>> MMC subsystem- the code isn't the easiest to follow); however, one
>>>>>>> looks
>>>>>>> no
>>>>>>> further than what Apple has done with the iPad2 to see that eMMC
>>>>>>> probably
>>>>>>> just is not a good solution to use in the first place.  We have
>>>>>>> measured
>>>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being
>>>>>>> double
>>>>>>> what
>>>>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>>>>> doesn't use eMMC at all for the iPad2.
>>>>>>
>>>>>> Thanks for all the clarification.  The problem is I am seeing write
>>>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>>>>> the time lost when measured between sending a command and receiving a
>>>>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>>>>> really slow but can the internal housekeeping of the card take so much
>>>>>> time?
>>>>>
>>>>> Have you tried to trace through all structs used for an MMC
>>>>> operation??!
>>>>>  Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>>>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple
>>>>> scatterlists
>>>>> that these other structs use...I've been playing around on trying to
>>>>> cache
>>>>> some things to try and improve performance and it blows me away how
>>>>> many
>>>>> variables and pointers I have to keep track of for one operation going
>>>>> to
>>>>> an
>>>>> LBA on an MMC.  I keep wondering if more of the 'struct request' could
>>>>> have
>>>>> been used, and 1/3 of these structures could be eliminated.  And
>>>>> another
>>>>> thing I wonder too is how much of this infrastructure is really needed,
>>>>> that
>>>>> when I do ask "what is this for?" question on the list and no one
>>>>> responds,
>>>>> if anyone else understands if it's needed either.
>>>>
>>>> I know I am not using the scatterlists, since the scatterlists are
>>>> aggregated into a 64k bounce buffer.  Regarding the different structs,
>>>> I am just taking them on face value assuming everything works "well".
>>>> But, my concern is why does it take such a long time (250 ms) to
>>>> return a transfer complete interrupt on occasional cases.  During this
>>>> time, the kernel is just waiting for the txfer_complete interrupt.
>>>> That's it.
>>>
>>> I think one fundamental problem with execution of the MMC commands is
>>> even
>>> though the MMC has it's own structures and own DMA/Host-controller, the
>>> OS's
>>> block subsystem and MMC subsystem do not really run independent of either
>>> other and each are still tied to each others' fate, holding up
>>> performance
>>> of the kernel in general.
>>>
>>> In particular, I have found that in the 2.6.36+ kernels that the sooner
>>> you
>>> can retire the 'struct request *req' (ie using __blk_end_request()) with
>>> respect to when the mmc_wait_for_req() call is made, the higher
>>> performance
>>> you are going to get out of the OS in terms of reads/writes using an MMC.
>>>  mmc_wait_for_req() is a blocking call, so that OS 'struct request req'
>>> will
>>> just sit around and do nothing until mmc_wait_for_req() is done.  I have
>>> been able to do some caching of some commands, calling
>>> __blk_end_request()
>>> before mmc_wait_for_req(), and getting much higher performance in a few
>>> experiments (but the work certainly is not ready for prime-time).
>>>
>>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal
>>> was
>>> to try and make that function a bit more non-blocking, but I have not
>>> played
>>> with it too much because my current focus is on existing products and no
>>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>>  However, I still see the fundamental problem is that the MMC stack,
>>> which
>>> was probably written with the intended purpose to be independent of the
>>> OS
>>> block subsystem (struct request and other stuff), really isn't
>>> independent
>>> of the OS block subsystem and will cause holdups between one another,
>>> thereby dragging down read/write performance of the MMC.
>>>
>>> The other fundamental problem is the writes themselves.  Way, WAY more
>>> writes occur on a handheld system in an end-user's hands than reads.
>>> Fundamental computer principle states "you make the common case fast". So
>>> focus should be on how to complete a write operation the fastest way
>>> possible.
>>
>> Thanks for the detailed explanation.
>> Please let me know if there is a fundamental issue with the way I am
>> inserting the high res timers.  In the block.c file, I am timing the
>> transfers as follows
>>
>> 1. Start timer
>> mmc_queue_bounce_pre()
>> mmc_wait_for_req()
>> mmc_queue_bounce_post()
>> End timer
>>
>> So, I don't really have to worry about the blk_end_request right.
>> Like you said, wait_for_req is a blocking wait.  I don't see what is
>> wrong with that being a blocking wait, because until you get the data
>> xfer complete irq, there is no point in going ahead.  The
>> blk_end_request comes later in the picture only when all the data is
>> transferred to the card.
>
> Yes, that is correct.
>
> But if you can do some cache trickery or queue tricks, you can delay when
> you have to actually write to the MMC, so then __blk_end_request() and
> retiring the 'struct request *req' becomes the time-sync.  That is a reason
> why mmc_wait_for_req() got some work done on it in the 3.0 kernel.  The OS
> does not have to wait for the host controller to complete the operation (ie,
> block on mmc_wait_for_data()) if there is no immediate dependency on that
> data- that is kind-of dumb.  This is why this can be a problem and a time
> sync.  It's no different than out-of-order execution in CPUs.

Thanks I'll look into the 3.0 code to see what the changes are and
whether it can improve the speed.  Thanks for your suggestions.

>> My line of thought is that the card is taking a lot of time for its
>> internal housekeeping.
>
> Each 'write' to a solid-state/nand/flash requires an erase operation first,
> so yes, there is more housekeeping going on than a simple 'write'.
>
> But, I want to be absolutely sure of my
>>
>> analysis before I can pass that judgement.
>>
>> I have also used another Toshiba card that gives me about 12 MBps
>> write speed for the same code, but I am worried is whether I am
>> masking some issue by blaming it on the card.  What if the Toshiba
>> card can give a throughput more than 12MBps ideally?
>
> No clue...you'd have to talk to Toshiba.
>
>>
>> Or could there be an issue that the irq handler(sdhci_irq) is called
>> with some kind of a delay and is there a possibility that we are not
>> capturing the transfer complete interrupt immediately?
>>
>>>>
>>>>> I mean, for the usual transfers it takes about 3ms to transfer
>>>>>>
>>>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>>>>> The thing is this is not on a file system.  I am measuring the speed
>>>>>> using basic "dd" command to write directly to the block device.
>>>>>>
>>>>>>> So, is this a software issue? or if
>>>>>>>>
>>>>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Yours,
>>>>>>>>> Linus Walleij
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> J (James/Jay) Freyensee
>>>>>>> Storage Technology Group
>>>>>>> Intel Corporation
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>> --
>>>>> J (James/Jay) Freyensee
>>>>> Storage Technology Group
>>>>> Intel Corporation
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> J (James/Jay) Freyensee
>>> Storage Technology Group
>>> Intel Corporation
>>>
>
>
> --
> J (James/Jay) Freyensee
> Storage Technology Group
> Intel Corporation
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 23:16                   ` Praveen G K
@ 2011-09-29  0:57                     ` Philip Rakity
  2011-09-29  2:24                       ` Praveen G K
  0 siblings, 1 reply; 32+ messages in thread
From: Philip Rakity @ 2011-09-29  0:57 UTC (permalink / raw)
  To: Praveen G K; +Cc: J Freyensee, Linus Walleij, linux-mmc



On Sep 28, 2011, at 4:16 PM, Praveen G K wrote:

> On Wed, Sep 28, 2011 at 3:59 PM, J Freyensee
> <james_p_freyensee@linux.intel.com> wrote:
>> On 09/28/2011 03:24 PM, Praveen G K wrote:
>>> 
>>> On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee
>>> <james_p_freyensee@linux.intel.com>  wrote:
>>>> 
>>>> On 09/28/2011 02:03 PM, Praveen G K wrote:
>>>>> 
>>>>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
>>>>> <james_p_freyensee@linux.intel.com>    wrote:
>>>>>> 
>>>>>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>>>>> 
>>>>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>>>>>> <james_p_freyensee@linux.intel.com>      wrote:
>>>>>>>> 
>>>>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>>>>>> <linus.walleij@linaro.org>        wrote:
>>>>>>>>>> 
>>>>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO
>>>>>>>>>>> 3.0
>>>>>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.
>>>>>>>>>>> On
>>>>>>>>>>> further debugging, I observed that every 63rd and 64th transfer
>>>>>>>>>>> takes
>>>>>>>>>>> a long time.
>>>>>>>>>> 
>>>>>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>>>>>> http://lwn.net/Articles/428584/
>>>>>>>>> 
>>>>>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>>>>>> speeds if I am not using linux?
>>>>>>>> 
>>>>>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In
>>>>>>>> R/W
>>>>>>>> performance measurements we have done, eMMC performance in products
>>>>>>>> users
>>>>>>>> would buy falls well, well short of any theoretical numbers.  We
>>>>>>>> believe
>>>>>>>> in
>>>>>>>> theory, the eMMC interface should be able to support up to 100MB/s,
>>>>>>>> but
>>>>>>>> in
>>>>>>>> reality on real customer platforms write bandwidths (for example)
>>>>>>>> barely
>>>>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment
>>>>>>>> or
>>>>>>>> Android (Linux OS environment we care about).  So maybe it is
>>>>>>>> software
>>>>>>>> implementation issues of multiple OSs preventing higher eMMC
>>>>>>>> performance
>>>>>>>> numbers (hence the reason why I sometimes ask basic coding questions
>>>>>>>> of
>>>>>>>> the
>>>>>>>> MMC subsystem- the code isn't the easiest to follow); however, one
>>>>>>>> looks
>>>>>>>> no
>>>>>>>> further than what Apple has done with the iPad2 to see that eMMC
>>>>>>>> probably
>>>>>>>> just is not a good solution to use in the first place.  We have
>>>>>>>> measured
>>>>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being
>>>>>>>> double
>>>>>>>> what
>>>>>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>>>>>> doesn't use eMMC at all for the iPad2.
>>>>>>> 
>>>>>>> Thanks for all the clarification.  The problem is I am seeing write
>>>>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>>>>>> the time lost when measured between sending a command and receiving a
>>>>>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>>>>>> really slow but can the internal housekeeping of the card take so much
>>>>>>> time?
>>>>>> 
>>>>>> Have you tried to trace through all structs used for an MMC
>>>>>> operation??!
>>>>>> Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>>>>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple
>>>>>> scatterlists
>>>>>> that these other structs use...I've been playing around on trying to
>>>>>> cache
>>>>>> some things to try and improve performance and it blows me away how
>>>>>> many
>>>>>> variables and pointers I have to keep track of for one operation going
>>>>>> to
>>>>>> an
>>>>>> LBA on an MMC.  I keep wondering if more of the 'struct request' could
>>>>>> have
>>>>>> been used, and 1/3 of these structures could be eliminated.  And
>>>>>> another
>>>>>> thing I wonder too is how much of this infrastructure is really needed,
>>>>>> that
>>>>>> when I do ask "what is this for?" question on the list and no one
>>>>>> responds,
>>>>>> if anyone else understands if it's needed either.
>>>>> 
>>>>> I know I am not using the scatterlists, since the scatterlists are
>>>>> aggregated into a 64k bounce buffer.  Regarding the different structs,
>>>>> I am just taking them on face value assuming everything works "well".
>>>>> But, my concern is why does it take such a long time (250 ms) to
>>>>> return a transfer complete interrupt on occasional cases.  During this
>>>>> time, the kernel is just waiting for the txfer_complete interrupt.
>>>>> That's it.
>>>> 
>>>> I think one fundamental problem with execution of the MMC commands is
>>>> even
>>>> though the MMC has it's own structures and own DMA/Host-controller, the
>>>> OS's
>>>> block subsystem and MMC subsystem do not really run independent of either
>>>> other and each are still tied to each others' fate, holding up
>>>> performance
>>>> of the kernel in general.
>>>> 
>>>> In particular, I have found that in the 2.6.36+ kernels that the sooner
>>>> you
>>>> can retire the 'struct request *req' (ie using __blk_end_request()) with
>>>> respect to when the mmc_wait_for_req() call is made, the higher
>>>> performance
>>>> you are going to get out of the OS in terms of reads/writes using an MMC.
>>>> mmc_wait_for_req() is a blocking call, so that OS 'struct request req'
>>>> will
>>>> just sit around and do nothing until mmc_wait_for_req() is done.  I have
>>>> been able to do some caching of some commands, calling
>>>> __blk_end_request()
>>>> before mmc_wait_for_req(), and getting much higher performance in a few
>>>> experiments (but the work certainly is not ready for prime-time).
>>>> 
>>>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal
>>>> was
>>>> to try and make that function a bit more non-blocking, but I have not
>>>> played
>>>> with it too much because my current focus is on existing products and no
>>>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>>> However, I still see the fundamental problem is that the MMC stack,
>>>> which
>>>> was probably written with the intended purpose to be independent of the
>>>> OS
>>>> block subsystem (struct request and other stuff), really isn't
>>>> independent
>>>> of the OS block subsystem and will cause holdups between one another,
>>>> thereby dragging down read/write performance of the MMC.
>>>> 
>>>> The other fundamental problem is the writes themselves.  Way, WAY more
>>>> writes occur on a handheld system in an end-user's hands than reads.
>>>> Fundamental computer principle states "you make the common case fast". So
>>>> focus should be on how to complete a write operation the fastest way
>>>> possible.
>>> 
>>> Thanks for the detailed explanation.
>>> Please let me know if there is a fundamental issue with the way I am
>>> inserting the high res timers.  In the block.c file, I am timing the
>>> transfers as follows
>>> 
>>> 1. Start timer
>>> mmc_queue_bounce_pre()
>>> mmc_wait_for_req()
>>> mmc_queue_bounce_post()
>>> End timer
>>> 
>>> So, I don't really have to worry about the blk_end_request right.
>>> Like you said, wait_for_req is a blocking wait.  I don't see what is
>>> wrong with that being a blocking wait, because until you get the data
>>> xfer complete irq, there is no point in going ahead.  The
>>> blk_end_request comes later in the picture only when all the data is
>>> transferred to the card.
>> 
>> Yes, that is correct.
>> 
>> But if you can do some cache trickery or queue tricks, you can delay when
>> you have to actually write to the MMC, so then __blk_end_request() and
>> retiring the 'struct request *req' becomes the time-sync.  That is a reason
>> why mmc_wait_for_req() got some work done on it in the 3.0 kernel.  The OS
>> does not have to wait for the host controller to complete the operation (ie,
>> block on mmc_wait_for_data()) if there is no immediate dependency on that
>> data- that is kind-of dumb.  This is why this can be a problem and a time
>> sync.  It's no different than out-of-order execution in CPUs.
> 
> Thanks I'll look into the 3.0 code to see what the changes are and
> whether it can improve the speed.  Thanks for your suggestions.
> 
>>> My line of thought is that the card is taking a lot of time for its
>>> internal housekeeping.
>> 
>> Each 'write' to a solid-state/nand/flash requires an erase operation first,
>> so yes, there is more housekeeping going on than a simple 'write'.
>> 
>> But, I want to be absolutely sure of my
>>> 
>>> analysis before I can pass that judgement.
>>> 
>>> I have also used another Toshiba card that gives me about 12 MBps
>>> write speed for the same code, but I am worried is whether I am
>>> masking some issue by blaming it on the card.  What if the Toshiba
>>> card can give a throughput more than 12MBps ideally?
>> 
>> No clue...you'd have to talk to Toshiba.
>> 
>>> 
>>> Or could there be an issue that the irq handler(sdhci_irq) is called
>>> with some kind of a delay and is there a possibility that we are not
>>> capturing the transfer complete interrupt immediately?
>>> 
>>>>> 
>>>>>> I mean, for the usual transfers it takes about 3ms to transfer
>>>>>>> 
>>>>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>>>>>> The thing is this is not on a file system.  I am measuring the speed
>>>>>>> using basic "dd" command to write directly to the block device.
>>>>>>> 
>>>>>>>> So, is this a software issue? or if
>>>>>>>>> 
>>>>>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> Yours,
>>>>>>>>>> Linus Walleij
>>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>>>> in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> J (James/Jay) Freyensee
>>>>>>>> Storage Technology Group
>>>>>>>> Intel Corporation
>>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>> in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> J (James/Jay) Freyensee
>>>>>> Storage Technology Group
>>>>>> Intel Corporation
>>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>>>> --
>>>> J (James/Jay) Freyensee
>>>> Storage Technology Group
>>>> Intel Corporation
>>>> 
>> 

some questions:

does using a bounce buffer make things faster ?

I think you are using sdma.   I am wondering if there is a way to increase the the xfer size.  
Is there some magic number inside the mmc code that can be increased ?

Philip


>> 
>> --
>> J (James/Jay) Freyensee
>> Storage Technology Group
>> Intel Corporation
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-29  0:57                     ` Philip Rakity
@ 2011-09-29  2:24                       ` Praveen G K
  2011-09-29  3:30                         ` Philip Rakity
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-09-29  2:24 UTC (permalink / raw)
  To: Philip Rakity; +Cc: J Freyensee, Linus Walleij, linux-mmc

On Wed, Sep 28, 2011 at 5:57 PM, Philip Rakity <prakity@marvell.com> wrote:
>
>
> On Sep 28, 2011, at 4:16 PM, Praveen G K wrote:
>
>> On Wed, Sep 28, 2011 at 3:59 PM, J Freyensee
>> <james_p_freyensee@linux.intel.com> wrote:
>>> On 09/28/2011 03:24 PM, Praveen G K wrote:
>>>>
>>>> On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee
>>>> <james_p_freyensee@linux.intel.com>  wrote:
>>>>>
>>>>> On 09/28/2011 02:03 PM, Praveen G K wrote:
>>>>>>
>>>>>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
>>>>>> <james_p_freyensee@linux.intel.com>    wrote:
>>>>>>>
>>>>>>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>>>>>>
>>>>>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>>>>>>> <james_p_freyensee@linux.intel.com>      wrote:
>>>>>>>>>
>>>>>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>>>>>>> <linus.walleij@linaro.org>        wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO
>>>>>>>>>>>> 3.0
>>>>>>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.
>>>>>>>>>>>> On
>>>>>>>>>>>> further debugging, I observed that every 63rd and 64th transfer
>>>>>>>>>>>> takes
>>>>>>>>>>>> a long time.
>>>>>>>>>>>
>>>>>>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>>>>>>> http://lwn.net/Articles/428584/
>>>>>>>>>>
>>>>>>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>>>>>>> speeds if I am not using linux?
>>>>>>>>>
>>>>>>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In
>>>>>>>>> R/W
>>>>>>>>> performance measurements we have done, eMMC performance in products
>>>>>>>>> users
>>>>>>>>> would buy falls well, well short of any theoretical numbers.  We
>>>>>>>>> believe
>>>>>>>>> in
>>>>>>>>> theory, the eMMC interface should be able to support up to 100MB/s,
>>>>>>>>> but
>>>>>>>>> in
>>>>>>>>> reality on real customer platforms write bandwidths (for example)
>>>>>>>>> barely
>>>>>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment
>>>>>>>>> or
>>>>>>>>> Android (Linux OS environment we care about).  So maybe it is
>>>>>>>>> software
>>>>>>>>> implementation issues of multiple OSs preventing higher eMMC
>>>>>>>>> performance
>>>>>>>>> numbers (hence the reason why I sometimes ask basic coding questions
>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>> MMC subsystem- the code isn't the easiest to follow); however, one
>>>>>>>>> looks
>>>>>>>>> no
>>>>>>>>> further than what Apple has done with the iPad2 to see that eMMC
>>>>>>>>> probably
>>>>>>>>> just is not a good solution to use in the first place.  We have
>>>>>>>>> measured
>>>>>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being
>>>>>>>>> double
>>>>>>>>> what
>>>>>>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>>>>>>> doesn't use eMMC at all for the iPad2.
>>>>>>>>
>>>>>>>> Thanks for all the clarification.  The problem is I am seeing write
>>>>>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>>>>>>> the time lost when measured between sending a command and receiving a
>>>>>>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>>>>>>> really slow but can the internal housekeeping of the card take so much
>>>>>>>> time?
>>>>>>>
>>>>>>> Have you tried to trace through all structs used for an MMC
>>>>>>> operation??!
>>>>>>> Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>>>>>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple
>>>>>>> scatterlists
>>>>>>> that these other structs use...I've been playing around on trying to
>>>>>>> cache
>>>>>>> some things to try and improve performance and it blows me away how
>>>>>>> many
>>>>>>> variables and pointers I have to keep track of for one operation going
>>>>>>> to
>>>>>>> an
>>>>>>> LBA on an MMC.  I keep wondering if more of the 'struct request' could
>>>>>>> have
>>>>>>> been used, and 1/3 of these structures could be eliminated.  And
>>>>>>> another
>>>>>>> thing I wonder too is how much of this infrastructure is really needed,
>>>>>>> that
>>>>>>> when I do ask "what is this for?" question on the list and no one
>>>>>>> responds,
>>>>>>> if anyone else understands if it's needed either.
>>>>>>
>>>>>> I know I am not using the scatterlists, since the scatterlists are
>>>>>> aggregated into a 64k bounce buffer.  Regarding the different structs,
>>>>>> I am just taking them on face value assuming everything works "well".
>>>>>> But, my concern is why does it take such a long time (250 ms) to
>>>>>> return a transfer complete interrupt on occasional cases.  During this
>>>>>> time, the kernel is just waiting for the txfer_complete interrupt.
>>>>>> That's it.
>>>>>
>>>>> I think one fundamental problem with execution of the MMC commands is
>>>>> even
>>>>> though the MMC has it's own structures and own DMA/Host-controller, the
>>>>> OS's
>>>>> block subsystem and MMC subsystem do not really run independent of either
>>>>> other and each are still tied to each others' fate, holding up
>>>>> performance
>>>>> of the kernel in general.
>>>>>
>>>>> In particular, I have found that in the 2.6.36+ kernels that the sooner
>>>>> you
>>>>> can retire the 'struct request *req' (ie using __blk_end_request()) with
>>>>> respect to when the mmc_wait_for_req() call is made, the higher
>>>>> performance
>>>>> you are going to get out of the OS in terms of reads/writes using an MMC.
>>>>> mmc_wait_for_req() is a blocking call, so that OS 'struct request req'
>>>>> will
>>>>> just sit around and do nothing until mmc_wait_for_req() is done.  I have
>>>>> been able to do some caching of some commands, calling
>>>>> __blk_end_request()
>>>>> before mmc_wait_for_req(), and getting much higher performance in a few
>>>>> experiments (but the work certainly is not ready for prime-time).
>>>>>
>>>>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal
>>>>> was
>>>>> to try and make that function a bit more non-blocking, but I have not
>>>>> played
>>>>> with it too much because my current focus is on existing products and no
>>>>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>>>> However, I still see the fundamental problem is that the MMC stack,
>>>>> which
>>>>> was probably written with the intended purpose to be independent of the
>>>>> OS
>>>>> block subsystem (struct request and other stuff), really isn't
>>>>> independent
>>>>> of the OS block subsystem and will cause holdups between one another,
>>>>> thereby dragging down read/write performance of the MMC.
>>>>>
>>>>> The other fundamental problem is the writes themselves.  Way, WAY more
>>>>> writes occur on a handheld system in an end-user's hands than reads.
>>>>> Fundamental computer principle states "you make the common case fast". So
>>>>> focus should be on how to complete a write operation the fastest way
>>>>> possible.
>>>>
>>>> Thanks for the detailed explanation.
>>>> Please let me know if there is a fundamental issue with the way I am
>>>> inserting the high res timers.  In the block.c file, I am timing the
>>>> transfers as follows
>>>>
>>>> 1. Start timer
>>>> mmc_queue_bounce_pre()
>>>> mmc_wait_for_req()
>>>> mmc_queue_bounce_post()
>>>> End timer
>>>>
>>>> So, I don't really have to worry about the blk_end_request right.
>>>> Like you said, wait_for_req is a blocking wait.  I don't see what is
>>>> wrong with that being a blocking wait, because until you get the data
>>>> xfer complete irq, there is no point in going ahead.  The
>>>> blk_end_request comes later in the picture only when all the data is
>>>> transferred to the card.
>>>
>>> Yes, that is correct.
>>>
>>> But if you can do some cache trickery or queue tricks, you can delay when
>>> you have to actually write to the MMC, so then __blk_end_request() and
>>> retiring the 'struct request *req' becomes the time-sync.  That is a reason
>>> why mmc_wait_for_req() got some work done on it in the 3.0 kernel.  The OS
>>> does not have to wait for the host controller to complete the operation (ie,
>>> block on mmc_wait_for_data()) if there is no immediate dependency on that
>>> data- that is kind-of dumb.  This is why this can be a problem and a time
>>> sync.  It's no different than out-of-order execution in CPUs.
>>
>> Thanks I'll look into the 3.0 code to see what the changes are and
>> whether it can improve the speed.  Thanks for your suggestions.
>>
>>>> My line of thought is that the card is taking a lot of time for its
>>>> internal housekeeping.
>>>
>>> Each 'write' to a solid-state/nand/flash requires an erase operation first,
>>> so yes, there is more housekeeping going on than a simple 'write'.
>>>
>>> But, I want to be absolutely sure of my
>>>>
>>>> analysis before I can pass that judgement.
>>>>
>>>> I have also used another Toshiba card that gives me about 12 MBps
>>>> write speed for the same code, but I am worried is whether I am
>>>> masking some issue by blaming it on the card.  What if the Toshiba
>>>> card can give a throughput more than 12MBps ideally?
>>>
>>> No clue...you'd have to talk to Toshiba.
>>>
>>>>
>>>> Or could there be an issue that the irq handler(sdhci_irq) is called
>>>> with some kind of a delay and is there a possibility that we are not
>>>> capturing the transfer complete interrupt immediately?
>>>>
>>>>>>
>>>>>>> I mean, for the usual transfers it takes about 3ms to transfer
>>>>>>>>
>>>>>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>>>>>>> The thing is this is not on a file system.  I am measuring the speed
>>>>>>>> using basic "dd" command to write directly to the block device.
>>>>>>>>
>>>>>>>>> So, is this a software issue? or if
>>>>>>>>>>
>>>>>>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Yours,
>>>>>>>>>>> Linus Walleij
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>>>>> in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> J (James/Jay) Freyensee
>>>>>>>>> Storage Technology Group
>>>>>>>>> Intel Corporation
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> J (James/Jay) Freyensee
>>>>>>> Storage Technology Group
>>>>>>> Intel Corporation
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>> --
>>>>> J (James/Jay) Freyensee
>>>>> Storage Technology Group
>>>>> Intel Corporation
>>>>>
>>>
>
> some questions:
>
> does using a bounce buffer make things faster ?
>
> I think you are using sdma.   I am wondering if there is a way to increase the the xfer size.
> Is there some magic number inside the mmc code that can be increased ?

The bounce buffer increases the speed, but that is limited to 64kB.  I
don't know why it is limited to that number though.
> Philip
>
>
>>>
>>> --
>>> J (James/Jay) Freyensee
>>> Storage Technology Group
>>> Intel Corporation
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-29  2:24                       ` Praveen G K
@ 2011-09-29  3:30                         ` Philip Rakity
  0 siblings, 0 replies; 32+ messages in thread
From: Philip Rakity @ 2011-09-29  3:30 UTC (permalink / raw)
  To: Praveen G K; +Cc: J Freyensee, Linus Walleij, linux-mmc


On Sep 28, 2011, at 7:24 PM, Praveen G K wrote:

> On Wed, Sep 28, 2011 at 5:57 PM, Philip Rakity <prakity@marvell.com> wrote:
>> 
>> 
>> On Sep 28, 2011, at 4:16 PM, Praveen G K wrote:
>> 
>>> On Wed, Sep 28, 2011 at 3:59 PM, J Freyensee
>>> <james_p_freyensee@linux.intel.com> wrote:
>>>> On 09/28/2011 03:24 PM, Praveen G K wrote:
>>>>> 
>>>>> On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee
>>>>> <james_p_freyensee@linux.intel.com>  wrote:
>>>>>> 
>>>>>> On 09/28/2011 02:03 PM, Praveen G K wrote:
>>>>>>> 
>>>>>>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
>>>>>>> <james_p_freyensee@linux.intel.com>    wrote:
>>>>>>>> 
>>>>>>>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>>>>>>> 
>>>>>>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>>>>>>>> <james_p_freyensee@linux.intel.com>      wrote:
>>>>>>>>>> 
>>>>>>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>>>>>>>> <linus.walleij@linaro.org>        wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO
>>>>>>>>>>>>> 3.0
>>>>>>>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.
>>>>>>>>>>>>> On
>>>>>>>>>>>>> further debugging, I observed that every 63rd and 64th transfer
>>>>>>>>>>>>> takes
>>>>>>>>>>>>> a long time.
>>>>>>>>>>>> 
>>>>>>>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>>>>>>>> http://lwn.net/Articles/428584/
>>>>>>>>>>> 
>>>>>>>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>>>>>>>> speeds if I am not using linux?
>>>>>>>>>> 
>>>>>>>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In
>>>>>>>>>> R/W
>>>>>>>>>> performance measurements we have done, eMMC performance in products
>>>>>>>>>> users
>>>>>>>>>> would buy falls well, well short of any theoretical numbers.  We
>>>>>>>>>> believe
>>>>>>>>>> in
>>>>>>>>>> theory, the eMMC interface should be able to support up to 100MB/s,
>>>>>>>>>> but
>>>>>>>>>> in
>>>>>>>>>> reality on real customer platforms write bandwidths (for example)
>>>>>>>>>> barely
>>>>>>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment
>>>>>>>>>> or
>>>>>>>>>> Android (Linux OS environment we care about).  So maybe it is
>>>>>>>>>> software
>>>>>>>>>> implementation issues of multiple OSs preventing higher eMMC
>>>>>>>>>> performance
>>>>>>>>>> numbers (hence the reason why I sometimes ask basic coding questions
>>>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>> MMC subsystem- the code isn't the easiest to follow); however, one
>>>>>>>>>> looks
>>>>>>>>>> no
>>>>>>>>>> further than what Apple has done with the iPad2 to see that eMMC
>>>>>>>>>> probably
>>>>>>>>>> just is not a good solution to use in the first place.  We have
>>>>>>>>>> measured
>>>>>>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being
>>>>>>>>>> double
>>>>>>>>>> what
>>>>>>>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>>>>>>>> doesn't use eMMC at all for the iPad2.
>>>>>>>>> 
>>>>>>>>> Thanks for all the clarification.  The problem is I am seeing write
>>>>>>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>>>>>>>> the time lost when measured between sending a command and receiving a
>>>>>>>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>>>>>>>> really slow but can the internal housekeeping of the card take so much
>>>>>>>>> time?
>>>>>>>> 
>>>>>>>> Have you tried to trace through all structs used for an MMC
>>>>>>>> operation??!
>>>>>>>> Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>>>>>>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple
>>>>>>>> scatterlists
>>>>>>>> that these other structs use...I've been playing around on trying to
>>>>>>>> cache
>>>>>>>> some things to try and improve performance and it blows me away how
>>>>>>>> many
>>>>>>>> variables and pointers I have to keep track of for one operation going
>>>>>>>> to
>>>>>>>> an
>>>>>>>> LBA on an MMC.  I keep wondering if more of the 'struct request' could
>>>>>>>> have
>>>>>>>> been used, and 1/3 of these structures could be eliminated.  And
>>>>>>>> another
>>>>>>>> thing I wonder too is how much of this infrastructure is really needed,
>>>>>>>> that
>>>>>>>> when I do ask "what is this for?" question on the list and no one
>>>>>>>> responds,
>>>>>>>> if anyone else understands if it's needed either.
>>>>>>> 
>>>>>>> I know I am not using the scatterlists, since the scatterlists are
>>>>>>> aggregated into a 64k bounce buffer.  Regarding the different structs,
>>>>>>> I am just taking them on face value assuming everything works "well".
>>>>>>> But, my concern is why does it take such a long time (250 ms) to
>>>>>>> return a transfer complete interrupt on occasional cases.  During this
>>>>>>> time, the kernel is just waiting for the txfer_complete interrupt.
>>>>>>> That's it.
>>>>>> 
>>>>>> I think one fundamental problem with execution of the MMC commands is
>>>>>> even
>>>>>> though the MMC has it's own structures and own DMA/Host-controller, the
>>>>>> OS's
>>>>>> block subsystem and MMC subsystem do not really run independent of either
>>>>>> other and each are still tied to each others' fate, holding up
>>>>>> performance
>>>>>> of the kernel in general.
>>>>>> 
>>>>>> In particular, I have found that in the 2.6.36+ kernels that the sooner
>>>>>> you
>>>>>> can retire the 'struct request *req' (ie using __blk_end_request()) with
>>>>>> respect to when the mmc_wait_for_req() call is made, the higher
>>>>>> performance
>>>>>> you are going to get out of the OS in terms of reads/writes using an MMC.
>>>>>> mmc_wait_for_req() is a blocking call, so that OS 'struct request req'
>>>>>> will
>>>>>> just sit around and do nothing until mmc_wait_for_req() is done.  I have
>>>>>> been able to do some caching of some commands, calling
>>>>>> __blk_end_request()
>>>>>> before mmc_wait_for_req(), and getting much higher performance in a few
>>>>>> experiments (but the work certainly is not ready for prime-time).
>>>>>> 
>>>>>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal
>>>>>> was
>>>>>> to try and make that function a bit more non-blocking, but I have not
>>>>>> played
>>>>>> with it too much because my current focus is on existing products and no
>>>>>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>>>>> However, I still see the fundamental problem is that the MMC stack,
>>>>>> which
>>>>>> was probably written with the intended purpose to be independent of the
>>>>>> OS
>>>>>> block subsystem (struct request and other stuff), really isn't
>>>>>> independent
>>>>>> of the OS block subsystem and will cause holdups between one another,
>>>>>> thereby dragging down read/write performance of the MMC.
>>>>>> 
>>>>>> The other fundamental problem is the writes themselves.  Way, WAY more
>>>>>> writes occur on a handheld system in an end-user's hands than reads.
>>>>>> Fundamental computer principle states "you make the common case fast". So
>>>>>> focus should be on how to complete a write operation the fastest way
>>>>>> possible.
>>>>> 
>>>>> Thanks for the detailed explanation.
>>>>> Please let me know if there is a fundamental issue with the way I am
>>>>> inserting the high res timers.  In the block.c file, I am timing the
>>>>> transfers as follows
>>>>> 
>>>>> 1. Start timer
>>>>> mmc_queue_bounce_pre()
>>>>> mmc_wait_for_req()
>>>>> mmc_queue_bounce_post()
>>>>> End timer
>>>>> 
>>>>> So, I don't really have to worry about the blk_end_request right.
>>>>> Like you said, wait_for_req is a blocking wait.  I don't see what is
>>>>> wrong with that being a blocking wait, because until you get the data
>>>>> xfer complete irq, there is no point in going ahead.  The
>>>>> blk_end_request comes later in the picture only when all the data is
>>>>> transferred to the card.
>>>> 
>>>> Yes, that is correct.
>>>> 
>>>> But if you can do some cache trickery or queue tricks, you can delay when
>>>> you have to actually write to the MMC, so then __blk_end_request() and
>>>> retiring the 'struct request *req' becomes the time-sync.  That is a reason
>>>> why mmc_wait_for_req() got some work done on it in the 3.0 kernel.  The OS
>>>> does not have to wait for the host controller to complete the operation (ie,
>>>> block on mmc_wait_for_data()) if there is no immediate dependency on that
>>>> data- that is kind-of dumb.  This is why this can be a problem and a time
>>>> sync.  It's no different than out-of-order execution in CPUs.
>>> 
>>> Thanks I'll look into the 3.0 code to see what the changes are and
>>> whether it can improve the speed.  Thanks for your suggestions.
>>> 
>>>>> My line of thought is that the card is taking a lot of time for its
>>>>> internal housekeeping.
>>>> 
>>>> Each 'write' to a solid-state/nand/flash requires an erase operation first,
>>>> so yes, there is more housekeeping going on than a simple 'write'.
>>>> 
>>>> But, I want to be absolutely sure of my
>>>>> 
>>>>> analysis before I can pass that judgement.
>>>>> 
>>>>> I have also used another Toshiba card that gives me about 12 MBps
>>>>> write speed for the same code, but I am worried is whether I am
>>>>> masking some issue by blaming it on the card.  What if the Toshiba
>>>>> card can give a throughput more than 12MBps ideally?
>>>> 
>>>> No clue...you'd have to talk to Toshiba.
>>>> 
>>>>> 
>>>>> Or could there be an issue that the irq handler(sdhci_irq) is called
>>>>> with some kind of a delay and is there a possibility that we are not
>>>>> capturing the transfer complete interrupt immediately?
>>>>> 
>>>>>>> 
>>>>>>>> I mean, for the usual transfers it takes about 3ms to transfer
>>>>>>>>> 
>>>>>>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>>>>>>>> The thing is this is not on a file system.  I am measuring the speed
>>>>>>>>> using basic "dd" command to write directly to the block device.
>>>>>>>>> 
>>>>>>>>>> So, is this a software issue? or if
>>>>>>>>>>> 
>>>>>>>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> Yours,
>>>>>>>>>>>> Linus Walleij
>>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>>>>>> in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> J (James/Jay) Freyensee
>>>>>>>>>> Storage Technology Group
>>>>>>>>>> Intel Corporation
>>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>>>> in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> J (James/Jay) Freyensee
>>>>>>>> Storage Technology Group
>>>>>>>> Intel Corporation
>>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> J (James/Jay) Freyensee
>>>>>> Storage Technology Group
>>>>>> Intel Corporation
>>>>>> 
>>>> 
>> 
>> some questions:
>> 
>> does using a bounce buffer make things faster ?
>> 
>> I think you are using sdma.   I am wondering if there is a way to increase the the xfer size.
>> Is there some magic number inside the mmc code that can be increased ?
> 
> The bounce buffer increases the speed, but that is limited to 64kB.  I
> don't know why it is limited to that number though.
>> Philip


I think there is away to increase the size of the transfer but I cannot see it in the mmc/ directory.
wonder if it is a related to file system type.


>> 
>> 
>>>> 
>>>> --
>>>> J (James/Jay) Freyensee
>>>> Storage Technology Group
>>>> Intel Corporation
>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 20:34       ` Praveen G K
  2011-09-28 21:01         ` J Freyensee
@ 2011-09-29  7:05         ` Linus Walleij
  2011-09-29  7:33           ` Linus Walleij
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Walleij @ 2011-09-29  7:05 UTC (permalink / raw)
  To: Praveen G K; +Cc: J Freyensee, linux-mmc, Arnd Bergmann, Jon Medhurst

On Wed, Sep 28, 2011 at 10:34 PM, Praveen G K <praveen.gk@gmail.com> wrote:

> The problem is I am seeing write
> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
> the time lost when measured between sending a command and receiving a
> data irq.  I am not sure what kind of an issue this is.  5MBps feels
> really slow but can the internal housekeeping of the card take so much
> time?

It can indeed take as much time as it wants as long as it meets
the specifications, what does your datasheet say?

If you connect a signal analyzer to your MMC bus you *will*
know for sure whether this is actually caused by the card,
of if there is some kernel irq/workqueue latency involved.

If the card is the issue, what you can do to improve
performance is to look for other eMMC vendors... ;-)

> I mean, for the usual transfers it takes about 3ms to transfer
> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
> The thing is this is not on a file system.  I am measuring the speed
> using basic "dd" command to write directly to the block device.

Have you tried to make a deeper analysis of the card
characteristics using Arnd Bergmanns "flashbench" tool?
http://git.linaro.org/gitweb?p=people/arnd/flashbench.git;a=summary

There you will see for sure if there are some problematic
read boundaries on this specific card.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-28 21:34             ` J Freyensee
  2011-09-28 22:24               ` Praveen G K
@ 2011-09-29  7:24               ` Linus Walleij
  2011-09-29  8:17                 ` Per Förlin
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Walleij @ 2011-09-29  7:24 UTC (permalink / raw)
  To: J Freyensee
  Cc: Praveen G K, linux-mmc, Per Forlin, Arnd Bergmann, Jon Medhurst

On Wed, Sep 28, 2011 at 11:34 PM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:

> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal was
> to try and make that function a bit more non-blocking,

What has been done by Per Förlin is to add pre_req/post_req hooks
for the datapath. This will improve data transfers in general if and
only if the driver can do some meaningful work in these hooks, so
your driver needs to be patched to use these.

Per patched a few select drivers to prepare the DMA buffers
at this time. In our case (mmci.c) dma_map_sg() can be done in
parallel with an ongoing transfer.

In our case (ux500, mmci, dma40) we don't have bounce buffers
so the only thing that will happen in parallel with ongoing transfers
is L2 and L1 cache flush. *still* we see a noticeable improvement in
throughput, most in L2, but even on the U300 which only does L1
cache I see some small improvements.

I *guess* if you're using bounce buffers, the gain will be even
more pronounced.

(Per, correct me if I'm wrong on any of this...)

> with it too much because my current focus is on existing products and no
> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>  However, I still see the fundamental problem is that the MMC stack, which
> was probably written with the intended purpose to be independent of the OS
> block subsystem (struct request and other stuff), really isn't independent
> of the OS block subsystem and will cause holdups between one another,
> thereby dragging down read/write performance of the MMC.

There are two issues IIRC:

- The block layer does not provide enough buffers at a time for
  the out-of-order buffer pre/post preps to make effect, I think this
  was during writes only (Per, can you elaborate?)

- Anything related to card geometries and special sectors and
  sector sizes etc, i.e. the stuff that Arnd has analyzed in detail,
  also Tixy looked into that for some cards IIRC.

Each needs to be adressed and is currently "to be done".

> The other fundamental problem is the writes themselves.  Way, WAY more
> writes occur on a handheld system in an end-user's hands than reads.
> Fundamental computer principle states "you make the common case fast". So
> focus should be on how to complete a write operation the fastest way
> possible.

First case above I think, yep it needs looking into...

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-29  7:05         ` Linus Walleij
@ 2011-09-29  7:33           ` Linus Walleij
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Walleij @ 2011-09-29  7:33 UTC (permalink / raw)
  To: Praveen G K; +Cc: J Freyensee, linux-mmc, Arnd Bergmann, Jon Medhurst

On Thu, Sep 29, 2011 at 9:05 AM, Linus Walleij <linus.walleij@linaro.org> wrote:

> Have you tried to make a deeper analysis of the card
> characteristics using Arnd Bergmanns "flashbench" tool?
> http://git.linaro.org/gitweb?p=people/arnd/flashbench.git;a=summary
>
> There you will see for sure if there are some problematic
> read boundaries on this specific card.

For comparison here are:

Condensed card survey:
https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey

Tixys detailed benchmarks for some specific SD cards:
http://yxit.co.uk/public/flash-performance/doc/cards

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-29  7:24               ` Linus Walleij
@ 2011-09-29  8:17                 ` Per Förlin
  2011-09-29 20:16                   ` J Freyensee
  0 siblings, 1 reply; 32+ messages in thread
From: Per Förlin @ 2011-09-29  8:17 UTC (permalink / raw)
  To: Linus Walleij
  Cc: J Freyensee, Praveen G K, linux-mmc, Arnd Bergmann, Jon Medhurst

On 09/29/2011 09:24 AM, Linus Walleij wrote:
> On Wed, Sep 28, 2011 at 11:34 PM, J Freyensee
> <james_p_freyensee@linux.intel.com> wrote:
> 
>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal was
>> to try and make that function a bit more non-blocking,
> 
> What has been done by Per Förlin is to add pre_req/post_req hooks
> for the datapath. This will improve data transfers in general if and
> only if the driver can do some meaningful work in these hooks, so
> your driver needs to be patched to use these.
> 
> Per patched a few select drivers to prepare the DMA buffers
> at this time. In our case (mmci.c) dma_map_sg() can be done in
> parallel with an ongoing transfer.
> 
> In our case (ux500, mmci, dma40) we don't have bounce buffers
> so the only thing that will happen in parallel with ongoing transfers
> is L2 and L1 cache flush. *still* we see a noticeable improvement in
> throughput, most in L2, but even on the U300 which only does L1
> cache I see some small improvements.
> 
> I *guess* if you're using bounce buffers, the gain will be even
> more pronounced.
> 
> (Per, correct me if I'm wrong on any of this...)
> 
Summary:
* The mmc block driver runs mmc_blk_rw_rq_prep(), mmc_queue_bounce_post() and __blk_end_request() in parallel with an ongoing mmc transfer.
* The driver may use the hooks to schedule low level work such as preparing dma and caches in parallel with ongoing mmc transfer.
* The big benefit of this is when using DMA and running the CPU at a lower speed. Here's an example of that: https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req#Block_device_tests_with_governor


>> with it too much because my current focus is on existing products and no
>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>  However, I still see the fundamental problem is that the MMC stack, which
>> was probably written with the intended purpose to be independent of the OS
>> block subsystem (struct request and other stuff), really isn't independent
>> of the OS block subsystem and will cause holdups between one another,
>> thereby dragging down read/write performance of the MMC.
> 
> There are two issues IIRC:
> 
> - The block layer does not provide enough buffers at a time for
>   the out-of-order buffer pre/post preps to make effect, I think this
>   was during writes only (Per, can you elaborate?)
> 
Writes are buffered and pushed down many in one go. This mean they can easily be scheduled to be prepared while another is being transferred.
Large continues reads are pushed down to MMC synchronously one request per read ahead size. The next large continues read will wait in the block layer and not start until the current one is complete. Read more about the details here: https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req#Analysis_of_how_block_layer_adds_read_request_to_the_mmc_block_queue

> - Anything related to card geometries and special sectors and
>   sector sizes etc, i.e. the stuff that Arnd has analyzed in detail,
>   also Tixy looked into that for some cards IIRC.
> 
> Each needs to be adressed and is currently "to be done".
> 
>> The other fundamental problem is the writes themselves.  Way, WAY more
>> writes occur on a handheld system in an end-user's hands than reads.
>> Fundamental computer principle states "you make the common case fast". So
>> focus should be on how to complete a write operation the fastest way
>> possible.
> 
> First case above I think, yep it needs looking into...
> 
The mmc non-blocking patches only tries to move any overhead in parallel with transfer. The actual transfer speed of MMC reads and writes are unaffected. I am hoping that the eMMC v4.5 packed commands support (the ability to group a series of commands in a single data transaction) will help to boost the performance in the future.

Regards,
Per

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-29  8:17                 ` Per Förlin
@ 2011-09-29 20:16                   ` J Freyensee
  2011-09-30  8:22                     ` Andrei E. Warkentin
  0 siblings, 1 reply; 32+ messages in thread
From: J Freyensee @ 2011-09-29 20:16 UTC (permalink / raw)
  To: Per Förlin
  Cc: Linus Walleij, Praveen G K, linux-mmc, Arnd Bergmann, Jon Medhurst

On 09/29/2011 01:17 AM, Per Förlin wrote:
> On 09/29/2011 09:24 AM, Linus Walleij wrote:
>> On Wed, Sep 28, 2011 at 11:34 PM, J Freyensee
>> <james_p_freyensee@linux.intel.com>  wrote:
>>
>>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal was
>>> to try and make that function a bit more non-blocking,
>>
>> What has been done by Per Förlin is to add pre_req/post_req hooks
>> for the datapath. This will improve data transfers in general if and
>> only if the driver can do some meaningful work in these hooks, so
>> your driver needs to be patched to use these.
>>
>> Per patched a few select drivers to prepare the DMA buffers
>> at this time. In our case (mmci.c) dma_map_sg() can be done in
>> parallel with an ongoing transfer.
>>
>> In our case (ux500, mmci, dma40) we don't have bounce buffers
>> so the only thing that will happen in parallel with ongoing transfers
>> is L2 and L1 cache flush. *still* we see a noticeable improvement in
>> throughput, most in L2, but even on the U300 which only does L1
>> cache I see some small improvements.
>>
>> I *guess* if you're using bounce buffers, the gain will be even
>> more pronounced.
>>
>> (Per, correct me if I'm wrong on any of this...)
>>
> Summary:
> * The mmc block driver runs mmc_blk_rw_rq_prep(), mmc_queue_bounce_post() and __blk_end_request() in parallel with an ongoing mmc transfer.
> * The driver may use the hooks to schedule low level work such as preparing dma and caches in parallel with ongoing mmc transfer.
> * The big benefit of this is when using DMA and running the CPU at a lower speed. Here's an example of that: https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req#Block_device_tests_with_governor
>
>
>>> with it too much because my current focus is on existing products and no
>>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>>   However, I still see the fundamental problem is that the MMC stack, which
>>> was probably written with the intended purpose to be independent of the OS
>>> block subsystem (struct request and other stuff), really isn't independent
>>> of the OS block subsystem and will cause holdups between one another,
>>> thereby dragging down read/write performance of the MMC.
>>
>> There are two issues IIRC:
>>
>> - The block layer does not provide enough buffers at a time for
>>    the out-of-order buffer pre/post preps to make effect, I think this
>>    was during writes only (Per, can you elaborate?)

As I've been playing around with with buffering/caching, it seems to me 
an opportunity to simplify things in the MMC space is to eliminate the 
need for a mmc_blk_request struct or mmc_request struct.  With looking 
through the mmc_blk_issue_rw_rq(), there is a lot of work to initialize 
struct mmc_blk_request brq, only to pass a struct mmc_queue variable the 
actual mmc_wait_for_req() instead.  In fact, some of the parameters in 
the struct mmc_blk_request member brq.mrq (of type mmc_request) wind up 
just pointing to members in struct mmc_blk_request brq.  Granted, I 
totally don't understand everything going on here and I haven't studied 
this code nearly as long as others, but when I see something like this, 
the first thing that comes up in my mind is 'elimination/simplification 
opportunity'.

>>
> Writes are buffered and pushed down many in one go. This mean they can easily be scheduled to be prepared while another is being transferred.
> Large continues reads are pushed down to MMC synchronously one request per read ahead size. The next large continues read will wait in the block layer and not start until the current one is complete. Read more about the details here: https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req#Analysis_of_how_block_layer_adds_read_request_to_the_mmc_block_queue
>
>> - Anything related to card geometries and special sectors and
>>    sector sizes etc, i.e. the stuff that Arnd has analyzed in detail,
>>    also Tixy looked into that for some cards IIRC.
>>
>> Each needs to be adressed and is currently "to be done".
>>
>>> The other fundamental problem is the writes themselves.  Way, WAY more
>>> writes occur on a handheld system in an end-user's hands than reads.
>>> Fundamental computer principle states "you make the common case fast". So
>>> focus should be on how to complete a write operation the fastest way
>>> possible.
>>
>> First case above I think, yep it needs looking into...
>>
> The mmc non-blocking patches only tries to move any overhead in parallel with transfer. The actual transfer speed of MMC reads and writes are unaffected. I am hoping that the eMMC v4.5 packed commands support (the ability to group a series of commands in a single data transaction) will help to boost the performance in the future.
>
> Regards,
> Per


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-29 20:16                   ` J Freyensee
@ 2011-09-30  8:22                     ` Andrei E. Warkentin
  2011-10-01  0:33                       ` J Freyensee
  0 siblings, 1 reply; 32+ messages in thread
From: Andrei E. Warkentin @ 2011-09-30  8:22 UTC (permalink / raw)
  To: J Freyensee, Praveen G K
  Cc: Per Förlin, Linus Walleij, linux-mmc, Arnd Bergmann, Jon Medhurst

Hi James,

2011/9/29 J Freyensee <james_p_freyensee@linux.intel.com>:
> As I've been playing around with with buffering/caching, it seems to me an
> opportunity to simplify things in the MMC space is to eliminate the need for
> a mmc_blk_request struct or mmc_request struct.  With looking through the
> mmc_blk_issue_rw_rq(), there is a lot of work to initialize struct
> mmc_blk_request brq, only to pass a struct mmc_queue variable the actual
> mmc_wait_for_req() instead.  In fact, some of the parameters in the struct
> mmc_blk_request member brq.mrq (of type mmc_request) wind up just pointing
> to members in struct mmc_blk_request brq.  Granted, I totally don't
> understand everything going on here and I haven't studied this code nearly
> as long as others, but when I see something like this, the first thing that
> comes up in my mind is 'elimination/simplification opportunity'.

mmc_request is what is actually handled by the SD/MMC host driver -
compact representation of
what needs to be done (unneeded fields are NULL pointers).
mmc_blk_request is basically the backing
store for these fields, for the block driver. I would guess the
mmc_request doesn't contain the fields
because it would be inefficient (or correct me). And since there is
quite a bit of logic behind running the
actual MMC commands (esp. w.r.t to mrq->sbc and mrq->stop), it would
not be a good idea to get
rid of mmc_request and pull the strings from card drivers either.

Praveen:

As for timings on Toshiba cards, you can search "MMC quirks relating
to performance/lifetime." in the archives.
There was quite a lot of very interesting data and discussions
specifically regarding performance, and it would
be pretty impossible and a disservice to try to summarize it :-). In
short, I've definitely seen 100ms blips, pronounced
by extra GC caused by unaligned accesses across allocation units (if I
remember correctly). You could try and reduce
the worst case, but it would make the average case worse. It's a bit
of voodoo. Best solution is interact with your vendor
and get suggestions on use and errata.

A

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-09-30  8:22                     ` Andrei E. Warkentin
@ 2011-10-01  0:33                       ` J Freyensee
  2011-10-02  6:20                         ` Andrei E. Warkentin
  0 siblings, 1 reply; 32+ messages in thread
From: J Freyensee @ 2011-10-01  0:33 UTC (permalink / raw)
  To: Andrei E. Warkentin
  Cc: Praveen G K, Per Förlin, Linus Walleij, linux-mmc,
	Arnd Bergmann, Jon Medhurst

On 09/30/2011 01:22 AM, Andrei E. Warkentin wrote:
> Hi James,
>
> 2011/9/29 J Freyensee<james_p_freyensee@linux.intel.com>:
>> As I've been playing around with with buffering/caching, it seems to me an
>> opportunity to simplify things in the MMC space is to eliminate the need for
>> a mmc_blk_request struct or mmc_request struct.  With looking through the
>> mmc_blk_issue_rw_rq(), there is a lot of work to initialize struct
>> mmc_blk_request brq, only to pass a struct mmc_queue variable the actual
>> mmc_wait_for_req() instead.  In fact, some of the parameters in the struct
>> mmc_blk_request member brq.mrq (of type mmc_request) wind up just pointing
>> to members in struct mmc_blk_request brq.  Granted, I totally don't
>> understand everything going on here and I haven't studied this code nearly
>> as long as others, but when I see something like this, the first thing that
>> comes up in my mind is 'elimination/simplification opportunity'.
>
> mmc_request is what is actually handled by the SD/MMC host driver -
> compact representation of
> what needs to be done (unneeded fields are NULL pointers).
> mmc_blk_request is basically the backing
> store for these fields, for the block driver. I would guess the
> mmc_request doesn't contain the fields
> because it would be inefficient (or correct me). And since there is
> quite a bit of logic behind running the
> actual MMC commands (esp. w.r.t to mrq->sbc and mrq->stop), it would
> not be a good idea to get
> rid of mmc_request and pull the strings from card drivers either.

So I have a question on write behavior.

Say mmc_blk_issue_rw_rq() is called.  Say the mmc_queue *mq variable 
passed in is a write.  Say that write is buffered, delayed into being 
sent via mmc_wait_for_req() for 5 seconds, and it's sent to 
mmc_wait_for_req() later.  Would that delay of sending the brq->mrq 
entry to mmc_wait_for_req() cause a timeout, ie:

mmc0: Timeout waiting for hardware interrupt.

??

If this is true, how would you extend the timeout?  I would not have 
expected this until mmc_wait_for_req() is called.  It appeared to me 
mmc_set_data_timeout() was just setting a variable in brq to be used 
when mmc_wait_for_req() is called.  I only see this behavior in eMMC 
cards, not MMC cards being stuck into an MMC slot of a laptop.

Thanks!

>
> Praveen:
>
> As for timings on Toshiba cards, you can search "MMC quirks relating
> to performance/lifetime." in the archives.
> There was quite a lot of very interesting data and discussions
> specifically regarding performance, and it would
> be pretty impossible and a disservice to try to summarize it :-). In
> short, I've definitely seen 100ms blips, pronounced
> by extra GC caused by unaligned accesses across allocation units (if I
> remember correctly). You could try and reduce
> the worst case, but it would make the average case worse. It's a bit
> of voodoo. Best solution is interact with your vendor
> and get suggestions on use and errata.
>
> A


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-01  0:33                       ` J Freyensee
@ 2011-10-02  6:20                         ` Andrei E. Warkentin
  2011-10-03 18:01                           ` J Freyensee
  0 siblings, 1 reply; 32+ messages in thread
From: Andrei E. Warkentin @ 2011-10-02  6:20 UTC (permalink / raw)
  To: J Freyensee
  Cc: Praveen G K, Per Förlin, Linus Walleij, linux-mmc,
	Arnd Bergmann, Jon Medhurst

Hi James,

2011/9/30 J Freyensee <james_p_freyensee@linux.intel.com>:
>
> So I have a question on write behavior.
>
> Say mmc_blk_issue_rw_rq() is called.  Say the mmc_queue *mq variable passed
> in is a write.

You mean the struct request?

> Say that write is buffered, delayed into being sent via
> mmc_wait_for_req() for 5 seconds, and it's sent to mmc_wait_for_req() later.
>  Would that delay of sending the brq->mrq entry to mmc_wait_for_req() cause
> a timeout, ie:

Are you working off of mmc-next? Sounds like you don't have Per
Förlin's async work yet.
I don't want to sound lame (yes, I know your Medfield or whatever
projects build on a particular baseline),
but you would be doing yourself a huge favor by doing your interesting
investigations on top of the top of
tree.

The old code indeed calls mmc_wait_for_req in  mmc_blk_issue_rw_rq,
while the new code does a
mmc_start_req, which waits for a previous async request to complete
before issuing the new one.

Could you describe in greater detail what you're doing? What exactly
do you mean by buffering?
As far as I understand, until you call mmc_wait_for_req (old code) or
mmc_start_req (new code), your
request only exists as a data structure, and the host controller
doesn't know or care about it. So it doesn't
matter when you send it - now, or five seconds in the future (of
course, you probably don't want other requests
to get ahead of a barrier request).

The mmc_set_data_timeout routine is used to calculate the time the
host controller will wait for the card to
process the read/write. This is obviously tied to the transfer size,
type (read or write), card properties as inferred
from its registers and technology.

>
> mmc0: Timeout waiting for hardware interrupt.
>
> ??
>
> If this is true, how would you extend the timeout?  I would not have
> expected this until mmc_wait_for_req() is called.

The fact that you got a timeout implies that the host was processing a
struct mmc_request already.

 It appeared to me
> mmc_set_data_timeout() was just setting a variable in brq to be used when
> mmc_wait_for_req() is called.  I only see this behavior in eMMC cards, not
> MMC cards being stuck into an MMC slot of a laptop.
>

It's hard to say what is going without seeing some code. My other suggestion is
instrument the host driver (and block driver as well) and figure out
what request is failing
and why.

A

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-02  6:20                         ` Andrei E. Warkentin
@ 2011-10-03 18:01                           ` J Freyensee
  2011-10-03 20:19                             ` Andrei Warkentin
  0 siblings, 1 reply; 32+ messages in thread
From: J Freyensee @ 2011-10-03 18:01 UTC (permalink / raw)
  To: Andrei E. Warkentin
  Cc: Praveen G K, Per Förlin, Linus Walleij, linux-mmc,
	Arnd Bergmann, Jon Medhurst

[-- Attachment #1: Type: text/plain, Size: 4028 bytes --]

On 10/01/2011 11:20 PM, Andrei E. Warkentin wrote:
> Hi James,
>
> 2011/9/30 J Freyensee<james_p_freyensee@linux.intel.com>:
>>
>> So I have a question on write behavior.
>>
>> Say mmc_blk_issue_rw_rq() is called.  Say the mmc_queue *mq variable passed
>> in is a write.
>
> You mean the struct request?
>
>>   Say that write is buffered, delayed into being sent via
>> mmc_wait_for_req() for 5 seconds, and it's sent to mmc_wait_for_req() later.
>>   Would that delay of sending the brq->mrq entry to mmc_wait_for_req() cause
>> a timeout, ie:
>
> Are you working off of mmc-next? Sounds like you don't have Per
> Förlin's async work yet.
> I don't want to sound lame (yes, I know your Medfield or whatever
> projects build on a particular baseline),
> but you would be doing yourself a huge favor by doing your interesting
> investigations on top of the top of
> tree.

Yeah, I know I'd be doing myself a huge favor by working off of mmc-next 
(or close to it), but product-wise, my department doesn't care for 
sustaining current platforms...yet (still trying to convince).

>
> The old code indeed calls mmc_wait_for_req in  mmc_blk_issue_rw_rq,
> while the new code does a
> mmc_start_req, which waits for a previous async request to complete
> before issuing the new one.
>
> Could you describe in greater detail what you're doing? What exactly
> do you mean by buffering?

So I was looking into sticking a write cache into block.c driver as a 
parameter, so it can be turned on and off upon driver load.  Any write 
operation goes to the cache and only on a cache collision will the
write operation get sent to the host controller for a write.  What I 
have working so far is just with an MMC card in an MMC slot of a laptop, 
and just bare-bones.  No general flush routine, error-handling, etc. 
 From a couple performance measurements I did on the MMC slot using 
blktrace/blkparse and 400MB write transactions, I was seeing huge 
performance boost with no data corruption.  So it is not looking like a 
total hair-brained idea.  But I am still pretty far from understanding 
everything here.  And the real payoff we want to see is performance a 
user can see on a handheld (i.e., Android) systems.

I did attach the code if you do want to look at it.  I heavily commented 
the code additions I made so it shouldn't be too scary to follow.  Any 
kernel contributions in the past I have made have had similar coding 
documentation in it.  I have currently turned on debugging in the host 
controller so I'm trying to understand what is going on there.

Thanks,
Jay


> As far as I understand, until you call mmc_wait_for_req (old code) or
> mmc_start_req (new code), your
> request only exists as a data structure, and the host controller
> doesn't know or care about it. So it doesn't
> matter when you send it - now, or five seconds in the future (of
> course, you probably don't want other requests
> to get ahead of a barrier request).
>
> The mmc_set_data_timeout routine is used to calculate the time the
> host controller will wait for the card to
> process the read/write. This is obviously tied to the transfer size,
> type (read or write), card properties as inferred
> from its registers and technology.
>
>>
>> mmc0: Timeout waiting for hardware interrupt.
>>
>> ??
>>
>> If this is true, how would you extend the timeout?  I would not have
>> expected this until mmc_wait_for_req() is called.
>
> The fact that you got a timeout implies that the host was processing a
> struct mmc_request already.
>
>   It appeared to me
>> mmc_set_data_timeout() was just setting a variable in brq to be used when
>> mmc_wait_for_req() is called.  I only see this behavior in eMMC cards, not
>> MMC cards being stuck into an MMC slot of a laptop.
>>
>
> It's hard to say what is going without seeing some code. My other suggestion is
> instrument the host driver (and block driver as well) and figure out
> what request is failing
> and why.
>
> A


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

[-- Attachment #2: block.c --]
[-- Type: text/x-csrc, Size: 41980 bytes --]

/*
 * Block driver for media (i.e., flash cards)
 *
 * Copyright 2002 Hewlett-Packard Company
 * Copyright 2005-2008 Pierre Ossman
 *
 * Use consistent with the GNU GPL is permitted,
 * provided that this copyright notice is
 * preserved in its entirety in all copies and derived works.
 *
 * HEWLETT-PACKARD COMPANY MAKES NO WARRANTIES, EXPRESSED OR IMPLIED,
 * AS TO THE USEFULNESS OR CORRECTNESS OF THIS CODE OR ITS
 * FITNESS FOR ANY PARTICULAR PURPOSE.
 *
 * Many thanks to Alessandro Rubini and Jonathan Corbet!
 *
 * Author:  Andrew Christian
 *          28 May 2002
 */
#define DEBUG
#include <linux/moduleparam.h>
#include <linux/module.h>
#include <linux/init.h>

#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/errno.h>
#include <linux/hdreg.h>
#include <linux/kdev_t.h>
#include <linux/blkdev.h>
#include <linux/mutex.h>
#include <linux/scatterlist.h>
#include <linux/string_helpers.h>

#include <linux/mmc/card.h>
#include <linux/mmc/host.h>
#include <linux/mmc/mmc.h>
#include <linux/mmc/sd.h>

#include <asm/system.h>
#include <asm/uaccess.h>

#if defined(CONFIG_DEBUG_FS)
#include <linux/dcache.h>
#include <linux/debugfs.h>
#endif

#include "queue.h"

MODULE_ALIAS("mmc:block");
#ifdef MODULE_PARAM_PREFIX
#undef MODULE_PARAM_PREFIX
#endif
#define MODULE_PARAM_PREFIX "mmcblk."

// jpf: This isn't working...probably need a spinlock because we probably
// don't want this to sleep
//static DEFINE_MUTEX(cache_mutex);
static DEFINE_MUTEX(block_mutex);

/*
 * The defaults come from config options but can be overriden by module
 * or bootarg options.
 */
static int perdev_minors = CONFIG_MMC_BLOCK_MINORS;

/*
 * We've only got one major, so number of mmcblk devices is
 * limited to 256 / number of minors per device.
 */
static int max_devices;

/* 256 minors, so at most 256 separate devices */
static DECLARE_BITMAP(dev_use, 256);

/*
 * There is one mmc_blk_data per slot.
 */
struct mmc_blk_data {
	spinlock_t	lock;
	struct gendisk	*disk;
	struct mmc_queue queue;

	unsigned int	usage;
	unsigned int	read_only;
};

static DEFINE_MUTEX(open_lock);

module_param(perdev_minors, int, 0444);
MODULE_PARM_DESC(perdev_minors, "Minors numbers to allocate per device");

struct mmc_blk_request {
	struct mmc_request	mrq;
	struct mmc_command	cmd;
	struct mmc_command	stop;
	struct mmc_data		data;
};

/*
 * The following is for a handset cache. Data collected using blktrace 
 * and blkparse on ARM and Intel handset/tablet solutions show 
 * a roughly 10:1 write-to-read ratio on end-user operations, with 
 * many of the writes only being 1-2 sectors.  This is to try to capture 
 * and optimize many of those 1-2 sector writes. Since this has been 
 * primarily targeted for certain Linux-based computers, the parameter
 * defaults to 'off'. 
 *  
 */

/*
 * handset_cachesize is used to specify the size of the cache. 
 * 0 for off, 1-5 is the two's exponent of how 
 * many cache entries (1=2^1=2 entries, 5=2^5=32 entries) there 
 * will be in the cache. 
 * The cachesize should be small and simple,
 * to try and minimize issues like power consumption.  The max 
 * size of the cache is enforced during creation. 
 */
unsigned short static handset_cachesize = 0;
module_param(handset_cachesize, ushort, 0444);
MODULE_PARM_DESC(handset_cachesize, "Small cache targeted for handsets");

struct mmc_cache_contents {
	sector_t cache_addr;         /* starting sector address  */
	unsigned int num_sectors;    /* number of sectors/blocks */
        struct mmc_blk_request *brq; /* The 'data' we need to keep */
	unsigned char valid;         /* If data in entry is valid */
	struct request *req;         /* ptr to outstanding request */
};
struct mmc_cache_frame {
	struct mmc_cache_contents *entry;
};
static struct mmc_cache_frame mmc_cache;

/**
 * mmc_cachesize() - return the number of actual or potential
 *      	     entries of the cache, irregardless if the
 *      	     cache has been actually created yet.
 *  
 * Returns: 
 *      size of allocated cache based on handset_cachesize. if
 *      handset_cachesize is 0, then it will return 0.
 */
static unsigned int mmc_cachesize(void) 
{
	return (1 << handset_cachesize);
}

/**
 * mmc_create_cache() - Allocate the cache. Cache is created 
 *      		based on handset_cachesize:
 *      		0 = no cache created
 *      		1 = 2^1 = cache size of 2 entries
 *      		2 = 2^2 = cache size of 4 entries
 *      		etc...
 *  
 * Caveats: Since the goal is to keep the cache small and not 
 * blow up power usage or the kernel itself, this function 
 * enforces a maximum cache cap. 
 *  
 * Returns: 
 *      0 for success
 *      -EPERM for inappropriate handset_cachesize value and no
 *       	creation of cache
 *      -ENOMEM for memory issue
 *      other value, error
 *  
 */
static int mmc_create_cache(void) 
{
	const unsigned short MAXSIZE = 5;
	int retval = -EPERM;
	unsigned int i;

	/* 
	 * In case this function gets called with
	 * handset_cachesize equal to 0, we want 
	 * to inform a cache didn't get created.
	 */ 
	if (handset_cachesize == 0) {
		return retval;
	}
	//mutex_lock(&cache_mutex);
	if (handset_cachesize > MAXSIZE) {
		handset_cachesize = MAXSIZE;
	}
	
	mmc_cache.entry = kmalloc(mmc_cachesize() * 
				  sizeof(struct mmc_cache_contents),
				  GFP_NOWAIT);
	if (mmc_cache.entry == NULL)
		retval = -ENOMEM;
	else {
		/* 
		 * Should be good enough to set 'valid' to 0, NULL brq,
		 * and allow junk data in the rest of the fields.
		 */
		for (i = 0; i < mmc_cachesize(); i++) {
			mmc_cache.entry[i].valid = 0;
			mmc_cache.entry[i].brq   = NULL;
		}
		retval = 0;
	}
	//mutex_unlock(&cache_mutex);
	return retval;
}

/**
 * mmc_index_cache() - provides entry number of the cache based 
 * on the sector number. 
 *  
 * Caveats: Note this should not be used if handset_cachesize is
 * 0. It's really meant as a helper function for all the other 
 * mmc cache functions. 
 *  
 * @sector_addr: Unsigned 64-bit sector address number. 
 * 
 * Returns: 
 * 	Math result of (sector_addr % handset_cachesize)
 */
static unsigned int mmc_index_cache(sector_t sector_addr)
{
	sector_t mask_modulo = 1;
	unsigned int i = 1;
	const unsigned int CACHESIZE = handset_cachesize;

	while (i < CACHESIZE) {
		mask_modulo = mask_modulo << 1 | 1;
		i++;
	}
	pr_debug("%s return value: %d\n", __func__,
		 ((unsigned int) (sector_addr & mask_modulo)));
	return ((unsigned int) (sector_addr & mask_modulo));
}

/**
 * mmc_insert_cacheentry() - caches entry into the cache. Remember this 
 * is write-policy cache based on workloads measured with 
 * blktrace, so only writes go here. If there is a read miss, it 
 * just simply goes to the storage device to get it's info and 
 * the read data does NOT get pulled from the device and stored 
 * here. 
 *  
 * Caveats: This assumes we have an exact hit (cache_addr + 
 * num_sectors) or it's the first entry (valid == 0).
 * mmc_check_cachehit() should be called first to check 
 * if: 
 *      -we have an exact hit (cache_addr + num_sectors)
 *      -we have a semi-hit (cache_addr hit but num_sectors
 *       miss)
 *  
 * @cache_addr: sector_t address to be used to be inserted into 
 *              the cache.
 * @num_sectors: Number of sectors that will be accessed 
 *               starting with cache_addr.
 * @brq: the mmc_blk_request pointer that is sent to the host 
 *       controller for an actual write to the address.
 * 
 * Returns: 
 *      positive integer, which is the index of entry that was
 *      	successfully placed into the cache.
 *      negative integer, which symbolizes an error.
 *  
 */
static int mmc_insert_cacheentry(sector_t cache_addr, 
				 unsigned int num_sectors,
				 struct mmc_blk_request *brq,
				 struct request *req)
{
	int retval = -EPERM;
	int cache_index;

	if (handset_cachesize == 0)
		return retval;

	//mutex_lock(&cache_mutex);

	cache_index = mmc_index_cache(cache_addr);
	mmc_cache.entry[cache_index].cache_addr  = cache_addr;
	mmc_cache.entry[cache_index].num_sectors = num_sectors;

	if (mmc_cache.entry[cache_index].brq == NULL) {
		mmc_cache.entry[cache_index].brq = 
			kzalloc(sizeof(struct mmc_blk_request), 
			GFP_NOWAIT);
		if (mmc_cache.entry[cache_index].brq != NULL) {
		
		/* in lib/scatterlist.c, it is unlikely the scatterlist
		 * is actually chained.  In other words, it is expected
		 * that a scatterlist here will only have 1 node instead 
		 * of a chain (which that is 'very unlikely'). So 
		 * assuming one scatterlist node.
		 */ 
			mmc_cache.entry[cache_index].brq->data.sg = 
			kzalloc(sizeof(struct scatterlist),
				GFP_NOWAIT);
			if (mmc_cache.entry[cache_index].brq->data.sg ==
			    NULL) {
				retval = -ENOMEM;
				goto no_memory;
			}
		} else {
			retval = -ENOMEM;
			goto no_memory;
		}
		
	}
	mmc_cache.entry[cache_index].brq->mrq.cmd = 
		&(mmc_cache.entry[cache_index].brq->cmd);
	mmc_cache.entry[cache_index].brq->mrq.data = 
		&(mmc_cache.entry[cache_index].brq->data);
	mmc_cache.entry[cache_index].brq->mrq.stop = NULL;

	mmc_cache.entry[cache_index].brq->cmd.arg =
		brq->cmd.arg;
	mmc_cache.entry[cache_index].brq->cmd.flags =
		brq->cmd.flags;
	mmc_cache.entry[cache_index].brq->data.blksz =
		brq->data.blksz;
	mmc_cache.entry[cache_index].brq->stop.opcode =
		brq->stop.opcode;
	mmc_cache.entry[cache_index].brq->stop.arg =
		brq->stop.arg;
	mmc_cache.entry[cache_index].brq->stop.flags =
		brq->stop.flags;
	mmc_cache.entry[cache_index].brq->data.blocks =
		brq->data.blocks;
	mmc_cache.entry[cache_index].brq->cmd.opcode =
		brq->cmd.opcode;
	mmc_cache.entry[cache_index].brq->data.flags =
		brq->data.flags;

	mmc_cache.entry[cache_index].brq->data.sg->dma_address =
		brq->data.sg->dma_address;
	mmc_cache.entry[cache_index].brq->data.sg->dma_length =
		brq->data.sg->dma_length;
	mmc_cache.entry[cache_index].brq->data.sg->length =
		brq->data.sg->length;
	mmc_cache.entry[cache_index].brq->data.sg->offset =
		brq->data.sg->offset;
	mmc_cache.entry[cache_index].brq->data.sg->page_link =
		brq->data.sg->page_link;
	mmc_cache.entry[cache_index].brq->data.sg_len =
		brq->data.sg_len;
	
	mmc_cache.entry[cache_index].req = req;
	/* 
	 * We set valid flag here, in the event
	 * kzalloc() fails so we know the entry
	 * is still not valid and can be used again.
	 */
	mmc_cache.entry[cache_index].valid = 1;
	retval = cache_index;

	no_memory:
	//mutex_unlock(&cache_mutex);
	return retval;
}

/**
 * mmc_check_cachehit() - Checks to see what type of cache hit 
 * occured for a given entry. 
 *  
 * @sector_addr: Sector address location start of which will be 
 *               used to check for a cache hit or miss.
 * @num_sectors: number of sectors to write to starting with 
 *               sector_addr.
 *  
 * Returns 
 *      - 0: Exact cache hit (sector_addr + num_sectors).  In
 *           this case, new data can just be written over the
 *           old data. In the case for a read, data can be read
 *           from the entry.
 *      - 1: partial cache hit (sector_addr) or miss and the
 *           data is valid.  In this case, for simplicity, the
 *           entry is written to the device. For a read, this
 *           data would first have to be written to the device,
 *           then the read would be allowed to proceed to the
 *           device.
 *      - 2: valid is 0.  In this case, write to the cache.  If
 *           it's a read, go to the device to get the
 *           information.
 *	- 3: entry is valid but cache_addr and sector_addr
 *	     don't match; we have a cache collision.  Report it,
 *	     and code calling this function should flush this entry
 *	     before insertion.
 *      - EPERM: function got called without handset_cachesize
 *        parameter being appropriately set
 *      - ENXIO: Unexpected cache address case; we should never see this.
 *        The only valid cases should be 0,1,2,EPERM (called wo/
 *        handset_cachesize set to a positive integer).
 */
static int mmc_check_cachehit(sector_t sector_addr, unsigned int num_sectors)
{
	int retval = -EPERM;
	unsigned int index;

	if (handset_cachesize == 0)
		return retval;

	//mutex_lock(&cache_mutex);
	pr_debug("mmc: %s() cache_addr/sector_addr: %#llx\n",
		__func__, sector_addr);
	pr_debug("mmc: %s() num_sectors:               %d\n",
		__func__, num_sectors);
	index = mmc_index_cache(sector_addr);
	pr_debug("mmc: %s() index:                     %d\n",
		__func__, index);
	
	/* case 2- valid is 0.*/
	if (mmc_cache.entry[index].valid == 0)
		retval = 2;

	/* case 0- perfect match */
	else if ((mmc_cache.entry[index].valid == 1) &&
		 ((mmc_cache.entry[index].cache_addr == sector_addr) &&
		  (mmc_cache.entry[index].num_sectors == num_sectors)))
		retval = 0;

	/* case 1- cache_addr matched and it's a valid entry */
	else if ((mmc_cache.entry[index].valid == 1) &&
		 ((mmc_cache.entry[index].cache_addr == sector_addr) &&
		  (mmc_cache.entry[index].num_sectors != num_sectors)))
		retval = 1;

	/* 
	 * case 3- entry is valid but cache_addr and sector_addr
	 * don't match; we have a cache collision.  Report it,
	 * and code calling this function should flush this entry
	 * before insertion.
         */
	else if ((mmc_cache.entry[index].valid == 1) &&
		 (mmc_cache.entry[index].cache_addr != sector_addr))
		return 3;

	/* We should never hit here */
	else
		retval = -ENXIO;

	pr_debug("mmc: %s(): returning %d\n", __func__, retval);
	//mutex_unlock(&cache_mutex);
	return retval;
}

/**
 * mmc_destroy_cache() - deallocates the cache. 
 *  
 * Caveats: It is believed this should only be called 
 * on shutdown, when everything is being destroyed. 
 */
static void mmc_destroy_cache(void) 
{	
	unsigned int i;

	if (handset_cachesize == 0)
		return;

	//mutex_lock(&cache_mutex);
	for (i = 0; i < mmc_cachesize(); i++) {
		if (mmc_cache.entry[i].brq->data.sg != NULL) {
			kfree(mmc_cache.entry[i].brq->data.sg);
			mmc_cache.entry[i].brq->data.sg = NULL;
		}
		if (mmc_cache.entry[i].brq != NULL) {
			kfree(mmc_cache.entry[i].brq);
			mmc_cache.entry[i].brq = NULL;
		}
	}
	kfree(mmc_cache.entry);
	mmc_cache.entry = NULL;
	//mutex_unlock(&cache_mutex);
	return;
}

static struct mmc_blk_data *mmc_blk_get(struct gendisk *disk)
{
	struct mmc_blk_data *md;

	mutex_lock(&open_lock);
	md = disk->private_data;
	if (md && md->usage == 0)
		md = NULL;
	if (md)
		md->usage++;
	mutex_unlock(&open_lock);

	return md;
}

static void mmc_blk_put(struct mmc_blk_data *md)
{
	mutex_lock(&open_lock);
	md->usage--;
	if (md->usage == 0) {
		int devmaj = MAJOR(disk_devt(md->disk));
		int devidx = MINOR(disk_devt(md->disk)) / perdev_minors;

		if (!devmaj)
			devidx = md->disk->first_minor / perdev_minors;

		blk_cleanup_queue(md->queue.queue);

		__clear_bit(devidx, dev_use);

		put_disk(md->disk);
		kfree(md);
	}
	mutex_unlock(&open_lock);
}

static int mmc_blk_open(struct block_device *bdev, fmode_t mode)
{
	struct mmc_blk_data *md = mmc_blk_get(bdev->bd_disk);
	int ret = -ENXIO;

	mutex_lock(&block_mutex);
	if (md) {
		if (md->usage == 2)
			check_disk_change(bdev);
		ret = 0;

		if ((mode & FMODE_WRITE) && md->read_only) {
			mmc_blk_put(md);
			ret = -EROFS;
		}
	}
	mutex_unlock(&block_mutex);

	return ret;
}

static int mmc_blk_release(struct gendisk *disk, fmode_t mode)
{
	struct mmc_blk_data *md = disk->private_data;

	mutex_lock(&block_mutex);
	mmc_blk_put(md);
	mutex_unlock(&block_mutex);
	return 0;
}

static int
mmc_blk_getgeo(struct block_device *bdev, struct hd_geometry *geo)
{
	geo->cylinders = get_capacity(bdev->bd_disk) / (4 * 16);
	geo->heads = 4;
	geo->sectors = 16;
	return 0;
}

static const struct block_device_operations mmc_bdops = {
	.open			= mmc_blk_open,
	.release		= mmc_blk_release,
	.getgeo			= mmc_blk_getgeo,
	.owner			= THIS_MODULE,
};

static u32 mmc_sd_num_wr_blocks(struct mmc_card *card)
{
	int err;
	u32 result;
	__be32 *blocks;

	struct mmc_request mrq;
	struct mmc_command cmd;
	struct mmc_data data;
	unsigned int timeout_us;

	struct scatterlist sg;

	memset(&cmd, 0, sizeof(struct mmc_command));

	cmd.opcode = MMC_APP_CMD;
	cmd.arg = card->rca << 16;
	cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_AC;

	err = mmc_wait_for_cmd(card->host, &cmd, 0);
	if (err)
		return (u32)-1;
	if (!mmc_host_is_spi(card->host) && !(cmd.resp[0] & R1_APP_CMD))
		return (u32)-1;

	memset(&cmd, 0, sizeof(struct mmc_command));

	cmd.opcode = SD_APP_SEND_NUM_WR_BLKS;
	cmd.arg = 0;
	cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;

	memset(&data, 0, sizeof(struct mmc_data));

	data.timeout_ns = card->csd.tacc_ns * 100;
	data.timeout_clks = card->csd.tacc_clks * 100;

	timeout_us = data.timeout_ns / 1000;
	timeout_us += data.timeout_clks * 1000 /
		(card->host->ios.clock / 1000);

	if (timeout_us > 100000) {
		data.timeout_ns = 100000000;
		data.timeout_clks = 0;
	}

	data.blksz = 4;
	data.blocks = 1;
	data.flags = MMC_DATA_READ;
	data.sg = &sg;
	data.sg_len = 1;

	memset(&mrq, 0, sizeof(struct mmc_request));

	mrq.cmd = &cmd;
	mrq.data = &data;

	blocks = kmalloc(4, GFP_KERNEL);
	if (!blocks)
		return (u32)-1;

	sg_init_one(&sg, blocks, 4);

	mmc_wait_for_req(card->host, &mrq);

	result = ntohl(*blocks);
	kfree(blocks);

	if (cmd.error || data.error)
		result = (u32)-1;

	return result;
}

static u32 get_card_status(struct mmc_card *card, char *disk_name)
{
	struct mmc_command cmd;
	int err;

	memset(&cmd, 0, sizeof(struct mmc_command));
	cmd.opcode = MMC_SEND_STATUS;
	if (!mmc_host_is_spi(card->host))
		cmd.arg = card->rca << 16;
	cmd.flags = MMC_RSP_SPI_R2 | MMC_RSP_R1 | MMC_CMD_AC;
	err = mmc_wait_for_cmd(card->host, &cmd, 0);
	if (err)
		printk(KERN_ERR "%s: error %d sending status comand",
		       disk_name, err);
	return cmd.resp[0];
}

static int mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req)
{
	struct mmc_blk_data *md = mq->data;
	struct mmc_card *card = md->queue.card;
	unsigned int from, nr, arg;
	int err = 0;

	mmc_claim_host(card->host);

	if (!mmc_can_erase(card)) {
		err = -EOPNOTSUPP;
		goto out;
	}

	from = blk_rq_pos(req);
	nr = blk_rq_sectors(req);

	if (mmc_can_trim(card))
		arg = MMC_TRIM_ARG;
	else
		arg = MMC_ERASE_ARG;

	/*
	 * Before issuing a user req, host driver should
	 * wait for the BKOPS is done or just use HPI to
	 * interrupt it.
	 */
	/* jpf: wasn't here in past recent versions, so must
	   not be that important to use Ubuntu to test
	err = mmc_wait_for_bkops(card);
	if (err)
		goto out;
	*/
	err = mmc_erase(card, from, nr, arg);
out:
	spin_lock_irq(&md->lock);
	__blk_end_request(req, err, blk_rq_bytes(req));
	spin_unlock_irq(&md->lock);

	mmc_release_host(card->host);

	return err ? 0 : 1;
}

static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
{
	struct mmc_blk_data *md = mq->data;
	struct mmc_card *card = md->queue.card;
	struct mmc_blk_request brq;
	int ret = 1, disable_multi = 0;

	mmc_claim_host(card->host);

	do {
		struct mmc_command cmd;
		u32 readcmd, writecmd, status = 0;

		memset(&brq, 0, sizeof(struct mmc_blk_request));
		brq.mrq.cmd = &brq.cmd;
		brq.mrq.data = &brq.data;

		brq.cmd.arg = blk_rq_pos(req);
		if (!mmc_card_blockaddr(card))
			brq.cmd.arg <<= 9;
		brq.cmd.flags = MMC_RSP_SPI_R1 | MMC_RSP_R1 | MMC_CMD_ADTC;
		brq.data.blksz = 512;
		brq.stop.opcode = MMC_STOP_TRANSMISSION;
		brq.stop.arg = 0;
		brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
		brq.data.blocks = blk_rq_sectors(req);

		/*
		 * The block layer doesn't support all sector count
		 * restrictions, so we need to be prepared for too big
		 * requests.
		 */
		if (brq.data.blocks > card->host->max_blk_count)
			brq.data.blocks = card->host->max_blk_count;

		/*
		 * After a read error, we redo the request one sector at a time
		 * in order to accurately determine which sectors can be read
		 * successfully.
		 */
		if (disable_multi && brq.data.blocks > 1)
			brq.data.blocks = 1;

		if (brq.data.blocks > 1) {
			/* SPI multiblock writes terminate using a special
			 * token, not a STOP_TRANSMISSION request.
			 */
			if (!mmc_host_is_spi(card->host)
					|| rq_data_dir(req) == READ)
				brq.mrq.stop = &brq.stop;
			readcmd = MMC_READ_MULTIPLE_BLOCK;
			writecmd = MMC_WRITE_MULTIPLE_BLOCK;
		} else {
			brq.mrq.stop = NULL;
			readcmd = MMC_READ_SINGLE_BLOCK;
			writecmd = MMC_WRITE_BLOCK;
		}
		if (rq_data_dir(req) == READ) {
			brq.cmd.opcode = readcmd;
			brq.data.flags |= MMC_DATA_READ;
		} else {
			brq.cmd.opcode = writecmd;
			brq.data.flags |= MMC_DATA_WRITE;
		}

		mmc_set_data_timeout(&brq.data, card);

		brq.data.sg = mq->sg;
		brq.data.sg_len = mmc_queue_map_sg(mq);

		/*
		 * Adjust the sg list so it is the same size as the
		 * request.
		 */
		if (brq.data.blocks != blk_rq_sectors(req)) {
			int i, data_size = brq.data.blocks << 9;
			struct scatterlist *sg;

			for_each_sg(brq.data.sg, sg, brq.data.sg_len, i) {
				data_size -= sg->length;
				if (data_size <= 0) {
					sg->length += data_size;
					i++;
					break;
				}
			}
			brq.data.sg_len = i;
		}
		
		/* jpf: CACHE GOES HERE AND CALLS THE REST OF THE CODE
		 * ONLY IF ON A MISS, FLUSH, OR DEACTIVATION OF CACHE
		 */
		if (handset_cachesize > 0) {
			int cachehit_result = 0;
			int cache_index = mmc_index_cache(blk_rq_pos(req));

			cachehit_result = mmc_check_cachehit(
					  blk_rq_pos(req),
					  blk_rq_sectors(req));

			if ((brq.cmd.opcode == MMC_WRITE_BLOCK) ||
			    (brq.cmd.opcode == MMC_WRITE_MULTIPLE_BLOCK)) {

				if (brq.cmd.opcode == MMC_WRITE_BLOCK) {
					pr_debug("%s: single write block occuring",
						req->rq_disk->disk_name);
				}
				
				if ((cachehit_result == 0) ||
				    (cachehit_result == 2)) {
					
					/* I think if it's a cache hit I need to
					 * call __blk_end_request() to retire the
					 * old req entry before overwriting it
					 */
					if (cachehit_result == 0) {

					}
					cachehit_result = mmc_insert_cacheentry(
							  blk_rq_pos(req),
							  blk_rq_sectors(req),
							  &brq, req);
					// jpf: try2- retire all commands, per Shane
					spin_lock_irq(&md->lock);
					ret = __blk_end_request(req, 0,
								blk_rq_bytes(req));
					if (ret == 0) {
						pr_debug("%s: ret in __blk_end_request is 0\n", 
						__func__);
					}
					else {
						pr_debug("%s: ret in __blk_end_request is %d\n", 
						__func__, ret);
					}
					spin_unlock_irq(&md->lock);
					pr_debug("%s: cache entry filled: %d\n",
						req->rq_disk->disk_name,
						cachehit_result);
					pr_debug("===write: entry complete===");

				} else if ((cachehit_result == 1) || 
					   (cachehit_result == 3)) {
 
					pr_debug("%s: Partial/Collision write cache hit\n",
						req->rq_disk->disk_name);
					pr_debug("%s: mmc_check_cachehit(): %d\n",
						req->rq_disk->disk_name,
						cachehit_result);
					/* 
					 * CODE HERE TO SEND WRITE REQUEST 
					 * IN CACHE BEFORE CACHING PARTIAL HIT 
					 * OR COLLISION ENTRY. 
					 */
					/* jpf: hope this queue can be used, or :-( */
					mmc_queue_bounce_pre(&(md->queue));
										
					pr_crit("%s: call before write via cache\n", __func__);
					// jpf: 9/27/11: THIS CALL HERE SEEMS TO BE THE SMOKING GUN BETWEEN
					// eMMC IN ANDROID AND MMC IN LAPTOP.  NOT SURE WHY IT'S BROKEN WHEN
					// I USE MY mrq COPY ON AN ANDROID PLATFORM.
					mmc_wait_for_req(card->host,
					  &(mmc_cache.entry[cache_index].brq->mrq));
					  //&brq.mrq);
					pr_crit("%s: call after write via cache:\n", __func__);

					cachehit_result = mmc_insert_cacheentry(
							  blk_rq_pos(req),
							  blk_rq_sectors(req),
							  &brq, req);
					// jpf: try 2- retire request as soon as it's stored in cache, per Shane
					spin_lock_irq(&md->lock);
					ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
					spin_unlock_irq(&md->lock);

					pr_debug("%s: cache entry filled: %d\n",
						req->rq_disk->disk_name,
						cachehit_result);
					pr_debug("===write: entry complete===");

				} else {
					pr_err("%s: mmc_check_cachehit() ",
						__func__);
					pr_err("returned unexpected value\n");
				}

			}
			else if ((brq.cmd.opcode == MMC_READ_SINGLE_BLOCK) ||
				 (brq.cmd.opcode == MMC_READ_MULTIPLE_BLOCK)) {

				/*
				 * Partial read hit would send the write 
				 * entry to the device before the read would 
				 * go to the device.  Perfect read hit 
				 * would go to the cache. Since this cache
				 * is for writes, we aren't going to do the
				 * more complicated thing and bring data
				 * to cache on a read miss.
				 */
				if (cachehit_result == 0) {
					pr_debug("%s: Perfect cache read hit",
						req->rq_disk->disk_name);
					pr_debug("cache stuff: %#llx | %d",
						(unsigned long long) 
						blk_rq_pos(req),
						blk_rq_sectors(req));

					/* "mmc_queue object"->queue */
					/* 
					 * jpf: 9/21/11
					 * Looks like there is one md, one queue
					 * per mmc 'slot' (area to stick mmc card)
					 * so i'm probably alright here.  Question is-
					 * how is the data from the cache getting to the
					 * read?? Not sure if this will work.
					 * From looking at mmc_queue_bounce_post(),
					 * data from a buffer gets copied to mmc_queue
					 * *mq's bounce_sg structure.  So the theory
					 * is, on host controller reads the data from
					 * the MMC card gets copied to a buffer, which
					 * then gets copied to mq->bounce_sg.  So all
					 * I need to do is just assign the cache entry
					 * hit's scatterlist to mq->bounce_sg.
					 * If this doen't work, then I'm defaulting
					 * to what I do with partial reads.
					 */
					mmc_queue_bounce_post(&(md->queue));
					spin_lock_irq(&md->lock);
					
					mmc_cache.entry[cache_index].brq->data.sg->dma_address =
						mq->bounce_sg->dma_address;
					#ifdef CONFIG_NEED_SG_DMA_LENGTH
					mmc_cache.entry[cache_index].brq->data.sg->dma_length =
						mq->bounce_sg->dma_length;
					#endif
					mmc_cache.entry[cache_index].brq->data.sg->length =
						mq->bounce_sg->length;
					mmc_cache.entry[cache_index].brq->data.sg->offset =
						mq->bounce_sg->offset;
					mmc_cache.entry[cache_index].brq->data.sg->page_link =
						mq->bounce_sg->page_link;
					mmc_cache.entry[cache_index].brq->data.sg_len =
						mq->bounce_sg_len;

					/* jpf: kind-of praying this works.  I do in fact
					   do not want to use what is in the cache
					   for __blk_end_request() though...I want to
					   retire the request passed into the function.
					*/
					 
					ret = __blk_end_request(req, 0,
								brq.data.bytes_xfered);
					spin_unlock_irq(&md->lock);
					pr_debug("===read: entry complete===");

				/* for now, we want to first write the entry to the
				   HW, then read from the HW
				 */
				} else if (cachehit_result == 1) {
					pr_debug("%s: Partial cache read hit",
						req->rq_disk->disk_name);
					pr_debug("cache stuff: %#llx | %d",
						(unsigned long long)
						blk_rq_pos(req),
						blk_rq_sectors(req));

					mmc_queue_bounce_pre(&(md->queue));
					mmc_wait_for_req(card->host,
					   &(mmc_cache.entry[cache_index].brq->mrq));
					mmc_queue_bounce_post(&(md->queue));
					mmc_cache.entry[cache_index].valid = 0;
					
					/* jpf: try and utilize what we got for the read in this
					   code so for now I'm not re-inventing the wheel
					 */
					goto normal_req_flow;

					pr_debug("===read: entry complete===");
				} else {
					pr_debug("=read: cache entry invalid=");
					goto normal_req_flow;
				}
			}
		} /* end handset_cachesize section */
		else {
			/*
			 * Based on looking at this code and from comments 
			 * in host.c, it is believed this module cannot 
			 * handle scatter-gather lists; therefore, this 
			 * call eventually does an operation in which it takes 
			 * a scatter-gather list and 'redoes it' as a 
			 * contiguous area of memory.  This is for writes ONLY.
			 */
			mmc_queue_bounce_pre(mq);
	
			/*
			 * Before issuing a user req, host driver should
			 * wait for the BKOPS is done or just use HPI to
			 * interrupt it.
			 */
	
			/* not here for Ubuntu 11.04 w/2.6.38 kernel either
			if (mmc_wait_for_bkops(card))
				goto cmd_err;
			*/
	
			/*
			 * Actual request being sent to the host 
			 * for writing/reading to the device. 
			 * This call waits for completion. 
			 */
			normal_req_flow: 
			mmc_wait_for_req(card->host, &brq.mrq);
	
			/*
			 * Since mmc_queue_bounce_pre() turns a scatter-gather 
			 * list and re-organizes it and writes it into a contiguous 
			 * memory for write operations, this does the opposite 
			 * for reads ONLY. 
			 */
			mmc_queue_bounce_post(mq);
	
			/*
			 * Check for errors here, but don't jump to cmd_err
			 * until later as we need to wait for the card to leave
			 * programming mode even when things go wrong.
			 */
			if (brq.cmd.error || brq.data.error || brq.stop.error) {
				if (brq.data.blocks > 1 && rq_data_dir(req) == READ) {
					/* Redo read one sector at a time */
					printk(KERN_WARNING "%s: retrying using single "
					       "block read\n", req->rq_disk->disk_name);
					disable_multi = 1;
					continue;
				}
				status = get_card_status(card, req->rq_disk->disk_name);
			} else if (disable_multi == 1) {
				disable_multi = 0;
			}
	
			if (brq.cmd.error) {
				printk(KERN_ERR "%s: error %d sending read/write "
				       "command, response %#x, card status %#x\n",
				       req->rq_disk->disk_name, brq.cmd.error,
				       brq.cmd.resp[0], status);
			}
	
			if (brq.data.error) {
				if (brq.data.error == -ETIMEDOUT && brq.mrq.stop)
					/* 'Stop' response contains card status */
					status = brq.mrq.stop->resp[0];
				printk(KERN_ERR "%s: error %d transferring data,"
				       " sector %u, nr %u, card status %#x\n",
				       req->rq_disk->disk_name, brq.data.error,
				       (unsigned)blk_rq_pos(req),
				       (unsigned)blk_rq_sectors(req), status);
			}
	
			if (brq.stop.error) {
				printk(KERN_ERR "%s: error %d sending stop command, "
				       "response %#x, card status %#x\n",
				       req->rq_disk->disk_name, brq.stop.error,
				       brq.stop.resp[0], status);
			}
	
			if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
				do {
					int err;
	
					cmd.opcode = MMC_SEND_STATUS;
					cmd.arg = card->rca << 16;
					cmd.flags = MMC_RSP_R1 | MMC_CMD_AC;
					err = mmc_wait_for_cmd(card->host, &cmd, 5);
					if (err) {
						printk(KERN_ERR "%s: error %d requesting status\n",
						       req->rq_disk->disk_name, err);
						goto cmd_err;
					}
					/*
					 * Some cards mishandle the status bits,
					 * so make sure to check both the busy
					 * indication and the card state.
					 */
				} while (!(cmd.resp[0] & R1_READY_FOR_DATA) ||
					(R1_CURRENT_STATE(cmd.resp[0]) == 7));
	
	#if 0
				if (cmd.resp[0] & ~0x00000900)
					printk(KERN_ERR "%s: status = %08x\n",
					       req->rq_disk->disk_name, cmd.resp[0]);
				if (mmc_decode_status(cmd.resp))
					goto cmd_err;
	#endif
			}
	
			if (brq.cmd.error || brq.stop.error || brq.data.error) {
				if (rq_data_dir(req) == READ) {
					/*
					 * After an error, we redo I/O one sector at a
					 * time, so we only reach here after trying to
					 * read a single sector.
					 */
					spin_lock_irq(&md->lock);
					ret = __blk_end_request(req, -EIO, brq.data.blksz);
					spin_unlock_irq(&md->lock);
					continue;
				}
				goto cmd_err;
			}
	
			/*
			 * Check if need to do bkops by each R1 response command
			 */
	
			/* jpf: not here for ubuntu 11.04 w/2.6.38 kernel 
			if (mmc_card_mmc(card) &&
					(brq.cmd.resp[0] & R1_URGENT_BKOPS))
				mmc_card_set_need_bkops(card);
			*/
	
			/*
			 * A block was successfully transferred.
			 */
			spin_lock_irq(&md->lock);
			ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
			spin_unlock_irq(&md->lock);
		} /* jpf: else(handset_cachesize is 0) */
pr_debug("%s: inside while()\n", __func__);
	} while (ret);
pr_debug("%s: outside while()\n", __func__);
	mmc_release_host(card->host);

	return 1;

 cmd_err:
 	/*
 	 * If this is an SD card and we're writing, we can first
 	 * mark the known good sectors as ok.
 	 *
	 * If the card is not SD, we can still ok written sectors
	 * as reported by the controller (which might be less than
	 * the real number of written sectors, but never more).
	 */
	if (mmc_card_sd(card)) {
		u32 blocks;

		blocks = mmc_sd_num_wr_blocks(card);
		if (blocks != (u32)-1) {
			spin_lock_irq(&md->lock);
			ret = __blk_end_request(req, 0, blocks << 9);
			spin_unlock_irq(&md->lock);
		}
	} else {
		spin_lock_irq(&md->lock);
		ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
		spin_unlock_irq(&md->lock);
	}

	mmc_release_host(card->host);

	spin_lock_irq(&md->lock);
	while (ret)
		ret = __blk_end_request(req, -EIO, blk_rq_cur_bytes(req));
	spin_unlock_irq(&md->lock);

	return 0;
}

static int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
{
	if (req->cmd_flags & REQ_DISCARD) {
		return mmc_blk_issue_discard_rq(mq, req);
	} else {
		return mmc_blk_issue_rw_rq(mq, req);
	}
}

static inline int mmc_blk_readonly(struct mmc_card *card)
{
	return mmc_card_readonly(card) ||
	       !(card->csd.cmdclass & CCC_BLOCK_WRITE);
}

static struct mmc_blk_data *mmc_blk_alloc(struct mmc_card *card)
{
	struct mmc_blk_data *md;
	int devidx, ret;

	devidx = find_first_zero_bit(dev_use, max_devices);
	if (devidx >= max_devices)
		return ERR_PTR(-ENOSPC);
	__set_bit(devidx, dev_use);

	md = kzalloc(sizeof(struct mmc_blk_data), GFP_KERNEL);
	if (!md) {
		ret = -ENOMEM;
		goto out;
	}


	/*
	 * Set the read-only status based on the supported commands
	 * and the write protect switch.
	 */
	md->read_only = mmc_blk_readonly(card);

	md->disk = alloc_disk(perdev_minors);
	if (md->disk == NULL) {
		ret = -ENOMEM;
		goto err_kfree;
	}

	spin_lock_init(&md->lock);
	md->usage = 1;

	ret = mmc_init_queue(&md->queue, card, &md->lock);
	if (ret)
		goto err_putdisk;

	md->queue.issue_fn = mmc_blk_issue_rq;
	md->queue.data = md;

	md->disk->major	= MMC_BLOCK_MAJOR;
	md->disk->first_minor = devidx * perdev_minors;
	md->disk->fops = &mmc_bdops;
	md->disk->private_data = md;
	md->disk->queue = md->queue.queue;
	md->disk->driverfs_dev = &card->dev;
	set_disk_ro(md->disk, md->read_only);

	/*
	 * As discussed on lkml, GENHD_FL_REMOVABLE should:
	 *
	 * - be set for removable media with permanent block devices
	 * - be unset for removable block devices with permanent media
	 *
	 * Since MMC block devices clearly fall under the second
	 * case, we do not set GENHD_FL_REMOVABLE.  Userspace
	 * should use the block device creation/destruction hotplug
	 * messages to tell when the card is present.
	 */

	snprintf(md->disk->disk_name, sizeof(md->disk->disk_name),
		"mmcblk%d", devidx);

	blk_queue_logical_block_size(md->queue.queue, 512);

	if (!mmc_card_sd(card) && mmc_card_blockaddr(card)) {
		/*
		 * The EXT_CSD sector count is in number or 512 byte
		 * sectors.
		 */
		set_capacity(md->disk, card->ext_csd.sectors);
	} else {
		/*
		 * The CSD capacity field is in units of read_blkbits.
		 * set_capacity takes units of 512 bytes.
		 */
		set_capacity(md->disk,
			card->csd.capacity << (card->csd.read_blkbits - 9));
	}
	return md;

 err_putdisk:
	put_disk(md->disk);
 err_kfree:
	kfree(md);
 out:
	return ERR_PTR(ret);
}

static int
mmc_blk_set_blksize(struct mmc_blk_data *md, struct mmc_card *card)
{
	int err;

	mmc_claim_host(card->host);
	err = mmc_set_blocklen(card, 512);
	mmc_release_host(card->host);

	if (err) {
		printk(KERN_ERR "%s: unable to set block size to 512: %d\n",
			md->disk->disk_name, err);
		return -EINVAL;
	}

	return 0;
}

static int mmc_blk_probe(struct mmc_card *card)
{
	struct mmc_blk_data *md;
	int err;
	char cap_str[10];

	/*
	 * Check that the card supports the command class(es) we need.
	 */
	if (!(card->csd.cmdclass & CCC_BLOCK_READ))
		return -ENODEV;

	md = mmc_blk_alloc(card);
	if (IS_ERR(md))
		return PTR_ERR(md);

	err = mmc_blk_set_blksize(md, card);
	if (err)
		goto out;

	string_get_size((u64)get_capacity(md->disk) << 9, STRING_UNITS_2,
			cap_str, sizeof(cap_str));
	printk(KERN_INFO "%s: %s %s %s %s\n",
		md->disk->disk_name, mmc_card_id(card), mmc_card_name(card),
		cap_str, md->read_only ? "(ro)" : "");

	mmc_set_drvdata(card, md);
	add_disk(md->disk);
	return 0;

 out:
	mmc_cleanup_queue(&md->queue);
	mmc_blk_put(md);

	return err;
}

static void mmc_blk_remove(struct mmc_card *card)
{
	struct mmc_blk_data *md = mmc_get_drvdata(card);

	if (md) {
		/* Stop new requests from getting into the queue */
		del_gendisk(md->disk);

		/* Then flush out any already in there */
		mmc_cleanup_queue(&md->queue);

		mmc_blk_put(md);
	}
	mmc_set_drvdata(card, NULL);
}

#ifdef CONFIG_PM
static int mmc_blk_suspend(struct mmc_card *card)
{
	struct mmc_blk_data *md = mmc_get_drvdata(card);

	if (md) {
		mmc_queue_suspend(&md->queue);
	}
	return 0;
}

static int mmc_blk_resume(struct mmc_card *card)
{
	struct mmc_blk_data *md = mmc_get_drvdata(card);

	if (md) {
		mmc_blk_set_blksize(md, card);
		mmc_queue_resume(&md->queue);
	}
	return 0;
}
#else
#define	mmc_blk_suspend	NULL
#define mmc_blk_resume	NULL
#endif

static struct mmc_driver mmc_driver = {
	.drv		= {
		.name	= "mmcblk",
	},
	.probe		= mmc_blk_probe,
	.remove		= mmc_blk_remove,
	.suspend	= mmc_blk_suspend,
	.resume		= mmc_blk_resume,
};

#if defined(CONFIG_DEBUG_FS)

struct mmc_cache_debugfs {
	struct dentry *cacherow;
	struct dentry *cache_addr;
	struct dentry *num_sectors;
	struct dentry *brq;
	struct dentry *valid;
	//struct debugfs_blob_wrapper nullflag;
};

static struct dentry *mmc_dentry_start = NULL;
static struct mmc_cache_debugfs *mmc_cache_debug = NULL;

#endif

static int __init mmc_blk_init(void)
{
	int res;
pr_debug("Jay's mmc_block driver init called\n");
	if (perdev_minors != CONFIG_MMC_BLOCK_MINORS)
		pr_debug("mmcblk: using %d minors per device\n", perdev_minors);

	max_devices = 256 / perdev_minors;

	res = register_blkdev(MMC_BLOCK_MAJOR, "mmc");
	if (res)
		goto out;

	res = mmc_register_driver(&mmc_driver);
	if (res)
		goto out2;

	/* 
	 * I think we really only want to have the mmc_cache called
	 * when no errors occur.
	 */
	if (handset_cachesize != 0) {
		res = mmc_create_cache();
		if (res != 0) {
			pr_err("mmcblk: error occured on creating cache: %d",
			       res);
			pr_err("mmcblk: cache will not be used.");
			handset_cachesize = 0;
		}
		
		#if defined(CONFIG_DEBUG_FS)
		mmc_cache_debug = kcalloc(mmc_cachesize(), 
					  sizeof *mmc_cache_debug, 
					  GFP_KERNEL);
		if ((mmc_cache_debug != NULL) && 
		    (mmc_dentry_start = debugfs_create_dir("mmc_cache", NULL))
		   ) {
			unsigned int i;
			for (i = 0; i < mmc_cachesize(); i++) {
				char cacherow[12];
				struct dentry *d;
				snprintf(cacherow, 12, "entry_%d", i);
				d = debugfs_create_dir(cacherow,
						       mmc_dentry_start);

				if (d != NULL) {
					mmc_cache_debug[i].cacherow = d;
					mmc_cache_debug[i].cache_addr =
					   debugfs_create_x64(
					   "cache_addr",
					   0444, d, 
					   &mmc_cache.entry[i].cache_addr);
					mmc_cache_debug[i].num_sectors =
					   debugfs_create_u32("num_sectors", 
					   0444, d, 
					   &mmc_cache.entry[i].num_sectors);
					/*
					I need to see brq actually changing,
					and using this method won't let me see
					it unless I add more debug code elsewhere,
					which I don't want to do.
					if (mmc_cache.entry[i].brq == NULL) {
						mmc_cache_debug[i].nullflag.data =
						"NULL\n";
						mmc_cache_debug[i].nullflag.size =
						5;
					} else {
						mmc_cache_debug[i].nullflag.data =
						"NOT_NULL\n";
						mmc_cache_debug[i].nullflag.size =
						9;
					}
					
					mmc_cache_debug[i].brq =
					   debugfs_create_blob("brq",
					   0444,d,
					   &mmc_cache_debug[i].nullflag);
					*/
					/* 
					 * jpf: Should be good enough; I just want
					 * to see brq change from 0 to !0.
					 * Since I'm targeting 32-bit archs, 
					 * an unsigned int * cast to see the pointer
					 * value should be fine...I hope...
					 */
					mmc_cache_debug[i].brq = d;
					mmc_cache_debug[i].brq =
					   debugfs_create_x32(
					   "brq",
					   0444, d, 
					   ((unsigned int *) &mmc_cache.entry[i].brq));
					/*
					 * This is read/write because allowing 
					 * the opportunity to write the valid 
					 * bit could provide good tests. 
					 */
					mmc_cache_debug[i].valid =
						debugfs_create_u8("valid",
						0666, d,
						&mmc_cache.entry[i].valid);

				} else { 
					pr_err("mmcblk: ");
					pr_err("debugfs_create_dir(%s) ",
					       cacherow);
					
					pr_err("failed to get created");
					pr_err("Returned error %ld\n", 
					       PTR_ERR(d));
				}
			}
		} else {
			pr_err("mmcblk: ");
			pr_err("debugfs_create_dir(mmc_cache) ");
			pr_err("failed to get created");
			pr_err("Returned error %ld\n", 
			       PTR_ERR(mmc_dentry_start));
		}
		#endif
	}
	pr_debug("%s mmc successful\n", __func__);
	return 0;
 out2:
	unregister_blkdev(MMC_BLOCK_MAJOR, "mmc");
 out:
	return res;
}

static void __exit mmc_blk_exit(void)
{
	mmc_unregister_driver(&mmc_driver);
	unregister_blkdev(MMC_BLOCK_MAJOR, "mmc");

	if (handset_cachesize != 0) {
		mmc_destroy_cache();
	}

	#if defined(CONFIG_DEBUG_FS)
	if (mmc_cache_debug != NULL) {
		if (!IS_ERR_OR_NULL(mmc_dentry_start)) {
			unsigned int i;
			for (i = 0; i < mmc_cachesize(); i++) {
				if (!IS_ERR_OR_NULL(
				     mmc_cache_debug[i].cacherow)) {
					if (!IS_ERR_OR_NULL(
						mmc_cache_debug[i].cache_addr))
						debugfs_remove(
						   mmc_cache_debug[i].cache_addr);
					if (!IS_ERR_OR_NULL(
						mmc_cache_debug[i].num_sectors))
						debugfs_remove(
						   mmc_cache_debug[i].num_sectors);
					if (!IS_ERR_OR_NULL(
						mmc_cache_debug[i].brq))
						debugfs_remove(
						   mmc_cache_debug[i].brq);
					if (!IS_ERR_OR_NULL(
						mmc_cache_debug[i].valid))
						debugfs_remove(
						   mmc_cache_debug[i].valid);
					debugfs_remove(
					    mmc_cache_debug[i].cacherow);
				}
			}
			debugfs_remove(mmc_dentry_start);
		}
		kfree(mmc_cache_debug);
	}
	#endif
}

module_init(mmc_blk_init);
module_exit(mmc_blk_exit);

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Multimedia Card (MMC) block device driver");


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-03 18:01                           ` J Freyensee
@ 2011-10-03 20:19                             ` Andrei Warkentin
  2011-10-03 21:00                               ` J Freyensee
  0 siblings, 1 reply; 32+ messages in thread
From: Andrei Warkentin @ 2011-10-03 20:19 UTC (permalink / raw)
  To: J Freyensee
  Cc: Praveen G K, Per Förlin, Linus Walleij, linux-mmc,
	Arnd Bergmann, Jon Medhurst, Andrei E. Warkentin

Hi James,

----- Original Message -----
> From: "J Freyensee" <james_p_freyensee@linux.intel.com>
> 
> Yeah, I know I'd be doing myself a huge favor by working off of
> mmc-next
> (or close to it), but product-wise, my department doesn't care for
> sustaining current platforms...yet (still trying to convince).
> 

I'd suggest working on linux-mmc. You can always back-port.

> So I was looking into sticking a write cache into block.c driver as a
> parameter, so it can be turned on and off upon driver load.  Any
> write
> operation goes to the cache and only on a cache collision will the
> write operation get sent to the host controller for a write.  What I
> have working so far is just with an MMC card in an MMC slot of a
> laptop,
> and just bare-bones.  No general flush routine, error-handling, etc.
>  From a couple performance measurements I did on the MMC slot using
> blktrace/blkparse and 400MB write transactions, I was seeing huge
> performance boost with no data corruption.  So it is not looking like
> a
> total hair-brained idea.  But I am still pretty far from
> understanding
> everything here.  And the real payoff we want to see is performance a
> user can see on a handheld (i.e., Android) systems.
> 

Interesting. Thanks for sharing. I don't want to seem silly, but how is what you're doing different from
the page cache? The page cache certainly defers write back (and I believe this is tunable...I'm not too
familiar yet or comfortable around the rest of blk I/O and VM). What are your test workloads? I would
guess this wouldn't have too great of an impact on a non O_DIRECT access, and O_DIRECT access anyway have
to bypass any caching logic.

A

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-03 20:19                             ` Andrei Warkentin
@ 2011-10-03 21:00                               ` J Freyensee
  2011-10-04  7:59                                 ` Andrei E. Warkentin
  0 siblings, 1 reply; 32+ messages in thread
From: J Freyensee @ 2011-10-03 21:00 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: Praveen G K, Per Förlin, Linus Walleij, linux-mmc,
	Arnd Bergmann, Jon Medhurst, Andrei E. Warkentin

On 10/03/2011 01:19 PM, Andrei Warkentin wrote:
> Hi James,
>
> ----- Original Message -----
>> From: "J Freyensee"<james_p_freyensee@linux.intel.com>
>>
>> Yeah, I know I'd be doing myself a huge favor by working off of
>> mmc-next
>> (or close to it), but product-wise, my department doesn't care for
>> sustaining current platforms...yet (still trying to convince).
>>
>
> I'd suggest working on linux-mmc. You can always back-port.
>
>> So I was looking into sticking a write cache into block.c driver as a
>> parameter, so it can be turned on and off upon driver load.  Any
>> write
>> operation goes to the cache and only on a cache collision will the
>> write operation get sent to the host controller for a write.  What I
>> have working so far is just with an MMC card in an MMC slot of a
>> laptop,
>> and just bare-bones.  No general flush routine, error-handling, etc.
>>   From a couple performance measurements I did on the MMC slot using
>> blktrace/blkparse and 400MB write transactions, I was seeing huge
>> performance boost with no data corruption.  So it is not looking like
>> a
>> total hair-brained idea.  But I am still pretty far from
>> understanding
>> everything here.  And the real payoff we want to see is performance a
>> user can see on a handheld (i.e., Android) systems.
>>
>
> Interesting. Thanks for sharing. I don't want to seem silly, but how is what you're doing different from
> the page cache? The page cache certainly defers write back (and I believe this is tunable...I'm not too
> familiar yet or comfortable around the rest of blk I/O and VM).

The idea is the page cache is too generic for hand-held (i.e. Android) 
workloads.  Page cache handles regular files, directories, 
user-swappable processes, etc, and all of that has to contend with the 
resource available for the page cache.  This is specific to eMMC 
workloads.  Namely, for games and even .pdf files on an Android system 
(ARM or Intel), there are a lot of 1-2 sector writes and almost 0 reads.

But by no means am I an expert on the page cache area either.

You are certainly right that the page cache is tunable.  I briefly 
looked at this, but then I decided I need to start writing something to 
start understanding stuff.

What are your test workloads?

For the MMC tests I conducted, they were just write blasts, like writing 
200 1MB files 200 times.  I just did enough as a 'thumb' test to see if 
it's worth killing myself on an Android box...it's a little more 
challenging getting it to work on an Android system since block.c is 
*THE* driver, whereas an MMC slot on a laptop is like some peon 
extension the laptop doesn't need.

I would
> guess this wouldn't have too great of an impact on a non O_DIRECT access, and O_DIRECT access anyway have
> to bypass any caching logic.

You are correct; I've already discovered I'd need to bypass the cache on 
O_DIRECT access.

>
> A


-- 
J (James/Jay) Freyensee
Storage Technology Group
Intel Corporation

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-03 21:00                               ` J Freyensee
@ 2011-10-04  7:59                                 ` Andrei E. Warkentin
  2011-10-19 23:27                                   ` Praveen G K
  0 siblings, 1 reply; 32+ messages in thread
From: Andrei E. Warkentin @ 2011-10-04  7:59 UTC (permalink / raw)
  To: J Freyensee
  Cc: Andrei Warkentin, Praveen G K, Per Förlin, Linus Walleij,
	linux-mmc, Arnd Bergmann, Jon Medhurst

Hi James,

2011/10/3 J Freyensee <james_p_freyensee@linux.intel.com>:
>
> The idea is the page cache is too generic for hand-held (i.e. Android)
> workloads.  Page cache handles regular files, directories, user-swappable
> processes, etc, and all of that has to contend with the resource available
> for the page cache.  This is specific to eMMC workloads.  Namely, for games
> and even .pdf files on an Android system (ARM or Intel), there are a lot of
> 1-2 sector writes and almost 0 reads.
>
> But by no means am I an expert on the page cache area either.
>

I misspoke, sorry, I really meant the buffer cache, which caches block
access. It may contend
with other resources, but it is practically boundless and responds
well to memory pressure (which
otherwise is something you need to consider).

As to Android workloads, what you're really trying to say, is that
you're dealing with a tumult of SQLite accesses,
and coupled with ext4 these don't look so good when it comes down to
MMC performance and reliability, right? When
I investigated this problem in my previous life, it came down to
figuring out if it was worth putting vendor hacks in the MMC driver
to purportedly reduce a drastic reduction in reliability/life-span,
while also improving performance for accesses smaller than flash page
size.

The problem being, of course that you have many small random accesses, which -
a) Chew through a fixed amount of erase-block (AU, allocation unit)
slots in the internal (non-volatile) cache on the MMC.
b) As a consequence of (a) result in much thrashing, as erase-block
slot evictions result in (small) writes, which result in extra erases.
c) The accesses could also end up spanning erase-blocks which further
multiplies the performance and life-span damage.

The hacks I was investigating actually made things worse performance
wise, and there was no way to measure reliability. I did realize that
you could, under some circumstances, and with some idea behind the GC
behavior of MMCs and it's flash parameters, devise an I/O scheduler
that would optimize accesses by grouping AUs and trying to defer
writing AUs which are being actively written to. Of course this would
be in no way generic, and would involve fine tuning on a per-card
basis, making it useful for eMMC/eSD.

Caching by itself  might save you some trouble from many writes to
similar places, but you can already tune the buffer cache to delay
writes
(/proc/sys/vm/dirty_writeback_centisec), and it's not going to help
with the fixed amount of AUs and preferences to a particular size of
writes (i.e. the garbage collection mechanism inside the MMC and the
flash technology in it). On the other hand, caching brings another set
of problems - data loss, and the occasional need to flush all data to
disk, with a larger delay.

Speaking of reducing flash traffic...you might be interested with
bumping the commit time (ext3/ext4), but that also has data-loss
implications.

Anyway, the point I want to make, is that you should ask yourself the
question of what you're trying to achieve, and what the real problem
is - and why existing solutions don't work. If you think caching is
your problem, then you should probably answer the question of why the
buffer cache isn't sufficient - and if it isn't, how should it adapt
to fit the scenario. I would want to say that the real fix should be
the I/O happy SQLite usage on Android... But there may be some value
in trying to alleviate in by grouping writes by AUs and deferring
"hot" AUs.

A

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-04  7:59                                 ` Andrei E. Warkentin
@ 2011-10-19 23:27                                   ` Praveen G K
  2011-10-20 15:01                                     ` Andrei E. Warkentin
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-10-19 23:27 UTC (permalink / raw)
  To: Andrei E. Warkentin
  Cc: J Freyensee, Andrei Warkentin, Per Förlin, Linus Walleij,
	linux-mmc, Arnd Bergmann, Jon Medhurst

Also, can somebody please tell me the significance of blk_end_request? Thanks.
Why do we call this after every block read or write?

2011/10/4 Andrei E. Warkentin <andrey.warkentin@gmail.com>
>
> Hi James,
>
> 2011/10/3 J Freyensee <james_p_freyensee@linux.intel.com>:
> >
> > The idea is the page cache is too generic for hand-held (i.e. Android)
> > workloads.  Page cache handles regular files, directories, user-swappable
> > processes, etc, and all of that has to contend with the resource available
> > for the page cache.  This is specific to eMMC workloads.  Namely, for games
> > and even .pdf files on an Android system (ARM or Intel), there are a lot of
> > 1-2 sector writes and almost 0 reads.
> >
> > But by no means am I an expert on the page cache area either.
> >
>
> I misspoke, sorry, I really meant the buffer cache, which caches block
> access. It may contend
> with other resources, but it is practically boundless and responds
> well to memory pressure (which
> otherwise is something you need to consider).
>
> As to Android workloads, what you're really trying to say, is that
> you're dealing with a tumult of SQLite accesses,
> and coupled with ext4 these don't look so good when it comes down to
> MMC performance and reliability, right? When
> I investigated this problem in my previous life, it came down to
> figuring out if it was worth putting vendor hacks in the MMC driver
> to purportedly reduce a drastic reduction in reliability/life-span,
> while also improving performance for accesses smaller than flash page
> size.
>
> The problem being, of course that you have many small random accesses, which -
> a) Chew through a fixed amount of erase-block (AU, allocation unit)
> slots in the internal (non-volatile) cache on the MMC.
> b) As a consequence of (a) result in much thrashing, as erase-block
> slot evictions result in (small) writes, which result in extra erases.
> c) The accesses could also end up spanning erase-blocks which further
> multiplies the performance and life-span damage.
>
> The hacks I was investigating actually made things worse performance
> wise, and there was no way to measure reliability. I did realize that
> you could, under some circumstances, and with some idea behind the GC
> behavior of MMCs and it's flash parameters, devise an I/O scheduler
> that would optimize accesses by grouping AUs and trying to defer
> writing AUs which are being actively written to. Of course this would
> be in no way generic, and would involve fine tuning on a per-card
> basis, making it useful for eMMC/eSD.
>
> Caching by itself  might save you some trouble from many writes to
> similar places, but you can already tune the buffer cache to delay
> writes
> (/proc/sys/vm/dirty_writeback_centisec), and it's not going to help
> with the fixed amount of AUs and preferences to a particular size of
> writes (i.e. the garbage collection mechanism inside the MMC and the
> flash technology in it). On the other hand, caching brings another set
> of problems - data loss, and the occasional need to flush all data to
> disk, with a larger delay.
>
> Speaking of reducing flash traffic...you might be interested with
> bumping the commit time (ext3/ext4), but that also has data-loss
> implications.
>
> Anyway, the point I want to make, is that you should ask yourself the
> question of what you're trying to achieve, and what the real problem
> is - and why existing solutions don't work. If you think caching is
> your problem, then you should probably answer the question of why the
> buffer cache isn't sufficient - and if it isn't, how should it adapt
> to fit the scenario. I would want to say that the real fix should be
> the I/O happy SQLite usage on Android... But there may be some value
> in trying to alleviate in by grouping writes by AUs and deferring
> "hot" AUs.
>
> A

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-19 23:27                                   ` Praveen G K
@ 2011-10-20 15:01                                     ` Andrei E. Warkentin
  2011-10-20 15:10                                       ` Praveen G K
  0 siblings, 1 reply; 32+ messages in thread
From: Andrei E. Warkentin @ 2011-10-20 15:01 UTC (permalink / raw)
  To: Praveen G K
  Cc: J Freyensee, Andrei Warkentin, Per Förlin, Linus Walleij,
	linux-mmc, Arnd Bergmann, Jon Medhurst

2011/10/19 Praveen G K <praveen.gk@gmail.com>:
> Also, can somebody please tell me the significance of blk_end_request? Thanks.
> Why do we call this after every block read or write?

Because you want to update the struct request with the amount
written/read. If the entire
requested I/O range has been satiffied, blk_end_request also calls
blk_finish_request and
completes the request.

I/O is asynchronous - hence, you need to let whatever made the request
know it's completed.

A

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-20 15:01                                     ` Andrei E. Warkentin
@ 2011-10-20 15:10                                       ` Praveen G K
  2011-10-20 15:26                                         ` Andrei Warkentin
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-10-20 15:10 UTC (permalink / raw)
  To: Andrei E. Warkentin
  Cc: J Freyensee, Andrei Warkentin, Per Förlin, Linus Walleij,
	linux-mmc, Arnd Bergmann, Jon Medhurst

2011/10/20 Andrei E. Warkentin <andrey.warkentin@gmail.com>:
> 2011/10/19 Praveen G K <praveen.gk@gmail.com>:
>> Also, can somebody please tell me the significance of blk_end_request? Thanks.
>> Why do we call this after every block read or write?
>
> Because you want to update the struct request with the amount
> written/read. If the entire
> requested I/O range has been satiffied, blk_end_request also calls
> blk_finish_request and
> completes the request.
Just for a quick understanding, I did the following

During every eMMC write, I called the multi block write command with
the same set of data, and I called the mmc_end_request after  let's
say every 10 transfers (with each transfer being 128 blocks).  I
noticed that I did not see those big busy wait times as frequently as
compared to calling blk_end_request after every 128 block was
transferred.  Why is that happening?

> I/O is asynchronous - hence, you need to let whatever made the request
> know it's completed.
So, does that mean, the actual writes to the eMMC (or reads from the
eMMC) takes place here? If so, what happens when we send the MULTI
BLOCK WRITE command?
> A
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-20 15:10                                       ` Praveen G K
@ 2011-10-20 15:26                                         ` Andrei Warkentin
  2011-10-20 16:07                                           ` Praveen G K
  0 siblings, 1 reply; 32+ messages in thread
From: Andrei Warkentin @ 2011-10-20 15:26 UTC (permalink / raw)
  To: Praveen G K
  Cc: J Freyensee, Per Förlin, Linus Walleij, linux-mmc,
	Arnd Bergmann, Jon Medhurst, Andrei E. Warkentin

----- Original Message -----
> From: "Praveen G K" <praveen.gk@gmail.com>
> To: "Andrei E. Warkentin" <andrey.warkentin@gmail.com>
> Cc: "J Freyensee" <james_p_freyensee@linux.intel.com>, "Andrei Warkentin" <awarkentin@vmware.com>, "Per Förlin"
> <per.forlin@stericsson.com>, "Linus Walleij" <linus.walleij@linaro.org>, linux-mmc@vger.kernel.org, "Arnd Bergmann"
> <arnd@arndb.de>, "Jon Medhurst" <tixy@linaro.org>
> Sent: Thursday, October 20, 2011 11:10:02 AM
> Subject: Re: slow eMMC write speed
> 
> 2011/10/20 Andrei E. Warkentin <andrey.warkentin@gmail.com>:
> > 2011/10/19 Praveen G K <praveen.gk@gmail.com>:
> >> Also, can somebody please tell me the significance of
> >> blk_end_request? Thanks.
> >> Why do we call this after every block read or write?
> >
> > Because you want to update the struct request with the amount
> > written/read. If the entire
> > requested I/O range has been satiffied, blk_end_request also calls
> > blk_finish_request and
> > completes the request.
> Just for a quick understanding, I did the following
> 
> During every eMMC write, I called the multi block write command with
> the same set of data, and I called the mmc_end_request 

What's mmc_end_request? I'm assuming you meant blk_end_request.

> after  let's
> say every 10 transfers (with each transfer being 128 blocks).  I
> noticed that I did not see those big busy wait times as frequently as
> compared to calling blk_end_request after every 128 block was
> transferred.  Why is that happening?

So you did 10 back-to-back 64k transfers inside the request handling routine, and you noticed that caused a smaller GC delay
rather than doing 10 separate multiblock 64k transfers the normal way?

Depends. Maybe mmc_host_lazy_disable() (called by mmc_release_host) ends up doing some power management, which is bound
to affect the card as well.

Or maybe the card is trying to optimize consecutive writes into one larger write internally, and you're crossing some internal
timestamp checking logic that distinguishes between separate and consecutive writes. You should talk to your manufacturer, and investigate what happens to your host on mmc_release_host.

A

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-20 15:26                                         ` Andrei Warkentin
@ 2011-10-20 16:07                                           ` Praveen G K
  2011-10-21  4:45                                             ` Andrei E. Warkentin
  0 siblings, 1 reply; 32+ messages in thread
From: Praveen G K @ 2011-10-20 16:07 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: J Freyensee, Per Förlin, Linus Walleij, linux-mmc,
	Arnd Bergmann, Jon Medhurst, Andrei E. Warkentin

2011/10/20 Andrei Warkentin <awarkentin@vmware.com>:
> ----- Original Message -----
>> From: "Praveen G K" <praveen.gk@gmail.com>
>> To: "Andrei E. Warkentin" <andrey.warkentin@gmail.com>
>> Cc: "J Freyensee" <james_p_freyensee@linux.intel.com>, "Andrei Warkentin" <awarkentin@vmware.com>, "Per Förlin"
>> <per.forlin@stericsson.com>, "Linus Walleij" <linus.walleij@linaro.org>, linux-mmc@vger.kernel.org, "Arnd Bergmann"
>> <arnd@arndb.de>, "Jon Medhurst" <tixy@linaro.org>
>> Sent: Thursday, October 20, 2011 11:10:02 AM
>> Subject: Re: slow eMMC write speed
>>
>> 2011/10/20 Andrei E. Warkentin <andrey.warkentin@gmail.com>:
>> > 2011/10/19 Praveen G K <praveen.gk@gmail.com>:
>> >> Also, can somebody please tell me the significance of
>> >> blk_end_request? Thanks.
>> >> Why do we call this after every block read or write?
>> >
>> > Because you want to update the struct request with the amount
>> > written/read. If the entire
>> > requested I/O range has been satiffied, blk_end_request also calls
>> > blk_finish_request and
>> > completes the request.
>> Just for a quick understanding, I did the following
>>
>> During every eMMC write, I called the multi block write command with
>> the same set of data, and I called the mmc_end_request
>
> What's mmc_end_request? I'm assuming you meant blk_end_request.
Yes, you are right.  I meant blk_end_request.

>> after  let's
>> say every 10 transfers (with each transfer being 128 blocks).  I
>> noticed that I did not see those big busy wait times as frequently as
>> compared to calling blk_end_request after every 128 block was
>> transferred.  Why is that happening?
>
> So you did 10 back-to-back 64k transfers inside the request handling routine, and you noticed that caused a smaller GC delay
> rather than doing 10 separate multiblock 64k transfers the normal way?
That's right.  Even though the request is supposed to service only
64k, I sent 10 write commands each with 64k and then completed the
request (within the loop that takes care of the read/writes in
block.c) with the same data.

So, instead of calling blk_end_request just after one write, I call
blk_end_request after 10 writes.

> Depends. Maybe mmc_host_lazy_disable() (called by mmc_release_host) ends up doing some power management, which is bound
> to affect the card as well.
>
> Or maybe the card is trying to optimize consecutive writes into one larger write internally, and you're crossing some internal
> timestamp checking logic that distinguishes between separate and consecutive writes. You should talk to your manufacturer, and investigate what happens to your host on mmc_release_host.
>
> A
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: slow eMMC write speed
  2011-10-20 16:07                                           ` Praveen G K
@ 2011-10-21  4:45                                             ` Andrei E. Warkentin
  0 siblings, 0 replies; 32+ messages in thread
From: Andrei E. Warkentin @ 2011-10-21  4:45 UTC (permalink / raw)
  To: Praveen G K
  Cc: Andrei Warkentin, J Freyensee, Per Förlin, Linus Walleij,
	linux-mmc, Arnd Bergmann, Jon Medhurst

2011/10/20 Praveen G K <praveen.gk@gmail.com>:
>>
>> So you did 10 back-to-back 64k transfers inside the request handling routine, and you noticed that caused a smaller GC delay
>> rather than doing 10 separate multiblock 64k transfers the normal way?
> That's right.  Even though the request is supposed to service only
> 64k, I sent 10 write commands each with 64k and then completed the
> request (within the loop that takes care of the read/writes in
> block.c) with the same data.
>
> So, instead of calling blk_end_request just after one write, I call
> blk_end_request after 10 writes.
>

Well, if you're just writing 10 times the data to the *same* location
(since it's inside the loop), then of course
your noticed GC delay is smaller when compared to writing 10 times to
different locations - the garbage collection happens
when the card notices you are writing to more allocation units than
are cached internally...

A

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-10-21  4:45 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-23  5:05 slow eMMC write speed Praveen G K
2011-09-28  5:42 ` Linus Walleij
2011-09-28 19:06   ` Praveen G K
2011-09-28 19:59     ` J Freyensee
2011-09-28 20:34       ` Praveen G K
2011-09-28 21:01         ` J Freyensee
2011-09-28 21:03           ` Praveen G K
2011-09-28 21:34             ` J Freyensee
2011-09-28 22:24               ` Praveen G K
2011-09-28 22:59                 ` J Freyensee
2011-09-28 23:16                   ` Praveen G K
2011-09-29  0:57                     ` Philip Rakity
2011-09-29  2:24                       ` Praveen G K
2011-09-29  3:30                         ` Philip Rakity
2011-09-29  7:24               ` Linus Walleij
2011-09-29  8:17                 ` Per Förlin
2011-09-29 20:16                   ` J Freyensee
2011-09-30  8:22                     ` Andrei E. Warkentin
2011-10-01  0:33                       ` J Freyensee
2011-10-02  6:20                         ` Andrei E. Warkentin
2011-10-03 18:01                           ` J Freyensee
2011-10-03 20:19                             ` Andrei Warkentin
2011-10-03 21:00                               ` J Freyensee
2011-10-04  7:59                                 ` Andrei E. Warkentin
2011-10-19 23:27                                   ` Praveen G K
2011-10-20 15:01                                     ` Andrei E. Warkentin
2011-10-20 15:10                                       ` Praveen G K
2011-10-20 15:26                                         ` Andrei Warkentin
2011-10-20 16:07                                           ` Praveen G K
2011-10-21  4:45                                             ` Andrei E. Warkentin
2011-09-29  7:05         ` Linus Walleij
2011-09-29  7:33           ` Linus Walleij

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.