From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bill Pringlemeir Date: Wed, 13 Aug 2014 11:14:24 -0400 Subject: [U-Boot] [PATCH 3/4] mtd: nand: add Freescale NFC driver In-Reply-To: <7b41b54fe3f11a943b0ec63f2e44f573@agner.ch> (Stefan Agner's message of "Wed, 13 Aug 2014 13:20:35 +0200") References: <4ea124ab817db6e3dee2067aadd6db14643990f5.1407312577.git.stefan@agner.ch> <1407796426.7427.100.camel@snotra.buserror.net> <1380dbe3110cbf9359220a78d69e1896@agner.ch> <1407881820.7427.137.camel@snotra.buserror.net> <87wqadl3h6.fsf@nbsps.com> <7b41b54fe3f11a943b0ec63f2e44f573@agner.ch> Message-ID: <87sil0l8vj.fsf@nbsps.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de On 13 Aug 2014, stefan at agner.ch wrote: > Am 2014-08-13 00:58, schrieb Bill Pringlemeir: > [snip] >>>>>> +static u32 nfc_read(struct mtd_info *mtd, uint reg) >>>>>> +{ >>>>>> + struct fsl_nfc *nfc = mtd_to_nfc(mtd); >>>>>> + >>>>>> + if (reg == NFC_FLASH_STATUS1 || >>>>>> + reg == NFC_FLASH_STATUS2 || >>>>>> + reg == NFC_IRQ_STATUS) >>>>>> + return __raw_readl(nfc->regs + reg); >>>>>> + /* Gang read/writes together for most registers. */ >>>>>> + else >>>>>> + return *(u32 *)(nfc->regs + reg); >>>>>> +} >>>>>> + >>>>>> +static void nfc_write(struct mtd_info *mtd, uint reg, u32 val) >>>>>> +{ >>>>>> + struct fsl_nfc *nfc = mtd_to_nfc(mtd); >>>>>> + >>>>>> + if (reg == NFC_FLASH_STATUS1 || >>>>>> + reg == NFC_FLASH_STATUS2 || >>>>>> + reg == NFC_IRQ_STATUS) >>>>>> + __raw_writel(val, nfc->regs + reg); >>>>>> + /* Gang read/writes together for most registers. */ >>>>>> + else >>>>>> + *(u32 *)(nfc->regs + reg) = val; >>>>>> +} >>>>> >>>>> You should always be using raw I/O accessors. If the intent is to >>>>> bypass I/O ordering for certain registers, the raw accessors >>>>> should already be skipping that. >> >>>> Ok, will do. >> >>> Sorry, the above should say "always be using I/O accesors", not >>> always raw ones. >> >> This is probably because of me. There are lines like this for >> configuration, >> >> /* Set configuration register. */ >> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_ADDR_AUTO_INCR_BIT); >> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BUFNO_AUTO_INCR_BIT); >> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BOOT_MODE_BIT); >> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_DMA_REQ_BIT); >> nfc_set(mtd, NFC_FLASH_CONFIG, CONFIG_FAST_FLASH_BIT); >> >> If you use some sort of 'volatile', you are saying to the compiler >> that these lines are *not*, >> >> ldr r0, [r1, #CONFIG] # read previous value >> ldr r2, =~MASK >> orr r0, #FAST_FLASH_BIT # set one bit. >> and r0, r2 # clear all bits at once. >> str r0, [r1, #CONFIG] # write it back. >> >> Instead it will change into five different load/modify/stores. The >> memory itself is just like SRAM except a few registers that are >> actually volatile; ie changed via the hardware. >> >> Particularly bad are the memcpy_io() on the ARM. Using these results >> in about 1/2 to 1/4 of the performance of the drivers through-put on >> the Vybrid. I can see that using accessors is good, but just >> replacing this with 'writel' will result in significantly more code >> and slower throughput. >> >> If people insist on this, then something like, >> >> val = nfc_read(mtd, NFC_REG); >> val = nfc_clear(val, CONFIG_ADDR_AUTO_INCR_BIT); >> val = nfc_clear(val, CONFIG_BUFNO_AUTO_INCR_BIT); >> val = nfc_clear(val, CONFIG_BOOT_MODE_BIT); >> val = nfc_clear(val, CONFIG_DMA_REQ_BIT); >> val = nfc_set(val, CONFIG_FAST_FLASH_BIT); >> nfc_write(mtd, NFC_REG, val); >> >> would result in the same code with 'nfc_read()' and 'nfc_write()' >> using the I/O accessors. I found this more difficult to read for the >> higher level functions. Instead some some standard macros for >> setting and clearing bits could be used. The original driver was >> using 'nfc_set()' and 'nfc_clear()'. Maybe they should just go. >> > > I just applied this change: > > diff --git a/drivers/mtd/nand/fsl_nfc.c b/drivers/mtd/nand/fsl_nfc.c > index df2c8be..01c010f 100644 > --- a/drivers/mtd/nand/fsl_nfc.c > +++ b/drivers/mtd/nand/fsl_nfc.c > @@ -191,26 +191,14 @@ static u32 nfc_read(struct mtd_info *mtd, uint > reg) > { > struct fsl_nfc *nfc = mtd_to_nfc(mtd); > > - if (reg == NFC_FLASH_STATUS1 || > - reg == NFC_FLASH_STATUS2 || > - reg == NFC_IRQ_STATUS) > - return __raw_readl(nfc->regs + reg); > - /* Gang read/writes together for most registers. */ > - else > - return *(u32 *)(nfc->regs + reg); > + return __raw_readl(nfc->regs + reg); > } > > static void nfc_write(struct mtd_info *mtd, uint reg, u32 val) > { > struct fsl_nfc *nfc = mtd_to_nfc(mtd); > > - if (reg == NFC_FLASH_STATUS1 || > - reg == NFC_FLASH_STATUS2 || > - reg == NFC_IRQ_STATUS) > - __raw_writel(val, nfc->regs + reg); > - /* Gang read/writes together for most registers. */ > - else > - *(u32 *)(nfc->regs + reg) = val; > + __raw_writel(val, nfc->regs + reg); > } > > static void nfc_set(struct mtd_info *mtd, uint reg, u32 bits) > > And did some performance measurement: > > => nand read ${loadaddr} 0x0 0x2000000 > > ===> With if/else > NAND read: device 0 offset 0x0, size 0x2000000 > 33554432 bytes read in 8494 ms: OK (3.8 MiB/s) > > ===> With __raw_readl only... > NAND read: device 0 offset 0x0, size 0x2000000 > 33554432 bytes read in 8184 ms: OK (3.9 MiB/s) > > => nand write ${loadaddr} rootfs-ubi 0x2000000 > > ===> With if/else > NAND write: device 0 offset 0xa00000, size 0x2000000 > 33554432 bytes written in 18200 ms: OK (1.8 MiB/s) > > ===> With __raw_readl only... > NAND write: device 0 offset 0xa00000, size 0x2000000 > 33554432 bytes written in 15095 ms: OK (2.1 MiB/s) > > These values are reproducible. I guess by removing the conditional > branch in the nfc_read/nfc_write functions we even gain performance. > > IMHO we should use the raw_writel only and "hand optimize" for > functions which are used often. For the initialization/configuration > functions, there is little value to save some register access. This will indeed depend on the compiler. I guess that the U-Boot 'readl' and 'writel' are the same as Linux? If the compiler doesn't do the analysis to see from the parameters that the 'if/else' is not needed, then it will not reduce the code and this can be larger. I was using gcc 4.8 when I looked at this. The NFC is an AHB peripheral and takes several clocks to get access to. Did the object size increase/decrease? For Linux and this change with gcc 4.8, With if/else $ size drivers/mtd/nand/fsl_nfc.o text data bss dec hex filename 3300 3652 0 6952 1b28 drivers/mtd/nand/fsl_nfc.o Only readl/writel, $ size drivers/mtd/nand/fsl_nfc.o text data bss dec hex filename 3532 3652 0 7184 1c10 drivers/mtd/nand/fsl_nfc.o Do you get bigger sizes for the 'readl/writel' only cases? This would indicate the compiler didn't inline/reduce the function calls. For certain, the 'read/set+clear/write' will be smaller and marginally faster with all compilers. If you have a smaller binary with the 'if/else' and the code still performs slower, then something I don't understand is happening. It is sensible that this will not effect performance that much. The register access is not the main bottle neck. The 'memcpy()' in nfc_read_buf(), nfc_read_spare() and nfc_write_buf() are more critical to performance. Ie, getting the nand data on/off of the NFC controllers internal buffers to main memory is the main bottle neck. Ironically, on the Vybrid, the Cortex-A5 is able to do this faster than the DMA engine as it has much better though-put to the SDRAM. The buffer transfers were originally 'memcpy_io()' or something which did barriers after each and every memory fetch. By replacing the register access alone, you shouldn't get that much of a performance difference (1/2 to 1/4 through-put). Regards, Bill Pringlemeir.