Re: The contradictory nature of spl_nand_fit_read

From: Tim Harvey <tharvey@gateworks.com>
To: Sean Anderson <sean.anderson@seco.com>
Cc: Dario Binacchi <dariobin@libero.it>,
	 Michael Nazzareno Trimarchi <michael@amarulasolutions.com>,
	U-Boot Mailing List <u-boot@lists.denx.de>,
	 Tom Rini <trini@konsulko.com>, Lokesh Vutla <lokeshvutla@ti.com>
Subject: Re: The contradictory nature of spl_nand_fit_read
Date: Mon, 4 Apr 2022 11:03:11 -0700	[thread overview]
Message-ID: <CAJ+vNU0vJKRr=AALm8hxB3WeHr8KOxY_U6wtSpsZ+7j0eJ8YKA@mail.gmail.com> (raw)
In-Reply-To: <b68ae4d8-4267-aab2-0d8b-79dbbfb339a4@seco.com>

On Mon, Apr 4, 2022 at 9:26 AM Sean Anderson <sean.anderson@seco.com> wrote:
>
>
>
> On 4/3/22 11:56 AM, Dario Binacchi wrote:
> > Hi Sean,
> >
> >> Il 01/04/2022 23:43 Michael Nazzareno Trimarchi <michael@amarulasolutions.com> ha scritto:
> >>
> >>
> >> Hi Sean
> >>
> >> On Fri, Apr 1, 2022 at 8:53 PM Sean Anderson <sean.anderson@seco.com> wrote:
> >> >
> >> >
> >> >
> >> > On 4/1/22 2:46 PM, Sean Anderson wrote:
> >> > > Hi all,
> >> > >
> >> > > I don't understand how spl_nand_fit_read is supposed to work. This
> >> > > function has been seemingly hacked together by multiple people over the
> >> > > years, and it (presumably) (somehow) works despite mass confusion of
> >> > > units. I tried to do some refactoring, but the contradictions make it
> >> > > very difficult to determine what is a bug and what is intentional.
> >> > >
> >> > > Lets start with the basics. spl_nand_fit_read is used to implement
> >> > > spl_load_info.load. I've reproduced the documentation for this function
> >> > > below:
> >> > >
> >> > >>  read() - Read from device
> >> > >>
> >> > >>  @load: Information about the load state
> >> > >>  @sector: Sector number to read from (each @load->bl_len bytes)
> >> > >>  @count: Number of sectors to read
> >> > >>  @buf: Buffer to read into
> >> > >>  @return number of sectors read, 0 on error
> >> > >
> >> > > In particular, both @sector and @count are in units of load.bl_len. In
> >> > > addition, the return value should be @count on success.
> >> > >
> >> > >> static ulong spl_nand_fit_read(struct spl_load_info *load, ulong offs,
> >> > >>                             ulong size, void *dst)
> >> > >> {
> >> > >>      int err;
> >> > >> #ifdef CONFIG_SYS_NAND_BLOCK_SIZE
> >> > >
> >> > > The following logic was introduced in commit 9f6a14c47f ("spl: fit:
> >> > > nand: fix fit loading in case of bad blocks"). However, at the time it
> >> > > was not guarded by the above ifdef. nand_spl_adjust_offset is not
> >> > > defined for all nand drivers. It is defined for at least mxs, am335x,
> >> > > atmel, and simple. Additionally, some drivers (such as rockchip)
> >> > > implement bad block skipping in nand_spl_load_image.
> >> > >
> >> > > It's not at all clear to me that there is common config which determines
> >> > > whether nand_spl_adjust_offset is implemented. Nevertheless, in commit
> >> > > aa0032f672 ("spl: fit: nand: allow for non-page-aligned elements"), the
> >> > > above directive selects different code if CONFIG_SYS_NAND_BLOCK_SIZE is
> >> > > defined.
> >> > >
> >> > >>      ulong sector;
> >> > >>
> >> > >>      sector = *(int *)load->priv;
> >> > >
> >> > > This is the offset passed to nand_spl_load_image. This
> >> > > offset is in units of bytes, and is aligned to the block size. However,
> >> > > it is *not* set if CONFIG_SPL_LOAD_IMX_CONTAINER is enabled. Oops.
> >> > >
> >> > >>      offs = sector + nand_spl_adjust_offset(sector, offs - sector);
> >> >
> >> > I forgot to mention this, but this also adds sector to offs, but offs already includes the sector:
> >> >
> >> > > return spl_load_simple_fit(spl_image, &load, offset / bl_len, header);
> >>
> >> The spl_nand_fit_read was called from info->read that was in units of sectors
> >> count = info->read(info, sector, sectors, fit);
> >> debug("fit read sector %lx, sectors=%d, dst=%p, count=%lu, size=0x%lx\n",
> >>               sector, sectors, fit, count, size);
> >>
> >>
> >> load.dev = NULL;
> >> load.priv = &offset;
> >> load.filename = NULL;
> >> load.bl_len = 1;
>
> Ah, here is the problem. If bl_len is always 1, then there is no problem
> going back and forth between units. But aa0032f672 ("spl: fit: nand: allow
> for non-page-aligned elements") means that bl_len is not always 1, so
> now the conversion matters.
>
> So I suppose my question is now for Tim: how did you test your patch?
>

Hi Sean,

I maintain a set of boards using gwventana_nand_defconfg that have
various NAND parts.

Yes, I vaguely recall running into problems when moving my boards to
DM. It was some time ago but if I recall I found
nand_spl_adjust_offset didn't exist for MXS and when I noticed the
requirement for it was introduced in 9f6a14c47ff9 ("spl: fit: nand:
fix fit loading in case of bad blocks") I looked at that patch and
noticed it added the notion of sectors which lead me to submitting
39cb85043cdb "(spl: fit: nand: skip bad block handling if NAND chip
not fully defined)" which added the ifdef check around it to
essentially move it back to bytes.

As far as testing goes I test with IMX6 boards using NAND as a boot
device with board/gateworks/gw_ventana and gwventana_nand_defconfig

With some debugging added my boot looks like this:
Booting from NAND
PMIC:  PFUZE100 ID=0x10
Trying to boot from NAND
spl: nand - using hw ecc
spl_nand_load_element bl_len=0xe00000 offset=0x1000 mtd=18200268
Found FIT
spl_nand_fit_read offset=0xe00 size=0x3 bl_len=0x1000
DTB:   imx6q-gw54xx
spl_nand_fit_read offset=0xe02 size=0xca bl_len=0x1000
spl_nand_fit_read offset=0xf07 size=0xc bl_len=0x1000

For my config CONFIG_SYS_NAND_BLOCK_SIZE is not defined because the
boards supported use a variety of devices:
2GiB Micron Technology MT29F16G08ADACAH4 02120768 (4k page size, 256k
erase size, 224 oob size, legacy layout ecc str=16)
2GiB Cypress S34ML16G202BHI000 02120939 (2k page size, 128k erase
size, 128B oob size, legacy layout ecc str=16)
1GiB Micron MT29F8G08ABACAH4 02120782 (4k page size, 156k erase size,
224 oob size, legacy layout ecc str=16)
256MiB Micron MT29F2G08ABAEAH4 02120770 (2k page size, 128k erase
size, 64B oob size, legacy layout ecc str=16)

So maybe the confusion is all around CONFIG_SYS_NAND_BLOCK_SIZE?

I appreciate you trying to unwind and clean all of this up. I recall
being thoroughly confused when it came time for me to move my boards
to DM and there were so many changes to unravel.

Best Regards,

Tim

> >> load.read = spl_nand_fit_read;
> >>
> >> static ulong spl_nand_fit_read(struct spl_load_info *load, ulong offs,
> >>                                ulong size, void *dst)
> >> {
> >>         ulong sector;
> >>         int ret;
> >>
> >>         sector = *(int *)load->priv;
> >>         offs = sector + nand_spl_adjust_offset(sector, offs - sector);
>
> This line is ONLY ok if bl_len = 1
>
> >>         ret = nand_spl_load_image(offs, size, dst);
> >>         if (!ret)
> >>                 return size;
> >>         else
> >>                 return 0;
> >> }
> >>
> >> Sorry it's a bit of time and I think that we are discussing about this
> >> commit at the time was introduced
> >> Michael
> >>
> >> >
> >> > Which means that we add in sector twice (but with different units!)
> >> >
> >> > > The documentation for this this function is as follows
> >> > >
> >> > >> nand_spl_adjust_offset - Adjust offset from a starting sector
> >> > >> @sector:     Address of the sector
> >> > >> @offs:       Offset starting from @sector
> >> > >>
> >> > >> If one or more bad blocks are in the address space between @sector
> >> > >> and @sector + @offs, @offs is increased by the NAND block size for
> >> > >> each bad block found.
> >> > >
> >> > > In particular, note that offs is in units of bytes. However, we know
> >> > > from above that at this point in the function, offs is in units of
> >> > > bl_len. Oops.
> >> > >
> >> > >> #else
> >> > >>      offs *= load->bl_len;
> >> > >>      size *= load->bl_len;
> >> > >
> >> > > This code path adjusts both offs and size to be in units of
> >> >
> >> > *units of bytes
> >> >
> >> > >> #endif
> >> > >>      err = nand_spl_load_image(offs, size, dst);
> >> > >
> >> > > So by the time we get here, one code path is in units of bytes, and the
> >> > > other is in units of bl_len. nand_spl_load_image seems to expect bytes
> >> > > (though there is no documentation, so I can't be sure).  Which of course
> >> > > begs the question: how did the code introduced in 9f6a14c47f ever work?
> >
> > The commit 9f6a14c47f only added the code strictly necessary to skip a bad block
> > that was in that memory area. I heavily tested it on a custom board that presented
> > this problem and that without it it would not have been usable. The patch itself
> > was also tested on SOCs that did not have bad block, therefore backward compatible.
> > I think that your doubts arise from the patches following mine and that, however,
> > I have not had the opportunity to test.
>
> Have you tested v2021.07 or later?
>
> --Sean
>
> >> > >
> >> > >>      if (err)
> >> > >>              return 0;
> >> > >>
> >> > >>      return size / load->bl_len;
> >> > >
> >> > > And of course, this should only be done if size is in units of bytes.
> >> > >
> >> > >> }
> >> > >
> >> > > I have some patches for what I think are bugs, but the logic of this
> >> > > function is so confused that I am unable to determine if they actually
> >> > > are bugs. I would greatly appreciate if some of the authors of the
> >> > > mentioned commits could comment on their intent, and what they thought
> >> > > the units of offs and size were.
> >> > >
> >> > > --Sean
> >> > >