Re: [PATCH 2/5] dmaengine: Add support for custom data mapping

From: Abhishek Sahu <absahu@codeaurora.org>
To: Andy Gross <andy.gross@linaro.org>
Cc: Vinod Koul <vinod.koul@intel.com>,
	dan.j.williams@intel.com, stanimir.varbanov@linaro.org,
	mcgrof@suse.com, okaya@codeaurora.org, pramod.gurav@linaro.org,
	arnd@arndb.de, linux-kernel@vger.kernel.org,
	dmaengine@vger.kernel.org, linux-arm-msm@vger.kernel.org
Subject: Re: [PATCH 2/5] dmaengine: Add support for custom data mapping
Date: Mon, 02 Jan 2017 19:55:37 +0530	[thread overview]
Message-ID: <8055fa20b35139e7d13831583ebf4f4f@codeaurora.org> (raw)
In-Reply-To: <20161229175450.GB17770@hector.attlocal.net>

On 2016-12-29 23:24, Andy Gross wrote:
> On Thu, Dec 22, 2016 at 01:04:37AM +0530, Abhishek Sahu wrote:
>> On 2016-12-21 01:55, Andy Gross wrote:
>> >On Wed, Dec 21, 2016 at 12:58:50AM +0530, Abhishek Sahu wrote:
>> >
>> ><snip>
>> >
>> >>>>Okay, do you have pointer on that one, will avoid asking the same
>> >>>>questions
>> >>>>again.
>> >>>
>> >>>I'll see if I can find the correspondance and send to you directly.
>> >>>
>> >>>>
>> >>>>> Ahbishek, correct me where i am wrong on the following:
>> >>>>> So two main differences between a normal descriptor and a command descriptor:
>> >>>>> 1) size of the descriptor
>> >>>>> 2) the flag setting
>> >>>>> 3) data sent in is a modified scatter gather that includes flags , vs a normal
>> >>>>> scatter gather
>> >>
>> >>Top level descriptor is same for both. Only difference is Command flag.
>> >>The
>> >>command descriptor will contain list of register read/write instead of
>> >>data
>> >>address
>> >>The peripheral driver can form the list with helper function provided in
>> >>patch 5
>> >>and submit it to BAM. The main issue is for other flag like EOT/NWD.
>> >>
>> >>The top level descriptor is again in the form of list where BAM writes
>> >>the
>> >>address of the list in register before starting of transfer. In this
>> >>list,
>> >>each element will have different flags.
>> >
>> >Ah that's right.  The command descriptor information is the format of the
>> >data
>> >pointed to by the sgl.  So you'd have some set of register read/writes
>> >described
>> >in those entries.
>> >
>> >>
>> >>>>>
>> >>>>> So the CMD descriptors in a given sgl can all have varying flags set? I'd assume
>> >>>>> they all have CMD flag set.  Do the current users of the command descriptors
>> >>>>> coalesce all of their requests into a big list?
>> >>>>>
>> >>
>> >>The main user for command descriptor is currently QPIC NAND/LCD. The
>> >>NAND
>> >>uses
>> >>3 BAM channels- tx, rx and command. NAND controller do the data transfer
>> >>in
>> >>chunk of codeword (maximum 544 bytes). NAND chip does the data transfer
>> >>on
>> >>page basis so each page read/write can have multiple codewords. The NAND
>> >>driver prepares command, tx(write) or rx(read) descriptor for complete
>> >>page
>> >>, submit it to BAM and wait for its completion. So NAND driver coalesces
>> >>all of their requests into a big list. In this big list,
>> >>
>> >>1. Some of the request for command channel requires NWD flag to be set.
>> >
>> >I'd expect this to occur at the end of a chain.  So if you had 5 CMD
>> >descriptors
>> >described in the SGL, only the last descriptor would have the NWD set.
>> >Correct?
>> >
>> >>2. TX request depends upon the setting of EOT flag so some of the TX
>> >>request
>> >>   in complete page write will contain EOT flag and others will not. So
>> >>this
>> >>   custom mapping will be used for data descriptor also in NAND driver.
>> >
>> >Can you give a sequence description of the descriptors and flags?  I
>> >haven't
>> >seen the NAND documentation that describes the sequence/flow.
>> 
>> Following is the sample list of command descriptor for page write(2K 
>> page).
>> The actual list will contain more no of descriptor which involves
>> spare area transfer also.
>> 
>> Index	INT	NWD	CMD	24bit Register Address
>> 0	-	-	1	0x0000F0 (EBI2_ECC_BUF_CFG)
>> 
>> 1	-	-	1	0x000020 (NAND_DEVn_CFG0)
>> 				0x000024 (NAND_DEVn_CFG1)
>> 				0x0000F0 (EBI2_ECC_BUF_CFG)
>> 				0x00000C (NAND_FLASH_CHIP_SELECT)
>> 
>> 2	-	-	1	0x000004 (NAND_ADDR0)
>> 				0x000008 (NAND_ADDR1)
>> 
>> 3	-	1	1	0x000000 (NAND_FLASH_CMD)
>> 
>> 4	-	1	1	0x000010 (NANDC_EXEC_CMD)
>> 
>> 5	-	-	1	0x000014 (NAND_FLASH_STATUS)
>> 
>> 6	-	1	1	0x000000 (NAND_FLASH_CMD)
>> 
>> 7	-	1	1	0x000010 (NANDC_EXEC_CMD)
>> 
>> 8	-	-	1	0x000014 (NAND_FLASH_STATUS)
>> 
>> 9	-	1	1	0x000000 (NAND_FLASH_CMD)
>> 
>> 10	-	1	1	0x000010 (NANDC_EXEC_CMD)
>> 
>> 11	-	-	1	0x000014 (NAND_FLASH_STATUS)
>> 
>> 12	-	1	1	0x000000 (NAND_FLASH_CMD)
>> 
>> 13	-	1	1	0x000010 (NANDC_EXEC_CMD)
>> 
>> 14	1	-	1	0x000014 (NAND_FLASH_STATUS)
>> 
>> 15	-	-	1	0x000044 (NAND_FLASH_READ_STATUS)
>> 				0x000014 (NAND_FLASH_STATUS)
> 
> Yeah I was expecting something like:
> - Setup NAND controller using some command writes (indices 0-4)
>   Loop doing the following until all the data is done:
>   - Send/Receive the Data
>   - Check status.
> 
> The only one that sticks out to me is index 14.  Is the INT flag there 
> to mark
> the actual end of the data transfer from the device?  Then you do one 
> more
> Status read.
> 
This is sample list given in NAND document. INT will be set only for the 
last
command. I checked the NAND driver in which status will be read only 
once for
each codeword.
>> >
>> >>>>> So a couple of thoughts on how to deal with this:
>> >>>>>
>> >>>>> 1) Define a virtual channel for the command descriptors vs a normal DMA
>> >>>>> transaction.  This lets you use the same hardware channel, but lets you discern
>> >>>>> which descriptor format you need to use.  The only issue I see with this is the
>> >>>>> required change in device tree binding to target the right type of channel (cmd
>> >>>>> vs normal).
>> >>>>
>> >>>>Or mark the descriptor is cmd and write accordingly...
>> >>>
>> >>>The only issue i see with that is knowing how much to pre-allocate during
>> >>>the
>> >>>prep call.  The flag set API would be called on the allocated tx
>> >>>descriptor.  So
>> >>>you'd have to know up front and be able to specify it.
>> >>>
>> >>>>
>> >>>>>
>> >>>>> 2) Provide an API to set flags for the descriptor on a whole descriptor basis.
>> >>>>>
>> >>>>> 3) If you have a set of transactions described by an sgl that has disparate use
>> >>>>> of flags, you split the list and use a separate transaction.  In other words, we
>> >>>>> need to enforce that the flag set API will be applied to all descriptors
>> >>>>> described by an sgl.  This means that the whole transaction may be comprised of
>> >>>>> multiple async TX descriptors.
>> >>
>> >>Each async TX descriptor will generate the BAM interrupt so we are
>> >>deviating
>> >>from main purpose of DMA where ideally we should get the interrupt at
>> >>the
>> >>end
>> >>of transfer. This is the main reason for going for this patch.
>> >
>> >If the client queues 1 descriptor or 5 descriptors, it doesn't matter that
>> >much.
>> >The client knows when it is done by waiting for the descriptors to
>> >complete.  It
>> >is less efficient than grouping them all, but it should still work.
>> >
>> Yes. client will wait for final descriptor completion. But these 
>> interrupts
>> will be overhead for CPU. For 1-2 page it won't matter much I guess it 
>> will
>> be
>> significant for complete chip read/write(during boot and FS i.e JFFS
>> operations).
>> >>
>> >>With the submitted patch, only 1 interrupt per channel is required for
>> >>complete NAND page and it solves the setting of BAM specific flags also.
>> >>Only issue with this patch is adding new API in DMA framework itself.
>> >>But
>> >>this API can be used by other DMA engines in future for which mapping
>> >>cannot
>> >>be done with available APIs and if this mapping is vendor specific.
>> >
>> >I guess the key point in all of this is that the DMA operation being done
>> >is not
>> >a normal data flow to/from the device.  It's direct remote register access
>> >to
>> >the device using address information contained in the sgl.  And you are
>> >collating the standard data access along with the special command access
>> >in the
>> >same API call.
>> Yes. Normally DMA engine (QCA ADM DMA engine also) allows to 
>> read/write
>> memory mapped
>> registers just like data. But BAM is different (Since it is not a 
>> global DMA
>> Engine
>> and coupled with peripheral). Also, this different flag requirement is 
>> not
>> just
>> for command descriptors but for data descriptors also.
>> 
>> BAM data access and command access differs only with flag and register
>> read/write
>> list. The register read and write list will be simply array of
>> struct bam_cmd_element added in patch
>> struct bam_cmd_element {
>>         __le32 addr:24;
>>         __le32 command:8;
>>         __le32 data;
>>         __le32 mask;
>>         __le32 reserved;
>> };
>> 
>> The address and size of the array will be passed in data and size 
>> field of
>> SGL.
>> If we want to form the SGL for mentioned list then we will have SGL of 
>> size
>> 15
>> with just one descriptor.
>> 
>> Now we require different flag for each SG entry. currently SG does not 
>> have
>> flag parameter and we can't add flag parameter just for our 
>> requirement in
>> generic SG. So we have added the custom mapping function and passed 
>> modified
>> SG
>> as parameter which is generic SG and flag.
> 
> I really think that we need some additional API that allows for the 
> flag munging
> for the descriptors instead of overriding the prep_slave_sg.  We 
> already needed
> to change the way the flags are passed anyway.  And instead of building 
> up a
> special sg list, the API should take a structure that has a 1:1 mapping 
> of the
> flags to the descriptors.  And you would call this API on your 
> descriptor before
> issuing it.
> 
> So build up the sglist.  Call the prep_slave_sg.  You get back a tx 
> descriptor
> that underneath is a bam descriptor.  Then call the API giving the 
> descriptor
> and the structure that defines the flags for the descriptors.  Then 
> submit the
> descriptor.
> 
> Something like:
> int qcom_bam_apply_descriptor_flags(struct dma_async_tx_descriptor *tx,
> 				    u16 *flags)
> {
> 	struct bam_async_desc async_desc = container_of(tx,
> 							struct bam_async_desc,
> 							vd.tx);
> 	int i;
> 
> 	for (i = 0; i < async_desc->num_desc; i++)
> 		async_desc->desc[i].flags = cpu_to_le16(flags[i]);
> }
> 
> EXPORT_SYMBOL(qcom_bam_apply_descriptor_flags)
> 

We want to tightly couple the SG and its flag. But if this 1:1 mapping 
is acceptable
in linux kernel then we can go ahead with this. It will solve our 
requirement and
does not require any change in Linux DMA API. I will do the same and 
will submit the
new patches.

> This applies the flags directly to the underlying hardware descriptors. 
>  The
> prep_slave_sg call would need to remove all the flag munging.  The 
> bam_start_dma
> would need to account for this as well by only setting the INT flag if 
> the
> transfer cannot get all of the descriptors in the FIFO.

It seems no major change is required in prep_slave_sg or bam_start_dma 
since
it is just setting INT flag for last entry which is required for QPIC 
drivers
also. We need to change the assignment of flag with bitwise OR 
assignment for
last BAM desc in function bam_start_dma

         /* set any special flags on the last descriptor */
         if (async_desc->num_desc == async_desc->xfer_len)
                 desc[async_desc->xfer_len - 1].flags |=
                                         cpu_to_le16(async_desc->flags);

> 
> Regards,
> 
> Andy

-- 
Abhishek Sahu