Re: [PATCH v8 2/3] CMDQ: Mediatek CMDQ driver - Horng-Shyang Liao

From: Horng-Shyang Liao <hs.liao@mediatek.com>
To: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Rob Herring <robh+dt@kernel.org>,
	Daniel Kurtz <djkurtz@chromium.org>,
	Sascha Hauer <s.hauer@pengutronix.de>,
	<devicetree@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-mediatek@lists.infradead.org>,
	<srv_heupstream@mediatek.com>,
	"Sascha Hauer" <kernel@pengutronix.de>,
	Philipp Zabel <p.zabel@pengutronix.de>,
	Nicolas Boichat <drinkcat@chromium.org>,
	CK HU <ck.hu@mediatek.com>,
	"cawa cheng" <cawa.cheng@mediatek.com>,
	Bibby Hsieh <bibby.hsieh@mediatek.com>,
	"YT Shen" <yt.shen@mediatek.com>,
	Daoyuan Huang <daoyuan.huang@mediatek.com>,
	Damon Chu <damon.chu@mediatek.com>,
	Josh-YC Liu <josh-yc.liu@mediatek.com>,
	Glory Hung <glory.hung@mediatek.com>,
	Jiaguang Zhang <jiaguang.zhang@mediatek.com>,
	Dennis-YC Hsieh <dennis-yc.hsieh@mediatek.com>,
	Monica Wang <monica.wang@mediatek.com>,
	<jassisinghbrar@gmail.com>, <jaswinder.singh@linaro.org>,
	<hs.liao@mediatek.com>
Subject: Re: [PATCH v8 2/3] CMDQ: Mediatek CMDQ driver
Date: Tue, 14 Jun 2016 15:44:28 +0800	[thread overview]
Message-ID: <1465890268.7191.13.camel@mtksdaap41> (raw)
In-Reply-To: <57583B45.2080504@gmail.com>

Hi Matthias,

On Wed, 2016-06-08 at 17:35 +0200, Matthias Brugger wrote:
> 
> On 08/06/16 14:25, Horng-Shyang Liao wrote:
> > Hi Matthias,
> >
> > On Wed, 2016-06-08 at 12:45 +0200, Matthias Brugger wrote:
> >>
> >> On 08/06/16 07:40, Horng-Shyang Liao wrote:
> >>> Hi Matthias,
> >>>
> >>> On Tue, 2016-06-07 at 18:59 +0200, Matthias Brugger wrote:
> >>>>
> >>>> On 03/06/16 15:11, Matthias Brugger wrote:
> >>>>>
> >>>>>
> >>>> [...]
> >>>>
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +            smp_mb(); /* modify jump before enable thread */
> >>>>>>>>>>>>>> +        }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +        cmdq_thread_writel(thread, task->pa_base +
> >>>>>>>>>>>>>> task->command_size,
> >>>>>>>>>>>>>> +                   CMDQ_THR_END_ADDR);
> >>>>>>>>>>>>>> +        cmdq_thread_resume(thread);
> >>>>>>>>>>>>>> +    }
> >>>>>>>>>>>>>> +    list_move_tail(&task->list_entry, &thread->task_busy_list);
> >>>>>>>>>>>>>> +    spin_unlock_irqrestore(&cmdq->exec_lock, flags);
> >>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +static void cmdq_handle_error_done(struct cmdq *cmdq,
> >>>>>>>>>>>>>> +                   struct cmdq_thread *thread, u32 irq_flag)
> >>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>> +    struct cmdq_task *task, *tmp, *curr_task = NULL;
> >>>>>>>>>>>>>> +    u32 curr_pa;
> >>>>>>>>>>>>>> +    struct cmdq_cb_data cmdq_cb_data;
> >>>>>>>>>>>>>> +    bool err;
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +    if (irq_flag & CMDQ_THR_IRQ_ERROR)
> >>>>>>>>>>>>>> +        err = true;
> >>>>>>>>>>>>>> +    else if (irq_flag & CMDQ_THR_IRQ_DONE)
> >>>>>>>>>>>>>> +        err = false;
> >>>>>>>>>>>>>> +    else
> >>>>>>>>>>>>>> +        return;
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +    curr_pa = cmdq_thread_readl(thread, CMDQ_THR_CURR_ADDR);
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +    list_for_each_entry_safe(task, tmp, &thread->task_busy_list,
> >>>>>>>>>>>>>> +                 list_entry) {
> >>>>>>>>>>>>>> +        if (curr_pa >= task->pa_base &&
> >>>>>>>>>>>>>> +            curr_pa < (task->pa_base + task->command_size))
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What are you checking here? It seems as if you make some implcit
> >>>>>>>>>>>>> assumptions about pa_base and the order of execution of
> >>>>>>>>>>>>> commands in the
> >>>>>>>>>>>>> thread. Is it save to do so? Does dma_alloc_coherent give any
> >>>>>>>>>>>>> guarantees
> >>>>>>>>>>>>> about dma_handle?
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Check what is the current running task in this GCE thread.
> >>>>>>>>>>>> 2. Yes.
> >>>>>>>>>>>> 3. Yes, CMDQ doesn't use iommu, so physical address is continuous.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, physical addresses might be continous, but AFAIK there is no
> >>>>>>>>>>> guarantee that the dma_handle address is steadily growing, when
> >>>>>>>>>>> calling
> >>>>>>>>>>> dma_alloc_coherent. And if I understand the code correctly, you
> >>>>>>>>>>> use this
> >>>>>>>>>>> assumption to decide if the task picked from task_busy_list is
> >>>>>>>>>>> currently
> >>>>>>>>>>> executing. So I think this mecanism is not working.
> >>>>>>>>>>
> >>>>>>>>>> I don't use dma_handle address, and just use physical addresses.
> >>>>>>>>>>      From CPU's point of view, tasks are linked by the busy list.
> >>>>>>>>>>      From GCE's point of view, tasks are linked by the JUMP command.
> >>>>>>>>>>
> >>>>>>>>>>> In which cases does the HW thread raise an interrupt.
> >>>>>>>>>>> In case of error. When does CMDQ_THR_IRQ_DONE get raised?
> >>>>>>>>>>
> >>>>>>>>>> GCE will raise interrupt if any task is done or error.
> >>>>>>>>>> However, GCE is fast, so CPU may get multiple done tasks
> >>>>>>>>>> when it is running ISR.
> >>>>>>>>>>
> >>>>>>>>>> In case of error, that GCE thread will pause and raise interrupt.
> >>>>>>>>>> So, CPU may get multiple done tasks and one error task.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I think we should reimplement the ISR mechanism. Can't we just read
> >>>>>>>>> CURR_IRQ_STATUS and THR_IRQ_STATUS in the handler and leave
> >>>>>>>>> cmdq_handle_error_done to the thread_fn? You will need to pass
> >>>>>>>>> information from the handler to thread_fn, but that shouldn't be an
> >>>>>>>>> issue. AFAIK interrupts are disabled in the handler, so we should stay
> >>>>>>>>> there as short as possible. Traversing task_busy_list is expensive, so
> >>>>>>>>> we need to do it in a thread context.
> >>>>>>>>
> >>>>>>>> Actually, our initial implementation is similar to your suggestion,
> >>>>>>>> but display needs CMDQ to return callback function very precisely,
> >>>>>>>> else display will drop frame.
> >>>>>>>> For display, CMDQ interrupt will be raised every 16 ~ 17 ms,
> >>>>>>>> and CMDQ needs to call callback function in ISR.
> >>>>>>>> If we defer callback to workqueue, the time interval may be larger than
> >>>>>>>> 32 ms.sometimes.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I think the problem is, that you implemented the workqueue as a ordered
> >>>>>>> workqueue, so there is no parallel processing. I'm still not sure why
> >>>>>>> you need the workqueue to be ordered. Can you please explain.
> >>>>>>
> >>>>>> The order should be kept.
> >>>>>> Let me use mouse cursor as an example.
> >>>>>> If task 1 means move mouse cursor to point A, task 2 means point B,
> >>>>>> and task 3 means point C, our expected result is A -> B -> C.
> >>>>>> If the order is not kept, the result could become A -> C -> B.
> >>>>>>
> >>>>>
> >>>>> Got it, thanks for the clarification.
> >>>>>
> >>>>
> >>>> I think a way to get rid of the workqueue is to use a timer, which gets
> >>>> programmed to the time a timeout in the first task in the busy list
> >>>> would happen. Everytime we update the busy list (e.g. because of task
> >>>> got finished by the thread), we update the timer. When the timer
> >>>> triggers, which hopefully won't happen too often, we return timeout on
> >>>> the busy list elements, until the time is lower then the actual time.
> >>>>
> >>>> At least with this we can reduce the data structures in this driver and
> >>>> make it more lightweight.
> >>>
> >>>   From my understanding, your proposed method can handle timeout case.
> >>>
> >>> However, the workqueue is also in charge of releasing tasks.
> >>> Do you take releasing tasks into consideration by using the proposed
> >>> timer method?
> >>> Furthermore, I think the code will become more complex if we also use
> >>> timer to implement releasing tasks.
> >>>
> >>
> >> Can't we call
> >>           clk_disable_unprepare(cmdq->clock);
> >>           cmdq_task_release(task);
> >> after invoking the callback?
> >
> > Do you mean just call these two functions in ISR?
> > My major concern is dma_free_coherent() and kfree() in
> > cmdq_task_release(task).
> 
> Why do we need the dma calls at all? Can't we just calculate the 
> physical address using __pa(x)?

I prefer to use dma_map_single/dma_unmap_single.

> > Therefore, your suggestion is to use GFP_ATOMIC for both
> > dma_alloc_coherent() and kzalloc(). Right?
> 
> I don't think we need GFP_ATOMIC, the critical path will just free the 
> memory.

I tested these two functions, and kfree was safe.
However, dma_free_coherent raised BUG.
BUG: failure at
/mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:1514/vunmap()!
1512 void vunmap(const void *addr)
1513 {
1514         BUG_ON(in_interrupt());		// <-- here
1515         might_sleep();
1516         if (addr)
1517                 __vunmap(addr, 0);
1518 }
1519 EXPORT_SYMBOL(vunmap);

Therefore, I plan to use kmalloc + dma_map_single instead of
dma_alloc_coherent, and dma_unmap_single + kfree instead of
dma_free_coherent.

What do you think about the function replacement?

> > If so, I can try to implement timeout by timer, and discuss with you
> > if I have further questions.
> >
> 
> Sounds good :)
> 
> Thanks,
> Matthias

Thanks,
HS

> >> Regrading the clock, wouldn't it be easier to handle the clock
> >> enable/disable depending on the state of task_busy_list? I suppose we
> >> can't as we would need to check the task_busy_list of all threads, right?
> >>
> >> Regards,
> >> Matthias
> >
> > Thanks,
> > HS
> >