linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mailbox: forward the hrtimer if not queued and under a lock
@ 2022-03-31  7:01 Björn Ardö
  2022-04-14 14:30 ` Jassi Brar
  0 siblings, 1 reply; 7+ messages in thread
From: Björn Ardö @ 2022-03-31  7:01 UTC (permalink / raw)
  To: Jassi Brar; +Cc: kernel, Björn Ardö, linux-kernel

This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
"mailbox: avoid timer start from callback"

The previous commit was reverted since it lead to a race that
caused the hrtimer to not be started at all. The check for
hrtimer_active() in msg_submit() will return true if the
callback function txdone_hrtimer() is currently running. This
function could return HRTIMER_NORESTART and then the timer
will not be restarted, and also msg_submit() will not start
the timer. This will lead to a message actually being submitted
but no timer will start to check for its compleation.

The original fix that added checking hrtimer_active() was added to
avoid a warning with hrtimer_forward. Looking in the kernel
another solution to avoid this warning is to check hrtimer_is_queued()
before calling hrtimer_forward_now() instead. This however requires a
lock so the timer is not started by msg_submit() inbetween this check
and the hrtimer_forward() call.

Signed-off-by: Björn Ardö <bjorn.ardo@axis.com>
---
 drivers/mailbox/mailbox.c          | 19 +++++++++++++------
 include/linux/mailbox_controller.h |  1 +
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 3e7d4b20ab34..4229b9b5da98 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -82,11 +82,11 @@ static void msg_submit(struct mbox_chan *chan)
 exit:
 	spin_unlock_irqrestore(&chan->lock, flags);
 
-	/* kick start the timer immediately to avoid delays */
 	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
-		/* but only if not already active */
-		if (!hrtimer_active(&chan->mbox->poll_hrt))
-			hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+		/* kick start the timer immediately to avoid delays */
+		spin_lock_irqsave(&chan->mbox->poll_hrt_lock, flags);
+		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+		spin_unlock_irqrestore(&chan->mbox->poll_hrt_lock, flags);
 	}
 }
 
@@ -120,20 +120,26 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
 		container_of(hrtimer, struct mbox_controller, poll_hrt);
 	bool txdone, resched = false;
 	int i;
+	unsigned long flags;
 
 	for (i = 0; i < mbox->num_chans; i++) {
 		struct mbox_chan *chan = &mbox->chans[i];
 
 		if (chan->active_req && chan->cl) {
-			resched = true;
 			txdone = chan->mbox->ops->last_tx_done(chan);
 			if (txdone)
 				tx_tick(chan, 0);
+			else
+				resched = true;
 		}
 	}
 
 	if (resched) {
-		hrtimer_forward_now(hrtimer, ms_to_ktime(mbox->txpoll_period));
+		spin_lock_irqsave(&mbox->poll_hrt_lock, flags);
+		if (!hrtimer_is_queued(hrtimer))
+			hrtimer_forward_now(hrtimer, ms_to_ktime(mbox->txpoll_period));
+		spin_unlock_irqrestore(&mbox->poll_hrt_lock, flags);
+
 		return HRTIMER_RESTART;
 	}
 	return HRTIMER_NORESTART;
@@ -500,6 +506,7 @@ int mbox_controller_register(struct mbox_controller *mbox)
 		hrtimer_init(&mbox->poll_hrt, CLOCK_MONOTONIC,
 			     HRTIMER_MODE_REL);
 		mbox->poll_hrt.function = txdone_hrtimer;
+		spin_lock_init(&mbox->poll_hrt_lock);
 	}
 
 	for (i = 0; i < mbox->num_chans; i++) {
diff --git a/include/linux/mailbox_controller.h b/include/linux/mailbox_controller.h
index 36d6ce673503..6fee33cb52f5 100644
--- a/include/linux/mailbox_controller.h
+++ b/include/linux/mailbox_controller.h
@@ -83,6 +83,7 @@ struct mbox_controller {
 				      const struct of_phandle_args *sp);
 	/* Internal to API */
 	struct hrtimer poll_hrt;
+	spinlock_t poll_hrt_lock;
 	struct list_head node;
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock
  2022-03-31  7:01 [PATCH] mailbox: forward the hrtimer if not queued and under a lock Björn Ardö
@ 2022-04-14 14:30 ` Jassi Brar
  2022-04-19 12:15   ` Bjorn Ardo
  0 siblings, 1 reply; 7+ messages in thread
From: Jassi Brar @ 2022-04-14 14:30 UTC (permalink / raw)
  To: Björn Ardö; +Cc: kernel, Linux Kernel Mailing List

On Thu, Mar 31, 2022 at 2:01 AM Björn Ardö <bjorn.ardo@axis.com> wrote:
>
> This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
> "mailbox: avoid timer start from callback"
>
> The previous commit was reverted since it lead to a race that
> caused the hrtimer to not be started at all. The check for
> hrtimer_active() in msg_submit() will return true if the
> callback function txdone_hrtimer() is currently running. This
> function could return HRTIMER_NORESTART and then the timer
> will not be restarted, and also msg_submit() will not start
> the timer. This will lead to a message actually being submitted
> but no timer will start to check for its compleation.
>
> The original fix that added checking hrtimer_active() was added to
> avoid a warning with hrtimer_forward. Looking in the kernel
> another solution to avoid this warning is to check hrtimer_is_queued()
> before calling hrtimer_forward_now() instead. This however requires a
> lock so the timer is not started by msg_submit() inbetween this check
> and the hrtimer_forward() call.
>
This is a very dense api used by many use-cases, I am not confident
making any changes without confirming its a real issue with the common
code. Please share your client code and traces, that will help me get
a clearer picture.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock
  2022-04-14 14:30 ` Jassi Brar
@ 2022-04-19 12:15   ` Bjorn Ardo
  2022-04-19 14:10     ` Jassi Brar
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Ardo @ 2022-04-19 12:15 UTC (permalink / raw)
  To: Jassi Brar; +Cc: kernel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 7533 bytes --]

Hi,


I can confirm that this is an actual issue found on our system, not just 
a theoretical case.


If I add the following trace-code to the original code:


diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 3e7d4b20ab34..8e9e82e5f4b1 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -57,6 +57,7 @@ static void msg_submit(struct mbox_chan *chan)
         void *data;
         int err = -EBUSY;

+       trace_printk("Entering msg_submit\n");
         spin_lock_irqsave(&chan->lock, flags);

         if (!chan->msg_count || chan->active_req)
@@ -85,9 +86,14 @@ static void msg_submit(struct mbox_chan *chan)
         /* kick start the timer immediately to avoid delays */
         if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
                 /* but only if not already active */
-               if (!hrtimer_active(&chan->mbox->poll_hrt))
+               if (!hrtimer_active(&chan->mbox->poll_hrt)) {
+                       trace_printk("Starting the hr timer from 
submit\n");
hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+               } else {
+                       trace_printk("Not starting the hr timer from 
submit since it is active\n");
+               }
         }
+       trace_printk("Leaving msg_submit\n");
  }

  static void tx_tick(struct mbox_chan *chan, int r)
@@ -121,6 +127,7 @@ static enum hrtimer_restart txdone_hrtimer(struct 
hrtimer *hrtimer)
         bool txdone, resched = false;
         int i;

+       trace_printk("Entering txdone_hrtimer\n");
         for (i = 0; i < mbox->num_chans; i++) {
                 struct mbox_chan *chan = &mbox->chans[i];

@@ -134,8 +141,10 @@ static enum hrtimer_restart txdone_hrtimer(struct 
hrtimer *hrtimer)

         if (resched) {
                 hrtimer_forward_now(hrtimer, 
ms_to_ktime(mbox->txpoll_period));
+               trace_printk("Leaving txdone_hrtimer with restart\n");
                 return HRTIMER_RESTART;
         }
+       trace_printk("Leaving txdone_hrtimer without restart\n");
         return HRTIMER_NORESTART;
  }

Then I get the following trace output (I have cropped a small portion 
around where the error appears):


        vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving 
msg_submit
        vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer: 
Entering txdone_hrtimer
        vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer: 
Leaving txdone_hrtimer without restart
        vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving 
msg_submit
           <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer: 
Entering txdone_hrtimer
        vhost-475-480     [000] d.h1.   217.440608: txdone_hrtimer: 
Leaving txdone_hrtimer without restart
           <idle>-0       [001] ..s1.   217.440609: msg_submit: Not 
starting the hr timer from submit since it is active
           <idle>-0       [001] ..s1.   217.440610: msg_submit: Leaving 
msg_submit


If I break down the log above we first have one case that works as 
intended. That is a message being written and a timer started and the 
message have been read when the timer hits:

        vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving 
msg_submit
        vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer: 
Entering txdone_hrtimer
        vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer: 
Leaving txdone_hrtimer without restart


After this we write a new message and a new timer is started:

        vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting 
the hr timer from submit
        vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving 
msg_submit


However here comes the race. Now a new message is being written at the 
same time as the hr-timer is handling the first reply (on different CPU's):

           <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering 
msg_submit
        vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer: 
Entering txdone_hrtimer

And the hr-timer decides not to restart itself:

        vhost-475-480     [000] d.h1.   217.440608: txdone_hrtimer: 
Leaving txdone_hrtimer without restart

And also the new message written decides to not start the timer (even 
thou it requires one), since it has the status of being active during 
the time that the hr-timer callback is running:

           <idle>-0       [001] ..s1.   217.440609: msg_submit: Not 
starting the hr timer from submit since it is active
           <idle>-0       [001] ..s1.   217.440610: msg_submit: Leaving 
msg_submit


So now we end up in a state where we have a message waiting for a reply 
but no timer started that can provide this reply.


Our own mailbox code is attached (it is not upstreamed yet, but it is on 
its way).


Best Regards

Björn

On 4/14/22 16:30, Jassi Brar wrote:
> On Thu, Mar 31, 2022 at 2:01 AM Björn Ardö <bjorn.ardo@axis.com> wrote:
>> This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
>> "mailbox: avoid timer start from callback"
>>
>> The previous commit was reverted since it lead to a race that
>> caused the hrtimer to not be started at all. The check for
>> hrtimer_active() in msg_submit() will return true if the
>> callback function txdone_hrtimer() is currently running. This
>> function could return HRTIMER_NORESTART and then the timer
>> will not be restarted, and also msg_submit() will not start
>> the timer. This will lead to a message actually being submitted
>> but no timer will start to check for its compleation.
>>
>> The original fix that added checking hrtimer_active() was added to
>> avoid a warning with hrtimer_forward. Looking in the kernel
>> another solution to avoid this warning is to check hrtimer_is_queued()
>> before calling hrtimer_forward_now() instead. This however requires a
>> lock so the timer is not started by msg_submit() inbetween this check
>> and the hrtimer_forward() call.
>>
> This is a very dense api used by many use-cases, I am not confident
> making any changes without confirming its a real issue with the common
> code. Please share your client code and traces, that will help me get
> a clearer picture.
>
> Thanks.

[-- Attachment #2: mailbox-artpec8.c --]
[-- Type: text/x-csrc, Size: 7507 bytes --]

// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2020 Axis Communications AB */

#include <linux/mailbox_controller.h>
#include <linux/platform_device.h>
#include <linux/interrupt.h>
#include <linux/device.h>
#include <linux/iopoll.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/io.h>
#include <linux/of.h>

#include <linux/soc/axis/artpec-endpoint.h>

#define MAILBOX_MCUCTLR		0x00
#define MAILBOX_INTGR0		0x08
#define MAILBOX_INTCR0		0x0c
#define MAILBOX_INTMR0		0x10
#define MAILBOX_INTSR0		0x14
#define MAILBOX_INTMSR0		0x18
#define MAILBOX_INTGR1		0x1C
#define MAILBOX_INTCR1		0x20
#define MAILBOX_INTMR1		0x24
#define MAILBOX_INTSR1		0x28
#define MAILBOX_INTMSR1		0x2C
#define MAILBOX_IS_VERSION	0x50
#define MAILBOX_ISSR(n)		(0x80 + (n) * 4)

/*
 * We have 32 interrupts and thus 32 channels.  We also have 64 general-purpose
 * ISSR registers, and we use these to send 32-bit values between the hosts for
 * each of the channels.  The first 32 ISSR registers are used for data sent
 * from CPU1 to CPU0 and the remaining are used in the other direction.
 */
#define MAILBOX_NUM_CHANS	32

struct artpec8_mbox {
	void __iomem *base;
	unsigned int cpu;
	struct mutex mutex;
	int irq_low;
	int irq_high;
	struct mbox_controller controller;
	struct mbox_chan chans[MAILBOX_NUM_CHANS];
	const struct artpec_endpoint_info *endpoint;
	u32 msi;
};

static irqreturn_t artpec8_mbox_irq(int irq, void *p)
{
	unsigned int cr, msr, issroffset = 0;
	struct artpec8_mbox *mbox = p;
	u32 clear, ints;

	if (mbox->cpu == 0) {
		msr = MAILBOX_INTMSR0;
		cr = MAILBOX_INTCR0;
	} else {
		msr = MAILBOX_INTMSR1;
		cr = MAILBOX_INTCR1;
		issroffset = 32;
	}

	ints = readl_relaxed(mbox->base + msr);
	if (!ints)
		return IRQ_NONE;

	clear = ints;
	while (ints) {
		unsigned int channum = __ffs(ints);
		struct mbox_chan *chan = &mbox->chans[channum];
		u32 msg;

		ints &= ~BIT(channum);

		msg = readl_relaxed(mbox->base + MAILBOX_ISSR(channum + issroffset));
		mbox_chan_received_data(chan, (void *)(unsigned long)msg);
	}

	writel_relaxed(clear, mbox->base + cr);

	return IRQ_HANDLED;
}

static int artpec8_mbox_send_data(struct mbox_chan *chan, void *data)
{
	struct artpec8_mbox *mbox = chan->con_priv;
	unsigned int channum = chan - mbox->chans;
	unsigned int gr, issroffset = 0;
	u32 arg = (u32)(unsigned long)data;

	if (mbox->cpu == 0) {
		gr = MAILBOX_INTGR1;
		issroffset = 32;
	} else {
		gr = MAILBOX_INTGR0;
	}

	writel(arg, mbox->base + MAILBOX_ISSR(channum + issroffset));

	writel(BIT(channum), mbox->base + gr);

	if (mbox->endpoint)
		writel(mbox->endpoint->msi_data + mbox->msi, mbox->endpoint->msi);

	return 0;
}

static bool artpec8_mbox_last_tx_done(struct mbox_chan *chan)
{
	struct artpec8_mbox *mbox = chan->con_priv;
	unsigned int sr = mbox->cpu == 0 ? MAILBOX_INTSR1 : MAILBOX_INTSR0;
	unsigned int channum = chan - mbox->chans;

	return !(readl_relaxed(mbox->base + sr) & BIT(channum));
}

static int artpec8_mbox_startup(struct mbox_chan *chan)
{
	struct artpec8_mbox *mbox = chan->con_priv;
	unsigned int mr = mbox->cpu == 0 ? MAILBOX_INTMR0 : MAILBOX_INTMR1;
	unsigned int channum = chan - mbox->chans;
	u32 val;

	mutex_lock(&mbox->mutex);

	/*
	 * The hardware manual claims that writing 0 has No Effect, but it also
	 * doesn't provide any way to unmask interrupts, so I'm assuming that
	 * that is a typo.
	 */
	val = readl_relaxed(mbox->base + mr);
	val &= ~BIT(channum);
	writel_relaxed(val, mbox->base + mr);

	mutex_unlock(&mbox->mutex);

	return 0;
}

static void artpec8_mbox_shutdown(struct mbox_chan *chan)
{
	struct artpec8_mbox *mbox = chan->con_priv;
	unsigned int mr = mbox->cpu == 0 ? MAILBOX_INTMR0 : MAILBOX_INTMR1;
	unsigned int channum = chan - mbox->chans;
	u32 val;

	mutex_lock(&mbox->mutex);

	val = readl_relaxed(mbox->base + mr);
	val |= BIT(channum);
	writel_relaxed(val, mbox->base + mr);

	mutex_unlock(&mbox->mutex);
}

static const struct mbox_chan_ops artpec8_mbox_ops = {
	.send_data = artpec8_mbox_send_data,
	.startup = artpec8_mbox_startup,
	.shutdown = artpec8_mbox_shutdown,
	.last_tx_done = artpec8_mbox_last_tx_done,
};

static int artpec8_mbox_reset(struct artpec8_mbox *mbox)
{
	u32 val;
	int ret;

	/* We assume that CPU0 starts up first, so it should do the main reset. */
	if (mbox->cpu != 0) {
		writel_relaxed(~0ul, mbox->base + MAILBOX_INTMR1);
		writel_relaxed(~0ul, mbox->base + MAILBOX_INTCR1);
		return 0;
	}

	/*
	 * It's not entirely clear from the hardware manual, but this reset may
	 * also clear the SEMAPHORE* registers used by the hwspinlock driver.
	 * This will probably not be a problem in practice since both drivers
	 * will presumably be used together by the same clients.
	 */
	writel_relaxed(1, mbox->base + MAILBOX_MCUCTLR);
	ret = readl_relaxed_poll_timeout(mbox->base + MAILBOX_MCUCTLR, val,
					 !val, 0, 100);
	if (ret)
		return ret;

	writel_relaxed(~0ul, mbox->base + MAILBOX_INTMR0);
	writel_relaxed(~0ul, mbox->base + MAILBOX_INTCR0);

	return 0;
}

static int artpec8_mbox_probe(struct platform_device *pdev)
{
	struct device *dev = &pdev->dev;
	struct artpec8_mbox *mbox;
	struct resource	*res;
	const struct artpec_endpoint_info *endpoint;
	int ret;
	int i;

	mbox = devm_kzalloc(dev, sizeof(*mbox), GFP_KERNEL);
	if (!mbox)
		return -ENOMEM;

	of_property_read_u32(dev->of_node, "cpu", &mbox->cpu);
	if (mbox->cpu > 1)
		return -EINVAL;

	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
	if (!res)
		return -ENXIO;

	mbox->base = devm_ioremap(dev, res->start, resource_size(res));
	if (IS_ERR(mbox->base))
		return PTR_ERR(mbox->base);

	mbox->irq_low = platform_get_irq_optional(pdev, 0);
	if (mbox->irq_low == -EPROBE_DEFER)
		return mbox->irq_low;

	mbox->irq_high = platform_get_irq_optional(pdev, 1);
	if (mbox->irq_high == -EPROBE_DEFER)
		return mbox->irq_high;

	/* Only enable MSI if present */
	mbox->endpoint = NULL;
	endpoint = artpec_endpoint_get_info(&pdev->dev);
	if (endpoint)
		if (!of_property_read_u32(dev->of_node, "msi", &mbox->msi))
			mbox->endpoint = endpoint;

	mutex_init(&mbox->mutex);

	mbox->controller.txdone_irq = false;
	mbox->controller.txdone_poll = true;
	mbox->controller.txpoll_period = 1;

	mbox->controller.dev = dev;
	mbox->controller.num_chans = ARRAY_SIZE(mbox->chans);
	mbox->controller.chans = mbox->chans;
	mbox->controller.ops = &artpec8_mbox_ops;

	for (i = 0; i < ARRAY_SIZE(mbox->chans); i++)
		mbox->chans[i].con_priv = mbox;

	ret = artpec8_mbox_reset(mbox);
	if (ret)
		return ret;

	if (mbox->irq_low > 0) {
		ret = devm_request_irq(dev, mbox->irq_low, artpec8_mbox_irq,
				       0, "mbox-low", mbox);
		if (ret)
			return ret;
	}

	if (mbox->irq_high > 0) {
		ret = devm_request_irq(dev, mbox->irq_high, artpec8_mbox_irq,
				       0, "mbox-high", mbox);
		if (ret)
			return ret;
	}

	platform_set_drvdata(pdev, mbox);

	return mbox_controller_register(&mbox->controller);
}

static int artpec8_mbox_remove(struct platform_device *pdev)
{
	struct artpec8_mbox *mbox = platform_get_drvdata(pdev);

	mbox_controller_unregister(&mbox->controller);

	return 0;
}

static const struct of_device_id artpec8_mbox_match[] = {
	{ .compatible = "samsung,artpec8-mailbox" },
	{ /* Sentinel */ }
};
MODULE_DEVICE_TABLE(of, artpec8_mbox_match);

static struct platform_driver artpec8_mbox_driver = {
	.probe	= artpec8_mbox_probe,
	.remove	= artpec8_mbox_remove,
	.driver	= {
		.name	= "artpec8-mailbox",
		.of_match_table	= artpec8_mbox_match,
	},
};

module_platform_driver(artpec8_mbox_driver);

MODULE_LICENSE("GPL v2");

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock
  2022-04-19 12:15   ` Bjorn Ardo
@ 2022-04-19 14:10     ` Jassi Brar
  2022-04-20  8:28       ` Bjorn Ardo
  0 siblings, 1 reply; 7+ messages in thread
From: Jassi Brar @ 2022-04-19 14:10 UTC (permalink / raw)
  To: Bjorn Ardo; +Cc: kernel, Linux Kernel Mailing List

On Tue, Apr 19, 2022 at 7:15 AM Bjorn Ardo <bjorn.ardo@axis.com> wrote:
>
> Hi,
>
>
> I can confirm that this is an actual issue found on our system, not just
> a theoretical case.
>
>
> If I add the following trace-code to the original code:
>
>
> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> index 3e7d4b20ab34..8e9e82e5f4b1 100644
> --- a/drivers/mailbox/mailbox.c
> +++ b/drivers/mailbox/mailbox.c
> @@ -57,6 +57,7 @@ static void msg_submit(struct mbox_chan *chan)
>          void *data;
>          int err = -EBUSY;
>
> +       trace_printk("Entering msg_submit\n");
>          spin_lock_irqsave(&chan->lock, flags);
>
>          if (!chan->msg_count || chan->active_req)
> @@ -85,9 +86,14 @@ static void msg_submit(struct mbox_chan *chan)
>          /* kick start the timer immediately to avoid delays */
>          if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>                  /* but only if not already active */
> -               if (!hrtimer_active(&chan->mbox->poll_hrt))
> +               if (!hrtimer_active(&chan->mbox->poll_hrt)) {
> +                       trace_printk("Starting the hr timer from
> submit\n");
> hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> +               } else {
> +                       trace_printk("Not starting the hr timer from
> submit since it is active\n");
> +               }
>          }
> +       trace_printk("Leaving msg_submit\n");
>   }
>
>   static void tx_tick(struct mbox_chan *chan, int r)
> @@ -121,6 +127,7 @@ static enum hrtimer_restart txdone_hrtimer(struct
> hrtimer *hrtimer)
>          bool txdone, resched = false;
>          int i;
>
> +       trace_printk("Entering txdone_hrtimer\n");
>          for (i = 0; i < mbox->num_chans; i++) {
>                  struct mbox_chan *chan = &mbox->chans[i];
>
> @@ -134,8 +141,10 @@ static enum hrtimer_restart txdone_hrtimer(struct
> hrtimer *hrtimer)
>
>          if (resched) {
>                  hrtimer_forward_now(hrtimer,
> ms_to_ktime(mbox->txpoll_period));
> +               trace_printk("Leaving txdone_hrtimer with restart\n");
>                  return HRTIMER_RESTART;
>          }
> +       trace_printk("Leaving txdone_hrtimer without restart\n");
>          return HRTIMER_NORESTART;
>   }
>
> Then I get the following trace output (I have cropped a small portion
> around where the error appears):
>
>
>         vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering
> msg_submit
>         vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting
> the hr timer from submit
>         vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving
> msg_submit
>         vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer:
> Entering txdone_hrtimer
>         vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer:
> Leaving txdone_hrtimer without restart
>         vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering
> msg_submit
>         vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting
> the hr timer from submit
>         vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving
> msg_submit
>            <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering
> msg_submit
>         vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer:
> Entering txdone_hrtimer
>         vhost-475-480     [000] d.h1.   217.440608: txdone_hrtimer:
> Leaving txdone_hrtimer without restart
>            <idle>-0       [001] ..s1.   217.440609: msg_submit: Not
> starting the hr timer from submit since it is active
>            <idle>-0       [001] ..s1.   217.440610: msg_submit: Leaving
> msg_submit
>
>
> If I break down the log above we first have one case that works as
> intended. That is a message being written and a timer started and the
> message have been read when the timer hits:
>
>         vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering
> msg_submit
>         vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting
> the hr timer from submit
>         vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving
> msg_submit
>         vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer:
> Entering txdone_hrtimer
>         vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer:
> Leaving txdone_hrtimer without restart
>
>
> After this we write a new message and a new timer is started:
>
>         vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering
> msg_submit
>         vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting
> the hr timer from submit
>         vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving
> msg_submit
>
>
> However here comes the race. Now a new message is being written at the
> same time as the hr-timer is handling the first reply (on different CPU's):
>
>            <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering
> msg_submit
>         vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer:
> Entering txdone_hrtimer
>
I don't have access to your client driver, but if it submits another
message from rx_callback() that is the problem.

Please have a look at how other platforms do, for example
imx_dsp_rproc_rx_tx_callback()

/**
 * mbox_chan_received_data - A way for controller driver to push data
 *              received from remote to the upper layer.
 * @chan: Pointer to the mailbox channel on which RX happened.
 * @mssg: Client specific message typecasted as void *
 *
 * After startup and before shutdown any data received on the chan
 * is passed on to the API via atomic mbox_chan_received_data().
 * The controller should ACK the RX only after this call returns.
 */
void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)

If not this, please share your client code as well.

thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock
  2022-04-19 14:10     ` Jassi Brar
@ 2022-04-20  8:28       ` Bjorn Ardo
  2022-05-23 11:56         ` Bjorn Ardo
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Ardo @ 2022-04-20  8:28 UTC (permalink / raw)
  To: Jassi Brar; +Cc: kernel, Linux Kernel Mailing List

Hi,


On 4/19/22 16:10, Jassi Brar wrote:
> I don't have access to your client driver, but if it submits another
> message from rx_callback() that is the problem.
>
> Please have a look at how other platforms do, for example
> imx_dsp_rproc_rx_tx_callback()
>
> /**
>   * mbox_chan_received_data - A way for controller driver to push data
>   *              received from remote to the upper layer.
>   * @chan: Pointer to the mailbox channel on which RX happened.
>   * @mssg: Client specific message typecasted as void *
>   *
>   * After startup and before shutdown any data received on the chan
>   * is passed on to the API via atomic mbox_chan_received_data().
>   * The controller should ACK the RX only after this call returns.
>   */
> void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
>
> If not this, please share your client code as well.
>
> thanks.


In rx_callback() we call tasklet_hi_schedule() to schedule a tasklet and 
this tasklet calls mbox_send_message(). The mailbox is setup with 
.tx_block = false.

I do not quite understand how calling mbox_send_message() from another 
contex (like a work thread as in imx_dsp_rproc_rx_tx_callback()) will 
resolve the race, could you explain this? Or do you mean that it should 
not be called from any interrupt context at all? Looking at the 
documentation of mbox_send_message() it states:


  * This function could be called from atomic context as it simply
  * queues the data and returns a token against the request.


If I look at the code in msg_submit() the part that calls 
hrtimer_active() and hrtimer_start() is completely without a lock. Also 
the code in txdone_hrtimer() that calls hrtimer_forward_now() is without 
a lock. So I cannot see anything preventing these two functions to be 
called concurrently on two different processors (as they do in my 
trace). And looking at the hr_timer code then hrtimer_active() will 
return true if the hrtimer is currently executing txdone_hrtimer().

The way I see the race is if the timer has called txdone_hrtimer() and 
it has passed the part with the for-loop looking at the individual 
channels when the msg_submit() runs. Then txdone_hrtimer() has decided 
to not restart the timer, but hrtimer_active() will still return true 
for a while (until txdone_hrtimer() return completely and the hrtimer 
code can change its status). In this time there is nothing that prevents 
msg_submit() from running, adding a new message that the timer should 
monitor. But if it manages to call hrtimer_active() in the time period 
before the hrtimer code updates it then the current code will never 
start the timer.

Also the poll_hrt is shared between all channels in the mailbox, so it 
does not have to be the same mailbox channel that hrtimer is currently 
waiting for that is calling msg_submit(). So not calling 
mbox_send_message() from rx_callback() will only alter timing for that 
channel. There could still be a race between two different mailbox channels.


This is my understanding of the current code, please tell me if I have 
missed or misunderstood any part of it?


Our current solution are using 4 different mailbox channels 
asynchronously. The code is part of a larger system, but I can put down 
some time and try and extract the relevant parts if you still think this 
is a client issue? But with my current understanding of the code, the 
race between msg_submit() and txdone_hrtimer() is quite clear, and with 
my proposed patch that removes this race we have be able to run for very 
long time without any problems (that is several days). Without the fix 
we get the race after 5-10 min.


Best Regards,

Björn




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock
  2022-04-20  8:28       ` Bjorn Ardo
@ 2022-05-23 11:56         ` Bjorn Ardo
  2022-05-23 19:35           ` Jassi Brar
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Ardo @ 2022-05-23 11:56 UTC (permalink / raw)
  To: Jassi Brar; +Cc: kernel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2580 bytes --]

Hi again,


On 4/20/22 10:28, Bjorn Ardo wrote:
>
>
> Our current solution are using 4 different mailbox channels 
> asynchronously. The code is part of a larger system, but I can put 
> down some time and try and extract the relevant parts if you still 
> think this is a client issue? But with my current understanding of the 
> code, the race between msg_submit() and txdone_hrtimer() is quite 
> clear, and with my proposed patch that removes this race we have be 
> able to run for very long time without any problems (that is several 
> days). Without the fix we get the race after 5-10 min.
>
>
>

I do not know if you have had any time to review my comments yet, but we 
have created some examples to trigger the error.


With the attached testmodule mailbox-loadtest.c I can trigger the error 
by attaching it to the two sides of an mailbox with the following 
devicetree code:

         mboxtest1 {
                 compatible = "mailbox-loadtest";
                 mbox-names = "ping", "pong";
                 mboxes = <&mbox_loop_pri 0 &mbox_loop_pri 1>;
         };

         mboxtest2 {
                 compatible = "mailbox-loadtest";
                 mbox-names = "pong", "ping";
                 mboxes = <&mbox_loop_scd 0 &mbox_loop_scd 1>;
         };


After that I create load on the mailbox by running (or respectively 
system) the following:

while echo 1 > /sys/kernel/debug/mboxtest1/ping ; do
usleep 1
done

while echo 1 > /sys/kernel/debug/mboxtest2/ping ; do
usleep 50000
done

After a few minutes (normally 2-5) I get errors.


Using the patch I sent earlier the errors goes away.


We also have created a mailbox-loopback.c that is a loopback mailbox 
that can be used on the same system (to make testing easier on systems 
that does not have a hardware mailbox), it is also attached. This can be 
probed by the following devicetree code:

         mbox_loop_pri: mailbox_loop_pri {
                 compatible = "mailbox-loopback";
                 #mbox-cells = <1>;
                 side = <0>;
         };
         mbox_loop_scd: mailbox_loop_scd {
                 compatible = "mailbox-loopback";
                 #mbox-cells = <1>;
                 side = <1>;
         };

And with this loopback mailbox we have also been able to reproduce the 
errors without the patch applied.


Best Regards,

Björn


[-- Attachment #2: mailbox-loadtest.c --]
[-- Type: text/x-csrc, Size: 5221 bytes --]

// SPDX-License-Identifier: GPL-2.0-or-later

/* Module to test a mailbox with heavy load on two concurrent channels.
 * Sends the value of the void pointer as the message, so no dereferencing
 * of the pointers are done here, or can be done by the mailbox driver.
 */

#include <linux/debugfs.h>
#include <linux/err.h>
#include <linux/fs.h>
#include <linux/io.h>
#include <linux/kernel.h>
#include <linux/mailbox_client.h>
#include <linux/module.h>
#include <linux/of.h>
#include <linux/platform_device.h>
#include <linux/poll.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
#include <linux/interrupt.h>

struct mbox_loadtest_device {
	struct device		*dev;
	struct dentry		*root_debugfs_dir;

	struct mbox_client     	ping_client;
	struct mbox_chan	*ping_channel;
	struct completion	completion;
	struct mutex		ping_lock;
	u32			ping_rsp;

	struct mbox_client	pong_client;
	struct mbox_chan	*pong_channel;
	struct tasklet_struct	pong_tasklet;
	u32			pong_rsp;
};


static void mbox_loadtest_receive_ping_message(struct mbox_client *client, void *message)
{
	struct mbox_loadtest_device *tdev = container_of(client, struct mbox_loadtest_device, ping_client);

	tdev->ping_rsp = (u32)(unsigned long)message;
	complete(&tdev->completion);
}

static void mbox_loadtest_pong_tasklet(unsigned long data)
{
	struct mbox_loadtest_device *tdev = (struct mbox_loadtest_device *)data;

	mbox_send_message(tdev->pong_channel, (void *)(unsigned long)tdev->pong_rsp);
}

static void mbox_loadtest_receive_pong_message(struct mbox_client *client, void *message)
{
	struct mbox_loadtest_device *tdev = container_of(client, struct mbox_loadtest_device, pong_client);

	tdev->pong_rsp = ((u32)(unsigned long)message) + 1;
	tasklet_init(&tdev->pong_tasklet, mbox_loadtest_pong_tasklet, (unsigned long)tdev);
	tasklet_hi_schedule(&tdev->pong_tasklet);
}

static int mbox_loadtest_send_ping_message(struct mbox_loadtest_device *tdev, u32 message)
{
	int compleated;
	u32 rsp;

	mutex_lock(&tdev->ping_lock);
	reinit_completion(&tdev->completion);
	mbox_send_message(tdev->ping_channel, (void *)(unsigned long)message);

	compleated = wait_for_completion_timeout(&tdev->completion, msecs_to_jiffies(20));
	rsp = tdev->ping_rsp;
	mutex_unlock(&tdev->ping_lock);

	if (!compleated) {
		dev_err(tdev->dev, "Timeout\n");
		return -EFAULT;
	}
	if (rsp != message+1) {
		dev_err(tdev->dev, "Wrong ans %i != %i\n", rsp, message);
		return -EFAULT;
	}

	return 0;
}
static ssize_t mbox_loadtest_ping_write(struct file *filp,
				       const char __user *userbuf,
				       size_t count, loff_t *ppos)
{
	int ret;
	struct mbox_loadtest_device *tdev = filp->private_data;

	ret = mbox_loadtest_send_ping_message(tdev, 0x42);
	return ret < 0 ? ret : count;
}

static const struct file_operations mbox_loadtest_ping_ops = {
	.write	= mbox_loadtest_ping_write,
	.open	= simple_open,
};

static int mbox_loadtest_probe(struct platform_device *pdev)
{
	struct mbox_loadtest_device *tdev;
	struct device_node *np = pdev->dev.of_node;

	tdev = devm_kzalloc(&pdev->dev, sizeof(*tdev), GFP_KERNEL);
	if (!tdev)
		return -ENOMEM;

	if (of_property_match_string(np, "mbox-names", "ping") >= 0) {

		tdev->ping_client.dev = &pdev->dev;
		tdev->ping_client.rx_callback = mbox_loadtest_receive_ping_message;
		tdev->ping_client.tx_done = NULL;
		tdev->ping_client.tx_block = false;
		tdev->ping_client.knows_txdone = false;
		tdev->ping_client.tx_tout = 500;
		tdev->ping_channel =  mbox_request_channel_byname(&tdev->ping_client, "ping");
		if (IS_ERR(tdev->ping_channel)) {
			return -EPROBE_DEFER;
		}
		mutex_init(&tdev->ping_lock);

		if (debugfs_initialized()) {
			tdev->root_debugfs_dir = debugfs_create_dir(dev_name(&pdev->dev), NULL);
			debugfs_create_file("ping", 0600, tdev->root_debugfs_dir,
					    tdev, &mbox_loadtest_ping_ops);
		}
	}

	if (of_property_match_string(np, "mbox-names", "pong") >= 0) {

		tdev->pong_client.dev = &pdev->dev;
		tdev->pong_client.rx_callback = mbox_loadtest_receive_pong_message;
		tdev->pong_client.tx_done = NULL;
		tdev->pong_client.tx_block = false;
		tdev->pong_client.knows_txdone = false;
		tdev->pong_client.tx_tout = 500;
		tdev->pong_channel =  mbox_request_channel_byname(&tdev->pong_client, "pong");
		if (IS_ERR(tdev->pong_channel)) {
			return -EPROBE_DEFER;
		}
	}

	init_completion(&tdev->completion);
	tdev->dev = &pdev->dev;
	platform_set_drvdata(pdev, tdev);
	return 0;
}

static int mbox_loadtest_remove(struct platform_device *pdev)
{
	struct mbox_loadtest_device *tdev = platform_get_drvdata(pdev);

	debugfs_remove_recursive(tdev->root_debugfs_dir);

	if (tdev->ping_channel)
		mbox_free_channel(tdev->ping_channel);
	if (tdev->pong_channel)
		mbox_free_channel(tdev->pong_channel);

	return 0;
}

static const struct of_device_id mbox_loadtest_match[] = {
	{ .compatible = "mailbox-loadtest" },
	{},
};
MODULE_DEVICE_TABLE(of, mbox_loadtest_match);

static struct platform_driver mbox_loadtest_driver = {
	.driver = {
		.name = "mailbox_loadtest",
		.of_match_table = mbox_loadtest_match,
	},
	.probe  = mbox_loadtest_probe,
	.remove = mbox_loadtest_remove,
};
module_platform_driver(mbox_loadtest_driver);

MODULE_DESCRIPTION("Mailbox Load Testing Facility");
MODULE_LICENSE("GPL v2");

[-- Attachment #3: mailbox-loopback.c --]
[-- Type: text/x-csrc, Size: 4275 bytes --]

// SPDX-License-Identifier: GPL-2.0

/* This is a loopback mailbox that can be used to test drivers on
 * a single system. It uses a global memory so only 2 instances
 * (one for each side) can be probed.
 */


#include <linux/mailbox_controller.h>
#include <linux/platform_device.h>
#include <linux/irq_work.h>
#include <linux/device.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/of.h>


struct mbox_loopback_chan {
	struct mbox_chan	*self;
	struct mbox_chan	*other;
	struct irq_work		work;
	struct hrtimer		timer;
	u32			msg;
	bool			pending;
};

#define MAILBOX_NUM_CHANS	32

struct mbox_loopback {
	struct mbox_controller		controller[2];
	bool				probed[2];
	struct mbox_loopback_chan	lchans[2*MAILBOX_NUM_CHANS];
	struct mbox_chan		chans[2*MAILBOX_NUM_CHANS];
};

/* A global shared memory for both sides of the mailbox */
static struct mbox_loopback mbox_loopback;

static void mbox_loopback_work(struct irq_work *work)
{
	struct mbox_loopback_chan *lchan = container_of(work, struct mbox_loopback_chan, work);

	mbox_chan_received_data(lchan->other, (void *)(unsigned long)lchan->msg);
	smp_wmb();
	lchan->pending = false;
}

static bool mbox_loopback_last_tx_done(struct mbox_chan *chan)
{
	struct mbox_loopback_chan *lchan = chan->con_priv;

	return !lchan->pending;
}

static int mbox_loopback_send_data(struct mbox_chan *chan, void *data)
{
	struct mbox_loopback_chan *lchan = chan->con_priv;

	lchan->msg = (u32)(unsigned long)data;
	lchan->pending = true;
	smp_wmb();

	/* Start a timer that will trigger an IRQ in a short while */
	hrtimer_start(&lchan->timer, ns_to_ktime(1000), HRTIMER_MODE_REL);
	return 0;
}

static enum hrtimer_restart mbox_loopback_timer_callback(struct hrtimer *hrtimer)
{
	struct mbox_loopback_chan *lchan = container_of(hrtimer, struct mbox_loopback_chan, timer);

	irq_work_queue(&lchan->work);
	return HRTIMER_NORESTART;
}

static void mbox_loopback_shutdown(struct mbox_chan *chan)
{
	struct mbox_loopback_chan *lchan = chan->con_priv;

	hrtimer_cancel(&lchan->timer);
	irq_work_sync(&lchan->work);
}

static const struct mbox_chan_ops mbox_loopback_ops = {
	.send_data = mbox_loopback_send_data,
	.shutdown = mbox_loopback_shutdown,
	.last_tx_done = mbox_loopback_last_tx_done,
};

static int mbox_loopback_probe(struct platform_device *pdev)
{
	struct device *dev = &pdev->dev;
	int i;
	unsigned int side;

	/* Check side (nothing or anything but 1 is primary side) */
	of_property_read_u32(dev->of_node, "side", &side);
	if (side != 1)
		side = 0;

	if (mbox_loopback.probed[side])
		return -ENOMEM;
	mbox_loopback.probed[side] = true;

	mbox_loopback.controller[side].dev = dev;
	mbox_loopback.controller[side].num_chans = MAILBOX_NUM_CHANS;
	mbox_loopback.controller[side].txdone_irq = false;
	mbox_loopback.controller[side].txdone_poll = true;
	mbox_loopback.controller[side].txpoll_period = 1;
	mbox_loopback.controller[side].chans = &mbox_loopback.chans[side * MAILBOX_NUM_CHANS];
	mbox_loopback.controller[side].ops = &mbox_loopback_ops;

	BUILD_BUG_ON(ARRAY_SIZE(mbox_loopback.chans) != ARRAY_SIZE(mbox_loopback.lchans));

	for (i = 0; i < MAILBOX_NUM_CHANS; i++) {
		int me = i + side * MAILBOX_NUM_CHANS;
		int other;

		if (me >= MAILBOX_NUM_CHANS) {
			other = me - MAILBOX_NUM_CHANS;
		} else {
			other = me + MAILBOX_NUM_CHANS;
		}

		mbox_loopback.lchans[me].self = &mbox_loopback.chans[me];
		mbox_loopback.lchans[me].other = &mbox_loopback.chans[other];
		init_irq_work(&mbox_loopback.lchans[me].work, mbox_loopback_work);
		hrtimer_init(&mbox_loopback.lchans[me].timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
		mbox_loopback.lchans[me].timer.function = mbox_loopback_timer_callback;
		mbox_loopback.chans[me].con_priv = &mbox_loopback.lchans[me];
	}
	return devm_mbox_controller_register(dev, &mbox_loopback.controller[side]);
}

static const struct of_device_id mbox_loopback_match[] = {
	{ .compatible = "mailbox-loopback" },
	{ /* sentinel */ }
};
MODULE_DEVICE_TABLE(of, mbox_loopback_match);

static struct platform_driver mbox_loopback_driver = {
	.probe	= mbox_loopback_probe,
	.driver	= {
		.name	= "mailbox-loopback",
		.of_match_table	= mbox_loopback_match,
	},
};

module_platform_driver(mbox_loopback_driver);

MODULE_DESCRIPTION("Loopback Mailbox");
MODULE_LICENSE("GPL v2");

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock
  2022-05-23 11:56         ` Bjorn Ardo
@ 2022-05-23 19:35           ` Jassi Brar
  0 siblings, 0 replies; 7+ messages in thread
From: Jassi Brar @ 2022-05-23 19:35 UTC (permalink / raw)
  To: Bjorn Ardo; +Cc: kernel, Linux Kernel Mailing List

On Mon, May 23, 2022 at 6:56 AM Bjorn Ardo <bjorn.ardo@axis.com> wrote:
>
> Hi again,
>
>
> On 4/20/22 10:28, Bjorn Ardo wrote:
> >
> >
> > Our current solution are using 4 different mailbox channels
> > asynchronously. The code is part of a larger system, but I can put
> > down some time and try and extract the relevant parts if you still
> > think this is a client issue? But with my current understanding of the
> > code, the race between msg_submit() and txdone_hrtimer() is quite
> > clear, and with my proposed patch that removes this race we have be
> > able to run for very long time without any problems (that is several
> > days). Without the fix we get the race after 5-10 min.
> >
> >
> >
>
> I do not know if you have had any time to review my comments yet, but we
> have created some examples to trigger the error.
>
Thanks, but your last explanation was enough. The logic seems fine.
I was hoping someone impacted by the commit may chime in with a tested/acked by.
I think I'll pick it up now.

Thanks.




>
> With the attached testmodule mailbox-loadtest.c I can trigger the error
> by attaching it to the two sides of an mailbox with the following
> devicetree code:
>
>          mboxtest1 {
>                  compatible = "mailbox-loadtest";
>                  mbox-names = "ping", "pong";
>                  mboxes = <&mbox_loop_pri 0 &mbox_loop_pri 1>;
>          };
>
>          mboxtest2 {
>                  compatible = "mailbox-loadtest";
>                  mbox-names = "pong", "ping";
>                  mboxes = <&mbox_loop_scd 0 &mbox_loop_scd 1>;
>          };
>
>
> After that I create load on the mailbox by running (or respectively
> system) the following:
>
> while echo 1 > /sys/kernel/debug/mboxtest1/ping ; do
> usleep 1
> done
>
> while echo 1 > /sys/kernel/debug/mboxtest2/ping ; do
> usleep 50000
> done
>
> After a few minutes (normally 2-5) I get errors.
>
>
> Using the patch I sent earlier the errors goes away.
>
>
> We also have created a mailbox-loopback.c that is a loopback mailbox
> that can be used on the same system (to make testing easier on systems
> that does not have a hardware mailbox), it is also attached. This can be
> probed by the following devicetree code:
>
>          mbox_loop_pri: mailbox_loop_pri {
>                  compatible = "mailbox-loopback";
>                  #mbox-cells = <1>;
>                  side = <0>;
>          };
>          mbox_loop_scd: mailbox_loop_scd {
>                  compatible = "mailbox-loopback";
>                  #mbox-cells = <1>;
>                  side = <1>;
>          };
>
> And with this loopback mailbox we have also been able to reproduce the
> errors without the patch applied.
>
>
> Best Regards,
>
> Björn
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-05-23 19:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-31  7:01 [PATCH] mailbox: forward the hrtimer if not queued and under a lock Björn Ardö
2022-04-14 14:30 ` Jassi Brar
2022-04-19 12:15   ` Bjorn Ardo
2022-04-19 14:10     ` Jassi Brar
2022-04-20  8:28       ` Bjorn Ardo
2022-05-23 11:56         ` Bjorn Ardo
2022-05-23 19:35           ` Jassi Brar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).