On 05.05.2021 11:30:49, Torin Cooper-Bennun wrote: > I've been testing the TCAN4550 recently with proper kit (no more jumper > wires, hooray!) and I'm happy to say the RX path is fixed in v5.12 with > the latest patches, and even with heavy load I see no missed frames or > errors. \o/ > However, TX still needs work. It's easy to break the driver due to the > following logic in m_can_start_xmit(): > > | if (cdev->tx_skb) { > | netdev_err(dev, "hard_xmit called while tx busy\n"); > | return NETDEV_TX_BUSY; > | } > > Regardless of your netif TX queue length or the number of TX buffers > allocated in the M_CAN core, if you try to transmit too quickly, you > will hit this. For the application I'm working on, I run into this very > quickly with real-world scenarios. > > Also, the queue is always stopped before the tx work is queued in > m_can_start_xmit(), which seems wrong and clearly doesn't solve the > problem: The simple approach is to stop the tx queue in ndo_start_xmit() and to restart it if the skb has been send. If the above error message is triggered I suppose there's a race condition in the driver, cdev->tx_skb is cleared _after_ the queue has been restarted. I think it's somewhere here: | static netdev_tx_t m_can_tx_handler(struct m_can_classdev *cdev) | { [...] | /* Enable TX FIFO element to start transfer */ | m_can_write(cdev, M_CAN_TXBAR, (1 << putidx)); TX complete IRQ can come after this... | /* stop network queue if fifo full */ | if (m_can_tx_fifo_full(cdev) || This is a SPI transfer and may take a lot of time.... | m_can_next_echo_skb_occupied(dev, putidx)) | netif_stop_queue(dev); [...] | } | static void m_can_tx_work_queue(struct work_struct *ws) | { | struct m_can_classdev *cdev = container_of(ws, struct m_can_classdev, | tx_work); | | m_can_tx_handler(cdev); ... here the tx_skb is set to NULL | cdev->tx_skb = NULL; | } I've send a patch (compile time tested, only!) to fix the probem. > | /* Need to stop the queue to avoid numerous requests > | * from being sent. Suggested improvement is to create > | * a queueing mechanism that will queue the skbs and > | * process them in order. > | */ > | cdev->tx_skb = skb; > | netif_stop_queue(cdev->net); > | queue_work(cdev->tx_wq, &cdev->tx_work); > > > So - I'd like to fix this. The comment in the snippet above suggests a > queueing mechanism. It would be good to hear your take on this, Marc - > AFAIU you have written a similar mechanism for mcp251xfd. :) For easier queuing, get rid of the worker. Directly send the can_frame from the xmit handler. The problem is, you cannot sleep inside it. This means, you cannot use regmap(), the only thing that works is spi_async(). And as it's async, you can only write data via SPI, reading doesn't make sense. Have a look at the mcp251xfd_start_xmit() function. All data structures for the SPI messages are prepared in beforehand in mcp251xfd_tx_ring_init_tx_obj(). The xmit function looks up the correct data structure, converts the skb to the chip's format mcp251xfd_tx_obj_from_skb() and then sends the data over spi. Special care is taken of the handling of the head and tail pointers and the stopping of the queue to avoid race conditions - see mcp251xfd_get_tx_free(), mcp251xfd_tx_busy(), netif_stop_queue(), netif_wake_queue(). Hope that helps, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |