All of lore.kernel.org
 help / color / mirror / Atom feed
* pcie dma transfer
@ 2018-06-04 11:12 Christoph Böhmwalder
  2018-06-04 12:05 ` Greg KH
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Böhmwalder @ 2018-06-04 11:12 UTC (permalink / raw)
  To: kernelnewbies

Hi,

I'm not sure how on-topic this is on this list, but I have a question
regarding a device driver design issue.

For our Bachelor's project my team and I are tasked to optimize an
existing hardware solution. The design utilizes an FPGA to accomplish
various tasks, including a Triple Speed Ethernet controller that is linked to
the CPU via PCI express. Currently the implementation is fairly naive,
and the driver just does byte-by-byte reads directly from a FIFO on the
FPGA device. This, of course, is quite resource intensive and basically
hogs up the CPU completely (leading throughput to peak at around
10 Mbit/s).

Our plan to solve this problem is as follows:

* Keep a buffer on the FPGA that retains a number of Ethernet packages.
* Once a certain threshold is reached (or a period of time, e.g. 5ms, elapses),
  the buffer is flushed and sent directly to RAM via DMA.
* When the buffer is flushed and the data is in RAM and accessible by
  the CPU, the device raises an interrupt, signalling the CPU to read
  the data.
* In the interrupt handler, we `memcpy` the individual packets to
  another buffer and hand them to the upper layer in the network stack.

Our rationale behind keeping a buffer of packets rather than just
transmitting a single packet is to maximize the amount of data send
with each PCIe transaction (and in turn minimize the overhead).

However, upon reading a relevant LDD chapter [1] (which, admittedly, we
should have done in the first place), we found that the authors of the
book take a different approach:

> The second case comes about when DMA is used asynchronously. This happens,
> for example, with data acquisition devices that go on pushing data even if
> nobody is reading them. In this case, the driver should maintain a buffer so
> that a subsequent read call will return all the accumulated data to user
> space. The steps involved in this kind of transfer are slightly different:
>   1. The hardware raises an interrupt to announce that new data has arrived.
>   2. The interrupt handler allocates a buffer and tells the hardware where to transfer
>      its data.
>   3. The peripheral device writes the data to the buffer and raises another interrupt
>      when it?s done.
>   4. The handler dispatches the new data, wakes any relevant process, and takes care
>      of housekeeping.
> A variant of the asynchronous approach is often seen with network cards. These
> cards often expect to see a circular buffer (often called a DMAring buffer) established
> in memory shared with the processor; each incoming packet is placed in the
> next available buffer in the ring, and an interrupt is signaled. The driver then passes
> the network packets to the rest of the kernel and places a new DMA buffer in the
> ring.

Now, there are some obvious advantages to this method (not the least of
which being that it's much easier to implement), but I can't help but
feel like it would be a little inefficient.

Now here's my question: Is our solution to this problem sane? Do you
think it would be viable or that it would create more issues than it
would solve? Should we go the LDD route instead and allocate a new
buffer everytime an interrupt is raised?

Thanks for your help!

--
Regards,
Christoph

[1] https://static.lwn.net/images/pdf/LDD3/ch15.pdf (page 30)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* pcie dma transfer
  2018-06-04 11:12 pcie dma transfer Christoph Böhmwalder
@ 2018-06-04 12:05 ` Greg KH
  2018-06-04 12:31   ` Christoph Böhmwalder
  0 siblings, 1 reply; 3+ messages in thread
From: Greg KH @ 2018-06-04 12:05 UTC (permalink / raw)
  To: kernelnewbies

On Mon, Jun 04, 2018 at 01:12:48PM +0200, Christoph B?hmwalder wrote:
> Hi,
> 
> I'm not sure how on-topic this is on this list, but I have a question
> regarding a device driver design issue.
> 
> For our Bachelor's project my team and I are tasked to optimize an
> existing hardware solution. The design utilizes an FPGA to accomplish
> various tasks, including a Triple Speed Ethernet controller that is linked to
> the CPU via PCI express. Currently the implementation is fairly naive,
> and the driver just does byte-by-byte reads directly from a FIFO on the
> FPGA device. This, of course, is quite resource intensive and basically
> hogs up the CPU completely (leading throughput to peak at around
> 10 Mbit/s).
> 
> Our plan to solve this problem is as follows:
> 
> * Keep a buffer on the FPGA that retains a number of Ethernet packages.
> * Once a certain threshold is reached (or a period of time, e.g. 5ms, elapses),
>   the buffer is flushed and sent directly to RAM via DMA.
> * When the buffer is flushed and the data is in RAM and accessible by
>   the CPU, the device raises an interrupt, signalling the CPU to read
>   the data.

The problem in this design might happen right here.  What happens
in the device between the interrupt being signaled, and the data being
copied out of the buffer?  Where do new packets go to?  How does the
device know it is "safe" to write new data to that memory?  That extra
housekeeping in the hardware gets very complex very quickly.

> * In the interrupt handler, we `memcpy` the individual packets to
>   another buffer and hand them to the upper layer in the network stack.

This all might work, if you have multiple buffers, as that is how some
drivers work.  Look at how the XHCI design is specified.  The spec is
open, and it gives you a very good description of how a relativly
high-speed PCIe device should work, with buffer management and the like.
You can probably use a lot of that type of design for your new work and
make things run a lot faster than what you currently have.

You also have access to loads of very high-speed drivers in Linux today,
to get design examples from.  Look at the networking driver of the
10, 40, and 100Gb cards, as well as the infiband drivers, and even some
of the PCIe flash block drivers.  Look at what the NVME spec says for
how those types of high-speed storage devices should be designed for
other examples.

best of luck!

greg k-h

^ permalink raw reply	[flat|nested] 3+ messages in thread

* pcie dma transfer
  2018-06-04 12:05 ` Greg KH
@ 2018-06-04 12:31   ` Christoph Böhmwalder
  0 siblings, 0 replies; 3+ messages in thread
From: Christoph Böhmwalder @ 2018-06-04 12:31 UTC (permalink / raw)
  To: kernelnewbies

On Mon, Jun 04, 2018 at 02:05:05PM +0200, Greg KH wrote:
> The problem in this design might happen right here.  What happens
> in the device between the interrupt being signaled, and the data being
> copied out of the buffer?  Where do new packets go to?  How does the
> device know it is "safe" to write new data to that memory?  That extra
> housekeeping in the hardware gets very complex very quickly.

That's one of our concerns as well, our solution seems to be way too
complex (as we, for example, still need a way to parse out the
individual packets from the buffer). I think we should focus more on
KISS going forward.

> This all might work, if you have multiple buffers, as that is how some
> drivers work.  Look at how the XHCI design is specified.  The spec is
> open, and it gives you a very good description of how a relativly
> high-speed PCIe device should work, with buffer management and the like.
> You can probably use a lot of that type of design for your new work and
> make things run a lot faster than what you currently have.
> 
> You also have access to loads of very high-speed drivers in Linux today,
> to get design examples from.  Look at the networking driver of the
> 10, 40, and 100Gb cards, as well as the infiband drivers, and even some
> of the PCIe flash block drivers.  Look at what the NVME spec says for
> how those types of high-speed storage devices should be designed for
> other examples.

Thanks for the pointer, I will take a look at that.

As for now, we're looking at implementing a solution using the ringbuffer
method described in LDD, since that seems quite reasonable. That may all
change once we research the other drivers a little bit though.

Thanks for your help and (as always) quick response time!

--
Regards,
Christoph

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-06-04 12:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04 11:12 pcie dma transfer Christoph Böhmwalder
2018-06-04 12:05 ` Greg KH
2018-06-04 12:31   ` Christoph Böhmwalder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.