* [LSF/MM TOPIC] Generic page-pool recycle facility? [not found] <1460034425.20949.7.camel@HansenPartnership.com> @ 2016-04-07 14:17 ` Jesper Dangaard Brouer 2016-04-07 14:38 ` [Lsf-pc] " Christoph Hellwig ` (3 more replies) 0 siblings, 4 replies; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-07 14:17 UTC (permalink / raw) To: lsf, linux-mm Cc: brouer, James Bottomley, netdev, Tom Herbert, Alexei Starovoitov, Brenden Blanco, lsf-pc (Topic proposal for MM-summit) Network Interface Cards (NIC) drivers, and increasing speeds stress the page-allocator (and DMA APIs). A number of driver specific open-coded approaches exists that work-around these bottlenecks in the page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and allocating larger pages and handing-out page "fragments". I'm proposing a generic page-pool recycle facility, that can cover the driver use-cases, increase performance and open up for zero-copy RX. The basic performance problem is that pages (containing packets at RX) are cycled through the page allocator (freed at TX DMA completion time). While a system in a steady state, could avoid calling the page allocator, when having a pool of pages equal to the size of the RX ring plus the number of outstanding frames in the TX ring (waiting for DMA completion). The motivation for quick page recycling came primarily for performance reasons. But returning pages to the same pool also benefit other use-cases. If a NIC HW RX ring is strictly bound (e.g. to a process or guest/KVM) then pages can be shared/mmap'ed (RX zero-copy) as information leaking does not occur. (Obviously for this use-case, when adding pages into the pool these need to zero'ed out). The motivation behind implemeting this (extremely fast page-pool) is because we need it as a building block in the network stack, but hopefully other areas could also benefit from this. [Resources/Links]: It is specifically related to: What Facebook calls XDP (eXpress Data Path) * https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf * RFC patchset thread: http://thread.gmane.org/gmane.linux.network/406288 And what I call the "packet-page" level: * BoF on kernel network performance: http://lwn.net/Articles/676806/ * http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html See you soon at LFS/MM-summit :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer @ 2016-04-07 14:38 ` Christoph Hellwig 2016-04-07 15:11 ` [Lsf] " Bart Van Assche 2016-04-07 15:48 ` Chuck Lever 2016-04-07 15:18 ` Eric Dumazet ` (2 subsequent siblings) 3 siblings, 2 replies; 35+ messages in thread From: Christoph Hellwig @ 2016-04-07 14:38 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: lsf, linux-mm, netdev, Brenden Blanco, James Bottomley, Tom Herbert, lsf-pc, Alexei Starovoitov This is also very interesting for storage targets, which face the same issue. SCST has a mode where it caches some fully constructed SGLs, which is probably very similar to what NICs want to do. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 14:38 ` [Lsf-pc] " Christoph Hellwig @ 2016-04-07 15:11 ` Bart Van Assche 2016-04-10 18:45 ` Sagi Grimberg 2016-04-07 15:48 ` Chuck Lever 1 sibling, 1 reply; 35+ messages in thread From: Bart Van Assche @ 2016-04-07 15:11 UTC (permalink / raw) To: Christoph Hellwig, Jesper Dangaard Brouer Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc, Alexei Starovoitov On 04/07/16 07:38, Christoph Hellwig wrote: > This is also very interesting for storage targets, which face the same > issue. SCST has a mode where it caches some fully constructed SGLs, > which is probably very similar to what NICs want to do. I think a cached allocator for page sets + the scatterlists that describe these page sets would not only be useful for SCSI target implementations but also for the Linux SCSI initiator. Today the scsi-mq code reserves space in each scsi_cmnd for a scatterlist of SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page sets less memory would be needed per scsi_cmnd. See also scsi_mq_setup_tags() and scsi_alloc_sgtable(). Bart. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 15:11 ` [Lsf] " Bart Van Assche @ 2016-04-10 18:45 ` Sagi Grimberg 2016-04-11 21:41 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 35+ messages in thread From: Sagi Grimberg @ 2016-04-10 18:45 UTC (permalink / raw) To: Bart Van Assche, Christoph Hellwig, Jesper Dangaard Brouer Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm, netdev, lsf-pc, Alexei Starovoitov >> This is also very interesting for storage targets, which face the same >> issue. SCST has a mode where it caches some fully constructed SGLs, >> which is probably very similar to what NICs want to do. > > I think a cached allocator for page sets + the scatterlists that > describe these page sets would not only be useful for SCSI target > implementations but also for the Linux SCSI initiator. Today the scsi-mq > code reserves space in each scsi_cmnd for a scatterlist of > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page > sets less memory would be needed per scsi_cmnd. If we go down this road how about also attaching some driver opaques to the page sets? I know of some drivers that can make good use of those ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-10 18:45 ` Sagi Grimberg @ 2016-04-11 21:41 ` Jesper Dangaard Brouer 2016-04-11 22:02 ` Alexander Duyck 2016-04-11 22:21 ` Alexei Starovoitov 0 siblings, 2 replies; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-11 21:41 UTC (permalink / raw) To: Sagi Grimberg Cc: Bart Van Assche, Christoph Hellwig, James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc, Alexei Starovoitov, brouer On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote: > >> This is also very interesting for storage targets, which face the same > >> issue. SCST has a mode where it caches some fully constructed SGLs, > >> which is probably very similar to what NICs want to do. > > > > I think a cached allocator for page sets + the scatterlists that > > describe these page sets would not only be useful for SCSI target > > implementations but also for the Linux SCSI initiator. Today the scsi-mq > > code reserves space in each scsi_cmnd for a scatterlist of > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page > > sets less memory would be needed per scsi_cmnd. > > If we go down this road how about also attaching some driver opaques > to the page sets? That was the ultimate plan... to leave some opaques bytes left in the page struct that drivers could use. In struct page I would need a pointer back to my page_pool struct and a page flag. Then, I would need room to store the dma_unmap address. (And then some of the usual fields are still needed, like the refcnt, and reusing some of the list constructs). And a zero-copy cross-domain id. For my packet-page idea, I would need a packet length and an offset where data starts (I can derive the "head-room" for encap from these two). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 21:41 ` Jesper Dangaard Brouer @ 2016-04-11 22:02 ` Alexander Duyck 2016-04-12 6:28 ` Jesper Dangaard Brouer 2016-04-11 22:21 ` Alexei Starovoitov 1 sibling, 1 reply; 35+ messages in thread From: Alexander Duyck @ 2016-04-11 22:02 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Sagi Grimberg, Bart Van Assche, Christoph Hellwig, James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc, Alexei Starovoitov On Mon, Apr 11, 2016 at 2:41 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote: > >> >> This is also very interesting for storage targets, which face the same >> >> issue. SCST has a mode where it caches some fully constructed SGLs, >> >> which is probably very similar to what NICs want to do. >> > >> > I think a cached allocator for page sets + the scatterlists that >> > describe these page sets would not only be useful for SCSI target >> > implementations but also for the Linux SCSI initiator. Today the scsi-mq >> > code reserves space in each scsi_cmnd for a scatterlist of >> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page >> > sets less memory would be needed per scsi_cmnd. >> >> If we go down this road how about also attaching some driver opaques >> to the page sets? > > That was the ultimate plan... to leave some opaques bytes left in the > page struct that drivers could use. > > In struct page I would need a pointer back to my page_pool struct and a > page flag. Then, I would need room to store the dma_unmap address. > (And then some of the usual fields are still needed, like the refcnt, > and reusing some of the list constructs). And a zero-copy cross-domain > id. > > > For my packet-page idea, I would need a packet length and an offset > where data starts (I can derive the "head-room" for encap from these > two). Have you taken a look at possibly trying to optimize the DMA pool API to work with pages? It sounds like it is supposed to do something similar to what you are wanting to do. - Alex ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 22:02 ` Alexander Duyck @ 2016-04-12 6:28 ` Jesper Dangaard Brouer 2016-04-12 15:37 ` Alexander Duyck 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-12 6:28 UTC (permalink / raw) To: Alexander Duyck Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco, Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc, Alexei Starovoitov, brouer On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > Have you taken a look at possibly trying to optimize the DMA pool API > to work with pages? It sounds like it is supposed to do something > similar to what you are wanting to do. Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA coherent memory (see use of dma_alloc_coherent/dma_free_coherent). What we are doing is "streaming" DMA memory, when processing the RX ring. (NIC are only using DMA coherent memory for the descriptors, which are allocated on driver init) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-12 6:28 ` Jesper Dangaard Brouer @ 2016-04-12 15:37 ` Alexander Duyck 0 siblings, 0 replies; 35+ messages in thread From: Alexander Duyck @ 2016-04-12 15:37 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco, Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc, Alexei Starovoitov On Mon, Apr 11, 2016 at 11:28 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> Have you taken a look at possibly trying to optimize the DMA pool API >> to work with pages? It sounds like it is supposed to do something >> similar to what you are wanting to do. > > Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA > coherent memory (see use of dma_alloc_coherent/dma_free_coherent). > > What we are doing is "streaming" DMA memory, when processing the RX > ring. > > (NIC are only using DMA coherent memory for the descriptors, which are > allocated on driver init) Yes, I know that but it shouldn't take much to extend the API to provide the option for a streaming DMA mapping. That was why I thought you might want to look in this direction. - Alex ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 21:41 ` Jesper Dangaard Brouer 2016-04-11 22:02 ` Alexander Duyck @ 2016-04-11 22:21 ` Alexei Starovoitov 2016-04-12 6:16 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 35+ messages in thread From: Alexei Starovoitov @ 2016-04-11 22:21 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Sagi Grimberg, Bart Van Assche, Christoph Hellwig, James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote: > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote: > > > >> This is also very interesting for storage targets, which face the same > > >> issue. SCST has a mode where it caches some fully constructed SGLs, > > >> which is probably very similar to what NICs want to do. > > > > > > I think a cached allocator for page sets + the scatterlists that > > > describe these page sets would not only be useful for SCSI target > > > implementations but also for the Linux SCSI initiator. Today the scsi-mq > > > code reserves space in each scsi_cmnd for a scatterlist of > > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page > > > sets less memory would be needed per scsi_cmnd. > > > > If we go down this road how about also attaching some driver opaques > > to the page sets? > > That was the ultimate plan... to leave some opaques bytes left in the > page struct that drivers could use. > > In struct page I would need a pointer back to my page_pool struct and a > page flag. Then, I would need room to store the dma_unmap address. > (And then some of the usual fields are still needed, like the refcnt, > and reusing some of the list constructs). And a zero-copy cross-domain > id. I don't think we need to add anything to struct page. This is supposed to be small cache of dma_mapped pages with lockless access. It can be implemented as an array or link list where every element is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to send it to back to page allocator. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 22:21 ` Alexei Starovoitov @ 2016-04-12 6:16 ` Jesper Dangaard Brouer 2016-04-12 17:20 ` Alexei Starovoitov 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-12 6:16 UTC (permalink / raw) To: Alexei Starovoitov Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco, Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc, brouer On Mon, 11 Apr 2016 15:21:26 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote: > > > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote: > > [...] > > > > > > If we go down this road how about also attaching some driver opaques > > > to the page sets? > > > > That was the ultimate plan... to leave some opaques bytes left in the > > page struct that drivers could use. > > > > In struct page I would need a pointer back to my page_pool struct and a > > page flag. Then, I would need room to store the dma_unmap address. > > (And then some of the usual fields are still needed, like the refcnt, > > and reusing some of the list constructs). And a zero-copy cross-domain > > id. > > I don't think we need to add anything to struct page. > This is supposed to be small cache of dma_mapped pages with lockless access. > It can be implemented as an array or link list where every element > is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to > send it to back to page allocator. It sounds like the Intel drivers recycle facility, where they split the page into two parts, and keep page in RX-ring, by swapping to other half of page, if page_count(page) is <= 2. Thus, they use the atomic page ref count to synchronize on. Thus, we end-up having two atomic operations per RX packet, on the page refcnt. Where DPDK have zero... By fully taking over the page as an allocator, almost like slab. I can optimize the common case (of the packet-page getting allocated and free'ed on the same CPU), and remove these atomic operations. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-12 6:16 ` Jesper Dangaard Brouer @ 2016-04-12 17:20 ` Alexei Starovoitov 0 siblings, 0 replies; 35+ messages in thread From: Alexei Starovoitov @ 2016-04-12 17:20 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco, Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc On Tue, Apr 12, 2016 at 08:16:49AM +0200, Jesper Dangaard Brouer wrote: > > On Mon, 11 Apr 2016 15:21:26 -0700 > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote: > > > > > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote: > > > > [...] > > > > > > > > If we go down this road how about also attaching some driver opaques > > > > to the page sets? > > > > > > That was the ultimate plan... to leave some opaques bytes left in the > > > page struct that drivers could use. > > > > > > In struct page I would need a pointer back to my page_pool struct and a > > > page flag. Then, I would need room to store the dma_unmap address. > > > (And then some of the usual fields are still needed, like the refcnt, > > > and reusing some of the list constructs). And a zero-copy cross-domain > > > id. > > > > I don't think we need to add anything to struct page. > > This is supposed to be small cache of dma_mapped pages with lockless access. > > It can be implemented as an array or link list where every element > > is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to > > send it to back to page allocator. > > It sounds like the Intel drivers recycle facility, where they split the > page into two parts, and keep page in RX-ring, by swapping to other > half of page, if page_count(page) is <= 2. Thus, they use the atomic > page ref count to synchronize on. actually I'm proposing the opposite. one page = one packet. I'm perfectly happy to waste half a page, since number of such pages is small and performance matter more. Typical performance vs memory tradeoff. > Thus, we end-up having two atomic operations per RX packet, on the page > refcnt. Where DPDK have zero... the page recycling cache should have zero atomic ops per packet otherwise it's non starter. > By fully taking over the page as an allocator, almost like slab. I can > optimize the common case (of the packet-page getting allocated and > free'ed on the same CPU), and remove these atomic operations. slub is doing local cmpxchg. 40G networking cannot afford it per packet. If it's amortized due to batching that will be ok. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 14:38 ` [Lsf-pc] " Christoph Hellwig 2016-04-07 15:11 ` [Lsf] " Bart Van Assche @ 2016-04-07 15:48 ` Chuck Lever 2016-04-07 16:14 ` [Lsf-pc] [Lsf] " Rik van Riel 1 sibling, 1 reply; 35+ messages in thread From: Chuck Lever @ 2016-04-07 15:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Jesper Dangaard Brouer, James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc, Alexei Starovoitov > On Apr 7, 2016, at 7:38 AM, Christoph Hellwig <hch@infradead.org> wrote: > > This is also very interesting for storage targets, which face the same > issue. SCST has a mode where it caches some fully constructed SGLs, > which is probably very similar to what NICs want to do. +1 for NFS server. -- Chuck Lever -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf-pc] [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 15:48 ` Chuck Lever @ 2016-04-07 16:14 ` Rik van Riel 2016-04-07 19:43 ` [Lsf] [Lsf-pc] " Jesper Dangaard Brouer 0 siblings, 1 reply; 35+ messages in thread From: Rik van Riel @ 2016-04-07 16:14 UTC (permalink / raw) To: Chuck Lever, Christoph Hellwig Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm, netdev, Jesper Dangaard Brouer, lsf-pc, Alexei Starovoitov [-- Attachment #1: Type: text/plain, Size: 655 bytes --] On Thu, 2016-04-07 at 08:48 -0700, Chuck Lever wrote: > > > > On Apr 7, 2016, at 7:38 AM, Christoph Hellwig <hch@infradead.org> > > wrote: > > > > This is also very interesting for storage targets, which face the > > same > > issue. SCST has a mode where it caches some fully constructed > > SGLs, > > which is probably very similar to what NICs want to do. > +1 for NFS server. I have swapped around my slot (into the MM track) with Jesper's slot (now a plenary session), since there seems to be a fair amount of interest in Jesper's proposal from IO and FS people, and my topic is more MM specific. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 16:14 ` [Lsf-pc] [Lsf] " Rik van Riel @ 2016-04-07 19:43 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-07 19:43 UTC (permalink / raw) To: Rik van Riel Cc: Chuck Lever, Christoph Hellwig, James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc, Alexei Starovoitov, brouer [-- Attachment #1: Type: text/plain, Size: 1010 bytes --] On Thu, 07 Apr 2016 12:14:00 -0400 Rik van Riel <riel@redhat.com> wrote: > On Thu, 2016-04-07 at 08:48 -0700, Chuck Lever wrote: > > > > > > On Apr 7, 2016, at 7:38 AM, Christoph Hellwig <hch@infradead.org> > > > wrote: > > > > > > This is also very interesting for storage targets, which face the > > > same issue. SCST has a mode where it caches some fully constructed > > > SGLs, which is probably very similar to what NICs want to do. > > > > +1 for NFS server. > > I have swapped around my slot (into the MM track) > with Jesper's slot (now a plenary session), since > there seems to be a fair amount of interest in > Jesper's proposal from IO and FS people, and my > topic is more MM specific. Wow - I'm impressed. I didn't expect such a good slot! Glad to see the interest! Thanks! -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 213 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer 2016-04-07 14:38 ` [Lsf-pc] " Christoph Hellwig @ 2016-04-07 15:18 ` Eric Dumazet 2016-04-09 9:11 ` [Lsf] " Jesper Dangaard Brouer 2016-04-07 19:48 ` Waskiewicz, PJ 2016-04-11 8:58 ` [Lsf-pc] " Mel Gorman 3 siblings, 1 reply; 35+ messages in thread From: Eric Dumazet @ 2016-04-07 15:18 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: lsf, linux-mm, James Bottomley, netdev, Tom Herbert, Alexei Starovoitov, Brenden Blanco, lsf-pc On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote: > (Topic proposal for MM-summit) > > Network Interface Cards (NIC) drivers, and increasing speeds stress > the page-allocator (and DMA APIs). A number of driver specific > open-coded approaches exists that work-around these bottlenecks in the > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and > allocating larger pages and handing-out page "fragments". > > I'm proposing a generic page-pool recycle facility, that can cover the > driver use-cases, increase performance and open up for zero-copy RX. > > > The basic performance problem is that pages (containing packets at RX) > are cycled through the page allocator (freed at TX DMA completion > time). While a system in a steady state, could avoid calling the page > allocator, when having a pool of pages equal to the size of the RX > ring plus the number of outstanding frames in the TX ring (waiting for > DMA completion). We certainly used this at Google for quite a while. The thing is : in steady state, the number of pages being 'in tx queues' is lower than number of pages that were allocated for RX queues. The page allocator is hardly hit, once you have big enough RX ring buffers. (Nothing fancy, simply the default number of slots) The 'hard coded´ code is quite small actually if (page_count(page) != 1) { free the page and allocate another one, since we are not the exclusive owner. Prefer __GFP_COLD pages btw. } page_ref_inc(page); Problem of a 'pool' is that it matches a router workload, not host one. With existing code, new pages are automatically allocated on demand, if say previous pages are still used by skb stored in sockets receive queues and consumers are slow to react to the presence of this data. But in most cases (steady state), the refcount on the page is released by the application reading the data before the driver cycled through the RX ring buffer and drivers only increments the page count. I also played with grouping pages into the same 2MB pages, but got mixed results. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 15:18 ` Eric Dumazet @ 2016-04-09 9:11 ` Jesper Dangaard Brouer 2016-04-09 12:34 ` Eric Dumazet 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-09 9:11 UTC (permalink / raw) To: Eric Dumazet Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc, Alexei Starovoitov, brouer Hi Eric, On Thu, 07 Apr 2016 08:18:29 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote: > > (Topic proposal for MM-summit) > > > > Network Interface Cards (NIC) drivers, and increasing speeds stress > > the page-allocator (and DMA APIs). A number of driver specific > > open-coded approaches exists that work-around these bottlenecks in the > > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and > > allocating larger pages and handing-out page "fragments". > > > > I'm proposing a generic page-pool recycle facility, that can cover the > > driver use-cases, increase performance and open up for zero-copy RX. > > > > > > The basic performance problem is that pages (containing packets at RX) > > are cycled through the page allocator (freed at TX DMA completion > > time). While a system in a steady state, could avoid calling the page > > allocator, when having a pool of pages equal to the size of the RX > > ring plus the number of outstanding frames in the TX ring (waiting for > > DMA completion). > > > We certainly used this at Google for quite a while. > > The thing is : in steady state, the number of pages being 'in tx queues' > is lower than number of pages that were allocated for RX queues. That was also my expectation, thanks for confirming my expectation. > The page allocator is hardly hit, once you have big enough RX ring > buffers. (Nothing fancy, simply the default number of slots) > > The 'hard coded´ code is quite small actually > > if (page_count(page) != 1) { > free the page and allocate another one, > since we are not the exclusive owner. > Prefer __GFP_COLD pages btw. > } > page_ref_inc(page); Above code is okay. But do you think we also can get away with the same trick we do with the SKB refcnf? Where we avoid an atomic operation if refcnt==1. void kfree_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; trace_kfree_skb(skb, __builtin_return_address(0)); __kfree_skb(skb); } EXPORT_SYMBOL(kfree_skb); -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-09 9:11 ` [Lsf] " Jesper Dangaard Brouer @ 2016-04-09 12:34 ` Eric Dumazet 2016-04-11 20:23 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 35+ messages in thread From: Eric Dumazet @ 2016-04-09 12:34 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm, netdev, lsf-pc, Alexei Starovoitov On Sat, 2016-04-09 at 11:11 +0200, Jesper Dangaard Brouer wrote: > Hi Eric, > Above code is okay. But do you think we also can get away with the same > trick we do with the SKB refcnf? Where we avoid an atomic operation if > refcnt==1. > > void kfree_skb(struct sk_buff *skb) > { > if (unlikely(!skb)) > return; > if (likely(atomic_read(&skb->users) == 1)) > smp_rmb(); > else if (likely(!atomic_dec_and_test(&skb->users))) > return; > trace_kfree_skb(skb, __builtin_return_address(0)); > __kfree_skb(skb); > } > EXPORT_SYMBOL(kfree_skb); No we can not use this trick this for pages : https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ec91698360b3818ff426488a1529811f7a7ab87f -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-09 12:34 ` Eric Dumazet @ 2016-04-11 20:23 ` Jesper Dangaard Brouer 2016-04-11 21:27 ` Eric Dumazet 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-11 20:23 UTC (permalink / raw) To: Eric Dumazet Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm, netdev, lsf-pc, Alexei Starovoitov, brouer On Sat, 09 Apr 2016 05:34:38 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Sat, 2016-04-09 at 11:11 +0200, Jesper Dangaard Brouer wrote: > > > Above code is okay. But do you think we also can get away with the same > > trick we do with the SKB refcnf? Where we avoid an atomic operation if > > refcnt==1. > > > > void kfree_skb(struct sk_buff *skb) > > { > > if (unlikely(!skb)) > > return; > > if (likely(atomic_read(&skb->users) == 1)) > > smp_rmb(); > > else if (likely(!atomic_dec_and_test(&skb->users))) > > return; > > trace_kfree_skb(skb, __builtin_return_address(0)); > > __kfree_skb(skb); > > } > > EXPORT_SYMBOL(kfree_skb); > > No we can not use this trick this for pages : > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ec91698360b3818ff426488a1529811f7a7ab87f > If we have a page-pool recycle facility, then we could use the trick, right? (As we know that get_page_unless_zero() cannot happen for pages in the pool). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 20:23 ` Jesper Dangaard Brouer @ 2016-04-11 21:27 ` Eric Dumazet 0 siblings, 0 replies; 35+ messages in thread From: Eric Dumazet @ 2016-04-11 21:27 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm, netdev, lsf-pc, Alexei Starovoitov On Mon, 2016-04-11 at 22:23 +0200, Jesper Dangaard Brouer wrote: > If we have a page-pool recycle facility, then we could use the trick, > right? (As we know that get_page_unless_zero() cannot happen for pages > in the pool). Well, if you disable everything that possibly use get_page_unless_zero(), I guess this could work. But then, you'll have to spy lkml traffic forever to make sure no new feature is added in the kernel, using this get_page_unless_zero() in a new clever way. You could use a page flag so that z BUG() triggers if get_page_unless_zero() is attempted on one of your precious pages ;)\ We had very subtle issues before my fixes (check 35b7a1915aa33da812074744647db0d9262a555c and children), so I would not waste time on the lock prefix avoidance at this point. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer 2016-04-07 14:38 ` [Lsf-pc] " Christoph Hellwig 2016-04-07 15:18 ` Eric Dumazet @ 2016-04-07 19:48 ` Waskiewicz, PJ 2016-04-07 20:38 ` Jesper Dangaard Brouer 2016-04-11 8:58 ` [Lsf-pc] " Mel Gorman 3 siblings, 1 reply; 35+ messages in thread From: Waskiewicz, PJ @ 2016-04-07 19:48 UTC (permalink / raw) To: lsf, linux-mm, brouer Cc: netdev, bblanco, alexei.starovoitov, James.Bottomley, tom, lsf-pc On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote: > (Topic proposal for MM-summit) > > Network Interface Cards (NIC) drivers, and increasing speeds stress > the page-allocator (and DMA APIs). A number of driver specific > open-coded approaches exists that work-around these bottlenecks in > the > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and > allocating larger pages and handing-out page "fragments". > > I'm proposing a generic page-pool recycle facility, that can cover > the > driver use-cases, increase performance and open up for zero-copy RX. Is this based on the page recycle stuff from ixgbe that used to be in the driver? If so I'd really like to be part of the discussion. -PJ -- PJ Waskiewicz Principal Engineer, NetApp e: pj.waskiewicz@netapp.com d: 503.961.3705 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 19:48 ` Waskiewicz, PJ @ 2016-04-07 20:38 ` Jesper Dangaard Brouer 2016-04-08 16:12 ` Alexander Duyck 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-07 20:38 UTC (permalink / raw) To: Waskiewicz, PJ Cc: lsf, linux-mm, netdev, bblanco, alexei.starovoitov, James.Bottomley, tom, lsf-pc, brouer On Thu, 7 Apr 2016 19:48:50 +0000 "Waskiewicz, PJ" <PJ.Waskiewicz@netapp.com> wrote: > On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote: > > (Topic proposal for MM-summit) > > > > Network Interface Cards (NIC) drivers, and increasing speeds stress > > the page-allocator (and DMA APIs). A number of driver specific > > open-coded approaches exists that work-around these bottlenecks in > > the > > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and > > allocating larger pages and handing-out page "fragments". > > > > I'm proposing a generic page-pool recycle facility, that can cover > > the > > driver use-cases, increase performance and open up for zero-copy RX. > > Is this based on the page recycle stuff from ixgbe that used to be in > the driver? If so I'd really like to be part of the discussion. Okay, so it is not part of the driver any-longer? I've studied the current ixgbe driver (and other NIC drivers) closely. Do you have some code pointers, to this older code? The likely-fastest recycle code I've see is in the bnx2x driver. If you are interested see: bnx2x_reuse_rx_data(). Again is it a bit open-coded produce/consumer ring queue (which would be nice to also cleanup). To amortize the cost of allocating a single page, most other drivers use the trick of allocating a larger (compound) page, and partition this page into smaller "fragments". Which also amortize the cost of dma_map/unmap (important on non-x86). This is actually problematic performance wise, because packet-data (in these page fragments) only get DMA_sync'ed, and is thus considered "read-only". As netstack need to write packet headers, yet-another (writable) memory area is allocated per packet (plus the SKB meta-data struct). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 20:38 ` Jesper Dangaard Brouer @ 2016-04-08 16:12 ` Alexander Duyck 0 siblings, 0 replies; 35+ messages in thread From: Alexander Duyck @ 2016-04-08 16:12 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Waskiewicz, PJ, lsf, linux-mm, netdev, bblanco, alexei.starovoitov, James.Bottomley@HansenPartnership.com, tom, lsf-pc On Thu, Apr 7, 2016 at 1:38 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Thu, 7 Apr 2016 19:48:50 +0000 > "Waskiewicz, PJ" <PJ.Waskiewicz@netapp.com> wrote: > >> On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote: >> > (Topic proposal for MM-summit) >> > >> > Network Interface Cards (NIC) drivers, and increasing speeds stress >> > the page-allocator (and DMA APIs). A number of driver specific >> > open-coded approaches exists that work-around these bottlenecks in >> > the >> > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and >> > allocating larger pages and handing-out page "fragments". >> > >> > I'm proposing a generic page-pool recycle facility, that can cover >> > the >> > driver use-cases, increase performance and open up for zero-copy RX. >> >> Is this based on the page recycle stuff from ixgbe that used to be in >> the driver? If so I'd really like to be part of the discussion. > > Okay, so it is not part of the driver any-longer? I've studied the > current ixgbe driver (and other NIC drivers) closely. Do you have some > code pointers, to this older code? No, it is still in the driver. I think when PJ said "used to" he was referring to the fact that the code was present in the driver back when he was working on it at Intel. You have to realize that the page reuse code has been in the Intel drivers for a long time. I think I introduced it originally on igb in July of 2008 as page recycling, commit bf36c1a0040c ("igb: add page recycling support"), and it was copied over to ixgbe in September, commit 762f4c571058 ("ixgbe: recycle pages in packet split mode"). > The likely-fastest recycle code I've see is in the bnx2x driver. If > you are interested see: bnx2x_reuse_rx_data(). Again is it a bit > open-coded produce/consumer ring queue (which would be nice to also > cleanup). Yeah, that is essentially the same kind of code we have in ixgbe_reuse_rx_page(). From what I can tell though the bnx2x doesn't actually reuse the buffers in the common case. That function is only called in the copy-break and error cases to recycle the buffer so that it doesn't have to be freed. > To amortize the cost of allocating a single page, most other drivers > use the trick of allocating a larger (compound) page, and partition > this page into smaller "fragments". Which also amortize the cost of > dma_map/unmap (important on non-x86). Right. The only reason why I went the reuse route instead of the compound page route is that I had speculated that you could still bottleneck yourself since the issue I was trying to avoid was the dma_map call hitting a global lock in IOMMU enabled systems. With the larger page route I could at best reduce the number of map calls to 1/16 or 1/32 of what it was. By doing the page reuse I actually bring it down to something approaching 0 as long as the buffers are being freed in a reasonable timeframe. This way the code would scale so I wouldn't have to worry about how many rings were active at the same time. As PJ can attest we even saw bugs where the page reuse actually was too effective in some cases leading to us carrying memory from one node to another when the interrupt was migrated. That was why we had to add the code to force us to free the page if it came from another node. > This is actually problematic performance wise, because packet-data > (in these page fragments) only get DMA_sync'ed, and is thus considered > "read-only". As netstack need to write packet headers, yet-another > (writable) memory area is allocated per packet (plus the SKB meta-data > struct). Have you done any actual testing with build_skb recently that shows how much of a gain there is to be had? I'm just curious as I know I saw a gain back in the day, but back when I ran that test we didn't have things like napi_alloc_skb running around which should be a pretty big win. It might be useful to hack a driver such as ixgbe to use build_skb and see if it is even worth the trouble to do it properly. Here is a patch I had generated back in 2013 to convert ixgbe over to using build_skb, https://patchwork.ozlabs.org/patch/236044/. You might be able to updated to make it work against current ixgbe and then could come back to us with data on what the actual gain is. My thought is the gain should have significantly decreased since back in the day as we optimized napi_alloc_skb to the point where I think the only real difference is probably the memcpy to pull the headers from the page. - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer ` (2 preceding siblings ...) 2016-04-07 19:48 ` Waskiewicz, PJ @ 2016-04-11 8:58 ` Mel Gorman 2016-04-11 12:26 ` Jesper Dangaard Brouer 3 siblings, 1 reply; 35+ messages in thread From: Mel Gorman @ 2016-04-11 8:58 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: lsf, linux-mm, netdev, Brenden Blanco, James Bottomley, Tom Herbert, lsf-pc, Alexei Starovoitov On Thu, Apr 07, 2016 at 04:17:15PM +0200, Jesper Dangaard Brouer wrote: > (Topic proposal for MM-summit) > > Network Interface Cards (NIC) drivers, and increasing speeds stress > the page-allocator (and DMA APIs). A number of driver specific > open-coded approaches exists that work-around these bottlenecks in the > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and > allocating larger pages and handing-out page "fragments". > > I'm proposing a generic page-pool recycle facility, that can cover the > driver use-cases, increase performance and open up for zero-copy RX. > Which bottleneck dominates -- the page allocator or the DMA API when setting up coherent pages? I'm wary of another page allocator API being introduced if it's for performance reasons. In response to this thread, I spent two days on a series that boosts performance of the allocator in the fast paths by 11-18% to illustrate that there was low-hanging fruit for optimising. If the one-LRU-per-node series was applied on top, there would be a further boost to performance on the allocation side. It could be further boosted if debugging checks and statistic updates were conditionally disabled by the caller. The main reason another allocator concerns me is that those pages are effectively pinned and cannot be reclaimed by the VM in low memory situations. It ends up needing its own API for tuning the size and hoping all the drivers get it right without causing OOM situations. It becomes a slippery slope of introducing shrinkers, locking and complexity. Then callers start getting concerned about NUMA locality and having to deal with multiple lists to maintain performance. Ultimately, it ends up being as slow as the page allocator and back to square 1 except now with more code. If it's the DMA API that dominates then something may be required but it should rely on the existing page allocator to alloc/free from. It would also need something like drain_all_pages to force free everything in there in low memory situations. Remember that multiple instances private to drivers or tasks will require shrinker implementations and the complexity may get unwieldly. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 8:58 ` [Lsf-pc] " Mel Gorman @ 2016-04-11 12:26 ` Jesper Dangaard Brouer 2016-04-11 13:08 ` Mel Gorman 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-11 12:26 UTC (permalink / raw) To: Mel Gorman Cc: lsf, linux-mm, netdev, Brenden Blanco, James Bottomley, Tom Herbert, lsf-pc, Alexei Starovoitov, brouer On Mon, 11 Apr 2016 09:58:19 +0100 Mel Gorman <mgorman@suse.de> wrote: > On Thu, Apr 07, 2016 at 04:17:15PM +0200, Jesper Dangaard Brouer wrote: > > (Topic proposal for MM-summit) > > > > Network Interface Cards (NIC) drivers, and increasing speeds stress > > the page-allocator (and DMA APIs). A number of driver specific > > open-coded approaches exists that work-around these bottlenecks in the > > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and > > allocating larger pages and handing-out page "fragments". > > > > I'm proposing a generic page-pool recycle facility, that can cover the > > driver use-cases, increase performance and open up for zero-copy RX. > > > > Which bottleneck dominates -- the page allocator or the DMA API when > setting up coherent pages? > It is actually both, but mostly DMA on non-x86 archs. The need to support multiple archs, then also cause a slowdown on x86, due to a side-effect. On arch's like PowerPC, the DMA API is the bottleneck. To workaround the cost of DMA calls, NIC driver alloc large order (compound) pages. (dma_map compound page, handout page-fragments for RX ring, and later dma_unmap when last RX page-fragments is seen). The unfortunate side-effect is that these RX page-fragments (which contain packet data) need to be considered 'read-only', because a dma_unmap call can be destructive. Network packets need to be modified (minimum time-to-live). Thus, netstack alloc new writable memory, copy-over IP-headers, and adjust offset pointer into RX-page. Avoiding the dma_unmap (AFAIK) will allow to make RX-pages writable. Idea by page-pool is to recycling pages back to the originating device, then we can avoid the need to call dma_unmap(). And only call dma_map() when setting up pages. > I'm wary of another page allocator API being introduced if it's for > performance reasons. In response to this thread, I spent two days on > a series that boosts performance of the allocator in the fast paths by > 11-18% to illustrate that there was low-hanging fruit for optimising. If > the one-LRU-per-node series was applied on top, there would be a further > boost to performance on the allocation side. It could be further boosted > if debugging checks and statistic updates were conditionally disabled by > the caller. It is always great if you can optimized the page allocator. IMHO the page allocator is too slow. At least for my performance needs (67ns per packet, approx 201 cycles at 3GHz). I've measured[1] alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc). The trick described above, of allocating a higher order page and handing out page-fragments, also workaround this page allocator bottleneck (on x86). I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to cost approx 500 cycles(tsc). That was more expensive, BUT an order=3 page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5 cycles. Usually a network RX-frame only need to be 2048 bytes, thus the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles. I view this as a bulking trick... maybe the page allocator can just give us a bulking API? ;-) > The main reason another allocator concerns me is that those pages > are effectively pinned and cannot be reclaimed by the VM in low memory > situations. It ends up needing its own API for tuning the size and hoping > all the drivers get it right without causing OOM situations. It becomes > a slippery slope of introducing shrinkers, locking and complexity. Then > callers start getting concerned about NUMA locality and having to deal > with multiple lists to maintain performance. Ultimately, it ends up being > as slow as the page allocator and back to square 1 except now with more code. The pages assigned to the RX ring queue are pinned like today. The pages avail in the pool could easily be reclaimed. I actually think we are better off providing a generic page pool interface the drivers can use. Instead of the situation where drivers and subsystems invent their own, which does not cooperate in OOM situations. For the networking fast forwarding use-case (NOT localhost delivery), then the page pool size would actually be limited at a fairly small fixed size. Packets will be hard dropped if exceeding this limit. The idea is, you want to limit the maximum latency the system can introduce then forwarding a packet, even in high overload situations. There is a good argumentation in section 3.2. of Google's paper[2]. They limit the pool size to 3000 and calculate this can max introduce 300 micro-sec latency. > If it's the DMA API that dominates then something may be required but it > should rely on the existing page allocator to alloc/free from. It would > also need something like drain_all_pages to force free everything in there > in low memory situations. Remember that multiple instances private to > drivers or tasks will require shrinker implementations and the complexity > may get unwieldly. I'll read up on the shrinker interface. [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench [2] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 12:26 ` Jesper Dangaard Brouer @ 2016-04-11 13:08 ` Mel Gorman 2016-04-11 16:19 ` [Lsf] " Jesper Dangaard Brouer 2016-04-11 16:20 ` Matthew Wilcox 0 siblings, 2 replies; 35+ messages in thread From: Mel Gorman @ 2016-04-11 13:08 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Mel Gorman, lsf, linux-mm, netdev, Brenden Blanco, James Bottomley, Tom Herbert, lsf-pc, Alexei Starovoitov On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote: > > Which bottleneck dominates -- the page allocator or the DMA API when > > setting up coherent pages? > > > > It is actually both, but mostly DMA on non-x86 archs. The need to > support multiple archs, then also cause a slowdown on x86, due to a > side-effect. > > On arch's like PowerPC, the DMA API is the bottleneck. To workaround > the cost of DMA calls, NIC driver alloc large order (compound) pages. > (dma_map compound page, handout page-fragments for RX ring, and later > dma_unmap when last RX page-fragments is seen). > So, IMO only holding onto the DMA pages is all that is justified but not a recycle of order-0 pages built on top of the core allocator. For DMA pages, it would take a bit of legwork but the per-cpu allocator could be split and converted to hold arbitrary sized pages with a constructer/destructor to do the DMA coherency step when pages are taken from or handed back to the core allocator. I'm not volunteering to do that unfortunately but I estimate it'd be a few days work unless it needs to be per-CPU and NUMA aware in which case the memory footprint will be high. > > I'm wary of another page allocator API being introduced if it's for > > performance reasons. In response to this thread, I spent two days on > > a series that boosts performance of the allocator in the fast paths by > > 11-18% to illustrate that there was low-hanging fruit for optimising. If > > the one-LRU-per-node series was applied on top, there would be a further > > boost to performance on the allocation side. It could be further boosted > > if debugging checks and statistic updates were conditionally disabled by > > the caller. > > It is always great if you can optimized the page allocator. IMHO the > page allocator is too slow. It's why I spent some time on it as any improvement in the allocator is an unconditional win without requiring driver modifications. > At least for my performance needs (67ns > per packet, approx 201 cycles at 3GHz). I've measured[1] > alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc). > It'd be worth retrying this with the branch http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5 This is an unreleased series that contains both the page allocator optimisations and the one-LRU-per-node series which in combination remove a lot of code from the page allocator fast paths. I have no data on how the combined series behaves but each series individually is known to improve page allocator performance. Once you have that, do a hackjob to remove the debugging checks from both the alloc and free path and see what that leaves. They could be bypassed properly with a __GFP_NOACCT flag used only by drivers that absolutely require pages as quickly as possible and willing to be less safe to get that performance. I expect then that the free path to be dominated by zone and pageblock lookups which are much harder to remove. The zone lookup can be removed if the caller knows exactly where the free pages need to go which is unlikely. The pageblock lookup could be removed if it was coming from a dedicated pool if the allocation side refills using pageblocks that are always MIGRATE_UNMOVABLE. > The trick described above, of allocating a higher order page and > handing out page-fragments, also workaround this page allocator > bottleneck (on x86). > Be aware that compound order allocs like this are a double edged sword as it'll be fast sometimes and other times require reclaim/compaction which can stall for prolonged periods of time. > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to > cost approx 500 cycles(tsc). That was more expensive, BUT an order=3 > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5 > cycles. Usually a network RX-frame only need to be 2048 bytes, thus > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles. > > I view this as a bulking trick... maybe the page allocator can just > give us a bulking API? ;-) > It could on the alloc side relatively easily using either a variation of rmqueue_bulk exposed at a higher level populating a linked list (link via page->lru) or an array supplied by the caller. It's harder to bulk free quickly as the pages being freed are not necessarily in the same pageblock requiring lookups in the free path. Tricky to get right, but preferable to a whole new allocator. > > The main reason another allocator concerns me is that those pages > > are effectively pinned and cannot be reclaimed by the VM in low memory > > situations. It ends up needing its own API for tuning the size and hoping > > all the drivers get it right without causing OOM situations. It becomes > > a slippery slope of introducing shrinkers, locking and complexity. Then > > callers start getting concerned about NUMA locality and having to deal > > with multiple lists to maintain performance. Ultimately, it ends up being > > as slow as the page allocator and back to square 1 except now with more code. > > The pages assigned to the RX ring queue are pinned like today. The > pages avail in the pool could easily be reclaimed. > How easy depends on how it's structured. If it's a global per-cpu list then it's an IPI to all CPUs which is straight-forward to implement but slow to execute. If it's per-driver then there needs to be a locked list of all pools and locking on each individual pool which could offset some of the performance benefit of using the pool in the first place. > I actually think we are better off providing a generic page pool > interface the drivers can use. Instead of the situation where drivers > and subsystems invent their own, which does not cooperate in OOM > situations. > If it's offsetting DMA setup/teardown then I'd be a bit happier. If it's yet-another-page allocator to bypass the core allocator then I'm less happy. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 13:08 ` Mel Gorman @ 2016-04-11 16:19 ` Jesper Dangaard Brouer 2016-04-11 16:53 ` Eric Dumazet 2016-04-11 18:07 ` Mel Gorman 2016-04-11 16:20 ` Matthew Wilcox 1 sibling, 2 replies; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-11 16:19 UTC (permalink / raw) To: Mel Gorman Cc: James Bottomley, netdev, Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov, brouer On Mon, 11 Apr 2016 14:08:27 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote: [...] > > > > It is always great if you can optimized the page allocator. IMHO the > > page allocator is too slow. > > It's why I spent some time on it as any improvement in the allocator is > an unconditional win without requiring driver modifications. > > > At least for my performance needs (67ns > > per packet, approx 201 cycles at 3GHz). I've measured[1] > > alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc). > > > > It'd be worth retrying this with the branch > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5 > The cost decreased to: 228 cycles(tsc), but there are some variations, sometimes it increase to 238 cycles(tsc). Nice, but there is still a looong way to my performance target, where I can spend 201 cycles for the entire forwarding path.... > This is an unreleased series that contains both the page allocator > optimisations and the one-LRU-per-node series which in combination remove a > lot of code from the page allocator fast paths. I have no data on how the > combined series behaves but each series individually is known to improve > page allocator performance. > > Once you have that, do a hackjob to remove the debugging checks from both the > alloc and free path and see what that leaves. They could be bypassed properly > with a __GFP_NOACCT flag used only by drivers that absolutely require pages > as quickly as possible and willing to be less safe to get that performance. I would be interested in testing/benchmarking a patch where you remove the debugging checks... You are also welcome to try out my benchmarking modules yourself: https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst This is really simple stuff (for rapid prototyping) I'm just doing: modprobe page_bench01; rmmod page_bench01 ; dmesg | tail -n40 [...] > > Be aware that compound order allocs like this are a double edged sword as > it'll be fast sometimes and other times require reclaim/compaction which > can stall for prolonged periods of time. Yes, I've notice that there can be a fairly high variation, when doing compound order allocs, which is not so nice! I really don't like these variations.... Drivers also do tricks where they fallback to smaller order pages. E.g. lookup function mlx4_alloc_pages(). I've tried to simulate that function here: https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 It does not seem very optimal. I tried to mem pressure the system a bit to cause the alloc_pages() to fail, and then the result were very bad, something like 2500 cycles, and it usually got the next order pages. > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to > > cost approx 500 cycles(tsc). That was more expensive, BUT an order=3 > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5 > > cycles. Usually a network RX-frame only need to be 2048 bytes, thus > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles. The order=3 cost were reduced to: 417 cycles(tsc), nice! But I've also seen it jump to 611 cycles. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 16:19 ` [Lsf] " Jesper Dangaard Brouer @ 2016-04-11 16:53 ` Eric Dumazet 2016-04-11 19:47 ` Jesper Dangaard Brouer 2016-04-11 18:07 ` Mel Gorman 1 sibling, 1 reply; 35+ messages in thread From: Eric Dumazet @ 2016-04-11 16:53 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Mel Gorman, James Bottomley, netdev, Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote: > Drivers also do tricks where they fallback to smaller order pages. E.g. > lookup function mlx4_alloc_pages(). I've tried to simulate that > function here: > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 We use order-0 pages on mlx4 at Google, as order-3 pages are very dangerous for some kind of attacks... An out of order TCP packet can hold an order-3 pages, while claiming to use 1.5 KBvia skb->truesize. order-0 only pages allow the page recycle trick used by Intel driver, and we hardly see any page allocations in typical workloads. While order-3 pages are 'nice' for friendly datacenter kind of traffic, they also are a higher risk on hosts connected to the wild Internet. Maybe I should upstream this patch ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 16:53 ` Eric Dumazet @ 2016-04-11 19:47 ` Jesper Dangaard Brouer 2016-04-11 21:14 ` Eric Dumazet 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-11 19:47 UTC (permalink / raw) To: Eric Dumazet Cc: lsf, netdev, Brenden Blanco, James Bottomley, linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Mel Gorman, Alexei Starovoitov, brouer, Alexander Duyck, Waskiewicz, PJ On Mon, 11 Apr 2016 09:53:54 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote: > > > Drivers also do tricks where they fallback to smaller order pages. E.g. > > lookup function mlx4_alloc_pages(). I've tried to simulate that > > function here: > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 > > We use order-0 pages on mlx4 at Google, as order-3 pages are very > dangerous for some kind of attacks... Interesting! > An out of order TCP packet can hold an order-3 pages, while claiming to > use 1.5 KBvia skb->truesize. > > order-0 only pages allow the page recycle trick used by Intel driver, > and we hardly see any page allocations in typical workloads. Yes, I looked at the Intel ixgbe drivers page recycle trick. It is actually quite cool, but code wise it is a little hard to follow. I started to look at the variant in i40e, specifically function i40e_clean_rx_irq_ps() explains it a bit more explicit. > While order-3 pages are 'nice' for friendly datacenter kind of > traffic, they also are a higher risk on hosts connected to the wild > Internet. > > Maybe I should upstream this patch ;) Definitely! Does this patch also include a page recycle trick? Else how do you get around the cost of allocating a single order-0 page? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 19:47 ` Jesper Dangaard Brouer @ 2016-04-11 21:14 ` Eric Dumazet 0 siblings, 0 replies; 35+ messages in thread From: Eric Dumazet @ 2016-04-11 21:14 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: lsf, netdev, Brenden Blanco, James Bottomley, linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Mel Gorman, Alexei Starovoitov, Alexander Duyck, Waskiewicz, PJ On Mon, 2016-04-11 at 21:47 +0200, Jesper Dangaard Brouer wrote: > On Mon, 11 Apr 2016 09:53:54 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote: > > > > > Drivers also do tricks where they fallback to smaller order pages. E.g. > > > lookup function mlx4_alloc_pages(). I've tried to simulate that > > > function here: > > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 > > > > We use order-0 pages on mlx4 at Google, as order-3 pages are very > > dangerous for some kind of attacks... > > Interesting! > > > An out of order TCP packet can hold an order-3 pages, while claiming to > > use 1.5 KBvia skb->truesize. > > > > order-0 only pages allow the page recycle trick used by Intel driver, > > and we hardly see any page allocations in typical workloads. > > Yes, I looked at the Intel ixgbe drivers page recycle trick. > > It is actually quite cool, but code wise it is a little hard to > follow. I started to look at the variant in i40e, specifically > function i40e_clean_rx_irq_ps() explains it a bit more explicit. > > > > While order-3 pages are 'nice' for friendly datacenter kind of > > traffic, they also are a higher risk on hosts connected to the wild > > Internet. > > > > Maybe I should upstream this patch ;) > > Definitely! > > Does this patch also include a page recycle trick? Else how do you get > around the cost of allocating a single order-0 page? > Yes, we use the page recycle trick. Obviously not on powerpc (or any arch with PAGE_SIZE >= 8192), but definitely on x86. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 16:19 ` [Lsf] " Jesper Dangaard Brouer 2016-04-11 16:53 ` Eric Dumazet @ 2016-04-11 18:07 ` Mel Gorman 2016-04-11 19:26 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 35+ messages in thread From: Mel Gorman @ 2016-04-11 18:07 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Mel Gorman, James Bottomley, netdev, Brenden Blanco, lsf, linux-mm, Tom Herbert, lsf-pc, Alexei Starovoitov On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote: > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5 > > > > The cost decreased to: 228 cycles(tsc), but there are some variations, > sometimes it increase to 238 cycles(tsc). > In the free path, a bulk pcp free adds to the cycles. In the alloc path, a refill of the pcp lists costs quite a bit. Either option introduces variances. The bulk free path can be optimised a little so I chucked some additional patches at it that are not released yet but I suspect the benefit will be marginal. The real heavy costs there are splitting/merging buddies. Fixing that is much more fundamental but even fronting the allocator with a new recycle allocator would not offset that as the refill of the page-recycling thing would incur high costs. > Nice, but there is still a looong way to my performance target, where I > can spend 201 cycles for the entire forwarding path.... > While I accept the cost is still too high, I think the effort should still be spent on improving the allocator in general than trying to bypass it. > > > This is an unreleased series that contains both the page allocator > > optimisations and the one-LRU-per-node series which in combination remove a > > lot of code from the page allocator fast paths. I have no data on how the > > combined series behaves but each series individually is known to improve > > page allocator performance. > > > > Once you have that, do a hackjob to remove the debugging checks from both the > > alloc and free path and see what that leaves. They could be bypassed properly > > with a __GFP_NOACCT flag used only by drivers that absolutely require pages > > as quickly as possible and willing to be less safe to get that performance. > > I would be interested in testing/benchmarking a patch where you remove > the debugging checks... > Right now, I'm not proposing to remove the debugging checks despite their cost. They catch really difficult problems in the field unfortunately including corruption from buggy hardware. A GFP flag that disables them for a very specific case would be ok but I expect it to be resisted by others if it's done for the general case. Even a static branch for runtime debugging checks may be resisted. Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on the grounds it is of questionable value. Applying that would free a flag for __GFP_NOACCT that bypasses debugging checks and statistic updates. That would work for the allocation side at least but doing the same for the free side would be hard (potentially impossible) to do transparently for drivers. > You are also welcome to try out my benchmarking modules yourself: > https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst > I took a quick look and functionally it's similar to the systemtap-based microbenchmark I'm using in mmtests so I don't think we have a problem with reproduction at the moment. > > Be aware that compound order allocs like this are a double edged sword as > > it'll be fast sometimes and other times require reclaim/compaction which > > can stall for prolonged periods of time. > > Yes, I've notice that there can be a fairly high variation, when doing > compound order allocs, which is not so nice! I really don't like these > variations.... > They can cripple you which is why I'm very wary of performance patches that require compound pages. It tends to look great only on benchmarks and then the corner cases hit in the real world and the bug reports are unpleasant. > Drivers also do tricks where they fallback to smaller order pages. E.g. > lookup function mlx4_alloc_pages(). I've tried to simulate that > function here: > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 > > It does not seem very optimal. I tried to mem pressure the system a bit > to cause the alloc_pages() to fail, and then the result were very bad, > something like 2500 cycles, and it usually got the next order pages. The options for fallback tend to have one hazard after the next. It's partially why the last series focused on order-0 pages only. > > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to > > > cost approx 500 cycles(tsc). That was more expensive, BUT an order=3 > > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5 > > > cycles. Usually a network RX-frame only need to be 2048 bytes, thus > > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles. > > The order=3 cost were reduced to: 417 cycles(tsc), nice! But I've also > seen it jump to 611 cycles. > The corner cases can be minimised to some extent -- lazy buddy merging for example but it unfortunately has other consequences for users that require high-order pages for functional reasons. I tried something like that once (http://thread.gmane.org/gmane.linux.kernel/807683) but didn't pursue it to the end as it was a small part of the problem I was dealing with at the time. It shouldn't be ruled out but it should be considered a last resort. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 18:07 ` Mel Gorman @ 2016-04-11 19:26 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-11 19:26 UTC (permalink / raw) To: Mel Gorman Cc: Mel Gorman, James Bottomley, netdev, Brenden Blanco, lsf, linux-mm, Tom Herbert, lsf-pc, Alexei Starovoitov, brouer On Mon, 11 Apr 2016 19:07:03 +0100 Mel Gorman <mgorman@suse.de> wrote: > On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote: > > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5 > > > > > > > The cost decreased to: 228 cycles(tsc), but there are some variations, > > sometimes it increase to 238 cycles(tsc). > > > > In the free path, a bulk pcp free adds to the cycles. In the alloc path, > a refill of the pcp lists costs quite a bit. Either option introduces > variances. The bulk free path can be optimised a little so I chucked > some additional patches at it that are not released yet but I suspect the > benefit will be marginal. The real heavy costs there are splitting/merging > buddies. Fixing that is much more fundamental but even fronting the allocator > with a new recycle allocator would not offset that as the refill of the > page-recycling thing would incur high costs. > Yes, re-filling page-pool (in the non-steady state) could be problematic for performance. That is why I'm very motivated in helping out with a bulk alloc/free scheme for the page allocator. > > Nice, but there is still a looong way to my performance target, where I > > can spend 201 cycles for the entire forwarding path.... > > > > While I accept the cost is still too high, I think the effort should still > be spent on improving the allocator in general than trying to bypass it. > I do think improving the page allocator is very important work. I just don't see how we can ever reach my performance target, without a page-pool recycle facility. I work in the area, where I think the cost of a single atomic operation is too high. I work on amortizing the individual atomic operations. That is what I did for the SLUB allocator, with the bulk API. see: Commit d0ecd894e3d5 ("slub: optimize bulk slowpath free by detached freelist") https://git.kernel.org/torvalds/c/d0ecd894e3d5 Commit fbd02630c6e3 ("slub: initial bulk free implementation") https://git.kernel.org/torvalds/c/fbd02630c6e3 This is now also used in the network stack: Commit 3134b9f019f2 ("Merge branch 'net-mitigate-kmem_free-slowpath'") Commit a3a8749d34d8 ("ixgbe: bulk free SKBs during TX completion cleanup cycle") > > > This is an unreleased series that contains both the page allocator > > > optimisations and the one-LRU-per-node series which in combination remove a > > > lot of code from the page allocator fast paths. I have no data on how the > > > combined series behaves but each series individually is known to improve > > > page allocator performance. > > > > > > Once you have that, do a hackjob to remove the debugging checks from both the > > > alloc and free path and see what that leaves. They could be bypassed properly > > > with a __GFP_NOACCT flag used only by drivers that absolutely require pages > > > as quickly as possible and willing to be less safe to get that performance. > > > > I would be interested in testing/benchmarking a patch where you remove > > the debugging checks... > > > > Right now, I'm not proposing to remove the debugging checks despite their > cost. They catch really difficult problems in the field unfortunately > including corruption from buggy hardware. A GFP flag that disables them > for a very specific case would be ok but I expect it to be resisted by > others if it's done for the general case. Even a static branch for runtime > debugging checks may be resisted. > > Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on > the grounds it is of questionable value. Applying that would free a flag > for __GFP_NOACCT that bypasses debugging checks and statistic updates. > That would work for the allocation side at least but doing the same for > the free side would be hard (potentially impossible) to do transparently > for drivers. Before spending too much work on something, I usually try to determine what the maximum benefit of something would be. Thus, I propose you create a patch that hack remove all the debug checks that you think could be beneficial to remove. And then benchmark it yourself or send it to me for benchmarking... that is the quickest way to determine if this is worth spending time on. > > You are also welcome to try out my benchmarking modules yourself: > > https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst > > > > I took a quick look and functionally it's similar to the systemtap-based > microbenchmark I'm using in mmtests so I don't think we have a problem > with reproduction at the moment. > > > > Be aware that compound order allocs like this are a double edged sword as > > > it'll be fast sometimes and other times require reclaim/compaction which > > > can stall for prolonged periods of time. > > > > Yes, I've notice that there can be a fairly high variation, when doing > > compound order allocs, which is not so nice! I really don't like these > > variations.... > > > > They can cripple you which is why I'm very wary of performance patches that > require compound pages. It tends to look great only on benchmarks and then > the corner cases hit in the real world and the bug reports are unpleasant. That confirms Eric's experience at Google, where they disabled this compound order page feature in the driver... > > Drivers also do tricks where they fallback to smaller order pages. E.g. > > lookup function mlx4_alloc_pages(). I've tried to simulate that > > function here: > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69 > > > > It does not seem very optimal. I tried to mem pressure the system a bit > > to cause the alloc_pages() to fail, and then the result were very bad, > > something like 2500 cycles, and it usually got the next order pages. > > The options for fallback tend to have one hazard after the next. It's > partially why the last series focused on order-0 pages only. Other places in the network stack, this falling down through the order's got removed, and replaced with a single "falldown" to order-0 pages. (due to people reporting bad experiences of latency spikes) > > > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to > > > > cost approx 500 cycles(tsc). That was more expensive, BUT an order=3 > > > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5 > > > > cycles. Usually a network RX-frame only need to be 2048 bytes, thus > > > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles. > > > > The order=3 cost were reduced to: 417 cycles(tsc), nice! But I've also > > seen it jump to 611 cycles. > > > > The corner cases can be minimised to some extent -- lazy buddy merging for > example but it unfortunately has other consequences for users that require > high-order pages for functional reasons. I tried something like that once > (http://thread.gmane.org/gmane.linux.kernel/807683) but didn't pursue it > to the end as it was a small part of the problem I was dealing with at the > time. It shouldn't be ruled out but it should be considered a last resort. > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 13:08 ` Mel Gorman 2016-04-11 16:19 ` [Lsf] " Jesper Dangaard Brouer @ 2016-04-11 16:20 ` Matthew Wilcox 2016-04-11 17:46 ` Thadeu Lima de Souza Cascardo 1 sibling, 1 reply; 35+ messages in thread From: Matthew Wilcox @ 2016-04-11 16:20 UTC (permalink / raw) To: Mel Gorman Cc: Jesper Dangaard Brouer, James Bottomley, netdev, Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote: > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote: > > On arch's like PowerPC, the DMA API is the bottleneck. To workaround > > the cost of DMA calls, NIC driver alloc large order (compound) pages. > > (dma_map compound page, handout page-fragments for RX ring, and later > > dma_unmap when last RX page-fragments is seen). > > So, IMO only holding onto the DMA pages is all that is justified but not a > recycle of order-0 pages built on top of the core allocator. For DMA pages, > it would take a bit of legwork but the per-cpu allocator could be split > and converted to hold arbitrary sized pages with a constructer/destructor > to do the DMA coherency step when pages are taken from or handed back to > the core allocator. I'm not volunteering to do that unfortunately but I > estimate it'd be a few days work unless it needs to be per-CPU and NUMA > aware in which case the memory footprint will be high. Have "we" tried to accelerate the DMA calls in PowerPC? For example, it could hold onto a cache of recently used mappings and recycle them if that still works. It trades off a bit of security (a device can continue to DMA after the memory should no longer be accessible to it) for speed, but then so does the per-driver hack of keeping pages around still mapped. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 16:20 ` Matthew Wilcox @ 2016-04-11 17:46 ` Thadeu Lima de Souza Cascardo 2016-04-11 18:37 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 35+ messages in thread From: Thadeu Lima de Souza Cascardo @ 2016-04-11 17:46 UTC (permalink / raw) To: Matthew Wilcox Cc: Mel Gorman, Jesper Dangaard Brouer, James Bottomley, netdev, Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov On Mon, Apr 11, 2016 at 12:20:47PM -0400, Matthew Wilcox wrote: > On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote: > > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote: > > > On arch's like PowerPC, the DMA API is the bottleneck. To workaround > > > the cost of DMA calls, NIC driver alloc large order (compound) pages. > > > (dma_map compound page, handout page-fragments for RX ring, and later > > > dma_unmap when last RX page-fragments is seen). > > > > So, IMO only holding onto the DMA pages is all that is justified but not a > > recycle of order-0 pages built on top of the core allocator. For DMA pages, > > it would take a bit of legwork but the per-cpu allocator could be split > > and converted to hold arbitrary sized pages with a constructer/destructor > > to do the DMA coherency step when pages are taken from or handed back to > > the core allocator. I'm not volunteering to do that unfortunately but I > > estimate it'd be a few days work unless it needs to be per-CPU and NUMA > > aware in which case the memory footprint will be high. > > Have "we" tried to accelerate the DMA calls in PowerPC? For example, it > could hold onto a cache of recently used mappings and recycle them if that > still works. It trades off a bit of security (a device can continue to DMA > after the memory should no longer be accessible to it) for speed, but then > so does the per-driver hack of keeping pages around still mapped. > There are two problems on the DMA calls on Power servers. One is scalability. A new allocation method for the address space would be necessary to fix it. The other one is the latency or the cost of updating the TCE tables. The only number I have is that I could push around 1M updates per second. So, we could guess 1us per operation, which is pretty much a no-no for Jesper use case. Your solution could address both. But I am concerned about the security problem. Here is why I think this problem should be ignored if we go this way. IOMMU can be used for three problems: virtualization, paranoia security and debuggability. For virtualization, there is a solution already, and it's in place for Power and x86. Power servers have the ability to enlarge the DMA window, allowing the entire VM memory to be mapped during PCI driver probe time. After that, dma_map is a simple sum and dma_unmap is a nop. x86 KVM maps the entire VM memory even before booting the guest. Unless we want to fix this for old Power servers, I see no point in fixing it. Now, if you are using IOMMU on the host with no passthrough or linear system memory mapping, you are paranoid. It's not just a matter of security, in fact. It's also a matter of stability. Hardware, firmware and drivers can be buggy, and they are. When I worked with drivers on Power servers, I found and fixed a lot of driver bugs that caused the device to write to memory it was not supposed to. Good thing is that IOMMU prevented that memory write to happen and the driver would be reset by EEH. If we can make this scenario faster, and if we want it to be the default we need to, then your solution might not be desired. Otherwise, just turn your IOMMU off or put it into passthrough. Now, the driver keeps pages mapped, but those pages belong to the driver. They are not pages we decide to give to a userspace process because it's no longer in use by the driver. So, I don't quite agree this would be a good tradeoff. Certainly not if we can do it in a way that does not require this. So, Jesper, please take into consideration that this pool design would rather be per device. Otherwise, we allow some device to write into another's device/driver memory. Cascardo. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 17:46 ` Thadeu Lima de Souza Cascardo @ 2016-04-11 18:37 ` Jesper Dangaard Brouer 2016-04-11 18:53 ` Bart Van Assche 0 siblings, 1 reply; 35+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-11 18:37 UTC (permalink / raw) To: Thadeu Lima de Souza Cascardo Cc: Matthew Wilcox, Mel Gorman, James Bottomley, netdev, Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov, brouer On Mon, 11 Apr 2016 14:46:25 -0300 Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote: > So, Jesper, please take into consideration that this pool design > would rather be per device. Otherwise, we allow some device to write > into another's device/driver memory. Yes, that was my intended use. I want to have a page-pool per device. I actually, want to go as far as a page-pool per NIC HW RX-ring queue. Because the other use-case for the page-pool is zero-copy RX. The NIC HW trick is that we today can create a HW filter in the NIC (via ethtool) and place that traffic into a separate RX queue in the NIC. Lets say matching NFS traffic or guest traffic. Then we can allow RX zero-copy of these pages, into the application/guest, somehow binding it to RX queue, e.g. introducing a "cross-domain-id" in the page-pool page that need to match. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility? 2016-04-11 18:37 ` Jesper Dangaard Brouer @ 2016-04-11 18:53 ` Bart Van Assche 0 siblings, 0 replies; 35+ messages in thread From: Bart Van Assche @ 2016-04-11 18:53 UTC (permalink / raw) To: Jesper Dangaard Brouer, Thadeu Lima de Souza Cascardo Cc: lsf, netdev, Brenden Blanco, James Bottomley, linux-mm, Mel Gorman, Tom Herbert, Matthew Wilcox, lsf-pc, Mel Gorman, Alexei Starovoitov On 04/11/2016 11:37 AM, Jesper Dangaard Brouer wrote: > On Mon, 11 Apr 2016 14:46:25 -0300 > Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote: > >> So, Jesper, please take into consideration that this pool design >> would rather be per device. Otherwise, we allow some device to write >> into another's device/driver memory. > > Yes, that was my intended use. I want to have a page-pool per device. > I actually, want to go as far as a page-pool per NIC HW RX-ring queue. > > Because the other use-case for the page-pool is zero-copy RX. > > The NIC HW trick is that we today can create a HW filter in the NIC > (via ethtool) and place that traffic into a separate RX queue in the > NIC. Lets say matching NFS traffic or guest traffic. Then we can allow > RX zero-copy of these pages, into the application/guest, somehow > binding it to RX queue, e.g. introducing a "cross-domain-id" in the > page-pool page that need to match. I think it is important to keep in mind that using a page pool for zero-copy RX is specific to protocols that are based on TCP/IP. Protocols like FC, SRP and iSER have been designed such that the side that allocates the buffers also initiates the data transfer (the target side). With TCP/IP however transferring data and allocating receive buffers happens by opposite sides of the connection. Bart. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2016-04-12 17:20 UTC | newest] Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1460034425.20949.7.camel@HansenPartnership.com> 2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer 2016-04-07 14:38 ` [Lsf-pc] " Christoph Hellwig 2016-04-07 15:11 ` [Lsf] " Bart Van Assche 2016-04-10 18:45 ` Sagi Grimberg 2016-04-11 21:41 ` Jesper Dangaard Brouer 2016-04-11 22:02 ` Alexander Duyck 2016-04-12 6:28 ` Jesper Dangaard Brouer 2016-04-12 15:37 ` Alexander Duyck 2016-04-11 22:21 ` Alexei Starovoitov 2016-04-12 6:16 ` Jesper Dangaard Brouer 2016-04-12 17:20 ` Alexei Starovoitov 2016-04-07 15:48 ` Chuck Lever 2016-04-07 16:14 ` [Lsf-pc] [Lsf] " Rik van Riel 2016-04-07 19:43 ` [Lsf] [Lsf-pc] " Jesper Dangaard Brouer 2016-04-07 15:18 ` Eric Dumazet 2016-04-09 9:11 ` [Lsf] " Jesper Dangaard Brouer 2016-04-09 12:34 ` Eric Dumazet 2016-04-11 20:23 ` Jesper Dangaard Brouer 2016-04-11 21:27 ` Eric Dumazet 2016-04-07 19:48 ` Waskiewicz, PJ 2016-04-07 20:38 ` Jesper Dangaard Brouer 2016-04-08 16:12 ` Alexander Duyck 2016-04-11 8:58 ` [Lsf-pc] " Mel Gorman 2016-04-11 12:26 ` Jesper Dangaard Brouer 2016-04-11 13:08 ` Mel Gorman 2016-04-11 16:19 ` [Lsf] " Jesper Dangaard Brouer 2016-04-11 16:53 ` Eric Dumazet 2016-04-11 19:47 ` Jesper Dangaard Brouer 2016-04-11 21:14 ` Eric Dumazet 2016-04-11 18:07 ` Mel Gorman 2016-04-11 19:26 ` Jesper Dangaard Brouer 2016-04-11 16:20 ` Matthew Wilcox 2016-04-11 17:46 ` Thadeu Lima de Souza Cascardo 2016-04-11 18:37 ` Jesper Dangaard Brouer 2016-04-11 18:53 ` Bart Van Assche
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).