* AF_XDP design flaws @ 2019-02-26 14:49 Maxim Mikityanskiy 2019-02-26 16:41 ` John Fastabend 0 siblings, 1 reply; 12+ messages in thread From: Maxim Mikityanskiy @ 2019-02-26 14:49 UTC (permalink / raw) To: netdev, Björn Töpel, Magnus Karlsson, David S. Miller Cc: Tariq Toukan, Saeed Mahameed, Eran Ben Elisha Hi everyone, I would like to discuss some design flaws of AF_XDP socket (XSK) implementation in kernel. At the moment I don't see a way to work around them without changing the API, so I would like to make sure that I'm not missing anything and to suggest and discuss some possible improvements that can be made. The issues I describe below are caused by the fact that the driver depends on the application doing some things, and if the application is slow/buggy/malicious, the driver is forced to busy poll because of the lack of a notification mechanism from the application side. I will refer to the i40e driver implementation a lot, as it is the first implementation of AF_XDP, but the issues are general and affect any driver. I already considered trying to fix it on driver level, but it doesn't seem possible, so it looks like the behavior and implementation of AF_XDP in the kernel has to be changed. RX side busy polling ==================== On the RX side, the driver expects the application to put some descriptors in the Fill Ring. There is no way for the application to notify the driver that there are more Fill Ring descriptors to take, so the driver is forced to busy poll the Fill Ring if it gets empty. E.g., the i40e driver does it in NAPI poll: int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) { ... failure = failure || !i40e_alloc_rx_buffers_fast_zc(rx_ring, cleaned_count); ... return failure ? budget : (int)total_rx_packets; } Basically, it means that if there are no descriptors in the Fill Ring, NAPI will never stop, draining CPU. Possible cases when it happens ------------------------------ 1. The application is slow, it received some frames in the RX Ring, and it is still handling the data, so it has no free frames to put to the Fill Ring. 2. The application is malicious, it opens an XSK and puts no frames to the Fill Ring. It can be used as a local DoS attack. 3. The application is buggy and stops filling the Fill Ring for whatever reason (deadlock, waiting for another blocking operation, other bugs). Although loading an XDP program requires root access, the DoS attack can be targeted to setups that already use XDP, i.e. an XDP program is already loaded. Even under root, userspace applications should not be able to disrupt system stability by just calling normal APIs without an intention to destroy the system, and here it happens in case 1. Possible way to solve the issue ------------------------------- When the driver can't take new Fill Ring frames, it shouldn't busy poll. Instead, it signals the failure to the application (e.g., with POLLERR), and after that it's up to the application to restart polling (e.g., by calling sendto()) after refilling the Fill Ring. The issue with this approach is that it changes the API, so we either have to deal with it or to introduce some API version field. TX side getting stuck ===================== On the TX side, there is the Completion Ring that the application has to clean. If it doesn't, the i40e driver stops taking descriptors from the TX Ring. If the application finally completes something, the driver can go on transmitting. However, it would require busy polling the Completion Ring (just like with the Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the application to kick the TX by calling sendto(). The issue is that poll() doesn't return POLLOUT in this case, because the TX Ring is full, so the application will never call sendto(), and the ring is stuck forever (or at least until something triggers NAPI). Possible way to solve the issue ------------------------------- When the driver can't reserve a descriptor in the Completion Ring, it should signal the failure to the application (e.g., with POLLERR). The application shouldn't call sendto() every time it sees that the number of not completed frames is greater than zero (like xdpsock sample does). Instead, the application should kick the TX only when it wants to flush the ring, and, in addition, after resolving the cause for POLLERR, i.e. after handling Completion Ring entries. The API will also have to change with this approach. Triggering NAPI on a different CPU core ======================================= .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU affinity, i40e triggers an interrupt to schedule NAPI, instead of calling napi_schedule directly. Scheduling NAPI on the correct CPU is what would every driver do, I guess, but currently it has to be implemented differently in every driver, and it relies on hardware features (the ability to trigger an IRQ). I suggest introducing a kernel API that would allow triggering NAPI on a given CPU. A brief look shows that something like smp_call_function_single_async can be used. Advantages: 1. It lifts the hardware requirement to be able to raise an interrupt on demand. 2. It would allow to move common code to the kernel (.ndo_xsk_async_xmit). 3. It is also useful in the situation where CPU affinity changes while being in NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by returning a value less than budget if CPU affinity changes. However, there are cases (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU. It's a race between the interrupt, which will move NAPI to the correct CPU, and __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a given CPU will benefit both mlx5e and i40e, because when this situation happens, it kills the performance. I would be happy to hear your thoughts about these issues. Thanks, Max ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AF_XDP design flaws 2019-02-26 14:49 AF_XDP design flaws Maxim Mikityanskiy @ 2019-02-26 16:41 ` John Fastabend 2019-02-27 8:08 ` Björn Töpel 0 siblings, 1 reply; 12+ messages in thread From: John Fastabend @ 2019-02-26 16:41 UTC (permalink / raw) To: Maxim Mikityanskiy, netdev, Björn Töpel, Magnus Karlsson, David S. Miller Cc: Tariq Toukan, Saeed Mahameed, Eran Ben Elisha On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote: > Hi everyone, > > I would like to discuss some design flaws of AF_XDP socket (XSK) implementation > in kernel. At the moment I don't see a way to work around them without changing > the API, so I would like to make sure that I'm not missing anything and to > suggest and discuss some possible improvements that can be made. > > The issues I describe below are caused by the fact that the driver depends on > the application doing some things, and if the application is > slow/buggy/malicious, the driver is forced to busy poll because of the lack of a > notification mechanism from the application side. I will refer to the i40e > driver implementation a lot, as it is the first implementation of AF_XDP, but > the issues are general and affect any driver. I already considered trying to fix > it on driver level, but it doesn't seem possible, so it looks like the behavior > and implementation of AF_XDP in the kernel has to be changed. > > RX side busy polling > ==================== > > On the RX side, the driver expects the application to put some descriptors in > the Fill Ring. There is no way for the application to notify the driver that > there are more Fill Ring descriptors to take, so the driver is forced to busy > poll the Fill Ring if it gets empty. E.g., the i40e driver does it in NAPI poll: > > int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) > { > ... > failure = failure || > !i40e_alloc_rx_buffers_fast_zc(rx_ring, > cleaned_count); > ... > return failure ? budget : (int)total_rx_packets; > } > > Basically, it means that if there are no descriptors in the Fill Ring, NAPI will > never stop, draining CPU. > > Possible cases when it happens > ------------------------------ > > 1. The application is slow, it received some frames in the RX Ring, and it is > still handling the data, so it has no free frames to put to the Fill Ring. > > 2. The application is malicious, it opens an XSK and puts no frames to the Fill > Ring. It can be used as a local DoS attack. > > 3. The application is buggy and stops filling the Fill Ring for whatever reason > (deadlock, waiting for another blocking operation, other bugs). > > Although loading an XDP program requires root access, the DoS attack can be > targeted to setups that already use XDP, i.e. an XDP program is already loaded. > Even under root, userspace applications should not be able to disrupt system > stability by just calling normal APIs without an intention to destroy the > system, and here it happens in case 1. I believe this is by design. If packets per second is your top priority (at the expense of power, cpu, etc.) this is the behavior you might want. To resolve your points if you don't trust the application it should be isolated to a queue pair and cores so the impact is known and managed. That said having a flag to back-off seems like a good idea. But should be doable as a feature without breaking current API. > > Possible way to solve the issue > ------------------------------- > > When the driver can't take new Fill Ring frames, it shouldn't busy poll. > Instead, it signals the failure to the application (e.g., with POLLERR), and > after that it's up to the application to restart polling (e.g., by calling > sendto()) after refilling the Fill Ring. The issue with this approach is that it > changes the API, so we either have to deal with it or to introduce some API > version field. See above. I like the idea here though. > > TX side getting stuck > ===================== > > On the TX side, there is the Completion Ring that the application has to clean. > If it doesn't, the i40e driver stops taking descriptors from the TX Ring. If the > application finally completes something, the driver can go on transmitting. > However, it would require busy polling the Completion Ring (just like with the > Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the > application to kick the TX by calling sendto(). The issue is that poll() doesn't > return POLLOUT in this case, because the TX Ring is full, so the application > will never call sendto(), and the ring is stuck forever (or at least until > something triggers NAPI). > > Possible way to solve the issue > ------------------------------- > > When the driver can't reserve a descriptor in the Completion Ring, it should > signal the failure to the application (e.g., with POLLERR). The application > shouldn't call sendto() every time it sees that the number of not completed > frames is greater than zero (like xdpsock sample does). Instead, the application > should kick the TX only when it wants to flush the ring, and, in addition, after > resolving the cause for POLLERR, i.e. after handling Completion Ring entries. > The API will also have to change with this approach. > +1 seems to me this can be done without breaking existing API though. > Triggering NAPI on a different CPU core > ======================================= > > .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU affinity, > i40e triggers an interrupt to schedule NAPI, instead of calling napi_schedule > directly. Scheduling NAPI on the correct CPU is what would every driver do, I > guess, but currently it has to be implemented differently in every driver, and > it relies on hardware features (the ability to trigger an IRQ). Ideally the application would be pinned to a core and the traffic steered to that core using ethtool/tc. Yes it requires a bit of mgmt on the platform but I think this is needed for best pps numbers. > > I suggest introducing a kernel API that would allow triggering NAPI on a given > CPU. A brief look shows that something like smp_call_function_single_async can > be used. Advantages: Assuming you want to avoid pinning cores/traffic for some reason. Could this be done with some combination of cpumap + af_xdp? I haven't thought too much about it though. Maybe Bjorn has some ideas. > > 1. It lifts the hardware requirement to be able to raise an interrupt on demand. > > 2. It would allow to move common code to the kernel (.ndo_xsk_async_xmit). > > 3. It is also useful in the situation where CPU affinity changes while being in > NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by returning > a value less than budget if CPU affinity changes. However, there are cases > (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU. It's a > race between the interrupt, which will move NAPI to the correct CPU, and > __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a given CPU > will benefit both mlx5e and i40e, because when this situation happens, it kills > the performance. How would we know what core to trigger NAPI on? > > I would be happy to hear your thoughts about these issues. At least the first two observations seem incrementally solvable to me and would be nice improvements. I assume the last case can be resolved by pinning cores/traffic but for the general case perhaps it is useful. > > Thanks, > Max > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AF_XDP design flaws 2019-02-26 16:41 ` John Fastabend @ 2019-02-27 8:08 ` Björn Töpel 2019-02-27 19:17 ` Maxim Mikityanskiy 0 siblings, 1 reply; 12+ messages in thread From: Björn Töpel @ 2019-02-27 8:08 UTC (permalink / raw) To: John Fastabend Cc: Maxim Mikityanskiy, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha On 2019-02-26 17:41, John Fastabend wrote: > On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote: >> Hi everyone, >> Hi! It's exciting to see more vendors looking into AF_XDP. ARM was involved early on, and now Mellanox. :-) It's really important that the AF_XDP model is a good fit for all drivers/vendors! Thanks for looking into this. >> I would like to discuss some design flaws of AF_XDP socket (XSK) implementation >> in kernel. At the moment I don't see a way to work around them without changing >> the API, so I would like to make sure that I'm not missing anything and to >> suggest and discuss some possible improvements that can be made. >> >> The issues I describe below are caused by the fact that the driver depends on >> the application doing some things, and if the application is >> slow/buggy/malicious, the driver is forced to busy poll because of the lack of a >> notification mechanism from the application side. I will refer to the i40e >> driver implementation a lot, as it is the first implementation of AF_XDP, but >> the issues are general and affect any driver. I already considered trying to fix >> it on driver level, but it doesn't seem possible, so it looks like the behavior >> and implementation of AF_XDP in the kernel has to be changed. >> >> RX side busy polling >> ==================== >> >> On the RX side, the driver expects the application to put some descriptors in >> the Fill Ring. There is no way for the application to notify the driver that >> there are more Fill Ring descriptors to take, so the driver is forced to busy >> poll the Fill Ring if it gets empty. E.g., the i40e driver does it in NAPI poll: >> >> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) >> { >> ... >> failure = failure || >> !i40e_alloc_rx_buffers_fast_zc(rx_ring, >> cleaned_count); >> ... >> return failure ? budget : (int)total_rx_packets; >> } >> >> Basically, it means that if there are no descriptors in the Fill Ring, NAPI will >> never stop, draining CPU. >> >> Possible cases when it happens >> ------------------------------ >> >> 1. The application is slow, it received some frames in the RX Ring, and it is >> still handling the data, so it has no free frames to put to the Fill Ring. >> >> 2. The application is malicious, it opens an XSK and puts no frames to the Fill >> Ring. It can be used as a local DoS attack. >> >> 3. The application is buggy and stops filling the Fill Ring for whatever reason >> (deadlock, waiting for another blocking operation, other bugs). >> >> Although loading an XDP program requires root access, the DoS attack can be >> targeted to setups that already use XDP, i.e. an XDP program is already loaded. >> Even under root, userspace applications should not be able to disrupt system >> stability by just calling normal APIs without an intention to destroy the >> system, and here it happens in case 1. > I believe this is by design. If packets per second is your top priority > (at the expense of power, cpu, etc.) this is the behavior you might > want. To resolve your points if you don't trust the application it > should be isolated to a queue pair and cores so the impact is known and > managed. > Correct, and I agree with John here. AF_XDP is a raw interface, and needs to be managed. > That said having a flag to back-off seems like a good idea. But should > be doable as a feature without breaking current API. > >> Possible way to solve the issue >> ------------------------------- >> >> When the driver can't take new Fill Ring frames, it shouldn't busy poll. >> Instead, it signals the failure to the application (e.g., with POLLERR), and >> after that it's up to the application to restart polling (e.g., by calling >> sendto()) after refilling the Fill Ring. The issue with this approach is that it >> changes the API, so we either have to deal with it or to introduce some API >> version field. > See above. I like the idea here though. > The uapi doesn't mandate *how* the driver should behave. The rational for the aggressive i40e fill ring polling (when the ring is empty) is to minimize the latency. The "fill ring is empty" behavior might be different on different drivers. I see a number of different alternatives to the current i40e way of busy-polling, all ranging from watchdog timers to actual HW implementations where writing to the tail pointer of the fill ring kicks a door bell. That being said, I agree with previous speakers; Being able to set a policy/driver behavior from userland is a good idea. This can be added without breaking the existing model. Max, any input to what this interface should look like from your perspective? >> TX side getting stuck >> ===================== >> >> On the TX side, there is the Completion Ring that the application has to clean. >> If it doesn't, the i40e driver stops taking descriptors from the TX Ring. If the >> application finally completes something, the driver can go on transmitting. >> However, it would require busy polling the Completion Ring (just like with the >> Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the >> application to kick the TX by calling sendto(). The issue is that poll() doesn't >> return POLLOUT in this case, because the TX Ring is full, so the application >> will never call sendto(), and the ring is stuck forever (or at least until >> something triggers NAPI). >> >> Possible way to solve the issue >> ------------------------------- >> >> When the driver can't reserve a descriptor in the Completion Ring, it should >> signal the failure to the application (e.g., with POLLERR). The application >> shouldn't call sendto() every time it sees that the number of not completed >> frames is greater than zero (like xdpsock sample does). Instead, the application >> should kick the TX only when it wants to flush the ring, and, in addition, after >> resolving the cause for POLLERR, i.e. after handling Completion Ring entries. >> The API will also have to change with this approach. >> > +1 seems to me this can be done without breaking existing API though. > Keep in mind that the xdpsock application is sample/toy, and should not be seen as the API documentation. I don't see that the API need to be changed. AF_XDP requires that the fill ring has entries, in order to receive data. Analogous, the completion ring needs to managed by the application to send frames. Are you suggesting that using poll() et al should be mandatory? Again, take me through your send scenario, step by step. To me, the userland application has all the knobs to determine whether the Tx ring can be flushed or not. >> Triggering NAPI on a different CPU core >> ======================================= >> >> .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU affinity, >> i40e triggers an interrupt to schedule NAPI, instead of calling napi_schedule >> directly. Scheduling NAPI on the correct CPU is what would every driver do, I >> guess, but currently it has to be implemented differently in every driver, and >> it relies on hardware features (the ability to trigger an IRQ). > Ideally the application would be pinned to a core and the traffic > steered to that core using ethtool/tc. Yes it requires a bit of mgmt on > the platform but I think this is needed for best pps numbers. > >> I suggest introducing a kernel API that would allow triggering NAPI on a given >> CPU. A brief look shows that something like smp_call_function_single_async can >> be used. Advantages: > Assuming you want to avoid pinning cores/traffic for some reason. Could > this be done with some combination of cpumap + af_xdp? I haven't thought > too much about it though. Maybe Bjorn has some ideas. > For the Intel drivers, the NAPI is associated to a certain interrupt. This does not change with AF_XDP, so ndo_xsk_async_xmit simply kicks the NAPI to run on the CPU bound defined by the irq affinity. I'm all open for additionally kernel APIs, and I'm happy to hack the i40e internals as well. Again, it must be really simple to add XDP/AF_XDP-ZC support for other vendor. Ideally, what I would like is a generic way of associating netdev queues (think ethtools-channels) with NAPI contexts. E.g. "queues 1,2,3 on NAPI foo, and I,II,III on NAPI bar, and the NAPIs should run on on the baz cgroup". This is a much bigger task and outside the AF_XDP scope. It ties in to the whole "threaded NAPIs and netdev queues as a first class kernel object" discussion. ;-) Magnus is working on proper busy-poll() (i.e. that the poll() syscall executes the NAPI context). With that solution the user application can select which core it wants to execute the poll() on -- and the NAPI will be executed from the poll(). >> 1. It lifts the hardware requirement to be able to raise an interrupt on demand. >> >> 2. It would allow to move common code to the kernel (.ndo_xsk_async_xmit). >> >> 3. It is also useful in the situation where CPU affinity changes while being in >> NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by returning >> a value less than budget if CPU affinity changes. However, there are cases >> (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU. It's a >> race between the interrupt, which will move NAPI to the correct CPU, and >> __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a given CPU >> will benefit both mlx5e and i40e, because when this situation happens, it kills >> the performance. > How would we know what core to trigger NAPI on? > >> I would be happy to hear your thoughts about these issues. > At least the first two observations seem incrementally solvable to me > and would be nice improvements. I assume the last case can be resolved > by pinning cores/traffic but for the general case perhaps it is useful. > Good summary, John. Really good input Max, and addressable -- but API changes as far as I see it is not required. (I'm out-of-office this week, skiing. So, I'll be away from the lists until Monday!) Cheers, Björn >> Thanks, >> Max >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: AF_XDP design flaws 2019-02-27 8:08 ` Björn Töpel @ 2019-02-27 19:17 ` Maxim Mikityanskiy 2019-02-27 21:03 ` Jonathan Lemon 0 siblings, 1 reply; 12+ messages in thread From: Maxim Mikityanskiy @ 2019-02-27 19:17 UTC (permalink / raw) To: Björn Töpel, John Fastabend Cc: netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha > -----Original Message----- > From: Björn Töpel <bjorn.topel@gmail.com> > Sent: 27 February, 2019 10:09 > To: John Fastabend <john.fastabend@gmail.com> > Cc: Maxim Mikityanskiy <maximmi@mellanox.com>; netdev@vger.kernel.org; Björn > Töpel <bjorn.topel@intel.com>; Magnus Karlsson <magnus.karlsson@intel.com>; > David S. Miller <davem@davemloft.net>; Tariq Toukan <tariqt@mellanox.com>; > Saeed Mahameed <saeedm@mellanox.com>; Eran Ben Elisha <eranbe@mellanox.com> > Subject: Re: AF_XDP design flaws > > On 2019-02-26 17:41, John Fastabend wrote: > > On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote: > >> Hi everyone, > >> > > Hi! It's exciting to see more vendors looking into AF_XDP. ARM was > involved early on, and now Mellanox. :-) It's really important that > the AF_XDP model is a good fit for all drivers/vendors! Thanks for > looking into this. > > >> I would like to discuss some design flaws of AF_XDP socket (XSK) > implementation > >> in kernel. At the moment I don't see a way to work around them without > changing > >> the API, so I would like to make sure that I'm not missing anything and > to > >> suggest and discuss some possible improvements that can be made. > >> > >> The issues I describe below are caused by the fact that the driver > depends on > >> the application doing some things, and if the application is > >> slow/buggy/malicious, the driver is forced to busy poll because of the > lack of a > >> notification mechanism from the application side. I will refer to the > i40e > >> driver implementation a lot, as it is the first implementation of AF_XDP, > but > >> the issues are general and affect any driver. I already considered trying > to fix > >> it on driver level, but it doesn't seem possible, so it looks like the > behavior > >> and implementation of AF_XDP in the kernel has to be changed. > >> > >> RX side busy polling > >> ==================== > >> > >> On the RX side, the driver expects the application to put some > descriptors in > >> the Fill Ring. There is no way for the application to notify the driver > that > >> there are more Fill Ring descriptors to take, so the driver is forced to > busy > >> poll the Fill Ring if it gets empty. E.g., the i40e driver does it in > NAPI poll: > >> > >> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) > >> { > >> ... > >> failure = failure || > >> > !i40e_alloc_rx_buffers_fast_zc(rx_ring, > >> > cleaned_count); > >> ... > >> return failure ? budget : (int)total_rx_packets; > >> } > >> > >> Basically, it means that if there are no descriptors in the Fill Ring, > NAPI will > >> never stop, draining CPU. > >> > >> Possible cases when it happens > >> ------------------------------ > >> > >> 1. The application is slow, it received some frames in the RX Ring, and > it is > >> still handling the data, so it has no free frames to put to the Fill > Ring. > >> > >> 2. The application is malicious, it opens an XSK and puts no frames to > the Fill > >> Ring. It can be used as a local DoS attack. > >> > >> 3. The application is buggy and stops filling the Fill Ring for whatever > reason > >> (deadlock, waiting for another blocking operation, other bugs). > >> > >> Although loading an XDP program requires root access, the DoS attack can > be > >> targeted to setups that already use XDP, i.e. an XDP program is already > loaded. > >> Even under root, userspace applications should not be able to disrupt > system > >> stability by just calling normal APIs without an intention to destroy the > >> system, and here it happens in case 1. > > I believe this is by design. If packets per second is your top priority > > (at the expense of power, cpu, etc.) this is the behavior you might > > want. To resolve your points if you don't trust the application it > > should be isolated to a queue pair and cores so the impact is known and > > managed. > > > > Correct, and I agree with John here. AF_XDP is a raw interface, and > needs to be managed. OK, right, if you want high pps sacrificing system stability and kernel-userspace isolation, XSK is at your service. It may be a valid point. However, there is still the issue of unintended use. If you use XSK, you can take measures to prevent the disruption, e.g. run only trusted and properly tested applications or isolate them. However, there is a case where XSKs are not really used on a given machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a huge hole. If you don't even know that XSK exists, you are still vulnerable. > > That said having a flag to back-off seems like a good idea. But should > > be doable as a feature without breaking current API. A flag can be added, e.g., to the sxdp_flags field of struct sockaddr_xdp, to switch to the behavior without busy polling on empty Fill Ring. However, I don't think the vulnerability can be eliminated while keeping the API intact. To protect the systems that don't use XSK, we need one of these things: - Disable XSK support by default, until explicitly enabled by root. - Disable the old behavior by default, until explicitly enabled by root. Both of those things will require additional setup on machines that use XSK, after upgrading the kernel. > >> Possible way to solve the issue > >> ------------------------------- > >> > >> When the driver can't take new Fill Ring frames, it shouldn't busy poll. > >> Instead, it signals the failure to the application (e.g., with POLLERR), > and > >> after that it's up to the application to restart polling (e.g., by > calling > >> sendto()) after refilling the Fill Ring. The issue with this approach is > that it > >> changes the API, so we either have to deal with it or to introduce some > API > >> version field. > > See above. I like the idea here though. > > > > The uapi doesn't mandate *how* the driver should behave. The rational > for the aggressive i40e fill ring polling (when the ring is empty) is > to minimize the latency. The "fill ring is empty" behavior might be > different on different drivers. If the behavior is different with different drivers, it will be difficult to create applications that are portable across NICs. The point of creating APIs is to provide a single interface to different implementations, thus hiding the differences from a higher abstraction level. What the driver does internally is up to the driver, but the result visible in the userspace should be the same, so we can't just make some drivers stop and wait when the Fill Ring is empty, while the others busy poll. It should be controlled by some flags if we want to preserve both options. > I see a number of different alternatives to the current i40e way of > busy-polling, all ranging from watchdog timers to actual HW > implementations where writing to the tail pointer of the fill ring > kicks a door bell. > > That being said, I agree with previous speakers; Being able to set a > policy/driver behavior from userland is a good idea. This can be added > without breaking the existing model. > > Max, any input to what this interface should look like from your > perspective? As I said above, we can add a flag (flags?) to sxdp_flags, that will be considered only if !XDP_SHARED_UMEM. It uses the existing field, so the backwards compatibility can be preserved. The only thing I'm concerned of is that if the current busy-polling behavior stays the default, some random machines that don't even use XSK will stay vulnerable to a local DoS attack. > > >> TX side getting stuck > >> ===================== > >> > >> On the TX side, there is the Completion Ring that the application has to > clean. > >> If it doesn't, the i40e driver stops taking descriptors from the TX Ring. > If the > >> application finally completes something, the driver can go on > transmitting. > >> However, it would require busy polling the Completion Ring (just like > with the > >> Fill Ring on the RX side). i40e doesn't do it, instead, it relies on the > >> application to kick the TX by calling sendto(). The issue is that poll() > doesn't > >> return POLLOUT in this case, because the TX Ring is full, so the > application > >> will never call sendto(), and the ring is stuck forever (or at least > until > >> something triggers NAPI). > >> > >> Possible way to solve the issue > >> ------------------------------- > >> > >> When the driver can't reserve a descriptor in the Completion Ring, it > should > >> signal the failure to the application (e.g., with POLLERR). The > application > >> shouldn't call sendto() every time it sees that the number of not > completed > >> frames is greater than zero (like xdpsock sample does). Instead, the > application > >> should kick the TX only when it wants to flush the ring, and, in > addition, after > >> resolving the cause for POLLERR, i.e. after handling Completion Ring > entries. > >> The API will also have to change with this approach. > >> > > +1 seems to me this can be done without breaking existing API though. > > > > Keep in mind that the xdpsock application is sample/toy, and should > not be seen as the API documentation. Unfortunately, Documentation/networking/af_xdp.rst misses a lot of things that become clear by reading xdpsock, kernel and i40e sources. So, currently, xdpsock is the best reference code available (unless you can point me to something different). > I don't see that the API need to be changed. AF_XDP requires that the > fill ring has entries, in order to receive data. Analogous, the > completion ring needs to managed by the application to send frames. The difference is that on the RX side, the driver busy polls until it gets enough Fill Ring entries, while on the TX side it stops when the Completion Ring is full. The application doesn't have a convenient mechanism to determine it. Of course, when it takes entries from the Completion Ring, it can check every time, whether it was full, and kick the TX in that cases, but there are two points here: - Why not signal it from poll() to keep kicking TX in one place and simplify the application? - This behavior isn't documented anywhere. The documentation says that I need to call sendmsg() to start the tranfer. However, looking at the source code, I found that actually I need to call sendto() each time I see there are outstanding packets, because i40e can ignore my sendto() if the Completion Ring is full. And the thing with the TX getting stuck is not covered even by the sample code. > Are you suggesting that using poll() et al should be mandatory? Again, > take me through your send scenario, step by step. Not necessarily; as both the kernel and the application share the same rings, it doesn't matter who of them will look at the producer/consumer indices and decide whether it can proceed TXing. The conditions used in xsk_poll can be reproduced in the application itself if it wants to eliminate poll(). My send algorithm: 1. Wait until poll() returns POLLOUT. 2. Put the packets into the UMEM, and the descriptors into the TX Ring. 3. Call sendto() (the difference is that sendto() is called only after actual sending, not every time the application sees outstanding packets, that can already be queued in the hardware). 4. If the driver can't reserve a descriptor in the Completion Ring, the next poll() will return POLLERR. In this case it's up to the application to call sendto() again after freeing some space in the Completion Ring. Pros: 1. Still doesn't force the application to use poll() - it just has to monitor rings itself. 2. Eliminates useless sendto()s. Although xdpsock is only a simple example, with the current XSK implementation in the kernel, I currently see no way to avoid calling sendto() every time the application sees some not completed packets, because the application doesn't know whether the kernel took that frame or not. With my approach, sendto() can be called only when needed. 3. sendto() and poll() can be placed in a single loop. When the Completion Ring overflows, poll() returns (with POLLERR), which doesn't happen currently. 4. Ability to batch multiple packets in a single sendto(). If we try to batch send packets now, we don't know how many of them were consumed by kernel, so we'll be forced to repeat sendto() until all of them are transmitted. > To me, the userland application has all the knobs to determine whether > the Tx ring can be flushed or not. Regarding this, consider the points above. > > >> Triggering NAPI on a different CPU core > >> ======================================= > >> > >> .ndo_xsk_async_xmit runs on a random CPU core, so, to preserve CPU > affinity, > >> i40e triggers an interrupt to schedule NAPI, instead of calling > napi_schedule > >> directly. Scheduling NAPI on the correct CPU is what would every driver > do, I > >> guess, but currently it has to be implemented differently in every > driver, and > >> it relies on hardware features (the ability to trigger an IRQ). > > Ideally the application would be pinned to a core and the traffic > > steered to that core using ethtool/tc. Yes it requires a bit of mgmt on > > the platform but I think this is needed for best pps numbers. Yes, you are right. However, the situation where .ndo_xsk_async_xmit is called on a wrong CPU is possible, and it will happen unless some additional setup is done, so the driver has to handle it. > >> I suggest introducing a kernel API that would allow triggering NAPI on a > given > >> CPU. A brief look shows that something like > smp_call_function_single_async can > >> be used. Advantages: > > Assuming you want to avoid pinning cores/traffic for some reason. Could > > this be done with some combination of cpumap + af_xdp? I haven't thought > > too much about it though. Maybe Bjorn has some ideas. Not that I want to avoid pinning CPUs... Intel has a hardware feature that allows them to trigger an IRQ on the given CPU to call NAPI on the right CPU. (I'm talking only about the TX side here.) The data to be transmitted can be written to the UMEM from the right CPU, but sendto() can be called from any, and it will work for i40e. Mellanox hardware doesn't have this feature, although I can do the same using some hacks. So I'm seeking for a portable solution. > > For the Intel drivers, the NAPI is associated to a certain > interrupt. The same for Mellanox. > This does not change with AF_XDP, so ndo_xsk_async_xmit > simply kicks the NAPI to run on the CPU bound defined by the irq > affinity. This is more complicated for Mellanox hardware. If there was a "trigger NAPI on CPU X" function (that could use smp_call_function_single_async internally), it would make this part completely driver- and hardware-agnostic, it could even be implemented as a common code in the kernel. > I'm all open for additionally kernel APIs, and I'm happy to hack the > i40e internals as well. Again, it must be really simple to add > XDP/AF_XDP-ZC support for other vendor. > > Ideally, what I would like is a generic way of associating netdev > queues (think ethtools-channels) with NAPI contexts. E.g. "queues > 1,2,3 on NAPI foo, and I,II,III on NAPI bar, and the NAPIs should run > on on the baz cgroup". This is a much bigger task and outside the > AF_XDP scope. It ties in to the whole "threaded NAPIs and netdev > queues as a first class kernel object" discussion. ;-) Yes, it's much more intrusive and big, let's deal with AF_XDP first :) > Magnus is working on proper busy-poll() (i.e. that the poll() syscall > executes the NAPI context). With that solution the user application > can select which core it wants to execute the poll() on -- and the > NAPI will be executed from the poll(). Wow, that's good. But the current i40e_napi_poll() checks the CPU affinity and reschedules if it's wrong - will this check be removed? Or will there be a separate NAPI instance for XSK TX? > > >> 1. It lifts the hardware requirement to be able to raise an interrupt on > demand. > >> > >> 2. It would allow to move common code to the kernel > (.ndo_xsk_async_xmit). > >> > >> 3. It is also useful in the situation where CPU affinity changes while > being in > >> NAPI poll. Currently, i40e and mlx5e try to stop NAPI polling by > returning > >> a value less than budget if CPU affinity changes. However, there are > cases > >> (e.g., NAPIF_STATE_MISSED) when NAPI will be rescheduled on a wrong CPU. > It's a > >> race between the interrupt, which will move NAPI to the correct CPU, and > >> __napi_schedule from a wrong CPU. Having an API to schedule NAPI on a > given CPU > >> will benefit both mlx5e and i40e, because when this situation happens, it > kills > >> the performance. > > How would we know what core to trigger NAPI on? We can check the cpumask for the necessary channel. > >> I would be happy to hear your thoughts about these issues. > > At least the first two observations seem incrementally solvable to me > > and would be nice improvements. I assume the last case can be resolved > > by pinning cores/traffic but for the general case perhaps it is useful. > > > > Good summary, John. Really good input Max, and addressable -- but API > changes as far as I see it is not required. OK, I'll try to sum up the points: 1. RX busy polling is good for latency, and issues caused by userspace misbehavior can be sacrificed for the performance, but if AF_XDP sockets are available in the system, even if unused, they open a security hole (local DoS). 2. TX stall can be solved in the userspace with the current API, however, this issue is not documented, and my approach can give some benefits. We can add a flag to turn it on explicitly. No driver changes, kernel-only feature. 3. NAPI issue can be worked around by calling sendto() from the right CPU (and it will most likely happen in production), but the current i40e implementation supports sendto() from any CPU, and it should be made driver-agnostic and put in the kernel. Thank you John and Björn for your input! > (I'm out-of-office this week, skiing. So, I'll be away from the lists > until Monday!) > > Cheers, > Björn > > >> Thanks, > >> Max > >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AF_XDP design flaws 2019-02-27 19:17 ` Maxim Mikityanskiy @ 2019-02-27 21:03 ` Jonathan Lemon 2019-02-28 10:49 ` Maxim Mikityanskiy 2019-02-28 14:23 ` Magnus Karlsson 0 siblings, 2 replies; 12+ messages in thread From: Jonathan Lemon @ 2019-02-27 21:03 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Björn Töpel, John Fastabend, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha On 27 Feb 2019, at 11:17, Maxim Mikityanskiy wrote: >> -----Original Message----- >> From: Björn Töpel <bjorn.topel@gmail.com> >> Sent: 27 February, 2019 10:09 >> To: John Fastabend <john.fastabend@gmail.com> >> Cc: Maxim Mikityanskiy <maximmi@mellanox.com>; >> netdev@vger.kernel.org; Björn >> Töpel <bjorn.topel@intel.com>; Magnus Karlsson >> <magnus.karlsson@intel.com>; >> David S. Miller <davem@davemloft.net>; Tariq Toukan >> <tariqt@mellanox.com>; >> Saeed Mahameed <saeedm@mellanox.com>; Eran Ben Elisha >> <eranbe@mellanox.com> >> Subject: Re: AF_XDP design flaws >> >> On 2019-02-26 17:41, John Fastabend wrote: >>> On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote: >>>> Hi everyone, >>>> >> >> Hi! It's exciting to see more vendors looking into AF_XDP. ARM was >> involved early on, and now Mellanox. :-) It's really important that >> the AF_XDP model is a good fit for all drivers/vendors! Thanks for >> looking into this. >> >>>> I would like to discuss some design flaws of AF_XDP socket (XSK) >> implementation >>>> in kernel. At the moment I don't see a way to work around them >>>> without >> changing >>>> the API, so I would like to make sure that I'm not missing anything >>>> and >> to >>>> suggest and discuss some possible improvements that can be made. >>>> >>>> The issues I describe below are caused by the fact that the driver >> depends on >>>> the application doing some things, and if the application is >>>> slow/buggy/malicious, the driver is forced to busy poll because of >>>> the >> lack of a >>>> notification mechanism from the application side. I will refer to >>>> the >> i40e >>>> driver implementation a lot, as it is the first implementation of >>>> AF_XDP, >> but >>>> the issues are general and affect any driver. I already considered >>>> trying >> to fix >>>> it on driver level, but it doesn't seem possible, so it looks like >>>> the >> behavior >>>> and implementation of AF_XDP in the kernel has to be changed. >>>> >>>> RX side busy polling >>>> ==================== >>>> >>>> On the RX side, the driver expects the application to put some >> descriptors in >>>> the Fill Ring. There is no way for the application to notify the >>>> driver >> that >>>> there are more Fill Ring descriptors to take, so the driver is >>>> forced to >> busy >>>> poll the Fill Ring if it gets empty. E.g., the i40e driver does it >>>> in >> NAPI poll: >>>> >>>> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) >>>> { >>>> ... >>>> failure = failure || >>>> >> !i40e_alloc_rx_buffers_fast_zc(rx_ring, >>>> >> cleaned_count); >>>> ... >>>> return failure ? budget : (int)total_rx_packets; >>>> } >>>> >>>> Basically, it means that if there are no descriptors in the Fill >>>> Ring, >> NAPI will >>>> never stop, draining CPU. >>>> >>>> Possible cases when it happens >>>> ------------------------------ >>>> >>>> 1. The application is slow, it received some frames in the RX Ring, >>>> and >> it is >>>> still handling the data, so it has no free frames to put to the >>>> Fill >> Ring. >>>> >>>> 2. The application is malicious, it opens an XSK and puts no frames >>>> to >> the Fill >>>> Ring. It can be used as a local DoS attack. >>>> >>>> 3. The application is buggy and stops filling the Fill Ring for >>>> whatever >> reason >>>> (deadlock, waiting for another blocking operation, other bugs). >>>> >>>> Although loading an XDP program requires root access, the DoS >>>> attack can >> be >>>> targeted to setups that already use XDP, i.e. an XDP program is >>>> already >> loaded. >>>> Even under root, userspace applications should not be able to >>>> disrupt >> system >>>> stability by just calling normal APIs without an intention to >>>> destroy the >>>> system, and here it happens in case 1. >>> I believe this is by design. If packets per second is your top >>> priority >>> (at the expense of power, cpu, etc.) this is the behavior you might >>> want. To resolve your points if you don't trust the application it >>> should be isolated to a queue pair and cores so the impact is known >>> and >>> managed. >>> >> >> Correct, and I agree with John here. AF_XDP is a raw interface, and >> needs to be managed. > > OK, right, if you want high pps sacrificing system stability and > kernel-userspace isolation, XSK is at your service. It may be a valid > point. However, there is still the issue of unintended use. I believe this is a double-edged sword - on one hand, it appears AF_XDP is aimed as a competitor to DPDK, and is focused only on raw packet speed, at the expense of usability. This isn't necessarily bad, but in my experience, if things aren't simple to use, then they tend not to get widely used. > If you use XSK, you can take measures to prevent the disruption, e.g. > run only trusted and properly tested applications or isolate them. > However, there is a case where XSKs are not really used on a given > machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a > huge > hole. If you don't even know that XSK exists, you are still > vulnerable. > >>> That said having a flag to back-off seems like a good idea. But >>> should >>> be doable as a feature without breaking current API. > > A flag can be added, e.g., to the sxdp_flags field of struct > sockaddr_xdp, to switch to the behavior without busy polling on empty > Fill Ring. > > However, I don't think the vulnerability can be eliminated while > keeping > the API intact. To protect the systems that don't use XSK, we need one > of these things: > > - Disable XSK support by default, until explicitly enabled by root. > > - Disable the old behavior by default, until explicitly enabled by > root. > > Both of those things will require additional setup on machines that > use > XSK, after upgrading the kernel. > >>>> Possible way to solve the issue >>>> ------------------------------- >>>> >>>> When the driver can't take new Fill Ring frames, it shouldn't busy >>>> poll. >>>> Instead, it signals the failure to the application (e.g., with >>>> POLLERR), >> and >>>> after that it's up to the application to restart polling (e.g., by >> calling >>>> sendto()) after refilling the Fill Ring. The issue with this >>>> approach is >> that it >>>> changes the API, so we either have to deal with it or to introduce >>>> some >> API >>>> version field. >>> See above. I like the idea here though. >>> >> >> The uapi doesn't mandate *how* the driver should behave. The rational >> for the aggressive i40e fill ring polling (when the ring is empty) is >> to minimize the latency. The "fill ring is empty" behavior might be >> different on different drivers. > > If the behavior is different with different drivers, it will be > difficult to create applications that are portable across NICs. The > point of creating APIs is to provide a single interface to different > implementations, thus hiding the differences from a higher abstraction > level. What the driver does internally is up to the driver, but the > result visible in the userspace should be the same, so we can't just > make some drivers stop and wait when the Fill Ring is empty, while the > others busy poll. It should be controlled by some flags if we want to > preserve both options. Or if the driver can't obtain new frames from the fill ring, it just drops the incoming packet, and recycles the old frame in order to keep the ring full. (This also assumes that the driver was able to initially populate the ring in the first place, which eliminates one DoS attack of never putting anything in the fill ring initially). I do agree that issues cited in this thread can be resolved with the use of flags to control the different behaviors - there just needs to be some agreement on what those behaviors are. -- Jonathan ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: AF_XDP design flaws 2019-02-27 21:03 ` Jonathan Lemon @ 2019-02-28 10:49 ` Maxim Mikityanskiy 2019-03-05 18:26 ` Björn Töpel 2019-02-28 14:23 ` Magnus Karlsson 1 sibling, 1 reply; 12+ messages in thread From: Maxim Mikityanskiy @ 2019-02-28 10:49 UTC (permalink / raw) To: Jonathan Lemon Cc: Björn Töpel, John Fastabend, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha > -----Original Message----- > From: Jonathan Lemon <jonathan.lemon@gmail.com> > Sent: 27 February, 2019 23:03 > To: Maxim Mikityanskiy <maximmi@mellanox.com> > Cc: Björn Töpel <bjorn.topel@gmail.com>; John Fastabend > <john.fastabend@gmail.com>; netdev@vger.kernel.org; Björn Töpel > <bjorn.topel@intel.com>; Magnus Karlsson <magnus.karlsson@intel.com>; David > S. Miller <davem@davemloft.net>; Tariq Toukan <tariqt@mellanox.com>; Saeed > Mahameed <saeedm@mellanox.com>; Eran Ben Elisha <eranbe@mellanox.com> > Subject: Re: AF_XDP design flaws > > On 27 Feb 2019, at 11:17, Maxim Mikityanskiy wrote: > > >> -----Original Message----- > >> From: Björn Töpel <bjorn.topel@gmail.com> > >> Sent: 27 February, 2019 10:09 > >> To: John Fastabend <john.fastabend@gmail.com> > >> Cc: Maxim Mikityanskiy <maximmi@mellanox.com>; > >> netdev@vger.kernel.org; Björn > >> Töpel <bjorn.topel@intel.com>; Magnus Karlsson > >> <magnus.karlsson@intel.com>; > >> David S. Miller <davem@davemloft.net>; Tariq Toukan > >> <tariqt@mellanox.com>; > >> Saeed Mahameed <saeedm@mellanox.com>; Eran Ben Elisha > >> <eranbe@mellanox.com> > >> Subject: Re: AF_XDP design flaws > >> > >> On 2019-02-26 17:41, John Fastabend wrote: > >>> On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote: > >>>> Hi everyone, > >>>> > >> > >> Hi! It's exciting to see more vendors looking into AF_XDP. ARM was > >> involved early on, and now Mellanox. :-) It's really important that > >> the AF_XDP model is a good fit for all drivers/vendors! Thanks for > >> looking into this. > >> > >>>> I would like to discuss some design flaws of AF_XDP socket (XSK) > >> implementation > >>>> in kernel. At the moment I don't see a way to work around them > >>>> without > >> changing > >>>> the API, so I would like to make sure that I'm not missing anything > >>>> and > >> to > >>>> suggest and discuss some possible improvements that can be made. > >>>> > >>>> The issues I describe below are caused by the fact that the driver > >> depends on > >>>> the application doing some things, and if the application is > >>>> slow/buggy/malicious, the driver is forced to busy poll because of > >>>> the > >> lack of a > >>>> notification mechanism from the application side. I will refer to > >>>> the > >> i40e > >>>> driver implementation a lot, as it is the first implementation of > >>>> AF_XDP, > >> but > >>>> the issues are general and affect any driver. I already considered > >>>> trying > >> to fix > >>>> it on driver level, but it doesn't seem possible, so it looks like > >>>> the > >> behavior > >>>> and implementation of AF_XDP in the kernel has to be changed. > >>>> > >>>> RX side busy polling > >>>> ==================== > >>>> > >>>> On the RX side, the driver expects the application to put some > >> descriptors in > >>>> the Fill Ring. There is no way for the application to notify the > >>>> driver > >> that > >>>> there are more Fill Ring descriptors to take, so the driver is > >>>> forced to > >> busy > >>>> poll the Fill Ring if it gets empty. E.g., the i40e driver does it > >>>> in > >> NAPI poll: > >>>> > >>>> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) > >>>> { > >>>> ... > >>>> failure = failure || > >>>> > >> !i40e_alloc_rx_buffers_fast_zc(rx_ring, > >>>> > >> cleaned_count); > >>>> ... > >>>> return failure ? budget : (int)total_rx_packets; > >>>> } > >>>> > >>>> Basically, it means that if there are no descriptors in the Fill > >>>> Ring, > >> NAPI will > >>>> never stop, draining CPU. > >>>> > >>>> Possible cases when it happens > >>>> ------------------------------ > >>>> > >>>> 1. The application is slow, it received some frames in the RX Ring, > >>>> and > >> it is > >>>> still handling the data, so it has no free frames to put to the > >>>> Fill > >> Ring. > >>>> > >>>> 2. The application is malicious, it opens an XSK and puts no frames > >>>> to > >> the Fill > >>>> Ring. It can be used as a local DoS attack. > >>>> > >>>> 3. The application is buggy and stops filling the Fill Ring for > >>>> whatever > >> reason > >>>> (deadlock, waiting for another blocking operation, other bugs). > >>>> > >>>> Although loading an XDP program requires root access, the DoS > >>>> attack can > >> be > >>>> targeted to setups that already use XDP, i.e. an XDP program is > >>>> already > >> loaded. > >>>> Even under root, userspace applications should not be able to > >>>> disrupt > >> system > >>>> stability by just calling normal APIs without an intention to > >>>> destroy the > >>>> system, and here it happens in case 1. > >>> I believe this is by design. If packets per second is your top > >>> priority > >>> (at the expense of power, cpu, etc.) this is the behavior you might > >>> want. To resolve your points if you don't trust the application it > >>> should be isolated to a queue pair and cores so the impact is known > >>> and > >>> managed. > >>> > >> > >> Correct, and I agree with John here. AF_XDP is a raw interface, and > >> needs to be managed. > > > > OK, right, if you want high pps sacrificing system stability and > > kernel-userspace isolation, XSK is at your service. It may be a valid > > point. However, there is still the issue of unintended use. > > I believe this is a double-edged sword - on one hand, it appears AF_XDP > is > aimed as a competitor to DPDK, and is focused only on raw packet speed, > at > the expense of usability. This isn't necessarily bad, but in my > experience, > if things aren't simple to use, then they tend not to get widely used. You are right, but not getting widespread also leads to the lack of usage examples, which in turn can lead to API misuse. Another point is that in the past there were security issues found in the code that is rarely used but turned on in binary kernels, and AF_XDP is such kind of code - malware can make use of its flaws even if it's not used legitimately on a given system. > > If you use XSK, you can take measures to prevent the disruption, e.g. > > run only trusted and properly tested applications or isolate them. > > However, there is a case where XSKs are not really used on a given > > machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a > > huge > > hole. If you don't even know that XSK exists, you are still > > vulnerable. > > > >>> That said having a flag to back-off seems like a good idea. But > >>> should > >>> be doable as a feature without breaking current API. > > > > A flag can be added, e.g., to the sxdp_flags field of struct > > sockaddr_xdp, to switch to the behavior without busy polling on empty > > Fill Ring. > > > > However, I don't think the vulnerability can be eliminated while > > keeping > > the API intact. To protect the systems that don't use XSK, we need one > > of these things: > > > > - Disable XSK support by default, until explicitly enabled by root. > > > > - Disable the old behavior by default, until explicitly enabled by > > root. > > > > Both of those things will require additional setup on machines that > > use > > XSK, after upgrading the kernel. > > > >>>> Possible way to solve the issue > >>>> ------------------------------- > >>>> > >>>> When the driver can't take new Fill Ring frames, it shouldn't busy > >>>> poll. > >>>> Instead, it signals the failure to the application (e.g., with > >>>> POLLERR), > >> and > >>>> after that it's up to the application to restart polling (e.g., by > >> calling > >>>> sendto()) after refilling the Fill Ring. The issue with this > >>>> approach is > >> that it > >>>> changes the API, so we either have to deal with it or to introduce > >>>> some > >> API > >>>> version field. > >>> See above. I like the idea here though. > >>> > >> > >> The uapi doesn't mandate *how* the driver should behave. The rational > >> for the aggressive i40e fill ring polling (when the ring is empty) is > >> to minimize the latency. The "fill ring is empty" behavior might be > >> different on different drivers. > > > > If the behavior is different with different drivers, it will be > > difficult to create applications that are portable across NICs. The > > point of creating APIs is to provide a single interface to different > > implementations, thus hiding the differences from a higher abstraction > > level. What the driver does internally is up to the driver, but the > > result visible in the userspace should be the same, so we can't just > > make some drivers stop and wait when the Fill Ring is empty, while the > > others busy poll. It should be controlled by some flags if we want to > > preserve both options. > > Or if the driver can't obtain new frames from the fill ring, it just > drops the incoming packet, and recycles the old frame in order to keep > the ring full. Good idea, thanks! I will need to analyze it though. > (This also assumes that the driver was able to initially > populate the ring in the first place, Which will not happen if there is nothing in the Fill Ring initially... > which eliminates one DoS attack of > never putting anything in the fill ring initially). > > I do agree that issues cited in this thread can be resolved with the > use of flags to control the different behaviors - there just needs to > be some agreement on what those behaviors are. > -- > Jonathan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AF_XDP design flaws 2019-02-28 10:49 ` Maxim Mikityanskiy @ 2019-03-05 18:26 ` Björn Töpel 2019-03-07 15:09 ` Maxim Mikityanskiy 0 siblings, 1 reply; 12+ messages in thread From: Björn Töpel @ 2019-03-05 18:26 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Jonathan Lemon, John Fastabend, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha On Thu, 28 Feb 2019 at 11:50, Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > [...] Back in the saddle! Sorry for the delay! Ok, let me try to summarize. First, let's go through the current AF_XDP semantics so that we're all on the same page, and then pull Max' suggestions in. Ingress ------- The simplified flow is: 1. Userland puts buffers on the fill ring 2. The fill ring is dequeued by the kernel 3. The kernel places the received buffer on the socket Rx ring If 2 doesn't get a buffer, no feedback (other than a driver level counter) is provided to userland. What re-try policy the driver should use, is up to the driver implementation. The i40e busy-polls, which is, as Max points out, will spend a lot of time in napi without a proper back-off mechanism. If the Rx ring is full, so that 3 fails, the packet is dropped and no feedback (other than a counter) is provided to userland. Egress ------ 1. Userland puts buffer(s) on the Tx ring 2. Userland calls sendto 3. The Tx ring is dequeued by the kernel 4. The kernel enqueues the buffer on the completion ring Again little or no feedback is provided to userland. If the completion ring is full, no packets are sent. Further, if the napi is running, the Tx ring will potentially be drained *without* calling sendto. So, it's really up to the userland application to determine when to call sendto. Further, if the napi is running and the driver cannot drain the Tx ring (completion full or HW full), i40e will busy-poll to get the packets out. Again, as Max points out, this will make the kernel spend a lot of time in napi context. The kernel "kick" on egress via sendto is something that we'd like to make optionally, such that the egress side is identical to the Rx side. Four rings per socket, that the user fills (fill ring/Tx) and drains (Rx/completion ring) without any syscalls at all. Again, this is doable with kernel-side napi-threads. The API is throughput oriented, and hence the current design. Now, onto Max' concerns, from my perspective: 1. The kernel spins too much in napi mode. Yes, the i40e driver does spin for throughput and latency reasons. I agree that we should add a back-off mechanism. I would prefer *not* adding this to the AF_XDP uapi, but having it as a driver knob. Another idea would be to move to a napi-thread similar to what Paolo Abeni suggested in [1], and let the scheduler deal with the fairness issue. 2. No/little error feedback to userland Max would like a mode where feedback when "fill ring has run dry", "completion queue is full", "HW queue full" returned to userland via the poll() syscall. In this mode, Max suggests that sendto() will return error if not all packets in the Tx ring can be sent. Further, the kernel should be kicked when there has been items placed in the fill ring. Again, all good and valid points! I think we can address this with the upcoming busy-poll() support. In the busy-poll mode (which will be a new AF_XDP bind option), the napi will be executed in the poll() context. Ingress would be: 1. Userland puts buffers on the fill ring 2. Call poll(), and from the poll context: a. The fill ring is dequeued by the kernel b. The kernel places the received buffer on the socket Rx ring If a. fails, poll() will return an POLLERR, and userland can act on it. Dito for egress, and poll() will return an POLLERR if the completion ring has less than Tx ring entries. So, we're addressing your concerns with the busy-poll mode, and let the throughput/non-busy-poll API as it is today. What do you think about that, Max? Would that be a path forward for Mellanox -- i.e. implementing the busy-poll and the current API? 3 Introduce an API to schedule a napi on a certain core I think this is outside the AF_XDP scope (given my points above). This is mainly kernel internals, and I have not strong options/thoughts here. As long as you guys are hacking AF_XDP, I'm happy. :-P Finally, yes, we need to work on the documentation! Patches are welcome! ;-) Max, thanks for the input and for looking into this! Very much appreciated! Cheers, Björn [1] https://lwn.net/Articles/686985/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: AF_XDP design flaws 2019-03-05 18:26 ` Björn Töpel @ 2019-03-07 15:09 ` Maxim Mikityanskiy 2019-03-07 16:51 ` Alexei Starovoitov 2019-03-07 17:52 ` Björn Töpel 0 siblings, 2 replies; 12+ messages in thread From: Maxim Mikityanskiy @ 2019-03-07 15:09 UTC (permalink / raw) To: Björn Töpel Cc: Jonathan Lemon, John Fastabend, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha > -----Original Message----- > From: Björn Töpel <bjorn.topel@gmail.com> > Sent: 5 March, 2019 20:26 > To: Maxim Mikityanskiy <maximmi@mellanox.com> > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>; John Fastabend > <john.fastabend@gmail.com>; netdev@vger.kernel.org; Björn Töpel > <bjorn.topel@intel.com>; Magnus Karlsson <magnus.karlsson@intel.com>; David > S. Miller <davem@davemloft.net>; Tariq Toukan <tariqt@mellanox.com>; Saeed > Mahameed <saeedm@mellanox.com>; Eran Ben Elisha <eranbe@mellanox.com> > Subject: Re: AF_XDP design flaws > > On Thu, 28 Feb 2019 at 11:50, Maxim Mikityanskiy <maximmi@mellanox.com> > wrote: > > > [...] > > Back in the saddle! Sorry for the delay! > > Ok, let me try to summarize. First, let's go through the current > AF_XDP semantics so that we're all on the same page, and then pull > Max' suggestions in. > > Ingress > ------- > > The simplified flow is: > > 1. Userland puts buffers on the fill ring > 2. The fill ring is dequeued by the kernel > 3. The kernel places the received buffer on the socket Rx ring > > If 2 doesn't get a buffer, no feedback (other than a driver level > counter) is provided to userland. What re-try policy the driver should > use, is up to the driver implementation. The i40e busy-polls, which > is, as Max points out, will spend a lot of time in napi without a > proper back-off mechanism. > > If the Rx ring is full, so that 3 fails, the packet is dropped and no > feedback (other than a counter) is provided to userland. > > Egress > ------ > > 1. Userland puts buffer(s) on the Tx ring > 2. Userland calls sendto > 3. The Tx ring is dequeued by the kernel > 4. The kernel enqueues the buffer on the completion ring > > Again little or no feedback is provided to userland. If the completion > ring is full, no packets are sent. Further, if the napi is running, > the Tx ring will potentially be drained *without* calling sendto. So, > it's really up to the userland application to determine when to call > sendto. > > Further, if the napi is running and the driver cannot drain the Tx > ring (completion full or HW full), i40e will busy-poll to get the > packets out. Again, as Max points out, this will make the kernel spend > a lot of time in napi context. > > The kernel "kick" on egress via sendto is something that we'd like to > make optionally, such that the egress side is identical to the Rx > side. Four rings per socket, that the user fills (fill ring/Tx) and > drains (Rx/completion ring) without any syscalls at all. Again, this > is doable with kernel-side napi-threads. > > The API is throughput oriented, and hence the current design. Thanks for the great summary! > Now, onto Max' concerns, from my perspective: > > 1. The kernel spins too much in napi mode. Not just too much, it will do it forever if the application simply doesn't act. > Yes, the i40e driver does spin for throughput and latency reasons. I > agree that we should add a back-off mechanism. I would prefer *not* > adding this to the AF_XDP uapi, but having it as a driver knob. A back-off won't be that good for latency, will it? Anyway, I'm fine with having aggressive polling to maximize throughput in the circumstances where the user wants it and can control the resource consumption. What I'm concerned of is a security hole this feature opens to the systems not using AF_XDP. I'd prefer having a kernel-level runtime switch that is off by default. If one wants to use AF_XDP, they have to turn it on first, and since then it's up to them to take care which applications they run and whether they trust them. And if no one turns AF_XDP on, no application can abuse the API to DoS the kernel. > Another idea would be to move to a napi-thread similar to what Paolo > Abeni suggested in [1], and let the scheduler deal with the fairness > issue. Sounds good, looks like the impact of the spinning NAPI thread will be the same as if the userspace application just started consuming 100% CPU. Still, it can be used to overcome the maximum process number set by ulimit and maximum CPU usage quotes. > 2. No/little error feedback to userland > > Max would like a mode where feedback when "fill ring has run dry", > "completion queue is full", "HW queue full" returned to userland via > the poll() syscall. > > In this mode, Max suggests that sendto() will return error if not all > packets in the Tx ring can be sent. Further, the kernel should be > kicked when there has been items placed in the fill ring. > > Again, all good and valid points! > > I think we can address this with the upcoming busy-poll() support. In > the busy-poll mode (which will be a new AF_XDP bind option), the napi > will be executed in the poll() context. > > Ingress would be: > > 1. Userland puts buffers on the fill ring > 2. Call poll(), and from the poll context: > a. The fill ring is dequeued by the kernel > b. The kernel places the received buffer on the socket Rx ring > > If a. fails, poll() will return an POLLERR, and userland can act on it. And if b. fails, is there any notification planned, besides the counter? > Dito for egress, and poll() will return an POLLERR if the completion > ring has less than Tx ring entries. > > So, we're addressing your concerns with the busy-poll mode, and let > the throughput/non-busy-poll API as it is today. > > What do you think about that, Max? Would that be a path forward for > Mellanox -- i.e. implementing the busy-poll and the current API? I see, but the security implications remain. If you just provide a second mode that is secure, malware can still use the first one. It only makes sense if the first one is deprecated and removed, but it's not the case as you say it aims for the maximum performance. The second mode still has sense though - if the throughput is good enough for the given application, this mode is more error-proof and spares CPU cycles. So, I still suggest introducing a kernel-level switch to turn on XSK explicitly. Yes, it affects the existing setups (minimally, requiring them to put a line into sysctl.conf, or whatever), but the feature is still actively evolving, and the security hole should be patched, so my opinion is that we can afford it. Thanks again for the good and structured summary. > 3 Introduce an API to schedule a napi on a certain core > > I think this is outside the AF_XDP scope (given my points above). This > is mainly kernel internals, and I have not strong options/thoughts > here. As long as you guys are hacking AF_XDP, I'm happy. :-P > > Finally, yes, we need to work on the documentation! Patches are > welcome! ;-) > > Max, thanks for the input and for looking into this! Very much > appreciated! > > > Cheers, > Björn > > [1] https://lwn.net/Articles/686985/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AF_XDP design flaws 2019-03-07 15:09 ` Maxim Mikityanskiy @ 2019-03-07 16:51 ` Alexei Starovoitov 2019-03-07 17:52 ` Björn Töpel 1 sibling, 0 replies; 12+ messages in thread From: Alexei Starovoitov @ 2019-03-07 16:51 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Björn Töpel, Jonathan Lemon, John Fastabend, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha On Thu, Mar 07, 2019 at 03:09:33PM +0000, Maxim Mikityanskiy wrote: > > What I'm concerned of is a security hole this feature opens to the systems not using AF_XDP. could you please explain the 'security hole' again because I still don't see one. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AF_XDP design flaws 2019-03-07 15:09 ` Maxim Mikityanskiy 2019-03-07 16:51 ` Alexei Starovoitov @ 2019-03-07 17:52 ` Björn Töpel 2019-03-21 15:46 ` Maxim Mikityanskiy 1 sibling, 1 reply; 12+ messages in thread From: Björn Töpel @ 2019-03-07 17:52 UTC (permalink / raw) To: Maxim Mikityanskiy Cc: Jonathan Lemon, John Fastabend, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha On Thu, 7 Mar 2019 at 16:09, Maxim Mikityanskiy <maximmi@mellanox.com> wrote: > [...] > > > Now, onto Max' concerns, from my perspective: > > > > 1. The kernel spins too much in napi mode. > > Not just too much, it will do it forever if the application simply doesn't act. > > > Yes, the i40e driver does spin for throughput and latency reasons. I > > agree that we should add a back-off mechanism. I would prefer *not* > > adding this to the AF_XDP uapi, but having it as a driver knob. > > A back-off won't be that good for latency, will it? > Correct, the "no buffer -> buffer" latency will increase. You can't have it all. :-) Further, once the fill ring dry, your latency is somewhat messed up... > Anyway, I'm fine with having aggressive polling to maximize > throughput in the circumstances where the user wants it and can > control the resource consumption. What I'm concerned of is a > security hole this feature opens to the systems not using > AF_XDP. I'd prefer having a kernel-level runtime switch that is off > by default. If one wants to use AF_XDP, they have to turn it on > first, and since then it's up to them to take care which > applications they run and whether they trust them. And if no one > turns AF_XDP on, no application can abuse the API to DoS the kernel. > Ok, some observations/thoughts. The "DoS" you're referring to, is that when the fill ring is empty so that i40e napi loop will report "not done" to enter the napi loop again and retry. Yes, this will keep the napi ksoftirqd busy and waste resources (again, in favor of latency). It will not DoS the kernel. When a firehose worth of packets entering the driver (but dropped in the stack), the napi poll will be busy as well. Not DoS:ed. Again, this behavior is for the i40e zero-copy implementation, which I've already addressed in the previous mail. For non-i40e AF_XDP ZC this is not the case. XDP sockets require CAP_NET_RAW to create the socket, and on top of that you need proper packet steering for zero-copy and an XDP program enabled. As John stated early on, if you don't trust it, you should contain/control it. > > Another idea would be to move to a napi-thread similar to what Paolo > > Abeni suggested in [1], and let the scheduler deal with the fairness > > issue. > > Sounds good, looks like the impact of the spinning NAPI thread will > be the same as if the userspace application just started consuming > 100% CPU. Still, it can be used to overcome the maximum process > number set by ulimit and maximum CPU usage quotes. > Yes, but user space applications that spins can be handled by the kernel. :-P > > > 2. No/little error feedback to userland > > > > Max would like a mode where feedback when "fill ring has run dry", > > "completion queue is full", "HW queue full" returned to userland via > > the poll() syscall. > > > > In this mode, Max suggests that sendto() will return error if not all > > packets in the Tx ring can be sent. Further, the kernel should be > > kicked when there has been items placed in the fill ring. > > > > Again, all good and valid points! > > > > I think we can address this with the upcoming busy-poll() support. In > > the busy-poll mode (which will be a new AF_XDP bind option), the napi > > will be executed in the poll() context. > > > > Ingress would be: > > > > 1. Userland puts buffers on the fill ring > > 2. Call poll(), and from the poll context: > > a. The fill ring is dequeued by the kernel > > b. The kernel places the received buffer on the socket Rx ring > > > > If a. fails, poll() will return an POLLERR, and userland can act on it. > > And if b. fails, is there any notification planned, besides the counter? > No. We'd like to leave that to the user space application (e.g. by monitoring progress via the head/tail pointers and the counters). The explicit model you proposed (which I like!), is for the upcoming busy-poll()-mode. > > Dito for egress, and poll() will return an POLLERR if the completion > > ring has less than Tx ring entries. > > > > So, we're addressing your concerns with the busy-poll mode, and let > > the throughput/non-busy-poll API as it is today. > > > > What do you think about that, Max? Would that be a path forward for > > Mellanox -- i.e. implementing the busy-poll and the current API? > > > I see, but the security implications remain. If you just provide a > second mode that is secure, malware can still use the first one. It > only makes sense if the first one is deprecated and removed, but > it's not the case as you say it aims for the maximum > performance. The second mode still has sense though - if the > throughput is good enough for the given application, this mode is > more error-proof and spares CPU cycles. > I don't agree that there are "security implications". > So, I still suggest introducing a kernel-level switch to turn on XSK > explicitly. Yes, it affects the existing setups (minimally, > requiring them to put a line into sysctl.conf, or whatever), but the > feature is still actively evolving, and the security hole should be > patched, so my opinion is that we can afford it. > Again, I don't agree with you here; I don't see a "security hole" and don't see the need for a "switch". I'm still keeping my fingers crossed for a Mellanox AF_XDP ZC implementation for a "throughput-mode (the current)" and "busy-poll()-mode" (upcoming). And again, addressing the "napi wastes resources" detail in the i40e driver can be fixed. Maybe you guys take another path! ;-) Thanks for looking into this! Cheers, Björn ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: AF_XDP design flaws 2019-03-07 17:52 ` Björn Töpel @ 2019-03-21 15:46 ` Maxim Mikityanskiy 0 siblings, 0 replies; 12+ messages in thread From: Maxim Mikityanskiy @ 2019-03-21 15:46 UTC (permalink / raw) To: Björn Töpel Cc: Jonathan Lemon, John Fastabend, netdev, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha, Alexei Starovoitov Hi, sorry for taking long time to reply. > -----Original Message----- > From: Björn Töpel <bjorn.topel@gmail.com> > Sent: 7 March, 2019 19:52 > To: Maxim Mikityanskiy <maximmi@mellanox.com> > Cc: Jonathan Lemon <jonathan.lemon@gmail.com>; John Fastabend > <john.fastabend@gmail.com>; netdev@vger.kernel.org; Björn Töpel > <bjorn.topel@intel.com>; Magnus Karlsson <magnus.karlsson@intel.com>; David > S. Miller <davem@davemloft.net>; Tariq Toukan <tariqt@mellanox.com>; Saeed > Mahameed <saeedm@mellanox.com>; Eran Ben Elisha <eranbe@mellanox.com> > Subject: Re: AF_XDP design flaws > > On Thu, 7 Mar 2019 at 16:09, Maxim Mikityanskiy <maximmi@mellanox.com> > wrote: > > > [...] > > > > > Now, onto Max' concerns, from my perspective: > > > > > > 1. The kernel spins too much in napi mode. > > > > Not just too much, it will do it forever if the application simply doesn't > act. > > > > > Yes, the i40e driver does spin for throughput and latency reasons. I > > > agree that we should add a back-off mechanism. I would prefer *not* > > > adding this to the AF_XDP uapi, but having it as a driver knob. > > > > A back-off won't be that good for latency, will it? > > > > Correct, the "no buffer -> buffer" latency will increase. You can't > have it all. :-) Further, once the fill ring dry, your latency is > somewhat messed up... > > > Anyway, I'm fine with having aggressive polling to maximize > > throughput in the circumstances where the user wants it and can > > control the resource consumption. What I'm concerned of is a > > security hole this feature opens to the systems not using > > AF_XDP. I'd prefer having a kernel-level runtime switch that is off > > by default. If one wants to use AF_XDP, they have to turn it on > > first, and since then it's up to them to take care which > > applications they run and whether they trust them. And if no one > > turns AF_XDP on, no application can abuse the API to DoS the kernel. > > > > Ok, some observations/thoughts. The "DoS" you're referring to, is that > when the fill ring is empty so that i40e napi loop will report "not > done" to enter the napi loop again and retry. Yes, this will keep the > napi ksoftirqd busy and waste resources (again, in favor of > latency). It will not DoS the kernel. When a firehose worth of packets > entering the driver (but dropped in the stack), the napi poll will be > busy as well. Not DoS:ed. Well, a huge flow of ingress packets can also DoS a machine. When a lot of resources are wasted for something useless, legitimate applications lack these resources to run at the full speed. So, in case it's a huge load of packets making all CPUs spin at 100% spending a lot of time in NAPI, it's perfectly fine, as long as these packet streams are useful and handled by some application. However, if these packets are flood or if we waste CPU for no reason in NAPI, it's DoS. Let's distinguish between these two cases, they are not the same. > Again, this behavior is for the i40e zero-copy implementation, which > I've already addressed in the previous mail. For non-i40e AF_XDP ZC > this is not the case. > > XDP sockets require CAP_NET_RAW to create the socket, That sounds good! It minimizes the attack surface, making it much harder to exploit. Normal users can't create an AF_XDP socket, root can - but root has a million of other ways to destroy the system. The only concern is that when root gives CAP_NET_RAW to a process running as a normal user, they may be unaware that they also give a permission for this process to spin all cores at 100%. > and on top of > that you need proper packet steering for zero-copy and an XDP program > enabled. Well, you don't need packet steering to misuse AF_XDP :). Only the XDP program is required. > As John stated early on, if you don't trust it, you should > contain/control it. I totally agree with that, but, as I said, it applies only if you want to run AF_XDP and if you are aware of possible issues. > > > Another idea would be to move to a napi-thread similar to what Paolo > > > Abeni suggested in [1], and let the scheduler deal with the fairness > > > issue. > > > > Sounds good, looks like the impact of the spinning NAPI thread will > > be the same as if the userspace application just started consuming > > 100% CPU. Still, it can be used to overcome the maximum process > > number set by ulimit and maximum CPU usage quotes. > > > > Yes, but user space applications that spins can be handled by the > kernel. :-P > > > > > > 2. No/little error feedback to userland > > > > > > Max would like a mode where feedback when "fill ring has run dry", > > > "completion queue is full", "HW queue full" returned to userland via > > > the poll() syscall. > > > > > > In this mode, Max suggests that sendto() will return error if not all > > > packets in the Tx ring can be sent. Further, the kernel should be > > > kicked when there has been items placed in the fill ring. > > > > > > Again, all good and valid points! > > > > > > I think we can address this with the upcoming busy-poll() support. In > > > the busy-poll mode (which will be a new AF_XDP bind option), the napi > > > will be executed in the poll() context. > > > > > > Ingress would be: > > > > > > 1. Userland puts buffers on the fill ring > > > 2. Call poll(), and from the poll context: > > > a. The fill ring is dequeued by the kernel > > > b. The kernel places the received buffer on the socket Rx ring > > > > > > If a. fails, poll() will return an POLLERR, and userland can act on it. > > > > And if b. fails, is there any notification planned, besides the counter? > > > > No. We'd like to leave that to the user space application (e.g. by > monitoring progress via the head/tail pointers and the counters). > > The explicit model you proposed (which I like!), is for the upcoming > busy-poll()-mode. > > > > Dito for egress, and poll() will return an POLLERR if the completion > > > ring has less than Tx ring entries. > > > > > > So, we're addressing your concerns with the busy-poll mode, and let > > > the throughput/non-busy-poll API as it is today. > > > > > > What do you think about that, Max? Would that be a path forward for > > > Mellanox -- i.e. implementing the busy-poll and the current API? > > > > > > I see, but the security implications remain. If you just provide a > > second mode that is secure, malware can still use the first one. It > > only makes sense if the first one is deprecated and removed, but > > it's not the case as you say it aims for the maximum > > performance. The second mode still has sense though - if the > > throughput is good enough for the given application, this mode is > > more error-proof and spares CPU cycles. > > > > I don't agree that there are "security implications". > > > So, I still suggest introducing a kernel-level switch to turn on XSK > > explicitly. Yes, it affects the existing setups (minimally, > > requiring them to put a line into sysctl.conf, or whatever), but the > > feature is still actively evolving, and the security hole should be > > patched, so my opinion is that we can afford it. > > > > Again, I don't agree with you here; I don't see a "security hole" and > don't see the need for a "switch". Summing up things said above, it doesn't look for me that severe any more. In fact, it's hard to exploit, as it makes sense only in a rare case of non-root process running with CAP_NET_RAW and CAP_NET_ADMIN, and the impact is not disastrous - according to my tests, even if all cores are spinning 100% in NAPI, the system is still responsive, although the packet rate suffers. Even though the attack surface and possible impact is minimal, I suggest not ignoring the issue. At least, the users of CAP_NET_RAW should be aware that a non-privileged application has a way of degrading the performance of the whole system. Unfortunately, I don't see any documentation on capabilities in the kernel sources - it could be a good place to put this notice if this documentation existed. Anyway, I still think that having AF_XDP disabled by default is better than surprising the user when they suddenly find out that their sandboxed unprivileged CAP_NET_RAW application ate all CPU and network throughput. Thanks, Max. > I'm still keeping my fingers crossed for a Mellanox AF_XDP ZC > implementation for a "throughput-mode (the current)" and > "busy-poll()-mode" (upcoming). And again, addressing the "napi wastes > resources" detail in the i40e driver can be fixed. Maybe you guys take > another path! ;-) > > Thanks for looking into this! > > Cheers, > Björn ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: AF_XDP design flaws 2019-02-27 21:03 ` Jonathan Lemon 2019-02-28 10:49 ` Maxim Mikityanskiy @ 2019-02-28 14:23 ` Magnus Karlsson 1 sibling, 0 replies; 12+ messages in thread From: Magnus Karlsson @ 2019-02-28 14:23 UTC (permalink / raw) To: Jonathan Lemon Cc: Maxim Mikityanskiy, Björn Töpel, John Fastabend, Network Development, Björn Töpel, Magnus Karlsson, David S. Miller, Tariq Toukan, Saeed Mahameed, Eran Ben Elisha On Wed, Feb 27, 2019 at 10:04 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > > On 27 Feb 2019, at 11:17, Maxim Mikityanskiy wrote: > > >> -----Original Message----- > >> From: Björn Töpel <bjorn.topel@gmail.com> > >> Sent: 27 February, 2019 10:09 > >> To: John Fastabend <john.fastabend@gmail.com> > >> Cc: Maxim Mikityanskiy <maximmi@mellanox.com>; > >> netdev@vger.kernel.org; Björn > >> Töpel <bjorn.topel@intel.com>; Magnus Karlsson > >> <magnus.karlsson@intel.com>; > >> David S. Miller <davem@davemloft.net>; Tariq Toukan > >> <tariqt@mellanox.com>; > >> Saeed Mahameed <saeedm@mellanox.com>; Eran Ben Elisha > >> <eranbe@mellanox.com> > >> Subject: Re: AF_XDP design flaws > >> > >> On 2019-02-26 17:41, John Fastabend wrote: > >>> On 2/26/19 6:49 AM, Maxim Mikityanskiy wrote: > >>>> Hi everyone, > >>>> > >> > >> Hi! It's exciting to see more vendors looking into AF_XDP. ARM was > >> involved early on, and now Mellanox. :-) It's really important that > >> the AF_XDP model is a good fit for all drivers/vendors! Thanks for > >> looking into this. > >> > >>>> I would like to discuss some design flaws of AF_XDP socket (XSK) > >> implementation > >>>> in kernel. At the moment I don't see a way to work around them > >>>> without > >> changing > >>>> the API, so I would like to make sure that I'm not missing anything > >>>> and > >> to > >>>> suggest and discuss some possible improvements that can be made. > >>>> > >>>> The issues I describe below are caused by the fact that the driver > >> depends on > >>>> the application doing some things, and if the application is > >>>> slow/buggy/malicious, the driver is forced to busy poll because of > >>>> the > >> lack of a > >>>> notification mechanism from the application side. I will refer to > >>>> the > >> i40e > >>>> driver implementation a lot, as it is the first implementation of > >>>> AF_XDP, > >> but > >>>> the issues are general and affect any driver. I already considered > >>>> trying > >> to fix > >>>> it on driver level, but it doesn't seem possible, so it looks like > >>>> the > >> behavior > >>>> and implementation of AF_XDP in the kernel has to be changed. > >>>> > >>>> RX side busy polling > >>>> ==================== > >>>> > >>>> On the RX side, the driver expects the application to put some > >> descriptors in > >>>> the Fill Ring. There is no way for the application to notify the > >>>> driver > >> that > >>>> there are more Fill Ring descriptors to take, so the driver is > >>>> forced to > >> busy > >>>> poll the Fill Ring if it gets empty. E.g., the i40e driver does it > >>>> in > >> NAPI poll: > >>>> > >>>> int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) > >>>> { > >>>> ... > >>>> failure = failure || > >>>> > >> !i40e_alloc_rx_buffers_fast_zc(rx_ring, > >>>> > >> cleaned_count); > >>>> ... > >>>> return failure ? budget : (int)total_rx_packets; > >>>> } > >>>> > >>>> Basically, it means that if there are no descriptors in the Fill > >>>> Ring, > >> NAPI will > >>>> never stop, draining CPU. > >>>> > >>>> Possible cases when it happens > >>>> ------------------------------ > >>>> > >>>> 1. The application is slow, it received some frames in the RX Ring, > >>>> and > >> it is > >>>> still handling the data, so it has no free frames to put to the > >>>> Fill > >> Ring. > >>>> > >>>> 2. The application is malicious, it opens an XSK and puts no frames > >>>> to > >> the Fill > >>>> Ring. It can be used as a local DoS attack. > >>>> > >>>> 3. The application is buggy and stops filling the Fill Ring for > >>>> whatever > >> reason > >>>> (deadlock, waiting for another blocking operation, other bugs). > >>>> > >>>> Although loading an XDP program requires root access, the DoS > >>>> attack can > >> be > >>>> targeted to setups that already use XDP, i.e. an XDP program is > >>>> already > >> loaded. > >>>> Even under root, userspace applications should not be able to > >>>> disrupt > >> system > >>>> stability by just calling normal APIs without an intention to > >>>> destroy the > >>>> system, and here it happens in case 1. > >>> I believe this is by design. If packets per second is your top > >>> priority > >>> (at the expense of power, cpu, etc.) this is the behavior you might > >>> want. To resolve your points if you don't trust the application it > >>> should be isolated to a queue pair and cores so the impact is known > >>> and > >>> managed. > >>> > >> > >> Correct, and I agree with John here. AF_XDP is a raw interface, and > >> needs to be managed. > > > > OK, right, if you want high pps sacrificing system stability and > > kernel-userspace isolation, XSK is at your service. It may be a valid > > point. However, there is still the issue of unintended use. > > I believe this is a double-edged sword - on one hand, it appears AF_XDP > is > aimed as a competitor to DPDK, and is focused only on raw packet speed, > at > the expense of usability. This isn't necessarily bad, but in my > experience, > if things aren't simple to use, then they tend not to get widely used. Yes, we need to find the right balance between usability and performance. IMO, it is too hard to use at the moment but there are plenty of things we could do first to make it easier to use that do not impact performance (documentation, libbpf helper functions, examples, etc.). Let us start there and see where that takes us. Driver support for XDP and AF_XDP ZC should also be simplified to facilitate adoption and force drivers to behave in the right way (and this also needs to be defined as you both point out). > > If you use XSK, you can take measures to prevent the disruption, e.g. > > run only trusted and properly tested applications or isolate them. > > However, there is a case where XSKs are not really used on a given > > machine, but are still available (CONFIG_XDP_SOCKETS=y), opening a > > huge > > hole. If you don't even know that XSK exists, you are still > > vulnerable. > > > >>> That said having a flag to back-off seems like a good idea. But > >>> should > >>> be doable as a feature without breaking current API. > > > > A flag can be added, e.g., to the sxdp_flags field of struct > > sockaddr_xdp, to switch to the behavior without busy polling on empty > > Fill Ring. > > > > However, I don't think the vulnerability can be eliminated while > > keeping > > the API intact. To protect the systems that don't use XSK, we need one > > of these things: > > > > - Disable XSK support by default, until explicitly enabled by root. > > > > - Disable the old behavior by default, until explicitly enabled by > > root. > > > > Both of those things will require additional setup on machines that > > use > > XSK, after upgrading the kernel. > > > >>>> Possible way to solve the issue > >>>> ------------------------------- > >>>> > >>>> When the driver can't take new Fill Ring frames, it shouldn't busy > >>>> poll. > >>>> Instead, it signals the failure to the application (e.g., with > >>>> POLLERR), > >> and > >>>> after that it's up to the application to restart polling (e.g., by > >> calling > >>>> sendto()) after refilling the Fill Ring. The issue with this > >>>> approach is > >> that it > >>>> changes the API, so we either have to deal with it or to introduce > >>>> some > >> API > >>>> version field. > >>> See above. I like the idea here though. > >>> > >> > >> The uapi doesn't mandate *how* the driver should behave. The rational > >> for the aggressive i40e fill ring polling (when the ring is empty) is > >> to minimize the latency. The "fill ring is empty" behavior might be > >> different on different drivers. > > > > If the behavior is different with different drivers, it will be > > difficult to create applications that are portable across NICs. The > > point of creating APIs is to provide a single interface to different > > implementations, thus hiding the differences from a higher abstraction > > level. What the driver does internally is up to the driver, but the > > result visible in the userspace should be the same, so we can't just > > make some drivers stop and wait when the Fill Ring is empty, while the > > others busy poll. It should be controlled by some flags if we want to > > preserve both options. > > Or if the driver can't obtain new frames from the fill ring, it just > drops the incoming packet, and recycles the old frame in order to keep > the ring full. (This also assumes that the driver was able to initially > populate the ring in the first place, which eliminates one DoS attack of > never putting anything in the fill ring initially). > > I do agree that issues cited in this thread can be resolved with the > use of flags to control the different behaviors - there just needs to > be some agreement on what those behaviors are. Will comment on Maxim's suggestions in a separate mail when I am back from vacation and have digested it all. Agree that we need to fix the DoS opportunity and that we need to make it easier for drivers such as Melanox to implement this. Thanks for looking into this. Highly appreciated. /Magnus > -- > Jonathan ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-03-21 15:46 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-26 14:49 AF_XDP design flaws Maxim Mikityanskiy 2019-02-26 16:41 ` John Fastabend 2019-02-27 8:08 ` Björn Töpel 2019-02-27 19:17 ` Maxim Mikityanskiy 2019-02-27 21:03 ` Jonathan Lemon 2019-02-28 10:49 ` Maxim Mikityanskiy 2019-03-05 18:26 ` Björn Töpel 2019-03-07 15:09 ` Maxim Mikityanskiy 2019-03-07 16:51 ` Alexei Starovoitov 2019-03-07 17:52 ` Björn Töpel 2019-03-21 15:46 ` Maxim Mikityanskiy 2019-02-28 14:23 ` Magnus Karlsson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).