From mboxrd@z Thu Jan  1 00:00:00 1970
From: Paul Menzel <pmenzel@molgen.mpg.de>
Date: Mon, 20 Sep 2021 08:46:52 +0200
Subject: [Intel-wired-lan] [PATCH net-next v3 1/3] ice: update dim usage
 and moderation
In-Reply-To: <20210824011259.738307-2-jesse.brandeburg@intel.com>
References: <20210824011259.738307-1-jesse.brandeburg@intel.com>
 <20210824011259.738307-2-jesse.brandeburg@intel.com>
Message-ID: <d87087e8-51d7-4fd2-d594-a79b126fa14c@molgen.mpg.de>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: intel-wired-lan@osuosl.org
List-ID: <intel-wired-lan.osuosl.org>

Dear Jesse,


Am 24.08.21 um 03:12 schrieb Jesse Brandeburg:
> The driver was having trouble with unreliable latency when doing single
> threaded ping-pong tests. This was root caused to the DIM algorithm
> landing on a too slow interrupt value, which caused high latency, and it
> was especially present when queues were being switched frequently by the
> scheduler as happens on default setups today.
> 
> In attempting to improve this, we allow the upper rate limit for
> interrupts to move to rate limit of 4 microseconds as a max, which means
> that no vector can generate more than 250,000 interrupts per second. The
> old config was up to 100,000. The driver previously tried to program the
> rate limit too frequently and if the receive and transmit side were both
> active on the same vector, the INTRL would be set incorrectly, and this
> change fixes that issue as a side effect of the redesign.
> 
> This driver will operate from now on with a slightly changed DIM table
> with more emphasis towards latency sensitivity by having more table
> entries with lower latency than with high latency (high being >= 64
> microseconds).
> 
> The driver also resets the DIM algorithm state with a new stats set when
> there is no work done and the data becomes stale (older than 1 second),
> for the respective receive or transmit portion of the interrupt.
> 
> Add a new helper for setting rate limit, which will be used more
> in a followup patch.

Nice work. Could you please add, how to run the ping-pong test, so 
others can verify the behavior?

> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> 
> ---
> v3: merged on top of tx_ring/rx_ring split patch series
> v2: original version
> ---
>   drivers/net/ethernet/intel/ice/ice_lib.c  |  25 +++++
>   drivers/net/ethernet/intel/ice/ice_lib.h  |   1 +
>   drivers/net/ethernet/intel/ice/ice_main.c | 118 ++++++++++++----------
>   drivers/net/ethernet/intel/ice/ice_txrx.c |  84 ++++++++-------
>   4 files changed, 141 insertions(+), 87 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> index f9f060da3f04..28240b9f83e5 100644
> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> @@ -1941,6 +1941,31 @@ void ice_write_itr(struct ice_ring_container *rc, u16 itr)
>   	__ice_write_itr(q_vector, rc, itr);
>   }
>   
> +/**
> + * ice_set_q_vector_intrl - set up interrupt rate limiting
> + * @q_vector: the vector to be configured
> + *
> + * Interrupt rate limiting is local to the vector, not per-queue so we must
> + * detect if either ring container has dynamic enabled to decide what to set

Dynamic moderation?

> + * the interrupt rate limit to via INTRL settings. In the case of dynamic
> + * disabled on both, write the value with the cached setting to make sure

Ditto.

> + * INTRL matches user visible value.
> + */
> +void ice_set_q_vector_intrl(struct ice_q_vector *q_vector)
> +{
> +	if (ITR_IS_DYNAMIC(&q_vector->tx) || ITR_IS_DYNAMIC(&q_vector->rx)) {
> +		/* in the case of dynamic enabled, cap each vector to no more
> +		 * than (4 us) 250,000 ints/sec, which allows low latency
> +		 * but still less than 500,000 interrupts per second, which
> +		 * reduces CPU a bit in the case of the lowest latency
> +		 * setting. The 4 here is a value in microseconds.
> +		 */
> +		ice_write_intrl(q_vector, 4);
> +	} else {
> +		ice_write_intrl(q_vector, q_vector->intrl);
> +	}
> +}
> +
>   /**
>    * ice_vsi_cfg_msix - MSIX mode Interrupt Config in the HW
>    * @vsi: the VSI being configured
> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h
> index f446c5071478..989c4241676a 100644
> --- a/drivers/net/ethernet/intel/ice/ice_lib.h
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.h
> @@ -103,6 +103,7 @@ int ice_status_to_errno(enum ice_status err);
>   
>   void ice_write_intrl(struct ice_q_vector *q_vector, u8 intrl);
>   void ice_write_itr(struct ice_ring_container *rc, u16 itr);
> +void ice_set_q_vector_intrl(struct ice_q_vector *q_vector);
>   
>   enum ice_status
>   ice_vsi_cfg_mac_fltr(struct ice_vsi *vsi, const u8 *macaddr, bool set);
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index 10c6287aa5b2..7749bb0945bc 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -5568,77 +5568,59 @@ int ice_vsi_cfg(struct ice_vsi *vsi)
>   }
>   
>   /* THEORY OF MODERATION:
> - * The below code creates custom DIM profiles for use by this driver, because
> - * the ice driver hardware works differently than the hardware that DIMLIB was
> + * The ice driver hardware works differently than the hardware that DIMLIB was
>    * originally made for. ice hardware doesn't have packet count limits that
>    * can trigger an interrupt, but it *does* have interrupt rate limit support,
> - * and this code adds that capability to be used by the driver when it's using
> - * DIMLIB. The DIMLIB code was always designed to be a suggestion to the driver
> - * for how to "respond" to traffic and interrupts, so this driver uses a
> - * slightly different set of moderation parameters to get best performance.
> + * which is hard-coded to limit to 250,000 ints/second.

Maybe:

? hard-coded to a limit of 250,000 ints/second.

> + * If not using dynamic moderation, the INTRL value can be modified
> + * by ethtool rx-usecs-high.
>    */
>   struct ice_dim {
>   	/* the throttle rate for interrupts, basically worst case delay before
>   	 * an initial interrupt fires, value is stored in microseconds.
>   	 */
>   	u16 itr;
> -	/* the rate limit for interrupts, which can cap a delay from a small
> -	 * ITR at a certain amount of interrupts per second. f.e. a 2us ITR
> -	 * could yield as much as 500,000 interrupts per second, but with a
> -	 * 10us rate limit, it limits to 100,000 interrupts per second. Value
> -	 * is stored in microseconds.
> -	 */
> -	u16 intrl;
>   };
>   
>   /* Make a different profile for Rx that doesn't allow quite so aggressive
> - * moderation at the high end (it maxes out at 128us or about 8k interrupts a
> - * second. The INTRL/rate parameters here are only useful to cap small ITR
> - * values, which is why for larger ITR's - like 128, which can only generate
> - * 8k interrupts per second, there is no point to rate limit and the values
> - * are set to zero. The rate limit values do affect latency, and so must
> - * be reasonably small so to not impact latency sensitive tests.
> + * moderation at the high end (it maxxes out at 126us or about 8k interrupts a

maxes?

> + * second.
>    */
>   static const struct ice_dim rx_profile[] = {
> -	{2, 10},
> -	{8, 16},
> -	{32, 0},
> -	{96, 0},
> -	{128, 0}
> +	{2},    /* 500,000 ints/s, capped at 250K by INTRL */
> +	{8},    /* 125,000 ints/s */
> +	{16},   /*  62,500 ints/s */
> +	{62},   /*  16,129 ints/s */
> +	{126}   /*   7,936 ints/s */
>   };
>   
>   /* The transmit profile, which has the same sorts of values
>    * as the previous struct
>    */
>   static const struct ice_dim tx_profile[] = {
> -	{2, 10},
> -	{8, 16},
> -	{64, 0},
> -	{128, 0},
> -	{256, 0}
> +	{2},    /* 500,000 ints/s, capped at 250K by INTRL */
> +	{8},    /* 125,000 ints/s */
> +	{40},   /*  16,125 ints/s */
> +	{128},  /*   7,812 ints/s */
> +	{256}   /*   3,906 ints/s */
>   };
>   
>   static void ice_tx_dim_work(struct work_struct *work)
>   {
>   	struct ice_ring_container *rc;
> -	struct ice_q_vector *q_vector;
>   	struct dim *dim;
> -	u16 itr, intrl;
> +	u16 itr;
>   
>   	dim = container_of(work, struct dim, work);
> -	rc = container_of(dim, struct ice_ring_container, dim);
> -	q_vector = container_of(rc, struct ice_q_vector, tx);
> +	rc = (struct ice_ring_container *)dim->priv;
>   
> -	if (dim->profile_ix >= ARRAY_SIZE(tx_profile))
> -		dim->profile_ix = ARRAY_SIZE(tx_profile) - 1;
> +	WARN_ON(dim->profile_ix >= ARRAY_SIZE(tx_profile));
>   
>   	/* look up the values in our local table */
>   	itr = tx_profile[dim->profile_ix].itr;
> -	intrl = tx_profile[dim->profile_ix].intrl;
>   
> -	ice_trace(tx_dim_work, q_vector, dim);
> +	ice_trace(tx_dim_work, container_of(rc, struct ice_q_vector, tx), dim);
>   	ice_write_itr(rc, itr);
> -	ice_write_intrl(q_vector, intrl);
>   
>   	dim->state = DIM_START_MEASURE;
>   }
> @@ -5646,28 +5628,64 @@ static void ice_tx_dim_work(struct work_struct *work)
>   static void ice_rx_dim_work(struct work_struct *work)
>   {
>   	struct ice_ring_container *rc;
> -	struct ice_q_vector *q_vector;
>   	struct dim *dim;
> -	u16 itr, intrl;
> +	u16 itr;
>   
>   	dim = container_of(work, struct dim, work);
> -	rc = container_of(dim, struct ice_ring_container, dim);
> -	q_vector = container_of(rc, struct ice_q_vector, rx);
> +	rc = (struct ice_ring_container *)dim->priv;
>   
> -	if (dim->profile_ix >= ARRAY_SIZE(rx_profile))
> -		dim->profile_ix = ARRAY_SIZE(rx_profile) - 1;
> +	WARN_ON(dim->profile_ix >= ARRAY_SIZE(rx_profile));
>   
>   	/* look up the values in our local table */
>   	itr = rx_profile[dim->profile_ix].itr;
> -	intrl = rx_profile[dim->profile_ix].intrl;
>   
> -	ice_trace(rx_dim_work, q_vector, dim);
> +	ice_trace(rx_dim_work, container_of(rc, struct ice_q_vector, rx), dim);
>   	ice_write_itr(rc, itr);
> -	ice_write_intrl(q_vector, intrl);
>   
>   	dim->state = DIM_START_MEASURE;
>   }
>   
> +#define ICE_DIM_DEFAULT_PROFILE_IX 1
> +
> +/**
> + * ice_init_moderation - set up interrupt moderation
> + * @q_vector: the vector containing rings to be configured
> + *
> + * Set up interrupt moderation registers, with the intent to do the right thing
> + * when called from reset or from probe, and whether or not dynamic moderation
> + * is enabled or not. Take special care to write all the registers in both
> + * dynamic mode or not in order to make sure hardware is in a known state.

moderation?

> + */
> +static void ice_init_moderation(struct ice_q_vector *q_vector)
> +{
> +	struct ice_ring_container *rc;
> +	bool tx_dynamic, rx_dynamic;
> +
> +	rc = &q_vector->tx;
> +	INIT_WORK(&rc->dim.work, ice_tx_dim_work);
> +	rc->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> +	rc->dim.profile_ix = ICE_DIM_DEFAULT_PROFILE_IX;
> +	rc->dim.priv = rc;
> +	tx_dynamic = ITR_IS_DYNAMIC(rc);
> +
> +	/* set the initial TX ITR to match the above */
> +	ice_write_itr(rc, tx_dynamic ?
> +		      tx_profile[rc->dim.profile_ix].itr : rc->itr_setting);
> +
> +	rc = &q_vector->rx;
> +	INIT_WORK(&rc->dim.work, ice_rx_dim_work);
> +	rc->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> +	rc->dim.profile_ix = ICE_DIM_DEFAULT_PROFILE_IX;
> +	rc->dim.priv = rc;
> +	rx_dynamic = ITR_IS_DYNAMIC(rc);
> +
> +	/* set the initial RX ITR to match the above */
> +	ice_write_itr(rc, rx_dynamic ? rx_profile[rc->dim.profile_ix].itr :
> +				       rc->itr_setting);
> +
> +	ice_set_q_vector_intrl(q_vector);
> +}
> +
>   /**
>    * ice_napi_enable_all - Enable NAPI for all q_vectors in the VSI
>    * @vsi: the VSI being configured
> @@ -5682,11 +5700,7 @@ static void ice_napi_enable_all(struct ice_vsi *vsi)
>   	ice_for_each_q_vector(vsi, q_idx) {
>   		struct ice_q_vector *q_vector = vsi->q_vectors[q_idx];
>   
> -		INIT_WORK(&q_vector->tx.dim.work, ice_tx_dim_work);
> -		q_vector->tx.dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> -
> -		INIT_WORK(&q_vector->rx.dim.work, ice_rx_dim_work);
> -		q_vector->rx.dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> +		ice_init_moderation(q_vector);
>   
>   		if (q_vector->rx.rx_ring || q_vector->tx.tx_ring)
>   			napi_enable(&q_vector->napi);
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 01ae331927bd..33573774647f 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -1259,6 +1259,41 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
>   	return failure ? budget : (int)total_rx_pkts;
>   }
>   
> +static void __ice_update_sample(struct ice_q_vector *q_vector,
> +				struct ice_ring_container *rc,
> +				struct dim_sample *sample,
> +				bool is_tx)
> +{
> +	u64 packets = 0, bytes = 0;
> +
> +	if (is_tx) {
> +		struct ice_tx_ring *tx_ring;
> +
> +		ice_for_each_tx_ring(tx_ring, *rc) {
> +			packets += tx_ring->stats.pkts;
> +			bytes += tx_ring->stats.bytes;
> +		}
> +	} else {
> +		struct ice_rx_ring *rx_ring;
> +
> +		ice_for_each_rx_ring(rx_ring, *rc) {
> +			packets += rx_ring->stats.pkts;
> +			bytes += rx_ring->stats.bytes;
> +		}
> +	}
> +
> +	dim_update_sample(q_vector->total_events, packets, bytes, sample);
> +	sample->comp_ctr = 0;
> +
> +	/* if dim settings get stale, like when not updated for 1
> +	 * second or longer, force it to start again. This addresses the
> +	 * freqent case of an idle queue being switched to by the

frequent

> +	 * scheduler. The 1,000 here means 1,000 milliseconds.
> +	 */
> +	if (ktime_ms_delta(sample->time, rc->dim.start_sample.time) >= 1000)
> +		rc->dim.state = DIM_START_MEASURE;
> +}
> +
>   /**
>    * ice_net_dim - Update net DIM algorithm
>    * @q_vector: the vector associated with the interrupt
> @@ -1274,34 +1309,16 @@ static void ice_net_dim(struct ice_q_vector *q_vector)
>   	struct ice_ring_container *rx = &q_vector->rx;
>   
>   	if (ITR_IS_DYNAMIC(tx)) {
> -		struct dim_sample dim_sample = {};
> -		u64 packets = 0, bytes = 0;
> -		struct ice_tx_ring *ring;
> -
> -		ice_for_each_tx_ring(ring, q_vector->tx) {
> -			packets += ring->stats.pkts;
> -			bytes += ring->stats.bytes;
> -		}
> -
> -		dim_update_sample(q_vector->total_events, packets, bytes,
> -				  &dim_sample);
> +		struct dim_sample dim_sample;
>   
> +		__ice_update_sample(q_vector, tx, &dim_sample, true);
>   		net_dim(&tx->dim, dim_sample);
>   	}
>   
>   	if (ITR_IS_DYNAMIC(rx)) {
> -		struct dim_sample dim_sample = {};
> -		u64 packets = 0, bytes = 0;
> -		struct ice_rx_ring *ring;
> -
> -		ice_for_each_rx_ring(ring, q_vector->rx) {
> -			packets += ring->stats.pkts;
> -			bytes += ring->stats.bytes;
> -		}
> -
> -		dim_update_sample(q_vector->total_events, packets, bytes,
> -				  &dim_sample);
> +		struct dim_sample dim_sample;
>   
> +		__ice_update_sample(q_vector, rx, &dim_sample, false);
>   		net_dim(&rx->dim, dim_sample);
>   	}
>   }
> @@ -1328,15 +1345,14 @@ static u32 ice_buildreg_itr(u16 itr_idx, u16 itr)
>   }
>   
>   /**
> - * ice_update_ena_itr - Update ITR moderation and re-enable MSI-X interrupt
> + * ice_enable_interrupt - re-enable MSI-X interrupt
>    * @q_vector: the vector associated with the interrupt to enable
>    *
> - * Update the net_dim() algorithm and re-enable the interrupt associated with
> - * this vector.
> - *
> - * If the VSI is down, the interrupt will not be re-enabled.
> + * If the VSI is down, the interrupt will not be re-enabled. Also,
> + * when enabling the interrupt always reset the wb_on_itr to false
> + * and trigger a software interrupt to clean out internal state.

Is the internal state cleaning required in the spec?

>    */
> -static void ice_update_ena_itr(struct ice_q_vector *q_vector)
> +static void ice_enable_interrupt(struct ice_q_vector *q_vector)
>   {
>   	struct ice_vsi *vsi = q_vector->vsi;
>   	bool wb_en = q_vector->wb_on_itr;
> @@ -1351,10 +1367,6 @@ static void ice_update_ena_itr(struct ice_q_vector *q_vector)
>   	if (wb_en)
>   		q_vector->wb_on_itr = false;
>   
> -	/* This will do nothing if dynamic updates are not enabled. */
> -	ice_net_dim(q_vector);
> -
> -	/* net_dim() updates ITR out-of-band using a work item */
>   	itr_val = ice_buildreg_itr(ICE_ITR_NONE, 0);
>   	/* trigger an immediate software interrupt when exiting
>   	 * busy poll, to make sure to catch any pending cleanups
> @@ -1482,10 +1494,12 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
>   	/* Exit the polling mode, but don't re-enable interrupts if stack might
>   	 * poll us due to busy-polling
>   	 */
> -	if (likely(napi_complete_done(napi, work_done)))
> -		ice_update_ena_itr(q_vector);
> -	else
> +	if (likely(napi_complete_done(napi, work_done))) {
> +		ice_net_dim(q_vector);
> +		ice_enable_interrupt(q_vector);
> +	} else {
>   		ice_set_wb_on_itr(q_vector);
> +	}
>   
>   	return min_t(int, work_done, budget - 1);
>   }


Kind regards,

Paul