linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] [net-next] packet: clarify timestamp overflow
@ 2017-11-27 16:19 Arnd Bergmann
  2017-11-27 16:19 ` [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps Arnd Bergmann
  0 siblings, 1 reply; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-27 16:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: y2038, Arnd Bergmann, Eric Dumazet, Willem de Bruijn, Rosen,
	Rami, Andrey Konovalov, Reshetova, Elena, Mike Maloney,
	Sowmini Varadhan, netdev, linux-kernel

The memory mapped packet socket data structure in version 1 through 3
all contain 32-bit second values for the packet time stamps, which makes
them suffer from the overflow of time_t in y2038 or y2106 (depending
on whether user space interprets the value as signed or unsigned).

The implementation uses the deprecated getnstimeofday() function.

In order to get rid of that, this changes the code to use
ktime_get_real_ts64() as a replacement, documenting the nature of the
overflow. As long as the user applications treat the timestamps as
unsigned, or only use the difference between timestamps, they are
fine, and changing the timestamps to 64-bit wouldn't require a more
invasive user space API change.

Note: a lot of other APIs suffer from incompatible structures when
time_t gets redefined to 64-bit in 32-bit user space, but this one
does not.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 net/packet/af_packet.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 737092ca9b4e..7432c6699818 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -439,17 +439,17 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 	}
 }
 
-static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec *ts,
+static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
 				   unsigned int flags)
 {
 	struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
 
 	if (shhwtstamps &&
 	    (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
-	    ktime_to_timespec_cond(shhwtstamps->hwtstamp, ts))
+	    ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
 		return TP_STATUS_TS_RAW_HARDWARE;
 
-	if (ktime_to_timespec_cond(skb->tstamp, ts))
+	if (ktime_to_timespec64_cond(skb->tstamp, ts))
 		return TP_STATUS_TS_SOFTWARE;
 
 	return 0;
@@ -459,13 +459,20 @@ static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
 				    struct sk_buff *skb)
 {
 	union tpacket_uhdr h;
-	struct timespec ts;
+	struct timespec64 ts;
 	__u32 ts_status;
 
 	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
 		return 0;
 
 	h.raw = frame;
+	/*
+	 * versions 1 through 3 overflow the timestamps in y2106, since they
+	 * all store the seconds in a 32-bit unsigned integer.
+	 * If we create a version 4, that should have a 64-bit timestamp,
+	 * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
+	 * nanoseconds.
+	 */
 	switch (po->tp_version) {
 	case TPACKET_V1:
 		h.h1->tp_sec = ts.tv_sec;
@@ -805,8 +812,8 @@ static void prb_close_block(struct tpacket_kbdq_core *pkc1,
 		 * It shouldn't really happen as we don't close empty
 		 * blocks. See prb_retire_rx_blk_timer_expired().
 		 */
-		struct timespec ts;
-		getnstimeofday(&ts);
+		struct timespec64 ts;
+		ktime_get_real_ts64(&ts);
 		h1->ts_last_pkt.ts_sec = ts.tv_sec;
 		h1->ts_last_pkt.ts_nsec	= ts.tv_nsec;
 	}
@@ -836,7 +843,7 @@ static void prb_thaw_queue(struct tpacket_kbdq_core *pkc)
 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
 	struct tpacket_block_desc *pbd1)
 {
-	struct timespec ts;
+	struct timespec64 ts;
 	struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1;
 
 	smp_rmb();
@@ -849,7 +856,7 @@ static void prb_open_block(struct tpacket_kbdq_core *pkc1,
 	BLOCK_NUM_PKTS(pbd1) = 0;
 	BLOCK_LEN(pbd1) = BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
 
-	getnstimeofday(&ts);
+	ktime_get_real_ts64(&ts);
 
 	h1->ts_first_pkt.ts_sec = ts.tv_sec;
 	h1->ts_first_pkt.ts_nsec = ts.tv_nsec;
@@ -2184,7 +2191,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	unsigned long status = TP_STATUS_USER;
 	unsigned short macoff, netoff, hdrlen;
 	struct sk_buff *copy_skb = NULL;
-	struct timespec ts;
+	struct timespec64 ts;
 	__u32 ts_status;
 	bool is_drop_n_account = false;
 	bool do_vnet = false;
@@ -2312,7 +2319,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
 
 	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
-		getnstimeofday(&ts);
+		ktime_get_real_ts64(&ts);
 
 	status |= ts_status;
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-27 16:19 [PATCH 1/2] [net-next] packet: clarify timestamp overflow Arnd Bergmann
@ 2017-11-27 16:19 ` Arnd Bergmann
  2017-11-27 16:59   ` Jiri Pirko
  0 siblings, 1 reply; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-27 16:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: y2038, Arnd Bergmann, Shuah Khan, Willem de Bruijn, Mike Maloney,
	Eric Dumazet, Kees Cook, Rosen, Rami, Andrey Konovalov,
	Reshetova, Elena, Sowmini Varadhan, netdev, linux-kernel,
	linux-kselftest

I tried to figure out what it would take to do a version 4 mmap packet
socket interface to completely avoid the y2106 overflow problem. This is
what I came up with, reusing most of the v3 code, except for the parts
where we access the timestamps.

For kselftest, I'm adding support for testing v4 in addition to v1-v3,
but the test currently does not look at the timestamps, so it won't
check that the timestamp format actually works as intended, only that
I didn't break the parts that worked in the v3 selftest.

Overall, this is more of a mess than I expected, so it's probably not
worth doing a v4 format just for the timestamp, but the patch can serve
as a reference for anyone that needs a new format for other reasons and
fixes this along with the other changes.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Untested and rather invasive, so don't apply this part without
discussion and testing
---
 include/uapi/linux/if_packet.h              |  24 +++++-
 net/packet/af_packet.c                      | 115 ++++++++++++++++++++--------
 tools/testing/selftests/net/psock_tpacket.c |  65 +++++++++-------
 3 files changed, 142 insertions(+), 62 deletions(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 67b61d91d89b..c2cf29acdd40 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -177,8 +177,27 @@ struct tpacket3_hdr {
 	__u8		tp_padding[8];
 };
 
+struct tpacket4_hdr {
+	__u32		tp_next_offset;
+	__u32		tp_nsec_hi;
+	__u32		tp_nsec_lo;
+	__u32		tp_snaplen;
+	__u32		tp_len;
+	__u32		tp_status;
+	__u16		tp_mac;
+	__u16		tp_net;
+	/* pkt_hdr variants */
+	union {
+		struct tpacket_hdr_variant1 hv1;
+	};
+	__u8		tp_padding[8];
+};
+
 struct tpacket_bd_ts {
-	unsigned int ts_sec;
+	union {
+		unsigned int ts_nsec_hi;
+		unsigned int ts_sec;
+	};
 	union {
 		unsigned int ts_usec;
 		unsigned int ts_nsec;
@@ -250,7 +269,8 @@ struct tpacket_block_desc {
 enum tpacket_versions {
 	TPACKET_V1,
 	TPACKET_V2,
-	TPACKET_V3
+	TPACKET_V3,
+	TPACKET_V4,
 };
 
 /*
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7432c6699818..34a07e4a93a5 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -164,6 +164,7 @@ union tpacket_uhdr {
 	struct tpacket_hdr  *h1;
 	struct tpacket2_hdr *h2;
 	struct tpacket3_hdr *h3;
+	struct tpacket4_hdr *h4;
 	void *raw;
 };
 
@@ -200,7 +201,7 @@ static void prb_retire_current_block(struct tpacket_kbdq_core *,
 		struct packet_sock *, unsigned int status);
 static int prb_queue_frozen(struct tpacket_kbdq_core *);
 static void prb_open_block(struct tpacket_kbdq_core *,
-		struct tpacket_block_desc *);
+		struct tpacket_block_desc *, struct packet_sock *po);
 static void prb_retire_rx_blk_timer_expired(struct timer_list *);
 static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *);
 static void prb_fill_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *);
@@ -404,6 +405,7 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		break;
 	case TPACKET_V3:
+	case TPACKET_V4:
 		h.h3->tp_status = status;
 		flush_dcache_page(pgv_to_page(&h.h3->tp_status));
 		break;
@@ -430,6 +432,7 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		return h.h2->tp_status;
 	case TPACKET_V3:
+	case TPACKET_V4:
 		flush_dcache_page(pgv_to_page(&h.h3->tp_status));
 		return h.h3->tp_status;
 	default:
@@ -439,17 +442,17 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 	}
 }
 
-static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
+static __u32 tpacket_get_timestamp(struct sk_buff *skb, s64 *stamp,
 				   unsigned int flags)
 {
 	struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
 
 	if (shhwtstamps &&
 	    (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
-	    ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
+	    (*stamp = ktime_to_ns(shhwtstamps->hwtstamp)))
 		return TP_STATUS_TS_RAW_HARDWARE;
 
-	if (ktime_to_timespec64_cond(skb->tstamp, ts))
+	if ((*stamp = ktime_to_ns(skb->tstamp)))
 		return TP_STATUS_TS_SOFTWARE;
 
 	return 0;
@@ -460,19 +463,15 @@ static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
 {
 	union tpacket_uhdr h;
 	struct timespec64 ts;
+	s64 stamp;
 	__u32 ts_status;
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
+	if (!(ts_status = tpacket_get_timestamp(skb, &stamp, po->tp_tstamp)))
 		return 0;
 
 	h.raw = frame;
-	/*
-	 * versions 1 through 3 overflow the timestamps in y2106, since they
-	 * all store the seconds in a 32-bit unsigned integer.
-	 * If we create a version 4, that should have a 64-bit timestamp,
-	 * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
-	 * nanoseconds.
-	 */
+
+	ts = ns_to_timespec64(stamp);
 	switch (po->tp_version) {
 	case TPACKET_V1:
 		h.h1->tp_sec = ts.tv_sec;
@@ -486,6 +485,9 @@ static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
 		h.h3->tp_sec = ts.tv_sec;
 		h.h3->tp_nsec = ts.tv_nsec;
 		break;
+	case TPACKET_V4:
+		h.h4->tp_nsec_hi = upper_32_bits(stamp);
+		h.h4->tp_nsec_lo = lower_32_bits(stamp);
 	default:
 		WARN(1, "TPACKET version not supported.\n");
 		BUG();
@@ -633,7 +635,7 @@ static void init_prb_bdqc(struct packet_sock *po,
 	p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
 	prb_init_ft_ops(p1, req_u);
 	prb_setup_retire_blk_timer(po);
-	prb_open_block(p1, pbd);
+	prb_open_block(p1, pbd, po);
 }
 
 /*  Do NOT update the last_blk_num first.
@@ -730,7 +732,7 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
 				* opening a block thaws the queue,restarts timer
 				* Thawing/timer-refresh is a side effect.
 				*/
-				prb_open_block(pkc, pbd);
+				prb_open_block(pkc, pbd, po);
 				goto out;
 			}
 		}
@@ -792,30 +794,43 @@ static void prb_close_block(struct tpacket_kbdq_core *pkc1,
 {
 	__u32 status = TP_STATUS_USER | stat;
 
-	struct tpacket3_hdr *last_pkt;
+	struct tpacket3_hdr *last_pkt_v3;
+	struct tpacket4_hdr *last_pkt_v4;
 	struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1;
 	struct sock *sk = &po->sk;
 
 	if (po->stats.stats3.tp_drops)
 		status |= TP_STATUS_LOSING;
 
-	last_pkt = (struct tpacket3_hdr *)pkc1->prev;
-	last_pkt->tp_next_offset = 0;
+	last_pkt_v3 = (struct tpacket3_hdr *)pkc1->prev;
+	last_pkt_v4 = (struct tpacket4_hdr *)pkc1->prev;
+	last_pkt_v3->tp_next_offset = 0;
 
 	/* Get the ts of the last pkt */
 	if (BLOCK_NUM_PKTS(pbd1)) {
-		h1->ts_last_pkt.ts_sec = last_pkt->tp_sec;
-		h1->ts_last_pkt.ts_nsec	= last_pkt->tp_nsec;
+		if (po->tp_version == TPACKET_V3) {
+			h1->ts_last_pkt.ts_sec = last_pkt_v3->tp_sec;
+			h1->ts_last_pkt.ts_nsec	= last_pkt_v3->tp_nsec;
+		} else {
+			h1->ts_last_pkt.ts_nsec_hi = last_pkt_v4->tp_nsec_hi;
+			h1->ts_last_pkt.ts_nsec	= last_pkt_v4->tp_nsec_lo;
+		}
 	} else {
 		/* Ok, we tmo'd - so get the current time.
 		 *
 		 * It shouldn't really happen as we don't close empty
 		 * blocks. See prb_retire_rx_blk_timer_expired().
 		 */
-		struct timespec64 ts;
-		ktime_get_real_ts64(&ts);
-		h1->ts_last_pkt.ts_sec = ts.tv_sec;
-		h1->ts_last_pkt.ts_nsec	= ts.tv_nsec;
+		if (po->tp_version == TPACKET_V3) {
+			struct timespec64 ts;
+			ktime_get_real_ts64(&ts);
+			h1->ts_last_pkt.ts_sec = ts.tv_sec;
+			h1->ts_last_pkt.ts_nsec	= ts.tv_nsec;
+		} else {
+			u64 ns = ktime_get_real_ns();
+			h1->ts_last_pkt.ts_nsec_hi = upper_32_bits(ns);
+			h1->ts_last_pkt.ts_nsec = lower_32_bits(ns);
+		}
 	}
 
 	smp_wmb();
@@ -841,9 +856,8 @@ static void prb_thaw_queue(struct tpacket_kbdq_core *pkc)
  *
  */
 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
-	struct tpacket_block_desc *pbd1)
+	struct tpacket_block_desc *pbd1, struct packet_sock *po)
 {
-	struct timespec64 ts;
 	struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1;
 
 	smp_rmb();
@@ -856,10 +870,19 @@ static void prb_open_block(struct tpacket_kbdq_core *pkc1,
 	BLOCK_NUM_PKTS(pbd1) = 0;
 	BLOCK_LEN(pbd1) = BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
 
-	ktime_get_real_ts64(&ts);
 
-	h1->ts_first_pkt.ts_sec = ts.tv_sec;
-	h1->ts_first_pkt.ts_nsec = ts.tv_nsec;
+	if (po->tp_version == TPACKET_V3) {
+		struct timespec64 ts;
+
+		ktime_get_real_ts64(&ts);
+		h1->ts_first_pkt.ts_sec = ts.tv_sec;
+		h1->ts_first_pkt.ts_nsec = ts.tv_nsec;
+	} else {
+		s64 ns = ktime_get_real_ns();
+
+		h1->ts_first_pkt.ts_nsec_hi = upper_32_bits(ns);
+		h1->ts_first_pkt.ts_nsec = lower_32_bits(ns);
+	}
 
 	pkc1->pkblk_start = (char *)pbd1;
 	pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
@@ -936,7 +959,7 @@ static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc,
 	 * open this block and return the offset where the first packet
 	 * needs to get stored.
 	 */
-	prb_open_block(pkc, pbd);
+	prb_open_block(pkc, pbd, po);
 	return (void *)pkc->nxt_offset;
 }
 
@@ -1068,7 +1091,7 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 			 * opening a block also thaws the queue.
 			 * Thawing is a side effect.
 			 */
-			prb_open_block(pkc, pbd);
+			prb_open_block(pkc, pbd, po);
 		}
 	}
 
@@ -1113,6 +1136,7 @@ static void *packet_current_rx_frame(struct packet_sock *po,
 					po->rx_ring.head, status);
 		return curr;
 	case TPACKET_V3:
+	case TPACKET_V4:
 		return __packet_lookup_frame_in_block(po, skb, status, len);
 	default:
 		WARN(1, "TPACKET version not supported\n");
@@ -1171,6 +1195,7 @@ static void packet_increment_rx_head(struct packet_sock *po,
 	case TPACKET_V2:
 		return packet_increment_head(rb);
 	case TPACKET_V3:
+	case TPACKET_V4:
 	default:
 		WARN(1, "TPACKET version not supported.\n");
 		BUG();
@@ -1279,7 +1304,7 @@ static int __packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 			return ROOM_NONE;
 	}
 
-	if (po->tp_version == TPACKET_V3) {
+	if (po->tp_version == TPACKET_V3 || po->tp_version == TPACKET_V4) {
 		if (__tpacket_v3_has_room(po, ROOM_POW_OFF))
 			ret = ROOM_NORMAL;
 		else if (__tpacket_v3_has_room(po, 0))
@@ -2192,6 +2217,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	unsigned short macoff, netoff, hdrlen;
 	struct sk_buff *copy_skb = NULL;
 	struct timespec64 ts;
+	u64 ns;
 	__u32 ts_status;
 	bool is_drop_n_account = false;
 	bool do_vnet = false;
@@ -2318,7 +2344,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 
 	skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
+	if ((ts_status = tpacket_get_timestamp(skb, &ns, po->tp_tstamp)))
+		ts = ns_to_timespec64(ns);
+	else
 		ktime_get_real_ts64(&ts);
 
 	status |= ts_status;
@@ -2365,6 +2393,19 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding));
 		hdrlen = sizeof(*h.h3);
 		break;
+	case TPACKET_V4:
+		/* identical to v3, except for the timestamp */
+		h.h4->tp_status |= status;
+		h.h4->tp_len = skb->len;
+		h.h4->tp_snaplen = snaplen;
+		h.h4->tp_mac = macoff;
+		h.h4->tp_net = netoff;
+		ns = timespec64_to_ns(&ts);
+		h.h4->tp_nsec_hi = upper_32_bits(ns);
+		h.h4->tp_nsec_lo = lower_32_bits(ns);
+		memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding));
+		hdrlen = sizeof(*h.h3);
+		break;
 	default:
 		BUG();
 	}
@@ -2570,6 +2611,7 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame,
 	ph.raw = frame;
 
 	switch (po->tp_version) {
+	case TPACKET_V4:
 	case TPACKET_V3:
 		if (ph.h3->tp_next_offset != 0) {
 			pr_warn_once("variable sized slot not supported");
@@ -2596,6 +2638,7 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame,
 		off_max = po->tx_ring.frame_size - tp_len;
 		if (po->sk.sk_type == SOCK_DGRAM) {
 			switch (po->tp_version) {
+			case TPACKET_V4:
 			case TPACKET_V3:
 				off = ph.h3->tp_net;
 				break;
@@ -2608,6 +2651,7 @@ static int tpacket_parse_header(struct packet_sock *po, void *frame,
 			}
 		} else {
 			switch (po->tp_version) {
+			case TPACKET_V4:
 			case TPACKET_V3:
 				off = ph.h3->tp_mac;
 				break;
@@ -3658,6 +3702,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 			len = sizeof(req_u.req);
 			break;
 		case TPACKET_V3:
+		case TPACKET_V4:
 		default:
 			len = sizeof(req_u.req3);
 			break;
@@ -3693,6 +3738,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		case TPACKET_V1:
 		case TPACKET_V2:
 		case TPACKET_V3:
+		case TPACKET_V4:
 			break;
 		default:
 			return -EINVAL;
@@ -3868,7 +3914,7 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 		memset(&po->stats, 0, sizeof(po->stats));
 		spin_unlock_bh(&sk->sk_receive_queue.lock);
 
-		if (po->tp_version == TPACKET_V3) {
+		if (po->tp_version == TPACKET_V3 || po->tp_version == TPACKET_V4) {
 			lv = sizeof(struct tpacket_stats_v3);
 			st.stats3.tp_packets += st.stats3.tp_drops;
 			data = &st.stats3;
@@ -3906,6 +3952,7 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 			val = sizeof(struct tpacket2_hdr);
 			break;
 		case TPACKET_V3:
+		case TPACKET_V4:
 			val = sizeof(struct tpacket3_hdr);
 			break;
 		default:
@@ -4250,6 +4297,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 			po->tp_hdrlen = TPACKET2_HDRLEN;
 			break;
 		case TPACKET_V3:
+		case TPACKET_V4:
 			po->tp_hdrlen = TPACKET3_HDRLEN;
 			break;
 		}
@@ -4284,6 +4332,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		if (unlikely(!pg_vec))
 			goto out;
 		switch (po->tp_version) {
+		case TPACKET_V4:
 		case TPACKET_V3:
 			/* Block transmit is not supported yet */
 			if (!tx_ring) {
diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c
index 7f6cd9fdacf3..b497632b6a70 100644
--- a/tools/testing/selftests/net/psock_tpacket.c
+++ b/tools/testing/selftests/net/psock_tpacket.c
@@ -3,7 +3,8 @@
  * Author: Daniel Borkmann <dborkman@redhat.com>
  *         Chetan Loke <loke.chetan@gmail.com> (TPACKET_V3 usage example)
  *
- * A basic test of packet socket's TPACKET_V1/TPACKET_V2/TPACKET_V3 behavior.
+ * A basic test of packet socket's TPACKET_V1/TPACKET_V2/TPACKET_V3/TPACKET_V4
+ * behavior.
  *
  * Control:
  *   Test the setup of the TPACKET socket with different patterns that are
@@ -19,6 +20,7 @@
  *   - TPACKET_V1: RX_RING, TX_RING
  *   - TPACKET_V2: RX_RING, TX_RING
  *   - TPACKET_V3: RX_RING
+ *   - TPACKET_V4: RX_RING
  *
  * License (GPLv2):
  *
@@ -310,12 +312,12 @@ static inline void __v2_tx_user_ready(struct tpacket2_hdr *hdr)
 	__sync_synchronize();
 }
 
-static inline int __v3_tx_kernel_ready(struct tpacket3_hdr *hdr)
+static inline int __v3_v4_tx_kernel_ready(struct tpacket3_hdr *hdr)
 {
 	return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
 }
 
-static inline void __v3_tx_user_ready(struct tpacket3_hdr *hdr)
+static inline void __v3_v4_tx_user_ready(struct tpacket3_hdr *hdr)
 {
 	hdr->tp_status = TP_STATUS_SEND_REQUEST;
 	__sync_synchronize();
@@ -329,7 +331,8 @@ static inline int __tx_kernel_ready(void *base, int version)
 	case TPACKET_V2:
 		return __v2_tx_kernel_ready(base);
 	case TPACKET_V3:
-		return __v3_tx_kernel_ready(base);
+	case TPACKET_V4:
+		return __v3_v4_tx_kernel_ready(base);
 	default:
 		bug_on(1);
 		return 0;
@@ -346,7 +349,8 @@ static inline void __tx_user_ready(void *base, int version)
 		__v2_tx_user_ready(base);
 		break;
 	case TPACKET_V3:
-		__v3_tx_user_ready(base);
+	case TPACKET_V4:
+		__v3_v4_tx_user_ready(base);
 		break;
 	}
 }
@@ -372,6 +376,7 @@ static inline void *get_next_frame(struct ring *ring, int n)
 	case TPACKET_V2:
 		return ring->rd[n].iov_base;
 	case TPACKET_V3:
+	case TPACKET_V4:
 		return f0 + (n * ring->req3.tp_frame_size);
 	default:
 		bug_on(1);
@@ -454,7 +459,8 @@ static void walk_tx(int sock, struct ring *ring)
 				       packet_len);
 				total_bytes += ppd.v2->tp_h.tp_snaplen;
 				break;
-			case TPACKET_V3: {
+			case TPACKET_V3:
+			case TPACKET_V4: {
 				struct tpacket3_hdr *tx = next;
 
 				tx->tp_snaplen = packet_len;
@@ -517,22 +523,22 @@ static void walk_v1_v2(int sock, struct ring *ring)
 		walk_tx(sock, ring);
 }
 
-static uint64_t __v3_prev_block_seq_num = 0;
+static uint64_t __v3_v4_prev_block_seq_num = 0;
 
-void __v3_test_block_seq_num(struct block_desc *pbd)
+void __v3_v4_test_block_seq_num(struct block_desc *pbd)
 {
-	if (__v3_prev_block_seq_num + 1 != pbd->h1.seq_num) {
+	if (__v3_v4_prev_block_seq_num + 1 != pbd->h1.seq_num) {
 		fprintf(stderr, "\nprev_block_seq_num:%"PRIu64", expected "
 			"seq:%"PRIu64" != actual seq:%"PRIu64"\n",
-			__v3_prev_block_seq_num, __v3_prev_block_seq_num + 1,
+			__v3_v4_prev_block_seq_num, __v3_v4_prev_block_seq_num + 1,
 			(uint64_t) pbd->h1.seq_num);
 		exit(1);
 	}
 
-	__v3_prev_block_seq_num = pbd->h1.seq_num;
+	__v3_v4_prev_block_seq_num = pbd->h1.seq_num;
 }
 
-static void __v3_test_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
+static void __v3_v4_test_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
 {
 	if (pbd->h1.num_pkts && bytes != pbd->h1.blk_len) {
 		fprintf(stderr, "\nblock:%u with %upackets, expected "
@@ -542,23 +548,23 @@ static void __v3_test_block_len(struct block_desc *pbd, uint32_t bytes, int bloc
 	}
 }
 
-static void __v3_test_block_header(struct block_desc *pbd, const int block_num)
+static void __v3_v4_test_block_header(struct block_desc *pbd, const int block_num)
 {
 	if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
 		fprintf(stderr, "\nblock %u: not in TP_STATUS_USER\n", block_num);
 		exit(1);
 	}
 
-	__v3_test_block_seq_num(pbd);
+	__v3_v4_test_block_seq_num(pbd);
 }
 
-static void __v3_walk_block(struct block_desc *pbd, const int block_num)
+static void __v3_v4_walk_block(struct block_desc *pbd, const int block_num)
 {
 	int num_pkts = pbd->h1.num_pkts, i;
 	unsigned long bytes = 0, bytes_with_padding = ALIGN_8(sizeof(*pbd));
 	struct tpacket3_hdr *ppd;
 
-	__v3_test_block_header(pbd, block_num);
+	__v3_v4_test_block_header(pbd, block_num);
 
 	ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
 				       pbd->h1.offset_to_first_pkt);
@@ -580,17 +586,17 @@ static void __v3_walk_block(struct block_desc *pbd, const int block_num)
 		__sync_synchronize();
 	}
 
-	__v3_test_block_len(pbd, bytes_with_padding, block_num);
+	__v3_v4_test_block_len(pbd, bytes_with_padding, block_num);
 	total_bytes += bytes;
 }
 
-void __v3_flush_block(struct block_desc *pbd)
+void __v3_v4_flush_block(struct block_desc *pbd)
 {
 	pbd->h1.block_status = TP_STATUS_KERNEL;
 	__sync_synchronize();
 }
 
-static void walk_v3_rx(int sock, struct ring *ring)
+static void walk_v3_v4_rx(int sock, struct ring *ring)
 {
 	unsigned int block_num = 0;
 	struct pollfd pfd;
@@ -614,8 +620,8 @@ static void walk_v3_rx(int sock, struct ring *ring)
 		while ((pbd->h1.block_status & TP_STATUS_USER) == 0)
 			poll(&pfd, 1, 1);
 
-		__v3_walk_block(pbd, block_num);
-		__v3_flush_block(pbd);
+		__v3_v4_walk_block(pbd, block_num);
+		__v3_v4_flush_block(pbd);
 
 		block_num = (block_num + 1) % ring->rd_num;
 	}
@@ -623,7 +629,7 @@ static void walk_v3_rx(int sock, struct ring *ring)
 	pair_udp_close(udp_sock);
 
 	if (total_packets != 2 * NUM_PACKETS) {
-		fprintf(stderr, "walk_v3_rx: received %u out of %u pkts\n",
+		fprintf(stderr, "walk_v3_v4_rx: received %u out of %u pkts\n",
 			total_packets, NUM_PACKETS);
 		exit(1);
 	}
@@ -631,10 +637,10 @@ static void walk_v3_rx(int sock, struct ring *ring)
 	fprintf(stderr, " %u pkts (%u bytes)", NUM_PACKETS, total_bytes >> 1);
 }
 
-static void walk_v3(int sock, struct ring *ring)
+static void walk_v3_v4(int sock, struct ring *ring)
 {
 	if (ring->type == PACKET_RX_RING)
-		walk_v3_rx(sock, ring);
+		walk_v3_v4_rx(sock, ring);
 	else
 		walk_tx(sock, ring);
 }
@@ -655,7 +661,7 @@ static void __v1_v2_fill(struct ring *ring, unsigned int blocks)
 	ring->flen = ring->req.tp_frame_size;
 }
 
-static void __v3_fill(struct ring *ring, unsigned int blocks, int type)
+static void __v3_v4_fill(struct ring *ring, unsigned int blocks, int type)
 {
 	if (type == PACKET_RX_RING) {
 		ring->req3.tp_retire_blk_tov = 64;
@@ -671,7 +677,7 @@ static void __v3_fill(struct ring *ring, unsigned int blocks, int type)
 				 ring->req3.tp_block_nr;
 
 	ring->mm_len = ring->req3.tp_block_size * ring->req3.tp_block_nr;
-	ring->walk = walk_v3;
+	ring->walk = walk_v3_v4;
 	ring->rd_num = ring->req3.tp_block_nr;
 	ring->flen = ring->req3.tp_block_size;
 }
@@ -695,7 +701,8 @@ static void setup_ring(int sock, struct ring *ring, int version, int type)
 		break;
 
 	case TPACKET_V3:
-		__v3_fill(ring, blocks, type);
+	case TPACKET_V4:
+		__v3_v4_fill(ring, blocks, type);
 		ret = setsockopt(sock, SOL_PACKET, type, &ring->req3,
 				 sizeof(ring->req3));
 		break;
@@ -804,6 +811,7 @@ static const char *tpacket_str[] = {
 	[TPACKET_V1] = "TPACKET_V1",
 	[TPACKET_V2] = "TPACKET_V2",
 	[TPACKET_V3] = "TPACKET_V3",
+	[TPACKET_V4] = "TPACKET_V4",
 };
 
 static const char *type_str[] = {
@@ -854,6 +862,9 @@ int main(void)
 	ret |= test_tpacket(TPACKET_V3, PACKET_RX_RING);
 	ret |= test_tpacket(TPACKET_V3, PACKET_TX_RING);
 
+	ret |= test_tpacket(TPACKET_V4, PACKET_RX_RING);
+	ret |= test_tpacket(TPACKET_V4, PACKET_TX_RING);
+
 	if (ret)
 		return 1;
 
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-27 16:19 ` [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps Arnd Bergmann
@ 2017-11-27 16:59   ` Jiri Pirko
  2017-11-27 20:35     ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2017-11-27 16:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, y2038, Shuah Khan, Willem de Bruijn,
	Mike Maloney, Eric Dumazet, Kees Cook, Rosen, Rami,
	Andrey Konovalov, Reshetova, Elena, Sowmini Varadhan, netdev,
	linux-kernel, linux-kselftest

Mon, Nov 27, 2017 at 05:19:25PM CET, arnd@arndb.de wrote:
>I tried to figure out what it would take to do a version 4 mmap packet
>socket interface to completely avoid the y2106 overflow problem. This is
>what I came up with, reusing most of the v3 code, except for the parts
>where we access the timestamps.
>
>For kselftest, I'm adding support for testing v4 in addition to v1-v3,
>but the test currently does not look at the timestamps, so it won't
>check that the timestamp format actually works as intended, only that
>I didn't break the parts that worked in the v3 selftest.
>
>Overall, this is more of a mess than I expected, so it's probably not
>worth doing a v4 format just for the timestamp, but the patch can serve
>as a reference for anyone that needs a new format for other reasons and
>fixes this along with the other changes.
>
>Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>---

[...]


>@@ -250,7 +269,8 @@ struct tpacket_block_desc {
> enum tpacket_versions {
> 	TPACKET_V1,
> 	TPACKET_V2,
>-	TPACKET_V3
>+	TPACKET_V3,
>+	TPACKET_V4,

I wonder with how many versions are we going to eventually end up with :O

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-27 16:59   ` Jiri Pirko
@ 2017-11-27 20:35     ` Willem de Bruijn
  2017-11-27 20:51       ` Arnd Bergmann
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2017-11-27 20:35 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Arnd Bergmann, David S. Miller, y2038, Shuah Khan,
	Willem de Bruijn, Mike Maloney, Eric Dumazet, Kees Cook, Rosen,
	Rami, Andrey Konovalov, Reshetova, Elena, Sowmini Varadhan,
	Network Development, LKML, linux-kselftest

On Mon, Nov 27, 2017 at 11:59 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, Nov 27, 2017 at 05:19:25PM CET, arnd@arndb.de wrote:
>>I tried to figure out what it would take to do a version 4 mmap packet
>>socket interface to completely avoid the y2106 overflow problem. This is
>>what I came up with, reusing most of the v3 code, except for the parts
>>where we access the timestamps.
>>
>>For kselftest, I'm adding support for testing v4 in addition to v1-v3,
>>but the test currently does not look at the timestamps, so it won't
>>check that the timestamp format actually works as intended, only that
>>I didn't break the parts that worked in the v3 selftest.
>>
>>Overall, this is more of a mess than I expected, so it's probably not
>>worth doing a v4 format just for the timestamp, but the patch can serve
>>as a reference for anyone that needs a new format for other reasons and
>>fixes this along with the other changes.
>>
>>Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>>---
>
> [...]
>
>
>>@@ -250,7 +269,8 @@ struct tpacket_block_desc {
>> enum tpacket_versions {
>>       TPACKET_V1,
>>       TPACKET_V2,
>>-      TPACKET_V3
>>+      TPACKET_V3,
>>+      TPACKET_V4,
>
> I wonder with how many versions are we going to eventually end up with :O

There already is an effort to come up with a new AF_PACKET V4 [1].
We should make sure that any new interface does not have the
Y2038/Y2106 issue. But, if a new version is being developed and
that subsumes all existing use cases, then there probably is no need
for another version that is a very small diff to V3.

If adding support for existing applications is useful, another approach
would be to add a new socket option that changes the semantics for
the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
check after filling in those structs whether the option is set and, if so,
overwrite the two fields.

[1] https://lwn.net/Articles/737947/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-27 20:35     ` Willem de Bruijn
@ 2017-11-27 20:51       ` Arnd Bergmann
  2017-11-28  7:04         ` Björn Töpel
  0 siblings, 1 reply; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-27 20:51 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jiri Pirko, David S. Miller, y2038 Mailman List, Shuah Khan,
	Willem de Bruijn, Mike Maloney, Eric Dumazet, Kees Cook, Rosen,
	Rami, Andrey Konovalov, Reshetova, Elena, Sowmini Varadhan,
	Network Development, LKML, linux-kselftest,
	Björn Töpel

On Mon, Nov 27, 2017 at 9:35 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, Nov 27, 2017 at 11:59 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, Nov 27, 2017 at 05:19:25PM CET, arnd@arndb.de wrote:
>>>I tried to figure out what it would take to do a version 4 mmap packet
>>>socket interface to completely avoid the y2106 overflow problem. This is
>>>what I came up with, reusing most of the v3 code, except for the parts
>>>where we access the timestamps.
>>>
>>>For kselftest, I'm adding support for testing v4 in addition to v1-v3,
>>>but the test currently does not look at the timestamps, so it won't
>>>check that the timestamp format actually works as intended, only that
>>>I didn't break the parts that worked in the v3 selftest.
>>>
>>>Overall, this is more of a mess than I expected, so it's probably not
>>>worth doing a v4 format just for the timestamp, but the patch can serve
>>>as a reference for anyone that needs a new format for other reasons and
>>>fixes this along with the other changes.
>>>
>>>Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>>>---
>>
>> [...]
>>
>>
>>>@@ -250,7 +269,8 @@ struct tpacket_block_desc {
>>> enum tpacket_versions {
>>>       TPACKET_V1,
>>>       TPACKET_V2,
>>>-      TPACKET_V3
>>>+      TPACKET_V3,
>>>+      TPACKET_V4,
>>
>> I wonder with how many versions are we going to eventually end up with :O
>
> There already is an effort to come up with a new AF_PACKET V4 [1].
> We should make sure that any new interface does not have the
> Y2038/Y2106 issue. But, if a new version is being developed and
> that subsumes all existing use cases, then there probably is no need
> for another version that is a very small diff to V3.

Ah, perfect, that's good timing. Adding Björn to Cc here.

> If adding support for existing applications is useful, another approach
> would be to add a new socket option that changes the semantics for
> the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
> check after filling in those structs whether the option is set and, if so,
> overwrite the two fields.
>
> [1] https://lwn.net/Articles/737947/

I don't think that's necessary. As long as the V4 capabilities are a
superset of V1-V3, we should be able to just require all users to
move to V4 (or later) in the next 89 years, and make sure that they
use unsigned seconds if they care about 2038.

      Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-27 20:51       ` Arnd Bergmann
@ 2017-11-28  7:04         ` Björn Töpel
  2017-11-28  8:46           ` Arnd Bergmann
  0 siblings, 1 reply; 24+ messages in thread
From: Björn Töpel @ 2017-11-28  7:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Willem de Bruijn, Jiri Pirko, David S. Miller,
	y2038 Mailman List, Shuah Khan, Willem de Bruijn, Mike Maloney,
	Eric Dumazet, Kees Cook, Rosen, Rami, Andrey Konovalov,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML,
	linux-kselftest, Björn Töpel

2017-11-27 21:51 GMT+01:00 Arnd Bergmann <arnd@arndb.de>:
[...]
>> There already is an effort to come up with a new AF_PACKET V4 [1].
>> We should make sure that any new interface does not have the
>> Y2038/Y2106 issue. But, if a new version is being developed and
>> that subsumes all existing use cases, then there probably is no need
>> for another version that is a very small diff to V3.
>
> Ah, perfect, that's good timing. Adding Björn to Cc here.
>

Unfortunately, for the Y2038/Y2106 cases, we'll be (as a result of
netdevconf discussions) moving the AF_PACKET V4 implementation to a
separate, new, address/packet family.

>> If adding support for existing applications is useful, another approach
>> would be to add a new socket option that changes the semantics for
>> the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
>> check after filling in those structs whether the option is set and, if so,
>> overwrite the two fields.
>>
>> [1] https://lwn.net/Articles/737947/
>
> I don't think that's necessary. As long as the V4 capabilities are a
> superset of V1-V3, we should be able to just require all users to
> move to V4 (or later) in the next 89 years, and make sure that they
> use unsigned seconds if they care about 2038.
>

Given that V4 wont be around for AF_PACKET -- at least not in the
shape of our patches -- Willem's suggestion is probably a good way
forward.

>       Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-28  7:04         ` Björn Töpel
@ 2017-11-28  8:46           ` Arnd Bergmann
  2017-11-28 13:14             ` [PATCH] [RFC v2] " Arnd Bergmann
  2017-11-28 14:08             ` [PATCH 2/2] [RFC] " Willem de Bruijn
  0 siblings, 2 replies; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-28  8:46 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Willem de Bruijn, Jiri Pirko, David S. Miller,
	y2038 Mailman List, Shuah Khan, Willem de Bruijn, Mike Maloney,
	Eric Dumazet, Kees Cook, Rosen, Rami, Andrey Konovalov,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML,
	linux-kselftest, Björn Töpel

On Tue, Nov 28, 2017 at 8:04 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> 2017-11-27 21:51 GMT+01:00 Arnd Bergmann <arnd@arndb.de>:
> [...]
>>> There already is an effort to come up with a new AF_PACKET V4 [1].
>>> We should make sure that any new interface does not have the
>>> Y2038/Y2106 issue. But, if a new version is being developed and
>>> that subsumes all existing use cases, then there probably is no need
>>> for another version that is a very small diff to V3.
>>
>> Ah, perfect, that's good timing. Adding Björn to Cc here.
>>
>
> Unfortunately, for the Y2038/Y2106 cases, we'll be (as a result of
> netdevconf discussions) moving the AF_PACKET V4 implementation to a
> separate, new, address/packet family.

Ok, I see.

>>> If adding support for existing applications is useful, another approach
>>> would be to add a new socket option that changes the semantics for
>>> the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
>>> check after filling in those structs whether the option is set and, if so,
>>> overwrite the two fields.
>>>
>>> [1] https://lwn.net/Articles/737947/
>>
>> I don't think that's necessary. As long as the V4 capabilities are a
>> superset of V1-V3, we should be able to just require all users to
>> move to V4 (or later) in the next 89 years, and make sure that they
>> use unsigned seconds if they care about 2038.
>>
>
> Given that V4 wont be around for AF_PACKET -- at least not in the
> shape of our patches -- Willem's suggestion is probably a good way
> forward.

That leaves one question: should we do that now, or wait until some
other reason for a V4 comes up? I don't mind creating another
patch for this, just want to get a feeling of whether the API clutter
is worth it when we have a way out that works until y2106 (at
which point we run into other problems as well).

       Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
  2017-11-28  8:46           ` Arnd Bergmann
@ 2017-11-28 13:14             ` Arnd Bergmann
  2017-11-28 14:24               ` Willem de Bruijn
  2017-11-28 15:05               ` David Miller
  2017-11-28 14:08             ` [PATCH 2/2] [RFC] " Willem de Bruijn
  1 sibling, 2 replies; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-28 13:14 UTC (permalink / raw)
  To: David S. Miller, Miroslav Lichvar, Willem de Bruijn
  Cc: Björn Töpel, Arnd Bergmann, Greg Kroah-Hartman,
	Francis Yan, Eric Dumazet, Kees Cook, Rosen, Rami,
	Andrey Konovalov, Mike Maloney, Sowmini Varadhan, linux-kernel,
	netdev

This is a second attempt to allow 64-bit timestamps in packet sockets,
this time retaining the existing three API versions, and instead using
flags for the timestamping interface that let us reinterpret the existing
fields.

We already have an interface to select the timestamp source to be either
software or hardware timestamps, using the SOF_TIMESTAMPING_RAW_HARDWARE
flag in the PACKET_TIMESTAMP option that gets passed to setsockopt().

This patch adds another two flags for the same interface:
SOF_TIMESTAMPING_SKIP makes all timestamps zero, which can save a few
cycles per packet when we don't have to touch an expensive clocksource
hardware. When SOF_TIMESTAMPING_OPT_NS64 gets added to the flag, we
reinterpret the tp_sec/tp_nsec fields to be the upper and lower half of
the 64-bit nanosecond value. This lets us use the interface until year
2554, when 64-bit unsigned nanoseconds overflow.

The implementation is fairly straightforward, but I'm less sure about the
interface. Using SOF_TIMESTAMPING_* flags in PACKET_TIMESTAMP is a bit
odd already since most of the other flags make no sense here.  Adding two
more flags that only make sense for packet sockets but not the normal
SO_TIMESTAMPING option on other sockets makes this even more confusing.

Link: https://patchwork.kernel.org/patch/10077199/
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
This is only a proof-of-concept patch, only compile-tested and put here
for discussion about whether we want it in principle.
---
 include/uapi/linux/net_tstamp.h |   4 +-
 net/packet/af_packet.c          | 129 +++++++++++++++++++++++++---------------
 2 files changed, 85 insertions(+), 48 deletions(-)

diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 4fe104b2411f..1b312d35fa41 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -30,8 +30,10 @@ enum {
 	SOF_TIMESTAMPING_OPT_STATS = (1<<12),
 	SOF_TIMESTAMPING_OPT_PKTINFO = (1<<13),
 	SOF_TIMESTAMPING_OPT_TX_SWHW = (1<<14),
+	SOF_TIMESTAMPING_SKIP = (1<<15),
+	SOF_TIMESTAMPING_OPT_NS64 = (1<<16),
 
-	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TX_SWHW,
+	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_NS64,
 	SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
 				 SOF_TIMESTAMPING_LAST
 };
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7432c6699818..8e40040df967 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -200,7 +200,7 @@ static void prb_retire_current_block(struct tpacket_kbdq_core *,
 		struct packet_sock *, unsigned int status);
 static int prb_queue_frozen(struct tpacket_kbdq_core *);
 static void prb_open_block(struct tpacket_kbdq_core *,
-		struct tpacket_block_desc *);
+		struct tpacket_block_desc *, struct packet_sock *);
 static void prb_retire_rx_blk_timer_expired(struct timer_list *);
 static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *);
 static void prb_fill_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *);
@@ -439,52 +439,92 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 	}
 }
 
-static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
+static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo,
 				   unsigned int flags)
 {
+	struct packet_sock *po = pkt_sk(skb->sk);
 	struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+	ktime_t stamp;
+	u32 type;
+
+	if (po->tp_tstamp & SOF_TIMESTAMPING_SKIP)
+		return 0;
 
 	if (shhwtstamps &&
-	    (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
-	    ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
-		return TP_STATUS_TS_RAW_HARDWARE;
+	    (po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE) &&
+	    shhwtstamps->hwtstamp) {
+		stamp = shhwtstamps->hwtstamp;
+		type = TP_STATUS_TS_RAW_HARDWARE;
+	} else if (skb->tstamp) {
+		stamp = skb->tstamp;
+		type = TP_STATUS_TS_SOFTWARE;
+	} else {
+		return 0;
+	}
 
-	if (ktime_to_timespec64_cond(skb->tstamp, ts))
-		return TP_STATUS_TS_SOFTWARE;
+	if (po->tp_tstamp & SOF_TIMESTAMPING_OPT_NS64) {
+		__u64 ns = ktime_to_ns(stamp);
 
-	return 0;
+		*hi = upper_32_bits(ns);
+		*lo = lower_32_bits(ns);
+	} else {
+		struct timespec64 ts = ktime_to_timespec64(stamp);
+
+		*hi = ts.tv_sec;
+		if (po->tp_version == TPACKET_V1)
+			*lo = ts.tv_nsec / NSEC_PER_USEC;
+		else
+			*lo = ts.tv_nsec;
+	}
+
+	return type;
+}
+
+static void packet_get_time(struct packet_sock *po, __u32 *hi, __u32 *lo)
+{
+	if (po->tp_tstamp & SOF_TIMESTAMPING_SKIP) {
+		*hi = 0;
+		*lo = 0;
+	} else if (po->tp_tstamp & SOF_TIMESTAMPING_OPT_NS64) {
+		__u64 ns = ktime_get_real_ns();
+
+		*hi = upper_32_bits(ns);
+		*hi = lower_32_bits(ns);
+	} else {
+		struct timespec64 ts;
+
+		ktime_get_real_ts64(&ts);
+		/* unsigned seconds overflow in y2106 here */
+		*hi = ts.tv_sec;
+		if (po->tp_version == TPACKET_V1)
+			*lo = ts.tv_nsec / NSEC_PER_USEC;
+		else
+			*lo = ts.tv_nsec;
+	}
 }
 
 static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
 				    struct sk_buff *skb)
 {
 	union tpacket_uhdr h;
-	struct timespec64 ts;
-	__u32 ts_status;
+	__u32 ts_status, hi, lo;
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
+	if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo, po->tp_tstamp)))
 		return 0;
 
 	h.raw = frame;
-	/*
-	 * versions 1 through 3 overflow the timestamps in y2106, since they
-	 * all store the seconds in a 32-bit unsigned integer.
-	 * If we create a version 4, that should have a 64-bit timestamp,
-	 * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
-	 * nanoseconds.
-	 */
 	switch (po->tp_version) {
 	case TPACKET_V1:
-		h.h1->tp_sec = ts.tv_sec;
-		h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
+		h.h1->tp_sec = hi;
+		h.h1->tp_usec = lo;
 		break;
 	case TPACKET_V2:
-		h.h2->tp_sec = ts.tv_sec;
-		h.h2->tp_nsec = ts.tv_nsec;
+		h.h2->tp_sec = hi;
+		h.h2->tp_nsec = lo;
 		break;
 	case TPACKET_V3:
-		h.h3->tp_sec = ts.tv_sec;
-		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_sec = hi;
+		h.h3->tp_nsec = lo;
 		break;
 	default:
 		WARN(1, "TPACKET version not supported.\n");
@@ -633,7 +673,7 @@ static void init_prb_bdqc(struct packet_sock *po,
 	p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
 	prb_init_ft_ops(p1, req_u);
 	prb_setup_retire_blk_timer(po);
-	prb_open_block(p1, pbd);
+	prb_open_block(p1, pbd, po);
 }
 
 /*  Do NOT update the last_blk_num first.
@@ -730,7 +770,7 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
 				* opening a block thaws the queue,restarts timer
 				* Thawing/timer-refresh is a side effect.
 				*/
-				prb_open_block(pkc, pbd);
+				prb_open_block(pkc, pbd, po);
 				goto out;
 			}
 		}
@@ -812,10 +852,8 @@ static void prb_close_block(struct tpacket_kbdq_core *pkc1,
 		 * It shouldn't really happen as we don't close empty
 		 * blocks. See prb_retire_rx_blk_timer_expired().
 		 */
-		struct timespec64 ts;
-		ktime_get_real_ts64(&ts);
-		h1->ts_last_pkt.ts_sec = ts.tv_sec;
-		h1->ts_last_pkt.ts_nsec	= ts.tv_nsec;
+		packet_get_time(po, &h1->ts_last_pkt.ts_sec,
+				&h1->ts_last_pkt.ts_nsec);
 	}
 
 	smp_wmb();
@@ -841,9 +879,8 @@ static void prb_thaw_queue(struct tpacket_kbdq_core *pkc)
  *
  */
 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
-	struct tpacket_block_desc *pbd1)
+	struct tpacket_block_desc *pbd1, struct packet_sock *po)
 {
-	struct timespec64 ts;
 	struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1;
 
 	smp_rmb();
@@ -856,10 +893,8 @@ static void prb_open_block(struct tpacket_kbdq_core *pkc1,
 	BLOCK_NUM_PKTS(pbd1) = 0;
 	BLOCK_LEN(pbd1) = BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
 
-	ktime_get_real_ts64(&ts);
-
-	h1->ts_first_pkt.ts_sec = ts.tv_sec;
-	h1->ts_first_pkt.ts_nsec = ts.tv_nsec;
+	packet_get_time(po, &h1->ts_first_pkt.ts_sec,
+			&h1->ts_first_pkt.ts_nsec);
 
 	pkc1->pkblk_start = (char *)pbd1;
 	pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
@@ -936,7 +971,7 @@ static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc,
 	 * open this block and return the offset where the first packet
 	 * needs to get stored.
 	 */
-	prb_open_block(pkc, pbd);
+	prb_open_block(pkc, pbd, po);
 	return (void *)pkc->nxt_offset;
 }
 
@@ -1068,7 +1103,7 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 			 * opening a block also thaws the queue.
 			 * Thawing is a side effect.
 			 */
-			prb_open_block(pkc, pbd);
+			prb_open_block(pkc, pbd, po);
 		}
 	}
 
@@ -2191,8 +2226,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	unsigned long status = TP_STATUS_USER;
 	unsigned short macoff, netoff, hdrlen;
 	struct sk_buff *copy_skb = NULL;
-	struct timespec64 ts;
 	__u32 ts_status;
+	__u32 hi, lo;
 	bool is_drop_n_account = false;
 	bool do_vnet = false;
 
@@ -2318,8 +2353,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 
 	skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
-		ktime_get_real_ts64(&ts);
+	if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo, po->tp_tstamp)))
+		packet_get_time(po, &hi, &lo);
 
 	status |= ts_status;
 
@@ -2329,8 +2364,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h1->tp_snaplen = snaplen;
 		h.h1->tp_mac = macoff;
 		h.h1->tp_net = netoff;
-		h.h1->tp_sec = ts.tv_sec;
-		h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
+		h.h1->tp_sec = hi;
+		h.h1->tp_usec = lo;
 		hdrlen = sizeof(*h.h1);
 		break;
 	case TPACKET_V2:
@@ -2338,8 +2373,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h2->tp_snaplen = snaplen;
 		h.h2->tp_mac = macoff;
 		h.h2->tp_net = netoff;
-		h.h2->tp_sec = ts.tv_sec;
-		h.h2->tp_nsec = ts.tv_nsec;
+		h.h2->tp_sec = hi;
+		h.h2->tp_nsec = lo;
 		if (skb_vlan_tag_present(skb)) {
 			h.h2->tp_vlan_tci = skb_vlan_tag_get(skb);
 			h.h2->tp_vlan_tpid = ntohs(skb->vlan_proto);
@@ -2360,8 +2395,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h3->tp_snaplen = snaplen;
 		h.h3->tp_mac = macoff;
 		h.h3->tp_net = netoff;
-		h.h3->tp_sec  = ts.tv_sec;
-		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_sec  = hi;
+		h.h3->tp_nsec = lo;
 		memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding));
 		hdrlen = sizeof(*h.h3);
 		break;
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-28  8:46           ` Arnd Bergmann
  2017-11-28 13:14             ` [PATCH] [RFC v2] " Arnd Bergmann
@ 2017-11-28 14:08             ` Willem de Bruijn
  2017-11-28 14:22               ` Arnd Bergmann
  1 sibling, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2017-11-28 14:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Björn Töpel, Jiri Pirko, David S. Miller,
	y2038 Mailman List, Shuah Khan, Willem de Bruijn, Mike Maloney,
	Eric Dumazet, Kees Cook, Rosen, Rami, Andrey Konovalov,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML,
	linux-kselftest, Björn Töpel

On Tue, Nov 28, 2017 at 3:46 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, Nov 28, 2017 at 8:04 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> 2017-11-27 21:51 GMT+01:00 Arnd Bergmann <arnd@arndb.de>:
>> [...]
>>>> There already is an effort to come up with a new AF_PACKET V4 [1].
>>>> We should make sure that any new interface does not have the
>>>> Y2038/Y2106 issue. But, if a new version is being developed and
>>>> that subsumes all existing use cases, then there probably is no need
>>>> for another version that is a very small diff to V3.
>>>
>>> Ah, perfect, that's good timing. Adding Björn to Cc here.
>>>
>>
>> Unfortunately, for the Y2038/Y2106 cases, we'll be (as a result of
>> netdevconf discussions) moving the AF_PACKET V4 implementation to a
>> separate, new, address/packet family.
>
> Ok, I see.

Does it matter whether the replacement is a new version or a
new packet family?

>>>> If adding support for existing applications is useful, another approach
>>>> would be to add a new socket option that changes the semantics for
>>>> the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
>>>> check after filling in those structs whether the option is set and, if so,
>>>> overwrite the two fields.
>>>>
>>>> [1] https://lwn.net/Articles/737947/
>>>
>>> I don't think that's necessary. As long as the V4 capabilities are a
>>> superset of V1-V3, we should be able to just require all users to
>>> move to V4 (or later) in the next 89 years, and make sure that they
>>> use unsigned seconds if they care about 2038.
>>>
>>
>> Given that V4 wont be around for AF_PACKET -- at least not in the
>> shape of our patches -- Willem's suggestion is probably a good way
>> forward.
>
> That leaves one question: should we do that now, or wait until some
> other reason for a V4 comes up? I don't mind creating another
> patch for this, just want to get a feeling of whether the API clutter
> is worth it when we have a way out that works until y2106 (at
> which point we run into other problems as well).

I don't expect that we'll have another packet version independent
from the work that Björn is doing. The choice to implement using
a new packet family is given by the complexity of the existing code,
especially the various locking mechanisms.

>From that point of view, and if we want to offer a Y2106 proof
AF_PACKET independent from the above, no reason to wait.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
  2017-11-28 14:08             ` [PATCH 2/2] [RFC] " Willem de Bruijn
@ 2017-11-28 14:22               ` Arnd Bergmann
  0 siblings, 0 replies; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-28 14:22 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Björn Töpel, Jiri Pirko, David S. Miller,
	y2038 Mailman List, Shuah Khan, Willem de Bruijn, Mike Maloney,
	Eric Dumazet, Kees Cook, Rosen, Rami, Andrey Konovalov,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML,
	linux-kselftest, Björn Töpel

On Tue, Nov 28, 2017 at 3:08 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Tue, Nov 28, 2017 at 3:46 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Tue, Nov 28, 2017 at 8:04 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>>> 2017-11-27 21:51 GMT+01:00 Arnd Bergmann <arnd@arndb.de>:
>>> [...]
>>>>> There already is an effort to come up with a new AF_PACKET V4 [1].
>>>>> We should make sure that any new interface does not have the
>>>>> Y2038/Y2106 issue. But, if a new version is being developed and
>>>>> that subsumes all existing use cases, then there probably is no need
>>>>> for another version that is a very small diff to V3.
>>>>
>>>> Ah, perfect, that's good timing. Adding Björn to Cc here.
>>>>
>>>
>>> Unfortunately, for the Y2038/Y2106 cases, we'll be (as a result of
>>> netdevconf discussions) moving the AF_PACKET V4 implementation to a
>>> separate, new, address/packet family.
>>
>> Ok, I see.
>
> Does it matter whether the replacement is a new version or a
> new packet family?

It depends on whether the new packet family provides a superset of
the AF_PACKET features or not. If we can expect that all users of
AF_PACKET can migrate to the replacement over time, then doing
it there is sufficient, otherwise adding 64-bit timestamps into AF_PACKET
may be a better way to upgrade existing users.

>>>>> If adding support for existing applications is useful, another approach
>>>>> would be to add a new socket option that changes the semantics for
>>>>> the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
>>>>> check after filling in those structs whether the option is set and, if so,
>>>>> overwrite the two fields.
>>>>>
>>>>> [1] https://lwn.net/Articles/737947/
>>>>
>>>> I don't think that's necessary. As long as the V4 capabilities are a
>>>> superset of V1-V3, we should be able to just require all users to
>>>> move to V4 (or later) in the next 89 years, and make sure that they
>>>> use unsigned seconds if they care about 2038.
>>>>
>>>
>>> Given that V4 wont be around for AF_PACKET -- at least not in the
>>> shape of our patches -- Willem's suggestion is probably a good way
>>> forward.
>>
>> That leaves one question: should we do that now, or wait until some
>> other reason for a V4 comes up? I don't mind creating another
>> patch for this, just want to get a feeling of whether the API clutter
>> is worth it when we have a way out that works until y2106 (at
>> which point we run into other problems as well).
>
> I don't expect that we'll have another packet version independent
> from the work that Björn is doing. The choice to implement using
> a new packet family is given by the complexity of the existing code,
> especially the various locking mechanisms.

Ok.

> From that point of view, and if we want to offer a Y2106 proof
> AF_PACKET independent from the above, no reason to wait.

Agreed.

       Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
  2017-11-28 13:14             ` [PATCH] [RFC v2] " Arnd Bergmann
@ 2017-11-28 14:24               ` Willem de Bruijn
  2017-11-28 14:45                 ` Arnd Bergmann
  2017-11-28 15:05               ` David Miller
  1 sibling, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2017-11-28 14:24 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Miroslav Lichvar, Willem de Bruijn,
	Björn Töpel, Greg Kroah-Hartman, Francis Yan,
	Eric Dumazet, Kees Cook, Rosen, Rami, Andrey Konovalov,
	Mike Maloney, Sowmini Varadhan, LKML, Network Development

On Tue, Nov 28, 2017 at 8:14 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> This is a second attempt to allow 64-bit timestamps in packet sockets,

Thanks for coding up this variant.

> The implementation is fairly straightforward, but I'm less sure about the
> interface. Using SOF_TIMESTAMPING_* flags in PACKET_TIMESTAMP is a bit
> odd already since most of the other flags make no sense here.  Adding two
> more flags that only make sense for packet sockets but not the normal
> SO_TIMESTAMPING option on other sockets makes this even more confusing.

Agreed.

Unfortunately, we're already stuck with SOL_PACKET/PACKET_TIMESTAMP
accepting SOF_TIMESTAMPING_RAW_HARDWARE.

Perhaps we can define a new PF_PACKET specific enum where the
equivalent option has the same value, so is backwards compatible:

enum {
      PACKET_TIMESTAMP_ORIG = 0,
      PACKET_TIMESTAMP_ZERO = 1 << 0,
      PACKET_TIMESTAMP_NS64 = 1 << 1,
      PACKET_TIMESTAMP_HW = 1 << 6
};

and  BUILD_BUG_ON(PACKET_TIMESTAMP_RAW != SOF_TIMESTAMPING_RAW_HARDWARE)
to document the dependency.

At high level, the code looks great to me, itself.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
  2017-11-28 14:24               ` Willem de Bruijn
@ 2017-11-28 14:45                 ` Arnd Bergmann
  0 siblings, 0 replies; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-28 14:45 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Miroslav Lichvar, Willem de Bruijn,
	Björn Töpel, Greg Kroah-Hartman, Francis Yan,
	Eric Dumazet, Kees Cook, Rosen, Rami, Andrey Konovalov,
	Mike Maloney, Sowmini Varadhan, LKML, Network Development

On Tue, Nov 28, 2017 at 3:24 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Tue, Nov 28, 2017 at 8:14 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Unfortunately, we're already stuck with SOL_PACKET/PACKET_TIMESTAMP
> accepting SOF_TIMESTAMPING_RAW_HARDWARE.
>
> Perhaps we can define a new PF_PACKET specific enum where the
> equivalent option has the same value, so is backwards compatible:
>
> enum {
>       PACKET_TIMESTAMP_ORIG = 0,
>       PACKET_TIMESTAMP_ZERO = 1 << 0,
>       PACKET_TIMESTAMP_NS64 = 1 << 1,
>       PACKET_TIMESTAMP_HW = 1 << 6
> };
>
> and  BUILD_BUG_ON(PACKET_TIMESTAMP_RAW != SOF_TIMESTAMPING_RAW_HARDWARE)
> to document the dependency.

Yes, that would be much cleaner, but in order to do this, we have to
be relatively
confident that existing users don't pass any flags other than
SOF_TIMESTAMPING_RAW_HARDWARE. It might be safer to start
at bit 31 for the new flags to minimize that risk:

/*
 * we used to pass  SOF_TIMESTAMPING_RAW_HARDWARE from user space,
 * but really want our own flags here, so define PACKET_TIMESTAMP_HW to the
 * existing value and use the high bits for new non-overlapping flags.
 */
enum {
       PACKET_TIMESTAMP_ORIG = 0,
       PACKET_TIMESTAMP_HW = 1 << 6,
       PACKET_TIMESTAMP_ZERO = 1 << 30,
       PACKET_TIMESTAMP_NS64 = 1 << 31,
};

> At high level, the code looks great to me, itself.

Thanks,

        Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
  2017-11-28 13:14             ` [PATCH] [RFC v2] " Arnd Bergmann
  2017-11-28 14:24               ` Willem de Bruijn
@ 2017-11-28 15:05               ` David Miller
  2017-11-28 20:02                 ` Arnd Bergmann
  1 sibling, 1 reply; 24+ messages in thread
From: David Miller @ 2017-11-28 15:05 UTC (permalink / raw)
  To: arnd
  Cc: mlichvar, willemb, bjorn.topel, gregkh, francisyyan, edumazet,
	keescook, rami.rosen, andreyknvl, maloney, sowmini.varadhan,
	linux-kernel, netdev

From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 28 Nov 2017 14:14:05 +0100

> The implementation is fairly straightforward, but I'm less sure about the
> interface. Using SOF_TIMESTAMPING_* flags in PACKET_TIMESTAMP is a bit
> odd already since most of the other flags make no sense here.  Adding two
> more flags that only make sense for packet sockets but not the normal
> SO_TIMESTAMPING option on other sockets makes this even more confusing.

We unfortunately never enforced any checking whatsoever of the
PACKET_TIMESTAMP mask the user gives us, we accept anything.

That makes any changes in this area effectively a grenade ready to go
off potentially at any moment.

We can't add new checks without potentially making existing apps stop
working.  And at least theoretically if we add new bits it is possible
for an existing app passing those bits in by accident to start
behaving improperly.

I know it sounds like overkill for this, but maybe we can add a new
socket option for the SKIP and 64-bit stuff.  That would be %100 safe.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
  2017-11-28 15:05               ` David Miller
@ 2017-11-28 20:02                 ` Arnd Bergmann
  2017-11-28 20:08                   ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-28 20:02 UTC (permalink / raw)
  To: David Miller
  Cc: Miroslav Lichvar, Willem de Bruijn, Björn Töpel,
	gregkh, Francis Yan, Eric Dumazet, Kees Cook, Rosen, Rami,
	Andrey Konovalov, Mike Maloney, Sowmini Varadhan,
	Linux Kernel Mailing List, Networking

On Tue, Nov 28, 2017 at 4:05 PM, David Miller <davem@davemloft.net> wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Tue, 28 Nov 2017 14:14:05 +0100
>
>> The implementation is fairly straightforward, but I'm less sure about the
>> interface. Using SOF_TIMESTAMPING_* flags in PACKET_TIMESTAMP is a bit
>> odd already since most of the other flags make no sense here.  Adding two
>> more flags that only make sense for packet sockets but not the normal
>> SO_TIMESTAMPING option on other sockets makes this even more confusing.
>
> We unfortunately never enforced any checking whatsoever of the
> PACKET_TIMESTAMP mask the user gives us, we accept anything.
>
> That makes any changes in this area effectively a grenade ready to go
> off potentially at any moment.
>
> We can't add new checks without potentially making existing apps stop
> working.  And at least theoretically if we add new bits it is possible
> for an existing app passing those bits in by accident to start
> behaving improperly.
>
> I know it sounds like overkill for this, but maybe we can add a new
> socket option for the SKIP and 64-bit stuff.  That would be %100 safe.

Sure, it's easy enough to do, I'll cook something up.

Does this mean you think the general idea of an extended interface
for 64-bit timestamps is useful for traditional packet sockets? I think
that was still an open question, though we seem to be getting closer
to consensus on the implementation and the interface that it should
use if we want it.

       Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
  2017-11-28 20:02                 ` Arnd Bergmann
@ 2017-11-28 20:08                   ` David Miller
  2017-11-28 20:31                     ` Arnd Bergmann
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2017-11-28 20:08 UTC (permalink / raw)
  To: arnd
  Cc: mlichvar, willemb, bjorn.topel, gregkh, francisyyan, edumazet,
	keescook, rami.rosen, andreyknvl, maloney, sowmini.varadhan,
	linux-kernel, netdev

From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 28 Nov 2017 21:02:05 +0100

> Does this mean you think the general idea of an extended interface
> for 64-bit timestamps is useful for traditional packet sockets? I
> think that was still an open question, though we seem to be getting
> closer to consensus on the implementation and the interface that it
> should use if we want it.

If it can be done reasonably easy, which you patches seem to indicate
is the case, I have no objections to extending packet socket for
64-bit timestamps.

I hope that AF_CAPTURE will be designed in such a way that all apps
can migrate to it, and I will be making sure it is implemented
appropriately with that in mind.

But I don't think AF_CAPTURE should block your work here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
  2017-11-28 20:08                   ` David Miller
@ 2017-11-28 20:31                     ` Arnd Bergmann
  2017-11-28 20:32                       ` [PATCH] [RFC v3] " Arnd Bergmann
  2017-11-29 13:45                       ` [PATCH] [RFC v4] " Arnd Bergmann
  0 siblings, 2 replies; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-28 20:31 UTC (permalink / raw)
  To: David Miller
  Cc: Miroslav Lichvar, Willem de Bruijn, Björn Töpel,
	gregkh, Francis Yan, Eric Dumazet, Kees Cook, Rosen, Rami,
	Andrey Konovalov, Mike Maloney, Sowmini Varadhan,
	Linux Kernel Mailing List, Networking

On Tue, Nov 28, 2017 at 9:08 PM, David Miller <davem@davemloft.net> wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Tue, 28 Nov 2017 21:02:05 +0100
>
>> Does this mean you think the general idea of an extended interface
>> for 64-bit timestamps is useful for traditional packet sockets? I
>> think that was still an open question, though we seem to be getting
>> closer to consensus on the implementation and the interface that it
>> should use if we want it.
>
> If it can be done reasonably easy, which you patches seem to indicate
> is the case, I have no objections to extending packet socket for
> 64-bit timestamps.
>
> I hope that AF_CAPTURE will be designed in such a way that all apps
> can migrate to it, and I will be making sure it is implemented
> appropriately with that in mind.
>
> But I don't think AF_CAPTURE should block your work here.

To clarify where I'm coming from, my interest is mainly in the first
patch to remove all users of 'struct timespec' in order to weed out
the users that are actually broken in 2038. I added the comment
about it breaking in 2106 and how it could be fixed, and then
decided to try out how ugly that fix would get.

I'll follow up with v3 that should address the comments I got so far,
but a conclusion of 'nobody is asking for it' would be fine too, and
then the patch could get dropped.

I'll also look at the AF_CAPTURE patches to make sure that
the timestamping aspect is handled in a safe way there.

       Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
  2017-11-28 20:31                     ` Arnd Bergmann
@ 2017-11-28 20:32                       ` Arnd Bergmann
  2017-11-28 22:28                         ` Willem de Bruijn
  2017-11-29 13:45                       ` [PATCH] [RFC v4] " Arnd Bergmann
  1 sibling, 1 reply; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-28 20:32 UTC (permalink / raw)
  To: David S. Miller, Willem de Bruijn
  Cc: Björn Töpel, Richard Cochran, Arnd Bergmann,
	Thomas Gleixner, Mike Maloney, Eric Dumazet, Kees Cook,
	Hans Liljestrand, Andrey Konovalov, Rosen, Rami, Reshetova,
	Elena, Sowmini Varadhan, netdev, linux-kernel

As I noticed in my previous patch to remove the 'timespec' usage in
the packet socket, the timestamps in the packet socket are slightly
inefficient as they convert a nanosecond value into seconds/nanoseconds
or seconds/microseconds.

This adds two new socket options for the timestamp to resolve that:

PACKET_SKIPTIMESTAMP sets a flag to indicate whether to generate
timestamps at all. When this is set, all timestamps are hardcoded to
zero, which saves a few cycles for the conversion and the access of
the hardware clocksource. The idea was taken from pktgen, which has an
F_NO_TIMESTAMP option for the same purpose.

PACKET_TIMESTAMP_NS64 changes the interpretation of the time stamp fields:
instead of having 32 bits for seconds plus 32 bits for nanoseconds or
microseconds, we now always send down 64 bits worth of nanoseconds when
this flag is set.

Link: https://patchwork.kernel.org/patch/10077199/
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
I still have not done any runtime testing on this patch,
only implemented the suggestions from the previous versions.

While I don't think anyone is actively looking for this feature,
I don't think there are any reasons left against merging it either,
and it might come in handy for someone.
---
 include/uapi/linux/if_packet.h |   2 +
 net/packet/af_packet.c         | 159 +++++++++++++++++++++++++++++------------
 net/packet/internal.h          |   2 +
 3 files changed, 116 insertions(+), 47 deletions(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 67b61d91d89b..2eba54770e6b 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -57,6 +57,8 @@ struct sockaddr_ll {
 #define PACKET_QDISC_BYPASS		20
 #define PACKET_ROLLOVER_STATS		21
 #define PACKET_FANOUT_DATA		22
+#define PACKET_SKIPTIMESTAMP		23
+#define PACKET_TIMESTAMP_NS64		24
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7432c6699818..ed6291b564a9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -200,7 +200,7 @@ static void prb_retire_current_block(struct tpacket_kbdq_core *,
 		struct packet_sock *, unsigned int status);
 static int prb_queue_frozen(struct tpacket_kbdq_core *);
 static void prb_open_block(struct tpacket_kbdq_core *,
-		struct tpacket_block_desc *);
+		struct tpacket_block_desc *, struct packet_sock *);
 static void prb_retire_rx_blk_timer_expired(struct timer_list *);
 static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *);
 static void prb_fill_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *);
@@ -439,52 +439,92 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 	}
 }
 
-static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
+static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo,
 				   unsigned int flags)
 {
+	struct packet_sock *po = pkt_sk(skb->sk);
 	struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+	ktime_t stamp;
+	u32 type;
+
+	if (po->tp_skiptstamp)
+		return 0;
 
 	if (shhwtstamps &&
-	    (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
-	    ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
-		return TP_STATUS_TS_RAW_HARDWARE;
+	    (po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE) &&
+	    shhwtstamps->hwtstamp) {
+		stamp = shhwtstamps->hwtstamp;
+		type = TP_STATUS_TS_RAW_HARDWARE;
+	} else if (skb->tstamp) {
+		stamp = skb->tstamp;
+		type = TP_STATUS_TS_SOFTWARE;
+	} else {
+		return 0;
+	}
 
-	if (ktime_to_timespec64_cond(skb->tstamp, ts))
-		return TP_STATUS_TS_SOFTWARE;
+	if (po->tp_tstamp_ns64) {
+		__u64 ns = ktime_to_ns(stamp);
 
-	return 0;
+		*hi = upper_32_bits(ns);
+		*lo = lower_32_bits(ns);
+	} else {
+		struct timespec64 ts = ktime_to_timespec64(stamp);
+
+		*hi = ts.tv_sec;
+		if (po->tp_version == TPACKET_V1)
+			*lo = ts.tv_nsec / NSEC_PER_USEC;
+		else
+			*lo = ts.tv_nsec;
+	}
+
+	return type;
+}
+
+static void packet_get_time(struct packet_sock *po, __u32 *hi, __u32 *lo)
+{
+	if (po->tp_skiptstamp) {
+		*hi = 0;
+		*lo = 0;
+	} else if (po->tp_tstamp_ns64) {
+		__u64 ns = ktime_get_real_ns();
+
+		*hi = upper_32_bits(ns);
+		*hi = lower_32_bits(ns);
+	} else {
+		struct timespec64 ts;
+
+		ktime_get_real_ts64(&ts);
+		/* unsigned seconds overflow in y2106 here */
+		*hi = ts.tv_sec;
+		if (po->tp_version == TPACKET_V1)
+			*lo = ts.tv_nsec / NSEC_PER_USEC;
+		else
+			*lo = ts.tv_nsec;
+	}
 }
 
 static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
 				    struct sk_buff *skb)
 {
 	union tpacket_uhdr h;
-	struct timespec64 ts;
-	__u32 ts_status;
+	__u32 ts_status, hi, lo;
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
+	if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo, po->tp_tstamp)))
 		return 0;
 
 	h.raw = frame;
-	/*
-	 * versions 1 through 3 overflow the timestamps in y2106, since they
-	 * all store the seconds in a 32-bit unsigned integer.
-	 * If we create a version 4, that should have a 64-bit timestamp,
-	 * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
-	 * nanoseconds.
-	 */
 	switch (po->tp_version) {
 	case TPACKET_V1:
-		h.h1->tp_sec = ts.tv_sec;
-		h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
+		h.h1->tp_sec = hi;
+		h.h1->tp_usec = lo;
 		break;
 	case TPACKET_V2:
-		h.h2->tp_sec = ts.tv_sec;
-		h.h2->tp_nsec = ts.tv_nsec;
+		h.h2->tp_sec = hi;
+		h.h2->tp_nsec = lo;
 		break;
 	case TPACKET_V3:
-		h.h3->tp_sec = ts.tv_sec;
-		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_sec = hi;
+		h.h3->tp_nsec = lo;
 		break;
 	default:
 		WARN(1, "TPACKET version not supported.\n");
@@ -633,7 +673,7 @@ static void init_prb_bdqc(struct packet_sock *po,
 	p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
 	prb_init_ft_ops(p1, req_u);
 	prb_setup_retire_blk_timer(po);
-	prb_open_block(p1, pbd);
+	prb_open_block(p1, pbd, po);
 }
 
 /*  Do NOT update the last_blk_num first.
@@ -730,7 +770,7 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
 				* opening a block thaws the queue,restarts timer
 				* Thawing/timer-refresh is a side effect.
 				*/
-				prb_open_block(pkc, pbd);
+				prb_open_block(pkc, pbd, po);
 				goto out;
 			}
 		}
@@ -812,10 +852,8 @@ static void prb_close_block(struct tpacket_kbdq_core *pkc1,
 		 * It shouldn't really happen as we don't close empty
 		 * blocks. See prb_retire_rx_blk_timer_expired().
 		 */
-		struct timespec64 ts;
-		ktime_get_real_ts64(&ts);
-		h1->ts_last_pkt.ts_sec = ts.tv_sec;
-		h1->ts_last_pkt.ts_nsec	= ts.tv_nsec;
+		packet_get_time(po, &h1->ts_last_pkt.ts_sec,
+				&h1->ts_last_pkt.ts_nsec);
 	}
 
 	smp_wmb();
@@ -841,9 +879,8 @@ static void prb_thaw_queue(struct tpacket_kbdq_core *pkc)
  *
  */
 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
-	struct tpacket_block_desc *pbd1)
+	struct tpacket_block_desc *pbd1, struct packet_sock *po)
 {
-	struct timespec64 ts;
 	struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1;
 
 	smp_rmb();
@@ -856,10 +893,8 @@ static void prb_open_block(struct tpacket_kbdq_core *pkc1,
 	BLOCK_NUM_PKTS(pbd1) = 0;
 	BLOCK_LEN(pbd1) = BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
 
-	ktime_get_real_ts64(&ts);
-
-	h1->ts_first_pkt.ts_sec = ts.tv_sec;
-	h1->ts_first_pkt.ts_nsec = ts.tv_nsec;
+	packet_get_time(po, &h1->ts_first_pkt.ts_sec,
+			&h1->ts_first_pkt.ts_nsec);
 
 	pkc1->pkblk_start = (char *)pbd1;
 	pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
@@ -936,7 +971,7 @@ static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc,
 	 * open this block and return the offset where the first packet
 	 * needs to get stored.
 	 */
-	prb_open_block(pkc, pbd);
+	prb_open_block(pkc, pbd, po);
 	return (void *)pkc->nxt_offset;
 }
 
@@ -1068,7 +1103,7 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 			 * opening a block also thaws the queue.
 			 * Thawing is a side effect.
 			 */
-			prb_open_block(pkc, pbd);
+			prb_open_block(pkc, pbd, po);
 		}
 	}
 
@@ -2191,8 +2226,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	unsigned long status = TP_STATUS_USER;
 	unsigned short macoff, netoff, hdrlen;
 	struct sk_buff *copy_skb = NULL;
-	struct timespec64 ts;
 	__u32 ts_status;
+	__u32 hi, lo;
 	bool is_drop_n_account = false;
 	bool do_vnet = false;
 
@@ -2318,8 +2353,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 
 	skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
-		ktime_get_real_ts64(&ts);
+	if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo, po->tp_tstamp)))
+		packet_get_time(po, &hi, &lo);
 
 	status |= ts_status;
 
@@ -2329,8 +2364,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h1->tp_snaplen = snaplen;
 		h.h1->tp_mac = macoff;
 		h.h1->tp_net = netoff;
-		h.h1->tp_sec = ts.tv_sec;
-		h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
+		h.h1->tp_sec = hi;
+		h.h1->tp_usec = lo;
 		hdrlen = sizeof(*h.h1);
 		break;
 	case TPACKET_V2:
@@ -2338,8 +2373,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h2->tp_snaplen = snaplen;
 		h.h2->tp_mac = macoff;
 		h.h2->tp_net = netoff;
-		h.h2->tp_sec = ts.tv_sec;
-		h.h2->tp_nsec = ts.tv_nsec;
+		h.h2->tp_sec = hi;
+		h.h2->tp_nsec = lo;
 		if (skb_vlan_tag_present(skb)) {
 			h.h2->tp_vlan_tci = skb_vlan_tag_get(skb);
 			h.h2->tp_vlan_tpid = ntohs(skb->vlan_proto);
@@ -2360,8 +2395,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h3->tp_snaplen = snaplen;
 		h.h3->tp_mac = macoff;
 		h.h3->tp_net = netoff;
-		h.h3->tp_sec  = ts.tv_sec;
-		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_sec  = hi;
+		h.h3->tp_nsec = lo;
 		memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding));
 		hdrlen = sizeof(*h.h3);
 		break;
@@ -3792,6 +3827,30 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->tp_tstamp = val;
 		return 0;
 	}
+	case PACKET_SKIPTIMESTAMP:
+	{
+		int val;
+
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		po->tp_skiptstamp = val;
+		return 0;
+	}
+	case PACKET_TIMESTAMP_NS64:
+	{
+		int val;
+
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		po->tp_tstamp_ns64 = val;
+		return 0;
+	}
 	case PACKET_FANOUT:
 	{
 		int val;
@@ -3921,6 +3980,12 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_TIMESTAMP:
 		val = po->tp_tstamp;
 		break;
+	case PACKET_SKIPTIMESTAMP:
+		val = po->tp_skiptstamp;
+		break;
+	case PACKET_TIMESTAMP_NS64:
+		val = po->tp_tstamp_ns64;
+		break;
 	case PACKET_FANOUT:
 		val = (po->fanout ?
 		       ((u32)po->fanout->id |
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 562fbc155006..20b69512210f 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -128,6 +128,8 @@ struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_skiptstamp:1;
+	unsigned int		tp_tstamp_ns64:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
  2017-11-28 20:32                       ` [PATCH] [RFC v3] " Arnd Bergmann
@ 2017-11-28 22:28                         ` Willem de Bruijn
  2017-11-29 12:31                           ` Arnd Bergmann
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2017-11-28 22:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Willem de Bruijn, Björn Töpel,
	Richard Cochran, Thomas Gleixner, Mike Maloney, Eric Dumazet,
	Kees Cook, Hans Liljestrand, Andrey Konovalov, Rosen, Rami,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML

On Tue, Nov 28, 2017 at 3:32 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> As I noticed in my previous patch to remove the 'timespec' usage in
> the packet socket, the timestamps in the packet socket are slightly
> inefficient as they convert a nanosecond value into seconds/nanoseconds
> or seconds/microseconds.
>
> This adds two new socket options for the timestamp to resolve that:
>
> PACKET_SKIPTIMESTAMP sets a flag to indicate whether to generate
> timestamps at all. When this is set, all timestamps are hardcoded to
> zero, which saves a few cycles for the conversion and the access of
> the hardware clocksource. The idea was taken from pktgen, which has an
> F_NO_TIMESTAMP option for the same purpose.
>
> PACKET_TIMESTAMP_NS64 changes the interpretation of the time stamp fields:
> instead of having 32 bits for seconds plus 32 bits for nanoseconds or
> microseconds, we now always send down 64 bits worth of nanoseconds when
> this flag is set.
>
> Link: https://patchwork.kernel.org/patch/10077199/
> Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

This works. Another option would be to add a PACKET_TIMESTAMP_EX
with the semantics we discussed previously + fail hard when any undefined
bits are set. I don't feel strong either way, we don't intend to extend further.

If taking this approach, it might be good to split into separate patches, one
for each flag?

> -static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
> +static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo,
>                                    unsigned int flags)

Argument flags is no longer used.

>  {
> +       struct packet_sock *po = pkt_sk(skb->sk);
>         struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
> +       ktime_t stamp;
> +       u32 type;
> +
> +       if (po->tp_skiptstamp)
> +               return 0;
>
>         if (shhwtstamps &&
> -           (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
> -           ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
> -               return TP_STATUS_TS_RAW_HARDWARE;
> +           (po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE) &&
> +           shhwtstamps->hwtstamp) {
> +               stamp = shhwtstamps->hwtstamp;
> +               type = TP_STATUS_TS_RAW_HARDWARE;
> +       } else if (skb->tstamp) {
> +               stamp = skb->tstamp;
> +               type = TP_STATUS_TS_SOFTWARE;
> +       } else {
> +               return 0;
> +       }
>
> -       if (ktime_to_timespec64_cond(skb->tstamp, ts))
> -               return TP_STATUS_TS_SOFTWARE;
> +       if (po->tp_tstamp_ns64) {
> +               __u64 ns = ktime_to_ns(stamp);
>
> -       return 0;
> +               *hi = upper_32_bits(ns);
> +               *lo = lower_32_bits(ns);
> +       } else {
> +               struct timespec64 ts = ktime_to_timespec64(stamp);
> +
> +               *hi = ts.tv_sec;
> +               if (po->tp_version == TPACKET_V1)

Very minor: may want to invert test to make newer the protocols the
likely branch.

>  static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
>                                     struct sk_buff *skb)
>  {
>         union tpacket_uhdr h;
> -       struct timespec64 ts;
> -       __u32 ts_status;
> +       __u32 ts_status, hi, lo;
>
> -       if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
> +       if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo, po->tp_tstamp)))
>                 return 0;
>
>         h.raw = frame;
> -       /*
> -        * versions 1 through 3 overflow the timestamps in y2106, since they
> -        * all store the seconds in a 32-bit unsigned integer.
> -        * If we create a version 4, that should have a 64-bit timestamp,
> -        * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
> -        * nanoseconds.
> -        */

Probably no need to introduce this in patch 1/2 when removing it in 2/2.

> @@ -2191,8 +2226,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>         unsigned long status = TP_STATUS_USER;
>         unsigned short macoff, netoff, hdrlen;
>         struct sk_buff *copy_skb = NULL;
> -       struct timespec64 ts;
>         __u32 ts_status;
> +       __u32 hi, lo;

since this function is not time-specific, the context of hi and lo is not
immediately obvious here. tstamp_hi, tstamp_lo? Or even __u32
tstamp[2] and have tpacket_get_timestamp and packet_get_time take
one fewer argument.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
  2017-11-28 22:28                         ` Willem de Bruijn
@ 2017-11-29 12:31                           ` Arnd Bergmann
  2017-11-29 16:51                             ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-29 12:31 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Willem de Bruijn, Björn Töpel,
	Richard Cochran, Thomas Gleixner, Mike Maloney, Eric Dumazet,
	Kees Cook, Hans Liljestrand, Andrey Konovalov, Rosen, Rami,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML

On Tue, Nov 28, 2017 at 11:28 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Tue, Nov 28, 2017 at 3:32 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> As I noticed in my previous patch to remove the 'timespec' usage in
>> the packet socket, the timestamps in the packet socket are slightly
>> inefficient as they convert a nanosecond value into seconds/nanoseconds
>> or seconds/microseconds.
>>
>> This adds two new socket options for the timestamp to resolve that:
>>
>> PACKET_SKIPTIMESTAMP sets a flag to indicate whether to generate
>> timestamps at all. When this is set, all timestamps are hardcoded to
>> zero, which saves a few cycles for the conversion and the access of
>> the hardware clocksource. The idea was taken from pktgen, which has an
>> F_NO_TIMESTAMP option for the same purpose.
>>
>> PACKET_TIMESTAMP_NS64 changes the interpretation of the time stamp fields:
>> instead of having 32 bits for seconds plus 32 bits for nanoseconds or
>> microseconds, we now always send down 64 bits worth of nanoseconds when
>> this flag is set.
>>
>> Link: https://patchwork.kernel.org/patch/10077199/
>> Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
>> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> This works. Another option would be to add a PACKET_TIMESTAMP_EX
> with the semantics we discussed previously + fail hard when any undefined
> bits are set. I don't feel strong either way, we don't intend to extend further.
>
> If taking this approach, it might be good to split into separate patches, one
> for each flag?
>
>> -static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
>> +static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo,
>>                                    unsigned int flags)
>
> Argument flags is no longer used.

Fixed

>> -       return 0;
>> +               *hi = upper_32_bits(ns);
>> +               *lo = lower_32_bits(ns);
>> +       } else {
>> +               struct timespec64 ts = ktime_to_timespec64(stamp);
>> +
>> +               *hi = ts.tv_sec;
>> +               if (po->tp_version == TPACKET_V1)
>
> Very minor: may want to invert test to make newer the protocols the
> likely branch.

Ok. I didn't think this would make any difference to the compiler, but
for readability it seems at least as good, so I've changed it as you suggested
and use "po->tp_version > TPACKET_V1".

>>  static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
>>                                     struct sk_buff *skb)
>>  {
>>         union tpacket_uhdr h;
>> -       struct timespec64 ts;
>> -       __u32 ts_status;
>> +       __u32 ts_status, hi, lo;
>>
>> -       if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
>> +       if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo, po->tp_tstamp)))
>>                 return 0;
>>
>>         h.raw = frame;
>> -       /*
>> -        * versions 1 through 3 overflow the timestamps in y2106, since they
>> -        * all store the seconds in a 32-bit unsigned integer.
>> -        * If we create a version 4, that should have a 64-bit timestamp,
>> -        * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
>> -        * nanoseconds.
>> -        */
>
> Probably no need to introduce this in patch 1/2 when removing it in 2/2.

I'm still considering this patch as experimental, since I haven't done any
actual testing on it, so I'm not sure it gets merged at the same time.
If patch 1 gets merged separately, I'd rather keep the comment in place.

>> @@ -2191,8 +2226,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>>         unsigned long status = TP_STATUS_USER;
>>         unsigned short macoff, netoff, hdrlen;
>>         struct sk_buff *copy_skb = NULL;
>> -       struct timespec64 ts;
>>         __u32 ts_status;
>> +       __u32 hi, lo;
>
> since this function is not time-specific, the context of hi and lo is not
> immediately obvious here. tstamp_hi, tstamp_lo? Or even __u32
> tstamp[2] and have tpacket_get_timestamp and packet_get_time take
> one fewer argument.

Fixed.

Thanks for the review! Any suggestions for how to do the testing? If you have
existing test cases, could you give my next version a test run to see if there
are any regressions and if the timestamps work as expected?

I see that there are test cases in tools/testing/selftests/net/, but none
of them seem to use the time stamps so far, and I'm not overly familiar
with how it works in the details to extend it in a meaningful way.

        arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] [RFC v4] packet: experimental support for 64-bit timestamps
  2017-11-28 20:31                     ` Arnd Bergmann
  2017-11-28 20:32                       ` [PATCH] [RFC v3] " Arnd Bergmann
@ 2017-11-29 13:45                       ` Arnd Bergmann
  1 sibling, 0 replies; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-29 13:45 UTC (permalink / raw)
  To: David S. Miller, Willem de Bruijn
  Cc: y2038, Björn Töpel, Richard Cochran, Arnd Bergmann,
	Mike Maloney, Eric Dumazet, Kees Cook, Andrey Konovalov, Rosen,
	Rami, Sowmini Varadhan, David Windsor, netdev, linux-kernel

As I noticed in my previous patch to remove the 'timespec' usage in
the packet socket, the timestamps in the packet socket are slightly
inefficient as they convert a nanosecond value into seconds/nanoseconds
or seconds/microseconds.

This adds two new socket options for the timestamp to resolve that:

PACKET_SKIPTIMESTAMP sets a flag to indicate whether to generate
timestamps at all. When this is set, all timestamps are hardcoded to
zero, which saves a few cycles for the conversion and the access of
the hardware clocksource. The idea was taken from pktgen, which has an
F_NO_TIMESTAMP option for the same purpose.

PACKET_TIMESTAMP_NS64 changes the interpretation of the time stamp fields:
instead of having 32 bits for seconds plus 32 bits for nanoseconds or
microseconds, we now always send down 64 bits worth of nanoseconds when
this flag is set.

Link: https://patchwork.kernel.org/patch/10077199/
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
I still have not done any runtime testing on this patch,
only implemented the suggestions from the previous versions.

While I don't think anyone is actively looking for this feature,
I don't think there are any reasons left against merging it either,
and it might come in handy for someone.

v4: address minor comments from Willem
v3: rework to use setsockopt
v2: use new tstamp flags instead of a new version
v1: original implementation using TPACKET_V4
---
 include/uapi/linux/if_packet.h |   2 +
 net/packet/af_packet.c         | 160 ++++++++++++++++++++++++++++-------------
 net/packet/internal.h          |   2 +
 3 files changed, 116 insertions(+), 48 deletions(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 67b61d91d89b..2eba54770e6b 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -57,6 +57,8 @@ struct sockaddr_ll {
 #define PACKET_QDISC_BYPASS		20
 #define PACKET_ROLLOVER_STATS		21
 #define PACKET_FANOUT_DATA		22
+#define PACKET_SKIPTIMESTAMP		23
+#define PACKET_TIMESTAMP_NS64		24
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7432c6699818..f55f330ab547 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -200,7 +200,7 @@ static void prb_retire_current_block(struct tpacket_kbdq_core *,
 		struct packet_sock *, unsigned int status);
 static int prb_queue_frozen(struct tpacket_kbdq_core *);
 static void prb_open_block(struct tpacket_kbdq_core *,
-		struct tpacket_block_desc *);
+		struct tpacket_block_desc *, struct packet_sock *);
 static void prb_retire_rx_blk_timer_expired(struct timer_list *);
 static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *);
 static void prb_fill_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *);
@@ -439,52 +439,91 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 	}
 }
 
-static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
-				   unsigned int flags)
+static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo)
 {
+	struct packet_sock *po = pkt_sk(skb->sk);
 	struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+	ktime_t stamp;
+	u32 type;
+
+	if (po->tp_skiptstamp)
+		return 0;
 
 	if (shhwtstamps &&
-	    (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
-	    ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
-		return TP_STATUS_TS_RAW_HARDWARE;
+	    (po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE) &&
+	    shhwtstamps->hwtstamp) {
+		stamp = shhwtstamps->hwtstamp;
+		type = TP_STATUS_TS_RAW_HARDWARE;
+	} else if (skb->tstamp) {
+		stamp = skb->tstamp;
+		type = TP_STATUS_TS_SOFTWARE;
+	} else {
+		return 0;
+	}
 
-	if (ktime_to_timespec64_cond(skb->tstamp, ts))
-		return TP_STATUS_TS_SOFTWARE;
+	if (po->tp_tstamp_ns64) {
+		__u64 ns = ktime_to_ns(stamp);
 
-	return 0;
+		*hi = upper_32_bits(ns);
+		*lo = lower_32_bits(ns);
+	} else {
+		struct timespec64 ts = ktime_to_timespec64(stamp);
+
+		*hi = ts.tv_sec;
+		if (po->tp_version > TPACKET_V1)
+			*lo = ts.tv_nsec;
+		else
+			*lo = ts.tv_nsec / NSEC_PER_USEC;
+	}
+
+	return type;
+}
+
+static void packet_get_time(struct packet_sock *po, __u32 *hi, __u32 *lo)
+{
+	if (po->tp_skiptstamp) {
+		*hi = 0;
+		*lo = 0;
+	} else if (po->tp_tstamp_ns64) {
+		__u64 ns = ktime_get_real_ns();
+
+		*hi = upper_32_bits(ns);
+		*hi = lower_32_bits(ns);
+	} else {
+		struct timespec64 ts;
+
+		ktime_get_real_ts64(&ts);
+		/* unsigned seconds overflow in y2106 here */
+		*hi = ts.tv_sec;
+		if (po->tp_version > TPACKET_V1)
+			*lo = ts.tv_nsec;
+		else
+			*lo = ts.tv_nsec / NSEC_PER_USEC;
+	}
 }
 
 static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
 				    struct sk_buff *skb)
 {
 	union tpacket_uhdr h;
-	struct timespec64 ts;
-	__u32 ts_status;
+	__u32 ts_status, hi, lo;
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
+	if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo)))
 		return 0;
 
 	h.raw = frame;
-	/*
-	 * versions 1 through 3 overflow the timestamps in y2106, since they
-	 * all store the seconds in a 32-bit unsigned integer.
-	 * If we create a version 4, that should have a 64-bit timestamp,
-	 * either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
-	 * nanoseconds.
-	 */
 	switch (po->tp_version) {
 	case TPACKET_V1:
-		h.h1->tp_sec = ts.tv_sec;
-		h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
+		h.h1->tp_sec = hi;
+		h.h1->tp_usec = lo;
 		break;
 	case TPACKET_V2:
-		h.h2->tp_sec = ts.tv_sec;
-		h.h2->tp_nsec = ts.tv_nsec;
+		h.h2->tp_sec = hi;
+		h.h2->tp_nsec = lo;
 		break;
 	case TPACKET_V3:
-		h.h3->tp_sec = ts.tv_sec;
-		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_sec = hi;
+		h.h3->tp_nsec = lo;
 		break;
 	default:
 		WARN(1, "TPACKET version not supported.\n");
@@ -633,7 +672,7 @@ static void init_prb_bdqc(struct packet_sock *po,
 	p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
 	prb_init_ft_ops(p1, req_u);
 	prb_setup_retire_blk_timer(po);
-	prb_open_block(p1, pbd);
+	prb_open_block(p1, pbd, po);
 }
 
 /*  Do NOT update the last_blk_num first.
@@ -730,7 +769,7 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
 				* opening a block thaws the queue,restarts timer
 				* Thawing/timer-refresh is a side effect.
 				*/
-				prb_open_block(pkc, pbd);
+				prb_open_block(pkc, pbd, po);
 				goto out;
 			}
 		}
@@ -812,10 +851,8 @@ static void prb_close_block(struct tpacket_kbdq_core *pkc1,
 		 * It shouldn't really happen as we don't close empty
 		 * blocks. See prb_retire_rx_blk_timer_expired().
 		 */
-		struct timespec64 ts;
-		ktime_get_real_ts64(&ts);
-		h1->ts_last_pkt.ts_sec = ts.tv_sec;
-		h1->ts_last_pkt.ts_nsec	= ts.tv_nsec;
+		packet_get_time(po, &h1->ts_last_pkt.ts_sec,
+				&h1->ts_last_pkt.ts_nsec);
 	}
 
 	smp_wmb();
@@ -841,9 +878,8 @@ static void prb_thaw_queue(struct tpacket_kbdq_core *pkc)
  *
  */
 static void prb_open_block(struct tpacket_kbdq_core *pkc1,
-	struct tpacket_block_desc *pbd1)
+	struct tpacket_block_desc *pbd1, struct packet_sock *po)
 {
-	struct timespec64 ts;
 	struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1;
 
 	smp_rmb();
@@ -856,10 +892,8 @@ static void prb_open_block(struct tpacket_kbdq_core *pkc1,
 	BLOCK_NUM_PKTS(pbd1) = 0;
 	BLOCK_LEN(pbd1) = BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
 
-	ktime_get_real_ts64(&ts);
-
-	h1->ts_first_pkt.ts_sec = ts.tv_sec;
-	h1->ts_first_pkt.ts_nsec = ts.tv_nsec;
+	packet_get_time(po, &h1->ts_first_pkt.ts_sec,
+			&h1->ts_first_pkt.ts_nsec);
 
 	pkc1->pkblk_start = (char *)pbd1;
 	pkc1->nxt_offset = pkc1->pkblk_start + BLK_PLUS_PRIV(pkc1->blk_sizeof_priv);
@@ -936,7 +970,7 @@ static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc,
 	 * open this block and return the offset where the first packet
 	 * needs to get stored.
 	 */
-	prb_open_block(pkc, pbd);
+	prb_open_block(pkc, pbd, po);
 	return (void *)pkc->nxt_offset;
 }
 
@@ -1068,7 +1102,7 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
 			 * opening a block also thaws the queue.
 			 * Thawing is a side effect.
 			 */
-			prb_open_block(pkc, pbd);
+			prb_open_block(pkc, pbd, po);
 		}
 	}
 
@@ -2191,8 +2225,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	unsigned long status = TP_STATUS_USER;
 	unsigned short macoff, netoff, hdrlen;
 	struct sk_buff *copy_skb = NULL;
-	struct timespec64 ts;
 	__u32 ts_status;
+	__u32 tstamp_hi, tstamp_lo;
 	bool is_drop_n_account = false;
 	bool do_vnet = false;
 
@@ -2318,8 +2352,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 
 	skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
 
-	if (!(ts_status = tpacket_get_timestamp(skb, &ts, po->tp_tstamp)))
-		ktime_get_real_ts64(&ts);
+	if (!(ts_status = tpacket_get_timestamp(skb, &tstamp_hi, &tstamp_lo)))
+		packet_get_time(po, &tstamp_hi, &tstamp_lo);
 
 	status |= ts_status;
 
@@ -2329,8 +2363,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h1->tp_snaplen = snaplen;
 		h.h1->tp_mac = macoff;
 		h.h1->tp_net = netoff;
-		h.h1->tp_sec = ts.tv_sec;
-		h.h1->tp_usec = ts.tv_nsec / NSEC_PER_USEC;
+		h.h1->tp_sec = tstamp_hi;
+		h.h1->tp_usec = tstamp_lo;
 		hdrlen = sizeof(*h.h1);
 		break;
 	case TPACKET_V2:
@@ -2338,8 +2372,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h2->tp_snaplen = snaplen;
 		h.h2->tp_mac = macoff;
 		h.h2->tp_net = netoff;
-		h.h2->tp_sec = ts.tv_sec;
-		h.h2->tp_nsec = ts.tv_nsec;
+		h.h2->tp_sec = tstamp_hi;
+		h.h2->tp_nsec = tstamp_lo;
 		if (skb_vlan_tag_present(skb)) {
 			h.h2->tp_vlan_tci = skb_vlan_tag_get(skb);
 			h.h2->tp_vlan_tpid = ntohs(skb->vlan_proto);
@@ -2360,8 +2394,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h3->tp_snaplen = snaplen;
 		h.h3->tp_mac = macoff;
 		h.h3->tp_net = netoff;
-		h.h3->tp_sec  = ts.tv_sec;
-		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_sec  = tstamp_hi;
+		h.h3->tp_nsec = tstamp_lo;
 		memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding));
 		hdrlen = sizeof(*h.h3);
 		break;
@@ -3792,6 +3826,30 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->tp_tstamp = val;
 		return 0;
 	}
+	case PACKET_SKIPTIMESTAMP:
+	{
+		int val;
+
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		po->tp_skiptstamp = val;
+		return 0;
+	}
+	case PACKET_TIMESTAMP_NS64:
+	{
+		int val;
+
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		po->tp_tstamp_ns64 = val;
+		return 0;
+	}
 	case PACKET_FANOUT:
 	{
 		int val;
@@ -3921,6 +3979,12 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_TIMESTAMP:
 		val = po->tp_tstamp;
 		break;
+	case PACKET_SKIPTIMESTAMP:
+		val = po->tp_skiptstamp;
+		break;
+	case PACKET_TIMESTAMP_NS64:
+		val = po->tp_tstamp_ns64;
+		break;
 	case PACKET_FANOUT:
 		val = (po->fanout ?
 		       ((u32)po->fanout->id |
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 562fbc155006..20b69512210f 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -128,6 +128,8 @@ struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_skiptstamp:1;
+	unsigned int		tp_tstamp_ns64:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
  2017-11-29 12:31                           ` Arnd Bergmann
@ 2017-11-29 16:51                             ` Willem de Bruijn
  2017-11-29 20:06                               ` Arnd Bergmann
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2017-11-29 16:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Willem de Bruijn, Björn Töpel,
	Richard Cochran, Thomas Gleixner, Mike Maloney, Eric Dumazet,
	Kees Cook, Hans Liljestrand, Andrey Konovalov, Rosen, Rami,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML

> Thanks for the review! Any suggestions for how to do the testing? If you have
> existing test cases, could you give my next version a test run to see if there
> are any regressions and if the timestamps work as expected?
>
> I see that there are test cases in tools/testing/selftests/net/, but none
> of them seem to use the time stamps so far, and I'm not overly familiar
> with how it works in the details to extend it in a meaningful way.

I could not find any good tests for this interface, either. The only
user of the interface I found was a little tool I wrote a few years
ago that compares timestamps at multiple points in the transmit
path for latency measurement [1]. But it may be easier to just write
a new test under tools/testing/selftests/net for this purpose. I can
help with that, too, if you want.

[1] https://github.com/wdebruij/kerneltools/blob/master/tools/tcplate/tcplate.c

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
  2017-11-29 16:51                             ` Willem de Bruijn
@ 2017-11-29 20:06                               ` Arnd Bergmann
  2017-11-30  1:39                                 ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Arnd Bergmann @ 2017-11-29 20:06 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Willem de Bruijn, Björn Töpel,
	Richard Cochran, Thomas Gleixner, Mike Maloney, Eric Dumazet,
	Kees Cook, Hans Liljestrand, Andrey Konovalov, Rosen, Rami,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML

On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>> Thanks for the review! Any suggestions for how to do the testing? If you have
>> existing test cases, could you give my next version a test run to see if there
>> are any regressions and if the timestamps work as expected?
>>
>> I see that there are test cases in tools/testing/selftests/net/, but none
>> of them seem to use the time stamps so far, and I'm not overly familiar
>> with how it works in the details to extend it in a meaningful way.
>
> I could not find any good tests for this interface, either. The only
> user of the interface I found was a little tool I wrote a few years
> ago that compares timestamps at multiple points in the transmit
> path for latency measurement [1]. But it may be easier to just write
> a new test under tools/testing/selftests/net for this purpose. I can
> help with that, too, if you want.

Thanks, that would be great!

     Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
  2017-11-29 20:06                               ` Arnd Bergmann
@ 2017-11-30  1:39                                 ` Willem de Bruijn
  2017-11-30  1:54                                   ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2017-11-30  1:39 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Willem de Bruijn, Björn Töpel,
	Richard Cochran, Thomas Gleixner, Mike Maloney, Eric Dumazet,
	Kees Cook, Hans Liljestrand, Andrey Konovalov, Rosen, Rami,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML

On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>>> Thanks for the review! Any suggestions for how to do the testing? If you have
>>> existing test cases, could you give my next version a test run to see if there
>>> are any regressions and if the timestamps work as expected?
>>>
>>> I see that there are test cases in tools/testing/selftests/net/, but none
>>> of them seem to use the time stamps so far, and I'm not overly familiar
>>> with how it works in the details to extend it in a meaningful way.
>>
>> I could not find any good tests for this interface, either. The only
>> user of the interface I found was a little tool I wrote a few years
>> ago that compares timestamps at multiple points in the transmit
>> path for latency measurement [1]. But it may be easier to just write
>> a new test under tools/testing/selftests/net for this purpose. I can
>> help with that, too, if you want.
>
> Thanks, that would be great!

I'll reply to this thread with git send-email with an extension to
tools/testing/selftests/net/psock_tpacket.c. I can resend that for
submission after your feature is merged (as it depends on it) or
feel free to include it in your patchset. The test currently fails for
the ns64 case. I probably did not convert correctly, but have to leave
the office and want to send what I have.

Two other comments: the test crashed the kernel due to a NULL ptr
in tpacket_get_timestamp. We cannot rely on skb->sk being set to
the packet socket here. And assignment to bitfield requires a cast to
boolean.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f55f330ab547..e9decc7fc5c3 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -439,9 +439,9 @@ static int __packet_get_status(struct packet_sock
*po, void *frame)
        }
 }

-static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo)
+static __u32 tpacket_get_timestamp(struct packet_sock *po, struct sk_buff *skb,
+                                  __u32 *hi, __u32 *lo)
 {
-       struct packet_sock *po = pkt_sk(skb->sk);
        struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
        ktime_t stamp;
        u32 type;
@@ -508,7 +508,7 @@ static __u32 __packet_set_timestamp(struct
packet_sock *po, void *frame,
        union tpacket_uhdr h;
        __u32 ts_status, hi, lo;

-       if (!(ts_status = tpacket_get_timestamp(skb, &hi, &lo)))
+       if (!(ts_status = tpacket_get_timestamp(po, skb, &hi, &lo)))
                return 0;

        h.raw = frame;
@@ -2352,7 +2352,7 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,

        skb_copy_bits(skb, 0, h.raw + macoff, snaplen);

-       if (!(ts_status = tpacket_get_timestamp(skb, &tstamp_hi, &tstamp_lo)))
+       if (!(ts_status = tpacket_get_timestamp(po, skb, &tstamp_hi,
&tstamp_lo)))
                packet_get_time(po, &tstamp_hi, &tstamp_lo);

        status |= ts_status;
@@ -3835,7 +3835,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
                if (copy_from_user(&val, optval, sizeof(val)))
                        return -EFAULT;

-               po->tp_skiptstamp = val;
+               po->tp_skiptstamp = !!val;
                return 0;
        }
        case PACKET_TIMESTAMP_NS64:
@@ -3847,7 +3847,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
                if (copy_from_user(&val, optval, sizeof(val)))
                        return -EFAULT;

-               po->tp_tstamp_ns64 = val;
+               po->tp_tstamp_ns64 = !!val;
                return 0;
        }
        case PACKET_FANOUT:

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
  2017-11-30  1:39                                 ` Willem de Bruijn
@ 2017-11-30  1:54                                   ` Willem de Bruijn
  0 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2017-11-30  1:54 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Willem de Bruijn, Björn Töpel,
	Richard Cochran, Thomas Gleixner, Mike Maloney, Eric Dumazet,
	Kees Cook, Hans Liljestrand, Andrey Konovalov, Rosen, Rami,
	Reshetova, Elena, Sowmini Varadhan, Network Development, LKML

On Wed, Nov 29, 2017 at 8:39 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>>> Thanks for the review! Any suggestions for how to do the testing? If you have
>>>> existing test cases, could you give my next version a test run to see if there
>>>> are any regressions and if the timestamps work as expected?
>>>>
>>>> I see that there are test cases in tools/testing/selftests/net/, but none
>>>> of them seem to use the time stamps so far, and I'm not overly familiar
>>>> with how it works in the details to extend it in a meaningful way.
>>>
>>> I could not find any good tests for this interface, either. The only
>>> user of the interface I found was a little tool I wrote a few years
>>> ago that compares timestamps at multiple points in the transmit
>>> path for latency measurement [1]. But it may be easier to just write
>>> a new test under tools/testing/selftests/net for this purpose. I can
>>> help with that, too, if you want.
>>
>> Thanks, that would be great!
>
> I'll reply to this thread with git send-email with an extension to
> tools/testing/selftests/net/psock_tpacket.c.

It appears that it did not end up in this thread. At least not when
using gmail threading. Patch at http://patchwork.ozlabs.org/patch/842854/

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-11-30  1:54 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-27 16:19 [PATCH 1/2] [net-next] packet: clarify timestamp overflow Arnd Bergmann
2017-11-27 16:19 ` [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps Arnd Bergmann
2017-11-27 16:59   ` Jiri Pirko
2017-11-27 20:35     ` Willem de Bruijn
2017-11-27 20:51       ` Arnd Bergmann
2017-11-28  7:04         ` Björn Töpel
2017-11-28  8:46           ` Arnd Bergmann
2017-11-28 13:14             ` [PATCH] [RFC v2] " Arnd Bergmann
2017-11-28 14:24               ` Willem de Bruijn
2017-11-28 14:45                 ` Arnd Bergmann
2017-11-28 15:05               ` David Miller
2017-11-28 20:02                 ` Arnd Bergmann
2017-11-28 20:08                   ` David Miller
2017-11-28 20:31                     ` Arnd Bergmann
2017-11-28 20:32                       ` [PATCH] [RFC v3] " Arnd Bergmann
2017-11-28 22:28                         ` Willem de Bruijn
2017-11-29 12:31                           ` Arnd Bergmann
2017-11-29 16:51                             ` Willem de Bruijn
2017-11-29 20:06                               ` Arnd Bergmann
2017-11-30  1:39                                 ` Willem de Bruijn
2017-11-30  1:54                                   ` Willem de Bruijn
2017-11-29 13:45                       ` [PATCH] [RFC v4] " Arnd Bergmann
2017-11-28 14:08             ` [PATCH 2/2] [RFC] " Willem de Bruijn
2017-11-28 14:22               ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).