* Add PGM protocol support to the IP stack
@ 2010-03-18 17:58 Christoph Lameter
2010-03-18 21:58 ` Christoph Lameter
2010-03-19 17:18 ` Andi Kleen
0 siblings, 2 replies; 21+ messages in thread
From: Christoph Lameter @ 2010-03-18 17:58 UTC (permalink / raw)
To: David Miller, netdev; +Cc: linux-kernel
Is there any work in progress on including PGM support (RFC 3208) in the
kernel?
I know about the openpgm implementation. Openpbm does this at the user
level and requires linking to a library. It is essentially a communication
protocol done in user space. It has privilege issues because it has to
create PGM packets via a raw socket. Which also has implications for the
possible performance. Openpgm seems to be able to interact with major
commercial implementations of PGM.
I am looking at openpgm right now and it seems that there are a number of
useful files and functions in there that could be used to implement PGM
support in the kernel.
There is also an existing socket API for handling PGM available in another
operating system whose name we rather avoid mentioning. That socket API
could be used as the basic. PGM use would then be possible without a
library and without privilege and performance issues.
PGM support would support two different modes of communication
1. Native PGM (allows NAK suppression by Cisco routers to be used)
socket(AF_INET, SOCK_RDM, IPPROTO_RM)
(SOCK_RDM is defined in the kernel sources but not implemented. PGM
support would implement SOCK_RDM, IPPROTO_RM would need to be defined
according to the IANA protocol number for PGM).
2. PGM over UDP (which is used by many commercial product but not by the
unspeakable OS). No router support for NAK suppression is available. For
this I guess we would have to support
socket(AF_INET, SOCK_RDM, IPPROTO_UDP)
I would be interested to find others who are interested in such a project
or maybe there is already a project in the works? If not then I will try
to come up with some code to get this going. Any help you could offer
would be appreciated.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-18 17:58 Add PGM protocol support to the IP stack Christoph Lameter
@ 2010-03-18 21:58 ` Christoph Lameter
2010-03-19 17:18 ` Andi Kleen
1 sibling, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2010-03-18 21:58 UTC (permalink / raw)
To: David Miller, netdev; +Cc: linux-kernel
Here is what I have so far after a couple of hours.
Something hacked together from openpgm and udplite.
---
Documentation/networking/pgm/TODO | 8
Documentation/networking/pgm/references | 2
Documentation/networking/pgm/usage | 91 ++++
include/linux/in.h | 2
include/linux/pgm.h | 720 ++++++++++++++++++++++++++++++++
net/ipv4/Kconfig | 14
net/ipv4/Makefile | 3
net/ipv4/pgm.c | 143 ++++++
8 files changed, 983 insertions(+)
Index: linux-2.6/include/linux/in.h
===================================================================
--- linux-2.6.orig/include/linux/in.h 2010-03-18 11:05:24.000000000 -0500
+++ linux-2.6/include/linux/in.h 2010-03-18 15:47:59.000000000 -0500
@@ -44,6 +44,7 @@ enum {
IPPROTO_PIM = 103, /* Protocol Independent Multicast */
IPPROTO_COMP = 108, /* Compression Header protocol */
+ IPPROTO_PGM = 113, /* Pragmatic General Multicast */
IPPROTO_SCTP = 132, /* Stream Control Transport Protocol */
IPPROTO_UDPLITE = 136, /* UDP-Lite (RFC 3828) */
@@ -51,6 +52,7 @@ enum {
IPPROTO_MAX
};
+#define IPPROTO_RM IPPROTO_PGM
/* Internet address. */
struct in_addr {
Index: linux-2.6/include/linux/pgm.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/pgm.h 2010-03-18 16:56:19.000000000 -0500
@@ -0,0 +1,720 @@
+/*
+ * PGM packet formats, RFC 3208.
+ *
+ * Copyright (c) 2006 Miru Limited.
+ * Copyright (c) 2010 Christoph Lameter, The Linux Foundation.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * March 17, 2010 Christoph Lameter
+ * Basic PGM definitions extracted from openpgm project.
+ * March 18, 2010
+ * Socket API and document intended usage.
+ * Basic protocol environment (from udplite.c)
+ */
+
+#ifndef _LINUX_PGM_H
+#define _LINUX_PGM_H
+
+#include <linux/types.h>
+
+/* PGM socket options */
+
+/* Transmitter */
+#define RM_LATEJOIN 1 /* X Not supported on receive so why have it? */
+#define RM_RATE_WINDOW_SIZE 2 /* See struct pgm_send_window */
+#define RM_SEND_WINDOW_ADV_RATE 3 /* X Increase of send window in percentage of window */
+#define RM_SENDER_STATISTICS 4 /* see struct pgm_sender_stats */
+#define RM_SENDER_WINDOW_ADVANCE_METHOD 5 /* X seems obsolete */
+#define RM_SET_MCAST_TTL 6 /* X Can be set via IP_MULTICAST_TTL */
+#define RM_SET_MESSAGE_BOUNDARY 7 /* Fix the size of the messages in bytes */
+#define RM_SET_SEND_IF 8 /* X use IP_MULTICAST_IF etc instead */
+#define RM_USE_FEC 9
+
+/* Receiver */
+#define RM_ADD_RECEIVE_IF 100 /* X ???? IP_MULTICAST_IF instead? */
+#define RM_DEL_RECEIVE_IF 101 /* X IP_MULTICAST_IF */
+#define RM_HIGH_SPEED_INTRANET_OPT 102 /* X PGM should adapt automatically to high speed networks */
+#define RM_RECEIVER_STATISTICS 103 /* See struct pgm_receiver_stats */
+
+/* Socket API structures (established by M$DN) */
+struct pgm_receiver_stats {
+ u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */
+ u64 NumRDataPacketsReceived; /* Number of RDATA (repair) sequences */
+ u64 NumDuplicateDataPackets; /* Duplicate sequences */
+ u64 DataBytesReceived;
+ u64 TotalBytesReceived;
+ u64 RateKBitsPerSecOverall; /* Receive rate since start of session X */
+ u64 RateKBitsPerSecLast; /* Receive rate for last second X*/
+ u64 TrailingEdgeSeqId; /* Oldest sequence in the receive window */
+ u64 LeadingEdgeSeqId; /* Newest sequence in the receive window */
+ u64 AverageSequencesInWindow; /* Average number of sequences in receive window X */
+ u64 MinSequencesInWindow; /* The mininum number of sequences */
+ u64 MaxSequencesInWindow; /* The maximum number of sequences */
+ u64 FirstNakSequenceNumber; /* First outstanding nack sequence number */
+ u64 NumPendingNaks; /* Number of sequences waiting for NCF */
+ u64 NumOutstandingNaks; /* Number of sequences waiting for RDATA */
+ u64 NumDataPacketsBuffered; /* Number of packets currently buffered */
+ u64 TotalSelectiveNaksSent; /* Number of NAKs sent total */
+ u64 TotalParityNaksSent; /* Number of parity NAKs sent */
+};
+
+struct pgm_sender_stats {
+ u64 DataBytesSent;
+ u64 TotalBytesSent;
+ u64 NaksReceived;
+ u64 NaksReceivedTooLate; /* NAKs received after receive window advanced */
+ u64 NumOutstandingNaks; /* Number of NAKs awaiting response */
+ u64 NumNaksAfterRData; /* Number of NAKs after RDATA sequences were sent which were ignored */
+ u64 RepairPacketsSent;
+ u64 BufferSpaceAvailable; /* Number of partial messages dropped */
+ u64 TrailingEdgeSeqId; /* Oldest sequence id in window */
+ u64 LeadingEdgeSeqId; /* Newest sequence id in window */
+ u64 RateKBitsPerSecOverall; /* Rate since start of session X */
+ u64 RateKBitsPerSecLast; /* Rate in last second X */
+ u64 TotalODataPacketsSent; /* Total data packets transmitted */
+};
+
+/* Setup of sender RateKbitsPerSec = WindowSizeBytes / WindowSizeMSecs */
+struct pgm_send_window {
+ u64 RateKbitsPerSec; /* Allowed rate for the sender in kbits per second */
+ u64 WindowSizeInMSecs; /* Send window size in time */
+ u64 WindowSizeInBytes; /* Window size in bytes */
+};
+
+struct pgm_fec_info {
+ u16 FECBlockSize; /* Maximum number of packets for a group. Default and max = 255 */
+ u16 FECProActivePackets; /* Number of proactive packets per group. */
+ u8 FECGroupSize; /* Number of packets to be treated as a group. Power of two */
+ int fFECOnDemandParityEnabled; /* Allow sender to sent parity repair packets */
+};
+
+/* address family indicator, rfc 1700 (ADDRESS FAMILY NUMBERS) */
+#ifndef AFI_IP
+#define AFI_IP 1 /* IP (IP version 4) */
+#define AFI_IP6 2 /* IP6 (IP version 6) */
+#endif
+
+/* UDP ports for UDP encapsulation, as per IBM WebSphere MQ */
+#define PGM_DEFAULT_UDP_ENCAP_UCAST_PORT 3055
+#define PGM_DEFAULT_UDP_ENCAP_MCAST_PORT 3056
+
+/* PGM default ports */
+#define PGM_DEFAULT_DATA_DESTINATION_PORT 7500
+#define PGM_DEFAULT_DATA_SOURCE_PORT 0 /* random */
+
+/* DoS limitation to protocol (MS08-036, KB950762) */
+#define PGM_MAX_APDU UINT16_MAX
+
+/* Cisco default: 24 (max 8200), Juniper & H3C default: 16 */
+#define PGM_MAX_FRAGMENTS 16
+
+enum pgm_type {
+ PGM_SPM = 0x00, /* 8.1: source path message */
+ PGM_POLL = 0x01, /* 14.7.1: poll request */
+ PGM_POLR = 0x02, /* 14.7.2: poll response */
+ PGM_ODATA = 0x04, /* 8.2: original data */
+ PGM_RDATA = 0x05, /* 8.2: repair data */
+ PGM_NAK = 0x08, /* 8.3: NAK or negative acknowledgement */
+ PGM_NNAK = 0x09, /* 8.3: N-NAK or null negative acknowledgement */
+ PGM_NCF = 0x0a, /* 8.3: NCF or NAK confirmation */
+ PGM_SPMR = 0x0c, /* 13.6: SPM request */
+ PGM_MAX = 0xff
+};
+
+#define PGM_OPT_LENGTH 0x00 /* options length */
+#define PGM_OPT_FRAGMENT 0x01 /* fragmentation */
+#define PGM_OPT_NAK_LIST 0x02 /* list of nak entries */
+#define PGM_OPT_JOIN 0x03 /* late joining */
+#define PGM_OPT_REDIRECT 0x07 /* redirect */
+#define PGM_OPT_SYN 0x0d /* synchronisation */
+#define PGM_OPT_FIN 0x0e /* session end */
+#define PGM_OPT_RST 0x0f /* session reset */
+
+#define PGM_OPT_PARITY_PRM 0x08 /* forward error correction parameters */
+#define PGM_OPT_PARITY_GRP 0x09 /* group number */
+#define PGM_OPT_CURR_TGSIZE 0x0a /* group size */
+
+#define PGM_OPT_CR 0x10 /* congestion report */
+#define PGM_OPT_CRQST 0x11 /* congestion report request */
+
+#define PGM_OPT_NAK_BO_IVL 0x04 /* nak back-off interval */
+#define PGM_OPT_NAK_BO_RNG 0x05 /* nak back-off range */
+#define PGM_OPT_NBR_UNREACH 0x0b /* neighbour unreachable */
+#define PGM_OPT_PATH_NLA 0x0c /* path nla */
+
+#define PGM_OPT_INVALID 0x7f /* option invalidated */
+
+/* 8. PGM header */
+struct pgm_header {
+ u16 sport; /* source port: tsi::sport or UDP port depending on direction */
+ u16 dport; /* destination port */
+ u8 type; /* version / packet type */
+ u8 options; /* options */
+#define PGM_OPT_PARITY 0x80 /* parity packet */
+#define PGM_OPT_VAR_PKTLEN 0x40 /* + variable sized packets */
+#define PGM_OPT_NETWORK 0x02 /* network-significant: must be interpreted by network elements */
+#define PGM_OPT_PRESENT 0x01 /* option extension are present */
+ u16 checksum; /* checksum */
+ u8 gsi[6]; /* global source id */
+ u16 tsdu_length; /* tsdu length */
+ /* tpdu length = th length (header + options) + tsdu length */
+};
+
+/* 8.1. Source Path Messages (SPM) */
+struct pgm_spm {
+ u32 sqn; /* spm sequence number */
+ u32 trail; /* trailing edge sequence number */
+ u32 lead; /* leading edge sequence number */
+ u16 nla_afi; /* nla afi */
+ u16 reserved; /* reserved */
+ struct in_addr spm_nla; /* path nla */
+ /* ... option extensions */
+};
+
+struct pgm_spm6 {
+ u32 sqn; /* spm sequence number */
+ u32 trail; /* trailing edge sequence number */
+ u32 lead; /* leading edge sequence number */
+ u16 nla_afi; /* nla afi */
+ u16 reserved; /* reserved */
+ struct in6_addr spm6_nla; /* path nla */
+ /* ... option extensions */
+};
+
+/* 8.2. Data Packet */
+struct pgm_data {
+ u32 sqn; /* data packet sequence number */
+ u32 trail; /* trailing edge sequence number */
+ /* ... option extensions */
+ /* ... data */
+};
+
+/* 8.3. Negative Acknowledgments and Confirmations (NAK, N-NAK, & NCF) */
+struct pgm_nak {
+ u32 sqn; /* requested sequence number */
+ u16 src_nla_afi; /* nla afi */
+ u16 reserved; /* reserved */
+ struct in_addr src_nla; /* source nla */
+ u16 grp_nla_afi; /* nla afi */
+ u16 reserved2; /* reserved */
+ struct in_addr grp_nla; /* multicast group nla */
+ /* ... option extension */
+};
+
+struct pgm_nak6 {
+ u32 sqn; /* requested sequence number */
+ u16 src_nla_afi; /* nla afi */
+ u16 reserved; /* reserved */
+ struct in6_addr src_nla; /* source nla */
+ u16 grp_nla_afi; /* nla afi */
+ u16 reserved2; /* reserved */
+ struct in6_addr grp_nla; /* multicast group nla */
+ /* ... option extension */
+};
+
+/* 9. Option header (max 16 per packet) */
+struct pgm_opt_header {
+ u8 type; /* option type */
+#define PGM_OPT_MASK 0x7f
+#define PGM_OPT_END 0x80 /* end of options flag */
+ u8 length; /* option length */
+ u8 reserved;
+#define PGM_OP_ENCODED 0x8 /* F-bit */
+#define PGM_OPX_MASK 0x3
+#define PGM_OPX_IGNORE 0x0 /* extensibility bits */
+#define PGM_OPX_INVALIDATE 0x1
+#define PGM_OPX_DISCARD 0x2
+#define PGM_OP_ENCODED_NULL 0x80 /* U-bit */
+};
+
+/* 9.1. Option extension length - OPT_LENGTH */
+struct pgm_opt_length {
+ u8 type; /* include header as total length overwrites reserved/OPX bits */
+ u8 length;
+ u16 total_length; /* total length of all options */
+};
+
+/* 9.2. Option fragment - OPT_FRAGMENT */
+struct pgm_opt_fragment {
+ u8 reserved; /* reserved */
+ u32 sqn; /* first sequence number */
+ u32 frag_off; /* offset */
+ u32 frag_len; /* length */
+};
+
+/* 9.3.5. Option NAK List - OPT_NAK_LIST */
+struct pgm_opt_nak_list {
+ u8 reserved; /* reserved */
+ u32 sqn[];
+};
+
+/* 9.4.2. Option Join - OPT_JOIN */
+struct pgm_opt_join {
+ u8 reserved; /* reserved */
+ u32 join_min; /* minimum sequence number */
+};
+
+/* 9.5.5. Option Redirect - OPT_REDIRECT */
+struct pgm_opt_redirect {
+ u8 reserved; /* reserved */
+ u16 nla_afi; /* nla afi */
+ u16 reserved2; /* reserved */
+ struct in_addr nla; /* dlr nla */
+};
+
+struct pgm_opt6_redirect {
+ u8 reserved; /* reserved */
+ u16 nla_afi; /* nla afi */
+ u16 reserved2; /* reserved */
+ struct in6_addr opt6_nla; /* dlr nla */
+};
+
+/* 9.6.2. Option Sources - OPT_SYN */
+struct pgm_opt_syn {
+ u8 reserved; /* reserved */
+};
+
+/* 9.7.4. Option End Session - OPT_FIN */
+struct pgm_opt_fin {
+ u8 reserved; /* reserved */
+};
+
+/* 9.8.4. Option Reset - OPT_RST */
+struct pgm_opt_rst {
+ u8 reserved; /* reserved */
+};
+
+
+/*
+ * Forward Error Correction - FEC
+ */
+
+/* 11.8.1. Option Parity - OPT_PARITY_PRM */
+struct pgm_opt_parity_prm {
+ u8 reserved; /* reserved */
+#define PGM_PARITY_PRM_MASK 0x3
+#define PGM_PARITY_PRM_PRO 0x1 /* source provides pro-active parity packets */
+#define PGM_PARITY_PRM_OND 0x2 /* on-demand parity packets */
+ u32 tgs; /* transmission group size */
+};
+
+/* 11.8.2. Option Parity Group - OPT_PARITY_GRP */
+struct pgm_opt_parity_grp {
+ u8 reserved; /* reserved */
+ u32 group; /* parity group number */
+};
+
+/* 11.8.3. Option Current Transmission Group Size - OPT_CURR_TGSIZE */
+struct pgm_opt_curr_tgsize {
+ u8 reserved; /* reserved */
+ u32 atgsize; /* actual transmission group size */
+};
+
+/*
+ * Congestion Control
+ */
+
+/* 12.7.1. Option Congestion Report - OPT_CR */
+struct pgm_opt_cr {
+ u8 reserved; /* reserved */
+ u32 cr_lead; /* congestion report reference sqn */
+ u16 cr_ne_wl; /* ne worst link */
+ u16 cr_ne_wp; /* ne worst path */
+ u16 cr_rx_wp; /* rcvr worst path */
+ u16 reserved2; /* reserved */
+ u16 nla_afi; /* nla afi */
+ u16 reserved3; /* reserved */
+ u32 cr_rcvr; /* worst receivers nla */
+};
+
+/* 12.7.2. Option Congestion Report Request - OPT_CRQST */
+struct pgm_opt_crqst {
+ u8 reserved; /* reserved */
+};
+
+
+/*
+ * SPM Requests
+ */
+
+/* 13.6. SPM Requests */
+struct pgm_spmr {
+ /* ... option extensions */
+};
+
+
+/*
+ * Poll Mechanism
+ */
+
+/* 14.7.1. Poll Request */
+struct pgm_poll {
+ u32 sqn; /* poll sequence number */
+ u16 round; /* poll round */
+ u16 type; /* poll sub-type */
+#define PGM_POLL_GENERAL 0x0 /* general poll */
+#define PGM_POLL_DLR 0x1 /* DLR poll */
+ u16 nla_afi; /* nla afi */
+ u16 reserved; /* reserved */
+ struct in_addr nla; /* path nla */
+ u32 bo_ivl; /* poll back-off interval */
+ char rand[4]; /* random string */
+ u32 mask; /* matching bit-mask */
+ /* ... option extensions */
+};
+
+struct pgm_poll6 {
+ u32 sqn; /* poll sequence number */
+ u16 round; /* poll round */
+ u16 s_type; /* poll sub-type */
+ u16 nla_afi; /* nla afi */
+ u16 reserved; /* reserved */
+ struct in6_addr nla; /* path nla */
+ u32 bo_ivl; /* poll back-off interval */
+ char rand[4]; /* random string */
+ u32 mask; /* matching bit-mask */
+ /* ... option extensions */
+};
+
+/* 14.7.2. Poll Response */
+struct pgm_polr {
+ u32 sqn; /* polr sequence number */
+ u16 round; /* polr round */
+ u16 reserved; /* reserved */
+ /* ... option extensions */
+};
+
+
+/*
+ * Implosion Prevention
+ */
+
+/* 15.4.1. Option NAK Back-Off Interval - OPT_NAK_BO_IVL */
+struct pgm_opt_nak_bo_ivl {
+ u8 opt_reserved; /* reserved */
+ u32 opt_nak_bo_ivl; /* nak back-off interval */
+ u32 opt_nak_bo_ivl_sqn; /* nak back-off interval sqn */
+};
+
+/* 15.4.2. Option NAK Back-Off Range - OPT_NAK_BO_RNG */
+struct pgm_opt_nak_bo_rng {
+ u8 opt_reserved; /* reserved */
+ u32 opt_nak_max_bo_ivl; /* maximum nak back-off interval */
+ u32 opt_nak_min_bo_ivl; /* minimum nak back-off interval */
+};
+
+/* 15.4.3. Option Neighbour Unreachable - OPT_NBR_UNREACH */
+struct pgm_opt_nbr_unreach {
+ u8 opt_reserved; /* reserved */
+};
+
+/* 15.4.4. Option Path - OPT_PATH_NLA */
+struct pgm_opt_path_nla {
+ u8 reserved; /* reserved */
+ struct in_addr opt_path_nla; /* path nla */
+};
+
+struct pgm_opt6_path_nla {
+ u8 reserved; /* reserved */
+ struct in6_addr opt6_path_nla; /* path nla */
+};
+
+#ifdef __KERNEL__
+
+#include <net/inet_sock.h>
+#include <linux/skbuff.h>
+#include <net/netns/hash.h>
+#include <linux/rslib.h>
+
+static inline int pgm_is_upstream(u8 type)
+{
+ return (type == PGM_NAK || /* unicast */
+ type == PGM_NNAK || /* unicast */
+ type == PGM_SPMR || /* multicast + unicast */
+ type == PGM_POLR); /* unicast */
+}
+
+static inline int pgm_is_peer(u8 type)
+{
+ return (type == PGM_SPMR); /* multicast */
+}
+
+static inline int pgm_is_downstream (u8 type)
+{
+ return (type == PGM_SPM || /* all multicast */
+ type == PGM_ODATA ||
+ type == PGM_RDATA ||
+ type == PGM_POLL ||
+ type == PGM_NCF);
+}
+
+int pgm_verify_spm(struct sk_buff *);
+int pgm_verify_spmr(struct sk_buff *);
+int pgm_verify_nak(struct sk_buff *);
+int pgm_verify_nnak(struct sk_buff *);
+int pgm_verify_ncf(struct sk_buff *);
+int pgm_verify_poll(struct sk_buff *);
+int pgm_verify_polr(struct sk_buff *);
+
+/* Global sesssion ID */
+struct pgm_gsi {
+ char gsi[6];
+};
+
+struct pgm_tsi {
+ char gsi[6]; /* global session identifier */
+ u16 sport; /* source port: a random number to help detect session re-starts */
+}
+
+/* Receiver data structures */
+
+enum pgm_rxw_state {
+ PGM_PKT_ERROR_STATE,
+ PGM_PKT_BACK_OFF_STATE, /* PGM protocol recovery states */
+ PGM_PKT_WAIT_NCF_STATE,
+ PGM_PKT_WAIT_DATA_STATE,
+
+ PGM_PKT_HAVE_DATA_STATE, /* data received waiting to commit to application layer */
+
+ PGM_PKT_HAVE_PARITY_STATE, /* contains parity information not original data */
+ PGM_PKT_COMMIT_DATA_STATE, /* commited data waiting for purging */
+ PGM_PKT_LOST_DATA_STATE, /* if recovery fails, but packet has not yet been commited */
+};
+
+enum pgm_rxw_returns {
+ PGM_RXW_OK,
+ PGM_RXW_INSERTED,
+ PGM_RXW_APPENDED,
+ PGM_RXW_UPDATED,
+ PGM_RXW_MISSING,
+ PGM_RXW_DUPLICATE,
+ PGM_RXW_MALFORMED,
+ PGM_RXW_BOUNDS,
+ PGM_RXW_SLOW_CONSUMER,
+ PGM_RXW_UNKNOWN,
+};
+
+struct pgm_rxw_state {
+ unsigned long nak_rb_expiry;
+ unsigned long nak_rpt_expiry;
+ unsigned long nak_rdata_expiry;
+
+ enum pgm_receiver_state state;
+
+ u8 nak_transmit_count;
+ u8 ncf_retry_count;
+ u8 data_retry_count;
+
+/* only valid on tg_sqn::pkt_sqn = 0 */
+ unsigned is_contiguous:1; /* transmission group */
+};
+
+struct pgm_rxw {
+ struct pgm_tsi * tsi;
+
+ struct list_head backoff_queue;
+ struct list_head wait_ncf_queue;
+ struct list_head wait_data_queue;
+
+ /* window context counters */
+ u32 lost_count; /* failed to repair */
+ u32 fragment_count; /* incomplete apdu */
+ u32 parity_count; /* parity for repairs */
+ u32 committed_count; /* but still in window */
+
+ u16 max_tpdu; /* maximum packet size */
+ u32 lead, trail;
+ u32 rxw_trail, rxw_trail_init;
+ u32 commit_lead;
+ unsigned is_constrained:1;
+ unsigned is_defined:1;
+ unsigned has_event:1; /* edge triggered */
+ unsigned is_fec_available:1;
+ struct rs_t rs;
+ u32 tg_size; /* transmission group size for parity recovery */
+ unsigned tg_sqn_shift;
+
+ u32 min_fill_time; /* restricted from pgm_time_t */
+ u32 max_fill_time;
+ u32 min_nak_transmit_count;
+ u32 max_nak_transmit_count;
+ u32 cumulative_losses;
+ u32 bytes_delivered; /* Fix this: Will overflow */
+ u32 msgs_delivered;
+
+ size_t size; /* in bytes */
+ unsigned alloc; /* in pkts */
+ struct sk_buff *pdata[];
+};
+
+struct pgm_rxw* pgm_rxw_create(pgm_tsi *, u16, u32, unsigned, unsigned);
+void pgm_rxw_destroy(struct pgm_rxw *);
+int pgm_rxw_add(struct pgm_rxw *, struct sk_buf *, u64, u64);
+void pgm_rxw_remove_commit(struct pgm_rxw *);
+size_t pgm_rxw_readv(struct pgm_rxw *, struct kiovec *, unsigned int);
+unsigned int pgm_rxw_remove_trail (struct pgm_rxw *);
+unsigned int pgm_rxw_update(struct pgm_rxw *, u32, u32, u64, u64);
+void pgm_rxw_update_fec(struct pgm_rxw *, unsigned int);
+int pgm_rxw_confirm(struct pgm_rxw *, u32, u64, u64, u64);
+void pgm_rxw_lost(struct pgm_rxw *, u32);
+void pgm_rxw_state(struct pgm_rxw *, struct sk_buff *, enum pgm_pkt_state);
+struct sk_buff *pgm_rxw_peek(struct pgm_rxw *, u32);
+
+static inline int pgm_rxw_max_length(struct pgm_rxw *window)
+{
+ return window->alloc;
+}
+
+static inline u32 pgm_rxw_length(struct pgm_rxw *window)
+{
+ return ( 1 + window->lead ) - window->trail;
+}
+
+static inline size_t pgm_rxw_size(struct pgm_rxw *window)
+{
+ return window->size;
+}
+
+static inline int pgm_rxw_is_empty(struct pgm_rxw *window)
+{
+ return pgm_rxw_length (window) == 0;
+}
+
+static inline int pgm_rxw_is_full(struct pgm_rxw *window)
+{
+ return pgm_rxw_length (window) == pgm_rxw_max_length (window);
+}
+
+static inline u32 pgm_rxw_lead(struct pgm_rxw *window)
+{
+ return window->lead;
+}
+
+static inline u32 pgm_rxw_next_lead(struct pgm_rxw *window)
+{
+ return pgm_rxw_lead(window) + 1;
+}
+
+/* Transmitter data structures */
+
+struct pgm_txw_state {
+ u32 unfolded_checksum; /* first 32-bit word must be checksum */
+
+ unsigned waiting_retransmit:1; /* in retransmit queue */
+ unsigned retransmit_count:15;
+ unsigned nak_elimination_count:16;
+
+ unsigned long expiry; /* Advance with time */
+ unsigned long last_retransmit; /* NAK elimination */
+};
+
+struct pgm_txw {
+ struct pgm_tsi* tsi;
+
+/* option: lockless atomics */
+ u32 lead;
+ u32 trail;
+
+ struct list_head retransmit_queue;
+
+ struct rs_t rs;
+ unsigned int tg_sqn_shift;
+ struct sk_buff * parity_buffer;
+ unsigned is_fec_enabled:1;
+
+ u32 size; /* window content size in bytes */
+ u32 alloc; /* length of pdata[] */
+ struct sk_buff* pdata[];
+};
+
+struct pgm_txw *pgm_txw_create(pgm_tsi *, u16, u32, unsigned int,
+ unsigned int, int, unsigned int, unsigned int);
+void pgm_txw_shutdown (struct pgm_txw *);
+void pgm_txw_add(struct pgm_txw *, struct sk_buff *);
+struct sk_buff* pgm_txw_peek(struct pgm_txw* , u32);
+int pgm_txw_retransmit_push(struct pgm_txw *, u32, int, unsigned int);
+struct sk_buff* pgm_txw_retransmit_try_peek(struct pgm_txw *);
+void pgm_txw_retransmit_remove_head(struct pgm_txw *);
+
+static inline unsigned int pgm_txw_max_length(struct pgm_txw *window)
+{
+ return window->alloc;
+}
+
+static inline u32 pgm_txw_length(struct pgm_txw *window)
+{
+ return ( 1 + window->lead ) - window->trail;
+}
+
+static inline u32 pgm_txw_size(struct pgm_txw *window)
+{
+ return window->size;
+}
+
+static inline int pgm_txw_is_empty(struct pgm_txw *window)
+{
+ return pgm_txw_length(window) == 0;
+}
+
+static inline int pgm_txw_is_full(struct pgm_txw *window)
+{
+ return pgm_txw_length(window) == pgm_txw_max_length(window);
+}
+
+static inline u32 pgm_txw_lead(struct pgm_txw *window)
+{
+ return window->lead;
+}
+
+static inline u32 pgm_txw_next_lead(struct pgm_txw *window)
+{
+ return pgm_txw_lead (window) + 1;
+}
+
+static inline u32 pgm_txw_trail(struct pgm_txw *window)
+{
+ return window->trail;
+}
+
+static inline u32 pgm_txw_get_unfolded_checksum(struct sk_buff *skb)
+{
+ struct pgm_txw_state *state = (void *)&skb->cb;
+
+ return state->unfolded_checksum;
+}
+
+static inline void pgm_txw_set_unfolded_checksum(struct sk_buff* skb, u32 csum)
+{
+ struct pgm_txw_state *state = (void *)&skb->cb;
+
+ state->unfolded_checksum = csum;
+}
+
+static inline void pgm_txw_inc_retransmit_count(struct sk_buff * skb)
+{
+ struct pgm_txw_state *state = (void *)&skb->cb;
+
+ state->retransmit_count++;
+}
+
+static inline int pgm_txw_retransmit_is_empty(struct pgm_txw *window)
+{
+ return list_empty(&window->retransmit_queue);
+}
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_PGM_H */
Index: linux-2.6/Documentation/networking/pgm/TODO
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/networking/pgm/TODO 2010-03-18 13:14:59.000000000 -0500
@@ -0,0 +1,8 @@
+- Define Socket API
+- Define /proc and sys api
+- Implement base logic
+- PGM over UDP
+- FEC Forward Error correction
+- Verify interaction with Cisco and other switches
+- Verify interaction with IBM Websphere, TIBCO, openpgm etc.
+
Index: linux-2.6/Documentation/networking/pgm/references
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/networking/pgm/references 2010-03-18 13:14:59.000000000 -0500
@@ -0,0 +1,2 @@
+RFC3208
+
Index: linux-2.6/Documentation/networking/pgm/usage
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/networking/pgm/usage 2010-03-18 15:55:17.000000000 -0500
@@ -0,0 +1,91 @@
+1. Opening a socket
+
+ A. Native PGM
+
+ fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM)
+
+ B. PGM over UDP
+
+ fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP)
+
+ C. PGM over SHM (?)
+
+ fd = socket(AF_UNIX, SOCK_RDM, 0)
+
+
+2. Binding to a multicast address
+
+ A. Sender
+
+ Connect the socket to a MC address and port using connect().
+
+ Note that the port is significant since multiple streams on different
+ ports can be run over the same MC addr.
+
+ B. Receiver
+
+ I. Bind the socket to the MC address and port of interest.
+
+ II. Listen to the socket.
+
+ Process will wait until a PGM packet destined to the port of interest
+ is received.
+
+ III. Accept a connection.
+
+ Establishes a session. Data can then be received.
+
+
+3. Sending and receiving
+
+ Use the usual socket read and write operations and the various flavors of waiting
+ for a packet via select, poll, epoll etc.
+
+ Packet sizes are determined by the number of packets in a single sendmsg() unless
+ overridden by the RM_SET_MESSAGE_BOUNDARY socket option.
+
+ The sender will block when the send window is full unless a non blocking write is performed.
+
+ The receiver shows the usual wait semantics. If the stream is set to unreliable then
+ packets may arrive in random order. If the set is set to RM_LISTEN_ONLY then packets may
+ just be missing.
+
+4. Transmitter Socket Options
+
+
+ A. Setting the window size / rate.
+
+ struct pgm_send_window x;
+ x.RateKbitsPerSec = 56;
+ x.WindowSizeInMsecs = 60000;
+ x.WindowSizeinBytes = 10000000;
+
+ setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x));
+
+ Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute.
+
+ B. FEC mode
+
+ struct pgm_fec_info x;
+
+ x.FECBlocksize = 255;
+ x.FECProActivePackets = 0;
+ x.FECGroupSize = 0;
+ x.fFECOnDemandParityEnabled = 1;
+
+ setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x));
+
+
+5. Receiver Socket Options
+
+ None?
+
+
+Possible Extensions
+
+ RM_UNORDERED accept unordered packet avoiding delays when packets arrive out of sequence.
+ packet is still NAKed.
+
+ RM_RECEIVE_ONLY Simply ignore missed packets. Do not send any replies.
+
+
Index: linux-2.6/net/ipv4/pgm.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/net/ipv4/pgm.c 2010-03-18 16:37:17.000000000 -0500
@@ -0,0 +1,143 @@
+/*
+ * PGM An implementation of the PGM (Pragmatic General Multicast)
+ * protocol (RFC 3208).
+ *
+ * Authors: Christoph Lameter <cl@linux-foundation.org>
+ *
+ * Changes:
+ * Fixes:
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include "udp_impl.h"
+
+struct udp_table pgm_table __read_mostly;
+EXPORT_SYMBOL(pgm_table);
+
+static int pgm_rcv(struct sk_buff *skb)
+{
+ /* TBD */
+ return __udp4_lib_rcv(skb, &pgm_table, IPPROTO_UDPLITE);
+}
+
+static void pgm_err(struct sk_buff *skb, u32 info)
+{
+ __udp4_lib_err(skb, info, &pgm_table);
+}
+
+static const struct net_protocol pgm_protocol = {
+ .handler = pgm_rcv,
+ .err_handler = pgm_err,
+ .no_policy = 1,
+ .netns_ok = 1,
+};
+
+struct proto pgm_prot = {
+ .name = "PGM",
+ .owner = THIS_MODULE,
+ .close = udp_lib_close,
+ .connect = ip4_datagram_connect,
+ .disconnect = udp_disconnect,
+ .ioctl = udp_ioctl,
+ .init = pgm_sk_init,
+ .destroy = udp_destroy_sock,
+ .setsockopt = pgm_setsockopt,
+ .getsockopt = pgm_getsockopt,
+ .sendmsg = pgm_sendmsg,
+ .recvmsg = pgm_recvmsg,
+ .sendpage = pgm_sendpage,
+ .backlog_rcv = udp_queue_rcv_skb,
+ .hash = udp_lib_hash,
+ .unhash = udp_lib_unhash,
+ .get_port = udp_v4_get_port,
+ .obj_size = sizeof(struct udp_sock),
+ .slab_flags = SLAB_DESTROY_BY_RCU,
+ .h.udp_table = &pgm_table,
+#ifdef CONFIG_COMPAT
+ .compat_setsockopt = compat_pgm_setsockopt,
+ .compat_getsockopt = compat_pgm_getsockopt,
+#endif
+};
+
+static struct inet_protosw pgm_ip_protosw = {
+ .type = SOCK_RDM,
+ .protocol = IPPROTO_PGM,
+ .prot = &pgm_ip_prot,
+ .ops = &inet_pgm_ops,
+ .no_check = 0, /* must checksum (RFC 3828) */
+ .flags = INET_PROTOSW_PERMANENT,
+};
+
+static struct inet_protosw pgm_udp_protosw = {
+ .type = SOCK_RDM,
+ .protocol = IPPROTO_UDP,
+ .prot = &pgm_udp_prot,
+ .ops = &inet_pgm_ops,
+ .no_check = 0, /* must checksum (RFC 3828) */
+ .flags = INET_PROTOSW_PERMANENT,
+};
+
+#ifdef CONFIG_PROC_FS
+static struct udp_seq_afinfo pgm_seq_afinfo = {
+ .name = "pgm",
+ .family = AF_INET,
+ .udp_table = &pgm_table,
+ .seq_fops = {
+ .owner = THIS_MODULE,
+ },
+ .seq_ops = {
+ .show = udp4_seq_show,
+ },
+};
+
+static int __net_init pgm_proc_init_net(struct net *net)
+{
+ return udp_proc_register(net, &pgm_seq_afinfo);
+}
+
+static void __net_exit pgm_proc_exit_net(struct net *net)
+{
+ udp_proc_unregister(net, &pgm_seq_afinfo);
+}
+
+static struct pernet_operations pgm4_net_ops = {
+ .init = pgm_proc_init_net,
+ .exit = pgm_proc_exit_net,
+};
+
+static __init int pgm_proc_init(void)
+{
+ return register_pernet_subsys(&pgm_net_ops);
+}
+#else
+static inline int pgm_proc_init(void)
+{
+ return 0;
+}
+#endif
+
+void __init pgm_register(void)
+{
+ udp_table_init(&pgm_table, "PGM");
+ if (proto_register(&pgm_prot, 1))
+ goto out_register_err;
+
+ if (inet_add_protocol(&pgm_protocol, IPPROTO_PGM) < 0)
+ goto out_unregister_proto;
+
+ inet_register_protosw(&pgm_ip_protosw);
+ inet_register_protosw(&pgm_udp_protosw);
+
+ if (pgm_proc_init())
+ printk(KERN_ERR "%s: Cannot register /proc!\n", __func__);
+ return;
+
+out_unregister_proto:
+ proto_unregister(&pgm_prot);
+out_register_err:
+ printk(KERN_CRIT "%s: Cannot add PGM protocol.\n", __func__);
+}
+
+EXPORT_SYMBOL(pgm_prot);
Index: linux-2.6/net/ipv4/Kconfig
===================================================================
--- linux-2.6.orig/net/ipv4/Kconfig 2010-03-18 16:16:34.000000000 -0500
+++ linux-2.6/net/ipv4/Kconfig 2010-03-18 16:39:36.000000000 -0500
@@ -14,6 +14,20 @@ config IP_MULTICAST
<file:Documentation/networking/multicast.txt>. For most people, it's
safe to say N.
+config IP_PGM
+ bool "IP: Pragmatic General Multicast (RFC3208) support"
+ depends on IP_MULTICAST && EXPERIMENTAL
+ help
+ This is an implementation of reliable multicasting following
+ RFC3208. PGM is used for publisher-subscriber based information
+ services on private networks. The PGM protocol allows for recovery
+ of lost packets through resent requests (NAKs) and through the
+ recovery of missing packets via FEC. PGM is supported by router
+ vendors through logic that allows correlation of NAKs to avoid
+ flooding the network with NAK (aka NAK-storm). PGM is widely used
+ in the financial industry and various commercial applications
+ support this protocol.
+
config IP_ADVANCED_ROUTER
bool "IP: advanced router"
---help---
Index: linux-2.6/net/ipv4/Makefile
===================================================================
--- linux-2.6.orig/net/ipv4/Makefile 2010-03-18 16:16:07.000000000 -0500
+++ linux-2.6/net/ipv4/Makefile 2010-03-18 16:24:04.000000000 -0500
@@ -52,3 +52,6 @@ obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
xfrm4_output.o
+
+obj-$(CONFIG_IP_PGM) += pgm.o
+
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-18 17:58 Add PGM protocol support to the IP stack Christoph Lameter
2010-03-18 21:58 ` Christoph Lameter
@ 2010-03-19 17:18 ` Andi Kleen
2010-03-19 21:53 ` David Miller
2010-03-22 14:20 ` Christoph Lameter
1 sibling, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2010-03-19 17:18 UTC (permalink / raw)
To: Christoph Lameter; +Cc: David Miller, netdev, linux-kernel
Christoph Lameter <cl@linux-foundation.org> writes:
>
> I know about the openpgm implementation. Openpbm does this at the user
> level and requires linking to a library. It is essentially a communication
> protocol done in user space. It has privilege issues because it has to
> create PGM packets via a raw socket.
That seems like a poor reason alone to put something into the kernel
Perhaps you rather need some way to have unpriviledged raw sockets?
The classical way to do this is to start suid root, only open
the socket and then drop privileges.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-19 17:18 ` Andi Kleen
@ 2010-03-19 21:53 ` David Miller
2010-03-19 22:26 ` H. Peter Anvin
2010-03-22 14:20 ` Christoph Lameter
1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2010-03-19 21:53 UTC (permalink / raw)
To: andi; +Cc: cl, netdev, linux-kernel
From: Andi Kleen <andi@firstfloor.org>
Date: Fri, 19 Mar 2010 18:18:36 +0100
> Christoph Lameter <cl@linux-foundation.org> writes:
>>
>> I know about the openpgm implementation. Openpbm does this at the user
>> level and requires linking to a library. It is essentially a communication
>> protocol done in user space. It has privilege issues because it has to
>> create PGM packets via a raw socket.
>
> That seems like a poor reason alone to put something into the kernel
> Perhaps you rather need some way to have unpriviledged raw sockets?
>
> The classical way to do this is to start suid root, only open
> the socket and then drop privileges.
I completely agree.
We should be able to make a way for unprivileged users to
use RAW sockets in some limited capacity, for cases like this.
But I also don't consider what openpbm has to do right now to
be all that much of a restriction. You need privileges to
add the protocol to the kernel, you need privileges to run
the userspace variant, there is no real difference.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-19 21:53 ` David Miller
@ 2010-03-19 22:26 ` H. Peter Anvin
2010-03-22 14:24 ` Christoph Lameter
0 siblings, 1 reply; 21+ messages in thread
From: H. Peter Anvin @ 2010-03-19 22:26 UTC (permalink / raw)
To: David Miller; +Cc: andi, cl, netdev, linux-kernel
On 03/19/2010 02:53 PM, David Miller wrote:
> But I also don't consider what openpbm has to do right now to
> be all that much of a restriction. You need privileges to
> add the protocol to the kernel, you need privileges to run
> the userspace variant, there is no real difference.
The real difference is if multiplex is needed between multiple
unprivileged users.
-hpa
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-19 17:18 ` Andi Kleen
2010-03-19 21:53 ` David Miller
@ 2010-03-22 14:20 ` Christoph Lameter
2010-03-22 16:36 ` Andi Kleen
1 sibling, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2010-03-22 14:20 UTC (permalink / raw)
To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel
On Fri, 19 Mar 2010, Andi Kleen wrote:
> Christoph Lameter <cl@linux-foundation.org> writes:
> >
> > I know about the openpgm implementation. Openpbm does this at the user
> > level and requires linking to a library. It is essentially a communication
> > protocol done in user space. It has privilege issues because it has to
> > create PGM packets via a raw socket.
>
> That seems like a poor reason alone to put something into the kernel
> Perhaps you rather need some way to have unpriviledged raw sockets?
Not the only reason. There are also performance implications. NAKing and
other control messages from user space are a pain and the available
implementations add numerous threads just to control the timing of control
messages and the expiration of data etc. Its difficult to listen to a PGM
port from user space. You have to get all messages for the PGM protocol
and then filter in each process.
PGM operates on the same level as TCP and UDP.
> The classical way to do this is to start suid root, only open
> the socket and then drop privileges.
Yes those solutions exist and the experience with their limitations are
the reason to try to get PGM in the kernel.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-19 22:26 ` H. Peter Anvin
@ 2010-03-22 14:24 ` Christoph Lameter
0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2010-03-22 14:24 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: David Miller, andi, netdev, linux-kernel
On Fri, 19 Mar 2010, H. Peter Anvin wrote:
> On 03/19/2010 02:53 PM, David Miller wrote:
> > But I also don't consider what openpbm has to do right now to
> > be all that much of a restriction. You need privileges to
> > add the protocol to the kernel, you need privileges to run
> > the userspace variant, there is no real difference.
>
> The real difference is if multiplex is needed between multiple
> unprivileged users.
It is needed. PGM ports exist and work similarly to UDP and TCP ports.
PGM as provided by openpgm and other solutions avoids native PGM and
instead uses PGM over UDP. But the routers do not support PGM over UDP in
the same way as native PGM. So the NAK suppression and other advanced
features available in Juniper and Cisco switches cannot be used.
openpbm can work with the native PGM protocol via a raw socket but then
one cannot run multiple processes communicating via different ports
effectively.
The fragmentation of packets and the assembly etc in user space is a pain.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 14:20 ` Christoph Lameter
@ 2010-03-22 16:36 ` Andi Kleen
2010-03-22 16:51 ` Christoph Lameter
0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2010-03-22 16:36 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel
On Mon, Mar 22, 2010 at 09:20:42AM -0500, Christoph Lameter wrote:
> On Fri, 19 Mar 2010, Andi Kleen wrote:
>
> > Christoph Lameter <cl@linux-foundation.org> writes:
> > >
> > > I know about the openpgm implementation. Openpbm does this at the user
> > > level and requires linking to a library. It is essentially a communication
> > > protocol done in user space. It has privilege issues because it has to
> > > create PGM packets via a raw socket.
> >
> > That seems like a poor reason alone to put something into the kernel
> > Perhaps you rather need some way to have unpriviledged raw sockets?
>
> Not the only reason. There are also performance implications. NAKing and
> other control messages from user space are a pain and the available
> implementations add numerous threads just to control the timing of control
> messages and the expiration of data etc. Its difficult to listen to a PGM
> port from user space. You have to get all messages for the PGM protocol
> and then filter in each process.
Ok that sounds like a good reason to have a kernel protocol.
Thanks.
Multicast reliable kernel protocols are somewhat new, I guess one
would need to make sure to come up with a clean generic interface
for them first.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 16:36 ` Andi Kleen
@ 2010-03-22 16:51 ` Christoph Lameter
2010-03-22 17:43 ` Andi Kleen
0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2010-03-22 16:51 UTC (permalink / raw)
To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel
On Mon, 22 Mar 2010, Andi Kleen wrote:
> Multicast reliable kernel protocols are somewhat new, I guess one
> would need to make sure to come up with a clean generic interface
> for them first.
It has been around for a long time in another OS. I wonder if I should use
the socket API realized there as a model or come up with something new
from scratch?
What I have right now is:
1. Opening a socket
A. Native PGM
fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM)
B. PGM over UDP
fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP)
C. PGM over SHM (?)
fd = socket(AF_UNIX, SOCK_RDM, 0)
2. Binding to a multicast address
A. Sender
Connect the socket to a MC address and port using connect().
Note that the port is significant since multiple streams on different
ports can be run over the same MC addr.
B. Receiver
I. Bind the socket to the MC address and port of interest.
II. Listen to the socket.
Process will wait until a PGM packet destined to the port of interest
is received.
III. Accept a connection.
Establishes a session. Data can then be received.
3. Sending and receiving
Use the usual socket read and write operations and the various flavors of waiting
for a packet via select, poll, epoll etc.
Packet sizes are determined by the number of packets in a single sendmsg() unless
overridden by the RM_SET_MESSAGE_BOUNDARY socket option.
The sender will block when the send window is full unless a non blocking write is performed.
The receiver shows the usual wait semantics. If the stream is set to unreliable then
packets may arrive in random order. If the set is set to RM_LISTEN_ONLY then packets may
just be missing.
4. Transmitter Socket Options
A. Setting the window size / rate.
struct pgm_send_window x;
x.RateKbitsPerSec = 56;
x.WindowSizeInMsecs = 60000;
x.WindowSizeinBytes = 10000000;
setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x));
Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute.
B. FEC mode
struct pgm_fec_info x;
x.FECBlocksize = 255;
x.FECProActivePackets = 0;
x.FECGroupSize = 0;
x.fFECOnDemandParityEnabled = 1;
setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x));
5. Receiver Socket Options
None?
Possible Extensions
RM_UNORDERED accept unordered packet avoiding delays when packets arrive out of sequence.
packet is still NAKed.
RM_RECEIVE_ONLY Simply ignore missed packets. Do not send any replies.
Existing socket options in the other OS (X denotes that this looks like
its screwy and should be avoided)
/* PGM socket options */
/* Transmitter */
#define RM_LATEJOIN 1 /* X Not supported on receive so why have it? */
#define RM_RATE_WINDOW_SIZE 2 /* See struct pgm_send_window */
#define RM_SEND_WINDOW_ADV_RATE 3 /* X Increase of send window in percentage of window */
#define RM_SENDER_STATISTICS 4 /* see struct pgm_sender_stats */
#define RM_SENDER_WINDOW_ADVANCE_METHOD 5 /* X seems obsolete */
#define RM_SET_MCAST_TTL 6 /* X Can be set via IP_MULTICAST_TTL */
#define RM_SET_MESSAGE_BOUNDARY 7 /* Fix the size of the messages in bytes */
#define RM_SET_SEND_IF 8 /* X use IP_MULTICAST_IF etc instead */
#define RM_USE_FEC 9
/* Receiver */
#define RM_ADD_RECEIVE_IF 100 /* X ???? IP_MULTICAST_IF instead? */
#define RM_DEL_RECEIVE_IF 101 /* X IP_MULTICAST_IF */
#define RM_HIGH_SPEED_INTRANET_OPT 102 /* X PGM should adapt automatically to high speed networks */
#define RM_RECEIVER_STATISTICS 103 /* See struct pgm_receiver_stats */
/* Socket API structures (established by M$DN) */
struct pgm_receiver_stats {
u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */
u64 NumRDataPacketsReceived; /* Number of RDATA (repair) sequences */
u64 NumDuplicateDataPackets; /* Duplicate sequences */
u64 DataBytesReceived;
u64 TotalBytesReceived;
u64 RateKBitsPerSecOverall; /* Receive rate since start of session X */
u64 RateKBitsPerSecLast; /* Receive rate for last second X*/
u64 TrailingEdgeSeqId; /* Oldest sequence in the receive window */
u64 LeadingEdgeSeqId; /* Newest sequence in the receive window */
u64 AverageSequencesInWindow; /* Average number of sequences in receive window X */
u64 MinSequencesInWindow; /* The mininum number of sequences */
u64 MaxSequencesInWindow; /* The maximum number of sequences */
u64 FirstNakSequenceNumber; /* First outstanding nack sequence number */
u64 NumPendingNaks; /* Number of sequences waiting for NCF */
u64 NumOutstandingNaks; /* Number of sequences waiting for RDATA */
u64 NumDataPacketsBuffered; /* Number of packets currently buffered */
u64 TotalSelectiveNaksSent; /* Number of NAKs sent total */
u64 TotalParityNaksSent; /* Number of parity NAKs sent */
};
struct pgm_sender_stats {
u64 DataBytesSent;
u64 TotalBytesSent;
u64 NaksReceived;
u64 NaksReceivedTooLate; /* NAKs received after receive window advanced */
u64 NumOutstandingNaks; /* Number of NAKs awaiting response */
u64 NumNaksAfterRData; /* Number of NAKs after RDATA sequences were sent which were ignored */
u64 RepairPacketsSent;
u64 BufferSpaceAvailable; /* Number of partial messages dropped */
u64 TrailingEdgeSeqId; /* Oldest sequence id in window */
u64 LeadingEdgeSeqId; /* Newest sequence id in window */
u64 RateKBitsPerSecOverall; /* Rate since start of session X */
u64 RateKBitsPerSecLast; /* Rate in last second X */
u64 TotalODataPacketsSent; /* Total data packets transmitted */
};
/* Setup of sender RateKbitsPerSec = WindowSizeBytes / WindowSizeMSecs */
struct pgm_send_window {
u64 RateKbitsPerSec; /* Allowed rate for the sender in kbits per second */
u64 WindowSizeInMSecs; /* Send window size in time */
u64 WindowSizeInBytes; /* Window size in bytes */
};
struct pgm_fec_info {
u16 FECBlockSize; /* Maximum number of packets for a group. Default and max = 255 */
u16 FECProActivePackets; /* Number of proactive packets per group. */
u8 FECGroupSize; /* Number of packets to be treated as a group. Power of two */
int fFECOnDemandParityEnabled; /* Allow sender to sent parity repair packets */
};
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 16:51 ` Christoph Lameter
@ 2010-03-22 17:43 ` Andi Kleen
2010-03-22 18:07 ` Christoph Lameter
0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2010-03-22 17:43 UTC (permalink / raw)
To: Christoph Lameter; +Cc: David Miller, netdev, linux-kernel
Christoph Lameter <cl@linux-foundation.org> writes:
> On Mon, 22 Mar 2010, Andi Kleen wrote:
>
>> Multicast reliable kernel protocols are somewhat new, I guess one
>> would need to make sure to come up with a clean generic interface
>> for them first.
>
> It has been around for a long time in another OS. I wonder if I should use
> the socket API realized there as a model or come up with something new
> from scratch?
If the other API doesn't have a serious flaw I guess it's better
to aim for a sub/superset at least, to make porting applications easier.
>
> What I have right now is:
>
> 1. Opening a socket
>
> A. Native PGM
>
> fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM)
RDM = Reliable ? Multicast ?
> B. PGM over UDP
>
> fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP)
>
> C. PGM over SHM (?)
>
> fd = socket(AF_UNIX, SOCK_RDM, 0)
Not sure how that should work.
> 3. Sending and receiving
>
> Use the usual socket read and write operations and the various flavors of waiting
> for a packet via select, poll, epoll etc.
>
> Packet sizes are determined by the number of packets in a single sendmsg() unless
Number of bytes surely?
> overridden by the RM_SET_MESSAGE_BOUNDARY socket option.
That's unusual to have such a option (except the MTU). What is it good for?
>
> 4. Transmitter Socket Options
>
>
> A. Setting the window size / rate.
>
> struct pgm_send_window x;
> x.RateKbitsPerSec = 56;
> x.WindowSizeInMsecs = 60000;
> x.WindowSizeinBytes = 10000000;
>
> setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x));
>
> Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute.
That's a very large buffer for a socket. It would be better to use the usual
auto shrinking/increasing mechanisms.
> B. FEC mode
>
> struct pgm_fec_info x;
>
> x.FECBlocksize = 255;
> x.FECProActivePackets = 0;
> x.FECGroupSize = 0;
> x.fFECOnDemandParityEnabled = 1;
>
> setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x));
Is that mode really needed?
> /* Socket API structures (established by M$DN) */
> struct pgm_receiver_stats {
> u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */
It's difficult to maintain 64 bit counters on 32bit hosts on all targets.
But I guess it would be ok to only fill in 32bit in this case.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 17:43 ` Andi Kleen
@ 2010-03-22 18:07 ` Christoph Lameter
2010-03-22 18:53 ` Andi Kleen
0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2010-03-22 18:07 UTC (permalink / raw)
To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel
On Mon, 22 Mar 2010, Andi Kleen wrote:
> > What I have right now is:
> >
> > 1. Opening a socket
>
> >
> > A. Native PGM
> >
> > fd = socket(AF_INET, SOCK_RDM, IPPROTO_PGM)
>
> RDM = Reliable ? Multicast ?
RDM is Reliable Datagram Multicast I believe. I'd rather have SOCK_PGM if
I could choose.
>
> > B. PGM over UDP
> >
> > fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP)
> >
> > C. PGM over SHM (?)
> >
> > fd = socket(AF_UNIX, SOCK_RDM, 0)
>
> Not sure how that should work.
Multiple processes would communicate via shm segments. Maybe defer to the
future but its an important operation mode as the systems grow bigger and bigger.
SHM segment would have to contain some sort of ring buffer that the
receivers could tap into. But that mode has not really been thought
through.
> > 3. Sending and receiving
> >
> > Use the usual socket read and write operations and the various flavors of waiting
> > for a packet via select, poll, epoll etc.
> >
> > Packet sizes are determined by the number of packets in a single sendmsg() unless
>
> Number of bytes surely?
Sorry yes you are right.
> > overridden by the RM_SET_MESSAGE_BOUNDARY socket option.
>
> That's unusual to have such a option (except the MTU). What is it good for?
No idea why it was implemented. It can be used to use send() for portions
of a message. Triggers the send() only when all bytes have been provided.
Probably necessary if one wants to have very long (megabytes) messages.
Esoteric and likely not going to be in a first release.
> > 4. Transmitter Socket Options
> >
> >
> > A. Setting the window size / rate.
> >
> > struct pgm_send_window x;
> > x.RateKbitsPerSec = 56;
> > x.WindowSizeInMsecs = 60000;
> > x.WindowSizeinBytes = 10000000;
> >
> > setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x));
> >
> > Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute.
>
> That's a very large buffer for a socket. It would be better to use the usual
> auto shrinking/increasing mechanisms.
Reliable multicast protocols have a defined time period / "reliabilty
buffer" so that they can resend a message that was missed for a time
period. It is customary to either specify a time period or define the size
of the "reliability buffer".
> > B. FEC mode
> >
> > struct pgm_fec_info x;
> >
> > x.FECBlocksize = 255;
> > x.FECProActivePackets = 0;
> > x.FECGroupSize = 0;
> > x.fFECOnDemandParityEnabled = 1;
> >
> > setsockopt(fd, SOCK_RDM, RM_FEC_MODE, &x, sizeof(x));
>
> Is that mode really needed?
Never used it. I'd rather skip for now. Maybe later.
>
> > /* Socket API structures (established by M$DN) */
> > struct pgm_receiver_stats {
> > u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */
>
> It's difficult to maintain 64 bit counters on 32bit hosts on all targets.
> But I guess it would be ok to only fill in 32bit in this case.
32 bit counters have the awful habit of overflowing.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 18:07 ` Christoph Lameter
@ 2010-03-22 18:53 ` Andi Kleen
2010-03-22 19:32 ` Christoph Lameter
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Andi Kleen @ 2010-03-22 18:53 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel
On Mon, Mar 22, 2010 at 01:07:37PM -0500, Christoph Lameter wrote:
> > > B. PGM over UDP
> > >
> > > fd = socket(AF_INET, SOCK_RDM, IPPROTO_UDP)
> > >
> > > C. PGM over SHM (?)
> > >
> > > fd = socket(AF_UNIX, SOCK_RDM, 0)
> >
> > Not sure how that should work.
>
> Multiple processes would communicate via shm segments. Maybe defer to the
> future but its an important operation mode as the systems grow bigger and bigger.
> SHM segment would have to contain some sort of ring buffer that the
> receivers could tap into. But that mode has not really been thought
> through.
AF_UNIX is not SHM today.
The only point is to avoid one copy? (user1 -> kernel -> user2 to user1 -> user2)
Not sure if that is really worth it. Don't you need another copy to the reliability
buffer anyways?
Letting kernel parse a data structure in user defined memory is also
always somewhat tricky.
But in principle AF_INET over localhost should not be that less efficient
than AF_UNIX, so you can probably drop it for now (unless you need special AF_UNIX
features like credentials)
> > >
> > > Packet sizes are determined by the number of packets in a single sendmsg() unless
> >
> > Number of bytes surely?
>
> Sorry yes you are right.
>
> > > overridden by the RM_SET_MESSAGE_BOUNDARY socket option.
> >
> > That's unusual to have such a option (except the MTU). What is it good for?
>
> No idea why it was implemented. It can be used to use send() for portions
> of a message. Triggers the send() only when all bytes have been provided.
> Probably necessary if one wants to have very long (megabytes) messages.
Those could be a problem in kernel memory consumption. One would need
to be very careful to have a good memory management scheme for the socket
in place.
> > >
> > > A. Setting the window size / rate.
> > >
> > > struct pgm_send_window x;
> > > x.RateKbitsPerSec = 56;
> > > x.WindowSizeInMsecs = 60000;
> > > x.WindowSizeinBytes = 10000000;
> > >
> > > setsockopt(fd, SOCK_RDM, RM_RATE_WINDOW_SIZE, &x, sizeof(x));
> > >
> > > Default is sending at 56Kbps with a buffer of 10 Megabytes and buffering for a minute.
> >
> > That's a very large buffer for a socket. It would be better to use the usual
> > auto shrinking/increasing mechanisms.
>
> Reliable multicast protocols have a defined time period / "reliabilty
> buffer" so that they can resend a message that was missed for a time
> period. It is customary to either specify a time period or define the size
> of the "reliability buffer".
One problem is memory management then. What happens when a process opens 100 of those
sockets and fills them all?
I guess you would still need a suitable global limit like TCP has.
> Never used it. I'd rather skip for now. Maybe later.
>
> >
> > > /* Socket API structures (established by M$DN) */
> > > struct pgm_receiver_stats {
> > > u64 NumODataPacketsReceived; /* Number of ODATA (original) sequences */
> >
> > It's difficult to maintain 64 bit counters on 32bit hosts on all targets.
> > But I guess it would be ok to only fill in 32bit in this case.
>
> 32 bit counters have the awful habit of overflowing.
There's just no portable atomic64_t. Ok maybe you can use the socket lock
to synchronize all the counts if they are only per socket.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 18:53 ` Andi Kleen
@ 2010-03-22 19:32 ` Christoph Lameter
2010-03-26 17:33 ` Christoph Lameter
2010-03-29 23:01 ` H. Peter Anvin
2 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2010-03-22 19:32 UTC (permalink / raw)
To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel
On Mon, 22 Mar 2010, Andi Kleen wrote:
> > Multiple processes would communicate via shm segments. Maybe defer to the
> > future but its an important operation mode as the systems grow bigger and bigger.
> > SHM segment would have to contain some sort of ring buffer that the
> > receivers could tap into. But that mode has not really been thought
> > through.
>
> AF_UNIX is not SHM today.
>
> The only point is to avoid one copy? (user1 -> kernel -> user2 to user1 -> user2)
> Not sure if that is really worth it. Don't you need another copy to the reliability
> buffer anyways?
Not sure either. Access of multiple processes to one reliability buffer
would be best. Some sort of multiended pipe I guess.
> But in principle AF_INET over localhost should not be that less efficient
> than AF_UNIX, so you can probably drop it for now (unless you need special AF_UNIX
> features like credentials)
Well lets skip it for now and see if there are performance implications in
the future.
> > > That's unusual to have such a option (except the MTU). What is it good for?
> >
> > No idea why it was implemented. It can be used to use send() for portions
> > of a message. Triggers the send() only when all bytes have been provided.
> > Probably necessary if one wants to have very long (megabytes) messages.
>
> Those could be a problem in kernel memory consumption. One would need
> to be very careful to have a good memory management scheme for the socket
> in place.
Lets not support it then unless someone can make a convincing case.
> > Reliable multicast protocols have a defined time period / "reliabilty
> > buffer" so that they can resend a message that was missed for a time
> > period. It is customary to either specify a time period or define the size
> > of the "reliability buffer".
>
> One problem is memory management then. What happens when a process opens 100 of those
> sockets and fills them all?
Pushes out the app? Same as the user space apps now. Some sort of
upper limit is needed I guess.
> I guess you would still need a suitable global limit like TCP has.
Yes.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 18:53 ` Andi Kleen
2010-03-22 19:32 ` Christoph Lameter
@ 2010-03-26 17:33 ` Christoph Lameter
2010-03-27 13:11 ` Andi Kleen
2010-03-29 23:01 ` H. Peter Anvin
2 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2010-03-26 17:33 UTC (permalink / raw)
To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel
Here is a pgm.7 manpage describing how the socket API could look like for
a PGM implementation.
I dumped the RM_* based socket options from the other OS since most of the
options were unusable.
.\" This man page is Copyright (C) 2010 Christoph Lameter <cl@linux-foundation.org>.
.\" Permission is granted to distribute possibly modified copies
.\" of this page provided the header is included verbatim,
.\" and in case of nontrivial modification author and date
.\" of the modification is added to the header.
.\"
.TH PGM 7 2010-08-01 "Linux" "Linux Programmer's Manual"
.SH NAME
pgm \- Pragmatic General Multicast Protocol Support for IPv4
.SH SYNOPSIS
.B #include <sys/socket.h>
.br
.B #include <netinet/in.h>
.br
.B #include <linux/pgm.h>
.sp
.B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_PGM);
.br
.B pgm_socket = socket(AF_INET, SOCK_RDM, IPPROTO_UDP);
.SH DESCRIPTION
This is an implementation of the Pragmatic General Multicast Protocol
described in RFC\ 3028.
PGM implements a connection oriented, Reliable Datagram Messaging
(thus SOCK_RDM) protocol. Packets are delivered in order even though the
network may
have reordered, duplicated or dropped packets. Receivers may ask for
retransmission of missed packets (NAK). Transmitters do not keep receiver
state so that an individual sender is able to interact with an unlimited
number of receivers.
The recovery mechanism of PGM can limit the scalability of PGM if too
many receivers are NAKing. Therefore measures exist at various layers
to reduce the potential repair volume that a transmitter may have to
deal with.
PGM supports two variants. The first one is the
.B native PGM protocol
which uses its own IP protocol implementation at the same level as TCP and UDP.
Native PGM supports NAK suppression ("assist") by network elements (Cisco,
Juniper and other commercially available routers have support for PGM) which
is an important measure to reduce the NAK volume in case of packet loss during
multicast replication of messages in the network. Routers can consolidate
multiple NAKs from downstream into a single upstream and are also able to
use
.B FEC
(Forward Error Correction) to directly provide repair data without having to
forward NAKs to a transmitter.
The second variant is
.B PGM over UDP.
UDP is used as a transport protocol
instead of IP. PGM over UDP does
.B not
support assist from network elements and
therefore has limited support for NAK suppression. PGM over UDP mainly exists
because of the lack of kernel based PGM implementations. Using raw sockets
for packet creation and packet reception is inefficient and slow. User space
based PGM implementation typically are restricted to a single stream or multiple
stream in the same process since the in kernel multiplexing available for TCP
and UDP does not exist.
PGM over UDP allows the use of UDP port multiplexing instead which allows for]
efficient operation of multiple streams on a single system even if the
OS has no native support for PGM.
Creation of a PGM socket will lead to an unconnected socket. A sender must connect
to a multicast address to be able to send messages. A receiver needs to
bind to the multicast address and port number of interest and then listen
to the socket. The receiver can accept a connection when PGM traffic is
received on the chosen PGM multicast address and port. It is then
possible to receive datagrams on the PGM socket.
When
.BR connect (2)
is called on the socket, the multicast destination address is set and
datagrams can then be sent using
.BR send (2)
or
.BR write (2).
It is not possible to send to other destinations than the single multicast
address connected to. Note that the the send operations will cause the
application to be throttled if the maximum transmission rate is exceeded.
Throttling can be avoided by setting the socket to non blocking mode or
using MSG_DONTWAIT.
In order to receive packets, the socket needs to be bound to a multicast
address first by using
.BR bind (2).
All receive operations return only one packet.
When the packet is smaller than the passed buffer, only that much
data is returned; when it is bigger, the packet is truncated and the
.B MSG_TRUNC
flag is set.
.B MSG_WAITALL
is not supported.
Some IP options may be sent or received using the socket options described in
.BR ip (7).
However, multicast join and leave operations are not supported.
See
.BR ip (7).
By default, Linux PGM does path MTU (Maximum Transmission Unit) discovery.
This means the kernel
will keep track of the MTU to a specific target IP address and return
.B EMSGSIZE
when a PGM packet write exceeds it.
When this happens, the application should decrease the packet size.
Path MTU discovery can be also turned off using the
.B IP_MTU_DISCOVER
socket option or the
.I /proc/sys/net/ipv4/ip_no_pmtu_disc
file; see
.BR ip (7)
for details.
When turned off, PGM will fragment outgoing PGM packets
that exceed the interface MTU.
However, disabling it is not recommended
for performance and reliability reasons.
.SS "Address Format"
PGM supports IPv4 and IPv6 but Linux currently only supports IPv4. The
.I sockaddr_in
address format described in
.BR ip (7)
is used.
.SS "Error Handling"
All fatal errors will be passed to the user as an error return even
when the socket is not connected.
This includes asynchronous errors
received from the network.
You may get an error for an earlier packet
that was sent on the same socket.
When the
.B IP_RECVERR
option is enabled, all errors are stored in the socket error queue,
and can be received by
.BR recvmsg (2)
with the
.B MSG_ERRQUEUE
flag set.
.SS /proc interfaces
System-wide PGM parameter settings can be accessed by files in the directory
.IR /proc/sys/net/ipv4/ .
.TP
.IR pgm_mem " "
This is a vector of three integers governing the number
of pages allowed for queueing by all PGM sockets.
.RS
.TP 10
.I min
Below this number of pages, PGM is not bothered about its
memory appetite.
When the amount of memory allocated by PGM exceeds
this number, PGM starts to moderate memory usage.
.TP
.I pressure
This value was introduced to follow the format of
.IR tcp_mem
(see
.BR tcp (7)).
.TP
.I max
Number of pages allowed for queueing by all PGM sockets.
.RE
.IP
Defaults values for these three items are
calculated at boot time from the amount of available memory.
.TP
.IR pgm_window_size_default " (integer; default value: 10 MB)"
Default size, in bytes, of receive and transmit windows used by PGM sockets.
Each PGM socket is able to use the size for the receiving data window,
even if total pages of PGM sockets exceed pgm_mem pressure.
.TP
.IR pgm_window_msec_default " (integer; default value: 2000)"
Default time for packets to keep in the transmit and receive windows.
Each PGM socket is able to use the time period to resend data,
even if total pages of PGM sockets exceed
.I pgm_mem
pressure.
.TP
.IR pgm_ambient_spm_msecs " (integer; default value 15 seconds)"
Unconditional heartbeat sent by PGM transmitters to periodically notify receivers
about the stream status.
.TP
.IR pgm_spm_list_usec " (integers; default value: 1000 1000 4000 8000 16000 32000 64000 1280000 256000 1000000 2000000 8000000) "
Intervals for successive SPM heatbearts for the case that the connection goes idle. Initial SPMs are rapid to allow for
fast discovery of a missed packet and then back off until the unconditional heartbeat limit is reached.
.TP
.IR pgm_transmitter_rate_kbps "(integer; default value: 56)"
Default limit on the rate of traffic produced by a single transmitter.
The rate is an overall maximum of repair and original data. The limit
is set low because transmitters can do a lot of harm to the network
(especially WAN links) if they sent at high rates. It it advisable to
be careful when increasing the rate.
.TP
.IR pgm_transmitter_repair_rate_kpbs "(integer; default value 30) "
Default limit on the amount of repair data sent by a single transmitter
.TP
.IR pgm_transmitter_nak_ignore_after_rdata_msec "(integer; default 50)"
Period during which to ignore receiver NAKs after repair data was sent
(is usually set to correlate to the maximum WAN delay seen). This is
used to avoid useless additional repair data while NAK / repair data
is in flight.
.TP
.IR pgm_crybaby_rate_kbps " (integer; default 20)"
Maximum rate of repair traffic to a single receiver. A single receiver may
be slow and not able to keep up. Therefore it may continually ask for repairs (Thus
.B crybaby).
This parameter allows to limit the impact that continual repair traffic by the crybaby and
typically causes the crybaby to get so far out of sync that the receiver will finally have
to give up since messages for which repair is needed have been expired on the transmitter side.
Note that the transmitters do not keep track of the receivers. Crybaby detection is an
opportunitic heuristic method.
.TP
.IR pgm_fec_proactive_packets " (integer; default 0 )"
The number of parity packets to insert in each sequence of
.B pgm_fec_group_size
packets. FEC (Forward Error Correction) is another means to reduce NAK
traffic in configurations with a large number of receivers. Receivers
(and network elements) will be able to reconstruct missed packets on their
own without resorting to NAKs. However, if too many packets are missed and
recover is not possible then NAKs will still be sent.
.TP
.IR pgm_fec_group_size " (integer; default 16)"
Defines a unit of packets for which FEC parity packets are created.
.TP
.IR pgm_nak_retries " (integer; default 20)"
The number of recovery attempts to make for a single message before giving up.
.TP
.IR pgm_naks_per_sec " (integer; default 50)"
The maximum number of NAKs to send per second.
.IR pgm_debug " (integer; default 0)"
Allows enabling diagnostics for PGM interaction on the network.
If set to one then PGM will log all recovery activities/
If set to two then PGM will additionally log SPMs and SPMR and connection setup and teardown.
If set to three then PGM will log all activities in the syslog.
.SS "Socket Options"
To set or get a PGM socket option, call
.BR getsockopt (2)
to read or
.BR setsockopt (2)
to write the option with the option level argument set to
.BR IPPROTO_PGM .
.TP
.BR PGM_TRANSMITTER_CONFIG
This option is used to set up parameters for the transmitter before
connecting to a multicast address. The option cannot be used on a
connected SOCK_RDM socket. It is recommended to first get the
configuration data (which will contain the configured OS defaults) and
then modify individual fields as needed.
.sp
.in +4n
.nf
struct pgm_transmitter_config {
int rate_kbyte; /* Maximum rate per second */
int window_msecs; /* Window maximum packet age */
int window_kbytes; /* Window maximum size in kbytes */
int ambient_spm_msecs; /* Unconditional SPM */
int spm_msecs[12]; /* Idle SPM backoff */
int repeat_nak_ignore_msecs; /* How long to skip nacks after sending rdata */
int repair_rate_kbyte; /* Max permitted rate of repair traffic */
int crybaby_rate_kbyte; /* Max rate of repair traffic to individual receiver */
int transmit_only:1; /* If set do not process feedback from receivers */
int fec:1; /* Enable forward error correction */
int fec_parity:1; /* Respond to parity repair packet requests */
int fec_packets_per_group; /* Maximum number of packets for a group. */
int fec_proactive_packets; /* Number of proactive packets per group. */
int fec_group_size; /* Number of packets to be treated as a group. Power of two */
}
.fi
.TP
.BR PGM_TRANSMITTER_STATISTICS
Retrieves transmitter statistics.
.sp
.in +4n
.nf
struct pgm_transmitter_stats {
u64 bytes_received;
u64 data_send;
u64 naks_received;
u64 naks_too_late; /* NAKs received after receive window advanced */
u64 naks_outstanding; /* Number of NAKs awaiting response */
u64 naks_after_rdata; /* Number of NAKs after RDATA sequences were sent which were ignored */
u64 rdata_packets; /* Repair data */
u64 odata_packets; /* Original data */
u32 first_seqid; /* Oldest sequence id in window */
u32 last_seqid; /* Newest sequence id in window */
};
fi
.TP
.BR PGM_RECEIVER_CONFIG
Used to setup receiver parameters before accepting a connection.
The option cannot be used a on a connected SOCK_RDM socket.
.sp
.in +4n
.nf
struct pgm_receiver_config {
int window_msecs; /* Receive window maximum age (per transmitter) */
int window_kbyte; /* Receive window maximum size (per transmitter) */
int nak_retries; /* Nak retries before giving up */
int nak_ncf_retries; /* Nak retries after NCF before giving up */
int nak_backoff_interval; /* time to backoff on NAK failure */
int naks_per_sec; /* Limit on the naks per second */
int peer_timeout; /* Discard peer if silent for this time period */
int spmr_timeout; /* Abort connection if no SPMR response */
int receive_only:1; /* Never send data to sender */
}
.fi
.TP
.BR PGM_RECEIVER_STATISTICS
Retrieves receiver statistics.
.sp
.in +4n
.nf
struct pgm_receiver_stats {
u64 bytes_received; /* Total bytes received */
u64 data_received /* Useful data bytes received */
u64 odata_packets; /* Number of ODATA (original) sequences */
u64 rdata_packets; /* Number of RDATA (repair) sequences */
u64 odata_duplicates; /* Duplicate ODATA */
u64 rdata_duplicates; /* Duplicate RDATA */
u32 first_seqid; /* First buffered sequence id (first transmitter) */
u32 last_seqid; /* Last buffered sequence id (first transmitter) */
u32 first_naked_seqid; /* First sequence id that was naked */
u64 pending_naks; /* Outstanding naks */
u64 pending_ncfs; /* Outstanding ncfs */
u64 naks_sent;
u64 parity_naks_sent;
u32 active_transmitters; /* Number of transmitters */
};
.fi
.SS Ioctls
These ioctls can be accessed using
.BR ioctl (2).
The correct syntax is:
.PP
.RS
.nf
.BI int " value";
.IB error " = ioctl(" pgm_socket ", " ioctl_type ", &" value ");"
.fi
.RE
.TP
.BR FIONREAD " (" SIOCINQ )
Gets a pointer to an integer as argument.
Returns the size of the next pending datagram in the integer in bytes,
or 0 when no datagram is pending.
.TP
.BR TIOCOUTQ " (" SIOCOUTQ )
Returns the number of data bytes in the local send queue.
.PP
In addition all ioctls documented in
.BR ip (7)
and
.BR socket (7)
are supported.
.SH ERRORS
All errors documented for
.BR socket (7)
or
.BR ip (7)
may be returned by a send or receive on a PGM socket.
.TP
.B ECONNREFUSED
The socket was not associated with a multicast address. For a receiver
this may mean that no PGM traffic was detected on the given port. The
address specified may not be a valid multicast address.
.TP
.B NOTCONN
Socket is not connected.
.TP
.B EISCONN
Socket is already connected.
.TP
.B ECONNABORTED
Receiver was not able to keep up. Connection was
torn down.
.\" .SH CREDITS
.\" This man page was written by Christoph Lameter.
.SH "SEE ALSO"
.BR ip (7),
.BR raw (7),
.BR socket (7),
.BR udp (7)
RFC\ 3028 for the Pragmatic General Multicast protocol.
.br
RFC\ 1122 for the host requirements.
.br
RFC\ 1191 for a description of path MTU discovery.
.SH COLOPHON
This page is part of release 3.xx of the Linux
.I man-pages
project.
A description of the project,
and information about reporting bugs,
can be found at
http://www.kernel.org/doc/man-pages/.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-26 17:33 ` Christoph Lameter
@ 2010-03-27 13:11 ` Andi Kleen
2010-03-27 16:54 ` Martin Sustrik
2010-03-29 15:00 ` Christoph Lameter
0 siblings, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2010-03-27 13:11 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel
On Fri, Mar 26, 2010 at 12:33:07PM -0500, Christoph Lameter wrote:
> Here is a pgm.7 manpage describing how the socket API could look like for
> a PGM implementation.
>
> I dumped the RM_* based socket options from the other OS since most of the
> options were unusable.
I did a quick read and the manpage/interface seem reasonable to me.
You changed the parameter struct fields to lower case. While
that looks definitely more Linuxy than before does it mean programs
have to #ifdef this? It might be good idea to have at least some
optional compat header that #defines.
-Andi
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-27 13:11 ` Andi Kleen
@ 2010-03-27 16:54 ` Martin Sustrik
2010-03-29 14:50 ` Christoph Lameter
2010-03-29 15:00 ` Christoph Lameter
1 sibling, 1 reply; 21+ messages in thread
From: Martin Sustrik @ 2010-03-27 16:54 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, David Miller, netdev, linux-kernel
Andi Kleen wrote:
> I did a quick read and the manpage/interface seem reasonable to me.
You may also have a look at original PGM implementation by Luigi Rizzo
(FreeBSD). It's not maintained, but it might give you broader view.
http://info.iet.unipi.it/~luigi/pgm-code/
Martin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-27 16:54 ` Martin Sustrik
@ 2010-03-29 14:50 ` Christoph Lameter
0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2010-03-29 14:50 UTC (permalink / raw)
To: Martin Sustrik; +Cc: Andi Kleen, David Miller, netdev, linux-kernel
On Sat, 27 Mar 2010, Martin Sustrik wrote:
> Andi Kleen wrote:
>
> > I did a quick read and the manpage/interface seem reasonable to me.
>
> You may also have a look at original PGM implementation by Luigi Rizzo
> (FreeBSD). It's not maintained, but it might give you broader view.
>
> http://info.iet.unipi.it/~luigi/pgm-code/
Interesting. Which files in that directory contain the most current code?
Looks like the tcpdump patch has been merged.
Here is another tcpdump patch that implements decoding PGM via UDP. Anyone
know how to submit something like that?
(Need to specify -Tpgm option to use pgm decoder on UDP traffic)
Index: tcpdump/interface.h
===================================================================
--- tcpdump.orig/interface.h 2010-02-26 18:50:39.411609391 -0600
+++ tcpdump/interface.h 2010-02-26 18:51:04.270350179 -0600
@@ -74,6 +74,7 @@
#define PT_CNFP 7 /* Cisco NetFlow protocol */
#define PT_TFTP 8 /* trivial file transfer protocol */
#define PT_AODV 9 /* Ad-hoc On-demand Distance Vector Protocol */
+#define PT_PGM 10 /* The PGM protocol */
#ifndef min
#define min(a,b) ((a)>(b)?(b):(a))
Index: tcpdump/print-udp.c
===================================================================
--- tcpdump.orig/print-udp.c 2010-02-26 18:51:35.921610552 -0600
+++ tcpdump/print-udp.c 2010-02-26 18:53:54.440349950 -0600
@@ -520,6 +520,11 @@
tftp_print(cp, length);
break;
+ case PT_PGM:
+ udpipaddr_print(ip, sport, dport);
+ pgm_print(cp, length, (const u_char *)ip);
+ break;
+
case PT_AODV:
udpipaddr_print(ip, sport, dport);
aodv_print((const u_char *)(up + 1), length,
Index: tcpdump/tcpdump.c
===================================================================
--- tcpdump.orig/tcpdump.c 2010-02-26 18:37:13.971601597 -0600
+++ tcpdump/tcpdump.c 2010-02-26 18:37:43.290033748 -0600
@@ -854,6 +854,8 @@
packettype = PT_TFTP;
else if (strcasecmp(optarg, "aodv") == 0)
packettype = PT_AODV;
+ else if (strcasecmp(optarg, "pgm") == 0)
+ packettype = PT_PGM;
else
error("unknown packet type `%s'", optarg);
break;
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-27 13:11 ` Andi Kleen
2010-03-27 16:54 ` Martin Sustrik
@ 2010-03-29 15:00 ` Christoph Lameter
2010-03-29 21:43 ` Andi Kleen
1 sibling, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2010-03-29 15:00 UTC (permalink / raw)
To: Andi Kleen; +Cc: David Miller, netdev, linux-kernel
On Sat, 27 Mar 2010, Andi Kleen wrote:
> On Fri, Mar 26, 2010 at 12:33:07PM -0500, Christoph Lameter wrote:
> > Here is a pgm.7 manpage describing how the socket API could look like for
> > a PGM implementation.
> >
> > I dumped the RM_* based socket options from the other OS since most of the
> > options were unusable.
>
> I did a quick read and the manpage/interface seem reasonable to me.
Thanks. I will then proceed to get a patch out that implements the
network environment. Then we can plug the openpgm logic in there.
> You changed the parameter struct fields to lower case. While
> that looks definitely more Linuxy than before does it mean programs
> have to #ifdef this? It might be good idea to have at least some
> optional compat header that #defines.
The socket API will be completely different. The basic handling of the
sockets is the same (binding, listening, connecting). There is no way of
mapping M$ socket options to Linux socket options with the approach that
I proposed in the manpage. The stats structure is different too since some
key elements were missing.
What users are there of the M$ api? I have seen vendors supplying their
own pgm implementation (guess due to bit rot in the old M$
implementation).
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-29 15:00 ` Christoph Lameter
@ 2010-03-29 21:43 ` Andi Kleen
0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2010-03-29 21:43 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, David Miller, netdev, linux-kernel
On Mon, Mar 29, 2010 at 10:00:57AM -0500, Christoph Lameter wrote:
> On Sat, 27 Mar 2010, Andi Kleen wrote:
>
> > On Fri, Mar 26, 2010 at 12:33:07PM -0500, Christoph Lameter wrote:
> > > Here is a pgm.7 manpage describing how the socket API could look like for
> > > a PGM implementation.
> > >
> > > I dumped the RM_* based socket options from the other OS since most of the
> > > options were unusable.
> >
> > I did a quick read and the manpage/interface seem reasonable to me.
>
> Thanks. I will then proceed to get a patch out that implements the
> network environment. Then we can plug the openpgm logic in there.
You might still need some reviewing from network maintainers.
>
> > You changed the parameter struct fields to lower case. While
> > that looks definitely more Linuxy than before does it mean programs
> > have to #ifdef this? It might be good idea to have at least some
> > optional compat header that #defines.
>
> The socket API will be completely different. The basic handling of the
> sockets is the same (binding, listening, connecting). There is no way of
> mapping M$ socket options to Linux socket options with the approach that
> I proposed in the manpage. The stats structure is different too since some
> key elements were missing.
Ok.
>
> What users are there of the M$ api? I have seen vendors supplying their
> own pgm implementation (guess due to bit rot in the old M$
> implementation).
I don't know, it was just a general consideration.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-22 18:53 ` Andi Kleen
2010-03-22 19:32 ` Christoph Lameter
2010-03-26 17:33 ` Christoph Lameter
@ 2010-03-29 23:01 ` H. Peter Anvin
2010-03-30 18:12 ` Christoph Lameter
2 siblings, 1 reply; 21+ messages in thread
From: H. Peter Anvin @ 2010-03-29 23:01 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, David Miller, netdev, linux-kernel
On 03/22/2010 11:53 AM, Andi Kleen wrote:
>
> There's just no portable atomic64_t. Ok maybe you can use the socket lock
> to synchronize all the counts if they are only per socket.
>
In 2.6.34 there is (although some arches which could support it natively
don't as of yet... but that's fixable.) See lib/atomic64.c.
-hpa
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Add PGM protocol support to the IP stack
2010-03-29 23:01 ` H. Peter Anvin
@ 2010-03-30 18:12 ` Christoph Lameter
0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2010-03-30 18:12 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Andi Kleen, David Miller, netdev, linux-kernel
On Mon, 29 Mar 2010, H. Peter Anvin wrote:
> On 03/22/2010 11:53 AM, Andi Kleen wrote:
> >
> > There's just no portable atomic64_t. Ok maybe you can use the socket lock
> > to synchronize all the counts if they are only per socket.
> >
>
> In 2.6.34 there is (although some arches which could support it natively
> don't as of yet... but that's fixable.) See lib/atomic64.c.
There are also the 64bit thiscpu operations that were merged in 2.6.33.
They do the right thing if the arch does not provide operations.
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-03-30 18:12 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-18 17:58 Add PGM protocol support to the IP stack Christoph Lameter
2010-03-18 21:58 ` Christoph Lameter
2010-03-19 17:18 ` Andi Kleen
2010-03-19 21:53 ` David Miller
2010-03-19 22:26 ` H. Peter Anvin
2010-03-22 14:24 ` Christoph Lameter
2010-03-22 14:20 ` Christoph Lameter
2010-03-22 16:36 ` Andi Kleen
2010-03-22 16:51 ` Christoph Lameter
2010-03-22 17:43 ` Andi Kleen
2010-03-22 18:07 ` Christoph Lameter
2010-03-22 18:53 ` Andi Kleen
2010-03-22 19:32 ` Christoph Lameter
2010-03-26 17:33 ` Christoph Lameter
2010-03-27 13:11 ` Andi Kleen
2010-03-27 16:54 ` Martin Sustrik
2010-03-29 14:50 ` Christoph Lameter
2010-03-29 15:00 ` Christoph Lameter
2010-03-29 21:43 ` Andi Kleen
2010-03-29 23:01 ` H. Peter Anvin
2010-03-30 18:12 ` Christoph Lameter
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.