All of lore.kernel.org
 help / color / mirror / Atom feed
* Xtables2 A7 spec draft
@ 2011-02-02 22:04 Jan Engelhardt
  2011-02-05 19:33 ` Jozsef Kadlecsik
  2011-02-07 20:50 ` James Nurmi
  0 siblings, 2 replies; 6+ messages in thread
From: Jan Engelhardt @ 2011-02-02 22:04 UTC (permalink / raw)
  To: Netfilter Developer Mailing List



I am posting the Xtables2 Netlink interface specification, draft 7
for comments.

Additionally, further documentation and toolchain around
it is available through the project page at

	http://jengelh.medozas.de/projects/xtables/

 * User Documentation Chapter 1: Architectural Differences
 * Developer Documentation Part 1: Netlink interface (WIP)
   This is copied below to facilitate inline replies
--8<--

Netlink interface

1 Concepts

This section is non-normative and should instead show the flow of 
thought and give reasons as to why the specification was 
conceived the way it is, and where the component problems are.

1.1 Nesting representation

The common element in Xtables is the ruleset, represented as a 
tree structure with ordering constraints at some levels:

ruleset (unordered tables)
 \__ table (unordered chains)
 |    \__ chain (ordered rules)
 |    |    \__ rule (ordered actions)
 |    |    |    \__ match (unordered data)
 |    |    |    |    \__ config-data
 |    |    |    |    |    \__ bin params
 |    |    |    |    \__ state-data
 |    |    |    |         \__ nlattrs
 |    |    |    \__ match...
 |    |    |    \__ target (unordered data)
 |    |    |    |    \__ config-data
 |    |    |    \__ target...
 |    |    |    \__ verdict...
 |    |    \__ rule...
 |    \__ chain...
 \__ table...

A more concrete example, here is a small ruleset, encoded into 
XML (just one of many possible representations):

<table>
  <chain name="INPUT">
    <rule idx="1">
      <match acidx="1" name="hashlimit" rev="1" csize="120">
        <config-data>...</config-data>
        <state-data>...</state-data>
      </match>
      <target acidx="2" name="TOS" rev="1">
        ...
      </target>
      <verdict acidx="3" name="ACCEPT" />
    </rule>
  </chain>
</table>

There are different ways to encode such a tree structure into a 
serialized stream. In many Netlink protocols, children attributes 
are encapsulated (a. k. a. “nested”, though we will avoid this 
term to avoid double-use) and treated as a whole as a parent's 
opaque data. It cannot be told apart from normal data. (Like 
writing “<chain> &lt;rule&gt; ... &lt;/rule&gt; </chain>” in 
XML.) We will call this format “Encapsulated Encoding”.

To encode an attribute's length, struct nlattr only has a 16-bit 
field, which means the attribute header plus payload is limited 
to 64 KB. This is easily exceedable with the encapsulated 
encoding as chains are collected rules in a chain, for example. 
The problem is aggreviated by the kernel's Netlink handler only 
allocating sk_buffs a page size worth, which leaves few room for 
extension data. In the worst case, the usable payload for 
attributes is around 3600 bytes only. In light of xt_u32's 
private data block being 1984 bytes already, that means that you 
won't be able to fit two -m u32 invocations nested in a single 
rule into a dump.

Certain voices in the community call for the obsoletion of such 
data blobs and replace them by Netlink attributes; there are no 
objections to doing so. However, the problem of size-limited 
sk_buffs applies to opaque data of any kind, and Netlink 
attributes fall within that.

The Xtables2 Netlink protocol encodes each node of information as 
a standalone attribute, to be called Flat Encoding, that is 
appended (a. k. a. “chained”) to the data stream. By avoiding 
encapsulated attributes, it is possible to split messages at much 
finer levels, and provides for attributes that happen to use 
opaque data with a maximally-sized buffer.

1.2 Nest markers<sub:Nest-markers>

Since Netlink messages do have a 32-bit quantity to store the 
message length, rulesets of roughly up to 4 GB are possibile, 
which is currently regarded as sufficient. The largest (while 
still being meaningful) rulesets seen to date in the industry 
weighed in at approximately 150 MB.

Whereas encapsulated attribute encoding automatically provided 
for boundaries, this is realized using dummy attributes in the 
chained approach. The start of a nesting level can be implicitly 
represented by the presence of the attribute that would have 
otherwise been used for encapsulated nesting. For declaring an 
end of a nest level, an extra attribute is needed:

• “chain { rule; rule; ... }” \Leftrightarrow CHAIN RULE RULE ... 
  STOP

1.3 Attribute limitations in nfnetlink

Netlink, being just a base protocol, does not specify what comes 
after the nlmsghdr, or how it is ordered. This is left up to the 
subprotocols based on Netlink. nfnetlink has two effective 
shortcomings (due to its parser) that shall be held in mind:

• Attribute ordering is ignored and lost

• No support for more than one attribute with the same type 
  within a message

struct nlattr **tb;
nla_for_each_attr(attr, head, ...)
        tb[nla_type(attr)] = attr;

This kills the idea of being able to do, for example, a table 
replace, in a single Netlink request message. This is like having 
to split an XML file at every tag simply because two tags can 
carry the same attribute. So Netlink requests have to be broken 
down into many many tiny parts and extra state has to be kept 
around in the kernel.

put_header(msg, NFXTM_TABLE_REPLACE);
foreach (rule)
        put(msg, rule);
send(sock, msg);

will become

put_header(msg, NFXTM_TABLE_REPLACE);
send(sock, msg);
foreach (rule) {
        clean(msg);
        put_header(msg, NFXTM_RULE_DATA);
        put(msg, rule);
        send(sock, msg);
}
clean(msg);
put_header(msg, NFXTM_COMMIT);
send(sock, msg);

or worse. In other words, the fact that the kernel side will use 
a temporary table (an implementation detail) will be exposed to 
userspace, which is bad too.

1.4 Summary of transform<sub:Summary-of-transform>

Essentially there is a 1:1 transform on the XML-like tree shown 
above, to:

NFXTM_CHAIN_ENTRY<name=INPUT,usertid=1>
  NFXTM_RULE_ENTRY<idx=1,usertid=1>
    NFXTM_MATCH_ENTRY<acidx=1,name=hashlimit,rev=1,usertid=2>
      NFXTM_CONFIG_DATA
        NFXTM_ARB_DATA<whatever>
        NFXTM_ARB_DATA<more arbitrary data>
      NFXTM_STOP
      NFXTM_STATE_DATA
        NFXTM_ATTR_DATA<nlattrs>
        NFXTM_ATTR_DATA<more nlattrs>
      NFXTM_STOP
    NFXTM_STOP
    NFXTM_TARGET_ENTRY<acidx=2,name=TOS,rev=0,usertid=3>
      ...
    NFXTM_STOP
    NFXTM_VERDICT_ENTRY<acidx=3,name=ACCEPT,usertid=3>
    NFXTM_STOP
  NFXTM_STOP
NFXTM_STOP

1.5 Extra sequence numbers<sub:Extra-sequence-numbers>

Netlink also does not specify any message ordering, though it 
does provide an nlmsg_seq field with which message order can at 
least be determined. The problem is that nothing specifies what 
nlmsg_seq should be in reply messages. It is assumed that the 
sequence number is linked, i. e. that a reply's number should be 
the same as the request's number, to do message matching (vague 
hint by netlink(7) manpage).

Even if that were decidedly so, that brings along a problem. In 
NLM_F_MULTI-style dumps, all messages would have the same 
nlmsg_seq. To counter this, multi messages will have an 
NFXT-specific sequence counter (NFXTA_SEQNO) in addition, 
especially since ordering is so much more crucial in Xtables than 
it is in other parts of networking.

1.6 Improved granularity error reporting

Xtables extensions as of Linux 2.6.37 can only return system 
error codes back to userspace in case there is a problem. The 
most common occurrences are, for example, ENOMEM (“Memory 
allocation failure” / “Out of memory”), and the dreaded EINVAL (“
Invalid argument”). Best practices at the moment are to printk a 
string to the kernel log for further information detailing the 
circumstances about the cause of EINVAL. In the light of this 
overload of EINVAL, an improved error reporting scheme is sought. 
(Other networking subsystems also suffer from this problem.)

By suggestion of Jozsef Kadlecsik, the Xtables2 protocol reports 
three kinds of errors:

• General/standard (integer) error codes, where there is no point 
  (or cannot be) to specify the nature of the error exactly. Like 
  in the example, ENOMEM: it is needles to report which new data 
  field could not be allocated.

• General Xtables2 error codes (largely replaces EINVAL sites) in 
  integer form, similar to errno. Use cases include:

  – chain for a requested operation does not exist

  – an extension is used from a hook it is not supposed to be

• Free-form string. Standalone, or in addition to the above.
It is impossible to provision error numbers for extensions, 
  especially those that are out-of-tree. The problems that 
  forcing a component to reuse another component's error code 
  space can be seen in the overuse of EINVAL. We are aware that 
  raw strings in kernel modules can hinder internationalization, 
  but it is seen as the better choice over awkward error codes 
  that convey nothing. It is also expected that strings do not 
  change that often.

The three error types will be conveyed by three distinct 
attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2 
error codes), and NFXTA_ERRSTR (free-form string).

  Error pointer

Once a table/chain splice request has been finalized, 
xt_check_{match,target} is run, which can return:

• chain name, rule index, match/target index, NFXTE_*/custom 
  string

  Line number

I noticed Jozsef has added a line number attribute in ipset 
version 5 to facilitate locating errors for users. For its 
apparent value, such attribute is also specified for xtnetlink:

A request message can contain a “ping attribute”, NFXTA_USERTID, 
which xtnetlink may keep track of and which may be reported back 
verbatim in case an error occured. It may be used to represent 
the source line, or any other number.

• For the tree example in section 1, the ruleset file would be “
  -A INPUT \
-m hashlimit ... \
-j TOS ... -j ACCEPT”.

1.7 Multi-type responses

Using multi-type responses provides for a seemingly shorter reply 
(in at least one case) than not doing so:

• \RightarrowNFXTM_CHAIN_DUMP<NFXTA_NAME>
\LeftarrowNFXTM_RULE_START<>
\LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_RULE_END<>
\LeftarrowNFXTM_RULE_START<>
\LeftarrowNFXTM_ETARGET<NFXTA_VERDICT>
\LeftarrowNFXTM_RULE_END<>
\LeftarrowNLMSG_DONE

• \RightarrowCHAIN_DUMP<NFXTA_NAME>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
\LeftarrowCHAIN_DUMP<NFXTA_VERDICT>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
\LeftarrowNLMSG_DONE

2 General use

2.1 Socket

Xtables2 is made available through an nfnetlink socket. 
Specifically, this is a Netlink socket of type NETLINK_NETFILTER, 
with which messages are exchanged that are tagged having Xtables 
as the subsystem.

#include <sys/socket.h>
#include <linux/netlink.h>

struct nlmsghdr nlmsg;
int nf_socket = socket(AF_NETLINK, SOCK_RAW, 
NETFILTER_NETFILTER);
nlmsg.nlmsg_type = (NFNL_SUBSYS_XTABLES << 8) | xt_msg_type;

2.2 Message format

All messages transmitted over the Netlink socket are to have the 
base struct nlmsghdr header, followed by a struct nfgenmsg header 
as mandated by nfnetlink. The .nfgen_family member is always set 
to NFPROTO_UNSPEC. The .version member denotes the format of the 
byte stream following nfgenmsg; this is currently version 0. The 
.res_id member is unused.

3 Attributes

The meaning of attributes depends upon the message and logical 
nesting level in which they appear. Their type however remains 
the same, such that a single Netlink attribute validation policy 
object (struct nla_policy) can be used for all message types.

A table of all known attributes:


+--------+-------------------+---------------+-----------------+--------------------------------------+
| Value  | Mnemonic          |    C type     | NLA type        | Notes                                |
+--------+-------------------+---------------+-----------------+--------------------------------------+
+--------+-------------------+---------------+-----------------+--------------------------------------+
|   1    | NFXTA_SEQNO       | unsigned int  | NLA_U32         | Section [sub:Extra-sequence-numbers] |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|  tba   | NFXTA_ERRNO       |     int       | NLA_U32         | Generic system errno (Exxx)          |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|  ...   | NFXTA_XTERRNO     |     int       | NLA_U32         | NFXT errno (NFXTE_*)                 |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_ERRSTR      |   char []     | NLA_NUL_STRING  | Arbitrary                            |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_USERTID     | unsigned int  | NLA_U32         | Arbitrary, retained verbatim         |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_CHAIN_NAME  |   char []     | NLA_NUL_STRING  |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_RULE_IDX    | unsigned int  | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_ACTION_IDX  | unsigned int  | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_NAME        |   char []     | NLA_NUL_STRING  |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_REVISION    |   uint8_t     | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_HOOKNUM     | unsigned int  | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_PRIORITY    |     int       | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_NFPROTO     |   uint8_t     | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_OFFSET      | unsigned int  | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_LENGTH      |    size_t     | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_HOOKMASK    | unsigned int  | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_SIZE        |    size_t     | NLA_U32         |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+
|        | NFXTA_NEW_NAME    |   char []     | NLA_NUL_STRING  |                                      |
+--------+-------------------+---------------+-----------------+--------------------------------------+


The kernel ignores attributes with value 0 during validation, so 
it was left unused.

4 Error types<sec:Error-types>


+--------+---------------------+-------------------------------------------+
| Value  | Mnemonic            | Description                               |
+--------+---------------------+-------------------------------------------+
+--------+---------------------+-------------------------------------------+
|   0    | NFXTE_SUCCESS       | No error                                  |
+--------+---------------------+-------------------------------------------+
|   1    | NFXTE_CHAIN_EXIST   | Chain already exists                      |
+--------+---------------------+-------------------------------------------+
|   2    | NFXTE_CHAIN_NOENT   | Chain does not exist                      |
+--------+---------------------+-------------------------------------------+
|   3    | NFXTE_RULESET_LOOP  | Ruleset contains a loop                   |
+--------+---------------------+-------------------------------------------+
|   4    | NFXTE_EXT_HOOKMASK  | Rule invoked from incompatible hook       |
+--------+---------------------+-------------------------------------------+
|        | NFXTE_PROMO_STATUS  | Promotion/demotion state already achieved |
+--------+---------------------+-------------------------------------------+


5 Message types


+------+-----------------------+----------------+---------------------------------------------+
| ID   | Mnemonic              |      Dir       | Notes                                       |
+------+-----------------------+----------------+---------------------------------------------+
+------+-----------------------+----------------+---------------------------------------------+
|  0   | NFXTM_STOP            |     both       | End of logical nesting level or transaction |
+------+-----------------------+----------------+---------------------------------------------+
|  1   | NFXTM_ERROR           | k\rightarrowu  | Kills transactions (but not dumps)          |
+------+-----------------------+----------------+---------------------------------------------+
|  2   | NFXTM_ABORT           | u\rightarrowk  | Abort transaction                           |
+------+-----------------------+----------------+---------------------------------------------+
| tba  | NFXTM_CHAIN_NEW       | u\rightarrowk  |                                             |
+------+-----------------------+----------------+---------------------------------------------+
| ...  | NFXTM_CHAIN_DEL       | u\rightarrowk  |                                             |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_CHAIN_MOVE      | u\rightarrowk  |                                             |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_CHAIN_PROMOTE   | u\rightarrowk  |                                             |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_CHAIN_DEMOTE    | u\rightarrowk  |                                             |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_TABLE_DUMP      | u\rightarrowk  | Dump start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_CHAIN_ENTRY     |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_RULE_ENTRY      |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_MATCH_ENTRY     |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_TARGET_ENTRY    |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_VERDICT_ENTRY   |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_JUMP_ENTRY      |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_GOTO_ENTRY      |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_CONFIG_DATA     |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_STATE_DATA      |     both       | Nest start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_ARB_DATA        |     both       | Arbitrary data                              |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_ATTR_DATA       |     both       | Attribute list                              |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_CHAIN_SPLICE    | u\rightarrowk  | Transaction start, nest start               |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_TABLE_REPLACE   | u\rightarrowk  | Transaction start, nest start               |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_IDENTIFY        |     both       | Dump start                                  |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_IDMATCH_ENTRY   | k\rightarrowu  |                                             |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_IDTARGET_ENTRY  | k\rightarrowu  |                                             |
+------+-----------------------+----------------+---------------------------------------------+
|      | NFXTM_EVENT           | k\rightarrowu  |                                             |
+------+-----------------------+----------------+---------------------------------------------+


5.1 End of nest level / transaction commit

NFXTM_STOP is used to end a nesting level as started by, for 
example, NFXTM_RULE_ENTRY.

It is also used to finish (commit) a transaction, such as with 
NFXTM_TABLE_REPLACE.

Request NFXTM_STOP:

• No attributes.

Response:

• Standard Netlink ACK.

5.2 Error report

xtnetlink uses NFXTM_ERROR to report back detailed errors on 
actions.

Possible attributes:

• NFXTA_ERRNO: generic error, using system-level errno codes 
  (ENOMEM, etc.)

• NFXTA_XTERRNO: xtnetlink error, see section [sec:Error-types]

• NFXTA_ERRSTR: free-form error string provided by extensions

• NFXTA_USERTID: user token received earlier is echoed back for 
  reference (may be used for things like line numbers)

• NFXTA_CHAIN_NAME: name of chain whose processing caused the 
  error

• NFXTA_RULE_IDX: index to rule (0-based) that caused the error

• NFXTA_ACTION_IDX: index to match/target/verdict (0-based) in 
  the particular rule that caused the error

(RFC:) Should outstanding transaction be terminated?

When NFXTM_ERROR is sent in an NLM_F_MULTI dump stream, an 
NLMSG_DONE message will still follow.

5.3 Transaction termination

NFXTM_ABORT can be used to abort a transaction as started by, for 
example, NFXTM_TABLE_REPLACE.

Request NFXTM_ABORT:

• No attributes.

Response:

• Standard Netlink ACK.

5.4 Chain creation

Request NFXTM_CHAIN_NEW:

• Attribute NFXTA_NAME: name of the new chain.

Response:

• Standard Netlink ACK, or NFXTM_ERROR:

  – ENOMEM: Out of memory

  – NFXTE_CHAIN_EXIST: Chain already exists

5.5 Chain deletion

Request NFXTM_CHAIN_DEL with attributes:

• NFXTA_NAME attribute carrying the name of the chain to delete

Response:

• Standard Netlink ACK, or NFXTM_ERROR:

  – NFXTE_CHAIN_NOENT: Chain does not exist.

Notes:

The chain is automatically demoted.

5.6 Chain renaming

Request:

• Type: NFXTM_CHAIN_MOVE

• Attributes: NFXTA_NAME (old name), NFXTA_NEW_NAME (new chain 
  name).

Response:

• Standard Netlink ACK, or NFXTA_ERROR:

  – NFXTE_CHAIN_NOENT: Source chain does not exist

  – NFXTE_CHAIN_EXIST: Target chain already exists

5.7 Promotion to base chain

Sets the specified chain up as an entrypoint from the Netfilter 
proper. (It does this by creating an appropriate nf_hook.)

Request:

• Type: NFXTM_CHAIN_PROMOTE

• Attributes: NFXTA_NAME, NFXTA_HOOKNUM (NF_INET_*/NF_ARP_*/
  NF_BR_*), NFXTA_PRIORITY, NFXTA_NFPROTO (one of the NFPROTO_* 
  constants)

Response:

• Standard Netlink ACK, or NFXTA_ERROR:

  – NFXTE_CHAIN_NOENT: The specified chain does not exist.

  – NFXTE_PROMO_STATUS: Already promoted.

  – NFXTE_RULESET_LOOP: There is a loop in the rule tree, which 
    is not allowed.

  – NFXTE_EXT_HOOKMASK: One or more extensions are used from a 
    hook that they do not support being invoked from.

Example:

• Turn the chain named “filter/ipv6/INPUT” into the equivalent of 
  the classic INPUT hook in the filter table: NFXTA_NAME=“
  filter/ipv6/INPUT”, NFXTA_HOOKNUM=NF_INET_LOCAL_IN (1), 
  NFXTA_PRIORITY=0, NFXTA_NFPROTO=NFPROTO_IPV6 (10).

5.8 Demotion from base chain

Removes the nf_hook.

Request:

• Type: NFXTM_CHAIN_DEMOTE

• Attributes: NFXTA_NAME

Response:

• Standard Netlink ACK, or NFXTA_ERROR:

  – NFXTE_CHAIN_NOENT: The specified chain does not exist.

  – NFXTE_PROMO_STATUS: Already demoted.

5.9 Implementation Identification (debug)

First and foremost a debug command, and to get something 
(table/chain-independent) that users can glare at (they love 
doing that).

Request:

• nlmsg_type = NFXTM_IDENTIFY;

Multiple message response:

• An NFXTM_IDENTIFY message containing:

  – An NFXTA_NAME attribute giving the name of the 
    implementation/patchset.

• Zero or more NFXTM_IDMATCH_ENTRY messages, giving 
  metainformation about the loaded match extensions. Each message 
  contains three attributes:

  – An NFXTA_NAME attribute for the name of the extension.

  – An NFXTA_REVISION attribute to denote the version of the 
    extension's parameter protocol.

  – An NFXTA_SIZE attribute for the size of its per-instance data 
    block.

  – An NFXTA_HOOKMASK attribute for the bitmap of hooks the 
    extensions may be used from.

• Zero or more NFXTM_IDTARGET_ENTRY messages, giving 
  metainformation about the loaded target extensions:

  – attributes like NFXTM_IDMATCH_ENTRY.

• NLMSG_DONE message.

5.10 Rule dump

Atomic dump of entire table/ruleset, or a single chain, with or 
without rules.

Request:

• nlmsg_type = NFXTM_TABLE_DUMP;

• NFXTA_NAME attribute specifying the name of the chain to dump. 
  Absence of attribute dumps entire table.

• NFXTA_RULE_IDX attribute specifying the particular rule 
  (1-based index) to dump. Absence of attribute dumps entire 
  chain. Use 0 to only get a chain list.

Multi Response:

• Zero or more chains, represented by the start marker message 
  NFXTM_CHAIN_ENTRY and the end marker NFXTM_STOP. The 
  NFXTM_CHAIN_ENTRY message may have NFXTA_HOOKNUM, 
  NFXTA_PRIORITY and NFXTA_NFPROTO attributes if it is a base 
  chain.

• Zero or more rules within NFXTM_CHAIN_ENTRY .. NFXTM_STOP, 
  represented by the start marker message NFXTM_RULE_ENTRY and 
  the end marker NFXTM_STOP.

• Zero or more actions within NFXTM_RULE_ENTRY .. NFXTM_STOP, 
  represented by the start marker message NFXTM_MATCH_ENTRY, 
  NFXTM_TARGET_ENTRY, NFXTM_VERDICT_ENTRY, NFXTM_JUMP_ENTRY or 
  NFXTM_GOTO_ENTRY and the end marker NFXTM_STOP.

• Zero or more config data messages within NFXTM_MATCH_ENTRY or 
  NFXTM_TARGET_ENTRY.

• Zero or more state data messages within NFXTM_MATCH_ENTRY or 
  NFXTM_TARGET_ENTRY.

(See section [sub:Summary-of-transform] for example.)

Errors:

• If an error occurs during dump, an NFXTM_ERRNO message is 
  emitted into the stream and the dump will then immediately 
  terminate with a standard NLMSG_DONE message. No NFXTA_STOP 
  attributes will be emitted if the dump stopped in the middle of 
  a nesting level.

5.11 Table replace

Atomic replacement of an entire table/ruleset.

1. User sends NFXTM_TABLE_REPLACE request. The state is 
  remembered per client socket.

2. Within this transaction, the following commands operate on a 
  temporary table: NFXTM_CHAIN_NEW, NFXTM_CHAIN_DEL, NFXTM_CHAIN_
  MOVE, NFXTM_CHAIN_SPLICE.

3. End transaction with NFXTM_STOP, or abort with NFXTM_ABORT.

5.12 Chain splicing (add/delete rules)

Chain splicing does a bulk deletion of zero or more consecutive 
rules, followed by a bulk insertion of zero or more consecutive 
rules, all done in an atomic fashion. It operates similar to 
Perl's splice function on arrays.

The user starts a transaction with a NFXTM_CHAIN_SPLICE request, 
supplying the name of the chain that is to be modified in a 
NFXTA_NAME attribute. xtnetlink will take the read lock on the 
table to prevent a table replace operation from interfering, and 
will take the write lock on the chain.

1. While in this context, higher-level transactions like 
  NFXTM_TABLE_REPLACE, are rejected.

2. Send new rules (ordered list).

3. End transaction with NFXTM_STOP, or abort transaction entirely 
  with NFXTM_ABORT.

New rules:

1. Send NFXTM_RULE_NEW. Must occur within the context of 
  chain_splice.

2. NFXTM_STOP. This ends the current rule.

blubb

Request:

• NFXTA_NAME: Name of the chain to modify.

• NFXTA_OFFSET: Index of entry where operation should start.

• NFXTA_LENGTH: Number of entries starting from offset that 
  should be removed. May be zero or more.

• Zero or more rules.

Response:

• Standard ACK.

• or detailed error code.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Xtables2 A7 spec draft
  2011-02-02 22:04 Xtables2 A7 spec draft Jan Engelhardt
@ 2011-02-05 19:33 ` Jozsef Kadlecsik
  2011-02-05 21:38   ` Jan Engelhardt
  2011-02-07 20:50 ` James Nurmi
  1 sibling, 1 reply; 6+ messages in thread
From: Jozsef Kadlecsik @ 2011-02-05 19:33 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

Hi Jan,

On Wed, 2 Feb 2011, Jan Engelhardt wrote:

> I am posting the Xtables2 Netlink interface specification, draft 7
> for comments.
> 
> Additionally, further documentation and toolchain around
> it is available through the project page at
> 
> 	http://jengelh.medozas.de/projects/xtables/
> 
>  * User Documentation Chapter 1: Architectural Differences
>  * Developer Documentation Part 1: Netlink interface (WIP)
>    This is copied below to facilitate inline replies
> --8<--
> 
> Netlink interface
> 
> 1 Concepts
> 
> This section is non-normative and should instead show the flow of 
> thought and give reasons as to why the specification was 
> conceived the way it is, and where the component problems are.
> 
> 1.1 Nesting representation
> 
> The common element in Xtables is the ruleset, represented as a 
> tree structure with ordering constraints at some levels:
> 
> ruleset (unordered tables)
>  \__ table (unordered chains)
>  |    \__ chain (ordered rules)
>  |    |    \__ rule (ordered actions)
>  |    |    |    \__ match (unordered data)
>  |    |    |    |    \__ config-data
>  |    |    |    |    |    \__ bin params
>  |    |    |    |    \__ state-data
>  |    |    |    |         \__ nlattrs
>  |    |    |    \__ match...
>  |    |    |    \__ target (unordered data)
>  |    |    |    |    \__ config-data
>  |    |    |    \__ target...
>  |    |    |    \__ verdict...
>  |    |    \__ rule...
>  |    \__ chain...
>  \__ table...

I believe the objects 'match', 'target', 'verdict' should be generalized 
and unified into a single entity named 'action' (or named whatever). It 
should have an attribute (better a flag attribute with a flag value) to 
denote that the given action is a terminating one (terminating 
target/verdict), so that the parser could check and warn/reject 
unreachable actions. That way the protocol were both simpler and more 
powerful at the same time. And we could express rules like

... -m whatever -j LOG -m more-specific -j DO-SOMETHING ...

I don't like the idea of passing binary parameters at any level: 
everything should be expressed in nlattrs.
 
> A more concrete example, here is a small ruleset, encoded into 
> XML (just one of many possible representations):
> 
> <table>
>   <chain name="INPUT">
>     <rule idx="1">
>       <match acidx="1" name="hashlimit" rev="1" csize="120">
>         <config-data>...</config-data>
>         <state-data>...</state-data>
>       </match>
>       <target acidx="2" name="TOS" rev="1">
>         ...
>       </target>
>       <verdict acidx="3" name="ACCEPT" />
>     </rule>
>   </chain>
> </table>
> 
> There are different ways to encode such a tree structure into a 
> serialized stream. In many Netlink protocols, children attributes 
> are encapsulated (a. k. a. ?nested?, though we will avoid this 
> term to avoid double-use) and treated as a whole as a parent's 
> opaque data. It cannot be told apart from normal data. (Like 
> writing ?<chain> &lt;rule&gt; ... &lt;/rule&gt; </chain>? in 
> XML.) We will call this format ?Encapsulated Encoding?.
> 
> To encode an attribute's length, struct nlattr only has a 16-bit 
> field, which means the attribute header plus payload is limited 
> to 64 KB. This is easily exceedable with the encapsulated 
> encoding as chains are collected rules in a chain, for example. 
> The problem is aggreviated by the kernel's Netlink handler only 
> allocating sk_buffs a page size worth, which leaves few room for 
> extension data. In the worst case, the usable payload for 
> attributes is around 3600 bytes only. In light of xt_u32's 
> private data block being 1984 bytes already, that means that you 
> won't be able to fit two -m u32 invocations nested in a single 
> rule into a dump.

The pagesize limit is a real problem. :-(( I don't see how could we avoid 
the possibility to split a single rule into multiple messages, because it 
did not simply fit into a single one.

> Certain voices in the community call for the obsoletion of such 
> data blobs and replace them by Netlink attributes; there are no 
> objections to doing so. However, the problem of size-limited 
> sk_buffs applies to opaque data of any kind, and Netlink 
> attributes fall within that.

I'm among the ones who object data blobs.
 
> The Xtables2 Netlink protocol encodes each node of information as 
> a standalone attribute, to be called Flat Encoding, that is 
> appended (a. k. a. ?chained?) to the data stream. By avoiding 
> encapsulated attributes, it is possible to split messages at much 
> finer levels, and provides for attributes that happen to use 
> opaque data with a maximally-sized buffer.

Even with encapsulation, the messages can be split at any level.
 
> 1.2 Nest markers<sub:Nest-markers>
> 
> Since Netlink messages do have a 32-bit quantity to store the 
> message length, rulesets of roughly up to 4 GB are possibile, 
> which is currently regarded as sufficient. The largest (while 
> still being meaningful) rulesets seen to date in the industry 
> weighed in at approximately 150 MB.
> 
> Whereas encapsulated attribute encoding automatically provided 
> for boundaries, this is realized using dummy attributes in the 
> chained approach. The start of a nesting level can be implicitly 
> represented by the presence of the attribute that would have 
> otherwise been used for encapsulated nesting. For declaring an 
> end of a nest level, an extra attribute is needed:
> 
> ? ?chain { rule; rule; ... }? \Leftrightarrow CHAIN RULE RULE ... 
>   STOP

With encapsulation, there were no need such an extra STOP attribute - 
except that we may have to split the encapsulated attributes into multiple 
messages and thus the STOP attribute/marker is needed.
 
> 1.3 Attribute limitations in nfnetlink
> 
> Netlink, being just a base protocol, does not specify what comes 
> after the nlmsghdr, or how it is ordered. This is left up to the 
> subprotocols based on Netlink. nfnetlink has two effective 
> shortcomings (due to its parser) that shall be held in mind:
> 
> ? Attribute ordering is ignored and lost

Even if netlink does not state that attribute ordering is kept, it does 
not state either that attributes may be reordered. Netling as transport 
protocol does not care about the attributes. So we can say that for 
xtables2, the attribute order in the netlink messages is fixed, period.
 
> ? No support for more than one attribute with the same type 
>   within a message

Oh no, you can put as many attributes with the same type as you like (and 
fit) into a single nested attribute!

> struct nlattr **tb;
> nla_for_each_attr(attr, head, ...)
>         tb[nla_type(attr)] = attr;
> 
> This kills the idea of being able to do, for example, a table 
> replace, in a single Netlink request message. This is like having 
> to split an XML file at every tag simply because two tags can 
> carry the same attribute. So Netlink requests have to be broken 
> down into many many tiny parts and extra state has to be kept 
> around in the kernel.
> 
> put_header(msg, NFXTM_TABLE_REPLACE);
> foreach (rule)
>         put(msg, rule);
> send(sock, msg);

And so the simple processing above can be applied.
 
> will become
> 
> put_header(msg, NFXTM_TABLE_REPLACE);
> send(sock, msg);
> foreach (rule) {
>         clean(msg);
>         put_header(msg, NFXTM_RULE_DATA);
>         put(msg, rule);
>         send(sock, msg);
> }
> clean(msg);
> put_header(msg, NFXTM_COMMIT);
> send(sock, msg);
> 
> or worse. In other words, the fact that the kernel side will use 
> a temporary table (an implementation detail) will be exposed to 
> userspace, which is bad too.
 
> 1.4 Summary of transform<sub:Summary-of-transform>
> 
> Essentially there is a 1:1 transform on the XML-like tree shown 
> above, to:
> 
> NFXTM_CHAIN_ENTRY<name=INPUT,usertid=1>
>   NFXTM_RULE_ENTRY<idx=1,usertid=1>
>     NFXTM_MATCH_ENTRY<acidx=1,name=hashlimit,rev=1,usertid=2>
>       NFXTM_CONFIG_DATA
>         NFXTM_ARB_DATA<whatever>
>         NFXTM_ARB_DATA<more arbitrary data>
>       NFXTM_STOP
>       NFXTM_STATE_DATA
>         NFXTM_ATTR_DATA<nlattrs>
>         NFXTM_ATTR_DATA<more nlattrs>
>       NFXTM_STOP
>     NFXTM_STOP
>     NFXTM_TARGET_ENTRY<acidx=2,name=TOS,rev=0,usertid=3>
>       ...
>     NFXTM_STOP
>     NFXTM_VERDICT_ENTRY<acidx=3,name=ACCEPT,usertid=3>
>     NFXTM_STOP
>   NFXTM_STOP
> NFXTM_STOP
> 
> 1.5 Extra sequence numbers<sub:Extra-sequence-numbers>
> 
> Netlink also does not specify any message ordering, though it 
> does provide an nlmsg_seq field with which message order can at 
> least be determined. The problem is that nothing specifies what 
> nlmsg_seq should be in reply messages. It is assumed that the 
> sequence number is linked, i. e. that a reply's number should be 
> the same as the request's number, to do message matching (vague 
> hint by netlink(7) manpage).

Nothing specifies what nlmsg_seq should be in, so it's up to the 
application, i.e. xtables2, how it's used...
 
> Even if that were decidedly so, that brings along a problem. In 
> NLM_F_MULTI-style dumps, all messages would have the same 
> nlmsg_seq. To counter this, multi messages will have an 
> NFXT-specific sequence counter (NFXTA_SEQNO) in addition, 
> especially since ordering is so much more crucial in Xtables than 
> it is in other parts of networking.

...but yes, for dumping an additional attribute is required to make sure 
the ordering is kept. Actually, two attributes: one at the rule level, and 
one at the "action" level in the given rule.
 
> 1.6 Improved granularity error reporting
> 
> Xtables extensions as of Linux 2.6.37 can only return system 
> error codes back to userspace in case there is a problem. The 
> most common occurrences are, for example, ENOMEM (?Memory 
> allocation failure? / ?Out of memory?), and the dreaded EINVAL (?
> Invalid argument?). Best practices at the moment are to printk a 
> string to the kernel log for further information detailing the 
> circumstances about the cause of EINVAL. In the light of this 
> overload of EINVAL, an improved error reporting scheme is sought. 
> (Other networking subsystems also suffer from this problem.)
> 
> By suggestion of Jozsef Kadlecsik, the Xtables2 protocol reports 
> three kinds of errors:
> 
> ? General/standard (integer) error codes, where there is no point 
>   (or cannot be) to specify the nature of the error exactly. Like 
>   in the example, ENOMEM: it is needles to report which new data 
>   field could not be allocated.
> 
> ? General Xtables2 error codes (largely replaces EINVAL sites) in 
>   integer form, similar to errno. Use cases include:
> 
>   ? chain for a requested operation does not exist
> 
>   ? an extension is used from a hook it is not supposed to be
> 
> ? Free-form string. Standalone, or in addition to the above.
> It is impossible to provision error numbers for extensions, 
>   especially those that are out-of-tree. The problems that 
>   forcing a component to reuse another component's error code 
>   space can be seen in the overuse of EINVAL. We are aware that 
>   raw strings in kernel modules can hinder internationalization, 
>   but it is seen as the better choice over awkward error codes 
>   that convey nothing. It is also expected that strings do not 
>   change that often.
> 
> The three error types will be conveyed by three distinct 
> attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2 
> error codes), and NFXTA_ERRSTR (free-form string).

I hammer the issue further :-). With properly separated error number 
domains, the three type can be expressed in a single error attribute. Just 
a second attribute is required to carry the identifier of the action in 
the rule to which the third type error code belongs to.

I'm still not convinced about the usefulness of the error string. The 
kernel part is always paired with the userspace part. The developer 
exactly knows which kind of errors can be send back to the userspace and 
can thus provide the textual decoding. As netlink sends back the original 
message in the error message, the userspace can fully decode every 
attribute (since it itself encoded it) too.

If a decoding for an error code is not provided, that's a bug an thus must 
be fixed.

>   Error pointer
> 
> Once a table/chain splice request has been finalized, 
> xt_check_{match,target} is run, which can return:
> 
> ? chain name, rule index, match/target index, NFXTE_*/custom 
>   string
> 
>   Line number
> 
> I noticed Jozsef has added a line number attribute in ipset 
> version 5 to facilitate locating errors for users. For its 
> apparent value, such attribute is also specified for xtnetlink:
> 
> A request message can contain a ?ping attribute?, NFXTA_USERTID, 
> which xtnetlink may keep track of and which may be reported back 
> verbatim in case an error occured. It may be used to represent 
> the source line, or any other number.

The line number is a very good identifier for a rule.

> ? For the tree example in section 1, the ruleset file would be ?
>   -A INPUT \
> -m hashlimit ... \
> -j TOS ... -j ACCEPT?.
> 
> 1.7 Multi-type responses
> 
> Using multi-type responses provides for a seemingly shorter reply 
> (in at least one case) than not doing so:
> 
> ? \RightarrowNFXTM_CHAIN_DUMP<NFXTA_NAME>
> \LeftarrowNFXTM_RULE_START<>
> \LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_RULE_END<>
> \LeftarrowNFXTM_RULE_START<>
> \LeftarrowNFXTM_ETARGET<NFXTA_VERDICT>
> \LeftarrowNFXTM_RULE_END<>
> \LeftarrowNLMSG_DONE
> 
> ? \RightarrowCHAIN_DUMP<NFXTA_NAME>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
> \LeftarrowCHAIN_DUMP<NFXTA_VERDICT>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
> \LeftarrowNLMSG_DONE
> 
> 2 General use
> 
> 2.1 Socket
> 
> Xtables2 is made available through an nfnetlink socket. 
> Specifically, this is a Netlink socket of type NETLINK_NETFILTER, 
> with which messages are exchanged that are tagged having Xtables 
> as the subsystem.
> 
> #include <sys/socket.h>
> #include <linux/netlink.h>
> 
> struct nlmsghdr nlmsg;
> int nf_socket = socket(AF_NETLINK, SOCK_RAW, 
> NETFILTER_NETFILTER);
> nlmsg.nlmsg_type = (NFNL_SUBSYS_XTABLES << 8) | xt_msg_type;
> 
> 2.2 Message format
> 
> All messages transmitted over the Netlink socket are to have the 
> base struct nlmsghdr header, followed by a struct nfgenmsg header 
> as mandated by nfnetlink. The .nfgen_family member is always set 
> to NFPROTO_UNSPEC. The .version member denotes the format of the 
> byte stream following nfgenmsg; this is currently version 0. The 
> .res_id member is unused.
> 
> 3 Attributes
> 
> The meaning of attributes depends upon the message and logical 
> nesting level in which they appear. Their type however remains 
> the same, such that a single Netlink attribute validation policy 
> object (struct nla_policy) can be used for all message types.
> 
> A table of all known attributes:
[...]

Maybe it was just not worded expicitly in the specification, but all 
attribute types which are affected should be sent in network order.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Xtables2 A7 spec draft
  2011-02-05 19:33 ` Jozsef Kadlecsik
@ 2011-02-05 21:38   ` Jan Engelhardt
  2011-02-06 11:43     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Engelhardt @ 2011-02-05 21:38 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Netfilter Developer Mailing List


On Saturday 2011-02-05 20:33, Jozsef Kadlecsik wrote:
>> 1.1 Nesting representation
>> 
>> The common element in Xtables is the ruleset, represented as a 
>> tree structure with ordering constraints at some levels:
>> 
>> ruleset (unordered tables)
>>  \__ table (unordered chains)
>>  |    \__ chain (ordered rules)
>>  |    |    \__ rule (ordered actions)
>>  |    |    |    \__ match (unordered data)
>>  |    |    |    |    \__ config-data
>>  |    |    |    |    |    \__ bin params
>>  |    |    |    |    \__ state-data
>>  |    |    |    |         \__ nlattrs
>>  |    |    |    \__ match...
>>  |    |    |    \__ target (unordered data)
>>  |    |    |    |    \__ config-data
>>  |    |    |    \__ target...
>>  |    |    |    \__ verdict...
>>  |    |    \__ rule...
>>  |    \__ chain...
>>  \__ table...
>
>I believe the objects 'match', 'target', 'verdict' should be generalized 
>and unified into a single entity named 'action' (or named whatever).

That is why there are already traces of the word 'action'
in the kernel (e.g. 'struct xt_action_param').

>I don't like the idea of passing binary parameters at any level: 
>everything should be expressed in nlattrs.
>
>> Certain voices in the community call for the obsoletion of such 
>> data blobs and replace them by Netlink attributes; there are no 
>> objections to doing so. However, the problem of size-limited 
>> sk_buffs applies to opaque data of any kind, and Netlink 
>> attributes fall within that.
>
>I'm among the ones who object data blobs.

I completely agree to many of your points, but my strategy shall be
clear: the first working code dump is _only_ supposed to 1. break up
the table blob, 2. do NL transport with the preexisting per-extension
blobs.

I would hate having to come up with a "perfect" solution from the
start. That won't work, for the following reasons: 1. It makes the
task look bigger, 2. the stream of work seemingly never-ending, both
of which cause increased chance for premature give-up. And that I
absolutely want to avoid. I suppose nobody of the maintainers would
want to review a 300-patchset at once either.

Of course, the new subsystem would only be marked stable once all
desires have been met. Something like how btrfs was merged.

>> To encode an attribute's length, struct nlattr only has a 16-bit 
>> field, which means the attribute header plus payload is limited 
>> to 64 KB. This is easily exceedable with the encapsulated 
>> encoding as chains are collected rules in a chain, for example. 
>> The problem is aggreviated by the kernel's Netlink handler only 
>> allocating sk_buffs a page size worth, which leaves few room for 
>> extension data. In the worst case, the usable payload for 
>> attributes is around 3600 bytes only. In light of xt_u32's 
>> private data block being 1984 bytes already, that means that you 
>> won't be able to fit two -m u32 invocations nested in a single 
>> rule into a dump.
>
>The pagesize limit is a real problem. :-(( I don't see how could we avoid 
>the possibility to split a single rule into multiple messages, because it 
>did not simply fit into a single one.

We will have to live with it, because when transferring from
kernel->user, other methods of transportation (such as a character
device) would run into the same limitation (it would be limited
by the size of the buffer passed to read(2)).

>> The Xtables2 Netlink protocol encodes each node of information as 
>> a standalone attribute, to be called Flat Encoding, that is 
>> appended (a. k. a. ?chained?) to the data stream. By avoiding 
>> encapsulated attributes, it is possible to split messages at much 
>> finer levels, and provides for attributes that happen to use 
>> opaque data with a maximally-sized buffer.
>
>Even with encapsulation, the messages can be split at any level.

I fear that won't work out so easily. Consider a Netlink message "msg
{ u32_params { atom1; atom2; ...; atomN; }}" with u32_params being an
NLA_F_NESTED. You could split that across messages as, for example,
"msg { u32_params { atom1; atom2; } } msg { u32_params { atom3; ...
atomN; }}", but you would have to repeat container headers, i.e.
u32_params. Which, given a big enough nesting level means that the
2nd message's space is used up by containers again.

If an analogy is needed: It is a bit like TCP segmentation vs. IP
fragmentation. In the former, there is one TCP hdr per message, in
the latter there is not.

>> Whereas encapsulated attribute encoding automatically provided 
>> for boundaries, this is realized using dummy attributes in the 
>> chained approach. The start of a nesting level can be implicitly 
>> represented by the presence of the attribute that would have 
>> otherwise been used for encapsulated nesting. For declaring an 
>> end of a nest level, an extra attribute is needed:
>> 
>> ? ?chain { rule; rule; ... }? \Leftrightarrow CHAIN RULE RULE ... 
>>   STOP
>
>With encapsulation, there were no need such an extra STOP attribute - 
>except that we may have to split the encapsulated attributes into multiple 
>messages and thus the STOP attribute/marker is needed.

Can you give an example of potential messages? The current spec
already lists

      NFXTM_STATE_DATA
        NFXTM_ATTR_DATA<nlattrs>
        NFXTM_ATTR_DATA<more nlattrs>
      NFXTM_STOP


>> 1.3 Attribute limitations in nfnetlink
>> 
>> Netlink, being just a base protocol, does not specify what comes 
>> after the nlmsghdr, or how it is ordered. This is left up to the 
>> subprotocols based on Netlink. nfnetlink has two effective 
>> shortcomings (due to its parser) that shall be held in mind:
>> 
>> ? Attribute ordering is ignored and lost
>
>Even if netlink does not state that attribute ordering is kept, it does 
>not state either that attributes may be reordered. Netling as transport 
>protocol does not care about the attributes.

Indeed Netlink is fine. The beef is with nfnetlink, which, due to its
use of "struct nlattr *tb[]" basically forfeits attribute ordering.

>So we can say that for xtables2, the attribute order in the netlink
>messages is fixed, period.

Yeah, but Pablo refused to accept patches which don't use nfnetlink,
or which rely on attribute order.


>> ? No support for more than one attribute with the same type 
>>   within a message
>
>Oh no, you can put as many attributes with the same type as you like (and 
>fit) into a single nested attribute!

Not just nested attributes. Attributes with the same type can be put
anywhere as long as you don't use a parser that utilizes the "struct
nlattr *tb[indexed_by_attr_type]" scheme. nfnetlink does use tb
however.

Encapsulating all the attrs in a nested attribute just to work around
nfnetlink's use of tb[] would beg the question of why one is using
nfnetlink in the first place then.

>> Even if that were decidedly so, that brings along a problem. In 
>> NLM_F_MULTI-style dumps, all messages would have the same 
>> nlmsg_seq. To counter this, multi messages will have an 
>> NFXT-specific sequence counter (NFXTA_SEQNO) in addition, 
>> especially since ordering is so much more crucial in Xtables than 
>> it is in other parts of networking.
>
>...but yes, for dumping an additional attribute is required to make sure 
>the ordering is kept. Actually, two attributes: one at the rule level, and 
>one at the "action" level in the given rule.

NFXTM_TABLE_DUMP<nlmsg_seqno=7> would yield:

NFXTM_CHAIN_ENTRY<nlmsg_seqno=7,nfxt_seqno=0, name=INPUT,usertid=1>
  NFXTM_RULE_ENTRY<nlmsg_seqno=7,nfxt_seqno=1, idx=1,usertid=1>
    NFXTM_MATCH_ENTRY<(7,2), acidx=1,name=hashlimit,rev=1,usertid=2>
      NFXTM_CONFIG_DATA<(7,3)>
        NFXTM_ARB_DATA<(7,4) custom data>
        NFXTM_ARB_DATA<(7,5) more custom data>
      NFXTM_STOP<(7,6)>
      NFXTM_STATE_DATA<(7,7)>
        NFXTM_ATTR_DATA<(7,8) nlattrs>
        NFXTM_ATTR_DATA<(7,9) more nlattrs>
      NFXTM_STOP<(7,10>)
    NFXTM_STOP<(7,11)>
    NFXTM_TARGET_ENTRY<(7,12), acidx=2,name=TOS,rev=0,usertid=3>
      ...
    NFXTM_STOP<(7,95)>
    NFXTM_VERDICT_ENTRY<(7,96), acidx=3,name=ACCEPT,usertid=3>
    NFXTM_STOP<(7,97)>
  NFXTM_STOP<(7,98)>
NFXTM_STOP<(7,99)>

So I think I am fine with one extra seqno (NFXTA_SEQNO).


>> 1.6 Improved granularity error reporting
>> 
>> ? General/standard (integer) error codes
>> ? General Xtables2 error codes (largely replaces EINVAL sites)
>> ? Free-form string. Standalone, or in addition to the above.
>> It is impossible to provision error numbers for extensions, 
>>   especially those that are out-of-tree.
>>
>> The three error types will be conveyed by three distinct 
>> attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2 
>> error codes), and NFXTA_ERRSTR (free-form string).
>
>I hammer the issue further :-). With properly separated error number 
>domains, the three type can be expressed in a single error attribute.

That would mean that NFXTE_* codes would have to start at -4096
and go from there. Possible to work, but it does not feel right.

Currently the kernel has pointer values (-4095U)..(-1U) reserved for
error codes - no object will ever reside at those virtaddrs. Should
the kernel ever have a need for more than 4095 system errno codes,
the virtaddr limit of 0xfffff000 for mappings could simply be changed
to, say, 0xffff0000. But then you would run into the problem that
NFXTE error values suddenly overlap with system error codes.
Thus, keeping system error codes and NFXTE error codes separate
seems a sensible thing to do.

>I'm still not convinced about the usefulness of the error string.

Extensions shipped with the kernel are already provisioned for; the
error string was really only meant for extensions living outside the
kernel (those just can't be ignored).

I guess per-extension error codes are possible. (That just came to
mind.)

>As netlink sends back the original 
>message in the error message, the userspace can fully decode every 
>attribute (since it itself encoded it) too.

Generally just the original message header - not the entire message.
But in synchronous operations - in other words, most cases -
even that is not necessary: because we know what we constructed,
we don't need to rely on the replica inside the error message.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Xtables2 A7 spec draft
  2011-02-05 21:38   ` Jan Engelhardt
@ 2011-02-06 11:43     ` Jozsef Kadlecsik
  0 siblings, 0 replies; 6+ messages in thread
From: Jozsef Kadlecsik @ 2011-02-06 11:43 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

On Sat, 5 Feb 2011, Jan Engelhardt wrote:

> On Saturday 2011-02-05 20:33, Jozsef Kadlecsik wrote:
> >> 1.1 Nesting representation
> >> 
> >> The common element in Xtables is the ruleset, represented as a 
> >> tree structure with ordering constraints at some levels:
> >> 
> >> ruleset (unordered tables)
> >>  \__ table (unordered chains)
> >>  |    \__ chain (ordered rules)
> >>  |    |    \__ rule (ordered actions)
> >>  |    |    |    \__ match (unordered data)
> >>  |    |    |    |    \__ config-data
> >>  |    |    |    |    |    \__ bin params
> >>  |    |    |    |    \__ state-data
> >>  |    |    |    |         \__ nlattrs
> >>  |    |    |    \__ match...
> >>  |    |    |    \__ target (unordered data)
> >>  |    |    |    |    \__ config-data
> >>  |    |    |    \__ target...
> >>  |    |    |    \__ verdict...
> >>  |    |    \__ rule...
> >>  |    \__ chain...
> >>  \__ table...
> >
> >I believe the objects 'match', 'target', 'verdict' should be generalized 
> >and unified into a single entity named 'action' (or named whatever).
> 
> That is why there are already traces of the word 'action'
> in the kernel (e.g. 'struct xt_action_param').

Then I was misled by NFXTM_MATCH|TARGET|VERDICT_ENTRY. So those are there 
for backward compatibility reasons only.

> >I don't like the idea of passing binary parameters at any level: 
> >everything should be expressed in nlattrs.
> >
> >> Certain voices in the community call for the obsoletion of such 
> >> data blobs and replace them by Netlink attributes; there are no 
> >> objections to doing so. However, the problem of size-limited 
> >> sk_buffs applies to opaque data of any kind, and Netlink 
> >> attributes fall within that.
> >
> >I'm among the ones who object data blobs.
> 
> I completely agree to many of your points, but my strategy shall be
> clear: the first working code dump is _only_ supposed to 1. break up
> the table blob, 2. do NL transport with the preexisting per-extension
> blobs.

That's a good plan, indeed. With it the task is split into more easily 
manageable parts.
 
> >> The Xtables2 Netlink protocol encodes each node of information as 
> >> a standalone attribute, to be called Flat Encoding, that is 
> >> appended (a. k. a. ?chained?) to the data stream. By avoiding 
> >> encapsulated attributes, it is possible to split messages at much 
> >> finer levels, and provides for attributes that happen to use 
> >> opaque data with a maximally-sized buffer.
> >
> >Even with encapsulation, the messages can be split at any level.
> 
> I fear that won't work out so easily. Consider a Netlink message "msg
> { u32_params { atom1; atom2; ...; atomN; }}" with u32_params being an
> NLA_F_NESTED. You could split that across messages as, for example,
> "msg { u32_params { atom1; atom2; } } msg { u32_params { atom3; ...
> atomN; }}", but you would have to repeat container headers, i.e.
> u32_params. Which, given a big enough nesting level means that the
> 2nd message's space is used up by containers again.

The levels of nesting is well defined in xtables2: table, chain, rule, 
action, data. It doesn't look like a big burden. I don't count the nesting 
levels of the different data containers, because those are required 
anyway.
 
> If an analogy is needed: It is a bit like TCP segmentation vs. IP
> fragmentation. In the former, there is one TCP hdr per message, in
> the latter there is not.

[Some regards IP fragmentation as a design mistake. Making it possible in 
IPv6 was actually a sin.]

> >> Whereas encapsulated attribute encoding automatically provided 
> >> for boundaries, this is realized using dummy attributes in the 
> >> chained approach. The start of a nesting level can be implicitly 
> >> represented by the presence of the attribute that would have 
> >> otherwise been used for encapsulated nesting. For declaring an 
> >> end of a nest level, an extra attribute is needed:
> >> 
> >> ? ?chain { rule; rule; ... }? \Leftrightarrow CHAIN RULE RULE ... 
> >>   STOP
> >
> >With encapsulation, there were no need such an extra STOP attribute - 
> >except that we may have to split the encapsulated attributes into multiple 
> >messages and thus the STOP attribute/marker is needed.
> 
> Can you give an example of potential messages? The current spec
> already lists
> 
>       NFXTM_STATE_DATA
>         NFXTM_ATTR_DATA<nlattrs>
>         NFXTM_ATTR_DATA<more nlattrs>
>       NFXTM_STOP

In my opinion the basic atomic element is a rule and our main issue is how 
to split a rule into multiple messages. It could be expressed with the 
NFXTM_STOP attribute, but I'd prefer an attribute flag value:

NFXTM_ACTION_ENTRY
  NFXTM_ACTION_FLAGS (MATCH|TARGET|VERDICT, COMPLETE)
    NFXTM_CONFIG_DATA
      NFXTM_ARB_DATA
      ...
    NFXTM_STATE_DATA
      NFXTM_ATTR_DATA
      ...

If the action entry is not flagged as complete, expect messages with 
additional config, state data entries. The config and state data are 
unordered, so there's no ordering issue here.
 
> >> 1.3 Attribute limitations in nfnetlink
> >> 
> >> Netlink, being just a base protocol, does not specify what comes 
> >> after the nlmsghdr, or how it is ordered. This is left up to the 
> >> subprotocols based on Netlink. nfnetlink has two effective 
> >> shortcomings (due to its parser) that shall be held in mind:
> >> 
> >> ? Attribute ordering is ignored and lost
> >
> >Even if netlink does not state that attribute ordering is kept, it does 
> >not state either that attributes may be reordered. Netling as transport 
> >protocol does not care about the attributes.
> 
> Indeed Netlink is fine. The beef is with nfnetlink, which, due to its
> use of "struct nlattr *tb[]" basically forfeits attribute ordering.
>
> >So we can say that for xtables2, the attribute order in the netlink
> >messages is fixed, period.
> 
> Yeah, but Pablo refused to accept patches which don't use nfnetlink,
> or which rely on attribute order.
>
> >> ? No support for more than one attribute with the same type 
> >>   within a message
> >
> >Oh no, you can put as many attributes with the same type as you like (and 
> >fit) into a single nested attribute!
> 
> Not just nested attributes. Attributes with the same type can be put
> anywhere as long as you don't use a parser that utilizes the "struct
> nlattr *tb[indexed_by_attr_type]" scheme. nfnetlink does use tb
> however.

Nfnetlink parses the attributes at the toplevel only. And at toplevel you 
don't rely on ordered attributes: tables, chains are unordered.

So you don't need to add anything to netlink/nfnetlink: you can use 
ordered attributes by nesting them, at the level where ordering is 
required.
  
> Encapsulating all the attrs in a nested attribute just to work around
> nfnetlink's use of tb[] would beg the question of why one is using
> nfnetlink in the first place then.
> 
> >> Even if that were decidedly so, that brings along a problem. In 
> >> NLM_F_MULTI-style dumps, all messages would have the same 
> >> nlmsg_seq. To counter this, multi messages will have an 
> >> NFXT-specific sequence counter (NFXTA_SEQNO) in addition, 
> >> especially since ordering is so much more crucial in Xtables than 
> >> it is in other parts of networking.
> >
> >...but yes, for dumping an additional attribute is required to make sure 
> >the ordering is kept. Actually, two attributes: one at the rule level, and 
> >one at the "action" level in the given rule.
> 
> NFXTM_TABLE_DUMP<nlmsg_seqno=7> would yield:
> 
> NFXTM_CHAIN_ENTRY<nlmsg_seqno=7,nfxt_seqno=0, name=INPUT,usertid=1>
>   NFXTM_RULE_ENTRY<nlmsg_seqno=7,nfxt_seqno=1, idx=1,usertid=1>
>     NFXTM_MATCH_ENTRY<(7,2), acidx=1,name=hashlimit,rev=1,usertid=2>
>       NFXTM_CONFIG_DATA<(7,3)>
>         NFXTM_ARB_DATA<(7,4) custom data>
>         NFXTM_ARB_DATA<(7,5) more custom data>
>       NFXTM_STOP<(7,6)>
>       NFXTM_STATE_DATA<(7,7)>
>         NFXTM_ATTR_DATA<(7,8) nlattrs>
>         NFXTM_ATTR_DATA<(7,9) more nlattrs>
>       NFXTM_STOP<(7,10>)
>     NFXTM_STOP<(7,11)>
>     NFXTM_TARGET_ENTRY<(7,12), acidx=2,name=TOS,rev=0,usertid=3>
>       ...
>     NFXTM_STOP<(7,95)>
>     NFXTM_VERDICT_ENTRY<(7,96), acidx=3,name=ACCEPT,usertid=3>
>     NFXTM_STOP<(7,97)>
>   NFXTM_STOP<(7,98)>
> NFXTM_STOP<(7,99)>
> 
> So I think I am fine with one extra seqno (NFXTA_SEQNO).

I regard NFXTA_ACTION_IDX as a second attribute besides NFXTA_SEQNO, which
is needed to reconstruct the proper order when receiving a full rule in
multiple messages.

> >> 1.6 Improved granularity error reporting
> >> 
> >> ? General/standard (integer) error codes
> >> ? General Xtables2 error codes (largely replaces EINVAL sites)
> >> ? Free-form string. Standalone, or in addition to the above.
> >> It is impossible to provision error numbers for extensions, 
> >>   especially those that are out-of-tree.
> >>
> >> The three error types will be conveyed by three distinct 
> >> attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2 
> >> error codes), and NFXTA_ERRSTR (free-form string).
> >
> >I hammer the issue further :-). With properly separated error number 
> >domains, the three type can be expressed in a single error attribute.
> 
> That would mean that NFXTE_* codes would have to start at -4096
> and go from there. Possible to work, but it does not feel right.
> 
> Currently the kernel has pointer values (-4095U)..(-1U) reserved for
> error codes - no object will ever reside at those virtaddrs. Should
> the kernel ever have a need for more than 4095 system errno codes,
> the virtaddr limit of 0xfffff000 for mappings could simply be changed
> to, say, 0xffff0000. But then you would run into the problem that
> NFXTE error values suddenly overlap with system error codes.
> Thus, keeping system error codes and NFXTE error codes separate
> seems a sensible thing to do.

The highest system error code currently is 132. I think we have got plenty 
of time to exhaust the rest and overflow 4095 :-).
 
> >I'm still not convinced about the usefulness of the error string.
> 
> Extensions shipped with the kernel are already provisioned for; the
> error string was really only meant for extensions living outside the
> kernel (those just can't be ignored).
> 
> I guess per-extension error codes are possible. (That just came to
> mind.)
> 
> >As netlink sends back the original 
> >message in the error message, the userspace can fully decode every 
> >attribute (since it itself encoded it) too.
> 
> Generally just the original message header - not the entire message.
> But in synchronous operations - in other words, most cases -
> even that is not necessary: because we know what we constructed,
> we don't need to rely on the replica inside the error message.

Netlink sends back the entire message, not just the header. (Unless you 
handle error messages manually and force netlink not to send them.)

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Xtables2 A7 spec draft
  2011-02-02 22:04 Xtables2 A7 spec draft Jan Engelhardt
  2011-02-05 19:33 ` Jozsef Kadlecsik
@ 2011-02-07 20:50 ` James Nurmi
  2011-02-07 21:45   ` Jan Engelhardt
  1 sibling, 1 reply; 6+ messages in thread
From: James Nurmi @ 2011-02-07 20:50 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

(inline)

Comments are made as maintainer of GoNetlink, a 'not-C' language;
disregard as desired.

On Wed, Feb 2, 2011 at 2:04 PM, Jan Engelhardt <jengelh@medozas.de> wrote:
>
>
> I am posting the Xtables2 Netlink interface specification, draft 7
> for comments.
>
> Additionally, further documentation and toolchain around
> it is available through the project page at
>
>        http://jengelh.medozas.de/projects/xtables/
>
>  * User Documentation Chapter 1: Architectural Differences
>  * Developer Documentation Part 1: Netlink interface (WIP)
>   This is copied below to facilitate inline replies
> --8<--
>
> Netlink interface
>
> 1 Concepts
>
> This section is non-normative and should instead show the flow of
> thought and give reasons as to why the specification was
> conceived the way it is, and where the component problems are.
>
> 1.1 Nesting representation
>
> The common element in Xtables is the ruleset, represented as a
> tree structure with ordering constraints at some levels:
>
> ruleset (unordered tables)
>  \__ table (unordered chains)
>  |    \__ chain (ordered rules)
>  |    |    \__ rule (ordered actions)
>  |    |    |    \__ match (unordered data)
>  |    |    |    |    \__ config-data
>  |    |    |    |    |    \__ bin params
>  |    |    |    |    \__ state-data
>  |    |    |    |         \__ nlattrs
>  |    |    |    \__ match...
>  |    |    |    \__ target (unordered data)
>  |    |    |    |    \__ config-data
>  |    |    |    \__ target...
>  |    |    |    \__ verdict...
>  |    |    \__ rule...
>  |    \__ chain...
>  \__ table...
>
> A more concrete example, here is a small ruleset, encoded into
> XML (just one of many possible representations):
>
> <table>
>  <chain name="INPUT">
>    <rule idx="1">
>      <match acidx="1" name="hashlimit" rev="1" csize="120">
>        <config-data>...</config-data>
>        <state-data>...</state-data>
>      </match>
>      <target acidx="2" name="TOS" rev="1">
>        ...
>      </target>
>      <verdict acidx="3" name="ACCEPT" />
>    </rule>
>  </chain>
> </table>
>
> There are different ways to encode such a tree structure into a
> serialized stream. In many Netlink protocols, children attributes
> are encapsulated (a. k. a. “nested”, though we will avoid this
> term to avoid double-use) and treated as a whole as a parent's
> opaque data. It cannot be told apart from normal data. (Like
> writing “<chain> &lt;rule&gt; ... &lt;/rule&gt; </chain>” in
> XML.) We will call this format “Encapsulated Encoding”.
>
> To encode an attribute's length, struct nlattr only has a 16-bit
> field, which means the attribute header plus payload is limited
> to 64 KB. This is easily exceedable with the encapsulated
> encoding as chains are collected rules in a chain, for example.
> The problem is aggreviated by the kernel's Netlink handler only
> allocating sk_buffs a page size worth, which leaves few room for
> extension data. In the worst case, the usable payload for
> attributes is around 3600 bytes only. In light of xt_u32's
> private data block being 1984 bytes already, that means that you
> won't be able to fit two -m u32 invocations nested in a single
> rule into a dump.
>
> Certain voices in the community call for the obsoletion of such
> data blobs and replace them by Netlink attributes; there are no
> objections to doing so. However, the problem of size-limited
> sk_buffs applies to opaque data of any kind, and Netlink
> attributes fall within that.

I'm all for of opaque data-blobs where the user is not expected to
understand the data underneath (FILE handles), but only so far as they
can be safely serialized to alternate processes for collection of
additional data (no *pointers, and only TLV styled abstractions)

>
> The Xtables2 Netlink protocol encodes each node of information as
> a standalone attribute, to be called Flat Encoding, that is
> appended (a. k. a. “chained”) to the data stream. By avoiding
> encapsulated attributes, it is possible to split messages at much
> finer levels, and provides for attributes that happen to use
> opaque data with a maximally-sized buffer.
>
> 1.2 Nest markers<sub:Nest-markers>
>
> Since Netlink messages do have a 32-bit quantity to store the
> messagelength, rulesets of roughly up to 4 GB are possibile,
> which is currently regarded as sufficient. The largest (while
> still being meaningful) rulesets seen to date in the industry
> weighed in at approximately 150 MB.

While managing tables/rules/etc atomically should be priority #1, I'm
not certain if optimizing the protocol for this makes a lot of sense
for either the user or kernel contexts.

>
> Whereas encapsulated attribute encoding automatically provided
> for boundaries, this is realized using dummy attributes in the
> chained approach. The start of a nesting level can be implicitly
> represented by the presence of the attribute that would have
> otherwise been used for encapsulated nesting. For declaring an
> end of a nest level, an extra attribute is needed:
>
> • “chain { rule; rule; ... }” \Leftrightarrow CHAIN RULE RULE ...
>  STOP
>
> 1.3 Attribute limitations in nfnetlink
>
> Netlink, being just a base protocol, does not specify what comes
> after the nlmsghdr, or how it is ordered. This is left up to the
> subprotocols based on Netlink. nfnetlink has two effective
> shortcomings (due to its parser) that shall be held in mind:
>
> • Attribute ordering is ignored and lost

(GoNetlink doesn't adhere to this belief; I didn't realize there was
any standardization of this approach outside of the libnfnetlink
implementation, and so assumed I'd be screwed if I followed it.)

>
> • No support for more than one attribute with the same type
>  within a message

ditto

> 1.4 Summary of transform<sub:Summary-of-transform>
>
> Essentially there is a 1:1 transform on the XML-like tree shown
> above, to:
>
> NFXTM_CHAIN_ENTRY<name=INPUT,usertid=1>
>  NFXTM_RULE_ENTRY<idx=1,usertid=1>
>    NFXTM_MATCH_ENTRY<acidx=1,name=hashlimit,rev=1,usertid=2>
>      NFXTM_CONFIG_DATA
>        NFXTM_ARB_DATA<whatever>
>        NFXTM_ARB_DATA<more arbitrary data>
>      NFXTM_STOP
>      NFXTM_STATE_DATA
>        NFXTM_ATTR_DATA<nlattrs>
>        NFXTM_ATTR_DATA<more nlattrs>
>      NFXTM_STOP
>    NFXTM_STOP
>    NFXTM_TARGET_ENTRY<acidx=2,name=TOS,rev=0,usertid=3>
>      ...
>    NFXTM_STOP
>    NFXTM_VERDICT_ENTRY<acidx=3,name=ACCEPT,usertid=3>
>    NFXTM_STOP
>  NFXTM_STOP
> NFXTM_STOP
>
> 1.5 Extra sequence numbers<sub:Extra-sequence-numbers>
>
> Netlink also does not specify any message ordering, though it
> does provide an nlmsg_seq field with which message order can at
> least be determined. The problem is that nothing specifies what
> nlmsg_seq should be in reply messages. It is assumed that the
> sequence number is linked, i. e. that a reply's number should be
> the same as the request's number, to do message matching (vague
> hint by netlink(7) manpage).

RFC 3549 (2.3.2.1) seems to support you in that the usage of sequence
numbers is undefined; My experience has been to expect the response to
match the request and dispatch accordingly -- since thats the 'norm',
and netlink shouldn't ever fail,  I'd actually rather see the protocol
use NLM_F_MULTI, NLM_F_ATOMIC pair, with an internal
sequence/timestamp for clients that really need an atomic state.

>
> Even if that were decidedly so, that brings along a problem. In
> NLM_F_MULTI-style dumps, all messages would have the same
> nlmsg_seq. To counter this, multi messages will have an
> NFXT-specific sequence counter (NFXTA_SEQNO) in addition,
> especially since ordering is so much more crucial in Xtables than
> it is in other parts of networking.

That, to me, is fine -- netlink is an encapsulation from my view,
MULTI is the right way to do long messages.

>
> 1.6 Improved granularity error reporting
>...

As a non-C implementation, I'd prefer constant error (class) with
flags (bitfield), but expect to rewrite a lot of constants anyhow.

> 1.7 Multi-type responses
> ...

Most RTNetlink protocols (which will be a similar user base I imagine)
make assumptions on the response type based on the query type;  In go,
for example, there is no generic, so re-decomposing a response becomes
expensive.

Personally, I would prefer that responses be limited solely to the
query I provided or an error, not something with multiple (possibly
confounding?) types.

> ...
> 3 Attributes
>
> The meaning of attributes depends upon the message and logical
> nesting level in which they appear. Their type however remains
> the same, such that a single Netlink attribute validation policy
> object (struct nla_policy) can be used for all message types.
>
> A table of all known attributes:
>
>
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | Value  | Mnemonic          |    C type     | NLA type        | Notes                                |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |   1    | NFXTA_SEQNO       | unsigned int  | NLA_U32         | Section [sub:Extra-sequence-numbers] |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |  tba   | NFXTA_ERRNO       |     int       | NLA_U32         | Generic system errno (Exxx)          |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |  ...   | NFXTA_XTERRNO     |     int       | NLA_U32         | NFXT errno (NFXTE_*)                 |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_ERRSTR      |   char []     | NLA_NUL_STRING  | Arbitrary                            |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_USERTID     | unsigned int  | NLA_U32         | Arbitrary, retained verbatim         |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_CHAIN_NAME  |   char []     | NLA_NUL_STRING  |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_RULE_IDX    | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_ACTION_IDX  | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_NAME        |   char []     | NLA_NUL_STRING  |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_REVISION    |   uint8_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_HOOKNUM     | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_PRIORITY    |     int       | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_NFPROTO     |   uint8_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_OFFSET      | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_LENGTH      |    size_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_HOOKMASK    | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_SIZE        |    size_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_NEW_NAME    |   char []     | NLA_NUL_STRING  |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+


W/r/t the NUL_STRING's -- is there a good reason to use a NUL'd
strings for NAME/etc, given the length is known? Wouldn't it make more
sense to simply require a byte string and apply the null internally? I
see this frequently in Netlink, and imagine it's a kernel consistency
thing?

>
>
> The kernel ignores attributes with value 0 during validation, so
> it was left unused.
>
> 4 Error types<sec:Error-types>
>
>
> +--------+---------------------+-------------------------------------------+
> | Value  | Mnemonic            | Description                               |
> +--------+---------------------+-------------------------------------------+
> +--------+---------------------+-------------------------------------------+
> |   0    | NFXTE_SUCCESS       | No error                                  |
> +--------+---------------------+-------------------------------------------+
> |   1    | NFXTE_CHAIN_EXIST   | Chain already exists                      |
> +--------+---------------------+-------------------------------------------+
> |   2    | NFXTE_CHAIN_NOENT   | Chain does not exist                      |
> +--------+---------------------+-------------------------------------------+
> |   3    | NFXTE_RULESET_LOOP  | Ruleset contains a loop                   |
> +--------+---------------------+-------------------------------------------+
> |   4    | NFXTE_EXT_HOOKMASK  | Rule invoked from incompatible hook       |
> +--------+---------------------+-------------------------------------------+
> |        | NFXTE_PROMO_STATUS  | Promotion/demotion state already achieved |
> +--------+---------------------+-------------------------------------------+
>
>
> 5 Message types
> ...

My biggest concern here seems as already pointed out -- the use of
STOP && deep nesting in messages;  Every time a STOP occurs in an
internal message, it's semantically equivalent to the completion of an
NF_F_MULTI no?

I see the advantage of a trivial protocol, but wouldn't it be much
simpler to have a 'bigger' protocol (table/chain/rule) with an
optional ATOMIC guarantee?

I don't see anywhere else guaranteeing tables/matches/rules will be
managed (as a set) with atomicity [I'm probably wrong], so doing it in
the protocol feels awkward.

There area  LOT of definitions of atomicity, ordering, etc within this
area, making me feel like doing that 'up one level' and in smaller
pieces might make for more manageable interface

Still, this all looks like phenomenal progress, and I look forward to
seeing it move on.

James
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Xtables2 A7 spec draft
  2011-02-07 20:50 ` James Nurmi
@ 2011-02-07 21:45   ` Jan Engelhardt
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Engelhardt @ 2011-02-07 21:45 UTC (permalink / raw)
  To: James Nurmi; +Cc: Netfilter Developer Mailing List


On Monday 2011-02-07 21:50, James Nurmi wrote:

>> +--------+-------------------+---------------+-----------------+--------------------------------------+
>> |        | NFXTA_ERRSTR      |   char []     | NLA_NUL_STRING  | Arbitrary                            |
>> +--------+-------------------+---------------+-----------------+--------------------------------------+
>
>W/r/t the NUL_STRING's -- is there a good reason to use a NUL'd
>strings for NAME/etc, given the length is known?

Simpler to deal with string functions especially when it comes to strcmp.


>> 5 Message types
>> ...
>
>My biggest concern here seems as already pointed out -- the use of
>STOP && deep nesting in messages;  Every time a STOP occurs in an
>internal message, it's semantically equivalent to the completion of an
>NF_F_MULTI no?

MULTIs do not seem to be nestable.

>I see the advantage of a trivial protocol, but wouldn't it be much
>simpler to have a 'bigger' protocol (table/chain/rule) with an
>optional ATOMIC guarantee?

I have no idea what you could mean by that. (In fact, most of
your reply gave me nothing to act on.)

>I don't see anywhere else guaranteeing tables/matches/rules will be
>managed (as a set) with atomicity [I'm probably wrong], so doing it in
>the protocol feels awkward.

By issuing NFXTM_TABLE_REPLACE (atomic replace of table) or
NFXTM_CHAIN_SPLICE (atomic replace of a chain and its rules), rules
that follow are collected and implanted into the live ruleset on the
final NFXTM_STOP. And these two cover all the atomicity one would
need as I see it.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-02-07 21:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-02 22:04 Xtables2 A7 spec draft Jan Engelhardt
2011-02-05 19:33 ` Jozsef Kadlecsik
2011-02-05 21:38   ` Jan Engelhardt
2011-02-06 11:43     ` Jozsef Kadlecsik
2011-02-07 20:50 ` James Nurmi
2011-02-07 21:45   ` Jan Engelhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.