* Xtables2 A7 spec draft
@ 2011-02-02 22:04 Jan Engelhardt
2011-02-05 19:33 ` Jozsef Kadlecsik
2011-02-07 20:50 ` James Nurmi
0 siblings, 2 replies; 6+ messages in thread
From: Jan Engelhardt @ 2011-02-02 22:04 UTC (permalink / raw)
To: Netfilter Developer Mailing List
I am posting the Xtables2 Netlink interface specification, draft 7
for comments.
Additionally, further documentation and toolchain around
it is available through the project page at
http://jengelh.medozas.de/projects/xtables/
* User Documentation Chapter 1: Architectural Differences
* Developer Documentation Part 1: Netlink interface (WIP)
This is copied below to facilitate inline replies
--8<--
Netlink interface
1 Concepts
This section is non-normative and should instead show the flow of
thought and give reasons as to why the specification was
conceived the way it is, and where the component problems are.
1.1 Nesting representation
The common element in Xtables is the ruleset, represented as a
tree structure with ordering constraints at some levels:
ruleset (unordered tables)
\__ table (unordered chains)
| \__ chain (ordered rules)
| | \__ rule (ordered actions)
| | | \__ match (unordered data)
| | | | \__ config-data
| | | | | \__ bin params
| | | | \__ state-data
| | | | \__ nlattrs
| | | \__ match...
| | | \__ target (unordered data)
| | | | \__ config-data
| | | \__ target...
| | | \__ verdict...
| | \__ rule...
| \__ chain...
\__ table...
A more concrete example, here is a small ruleset, encoded into
XML (just one of many possible representations):
<table>
<chain name="INPUT">
<rule idx="1">
<match acidx="1" name="hashlimit" rev="1" csize="120">
<config-data>...</config-data>
<state-data>...</state-data>
</match>
<target acidx="2" name="TOS" rev="1">
...
</target>
<verdict acidx="3" name="ACCEPT" />
</rule>
</chain>
</table>
There are different ways to encode such a tree structure into a
serialized stream. In many Netlink protocols, children attributes
are encapsulated (a. k. a. “nested”, though we will avoid this
term to avoid double-use) and treated as a whole as a parent's
opaque data. It cannot be told apart from normal data. (Like
writing “<chain> <rule> ... </rule> </chain>” in
XML.) We will call this format “Encapsulated Encoding”.
To encode an attribute's length, struct nlattr only has a 16-bit
field, which means the attribute header plus payload is limited
to 64 KB. This is easily exceedable with the encapsulated
encoding as chains are collected rules in a chain, for example.
The problem is aggreviated by the kernel's Netlink handler only
allocating sk_buffs a page size worth, which leaves few room for
extension data. In the worst case, the usable payload for
attributes is around 3600 bytes only. In light of xt_u32's
private data block being 1984 bytes already, that means that you
won't be able to fit two -m u32 invocations nested in a single
rule into a dump.
Certain voices in the community call for the obsoletion of such
data blobs and replace them by Netlink attributes; there are no
objections to doing so. However, the problem of size-limited
sk_buffs applies to opaque data of any kind, and Netlink
attributes fall within that.
The Xtables2 Netlink protocol encodes each node of information as
a standalone attribute, to be called Flat Encoding, that is
appended (a. k. a. “chained”) to the data stream. By avoiding
encapsulated attributes, it is possible to split messages at much
finer levels, and provides for attributes that happen to use
opaque data with a maximally-sized buffer.
1.2 Nest markers<sub:Nest-markers>
Since Netlink messages do have a 32-bit quantity to store the
message length, rulesets of roughly up to 4 GB are possibile,
which is currently regarded as sufficient. The largest (while
still being meaningful) rulesets seen to date in the industry
weighed in at approximately 150 MB.
Whereas encapsulated attribute encoding automatically provided
for boundaries, this is realized using dummy attributes in the
chained approach. The start of a nesting level can be implicitly
represented by the presence of the attribute that would have
otherwise been used for encapsulated nesting. For declaring an
end of a nest level, an extra attribute is needed:
• “chain { rule; rule; ... }” \Leftrightarrow CHAIN RULE RULE ...
STOP
1.3 Attribute limitations in nfnetlink
Netlink, being just a base protocol, does not specify what comes
after the nlmsghdr, or how it is ordered. This is left up to the
subprotocols based on Netlink. nfnetlink has two effective
shortcomings (due to its parser) that shall be held in mind:
• Attribute ordering is ignored and lost
• No support for more than one attribute with the same type
within a message
struct nlattr **tb;
nla_for_each_attr(attr, head, ...)
tb[nla_type(attr)] = attr;
This kills the idea of being able to do, for example, a table
replace, in a single Netlink request message. This is like having
to split an XML file at every tag simply because two tags can
carry the same attribute. So Netlink requests have to be broken
down into many many tiny parts and extra state has to be kept
around in the kernel.
put_header(msg, NFXTM_TABLE_REPLACE);
foreach (rule)
put(msg, rule);
send(sock, msg);
will become
put_header(msg, NFXTM_TABLE_REPLACE);
send(sock, msg);
foreach (rule) {
clean(msg);
put_header(msg, NFXTM_RULE_DATA);
put(msg, rule);
send(sock, msg);
}
clean(msg);
put_header(msg, NFXTM_COMMIT);
send(sock, msg);
or worse. In other words, the fact that the kernel side will use
a temporary table (an implementation detail) will be exposed to
userspace, which is bad too.
1.4 Summary of transform<sub:Summary-of-transform>
Essentially there is a 1:1 transform on the XML-like tree shown
above, to:
NFXTM_CHAIN_ENTRY<name=INPUT,usertid=1>
NFXTM_RULE_ENTRY<idx=1,usertid=1>
NFXTM_MATCH_ENTRY<acidx=1,name=hashlimit,rev=1,usertid=2>
NFXTM_CONFIG_DATA
NFXTM_ARB_DATA<whatever>
NFXTM_ARB_DATA<more arbitrary data>
NFXTM_STOP
NFXTM_STATE_DATA
NFXTM_ATTR_DATA<nlattrs>
NFXTM_ATTR_DATA<more nlattrs>
NFXTM_STOP
NFXTM_STOP
NFXTM_TARGET_ENTRY<acidx=2,name=TOS,rev=0,usertid=3>
...
NFXTM_STOP
NFXTM_VERDICT_ENTRY<acidx=3,name=ACCEPT,usertid=3>
NFXTM_STOP
NFXTM_STOP
NFXTM_STOP
1.5 Extra sequence numbers<sub:Extra-sequence-numbers>
Netlink also does not specify any message ordering, though it
does provide an nlmsg_seq field with which message order can at
least be determined. The problem is that nothing specifies what
nlmsg_seq should be in reply messages. It is assumed that the
sequence number is linked, i. e. that a reply's number should be
the same as the request's number, to do message matching (vague
hint by netlink(7) manpage).
Even if that were decidedly so, that brings along a problem. In
NLM_F_MULTI-style dumps, all messages would have the same
nlmsg_seq. To counter this, multi messages will have an
NFXT-specific sequence counter (NFXTA_SEQNO) in addition,
especially since ordering is so much more crucial in Xtables than
it is in other parts of networking.
1.6 Improved granularity error reporting
Xtables extensions as of Linux 2.6.37 can only return system
error codes back to userspace in case there is a problem. The
most common occurrences are, for example, ENOMEM (“Memory
allocation failure” / “Out of memory”), and the dreaded EINVAL (“
Invalid argument”). Best practices at the moment are to printk a
string to the kernel log for further information detailing the
circumstances about the cause of EINVAL. In the light of this
overload of EINVAL, an improved error reporting scheme is sought.
(Other networking subsystems also suffer from this problem.)
By suggestion of Jozsef Kadlecsik, the Xtables2 protocol reports
three kinds of errors:
• General/standard (integer) error codes, where there is no point
(or cannot be) to specify the nature of the error exactly. Like
in the example, ENOMEM: it is needles to report which new data
field could not be allocated.
• General Xtables2 error codes (largely replaces EINVAL sites) in
integer form, similar to errno. Use cases include:
– chain for a requested operation does not exist
– an extension is used from a hook it is not supposed to be
• Free-form string. Standalone, or in addition to the above.
It is impossible to provision error numbers for extensions,
especially those that are out-of-tree. The problems that
forcing a component to reuse another component's error code
space can be seen in the overuse of EINVAL. We are aware that
raw strings in kernel modules can hinder internationalization,
but it is seen as the better choice over awkward error codes
that convey nothing. It is also expected that strings do not
change that often.
The three error types will be conveyed by three distinct
attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2
error codes), and NFXTA_ERRSTR (free-form string).
Error pointer
Once a table/chain splice request has been finalized,
xt_check_{match,target} is run, which can return:
• chain name, rule index, match/target index, NFXTE_*/custom
string
Line number
I noticed Jozsef has added a line number attribute in ipset
version 5 to facilitate locating errors for users. For its
apparent value, such attribute is also specified for xtnetlink:
A request message can contain a “ping attribute”, NFXTA_USERTID,
which xtnetlink may keep track of and which may be reported back
verbatim in case an error occured. It may be used to represent
the source line, or any other number.
• For the tree example in section 1, the ruleset file would be “
-A INPUT \
-m hashlimit ... \
-j TOS ... -j ACCEPT”.
1.7 Multi-type responses
Using multi-type responses provides for a seemingly shorter reply
(in at least one case) than not doing so:
• \RightarrowNFXTM_CHAIN_DUMP<NFXTA_NAME>
\LeftarrowNFXTM_RULE_START<>
\LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowNFXTM_RULE_END<>
\LeftarrowNFXTM_RULE_START<>
\LeftarrowNFXTM_ETARGET<NFXTA_VERDICT>
\LeftarrowNFXTM_RULE_END<>
\LeftarrowNLMSG_DONE
• \RightarrowCHAIN_DUMP<NFXTA_NAME>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
\LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
\LeftarrowCHAIN_DUMP<NFXTA_VERDICT>
\LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
\LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
\LeftarrowNLMSG_DONE
2 General use
2.1 Socket
Xtables2 is made available through an nfnetlink socket.
Specifically, this is a Netlink socket of type NETLINK_NETFILTER,
with which messages are exchanged that are tagged having Xtables
as the subsystem.
#include <sys/socket.h>
#include <linux/netlink.h>
struct nlmsghdr nlmsg;
int nf_socket = socket(AF_NETLINK, SOCK_RAW,
NETFILTER_NETFILTER);
nlmsg.nlmsg_type = (NFNL_SUBSYS_XTABLES << 8) | xt_msg_type;
2.2 Message format
All messages transmitted over the Netlink socket are to have the
base struct nlmsghdr header, followed by a struct nfgenmsg header
as mandated by nfnetlink. The .nfgen_family member is always set
to NFPROTO_UNSPEC. The .version member denotes the format of the
byte stream following nfgenmsg; this is currently version 0. The
.res_id member is unused.
3 Attributes
The meaning of attributes depends upon the message and logical
nesting level in which they appear. Their type however remains
the same, such that a single Netlink attribute validation policy
object (struct nla_policy) can be used for all message types.
A table of all known attributes:
+--------+-------------------+---------------+-----------------+--------------------------------------+
| Value | Mnemonic | C type | NLA type | Notes |
+--------+-------------------+---------------+-----------------+--------------------------------------+
+--------+-------------------+---------------+-----------------+--------------------------------------+
| 1 | NFXTA_SEQNO | unsigned int | NLA_U32 | Section [sub:Extra-sequence-numbers] |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| tba | NFXTA_ERRNO | int | NLA_U32 | Generic system errno (Exxx) |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| ... | NFXTA_XTERRNO | int | NLA_U32 | NFXT errno (NFXTE_*) |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_ERRSTR | char [] | NLA_NUL_STRING | Arbitrary |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_USERTID | unsigned int | NLA_U32 | Arbitrary, retained verbatim |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_CHAIN_NAME | char [] | NLA_NUL_STRING | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_RULE_IDX | unsigned int | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_ACTION_IDX | unsigned int | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_NAME | char [] | NLA_NUL_STRING | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_REVISION | uint8_t | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_HOOKNUM | unsigned int | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_PRIORITY | int | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_NFPROTO | uint8_t | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_OFFSET | unsigned int | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_LENGTH | size_t | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_HOOKMASK | unsigned int | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_SIZE | size_t | NLA_U32 | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
| | NFXTA_NEW_NAME | char [] | NLA_NUL_STRING | |
+--------+-------------------+---------------+-----------------+--------------------------------------+
The kernel ignores attributes with value 0 during validation, so
it was left unused.
4 Error types<sec:Error-types>
+--------+---------------------+-------------------------------------------+
| Value | Mnemonic | Description |
+--------+---------------------+-------------------------------------------+
+--------+---------------------+-------------------------------------------+
| 0 | NFXTE_SUCCESS | No error |
+--------+---------------------+-------------------------------------------+
| 1 | NFXTE_CHAIN_EXIST | Chain already exists |
+--------+---------------------+-------------------------------------------+
| 2 | NFXTE_CHAIN_NOENT | Chain does not exist |
+--------+---------------------+-------------------------------------------+
| 3 | NFXTE_RULESET_LOOP | Ruleset contains a loop |
+--------+---------------------+-------------------------------------------+
| 4 | NFXTE_EXT_HOOKMASK | Rule invoked from incompatible hook |
+--------+---------------------+-------------------------------------------+
| | NFXTE_PROMO_STATUS | Promotion/demotion state already achieved |
+--------+---------------------+-------------------------------------------+
5 Message types
+------+-----------------------+----------------+---------------------------------------------+
| ID | Mnemonic | Dir | Notes |
+------+-----------------------+----------------+---------------------------------------------+
+------+-----------------------+----------------+---------------------------------------------+
| 0 | NFXTM_STOP | both | End of logical nesting level or transaction |
+------+-----------------------+----------------+---------------------------------------------+
| 1 | NFXTM_ERROR | k\rightarrowu | Kills transactions (but not dumps) |
+------+-----------------------+----------------+---------------------------------------------+
| 2 | NFXTM_ABORT | u\rightarrowk | Abort transaction |
+------+-----------------------+----------------+---------------------------------------------+
| tba | NFXTM_CHAIN_NEW | u\rightarrowk | |
+------+-----------------------+----------------+---------------------------------------------+
| ... | NFXTM_CHAIN_DEL | u\rightarrowk | |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_CHAIN_MOVE | u\rightarrowk | |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_CHAIN_PROMOTE | u\rightarrowk | |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_CHAIN_DEMOTE | u\rightarrowk | |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_TABLE_DUMP | u\rightarrowk | Dump start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_CHAIN_ENTRY | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_RULE_ENTRY | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_MATCH_ENTRY | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_TARGET_ENTRY | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_VERDICT_ENTRY | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_JUMP_ENTRY | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_GOTO_ENTRY | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_CONFIG_DATA | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_STATE_DATA | both | Nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_ARB_DATA | both | Arbitrary data |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_ATTR_DATA | both | Attribute list |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_CHAIN_SPLICE | u\rightarrowk | Transaction start, nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_TABLE_REPLACE | u\rightarrowk | Transaction start, nest start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_IDENTIFY | both | Dump start |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_IDMATCH_ENTRY | k\rightarrowu | |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_IDTARGET_ENTRY | k\rightarrowu | |
+------+-----------------------+----------------+---------------------------------------------+
| | NFXTM_EVENT | k\rightarrowu | |
+------+-----------------------+----------------+---------------------------------------------+
5.1 End of nest level / transaction commit
NFXTM_STOP is used to end a nesting level as started by, for
example, NFXTM_RULE_ENTRY.
It is also used to finish (commit) a transaction, such as with
NFXTM_TABLE_REPLACE.
Request NFXTM_STOP:
• No attributes.
Response:
• Standard Netlink ACK.
5.2 Error report
xtnetlink uses NFXTM_ERROR to report back detailed errors on
actions.
Possible attributes:
• NFXTA_ERRNO: generic error, using system-level errno codes
(ENOMEM, etc.)
• NFXTA_XTERRNO: xtnetlink error, see section [sec:Error-types]
• NFXTA_ERRSTR: free-form error string provided by extensions
• NFXTA_USERTID: user token received earlier is echoed back for
reference (may be used for things like line numbers)
• NFXTA_CHAIN_NAME: name of chain whose processing caused the
error
• NFXTA_RULE_IDX: index to rule (0-based) that caused the error
• NFXTA_ACTION_IDX: index to match/target/verdict (0-based) in
the particular rule that caused the error
(RFC:) Should outstanding transaction be terminated?
When NFXTM_ERROR is sent in an NLM_F_MULTI dump stream, an
NLMSG_DONE message will still follow.
5.3 Transaction termination
NFXTM_ABORT can be used to abort a transaction as started by, for
example, NFXTM_TABLE_REPLACE.
Request NFXTM_ABORT:
• No attributes.
Response:
• Standard Netlink ACK.
5.4 Chain creation
Request NFXTM_CHAIN_NEW:
• Attribute NFXTA_NAME: name of the new chain.
Response:
• Standard Netlink ACK, or NFXTM_ERROR:
– ENOMEM: Out of memory
– NFXTE_CHAIN_EXIST: Chain already exists
5.5 Chain deletion
Request NFXTM_CHAIN_DEL with attributes:
• NFXTA_NAME attribute carrying the name of the chain to delete
Response:
• Standard Netlink ACK, or NFXTM_ERROR:
– NFXTE_CHAIN_NOENT: Chain does not exist.
Notes:
The chain is automatically demoted.
5.6 Chain renaming
Request:
• Type: NFXTM_CHAIN_MOVE
• Attributes: NFXTA_NAME (old name), NFXTA_NEW_NAME (new chain
name).
Response:
• Standard Netlink ACK, or NFXTA_ERROR:
– NFXTE_CHAIN_NOENT: Source chain does not exist
– NFXTE_CHAIN_EXIST: Target chain already exists
5.7 Promotion to base chain
Sets the specified chain up as an entrypoint from the Netfilter
proper. (It does this by creating an appropriate nf_hook.)
Request:
• Type: NFXTM_CHAIN_PROMOTE
• Attributes: NFXTA_NAME, NFXTA_HOOKNUM (NF_INET_*/NF_ARP_*/
NF_BR_*), NFXTA_PRIORITY, NFXTA_NFPROTO (one of the NFPROTO_*
constants)
Response:
• Standard Netlink ACK, or NFXTA_ERROR:
– NFXTE_CHAIN_NOENT: The specified chain does not exist.
– NFXTE_PROMO_STATUS: Already promoted.
– NFXTE_RULESET_LOOP: There is a loop in the rule tree, which
is not allowed.
– NFXTE_EXT_HOOKMASK: One or more extensions are used from a
hook that they do not support being invoked from.
Example:
• Turn the chain named “filter/ipv6/INPUT” into the equivalent of
the classic INPUT hook in the filter table: NFXTA_NAME=“
filter/ipv6/INPUT”, NFXTA_HOOKNUM=NF_INET_LOCAL_IN (1),
NFXTA_PRIORITY=0, NFXTA_NFPROTO=NFPROTO_IPV6 (10).
5.8 Demotion from base chain
Removes the nf_hook.
Request:
• Type: NFXTM_CHAIN_DEMOTE
• Attributes: NFXTA_NAME
Response:
• Standard Netlink ACK, or NFXTA_ERROR:
– NFXTE_CHAIN_NOENT: The specified chain does not exist.
– NFXTE_PROMO_STATUS: Already demoted.
5.9 Implementation Identification (debug)
First and foremost a debug command, and to get something
(table/chain-independent) that users can glare at (they love
doing that).
Request:
• nlmsg_type = NFXTM_IDENTIFY;
Multiple message response:
• An NFXTM_IDENTIFY message containing:
– An NFXTA_NAME attribute giving the name of the
implementation/patchset.
• Zero or more NFXTM_IDMATCH_ENTRY messages, giving
metainformation about the loaded match extensions. Each message
contains three attributes:
– An NFXTA_NAME attribute for the name of the extension.
– An NFXTA_REVISION attribute to denote the version of the
extension's parameter protocol.
– An NFXTA_SIZE attribute for the size of its per-instance data
block.
– An NFXTA_HOOKMASK attribute for the bitmap of hooks the
extensions may be used from.
• Zero or more NFXTM_IDTARGET_ENTRY messages, giving
metainformation about the loaded target extensions:
– attributes like NFXTM_IDMATCH_ENTRY.
• NLMSG_DONE message.
5.10 Rule dump
Atomic dump of entire table/ruleset, or a single chain, with or
without rules.
Request:
• nlmsg_type = NFXTM_TABLE_DUMP;
• NFXTA_NAME attribute specifying the name of the chain to dump.
Absence of attribute dumps entire table.
• NFXTA_RULE_IDX attribute specifying the particular rule
(1-based index) to dump. Absence of attribute dumps entire
chain. Use 0 to only get a chain list.
Multi Response:
• Zero or more chains, represented by the start marker message
NFXTM_CHAIN_ENTRY and the end marker NFXTM_STOP. The
NFXTM_CHAIN_ENTRY message may have NFXTA_HOOKNUM,
NFXTA_PRIORITY and NFXTA_NFPROTO attributes if it is a base
chain.
• Zero or more rules within NFXTM_CHAIN_ENTRY .. NFXTM_STOP,
represented by the start marker message NFXTM_RULE_ENTRY and
the end marker NFXTM_STOP.
• Zero or more actions within NFXTM_RULE_ENTRY .. NFXTM_STOP,
represented by the start marker message NFXTM_MATCH_ENTRY,
NFXTM_TARGET_ENTRY, NFXTM_VERDICT_ENTRY, NFXTM_JUMP_ENTRY or
NFXTM_GOTO_ENTRY and the end marker NFXTM_STOP.
• Zero or more config data messages within NFXTM_MATCH_ENTRY or
NFXTM_TARGET_ENTRY.
• Zero or more state data messages within NFXTM_MATCH_ENTRY or
NFXTM_TARGET_ENTRY.
(See section [sub:Summary-of-transform] for example.)
Errors:
• If an error occurs during dump, an NFXTM_ERRNO message is
emitted into the stream and the dump will then immediately
terminate with a standard NLMSG_DONE message. No NFXTA_STOP
attributes will be emitted if the dump stopped in the middle of
a nesting level.
5.11 Table replace
Atomic replacement of an entire table/ruleset.
1. User sends NFXTM_TABLE_REPLACE request. The state is
remembered per client socket.
2. Within this transaction, the following commands operate on a
temporary table: NFXTM_CHAIN_NEW, NFXTM_CHAIN_DEL, NFXTM_CHAIN_
MOVE, NFXTM_CHAIN_SPLICE.
3. End transaction with NFXTM_STOP, or abort with NFXTM_ABORT.
5.12 Chain splicing (add/delete rules)
Chain splicing does a bulk deletion of zero or more consecutive
rules, followed by a bulk insertion of zero or more consecutive
rules, all done in an atomic fashion. It operates similar to
Perl's splice function on arrays.
The user starts a transaction with a NFXTM_CHAIN_SPLICE request,
supplying the name of the chain that is to be modified in a
NFXTA_NAME attribute. xtnetlink will take the read lock on the
table to prevent a table replace operation from interfering, and
will take the write lock on the chain.
1. While in this context, higher-level transactions like
NFXTM_TABLE_REPLACE, are rejected.
2. Send new rules (ordered list).
3. End transaction with NFXTM_STOP, or abort transaction entirely
with NFXTM_ABORT.
New rules:
1. Send NFXTM_RULE_NEW. Must occur within the context of
chain_splice.
2. NFXTM_STOP. This ends the current rule.
blubb
Request:
• NFXTA_NAME: Name of the chain to modify.
• NFXTA_OFFSET: Index of entry where operation should start.
• NFXTA_LENGTH: Number of entries starting from offset that
should be removed. May be zero or more.
• Zero or more rules.
Response:
• Standard ACK.
• or detailed error code.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Xtables2 A7 spec draft
2011-02-02 22:04 Xtables2 A7 spec draft Jan Engelhardt
@ 2011-02-05 19:33 ` Jozsef Kadlecsik
2011-02-05 21:38 ` Jan Engelhardt
2011-02-07 20:50 ` James Nurmi
1 sibling, 1 reply; 6+ messages in thread
From: Jozsef Kadlecsik @ 2011-02-05 19:33 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List
Hi Jan,
On Wed, 2 Feb 2011, Jan Engelhardt wrote:
> I am posting the Xtables2 Netlink interface specification, draft 7
> for comments.
>
> Additionally, further documentation and toolchain around
> it is available through the project page at
>
> http://jengelh.medozas.de/projects/xtables/
>
> * User Documentation Chapter 1: Architectural Differences
> * Developer Documentation Part 1: Netlink interface (WIP)
> This is copied below to facilitate inline replies
> --8<--
>
> Netlink interface
>
> 1 Concepts
>
> This section is non-normative and should instead show the flow of
> thought and give reasons as to why the specification was
> conceived the way it is, and where the component problems are.
>
> 1.1 Nesting representation
>
> The common element in Xtables is the ruleset, represented as a
> tree structure with ordering constraints at some levels:
>
> ruleset (unordered tables)
> \__ table (unordered chains)
> | \__ chain (ordered rules)
> | | \__ rule (ordered actions)
> | | | \__ match (unordered data)
> | | | | \__ config-data
> | | | | | \__ bin params
> | | | | \__ state-data
> | | | | \__ nlattrs
> | | | \__ match...
> | | | \__ target (unordered data)
> | | | | \__ config-data
> | | | \__ target...
> | | | \__ verdict...
> | | \__ rule...
> | \__ chain...
> \__ table...
I believe the objects 'match', 'target', 'verdict' should be generalized
and unified into a single entity named 'action' (or named whatever). It
should have an attribute (better a flag attribute with a flag value) to
denote that the given action is a terminating one (terminating
target/verdict), so that the parser could check and warn/reject
unreachable actions. That way the protocol were both simpler and more
powerful at the same time. And we could express rules like
... -m whatever -j LOG -m more-specific -j DO-SOMETHING ...
I don't like the idea of passing binary parameters at any level:
everything should be expressed in nlattrs.
> A more concrete example, here is a small ruleset, encoded into
> XML (just one of many possible representations):
>
> <table>
> <chain name="INPUT">
> <rule idx="1">
> <match acidx="1" name="hashlimit" rev="1" csize="120">
> <config-data>...</config-data>
> <state-data>...</state-data>
> </match>
> <target acidx="2" name="TOS" rev="1">
> ...
> </target>
> <verdict acidx="3" name="ACCEPT" />
> </rule>
> </chain>
> </table>
>
> There are different ways to encode such a tree structure into a
> serialized stream. In many Netlink protocols, children attributes
> are encapsulated (a. k. a. ?nested?, though we will avoid this
> term to avoid double-use) and treated as a whole as a parent's
> opaque data. It cannot be told apart from normal data. (Like
> writing ?<chain> <rule> ... </rule> </chain>? in
> XML.) We will call this format ?Encapsulated Encoding?.
>
> To encode an attribute's length, struct nlattr only has a 16-bit
> field, which means the attribute header plus payload is limited
> to 64 KB. This is easily exceedable with the encapsulated
> encoding as chains are collected rules in a chain, for example.
> The problem is aggreviated by the kernel's Netlink handler only
> allocating sk_buffs a page size worth, which leaves few room for
> extension data. In the worst case, the usable payload for
> attributes is around 3600 bytes only. In light of xt_u32's
> private data block being 1984 bytes already, that means that you
> won't be able to fit two -m u32 invocations nested in a single
> rule into a dump.
The pagesize limit is a real problem. :-(( I don't see how could we avoid
the possibility to split a single rule into multiple messages, because it
did not simply fit into a single one.
> Certain voices in the community call for the obsoletion of such
> data blobs and replace them by Netlink attributes; there are no
> objections to doing so. However, the problem of size-limited
> sk_buffs applies to opaque data of any kind, and Netlink
> attributes fall within that.
I'm among the ones who object data blobs.
> The Xtables2 Netlink protocol encodes each node of information as
> a standalone attribute, to be called Flat Encoding, that is
> appended (a. k. a. ?chained?) to the data stream. By avoiding
> encapsulated attributes, it is possible to split messages at much
> finer levels, and provides for attributes that happen to use
> opaque data with a maximally-sized buffer.
Even with encapsulation, the messages can be split at any level.
> 1.2 Nest markers<sub:Nest-markers>
>
> Since Netlink messages do have a 32-bit quantity to store the
> message length, rulesets of roughly up to 4 GB are possibile,
> which is currently regarded as sufficient. The largest (while
> still being meaningful) rulesets seen to date in the industry
> weighed in at approximately 150 MB.
>
> Whereas encapsulated attribute encoding automatically provided
> for boundaries, this is realized using dummy attributes in the
> chained approach. The start of a nesting level can be implicitly
> represented by the presence of the attribute that would have
> otherwise been used for encapsulated nesting. For declaring an
> end of a nest level, an extra attribute is needed:
>
> ? ?chain { rule; rule; ... }? \Leftrightarrow CHAIN RULE RULE ...
> STOP
With encapsulation, there were no need such an extra STOP attribute -
except that we may have to split the encapsulated attributes into multiple
messages and thus the STOP attribute/marker is needed.
> 1.3 Attribute limitations in nfnetlink
>
> Netlink, being just a base protocol, does not specify what comes
> after the nlmsghdr, or how it is ordered. This is left up to the
> subprotocols based on Netlink. nfnetlink has two effective
> shortcomings (due to its parser) that shall be held in mind:
>
> ? Attribute ordering is ignored and lost
Even if netlink does not state that attribute ordering is kept, it does
not state either that attributes may be reordered. Netling as transport
protocol does not care about the attributes. So we can say that for
xtables2, the attribute order in the netlink messages is fixed, period.
> ? No support for more than one attribute with the same type
> within a message
Oh no, you can put as many attributes with the same type as you like (and
fit) into a single nested attribute!
> struct nlattr **tb;
> nla_for_each_attr(attr, head, ...)
> tb[nla_type(attr)] = attr;
>
> This kills the idea of being able to do, for example, a table
> replace, in a single Netlink request message. This is like having
> to split an XML file at every tag simply because two tags can
> carry the same attribute. So Netlink requests have to be broken
> down into many many tiny parts and extra state has to be kept
> around in the kernel.
>
> put_header(msg, NFXTM_TABLE_REPLACE);
> foreach (rule)
> put(msg, rule);
> send(sock, msg);
And so the simple processing above can be applied.
> will become
>
> put_header(msg, NFXTM_TABLE_REPLACE);
> send(sock, msg);
> foreach (rule) {
> clean(msg);
> put_header(msg, NFXTM_RULE_DATA);
> put(msg, rule);
> send(sock, msg);
> }
> clean(msg);
> put_header(msg, NFXTM_COMMIT);
> send(sock, msg);
>
> or worse. In other words, the fact that the kernel side will use
> a temporary table (an implementation detail) will be exposed to
> userspace, which is bad too.
> 1.4 Summary of transform<sub:Summary-of-transform>
>
> Essentially there is a 1:1 transform on the XML-like tree shown
> above, to:
>
> NFXTM_CHAIN_ENTRY<name=INPUT,usertid=1>
> NFXTM_RULE_ENTRY<idx=1,usertid=1>
> NFXTM_MATCH_ENTRY<acidx=1,name=hashlimit,rev=1,usertid=2>
> NFXTM_CONFIG_DATA
> NFXTM_ARB_DATA<whatever>
> NFXTM_ARB_DATA<more arbitrary data>
> NFXTM_STOP
> NFXTM_STATE_DATA
> NFXTM_ATTR_DATA<nlattrs>
> NFXTM_ATTR_DATA<more nlattrs>
> NFXTM_STOP
> NFXTM_STOP
> NFXTM_TARGET_ENTRY<acidx=2,name=TOS,rev=0,usertid=3>
> ...
> NFXTM_STOP
> NFXTM_VERDICT_ENTRY<acidx=3,name=ACCEPT,usertid=3>
> NFXTM_STOP
> NFXTM_STOP
> NFXTM_STOP
>
> 1.5 Extra sequence numbers<sub:Extra-sequence-numbers>
>
> Netlink also does not specify any message ordering, though it
> does provide an nlmsg_seq field with which message order can at
> least be determined. The problem is that nothing specifies what
> nlmsg_seq should be in reply messages. It is assumed that the
> sequence number is linked, i. e. that a reply's number should be
> the same as the request's number, to do message matching (vague
> hint by netlink(7) manpage).
Nothing specifies what nlmsg_seq should be in, so it's up to the
application, i.e. xtables2, how it's used...
> Even if that were decidedly so, that brings along a problem. In
> NLM_F_MULTI-style dumps, all messages would have the same
> nlmsg_seq. To counter this, multi messages will have an
> NFXT-specific sequence counter (NFXTA_SEQNO) in addition,
> especially since ordering is so much more crucial in Xtables than
> it is in other parts of networking.
...but yes, for dumping an additional attribute is required to make sure
the ordering is kept. Actually, two attributes: one at the rule level, and
one at the "action" level in the given rule.
> 1.6 Improved granularity error reporting
>
> Xtables extensions as of Linux 2.6.37 can only return system
> error codes back to userspace in case there is a problem. The
> most common occurrences are, for example, ENOMEM (?Memory
> allocation failure? / ?Out of memory?), and the dreaded EINVAL (?
> Invalid argument?). Best practices at the moment are to printk a
> string to the kernel log for further information detailing the
> circumstances about the cause of EINVAL. In the light of this
> overload of EINVAL, an improved error reporting scheme is sought.
> (Other networking subsystems also suffer from this problem.)
>
> By suggestion of Jozsef Kadlecsik, the Xtables2 protocol reports
> three kinds of errors:
>
> ? General/standard (integer) error codes, where there is no point
> (or cannot be) to specify the nature of the error exactly. Like
> in the example, ENOMEM: it is needles to report which new data
> field could not be allocated.
>
> ? General Xtables2 error codes (largely replaces EINVAL sites) in
> integer form, similar to errno. Use cases include:
>
> ? chain for a requested operation does not exist
>
> ? an extension is used from a hook it is not supposed to be
>
> ? Free-form string. Standalone, or in addition to the above.
> It is impossible to provision error numbers for extensions,
> especially those that are out-of-tree. The problems that
> forcing a component to reuse another component's error code
> space can be seen in the overuse of EINVAL. We are aware that
> raw strings in kernel modules can hinder internationalization,
> but it is seen as the better choice over awkward error codes
> that convey nothing. It is also expected that strings do not
> change that often.
>
> The three error types will be conveyed by three distinct
> attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2
> error codes), and NFXTA_ERRSTR (free-form string).
I hammer the issue further :-). With properly separated error number
domains, the three type can be expressed in a single error attribute. Just
a second attribute is required to carry the identifier of the action in
the rule to which the third type error code belongs to.
I'm still not convinced about the usefulness of the error string. The
kernel part is always paired with the userspace part. The developer
exactly knows which kind of errors can be send back to the userspace and
can thus provide the textual decoding. As netlink sends back the original
message in the error message, the userspace can fully decode every
attribute (since it itself encoded it) too.
If a decoding for an error code is not provided, that's a bug an thus must
be fixed.
> Error pointer
>
> Once a table/chain splice request has been finalized,
> xt_check_{match,target} is run, which can return:
>
> ? chain name, rule index, match/target index, NFXTE_*/custom
> string
>
> Line number
>
> I noticed Jozsef has added a line number attribute in ipset
> version 5 to facilitate locating errors for users. For its
> apparent value, such attribute is also specified for xtnetlink:
>
> A request message can contain a ?ping attribute?, NFXTA_USERTID,
> which xtnetlink may keep track of and which may be reported back
> verbatim in case an error occured. It may be used to represent
> the source line, or any other number.
The line number is a very good identifier for a rule.
> ? For the tree example in section 1, the ruleset file would be ?
> -A INPUT \
> -m hashlimit ... \
> -j TOS ... -j ACCEPT?.
>
> 1.7 Multi-type responses
>
> Using multi-type responses provides for a seemingly shorter reply
> (in at least one case) than not doing so:
>
> ? \RightarrowNFXTM_CHAIN_DUMP<NFXTA_NAME>
> \LeftarrowNFXTM_RULE_START<>
> \LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_EMATCH<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_ETARGET<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowNFXTM_RULE_END<>
> \LeftarrowNFXTM_RULE_START<>
> \LeftarrowNFXTM_ETARGET<NFXTA_VERDICT>
> \LeftarrowNFXTM_RULE_END<>
> \LeftarrowNLMSG_DONE
>
> ? \RightarrowCHAIN_DUMP<NFXTA_NAME>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_MATCH_END>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
> \LeftarrowCHAIN_DUMP<NFXTA_NAME, NFXTA_REVISION, NFXTA_DATA>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_START>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_START>
> \LeftarrowCHAIN_DUMP<NFXTA_VERDICT>
> \LeftarrowCHAIN_DUMP<NFXTA_TARGET_END>
> \LeftarrowCHAIN_DUMP<NFXTA_RULE_END>
> \LeftarrowNLMSG_DONE
>
> 2 General use
>
> 2.1 Socket
>
> Xtables2 is made available through an nfnetlink socket.
> Specifically, this is a Netlink socket of type NETLINK_NETFILTER,
> with which messages are exchanged that are tagged having Xtables
> as the subsystem.
>
> #include <sys/socket.h>
> #include <linux/netlink.h>
>
> struct nlmsghdr nlmsg;
> int nf_socket = socket(AF_NETLINK, SOCK_RAW,
> NETFILTER_NETFILTER);
> nlmsg.nlmsg_type = (NFNL_SUBSYS_XTABLES << 8) | xt_msg_type;
>
> 2.2 Message format
>
> All messages transmitted over the Netlink socket are to have the
> base struct nlmsghdr header, followed by a struct nfgenmsg header
> as mandated by nfnetlink. The .nfgen_family member is always set
> to NFPROTO_UNSPEC. The .version member denotes the format of the
> byte stream following nfgenmsg; this is currently version 0. The
> .res_id member is unused.
>
> 3 Attributes
>
> The meaning of attributes depends upon the message and logical
> nesting level in which they appear. Their type however remains
> the same, such that a single Netlink attribute validation policy
> object (struct nla_policy) can be used for all message types.
>
> A table of all known attributes:
[...]
Maybe it was just not worded expicitly in the specification, but all
attribute types which are affected should be sent in network order.
Best regards,
Jozsef
-
E-mail : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Xtables2 A7 spec draft
2011-02-05 19:33 ` Jozsef Kadlecsik
@ 2011-02-05 21:38 ` Jan Engelhardt
2011-02-06 11:43 ` Jozsef Kadlecsik
0 siblings, 1 reply; 6+ messages in thread
From: Jan Engelhardt @ 2011-02-05 21:38 UTC (permalink / raw)
To: Jozsef Kadlecsik; +Cc: Netfilter Developer Mailing List
On Saturday 2011-02-05 20:33, Jozsef Kadlecsik wrote:
>> 1.1 Nesting representation
>>
>> The common element in Xtables is the ruleset, represented as a
>> tree structure with ordering constraints at some levels:
>>
>> ruleset (unordered tables)
>> \__ table (unordered chains)
>> | \__ chain (ordered rules)
>> | | \__ rule (ordered actions)
>> | | | \__ match (unordered data)
>> | | | | \__ config-data
>> | | | | | \__ bin params
>> | | | | \__ state-data
>> | | | | \__ nlattrs
>> | | | \__ match...
>> | | | \__ target (unordered data)
>> | | | | \__ config-data
>> | | | \__ target...
>> | | | \__ verdict...
>> | | \__ rule...
>> | \__ chain...
>> \__ table...
>
>I believe the objects 'match', 'target', 'verdict' should be generalized
>and unified into a single entity named 'action' (or named whatever).
That is why there are already traces of the word 'action'
in the kernel (e.g. 'struct xt_action_param').
>I don't like the idea of passing binary parameters at any level:
>everything should be expressed in nlattrs.
>
>> Certain voices in the community call for the obsoletion of such
>> data blobs and replace them by Netlink attributes; there are no
>> objections to doing so. However, the problem of size-limited
>> sk_buffs applies to opaque data of any kind, and Netlink
>> attributes fall within that.
>
>I'm among the ones who object data blobs.
I completely agree to many of your points, but my strategy shall be
clear: the first working code dump is _only_ supposed to 1. break up
the table blob, 2. do NL transport with the preexisting per-extension
blobs.
I would hate having to come up with a "perfect" solution from the
start. That won't work, for the following reasons: 1. It makes the
task look bigger, 2. the stream of work seemingly never-ending, both
of which cause increased chance for premature give-up. And that I
absolutely want to avoid. I suppose nobody of the maintainers would
want to review a 300-patchset at once either.
Of course, the new subsystem would only be marked stable once all
desires have been met. Something like how btrfs was merged.
>> To encode an attribute's length, struct nlattr only has a 16-bit
>> field, which means the attribute header plus payload is limited
>> to 64 KB. This is easily exceedable with the encapsulated
>> encoding as chains are collected rules in a chain, for example.
>> The problem is aggreviated by the kernel's Netlink handler only
>> allocating sk_buffs a page size worth, which leaves few room for
>> extension data. In the worst case, the usable payload for
>> attributes is around 3600 bytes only. In light of xt_u32's
>> private data block being 1984 bytes already, that means that you
>> won't be able to fit two -m u32 invocations nested in a single
>> rule into a dump.
>
>The pagesize limit is a real problem. :-(( I don't see how could we avoid
>the possibility to split a single rule into multiple messages, because it
>did not simply fit into a single one.
We will have to live with it, because when transferring from
kernel->user, other methods of transportation (such as a character
device) would run into the same limitation (it would be limited
by the size of the buffer passed to read(2)).
>> The Xtables2 Netlink protocol encodes each node of information as
>> a standalone attribute, to be called Flat Encoding, that is
>> appended (a. k. a. ?chained?) to the data stream. By avoiding
>> encapsulated attributes, it is possible to split messages at much
>> finer levels, and provides for attributes that happen to use
>> opaque data with a maximally-sized buffer.
>
>Even with encapsulation, the messages can be split at any level.
I fear that won't work out so easily. Consider a Netlink message "msg
{ u32_params { atom1; atom2; ...; atomN; }}" with u32_params being an
NLA_F_NESTED. You could split that across messages as, for example,
"msg { u32_params { atom1; atom2; } } msg { u32_params { atom3; ...
atomN; }}", but you would have to repeat container headers, i.e.
u32_params. Which, given a big enough nesting level means that the
2nd message's space is used up by containers again.
If an analogy is needed: It is a bit like TCP segmentation vs. IP
fragmentation. In the former, there is one TCP hdr per message, in
the latter there is not.
>> Whereas encapsulated attribute encoding automatically provided
>> for boundaries, this is realized using dummy attributes in the
>> chained approach. The start of a nesting level can be implicitly
>> represented by the presence of the attribute that would have
>> otherwise been used for encapsulated nesting. For declaring an
>> end of a nest level, an extra attribute is needed:
>>
>> ? ?chain { rule; rule; ... }? \Leftrightarrow CHAIN RULE RULE ...
>> STOP
>
>With encapsulation, there were no need such an extra STOP attribute -
>except that we may have to split the encapsulated attributes into multiple
>messages and thus the STOP attribute/marker is needed.
Can you give an example of potential messages? The current spec
already lists
NFXTM_STATE_DATA
NFXTM_ATTR_DATA<nlattrs>
NFXTM_ATTR_DATA<more nlattrs>
NFXTM_STOP
>> 1.3 Attribute limitations in nfnetlink
>>
>> Netlink, being just a base protocol, does not specify what comes
>> after the nlmsghdr, or how it is ordered. This is left up to the
>> subprotocols based on Netlink. nfnetlink has two effective
>> shortcomings (due to its parser) that shall be held in mind:
>>
>> ? Attribute ordering is ignored and lost
>
>Even if netlink does not state that attribute ordering is kept, it does
>not state either that attributes may be reordered. Netling as transport
>protocol does not care about the attributes.
Indeed Netlink is fine. The beef is with nfnetlink, which, due to its
use of "struct nlattr *tb[]" basically forfeits attribute ordering.
>So we can say that for xtables2, the attribute order in the netlink
>messages is fixed, period.
Yeah, but Pablo refused to accept patches which don't use nfnetlink,
or which rely on attribute order.
>> ? No support for more than one attribute with the same type
>> within a message
>
>Oh no, you can put as many attributes with the same type as you like (and
>fit) into a single nested attribute!
Not just nested attributes. Attributes with the same type can be put
anywhere as long as you don't use a parser that utilizes the "struct
nlattr *tb[indexed_by_attr_type]" scheme. nfnetlink does use tb
however.
Encapsulating all the attrs in a nested attribute just to work around
nfnetlink's use of tb[] would beg the question of why one is using
nfnetlink in the first place then.
>> Even if that were decidedly so, that brings along a problem. In
>> NLM_F_MULTI-style dumps, all messages would have the same
>> nlmsg_seq. To counter this, multi messages will have an
>> NFXT-specific sequence counter (NFXTA_SEQNO) in addition,
>> especially since ordering is so much more crucial in Xtables than
>> it is in other parts of networking.
>
>...but yes, for dumping an additional attribute is required to make sure
>the ordering is kept. Actually, two attributes: one at the rule level, and
>one at the "action" level in the given rule.
NFXTM_TABLE_DUMP<nlmsg_seqno=7> would yield:
NFXTM_CHAIN_ENTRY<nlmsg_seqno=7,nfxt_seqno=0, name=INPUT,usertid=1>
NFXTM_RULE_ENTRY<nlmsg_seqno=7,nfxt_seqno=1, idx=1,usertid=1>
NFXTM_MATCH_ENTRY<(7,2), acidx=1,name=hashlimit,rev=1,usertid=2>
NFXTM_CONFIG_DATA<(7,3)>
NFXTM_ARB_DATA<(7,4) custom data>
NFXTM_ARB_DATA<(7,5) more custom data>
NFXTM_STOP<(7,6)>
NFXTM_STATE_DATA<(7,7)>
NFXTM_ATTR_DATA<(7,8) nlattrs>
NFXTM_ATTR_DATA<(7,9) more nlattrs>
NFXTM_STOP<(7,10>)
NFXTM_STOP<(7,11)>
NFXTM_TARGET_ENTRY<(7,12), acidx=2,name=TOS,rev=0,usertid=3>
...
NFXTM_STOP<(7,95)>
NFXTM_VERDICT_ENTRY<(7,96), acidx=3,name=ACCEPT,usertid=3>
NFXTM_STOP<(7,97)>
NFXTM_STOP<(7,98)>
NFXTM_STOP<(7,99)>
So I think I am fine with one extra seqno (NFXTA_SEQNO).
>> 1.6 Improved granularity error reporting
>>
>> ? General/standard (integer) error codes
>> ? General Xtables2 error codes (largely replaces EINVAL sites)
>> ? Free-form string. Standalone, or in addition to the above.
>> It is impossible to provision error numbers for extensions,
>> especially those that are out-of-tree.
>>
>> The three error types will be conveyed by three distinct
>> attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2
>> error codes), and NFXTA_ERRSTR (free-form string).
>
>I hammer the issue further :-). With properly separated error number
>domains, the three type can be expressed in a single error attribute.
That would mean that NFXTE_* codes would have to start at -4096
and go from there. Possible to work, but it does not feel right.
Currently the kernel has pointer values (-4095U)..(-1U) reserved for
error codes - no object will ever reside at those virtaddrs. Should
the kernel ever have a need for more than 4095 system errno codes,
the virtaddr limit of 0xfffff000 for mappings could simply be changed
to, say, 0xffff0000. But then you would run into the problem that
NFXTE error values suddenly overlap with system error codes.
Thus, keeping system error codes and NFXTE error codes separate
seems a sensible thing to do.
>I'm still not convinced about the usefulness of the error string.
Extensions shipped with the kernel are already provisioned for; the
error string was really only meant for extensions living outside the
kernel (those just can't be ignored).
I guess per-extension error codes are possible. (That just came to
mind.)
>As netlink sends back the original
>message in the error message, the userspace can fully decode every
>attribute (since it itself encoded it) too.
Generally just the original message header - not the entire message.
But in synchronous operations - in other words, most cases -
even that is not necessary: because we know what we constructed,
we don't need to rely on the replica inside the error message.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Xtables2 A7 spec draft
2011-02-05 21:38 ` Jan Engelhardt
@ 2011-02-06 11:43 ` Jozsef Kadlecsik
0 siblings, 0 replies; 6+ messages in thread
From: Jozsef Kadlecsik @ 2011-02-06 11:43 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List
On Sat, 5 Feb 2011, Jan Engelhardt wrote:
> On Saturday 2011-02-05 20:33, Jozsef Kadlecsik wrote:
> >> 1.1 Nesting representation
> >>
> >> The common element in Xtables is the ruleset, represented as a
> >> tree structure with ordering constraints at some levels:
> >>
> >> ruleset (unordered tables)
> >> \__ table (unordered chains)
> >> | \__ chain (ordered rules)
> >> | | \__ rule (ordered actions)
> >> | | | \__ match (unordered data)
> >> | | | | \__ config-data
> >> | | | | | \__ bin params
> >> | | | | \__ state-data
> >> | | | | \__ nlattrs
> >> | | | \__ match...
> >> | | | \__ target (unordered data)
> >> | | | | \__ config-data
> >> | | | \__ target...
> >> | | | \__ verdict...
> >> | | \__ rule...
> >> | \__ chain...
> >> \__ table...
> >
> >I believe the objects 'match', 'target', 'verdict' should be generalized
> >and unified into a single entity named 'action' (or named whatever).
>
> That is why there are already traces of the word 'action'
> in the kernel (e.g. 'struct xt_action_param').
Then I was misled by NFXTM_MATCH|TARGET|VERDICT_ENTRY. So those are there
for backward compatibility reasons only.
> >I don't like the idea of passing binary parameters at any level:
> >everything should be expressed in nlattrs.
> >
> >> Certain voices in the community call for the obsoletion of such
> >> data blobs and replace them by Netlink attributes; there are no
> >> objections to doing so. However, the problem of size-limited
> >> sk_buffs applies to opaque data of any kind, and Netlink
> >> attributes fall within that.
> >
> >I'm among the ones who object data blobs.
>
> I completely agree to many of your points, but my strategy shall be
> clear: the first working code dump is _only_ supposed to 1. break up
> the table blob, 2. do NL transport with the preexisting per-extension
> blobs.
That's a good plan, indeed. With it the task is split into more easily
manageable parts.
> >> The Xtables2 Netlink protocol encodes each node of information as
> >> a standalone attribute, to be called Flat Encoding, that is
> >> appended (a. k. a. ?chained?) to the data stream. By avoiding
> >> encapsulated attributes, it is possible to split messages at much
> >> finer levels, and provides for attributes that happen to use
> >> opaque data with a maximally-sized buffer.
> >
> >Even with encapsulation, the messages can be split at any level.
>
> I fear that won't work out so easily. Consider a Netlink message "msg
> { u32_params { atom1; atom2; ...; atomN; }}" with u32_params being an
> NLA_F_NESTED. You could split that across messages as, for example,
> "msg { u32_params { atom1; atom2; } } msg { u32_params { atom3; ...
> atomN; }}", but you would have to repeat container headers, i.e.
> u32_params. Which, given a big enough nesting level means that the
> 2nd message's space is used up by containers again.
The levels of nesting is well defined in xtables2: table, chain, rule,
action, data. It doesn't look like a big burden. I don't count the nesting
levels of the different data containers, because those are required
anyway.
> If an analogy is needed: It is a bit like TCP segmentation vs. IP
> fragmentation. In the former, there is one TCP hdr per message, in
> the latter there is not.
[Some regards IP fragmentation as a design mistake. Making it possible in
IPv6 was actually a sin.]
> >> Whereas encapsulated attribute encoding automatically provided
> >> for boundaries, this is realized using dummy attributes in the
> >> chained approach. The start of a nesting level can be implicitly
> >> represented by the presence of the attribute that would have
> >> otherwise been used for encapsulated nesting. For declaring an
> >> end of a nest level, an extra attribute is needed:
> >>
> >> ? ?chain { rule; rule; ... }? \Leftrightarrow CHAIN RULE RULE ...
> >> STOP
> >
> >With encapsulation, there were no need such an extra STOP attribute -
> >except that we may have to split the encapsulated attributes into multiple
> >messages and thus the STOP attribute/marker is needed.
>
> Can you give an example of potential messages? The current spec
> already lists
>
> NFXTM_STATE_DATA
> NFXTM_ATTR_DATA<nlattrs>
> NFXTM_ATTR_DATA<more nlattrs>
> NFXTM_STOP
In my opinion the basic atomic element is a rule and our main issue is how
to split a rule into multiple messages. It could be expressed with the
NFXTM_STOP attribute, but I'd prefer an attribute flag value:
NFXTM_ACTION_ENTRY
NFXTM_ACTION_FLAGS (MATCH|TARGET|VERDICT, COMPLETE)
NFXTM_CONFIG_DATA
NFXTM_ARB_DATA
...
NFXTM_STATE_DATA
NFXTM_ATTR_DATA
...
If the action entry is not flagged as complete, expect messages with
additional config, state data entries. The config and state data are
unordered, so there's no ordering issue here.
> >> 1.3 Attribute limitations in nfnetlink
> >>
> >> Netlink, being just a base protocol, does not specify what comes
> >> after the nlmsghdr, or how it is ordered. This is left up to the
> >> subprotocols based on Netlink. nfnetlink has two effective
> >> shortcomings (due to its parser) that shall be held in mind:
> >>
> >> ? Attribute ordering is ignored and lost
> >
> >Even if netlink does not state that attribute ordering is kept, it does
> >not state either that attributes may be reordered. Netling as transport
> >protocol does not care about the attributes.
>
> Indeed Netlink is fine. The beef is with nfnetlink, which, due to its
> use of "struct nlattr *tb[]" basically forfeits attribute ordering.
>
> >So we can say that for xtables2, the attribute order in the netlink
> >messages is fixed, period.
>
> Yeah, but Pablo refused to accept patches which don't use nfnetlink,
> or which rely on attribute order.
>
> >> ? No support for more than one attribute with the same type
> >> within a message
> >
> >Oh no, you can put as many attributes with the same type as you like (and
> >fit) into a single nested attribute!
>
> Not just nested attributes. Attributes with the same type can be put
> anywhere as long as you don't use a parser that utilizes the "struct
> nlattr *tb[indexed_by_attr_type]" scheme. nfnetlink does use tb
> however.
Nfnetlink parses the attributes at the toplevel only. And at toplevel you
don't rely on ordered attributes: tables, chains are unordered.
So you don't need to add anything to netlink/nfnetlink: you can use
ordered attributes by nesting them, at the level where ordering is
required.
> Encapsulating all the attrs in a nested attribute just to work around
> nfnetlink's use of tb[] would beg the question of why one is using
> nfnetlink in the first place then.
>
> >> Even if that were decidedly so, that brings along a problem. In
> >> NLM_F_MULTI-style dumps, all messages would have the same
> >> nlmsg_seq. To counter this, multi messages will have an
> >> NFXT-specific sequence counter (NFXTA_SEQNO) in addition,
> >> especially since ordering is so much more crucial in Xtables than
> >> it is in other parts of networking.
> >
> >...but yes, for dumping an additional attribute is required to make sure
> >the ordering is kept. Actually, two attributes: one at the rule level, and
> >one at the "action" level in the given rule.
>
> NFXTM_TABLE_DUMP<nlmsg_seqno=7> would yield:
>
> NFXTM_CHAIN_ENTRY<nlmsg_seqno=7,nfxt_seqno=0, name=INPUT,usertid=1>
> NFXTM_RULE_ENTRY<nlmsg_seqno=7,nfxt_seqno=1, idx=1,usertid=1>
> NFXTM_MATCH_ENTRY<(7,2), acidx=1,name=hashlimit,rev=1,usertid=2>
> NFXTM_CONFIG_DATA<(7,3)>
> NFXTM_ARB_DATA<(7,4) custom data>
> NFXTM_ARB_DATA<(7,5) more custom data>
> NFXTM_STOP<(7,6)>
> NFXTM_STATE_DATA<(7,7)>
> NFXTM_ATTR_DATA<(7,8) nlattrs>
> NFXTM_ATTR_DATA<(7,9) more nlattrs>
> NFXTM_STOP<(7,10>)
> NFXTM_STOP<(7,11)>
> NFXTM_TARGET_ENTRY<(7,12), acidx=2,name=TOS,rev=0,usertid=3>
> ...
> NFXTM_STOP<(7,95)>
> NFXTM_VERDICT_ENTRY<(7,96), acidx=3,name=ACCEPT,usertid=3>
> NFXTM_STOP<(7,97)>
> NFXTM_STOP<(7,98)>
> NFXTM_STOP<(7,99)>
>
> So I think I am fine with one extra seqno (NFXTA_SEQNO).
I regard NFXTA_ACTION_IDX as a second attribute besides NFXTA_SEQNO, which
is needed to reconstruct the proper order when receiving a full rule in
multiple messages.
> >> 1.6 Improved granularity error reporting
> >>
> >> ? General/standard (integer) error codes
> >> ? General Xtables2 error codes (largely replaces EINVAL sites)
> >> ? Free-form string. Standalone, or in addition to the above.
> >> It is impossible to provision error numbers for extensions,
> >> especially those that are out-of-tree.
> >>
> >> The three error types will be conveyed by three distinct
> >> attributes: NFXTA_ERRNO (generic error codes), NFXTA_XTERRNO (xt2
> >> error codes), and NFXTA_ERRSTR (free-form string).
> >
> >I hammer the issue further :-). With properly separated error number
> >domains, the three type can be expressed in a single error attribute.
>
> That would mean that NFXTE_* codes would have to start at -4096
> and go from there. Possible to work, but it does not feel right.
>
> Currently the kernel has pointer values (-4095U)..(-1U) reserved for
> error codes - no object will ever reside at those virtaddrs. Should
> the kernel ever have a need for more than 4095 system errno codes,
> the virtaddr limit of 0xfffff000 for mappings could simply be changed
> to, say, 0xffff0000. But then you would run into the problem that
> NFXTE error values suddenly overlap with system error codes.
> Thus, keeping system error codes and NFXTE error codes separate
> seems a sensible thing to do.
The highest system error code currently is 132. I think we have got plenty
of time to exhaust the rest and overflow 4095 :-).
> >I'm still not convinced about the usefulness of the error string.
>
> Extensions shipped with the kernel are already provisioned for; the
> error string was really only meant for extensions living outside the
> kernel (those just can't be ignored).
>
> I guess per-extension error codes are possible. (That just came to
> mind.)
>
> >As netlink sends back the original
> >message in the error message, the userspace can fully decode every
> >attribute (since it itself encoded it) too.
>
> Generally just the original message header - not the entire message.
> But in synchronous operations - in other words, most cases -
> even that is not necessary: because we know what we constructed,
> we don't need to rely on the replica inside the error message.
Netlink sends back the entire message, not just the header. (Unless you
handle error messages manually and force netlink not to send them.)
Best regards,
Jozsef
-
E-mail : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Xtables2 A7 spec draft
2011-02-02 22:04 Xtables2 A7 spec draft Jan Engelhardt
2011-02-05 19:33 ` Jozsef Kadlecsik
@ 2011-02-07 20:50 ` James Nurmi
2011-02-07 21:45 ` Jan Engelhardt
1 sibling, 1 reply; 6+ messages in thread
From: James Nurmi @ 2011-02-07 20:50 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List
(inline)
Comments are made as maintainer of GoNetlink, a 'not-C' language;
disregard as desired.
On Wed, Feb 2, 2011 at 2:04 PM, Jan Engelhardt <jengelh@medozas.de> wrote:
>
>
> I am posting the Xtables2 Netlink interface specification, draft 7
> for comments.
>
> Additionally, further documentation and toolchain around
> it is available through the project page at
>
> http://jengelh.medozas.de/projects/xtables/
>
> * User Documentation Chapter 1: Architectural Differences
> * Developer Documentation Part 1: Netlink interface (WIP)
> This is copied below to facilitate inline replies
> --8<--
>
> Netlink interface
>
> 1 Concepts
>
> This section is non-normative and should instead show the flow of
> thought and give reasons as to why the specification was
> conceived the way it is, and where the component problems are.
>
> 1.1 Nesting representation
>
> The common element in Xtables is the ruleset, represented as a
> tree structure with ordering constraints at some levels:
>
> ruleset (unordered tables)
> \__ table (unordered chains)
> | \__ chain (ordered rules)
> | | \__ rule (ordered actions)
> | | | \__ match (unordered data)
> | | | | \__ config-data
> | | | | | \__ bin params
> | | | | \__ state-data
> | | | | \__ nlattrs
> | | | \__ match...
> | | | \__ target (unordered data)
> | | | | \__ config-data
> | | | \__ target...
> | | | \__ verdict...
> | | \__ rule...
> | \__ chain...
> \__ table...
>
> A more concrete example, here is a small ruleset, encoded into
> XML (just one of many possible representations):
>
> <table>
> <chain name="INPUT">
> <rule idx="1">
> <match acidx="1" name="hashlimit" rev="1" csize="120">
> <config-data>...</config-data>
> <state-data>...</state-data>
> </match>
> <target acidx="2" name="TOS" rev="1">
> ...
> </target>
> <verdict acidx="3" name="ACCEPT" />
> </rule>
> </chain>
> </table>
>
> There are different ways to encode such a tree structure into a
> serialized stream. In many Netlink protocols, children attributes
> are encapsulated (a. k. a. “nested”, though we will avoid this
> term to avoid double-use) and treated as a whole as a parent's
> opaque data. It cannot be told apart from normal data. (Like
> writing “<chain> <rule> ... </rule> </chain>” in
> XML.) We will call this format “Encapsulated Encoding”.
>
> To encode an attribute's length, struct nlattr only has a 16-bit
> field, which means the attribute header plus payload is limited
> to 64 KB. This is easily exceedable with the encapsulated
> encoding as chains are collected rules in a chain, for example.
> The problem is aggreviated by the kernel's Netlink handler only
> allocating sk_buffs a page size worth, which leaves few room for
> extension data. In the worst case, the usable payload for
> attributes is around 3600 bytes only. In light of xt_u32's
> private data block being 1984 bytes already, that means that you
> won't be able to fit two -m u32 invocations nested in a single
> rule into a dump.
>
> Certain voices in the community call for the obsoletion of such
> data blobs and replace them by Netlink attributes; there are no
> objections to doing so. However, the problem of size-limited
> sk_buffs applies to opaque data of any kind, and Netlink
> attributes fall within that.
I'm all for of opaque data-blobs where the user is not expected to
understand the data underneath (FILE handles), but only so far as they
can be safely serialized to alternate processes for collection of
additional data (no *pointers, and only TLV styled abstractions)
>
> The Xtables2 Netlink protocol encodes each node of information as
> a standalone attribute, to be called Flat Encoding, that is
> appended (a. k. a. “chained”) to the data stream. By avoiding
> encapsulated attributes, it is possible to split messages at much
> finer levels, and provides for attributes that happen to use
> opaque data with a maximally-sized buffer.
>
> 1.2 Nest markers<sub:Nest-markers>
>
> Since Netlink messages do have a 32-bit quantity to store the
> messagelength, rulesets of roughly up to 4 GB are possibile,
> which is currently regarded as sufficient. The largest (while
> still being meaningful) rulesets seen to date in the industry
> weighed in at approximately 150 MB.
While managing tables/rules/etc atomically should be priority #1, I'm
not certain if optimizing the protocol for this makes a lot of sense
for either the user or kernel contexts.
>
> Whereas encapsulated attribute encoding automatically provided
> for boundaries, this is realized using dummy attributes in the
> chained approach. The start of a nesting level can be implicitly
> represented by the presence of the attribute that would have
> otherwise been used for encapsulated nesting. For declaring an
> end of a nest level, an extra attribute is needed:
>
> • “chain { rule; rule; ... }” \Leftrightarrow CHAIN RULE RULE ...
> STOP
>
> 1.3 Attribute limitations in nfnetlink
>
> Netlink, being just a base protocol, does not specify what comes
> after the nlmsghdr, or how it is ordered. This is left up to the
> subprotocols based on Netlink. nfnetlink has two effective
> shortcomings (due to its parser) that shall be held in mind:
>
> • Attribute ordering is ignored and lost
(GoNetlink doesn't adhere to this belief; I didn't realize there was
any standardization of this approach outside of the libnfnetlink
implementation, and so assumed I'd be screwed if I followed it.)
>
> • No support for more than one attribute with the same type
> within a message
ditto
> 1.4 Summary of transform<sub:Summary-of-transform>
>
> Essentially there is a 1:1 transform on the XML-like tree shown
> above, to:
>
> NFXTM_CHAIN_ENTRY<name=INPUT,usertid=1>
> NFXTM_RULE_ENTRY<idx=1,usertid=1>
> NFXTM_MATCH_ENTRY<acidx=1,name=hashlimit,rev=1,usertid=2>
> NFXTM_CONFIG_DATA
> NFXTM_ARB_DATA<whatever>
> NFXTM_ARB_DATA<more arbitrary data>
> NFXTM_STOP
> NFXTM_STATE_DATA
> NFXTM_ATTR_DATA<nlattrs>
> NFXTM_ATTR_DATA<more nlattrs>
> NFXTM_STOP
> NFXTM_STOP
> NFXTM_TARGET_ENTRY<acidx=2,name=TOS,rev=0,usertid=3>
> ...
> NFXTM_STOP
> NFXTM_VERDICT_ENTRY<acidx=3,name=ACCEPT,usertid=3>
> NFXTM_STOP
> NFXTM_STOP
> NFXTM_STOP
>
> 1.5 Extra sequence numbers<sub:Extra-sequence-numbers>
>
> Netlink also does not specify any message ordering, though it
> does provide an nlmsg_seq field with which message order can at
> least be determined. The problem is that nothing specifies what
> nlmsg_seq should be in reply messages. It is assumed that the
> sequence number is linked, i. e. that a reply's number should be
> the same as the request's number, to do message matching (vague
> hint by netlink(7) manpage).
RFC 3549 (2.3.2.1) seems to support you in that the usage of sequence
numbers is undefined; My experience has been to expect the response to
match the request and dispatch accordingly -- since thats the 'norm',
and netlink shouldn't ever fail, I'd actually rather see the protocol
use NLM_F_MULTI, NLM_F_ATOMIC pair, with an internal
sequence/timestamp for clients that really need an atomic state.
>
> Even if that were decidedly so, that brings along a problem. In
> NLM_F_MULTI-style dumps, all messages would have the same
> nlmsg_seq. To counter this, multi messages will have an
> NFXT-specific sequence counter (NFXTA_SEQNO) in addition,
> especially since ordering is so much more crucial in Xtables than
> it is in other parts of networking.
That, to me, is fine -- netlink is an encapsulation from my view,
MULTI is the right way to do long messages.
>
> 1.6 Improved granularity error reporting
>...
As a non-C implementation, I'd prefer constant error (class) with
flags (bitfield), but expect to rewrite a lot of constants anyhow.
> 1.7 Multi-type responses
> ...
Most RTNetlink protocols (which will be a similar user base I imagine)
make assumptions on the response type based on the query type; In go,
for example, there is no generic, so re-decomposing a response becomes
expensive.
Personally, I would prefer that responses be limited solely to the
query I provided or an error, not something with multiple (possibly
confounding?) types.
> ...
> 3 Attributes
>
> The meaning of attributes depends upon the message and logical
> nesting level in which they appear. Their type however remains
> the same, such that a single Netlink attribute validation policy
> object (struct nla_policy) can be used for all message types.
>
> A table of all known attributes:
>
>
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | Value | Mnemonic | C type | NLA type | Notes |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | 1 | NFXTA_SEQNO | unsigned int | NLA_U32 | Section [sub:Extra-sequence-numbers] |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | tba | NFXTA_ERRNO | int | NLA_U32 | Generic system errno (Exxx) |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | ... | NFXTA_XTERRNO | int | NLA_U32 | NFXT errno (NFXTE_*) |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_ERRSTR | char [] | NLA_NUL_STRING | Arbitrary |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_USERTID | unsigned int | NLA_U32 | Arbitrary, retained verbatim |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_CHAIN_NAME | char [] | NLA_NUL_STRING | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_RULE_IDX | unsigned int | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_ACTION_IDX | unsigned int | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_NAME | char [] | NLA_NUL_STRING | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_REVISION | uint8_t | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_HOOKNUM | unsigned int | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_PRIORITY | int | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_NFPROTO | uint8_t | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_OFFSET | unsigned int | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_LENGTH | size_t | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_HOOKMASK | unsigned int | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_SIZE | size_t | NLA_U32 | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | | NFXTA_NEW_NAME | char [] | NLA_NUL_STRING | |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
W/r/t the NUL_STRING's -- is there a good reason to use a NUL'd
strings for NAME/etc, given the length is known? Wouldn't it make more
sense to simply require a byte string and apply the null internally? I
see this frequently in Netlink, and imagine it's a kernel consistency
thing?
>
>
> The kernel ignores attributes with value 0 during validation, so
> it was left unused.
>
> 4 Error types<sec:Error-types>
>
>
> +--------+---------------------+-------------------------------------------+
> | Value | Mnemonic | Description |
> +--------+---------------------+-------------------------------------------+
> +--------+---------------------+-------------------------------------------+
> | 0 | NFXTE_SUCCESS | No error |
> +--------+---------------------+-------------------------------------------+
> | 1 | NFXTE_CHAIN_EXIST | Chain already exists |
> +--------+---------------------+-------------------------------------------+
> | 2 | NFXTE_CHAIN_NOENT | Chain does not exist |
> +--------+---------------------+-------------------------------------------+
> | 3 | NFXTE_RULESET_LOOP | Ruleset contains a loop |
> +--------+---------------------+-------------------------------------------+
> | 4 | NFXTE_EXT_HOOKMASK | Rule invoked from incompatible hook |
> +--------+---------------------+-------------------------------------------+
> | | NFXTE_PROMO_STATUS | Promotion/demotion state already achieved |
> +--------+---------------------+-------------------------------------------+
>
>
> 5 Message types
> ...
My biggest concern here seems as already pointed out -- the use of
STOP && deep nesting in messages; Every time a STOP occurs in an
internal message, it's semantically equivalent to the completion of an
NF_F_MULTI no?
I see the advantage of a trivial protocol, but wouldn't it be much
simpler to have a 'bigger' protocol (table/chain/rule) with an
optional ATOMIC guarantee?
I don't see anywhere else guaranteeing tables/matches/rules will be
managed (as a set) with atomicity [I'm probably wrong], so doing it in
the protocol feels awkward.
There area LOT of definitions of atomicity, ordering, etc within this
area, making me feel like doing that 'up one level' and in smaller
pieces might make for more manageable interface
Still, this all looks like phenomenal progress, and I look forward to
seeing it move on.
James
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Xtables2 A7 spec draft
2011-02-07 20:50 ` James Nurmi
@ 2011-02-07 21:45 ` Jan Engelhardt
0 siblings, 0 replies; 6+ messages in thread
From: Jan Engelhardt @ 2011-02-07 21:45 UTC (permalink / raw)
To: James Nurmi; +Cc: Netfilter Developer Mailing List
On Monday 2011-02-07 21:50, James Nurmi wrote:
>> +--------+-------------------+---------------+-----------------+--------------------------------------+
>> | | NFXTA_ERRSTR | char [] | NLA_NUL_STRING | Arbitrary |
>> +--------+-------------------+---------------+-----------------+--------------------------------------+
>
>W/r/t the NUL_STRING's -- is there a good reason to use a NUL'd
>strings for NAME/etc, given the length is known?
Simpler to deal with string functions especially when it comes to strcmp.
>> 5 Message types
>> ...
>
>My biggest concern here seems as already pointed out -- the use of
>STOP && deep nesting in messages; Every time a STOP occurs in an
>internal message, it's semantically equivalent to the completion of an
>NF_F_MULTI no?
MULTIs do not seem to be nestable.
>I see the advantage of a trivial protocol, but wouldn't it be much
>simpler to have a 'bigger' protocol (table/chain/rule) with an
>optional ATOMIC guarantee?
I have no idea what you could mean by that. (In fact, most of
your reply gave me nothing to act on.)
>I don't see anywhere else guaranteeing tables/matches/rules will be
>managed (as a set) with atomicity [I'm probably wrong], so doing it in
>the protocol feels awkward.
By issuing NFXTM_TABLE_REPLACE (atomic replace of table) or
NFXTM_CHAIN_SPLICE (atomic replace of a chain and its rules), rules
that follow are collected and implanted into the live ruleset on the
final NFXTM_STOP. And these two cover all the atomicity one would
need as I see it.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-02-07 21:45 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-02 22:04 Xtables2 A7 spec draft Jan Engelhardt
2011-02-05 19:33 ` Jozsef Kadlecsik
2011-02-05 21:38 ` Jan Engelhardt
2011-02-06 11:43 ` Jozsef Kadlecsik
2011-02-07 20:50 ` James Nurmi
2011-02-07 21:45 ` Jan Engelhardt
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.