Discussion:
Bug#934584: IPMasquerade=yes uses iptables (not nftables)
Add Reply
Trent W. Buck
2019-08-12 09:00:02 UTC
Reply
Permalink
Package: systemd
Version: 241-5
Severity: normal
File: /lib/systemd/network/80-container-ve.network

Debian 10 defaults to nftables:

https://www.debian.org/releases/stable/amd64/release-notes/ch-whats-new.en.html#nftables

...but systemd doesn't for IPMasquerade=, see below.



AFAICT the default behaviour of "machinectl start my-new-container" is
to create a veth interface between the host and the container, with a
private /28 IPv4 address range shared between them, masquerading
(i.e. source NAT), and no IPv6 RA(?!). This is governed by

/lib/systemd/network/80-container-ve.network

AFAICT this is set up using *legacy* iptables, not using nftables.
Here is a system with sshguard installed (uses native nftables), and two systemd containers running:

bash5# iptables-save
# Table `sshguard' is incompatible, use 'nft' tool.
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them

bash5# iptables-legacy-save
# Generated by iptables-save v1.8.3 on Mon Aug 12 18:01:37 2019
*nat
:PREROUTING ACCEPT [7781:686652]
:INPUT ACCEPT [7749:685024]
:OUTPUT ACCEPT [2108:206384]
:POSTROUTING ACCEPT [2108:206384]
-A POSTROUTING -s 10.0.0.16/28 -j MASQUERADE
-A POSTROUTING -s 10.0.0.0/28 -j MASQUERADE
COMMIT
# Completed on Mon Aug 12 18:01:37 2019

bash5# nft list ruleset
table ip sshguard {
set attackers {
type ipv4_addr
flags interval
}

chain blacklist {
type filter hook input priority filter - 10; policy accept;
ip saddr @attackers drop
}
}
table ip6 sshguard {
set attackers {
type ipv6_addr
flags interval
}

chain blacklist {
type filter hook input priority filter - 10; policy accept;
ip6 saddr @attackers drop
}
}


I *think* the nftables people said mixing legacy (xtables) and
nftables on the same system is Unsupported™ and Bad Things™ will happen,
but I can't find a citation right now.

I can do "busybox ping example.com" within the container, which
implies systemd's MASQUERADE rules *are* working.

This test nftables ruleset also appears to be working
(while systemd's legacy ruleset are present):

bash5# nft 'table a { chain b { type nat hook postrouting priority srcnat; }; chain c { type nat hook prerouting priority dstnat; tcp dport 80 counter log; }; }'
bash5# nft list ruleset
[...]
table ip a {
chain b {
type nat hook postrouting priority srcnat; policy accept;
}

chain c {
type nat hook prerouting priority dstnat; policy accept;
tcp dport 80 counter packets 1 bytes 60 log
}
}

...so, I may be freaking out about nothing.

At a minimum, the legacy rules created by systemd are "invisible" to
an admin looking directly at "nft list ruleset", which is the only
place they will look if they expect the system to be nftables native.
That violates the principle of least surprise.

Is it possible to make systemd use nftables instead of iptables, when the system is so configured?
I think this would have Just Happened automatically if systemd was actually running iptables(8) or iptables-restore(8), but
I think it is instead talking direct to the kernel in src/shared/firewall-util.c:fw_add_masquerade() ?


PS: I don't know how to do it directly in C, but from nft(8), your MASQUERADE rules would look like this (comments optional):

#!/usr/sbin/nft --file
table ip systemd-container-blah-blah {
chain postrouting {
type nat hook postrouting priority srcnat
policy accept
ip saddr 10.0.0.0/28 masquerade comment "for systemd-***@my-new-container"
ip saddr 10.0.0.16/28 masquerade comment "for systemd-***@my-other-container"
}
chain prerouting {
type nat hook prerouting priority dstnat
policy accept
continue comment "Apparently the postrouting chain won't work unless unless this prerouting chain also exists"
}
}

The "modern" way to do it would be to put all the IP ranges into a
named set, so that you just add/remove from the set, not from the
rules. This is analogous to "iptables -m set --help" and ipset(8),
which you could already have used for efficiency when you have
hundreds of containers:

#!/usr/sbin/nft --file
table ip systemd-container-blah-blah {
set systemd-container-masquerade-ranges { type ipv4_addr; flags interval; }
chain postrouting {
type nat hook postrouting priority srcnat
policy accept
ip saddr @systemd-container-masquerade-ranges masquerade comment "for systemd-nspawn@"
}
chain prerouting {
type nat hook prerouting priority dstnat
policy accept
continue comment "Apparently the postrouting chain won't work unless unless this prerouting chain also exists"
}
}

Then when a container comes up, do

nft 'add element ip systemd-container-blah-blah systemd-container-masquerade-ranges { 10.0.0.0/24 }'

In fact, this is exactl
Michael Biebl
2019-08-12 11:40:01 UTC
Reply
Permalink
Post by Trent W. Buck
Package: systemd
Version: 241-5
Severity: normal
File: /lib/systemd/network/80-container-ve.network
https://www.debian.org/releases/stable/amd64/release-notes/ch-whats-new.en.html#nftables
...but systemd doesn't for IPMasquerade=, see below.
AFAICT the default behaviour of "machinectl start my-new-container" is
to create a veth interface between the host and the container, with a
private /28 IPv4 address range shared between them, masquerading
(i.e. source NAT), and no IPv6 RA(?!). This is governed by
/lib/systemd/network/80-container-ve.network
AFAICT this is set up using *legacy* iptables, not using nftables.
bash5# iptables-save
# Table `sshguard' is incompatible, use 'nft' tool.
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them
bash5# iptables-legacy-save
# Generated by iptables-save v1.8.3 on Mon Aug 12 18:01:37 2019
*nat
:PREROUTING ACCEPT [7781:686652]
:INPUT ACCEPT [7749:685024]
:OUTPUT ACCEPT [2108:206384]
:POSTROUTING ACCEPT [2108:206384]
-A POSTROUTING -s 10.0.0.16/28 -j MASQUERADE
-A POSTROUTING -s 10.0.0.0/28 -j MASQUERADE
COMMIT
# Completed on Mon Aug 12 18:01:37 2019
bash5# nft list ruleset
table ip sshguard {
set attackers {
type ipv4_addr
flags interval
}
chain blacklist {
type filter hook input priority filter - 10; policy accept;
}
}
table ip6 sshguard {
set attackers {
type ipv6_addr
flags interval
}
chain blacklist {
type filter hook input priority filter - 10; policy accept;
}
}
I *think* the nftables people said mixing legacy (xtables) and
nftables on the same system is Unsupported™ and Bad Things™ will happen,
but I can't find a citation right now.
I can do "busybox ping example.com" within the container, which
implies systemd's MASQUERADE rules *are* working.
This test nftables ruleset also appears to be working
bash5# nft 'table a { chain b { type nat hook postrouting priority srcnat; }; chain c { type nat hook prerouting priority dstnat; tcp dport 80 counter log; }; }'
bash5# nft list ruleset
[...]
table ip a {
chain b {
type nat hook postrouting priority srcnat; policy accept;
}
chain c {
type nat hook prerouting priority dstnat; policy accept;
tcp dport 80 counter packets 1 bytes 60 log
}
}
...so, I may be freaking out about nothing.
At a minimum, the legacy rules created by systemd are "invisible" to
an admin looking directly at "nft list ruleset", which is the only
place they will look if they expect the system to be nftables native.
That violates the principle of least surprise.
Is it possible to make systemd use nftables instead of iptables, when the system is so configured?
I think this would have Just Happened automatically if systemd was actually running iptables(8) or iptables-restore(8), but
I think it is instead talking direct to the kernel in src/shared/firewall-util.c:fw_add_masquerade() ?
#!/usr/sbin/nft --file
table ip systemd-container-blah-blah {
chain postrouting {
type nat hook postrouting priority srcnat
policy accept
}
chain prerouting {
type nat hook prerouting priority dstnat
policy accept
continue comment "Apparently the postrouting chain won't work unless unless this prerouting chain also exists"
}
}
The "modern" way to do it would be to put all the IP ranges into a
named set, so that you just add/remove from the set, not from the
rules. This is analogous to "iptables -m set --help" and ipset(8),
which you could already have used for efficiency when you have
#!/usr/sbin/nft --file
table ip systemd-container-blah-blah {
set systemd-container-masquerade-ranges { type ipv4_addr; flags interval; }
chain postrouting {
type nat hook postrouting priority srcnat
policy accept
}
chain prerouting {
type nat hook prerouting priority dstnat
policy accept
continue comment "Apparently the postrouting chain won't work unless unless this prerouting chain also exists"
}
}
Then when a container comes up, do
nft 'add element ip systemd-container-blah-blah systemd-container-masquerade-ranges { 10.0.0.0/24 }'
In fact, this is exactly what sshguard is doing for its filter blacklist.
src/shared/firewall-util.* uses libiptc (which in turn uses iptables)

ttbomk, mixing nftables and iptables is supported, otherwise we'd have
huge problems in buster (e.g. firewalld was explicitly switched back to
use iptables as quite a few components are not yet nft ready, like
libvirt and other container managers like docker).
That said, I've CCed Arturo, maybe he can chime in here.


To me this sounds more like a wishlist bug to get systemd ported from
libiptc to libnftables and that should be filed and addressed upstream.

Michael
--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?
Loading...