Refactor sysctl tuning
- Use sysctl formula instead of static files.
- Drop old tuning options which are deprecated, equal their default
value, or were too generic to be applied to all machines
- Split settings into common and role specific ones.
- Move settings which were too generic but could be useful again on
a role level in the future to a comment block.
Below are descriptive comments with the thought processes involved
with each changed parameter:
1)
```
net.ipv4.neigh.default.gc_stale_time = 3600
net.ipv6.neigh.default.gc_stale_time = 3600
gc_stale_time (since Linux 2.2)
Determines how often to check for stale neighbor entries.
When a neighbor entry is considered stale, it is resolved
again before sending data to it. Defaults to 60 seconds.
```
There is no reason in our current network to check for stale entries
less often. In fact, I would be confused `if ip neigh sh` updates slower
than expected. It might be useful to combine this in situations where
2) applies though.
2)
```
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv6.neigh.default.gc_thresh3 = 4096
net.ipv6.neigh.default.gc_thresh2 = 2048
net.ipv6.neigh.default.gc_thresh1 = 1024
gc_thresh1 (since Linux 2.2)
The minimum number of entries to keep in the ARP cache.
The garbage collector will not run if there are fewer than
this number of entries in the cache. Defaults to 128.
gc_thresh2 (since Linux 2.2)
The soft maximum number of entries to keep in the ARP
cache. The garbage collector will allow the number of
entries to exceed this for 5 seconds before collection
will be performed. Defaults to 512.
gc_thresh3 (since Linux 2.2)
The hard maximum number of entries to keep in the ARP
cache. The garbage collector will always run if there are
more than this number of entries in the cache. Defaults
to 1024.
```
This should be tuned on servers where kernel messages such as
`arp_cache: neighbor table overflow!` are observed.
On internal machines this should never happen.
If it does happen on internet connected machines, we can
re-add this on a role basis later on.
3)
```
net.core.netdev_max_backlog = 50000
netdev_max_backlog
Maximum number of packets, queued on the INPUT side,
when the interface receives packets faster than kernel can
process them.
```
I'm not opposed to keeping this, though I would be interested which
specific systems benefit from it and how.
4)
```
net.ipv4.tcp_syncookies = 1
tcp_syncookies - INTEGER
Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
Send out syncookies when the syn backlog queue of a socket
overflows. This is to prevent against the common
'SYN flood attack'
Default: 1
```
This is 1 by default, and hence does not need to be set by us.
5)
```
net.ipv4.ip_forward = 0
ip_forward - BOOLEAN
0 - disabled (default)
not 0 - enabled
```
This is 0 by default, and hence does not need to be set by us. We will
soon be setting this to 1 on a role.gateway level.
6)
```
net.ipv6.conf.all.forwarding = 0
forwarding - BOOLEAN
Enable IP forwarding on this interface. This controls whether
packets received _on_ this interface can be forwarded.
```
Although the default value not being explicitly declared, the text
very much suggests that the toggle is there to enable it on demand, hence
us not needing to set 0.
Same as with 5), this will be enabled on a role level in the future.
7)
```
net.ipv4.tcp_ecn = 0
tcp_ecn - INTEGER
Control use of Explicit Congestion Notification (ECN) by TCP.
ECN is used only when both ends of the TCP connection indicate
support for it. This feature is useful in avoiding losses due
to congestion by allowing supporting routers to signal
congestion before having to drop packets.
Possible values are:
0 Disable ECN. Neither initiate nor accept ECN.
1 Enable ECN when requested by incoming connections and
also request ECN on outgoing connection attempts.
2 Enable ECN when requested by incoming connections
but do not request ECN on outgoing connections.
Default: 2
```
This is a very interesting one! Here I would not just leave it in, but
rather set it to 1.
https://en.wikipedia.org/wiki/Explicit_Congestion_Notification
https://www.juniper.net/documentation/us/en/software/junos/cos/topics/concept/cos-qfx-series-explicit-congestion-notification-understanding.html
An online research revealed this often having been disabled as
common practice due to issues with network equipment in the past.
Our network equipment should not have any issues handling ECN, given us
not disabling it on the same hardware in other parts of the data centers.
(cboltz adds: "Indeed, 1 looks like a good choice here.")
8)
```
net.ipv6.conf.default.autoconf = 0
net.ipv6.conf.default.accept_ra = 0
```
We might want to keep autoconf disabled (for now, we use only static IPv6
addresses) but enable router advertisements along with running a radvd server.
9)
```
net.ipv6.conf.default.accept_ra_defrtr = 0
accept_ra_defrtr - BOOLEAN
Learn default router in Router Advertisement.
Functional default: enabled if accept_ra is enabled.
disabled if accept_ra is disabled.
```
This setting follows `accept_ra` and is hence superfluous in our case.
```
net.ipv4.neigh.default.gc_interval = 3600
net.ipv6.neigh.default.gc_interval = 3600
gc_interval (since Linux 2.2)
How frequently the garbage collector for neighbor entries
should attempt to run. Defaults to 30 seconds.
```
I don't think we suffer from any bottlenecks by keeping frequent garbage
collection, however I could not find how such impact would be measured and located.
It seems these toggles are useful if the kernel reports neighbour table overflows:
http://www.cyberciti.biz/faq/centos-redhat-debian-linux-neighbor-table-overflow/.
If they are needed on a machine suffeirng from such, we can enable it on a role or id
level in the future. In this case we should use the `all` instead of the `default`
namespace.
10)
```
net.ipv4.conf.all.log_martians = 0
net.ipv4.conf.default.log_martians = 0
log_martians - BOOLEAN
Log packets with impossible addresses to kernel log.
log_martians for the interface will be enabled if at least one of
conf/{all,interface}/log_martians is set to TRUE,
it will be disabled otherwise
```
I am very much interested in this and would like it logged, hence setting this to 1.
11)
```
net.ipv4.conf.login.log_martians = 0
net.ipv4.conf.private.log_martians = 0
net.ipv4.conf.external.log_martians = 0
```
These use hardcoded interface names which make no sense to write configuration for
on all machines. Furthermore, the values match the previously defined default.
12)
```
net.ipv6.route.max_size=16384
route/max_size - INTEGER
Maximum number of routes allowed in the kernel. Increase
this when using large numbers of interfaces and/or routes.
From linux kernel 3.6 onwards, this is deprecated for ipv4
as route cache is no longer used.
```
Deprecated, removing.
13)
```
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
bridge-nf-call-arptables - BOOLEAN
1 : pass bridged ARP traffic to arptables' FORWARD chain.
0 : disable this.
Default: 1
bridge-nf-call-iptables - BOOLEAN
1 : pass bridged IPv4 traffic to iptables' chains.
0 : disable this.
Default: 1
bridge-nf-call-ip6tables - BOOLEAN
1 : pass bridged IPv6 traffic to ip6tables' chains.
0 : disable this.
Default: 1
```
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/net/bridge/br_netfilter.c?id=d049a43dcf06a3e155f5496aade5184755a288c4
This is used by the `br_netfilter` module which is not loaded by default.
Checking on all machines using Salt, it is currently only loaded on
```
obsreview.infra.opensuse.org:
br_netfilter 28672 0
bridge 229376 1 br_netfilter
gitlab-runner2.infra.opensuse.org:
br_netfilter 32768 0
bridge 434176 1 br_netfilter
gitlab-runner1.infra.opensuse.org:
br_netfilter 32768 0
bridge 356352 1 br_netfilter
```
This seems to correlate to
https://unix.stackexchange.com/questions/719112/why-do-net-bridge-bridge-nf-call-arp-ip-ip6tables-default-to-1,
which suggests this module being loaded by Docker.
In this case it might make sense to keep these toggles at 0.
Setting it globally would not hurt, and benefit machines which get `br_netfilter`
loaded by other software.
Of course, it is to be hoped that the deprecation finally happens, especially given the
use of nftables in all scenarios not involving Docker.
Alternatively it could be moved to a Docker role.
14)
```
net.bridge.bridge-nf-filter-pppoe-tagged = 0
net.bridge.bridge-nf-filter-vlan-tagged = 0
bridge-nf-filter-vlan-tagged - BOOLEAN
1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
0 : disable this.
Default: 0
bridge-nf-filter-pppoe-tagged - BOOLEAN
1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
0 : disable this.
Default: 0
```
Similarly legacy as described in 13).
The machines using iptables/Docker do not use VLANs or PPPoE and hence
do not benefit from this.
15)
```
vm.swappiness = 5
swappiness
This control is used to define how aggressive the kernel will swap
memory pages. Higher values will increase aggressiveness, lower values
decrease the amount of swap. A value of 0 instructs the kernel not to
initiate swap until the amount of free and file-backed pages is less
than the high water mark in a zone.
The default value is 60.
```
Quoting RedHat:
```
Tuning vm.swappiness incorrectly may hurt performance or may have a different
impact between light and heavy workloads.
Changes to this parameter should be made in small increments and should be tested
under the same conditions that the system normally operates.
```
This should be tuned on memory hungry systems or when there is a recommendation
specific to the software running on a particular machine - for example, GitLab
recommends it in "constrained" environments:
https://docs.gitlab.com/omnibus/settings/memory_constrained_envs.html#configure-swap.
In all other cases I suggest to trust the default.
(cboltz adds: "I'd add that we have very few machines/VMs that actually have swap.",
"Also, we should consider to convert existing swap to "real" RAM which will give us a
much better performance improvement than this setting ;-)")
16)
```
net.core.somaxconn = 2048
somaxconn - INTEGER
Limit of socket listen() backlog, known in userspace as SOMAXCONN.
Defaults to 4096. (Was 128 before linux-5.4)
See also tcp_max_syn_backlog for additional tuning for TCP sockets.
```
The comment "increasing the backlog limit" does no longer make sense, given the new
default being even higher. Hence we can remove this setting from modern systems.
17)
```
net.ipv4.tcp_timestamps = 0
tcp_timestamps - INTEGER
Enable timestamps as defined in RFC1323.
0: Disabled.
1: Enable timestamps as defined in RFC1323 and use random offset for
each connection rather than only using the current time.
2: Like 1, but without random offsets.
Default: 1
```
It seems this used to be a vulnerability in the past, however is no more by
defaulting to using a random offset:
https://security.stackexchange.com/a/224696.
Hence I take this should be safe to keep at the default (1) now.
18)
```
net.core.optmem_max = 65536
```
This might be interesting to investigate further as we are using up to 40G networking.
Instructions on calculating an ideal value seem to be scarce though.
https://indico.cern.ch/event/212228/contributions/1507212/attachments/333941/466017/10GE_network_tests_with_UDP.pdf
19)
```
net.ipv4.tcp_max_tw_buckets = 1440000
tcp_max_tw_buckets - INTEGER
Maximal number of timewait sockets held by system simultaneously.
If this number is exceeded time-wait socket is immediately destroyed
and warning is printed. This limit exists only to prevent
simple DoS attacks, you _must_ not lower the limit artificially,
but rather increase it (probably, after increasing installed memory),
if network conditions require more than default value.
```
I feel like they missed to elaborate on the "default value" in that text.
IBM goes into a bit more detail:
https://www.ibm.com/docs/en/linux-on-systems?topic=tuning-tcpip-ipv4-settings#net.ipv4.tcp_max_tw_buckets,
suggesting the default to be 262144.
Again, lots of "suggestions" on what to set this to, but no useful information on
determining a value suited to our environment.
20)
```
net.ipv4.tcp_tw_recycle = 1
```
This no longer exists:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4396e46187ca5070219b81773c4e65088dac50cc
21)
```
net.ipv4.tcp_tw_reuse = 1
tcp_tw_reuse - INTEGER
Enable reuse of TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint.
0 - disable
1 - global enable
2 - enable for loopback traffic only
It should not be changed without advice/request of technical
experts.
Default: 2
```
"technical experts" - are they saying all other kernel options should be set
by the average user? ;-)
Marginally more information is provided by this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=79e9fed460385a3d8ba0b5782e9e74405cb199b1,
though it is still not clear why one would want to enable this for interfaces other
than the loopback ones.
I like the article in
https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux, especially the part:
> The Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle
> and net.ipv4.tcp_tw_reuse do. ...
Yes!
The article then goes on with a very in depth explanation.
Most importantly, it explains the individual kernel options after stating
> If you still think you have a problem with TIME-WAIT connections after reading the
> previous section, there are three additional solutions to solve them:
Hence, if we do think we have problems, I think we should properly read and understand this,
and otherwise remove this until we did.
22)
```
net.ipv4.tcp_max_orphans = 16384
tcp_max_orphans - INTEGER
Maximal number of TCP sockets not attached to any user file handle,
held by system. If this number is exceeded orphaned connections are
reset immediately and warning is printed. This limit exists
only to prevent simple DoS attacks, you _must_ not rely on this
or lower the limit artificially, but rather increase it
(probably, after increasing installed memory),
if network conditions require more than default value,
and tune network services to linger and kill such states
more aggressively. Let me to remind again: each orphan eats
up to ~64K of unswappable memory.
```
Sounds reasonable to reduce memory usage, but it explicitly states
"you must not rely on this or lower the limit artificially".
The commit message in
https://github.com/torvalds/linux/commit/c5ed63d66f24fd4f7089b5a6e087b0ce7202aa8e suggests
that the default depends on the host system memory.
On my Tumblweed system with 16GB RAM, the default seems to be 65536.
Hence we are lowering this on machines with enough memory against the very explicit
recommendation not to. If we do want to tune this, we need to make it dependent on the
machine memory. Easier would be to drop it and only configure it on a machine or role basis if
a specific system is found to emit `TCP: too many of orphaned ....` warning messages:
https://github.com/torvalds/linux/blob/c5ed63d66f24fd4f7089b5a6e087b0ce7202aa8e/net/ipv4/tcp.c#L2017.
23)
```
net.ipv4.tcp_orphan_retries = 0
tcp_orphan_retries (integer; default: 8; since Linux 2.4)
The maximum number of attempts made to probe the other end
of a connection which has been closed by our end.
or
tcp_orphan_retries - INTEGER
This value influences the timeout of a locally closed TCP connection,
when RTO retransmissions remain unacknowledged.
See tcp_retries2 for more details.
The default value is 8.
If your machine is a loaded WEB server,
you should think about lowering this value, such sockets
may consume significant resources. Cf. tcp_max_orphans.
```
Sysctl reports the default to be 0 on my Tumbleweed machine already,
hence we do not need to set this.
It seems the default of 8 mentioned in the documentation is rather playing a role
in the following logic, and not referring to the sysctl value:
https://github.com/openSUSE/kernel/blob/4d82a8f12dcb8809a6fd36f6df0a6c062eaf88ff/net/ipv4/tcp_timer.c#L148
24)
```
net.ipv4.ipfrag_low_thresh = 446464
ipfrag_low_thresh - LONG INTEGER
(Obsolete since linux-4.17) Maximum memory used to reassemble IP fragments before
the kernel begins to remove incomplete fragment queues to free up resources.
The kernel still accepts new fragments for defragmentation.
```
Deprecated, removing.
25)
```
net.ipv4.neigh.default.proxy_qlen = 96
proxy_qlen (since Linux 2.2)
The maximum number of packets which may be queued to
proxy-ARP addresses. Defaults to 64.
TLDP has slightly more words around it:
/proc/sys/net/ipv4/neigh/DEV/proxy_delay
Maximum time (real time is random [0..proxytime]) before answering to an
ARP request for which we have an proxy ARP entry.
In some cases, this is used to prevent network flooding.
/proc/sys/net/ipv4/neigh/DEV/proxy_qlen
Maximum queue length of the delayed proxy arp timer. (see proxy_delay).
```
I do not know of us using Proxy ARP anywhere and hence do not think this makes sense to keep.
The comment in our file and the two lines following match the ones in this file 1:1:
https://gist.github.com/dkulagin/c5081095c123fc8fe3f80f43cd7a15d5#file-sysctl-conf-L216
I know it's evil to suggest this was just copy pasted. :-)
26)
```
net.ipv4.neigh.default.unres_qlen = 6
neigh/default/unres_qlen - INTEGER
The maximum number of packets which may be queued for each unresolved address by
other network layers.
(deprecated in linux 3.3) : use unres_qlen_bytes instead.
Prior to linux 3.3, the default value is 3 which may cause unexpected packet loss.
The current default value is calculated according to default value of unres_qlen_bytes
and true size of packet.
Default: 101
```
Deprecated, superseded by:
```
neigh/default/unres_qlen_bytes - INTEGER
The maximum number of bytes which may be used by packets
queued for each unresolved address by other network layers.
(added in linux 3.3)
Setting negative value is meaningless and will return error.
Default: SK_WMEM_MAX, (same as net.core.wmem_default).
Exact value depends on architecture and kernel options,
but should be enough to allow queuing 256 packets
of medium size.
```
Unfortunately again a candidate with very scarce information on how to assess
and calculate an optimized value.
27)
```
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
rmem_default
The default setting of the socket receive buffer in bytes.
rmem_max
The maximum receive socket buffer size in bytes.
wmem_default
The default setting (in bytes) of the socket send buffer.
wmem_max
The maximum send socket buffer size in bytes.
```
Relates to 18).
https://www.tecchannel.de/a/tcp-ip-tuning-fuer-linux,429773,6 suggests this being
for low memory situations.
I'm not sure we suffer from those, though there might be other benefits which this
single article did not cover.
28)
```
net.ipv4.tcp_mem=8388608 8388608 8388608
net.ipv4.tcp_rmem=1048576 4194304 16777216
net.ipv4.tcp_wmem=1048576 4194304 16777216
tcp_mem - vector of 3 INTEGERs: min, pressure, max
min: below this number of pages TCP is not bothered about its
memory appetite.
pressure: when amount of memory allocated by TCP exceeds this number
of pages, TCP moderates its memory consumption and enters memory
pressure mode, which is exited when memory consumption falls
under "min".
max: number of pages allowed for queueing by all TCP sockets.
Defaults are calculated at boot time from amount of available
memory.
tcp_rmem - vector of 3 INTEGERs: min, default, max
min: Minimal size of receive buffer used by TCP sockets.
It is guaranteed to each TCP socket, even under moderate memory
pressure.
Default: 4K
default: initial size of receive buffer used by TCP sockets.
This value overrides net.core.rmem_default used by other protocols.
Default: 87380 bytes. This value results in window of 65535 with
default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
less for default tcp_app_win. See below about these variables.
max: maximal size of receive buffer allowed for automatically
selected receiver buffers for TCP socket. This value does not override
net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
automatic tuning of that socket's receive buffer size, in which
case this value is ignored.
Default: between 87380B and 6MB, depending on RAM size.
tcp_wmem - vector of 3 INTEGERs: min, default, max
min: Amount of memory reserved for send buffers for TCP sockets.
Each TCP socket has rights to use it due to fact of its birth.
Default: 4K
default: initial size of send buffer used by TCP sockets. This
value overrides net.core.wmem_default used by other protocols.
It is usually lower than net.core.wmem_default.
Default: 16K
max: Maximal amount of memory allowed for automatically tuned
send buffers for TCP sockets. This value does not override
net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables
automatic tuning of that socket's send buffer size, in which case
this value is ignored.
Default: between 64K and 4MB, depending on RAM size.
```
Again, these are values which have their defaults calculated based on memory size.
As stated along the related settings in 27), I'm not sure we need this.
If we do, it should, from my understanding, not be a one size fits all.
29)
```
net.ipv4.conf.all.log_martians = 0
net.ipv4.conf.default.log_martians = 0
```
This duplicates 10).
30)
```
net.ipv4.tcp_fin_timeout = 15
tcp_fin_timeout - INTEGER
The length of time an orphaned (no longer referenced by any
application) connection will remain in the FIN_WAIT_2 state
before it is aborted at the local end. While a perfectly
valid "receive only" state for an un-orphaned connection, an
orphaned connection in FIN_WAIT_2 state could otherwise wait
forever for the remote to close its end of the connection.
Cf. tcp_max_orphans
Default: 60 seconds
```
Interestingly, the kernel documentation omits the TCP specification part:
```
tcp_fin_timeout (integer; default: 60; since Linux 2.2)
This specifies how many seconds to wait for a final FIN packet before the socket
is forcibly closed. This is strictly a violation of the TCP specification, but
required to prevent denial-of-service attacks. In Linux 2.2, the default value was 180.
```
I do see that lowering this might be desirable for faster abortion of stale connections
on the internet, but think this should not be altered on internal hosts.
31)
```
net.ipv4.tcp_keepalive_time = 300
tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours.
```
TLDP uses a few more words:
```
tcp_keepalive_time
the interval between the last data packet sent (simple ACKs are not considered data) and
the first keepalive probe; after the connection is marked to need keepalive, this counter
is not used any further
```
It seems this only affects applications explicitly using TCP with keepalive enabled.
I am confused by the comment stating "decrease the time default". According to my calculation, our
value equals five hours, which is three hours more than the default.
Though I might be confusing the units? If it is more than the default now,
it would make sense to remove it.
(cboltz adds: "The interesting question is which unit is used ;-)",
"Since other settings use seconds, I wouldn't be too surprised if 300 means 5 minutes")
32)
```
net.ipv4.tcp_keepalive_probes = 5
tcp_keepalive_probes - INTEGER
How many keepalive probes TCP sends out, until it decides that the
connection is broken. Default value: 9.
tcp_keepalive_probes (integer; default: 9; since Linux 2.2)
The maximum number of TCP keep-alive probes to send before giving up and killing the
connection if no response is obtained from the other end.
```
I can get behind why one might want to reduce keepalive probes, but
cannot assess why 5 was chosen.
The article
https://webhostinggeeks.com/howto/tcp-keepalive-recommended-settings-and-best-practices/
suggests values between three and five are reasonable.
Since this setting is relatively easy to understand I would be fine with keeping it
albeit not understanding the exact choice in value, though of course I would
prefer to understand. :-)
(cboltz adds: "If the number would be 4, I'd say it was https://xkcd.com/221/ ;-)
(and I guess the reason for 5 is quite similar)",
"Keeping this indeed sounds useful.")
33)
```
net.ipv4.tcp_keepalive_intvl = 15
tcp_keepalive_intvl - INTEGER
How frequently the probes are send out. Multiplied by tcp_keepalive_probes it is time
to kill not responding connection, after probes started. Default value: 75sec i.e.
connection will be aborted after ~11 minutes of retries.
```
Same comment as in 32) applies.
34)
```
net.ipv4.route.flush=1
net.ipv6.route.flush=1
```
Tough to find information on this, there is a sysfs toggle with the same name/path which
can be used to trigger a one-time flush of the routing cache.
The person in https://unix.stackexchange.com/a/734077 suggests that setting this to 1
is pointless. It is a bit more involved than some other toggles:
https://github.com/torvalds/linux/commit/39a23e75087ce815abbddbd565b9a2e567ac47da
and it's not quite clear when this would be triggered if not manually by calling the sysfs path.
Signed-off-by: Georg Pfuetzenreuter <georg.pfuetzenreuter@suse.com>