aboutsummaryrefslogtreecommitdiff
path: root/include/net/netns
AgeCommit message (Collapse)Author
2021-04-26netfilter: remove all xt_table anchors from struct netFlorian Westphal
No longer needed, table pointer arg is now passed via netfilter core. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26netfilter: ebtables: remove the 3 ebtables pointers from struct netFlorian Westphal
ebtables stores the table internal data (what gets passed to the ebt_do_table() interpreter) in struct net. nftables keeps the internal interpreter format in pernet lists and passes it via the netfilter core infrastructure (priv pointer). Do the same for ebtables: the nf_hook_ops are duplicated via kmemdup, then the ops->priv pointer is set to the table that is being registered. After that, the netfilter core passes this table info to the hookfn. This allows to remove the pointers from struct net. Same pattern can be applied to ip/ip6/arptables. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13netfilter: conntrack: convert sysctls to u8Florian Westphal
log_invalid sysctl allows values of 0 to 255 inclusive so we no longer need a range check: the min/max values can be removed. This also removes all member variables that were moved to net_generic data in previous patches. This reduces size of netns_ct struct by one cache line. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: MAINTAINERS - keep Chandrasekar drivers/net/ethernet/mellanox/mlx5/core/en_main.c - simple fix + trust the code re-added to param.c in -next is fine include/linux/bpf.h - trivial include/linux/ethtool.h - trivial, fix kdoc while at it include/linux/skmsg.h - move to relevant place in tcp.c, comment re-wrapped net/core/skmsg.c - add the sk = sk // sk = NULL around calls net/tipc/crypto.c - trivial Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09net: dccp: use net_generic storageFlorian Westphal
DCCP is virtually never used, so no need to use space in struct net for it. Put the pernet ipv4/v6 socket in the dccp ipv4/ipv6 modules instead. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20210408174502.1625-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter/IPVS updates for your net-next tree: 1) Simplify log infrastructure modularity: Merge ipv4, ipv6, bridge, netdev and ARP families to nf_log_syslog.c. Add module softdeps. This fixes a rare deadlock condition that might occur when log module autoload is required. From Florian Westphal. 2) Moves part of netfilter related pernet data from struct net to net_generic() infrastructure. All of these users can be modules, so if they are not loaded there is no need to waste space. Size reduction is 7 cachelines on x86_64, also from Florian. 2) Update nftables audit support to report events once per table, to get it aligned with iptables. From Richard Guy Briggs. 3) Check for stale routes from the flowtable garbage collector path. This is fixing IPv6 which breaks due missing check for the dst_cookie. 4) Add a nfnl_fill_hdr() function to simplify netlink + nfnetlink headers setup. 5) Remove documentation on several statified functions. 6) Remove printk on netns creation for the FTP IPVS tracker, from Florian Westphal. 7) Remove unnecessary nf_tables_destroy_list_lock spinlock initialization, from Yang Yingliang. 7) Remove a duplicated forward declaration in ipset, from Wan Jiabing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-06net: remove obsolete members from struct netFlorian Westphal
all have been moved to generic_net infra. On x86_64, this reduces struct net size from 70 to 63 cache lines (4480 to 4032 byte). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-02net: reorganize fields in netns_mibEric Dumazet
Order fields to increase locality for most used protocols. udplite and icmp are moved at the end. Same for proc_net_devsnmp6 which is not used in fast path. This potentially saves one cache line miss for typical TCP/UDP over IPv4/IPv6. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv6: move ip6_dst_ops first in netns_ipv6Eric Dumazet
ip6_dst_ops have cache line alignement. Moving it at beginning of netns_ipv6 removes a 48 byte hole, and shrinks netns_ipv6 from 12 to 11 cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv6: convert elligible sysctls to u8Eric Dumazet
Convert most sysctls that can fit in a byte. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31tcp: convert tcp_comp_sack_nr sysctl to u8Eric Dumazet
tcp_comp_sack_nr max value was already 255. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert igmp_link_local_mcast_reports sysctl to u8Eric Dumazet
This sysctl is a bool, can use less storage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert fib_multipath_{use_neigh|hash_policy} sysctls to u8Eric Dumazet
Make room for better packing of netns_ipv4 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert udp_l3mdev_accept sysctl to u8Eric Dumazet
Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31ipv4: convert fib_notify_on_flag_change sysctl to u8Eric Dumazet
Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31inet: shrink netns_ipv4 by another cache lineEric Dumazet
By shuffling around some fields to remove 8 bytes of hole, we can save one cache line. pahole result before/after the patch : /* size: 768, cachelines: 12, members: 139 */ /* sum members: 673, holes: 11, sum holes: 39 */ /* padding: 56 */ /* paddings: 2, sum paddings: 7 */ /* forced alignments: 1 */ -> /* size: 704, cachelines: 11, members: 139 */ /* sum members: 673, holes: 10, sum holes: 31 */ /* paddings: 2, sum paddings: 7 */ /* forced alignments: 1 */ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31inet: shrink inet_timewait_death_row by 48 bytesEric Dumazet
struct inet_timewait_death_row uses two cache lines, because we want tw_count to use a full cache line to avoid false sharing. Rework its definition and placement in netns_ipv4 so that: 1) We add 60 bytes of padding after tw_count to avoid false sharing, knowing that tcp_death_row will have ____cacheline_aligned_in_smp attribute. 2) We do not risk padding before tcp_death_row, because we move it at the beginning of netns_ipv4, even if new fields are added later. 3) We do not waste 48 bytes of padding after it. Note that I have not changed dccp. pahole result for struct netns_ipv4 before/after the patch : /* size: 832, cachelines: 13, members: 139 */ /* sum members: 721, holes: 12, sum holes: 95 */ /* padding: 16 */ /* paddings: 2, sum paddings: 55 */ -> /* size: 768, cachelines: 12, members: 139 */ /* sum members: 673, holes: 11, sum holes: 39 */ /* padding: 56 */ /* paddings: 2, sum paddings: 7 */ /* forced alignments: 1 */ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30net: add sysctl for enabling RFC 8335 PROBE messagesAndreas Roeseler
Section 8 of RFC 8335 specifies potential security concerns of responding to PROBE requests, and states that nodes that support PROBE functionality MUST be able to enable/disable responses and that responses MUST be disabled by default Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25tcp: convert elligible sysctls to u8Eric Dumazet
Many tcp sysctls are either bools or small ints that can fit into u8. Reducing space taken by sysctls can save few cache line misses when sending/receiving data while cpu caches are empty, for example after cpu idle period. This is hard to measure with typical network performance tests, but after this patch, struct netns_ipv4 has shrunk by three cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25inet: convert tcp_early_demux and udp_early_demux to u8Eric Dumazet
For these sysctls, their dedicated helpers have to use proc_dou8vec_minmax(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25ipv4: convert ip_forward_update_priority sysctl to u8Eric Dumazet
This sysctl uses ip_fwd_update_priority() helper, so the conversion needs to change it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25ipv4: shrink netns_ipv4 with sysctl conversionsEric Dumazet
These sysctls that can fit in one byte instead of one int are converted to save space and thus reduce cache line misses. - icmp_echo_ignore_all, icmp_echo_ignore_broadcasts, - icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr - tcp_ecn, tcp_ecn_fallback - ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu - ip_nonlocal_bind, ip_autobind_reuse - ip_dynaddr, ip_early_demux, raw_l3mdev_accept - nexthop_compat_mode, fwmark_reflect Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24tcp_metrics: tcpm_hash_bucket is strictly localEric Dumazet
After commit 098a697b497e ("tcp_metrics: Use a single hash table for all network namespaces."), tcpm_hash_bucket is local to net/ipv4/tcp_metrics.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22net: xfrm: Use sequence counter with associated spinlockAhmed S. Darwish
A sequence counter write section must be serialized or its internal state can get corrupted. A plain seqcount_t does not contain the information of which lock must be held to guaranteee write side serialization. For xfrm_state_hash_generation, use seqcount_spinlock_t instead of plain seqcount_t. This allows to associate the spinlock used for write serialization with the sequence counter. It thus enables lockdep to verify that the write serialization lock is indeed held before entering the sequence counter write section. If lockdep is disabled, this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-22net: xfrm: Localize sequence counter per network namespaceAhmed S. Darwish
A sequence counter write section must be serialized or its internal state can get corrupted. The "xfrm_state_hash_generation" seqcount is global, but its write serialization lock (net->xfrm.xfrm_state_lock) is instantiated per network namespace. The write protection is thus insufficient. To provide full protection, localize the sequence counter per network namespace instead. This should be safe as both the seqcount read and write sections access data exclusively within the network namespace. It also lays the foundation for transforming "xfrm_state_hash_generation" data type from seqcount_t to seqcount_LOCKNAME_t in further commits. Fixes: b65e3d7be06f ("xfrm: state: add sequence count to detect hash resizes") Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-02-02net: ipv6: Emit notification when fib hardware flags are changedAmit Cohen
After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. It is also possible for a route already installed in hardware to change its action and therefore its flags. For example, a host route that is trapping packets can be "promoted" to perform decapsulation following the installation of an IPinIP/VXLAN tunnel. Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed. The aim is to provide an indication to user-space (e.g., routing daemons) about the state of the route in hardware. Introduce a sysctl that controls this behavior. Keep the default value at 0 (i.e., do not emit notifications) for several reasons: - Multiple RTM_NEWROUTE notification per-route might confuse existing routing daemons. - Convergence reasons in routing daemons. - The extra notifications will negatively impact the insertion rate. - Not all users are interested in these notifications. Move fib6_info_hw_flags_set() to C file because it is no longer a short function. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02net: ipv4: Emit notification when fib hardware flags are changedAmit Cohen
After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. It is also possible for a route already installed in hardware to change its action and therefore its flags. For example, a host route that is trapping packets can be "promoted" to perform decapsulation following the installation of an IPinIP/VXLAN tunnel. Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed. The aim is to provide an indication to user-space (e.g., routing daemons) about the state of the route in hardware. Introduce a sysctl that controls this behavior. Keep the default value at 0 (i.e., do not emit notifications) for several reasons: - Multiple RTM_NEWROUTE notification per-route might confuse existing routing daemons. - Convergence reasons in routing daemons. - The extra notifications will negatively impact the insertion rate. - Not all users are interested in these notifications. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30sctp: add encap_port for netns sock asoc and transportXin Long
encap_port is added as per netns/sock/assoc/transport, and the latter one's encap_port inherits the former one's by default. The transport's encap_port value would mostly decide if one packet should go out with udp encapsulated or not. This patch also allows users to set netns' encap_port by sysctl. v1->v2: - Change to define encap_port as __be16 for sctp_sock, asoc and transport. v2->v3: - No change. v3->v4: - Add 'encap_port' entry in ip-sysctl.rst. v4->v5: - Improve the description of encap_port in ip-sysctl.rst. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30sctp: create udp6 sock and set its encap_rcvXin Long
This patch is to add the udp6 sock part in sctp_udp_sock_start/stop(). udp_conf.use_udp6_rx_checksums is set to true, as: "The SCTP checksum MUST be computed for IPv4 and IPv6, and the UDP checksum SHOULD be computed for IPv4 and IPv6" says in rfc6951#section-5.3. v1->v2: - Add pr_err() when fails to create udp v6 sock. - Add #if IS_ENABLED(CONFIG_IPV6) not to create v6 sock when ipv6 is disabled. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30sctp: create udp4 sock and add its encap_rcvXin Long
This patch is to add the functions to create/release udp4 sock, and set the sock's encap_rcv to process the incoming udp encap sctp packets. In sctp_udp_rcv(), as we can see, all we need to do is fix the transport header for sctp_rcv(), then it would implement the part of rfc6951#section-5.4: "When an encapsulated packet is received, the UDP header is removed. Then, the generic lookup is performed, as done by an SCTP stack whenever a packet is received, to find the association for the received SCTP packet" Note that these functions will be called in the last patch of this patchset when enabling this feature. v1->v2: - Add pr_err() when fails to create udp v4 sock. v2->v3: - Add 'select NET_UDP_TUNNEL' in sctp Kconfig. v3->v4: - No change. v4->v5: - Change to set udp_port to 0 by default. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-12can: remove obsolete version stringsOliver Hartkopp
As pointed out by Jakub Kicinski here: http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com this patch removes the obsolete version information of the different CAN protocols and the AF_CAN core module. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20201012074354.25839-2-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15nexthop: Convert to blocking notification chainIdo Schimmel
Currently, the only listener of the nexthop notification chain is the VXLAN driver. Subsequent patches will add more listeners (e.g., device drivers such as netdevsim) that need to be able to block when processing notifications. Therefore, convert the notification chain to a blocking one. This is safe as notifications are always emitted from process context. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10tcp: reflect tos value received in SYN to the socketWei Wang
This commit adds a new TCP feature to reflect the tos value received in SYN, and send it out on the SYN-ACK, and eventually set the tos value of the established socket with this reflected tos value. This provides a way to set the traffic class/QoS level for all traffic in the same connection to be the same as the incoming SYN request. It could be useful in data centers to provide equivalent QoS according to the incoming request. This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by default turned off. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08netfilter: nf_tables: coalesce multiple notifications into one skbuffPablo Neira Ayuso
On x86_64, each notification results in one skbuff allocation which consumes at least 768 bytes due to the skbuff overhead. This patch coalesces several notifications into one single skbuff, so each notification consumes at least ~211 bytes, that ~3.5 times less memory consumption. As a result, this is reducing the chances to exhaust the netlink socket receive buffer. Rule of thumb is that each notification batch only contains netlink messages whose report flag is the same, nfnetlink_send() requires this to do appropriate delivery to userspace, either via unicast (echo mode) or multicast (monitor mode). The skbuff control buffer is used to annotate the report flag for later handling at the new coalescing routine. The batch skbuff notification size is NLMSG_GOODSIZE, using a larger skbuff would allow for more socket receiver buffer savings (to amortize the cost of the skbuff even more), however, going over that size might break userspace applications, so let's be conservative and stick to NLMSG_GOODSIZE. Reported-by: Phil Sutter <phil@nwl.cc> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30bpf, netns: Keep a list of attached bpf_link'sJakub Sitnicki
To support multi-prog link-based attachments for new netns attach types, we need to keep track of more than one bpf_link per attach type. Hence, convert net->bpf.links into a list, that currently can be either empty or have just one item. Instead of reusing bpf_prog_list from bpf-cgroup, we link together bpf_netns_link's themselves. This makes list management simpler as we don't have to allocate, initialize, and later release list elements. We can do this because multi-prog attachment will be available only for bpf_link, and we don't need to build a list of programs attached directly and indirectly via links. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
2020-06-30bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki
Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-01bpf: Add link-based BPF program attachment to network namespaceJakub Sitnicki
Extend bpf() syscall subcommands that operate on bpf_link, that is LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO, to accept attach types tied to network namespaces (only flow dissector at the moment). Link-based and prog-based attachment can be used interchangeably, but only one can exist at a time. Attempts to attach a link when a prog is already attached directly, and the other way around, will be met with -EEXIST. Attempts to detach a program when link exists result in -EINVAL. Attachment of multiple links of same attach type to one netns is not supported with the intention to lift the restriction when a use-case presents itself. Because of that link create returns -E2BIG when trying to create another netns link, when one already exists. Link-based attachments to netns don't keep a netns alive by holding a ref to it. Instead links get auto-detached from netns when the latter is being destroyed, using a pernet pre_exit callback. When auto-detached, link lives in defunct state as long there are open FDs for it. -ENOLINK is returned if a user tries to update a defunct link. Because bpf_link to netns doesn't hold a ref to struct net, special care is taken when releasing, updating, or filling link info. The netns might be getting torn down when any of these link operations are in progress. That is why auto-detach and update/release/fill_info are synchronized by the same mutex. Also, link ops have to always check if auto-detach has not happened yet and if netns is still alive (refcnt > 0). Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200531082846.2117903-5-jakub@cloudflare.com
2020-06-01net: Introduce netns_bpf for BPF programs attached to netnsJakub Sitnicki
In order to: (1) attach more than one BPF program type to netns, or (2) support attaching BPF programs to netns with bpf_link, or (3) support multi-prog attach points for netns we will need to keep more state per netns than a single pointer like we have now for BPF flow dissector program. Prepare for the above by extracting netns_bpf that is part of struct net, for storing all state related to BPF programs attached to netns. Turn flow dissector callbacks for querying/attaching/detaching a program into generic ones that operate on netns_bpf. Next patch will move the generic callbacks into their own module. This is similar to how it is organized for cgroup with cgroup_bpf. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com
2020-05-22nexthop: add support for notifiersRoopa Prabhu
This patch adds nexthop add/del notifiers. To be used by vxlan driver in a later patch. Could possibly be used by switchdev drivers in the future. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30tcp: add hrtimer slack to sack compressionEric Dumazet
Add a sysctl to control hrtimer slack, default of 100 usec. This gives the opportunity to reduce system overhead, and help very short RTT flows. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28net: ipv4: add sysctl for nexthop api compatibility modeRoopa Prabhu
Current route nexthop API maintains user space compatibility with old route API by default. Dumps and netlink notifications support both new and old API format. In systems which have moved to the new API, this compatibility mode cancels some of the performance benefits provided by the new nexthop API. This patch adds new sysctl nexthop_compat_mode which is on by default but provides the ability to turn off compatibility mode allowing systems to run entirely with the new routing API. Old route API behaviour and support is not modified by this sysctl. Uses a single sysctl to cover both ipv4 and ipv6 following other sysctls. Covers dumps and delete notifications as suggested by David Ahern. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: add and use MIB counter infrastructureFlorian Westphal
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s" or "nstat" will show them automatically. The MPTCP MIB counters are allocated in a distinct pcpu area in order to avoid bloating/wasting TCP pcpu memory. Counters are allocated once the first MPTCP socket is created in a network namespace and free'd on exit. If no sockets have been allocated, all-zero mptcp counters are shown. The MIB counter list is taken from the multipath-tcp.org kernel, but only a few counters have been picked up so far. The counter list can be increased at any time later on. v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are ↵Kuniyuki Iwashima
exhausted. Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger condition for bind_conflict") introduced a restriction to forbid to bind SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to assign ports dispersedly so that we can connect to the same remote host. The change results in accelerating port depletion so that we fail to bind sockets to the same local port even if we want to connect to the different remote hosts. You can reproduce this issue by following instructions below. 1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768" 2. set SO_REUSEADDR to two sockets. 3. bind two sockets to (localhost, 0) and the latter fails. Therefore, when ephemeral ports are exhausted, bind(0) should fallback to the legacy behaviour to enable the SO_REUSEADDR option and make it possible to connect to different remote (addr, port) tuples. This patch allows us to bind SO_REUSEADDR enabled sockets to the same (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all ephemeral ports are exhausted. This also allows connect() and listen() to share ports in the following way and may break some applications. So the ip_autobind_reuse is 0 by default and disables the feature. 1. setsockopt(sk1, SO_REUSEADDR) 2. setsockopt(sk2, SO_REUSEADDR) 3. bind(sk1, saddr, 0) 4. bind(sk2, saddr, 0) 5. connect(sk1, daddr) 6. listen(sk2) If it is set 1, we can fully utilize the 4-tuples, but we should use IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible. The notable thing is that if all sockets bound to the same port have both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an ephemeral port and also do listen(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Minor conflict in mlx5 because changes happened to code that has moved meanwhile. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24netfilter: nf_tables: autoload modules from the abort pathPablo Neira Ayuso
This patch introduces a list of pending module requests. This new module list is composed of nft_module_request objects that contain the module name and one status field that tells if the module has been already loaded (the 'done' field). In the first pass, from the preparation phase, the netlink command finds that a module is missing on this list. Then, a module request is allocated and added to this list and nft_request_module() returns -EAGAIN. This triggers the abort path with the autoload parameter set on from nfnetlink, request_module() is called and the module request enters the 'done' state. Since the mutex is released when loading modules from the abort phase, the module list is zapped so this is iteration occurs over a local list. Therefore, the request_module() calls happen when object lists are in consistent state (after fulling aborting the transaction) and the commit list is empty. On the second pass, the netlink command will find that it already tried to load the module, so it does not request it again and nft_request_module() returns 0. Then, there is a look up to find the object that the command was missing. If the module was successfully loaded, the command proceeds normally since it finds the missing object in place, otherwise -ENOENT is reported to userspace. This patch also updates nfnetlink to include the reason to enter the abort phase, which is required for this new autoload module rationale. Fixes: ec7470b834fe ("netfilter: nf_tables: store transaction list locally while requesting module") Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-09net-tcp: Disable TCP ssthresh metrics cache by defaultKevin(Yudong) Yang
This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save" that disables TCP ssthresh metrics cache by default. Other parts of TCP metrics cache, e.g. rtt, cwnd, remain unchanged. As modern networks becoming more and more dynamic, TCP metrics cache today often causes more harm than benefits. For example, the same IP address is often shared by different subscribers behind NAT in residential networks. Even if the IP address is not shared by different users, caching the slow-start threshold of a previous short flow using loss-based congestion control (e.g. cubic) often causes the future longer flows of the same network path to exit slow-start prematurely with abysmal throughput. Caching ssthresh is very risky and can lead to terrible performance. Therefore it makes sense to make disabling ssthresh caching by default and opt-in for specific networks by the administrators. This practice also has worked well for several years of deployment with CUBIC congestion control at Google. Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21ipv6: keep track of routes using srcPaolo Abeni
Use a per namespace counter, increment it on successful creation of any route using the source address, decrement it on deletion of such routes. This allows us to check easily if the routing decision in the current namespace depends on the packet source. Will be used by the next patch. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08sctp: add support for Primary Path SwitchoverXin Long
This is a new feature defined in section 5 of rfc7829: "Primary Path Switchover". By introducing a new tunable parameter: Primary.Switchover.Max.Retrans (PSMR) The primary path will be changed to another active path when the path error counter on the old primary path exceeds PSMR, so that "the SCTP sender is allowed to continue data transmission on a new working path even when the old primary destination address becomes active again". This patch is to add this tunable parameter, 'ps_retrans' per netns, sock, asoc and transport. It also allows a user to change ps_retrans per netns by sysctl, and ps_retrans per sock/asoc/transport will be initialized with it. The check will be done in sctp_do_8_2_transport_strike() when this feature is enabled. Note this feature is disabled by initializing 'ps_retrans' per netns as 0xffff by default, and its value can't be less than 'pf_retrans' when changing by sysctl. v3->v4: - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of sysctl 'ps_retrans'. - add a new entry for ps_retrans on ip-sysctl.txt. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08sctp: add pf_expose per netns and sock and asocXin Long
As said in rfc7829, section 3, point 12: The SCTP stack SHOULD expose the PF state of its destination addresses to the ULP as well as provide the means to notify the ULP of state transitions of its destination addresses from active to PF, and vice versa. However, it is recommended that an SCTP stack implementing SCTP-PF also allows for the ULP to be kept ignorant of the PF state of its destinations and the associated state transitions, thus allowing for retention of the simpler state transition model of [RFC4960] in the ULP. Not only does it allow to expose the PF state to ULP, but also allow to ignore sctp-pf to ULP. So this patch is to add pf_expose per netns, sock and asoc. And in sctp_assoc_control_transport(), ulp_notify will be set to false if asoc->expose is not 'enabled' in next patch. It also allows a user to change pf_expose per netns by sysctl, and pf_expose per sock and asoc will be initialized with it. Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt, to not allow a user to query the state of a sctp-pf peer address when pf_expose is 'disabled', as said in section 7.3. v1->v2: - Fix a build warning noticed by Nathan Chancellor. v2->v3: - set pf_expose to UNUSED by default to keep compatible with old applications. v3->v4: - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested. - change this patch to 1/5, and move sctp_assoc_control_transport change into 2/5, as Marcelo suggested. - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>