summaryrefslogtreecommitdiffstats
path: root/net/netfilter
AgeCommit message (Collapse)Author
2020-01-12netfilter: ctnetlink: netns exit must wait for callbacksFlorian Westphal
[ Upstream commit 18a110b022a5c02e7dc9f6109d0bd93e58ac6ebb ] Curtis Taylor and Jon Maxwell reported and debugged a crash on 3.10 based kernel. Crash occurs in ctnetlink_conntrack_events because net->nfnl socket is NULL. The nfnl socket was set to NULL by netns destruction running on another cpu. The exiting network namespace calls the relevant destructors in the following order: 1. ctnetlink_net_exit_batch This nulls out the event callback pointer in struct netns. 2. nfnetlink_net_exit_batch This nulls net->nfnl socket and frees it. 3. nf_conntrack_cleanup_net_list This removes all remaining conntrack entries. This is order is correct. The only explanation for the crash so ar is: cpu1: conntrack is dying, eviction occurs: -> nf_ct_delete() -> nf_conntrack_event_report \ -> nf_conntrack_eventmask_report -> notify->fcn() (== ctnetlink_conntrack_events). cpu1: a. fetches rcu protected pointer to obtain ctnetlink event callback. b. gets interrupted. cpu2: runs netns exit handlers: a runs ctnetlink destructor, event cb pointer set to NULL. b runs nfnetlink destructor, nfnl socket is closed and set to NULL. cpu1: c. resumes and trips over NULL net->nfnl. Problem appears to be that ctnetlink_net_exit_batch only prevents future callers of nf_conntrack_eventmask_report() from obtaining the callback. It doesn't wait of other cpus that might have already obtained the callbacks address. I don't see anything in upstream kernels that would prevent similar crash: We need to wait for all cpus to have exited the event callback. Fixes: 9592a5c01e79dbc59eb56fa ("netfilter: ctnetlink: netns support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09netfilter: nft_tproxy: Fix port selector on Big EndianPhil Sutter
[ Upstream commit 8cb4ec44de42b99b92399b4d1daf3dc430ed0186 ] On Big Endian architectures, u16 port value was extracted from the wrong parts of u32 sreg_port, just like commit 10596608c4d62 ("netfilter: nf_tables: fix mismatch in big-endian system") describes. Fixes: 4ed8eb6570a49 ("netfilter: nf_tables: Add native tproxy support") Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04net: add bool confirm_neigh parameter for dst_ops.update_pmtuHangbin Liu
[ Upstream commit bd085ef678b2cc8c38c105673dfe8ff8f5ec0c57 ] The MTU update code is supposed to be invoked in response to real networking events that update the PMTU. In IPv6 PMTU update function __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor confirmed time. But for tunnel code, it will call pmtu before xmit, like: - tnl_update_pmtu() - skb_dst_update_pmtu() - ip6_rt_update_pmtu() - __ip6_rt_update_pmtu() - dst_confirm_neigh() If the tunnel remote dst mac address changed and we still do the neigh confirm, we will not be able to update neigh cache and ping6 remote will failed. So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we should not be invoking dst_confirm_neigh() as we have no evidence of successful two-way communication at this point. On the other hand it is also important to keep the neigh reachability fresh for TCP flows, so we cannot remove this dst_confirm_neigh() call. To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu to choose whether we should do neigh update or not. I will add the parameter in this patch and set all the callers to true to comply with the previous way, and fix the tunnel code one by one on later patches. v5: No change. v4: No change. v3: Do not remove dst_confirm_neigh, but add a new bool parameter in dst_ops.update_pmtu to control whether we should do neighbor confirm. Also split the big patch to small ones for each area. v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu. Suggested-by: David Miller <davem@davemloft.net> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04netfilter: nf_queue: enqueue skbs with NULL dstMarco Oliverio
[ Upstream commit 0b9173f4688dfa7c5d723426be1d979c24ce3d51 ] Bridge packets that are forwarded have skb->dst == NULL and get dropped by the check introduced by b60a77386b1d4868f72f6353d35dabe5fbe981f2 (net: make skb_dst_force return true when dst is refcounted). To fix this we check skb_dst() before skb_dst_force(), so we don't drop skb packet with dst == NULL. This holds also for skb at the PRE_ROUTING hook so we remove the second check. Fixes: b60a77386b1d ("net: make skb_dst_force return true when dst is refcounted") Signed-off-by: Marco Oliverio <marco.oliverio@tanaza.com> Signed-off-by: Rocco Folino <rocco.folino@tanaza.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-13netfilter: nf_tables: don't use position attribute on rule replacementFlorian Westphal
[ Upstream commit 447750f281abef547be44fdcfe3bc4447b3115a8 ] Its possible to set both HANDLE and POSITION when replacing a rule. In this case, the rule at POSITION gets replaced using the userspace-provided handle. Rule handles are supposed to be generated by the kernel only. Duplicate handles should be harmless, however better disable this "feature" by only checking for the POSITION attribute on insert operations. Fixes: 5e94846686d0 ("netfilter: nf_tables: add insert operation") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05netfilter: nf_tables: fix a missing check of nla_put_failureKangjie Lu
[ Upstream commit eb8950861c1bfd3eecc8f6faad213e3bca0dc395 ] If nla_nest_start() may fail. The fix checks its return value and goes to nla_put_failure if it fails. Signed-off-by: Kangjie Lu <kjlu@umn.edu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05netfilter: nf_nat_sip: fix RTP/RTCP source port translationsAlin Nastac
[ Upstream commit 8294059931448aa1ca112615bdffa3eab552c382 ] Each media stream negotiation between 2 SIP peers will trigger creation of 4 different expectations (2 RTP and 2 RTCP): - INVITE will create expectations for the media packets sent by the called peer - reply to the INVITE will create expectations for media packets sent by the caller The dport used by these expectations usually match the ones selected by the SIP peers, but they might get translated due to conflicts with another expectation. When such event occur, it is important to do this translation in both directions, dport translation on the receiving path and sport translation on the sending path. This commit fixes the sport translation when the peer requiring it is also the one that starts the media stream. In this scenario, first media stream packet is forwarded from LAN to WAN and will rely on nf_nat_sip_expected() to do the necessary sport translation. However, the expectation matched by this packet does not contain the necessary information for doing SNAT, this data being stored in the paired expectation created by the sender's SIP message (INVITE or reply to it). Signed-off-by: Alin Nastac <alin.nastac@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24netfilter: nft_compat: do not dump private areaPablo Neira Ayuso
[ Upstream commit d701d8117200399d85e63a737d2e4e897932f3b6 ] Zero pad private area, otherwise we expose private kernel pointer to userspace. This patch also zeroes the tail area after the ->matchsize and ->targetsize that results from XT_ALIGN(). Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-20netfilter: nf_tables: avoid BUG_ON usageFlorian Westphal
[ Upstream commit fa5950e498e7face21a1761f327e6c1152f778c3 ] None of these spots really needs to crash the kernel. In one two cases we can jsut report error to userspace, in the other cases we can just use WARN_ON (and leak memory instead). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 setsStefano Brivio
[ Upstream commit 97664bc2c77e2b65cdedddcae2643fc93291d958 ] Same as commit 1b4a75108d5b ("netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and hash:ip,mac sets"), another copy and paste went wrong in commit 8cc4ccf58379 ("netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets"). When I fixed this for IPv4 in 1b4a75108d5b, I didn't realise that hash:ip,mac sets also support IPv6 as family, and this is covered by a separate function, hash_ipmac6_kadt(). In hash:ip,mac sets, the first dimension is the IP address, and the second dimension is the MAC address: check the IPSET_DIM_TWO_SRC flag in flags while deciding which MAC address to copy, destination or source. This way, mixing source and destination matches for the two dimensions of ip,mac hash type works as expected, also for IPv6. With this setup: ip netns add A ip link add veth1 type veth peer name veth2 netns A ip addr add 2001:db8::1/64 dev veth1 ip -net A addr add 2001:db8::2/64 dev veth2 ip link set veth1 up ip -net A link set veth2 up dst=$(ip netns exec A cat /sys/class/net/veth2/address) ip netns exec A ipset create test_hash hash:ip,mac family inet6 ip netns exec A ipset add test_hash 2001:db8::1,${dst} ip netns exec A ip6tables -A INPUT -p icmpv6 --icmpv6-type 135 -j ACCEPT ip netns exec A ip6tables -A INPUT -m set ! --match-set test_hash src,dst -j DROP ipset now correctly matches a test packet: # ping -c1 2001:db8::2 >/dev/null # echo $? 0 Reported-by: Chen, Yi <yiche@redhat.com> Fixes: 8cc4ccf58379 ("netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12ipvs: move old_secure_tcp into struct netns_ipvsEric Dumazet
[ Upstream commit c24b75e0f9239e78105f81c5f03a751641eb07ef ] syzbot reported the following issue : BUG: KCSAN: data-race in update_defense_level / update_defense_level read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1: update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177 defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0: update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205 defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events defense_work_handler Indeed, old_secure_tcp is currently a static variable, while it needs to be a per netns variable. Fixes: a0840e2e165a ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12ipvs: don't ignore errors in case refcounting ip_vs module failsDavide Caratti
[ Upstream commit 62931f59ce9cbabb934a431f48f2f1f441c605ac ] if the IPVS module is removed while the sync daemon is starting, there is a small gap where try_module_get() might fail getting the refcount inside ip_vs_use_count_inc(). Then, the refcounts of IPVS module are unbalanced, and the subsequent call to stop_sync_thread() causes the following splat: WARNING: CPU: 0 PID: 4013 at kernel/module.c:1146 module_put.part.44+0x15b/0x290 Modules linked in: ip_vs(-) nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth ip6table_filter ip6_tables iptable_filter binfmt_misc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ext4 mbcache jbd2 ghash_clmulni_intel snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_nhlt snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd cryptd glue_helper joydev pcspkr snd_timer virtio_balloon snd soundcore i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net net_failover virtio_blk failover virtio_console qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ata_piix ttm crc32c_intel serio_raw drm virtio_pci libata virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_defrag_ipv6] CPU: 0 PID: 4013 Comm: modprobe Tainted: G W 5.4.0-rc1.upstream+ #741 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:module_put.part.44+0x15b/0x290 Code: 04 25 28 00 00 00 0f 85 18 01 00 00 48 83 c4 68 5b 5d 41 5c 41 5d 41 5e 41 5f c3 89 44 24 28 83 e8 01 89 c5 0f 89 57 ff ff ff <0f> 0b e9 78 ff ff ff 65 8b 1d 67 83 26 4a 89 db be 08 00 00 00 48 RSP: 0018:ffff888050607c78 EFLAGS: 00010297 RAX: 0000000000000003 RBX: ffffffffc1420590 RCX: ffffffffb5db0ef9 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffffc1420590 RBP: 00000000ffffffff R08: fffffbfff82840b3 R09: fffffbfff82840b3 R10: 0000000000000001 R11: fffffbfff82840b2 R12: 1ffff1100a0c0f90 R13: ffffffffc1420200 R14: ffff88804f533300 R15: ffff88804f533ca0 FS: 00007f8ea9720740(0000) GS:ffff888053800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3245abe000 CR3: 000000004c28a006 CR4: 00000000001606f0 Call Trace: stop_sync_thread+0x3a3/0x7c0 [ip_vs] ip_vs_sync_net_cleanup+0x13/0x50 [ip_vs] ops_exit_list.isra.5+0x94/0x140 unregister_pernet_operations+0x29d/0x460 unregister_pernet_device+0x26/0x60 ip_vs_cleanup+0x11/0x38 [ip_vs] __x64_sys_delete_module+0x2d5/0x400 do_syscall_64+0xa5/0x4e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f8ea8bf0db7 Code: 73 01 c3 48 8b 0d b9 80 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 80 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffcd38d2fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000000002436240 RCX: 00007f8ea8bf0db7 RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00000000024362a8 RBP: 0000000000000000 R08: 00007f8ea8eba060 R09: 00007f8ea8c658a0 R10: 00007ffcd38d2a60 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000001 R14: 00000000024362a8 R15: 0000000000000000 irq event stamp: 4538 hardirqs last enabled at (4537): [<ffffffffb6193dde>] quarantine_put+0x9e/0x170 hardirqs last disabled at (4538): [<ffffffffb5a0556a>] trace_hardirqs_off_thunk+0x1a/0x20 softirqs last enabled at (4522): [<ffffffffb6f8ebe9>] sk_common_release+0x169/0x2d0 softirqs last disabled at (4520): [<ffffffffb6f8eb3e>] sk_common_release+0xbe/0x2d0 Check the return value of ip_vs_use_count_inc() and let its caller return proper error. Inside do_ip_vs_set_ctl() the module is already refcounted, we don't need refcount/derefcount there. Finally, in register_ip_vs_app() and start_sync_thread(), take the module refcount earlier and ensure it's released in the error path. Change since v1: - better return values in case of failure of ip_vs_use_count_inc(), thanks to Julian Anastasov - no need to increase/decrease the module refcount in ip_vs_set_ctl(), thanks to Julian Anastasov Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12netfilter: nf_flow_table: set timeout before insertion into hashesPablo Neira Ayuso
[ Upstream commit daf61b026f4686250e6afa619e6d7b49edc61df7 ] Other garbage collector might remove an entry not fully set up yet. [570953.958293] RIP: 0010:memcmp+0x9/0x50 [...] [570953.958567] flow_offload_hash_cmp+0x1e/0x30 [nf_flow_table] [570953.958585] flow_offload_lookup+0x8c/0x110 [nf_flow_table] [570953.958606] nf_flow_offload_ip_hook+0x135/0xb30 [nf_flow_table] [570953.958624] nf_flow_offload_inet_hook+0x35/0x37 [nf_flow_table_inet] [570953.958646] nf_hook_slow+0x3c/0xb0 [570953.958664] __netif_receive_skb_core+0x90f/0xb10 [570953.958678] ? ip_rcv_finish+0x82/0xa0 [570953.958692] __netif_receive_skb_one_core+0x3b/0x80 [570953.958711] __netif_receive_skb+0x18/0x60 [570953.958727] netif_receive_skb_internal+0x45/0xf0 [570953.958741] napi_gro_receive+0xcd/0xf0 [570953.958764] ixgbe_clean_rx_irq+0x432/0xe00 [ixgbe] [570953.958782] ixgbe_poll+0x27b/0x700 [ixgbe] [570953.958796] net_rx_action+0x284/0x3c0 [570953.958817] __do_softirq+0xcc/0x27c [570953.959464] irq_exit+0xe8/0x100 [570953.960097] do_IRQ+0x59/0xe0 [570953.960734] common_interrupt+0xf/0xf Fixes: 43c8f131184f ("netfilter: nf_flow_table: fix missing error check for rhashtable_insert_fast") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-12netfilter: ipset: Fix an error code in ip_set_sockfn_get()Dan Carpenter
commit 30b7244d79651460ff114ba8f7987ed94c86b99a upstream. The copy_to_user() function returns the number of bytes remaining to be copied. In this code, that positive return is checked at the end of the function and we return zero/success. What we should do instead is return -EFAULT. Fixes: a7b4f989a629 ("netfilter: ipset: IP set core support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-06netfilter: ipset: Make invalid MAC address checks consistentStefano Brivio
[ Upstream commit 29edbc3ebdb0faa934114f14bf12fc0b784d4f1b ] Set types bitmap:ipmac and hash:ipmac check that MAC addresses are not all zeroes. Introduce one missing check, and make the remaining ones consistent, using is_zero_ether_addr() instead of comparing against an array containing zeroes. This was already done for hash:mac sets in commit 26c97c5d8dac ("netfilter: ipset: Use is_zero_ether_addr instead of static and memcmp"). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-29netfilter: nft_connlimit: disable bh on garbage collectionPablo Neira Ayuso
[ Upstream commit 34a4c95abd25ab41fb390b985a08a651b1fa0b0f ] BH must be disabled when invoking nf_conncount_gc_list() to perform garbage collection, otherwise deadlock might happen. nf_conncount_add+0x1f/0x50 [nf_conncount] nft_connlimit_eval+0x4c/0xe0 [nft_connlimit] nft_dynset_eval+0xb5/0x100 [nf_tables] nft_do_chain+0xea/0x420 [nf_tables] ? sch_direct_xmit+0x111/0x360 ? noqueue_init+0x10/0x10 ? __qdisc_run+0x84/0x510 ? tcp_packet+0x655/0x1610 [nf_conntrack] ? ip_finish_output2+0x1a7/0x430 ? tcp_error+0x130/0x150 [nf_conntrack] ? nf_conntrack_in+0x1fc/0x4c0 [nf_conntrack] nft_do_chain_ipv4+0x66/0x80 [nf_tables] nf_hook_slow+0x44/0xc0 ip_rcv+0xb5/0xd0 ? ip_rcv_finish_core.isra.19+0x360/0x360 __netif_receive_skb_one_core+0x52/0x70 netif_receive_skb_internal+0x34/0xe0 napi_gro_receive+0xba/0xe0 e1000_clean_rx_irq+0x1e9/0x420 [e1000e] e1000e_poll+0xbe/0x290 [e1000e] net_rx_action+0x149/0x3b0 __do_softirq+0xde/0x2d8 irq_exit+0xba/0xc0 do_IRQ+0x85/0xd0 common_interrupt+0xf/0xf </IRQ> RIP: 0010:nf_conncount_gc_list+0x3b/0x130 [nf_conncount] Fixes: 2f971a8f4255 ("netfilter: nf_conncount: move all list iterations under spinlock") Reported-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-11netfilter: nf_tables: allow lookups in dynamic setsFlorian Westphal
[ Upstream commit acab713177377d9e0889c46bac7ff0cfb9a90c4d ] This un-breaks lookups in sets that have the 'dynamic' flag set. Given this active example configuration: table filter { set set1 { type ipv4_addr size 64 flags dynamic,timeout timeout 1m } chain input { type filter hook input priority 0; policy accept; } } ... this works: nft add rule ip filter input add @set1 { ip saddr } -> whenever rule is triggered, the source ip address is inserted into the set (if it did not exist). This won't work: nft add rule ip filter input ip saddr @set1 counter Error: Could not process rule: Operation not supported In other words, we can add entries to the set, but then can't make matching decision based on that set. That is just wrong -- all set backends support lookups (else they would not be very useful). The failure comes from an explicit rejection in nft_lookup.c. Looking at the history, it seems like NFT_SET_EVAL used to mean 'set contains expressions' (aka. "is a meter"), for instance something like nft add rule ip filter input meter example { ip saddr limit rate 10/second } or nft add rule ip filter input meter example { ip saddr counter } The actual meaning of NFT_SET_EVAL however, is 'set can be updated from the packet path'. 'meters' and packet-path insertions into sets, such as 'add @set { ip saddr }' use exactly the same kernel code (nft_dynset.c) and thus require a set backend that provides the ->update() function. The only set that provides this also is the only one that has the NFT_SET_EVAL feature flag. Removing the wrong check makes the above example work. While at it, also fix the flag check during set instantiation to allow supported combinations only. Fixes: 8aeff920dcc9b3f ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-01netfilter: nft_socket: fix erroneous socket assignmentFernando Fernandez Mancera
[ Upstream commit 039b1f4f24ecc8493b6bb9d70b4b78750d1b35c2 ] The socket assignment is wrong, see skb_orphan(): When skb->destructor callback is not set, but skb->sk is set, this hits BUG(). Link: https://bugzilla.redhat.com/show_bug.cgi?id=1651813 Fixes: 554ced0a6e29 ("netfilter: nf_tables: add support for native socket matching") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21netfilter: nf_conntrack_ftp: Fix debug outputThomas Jarosch
[ Upstream commit 3a069024d371125227de3ac8fa74223fcf473520 ] The find_pattern() debug output was printing the 'skip' character. This can be a NULL-byte and messes up further pr_debug() output. Output without the fix: kernel: nf_conntrack_ftp: Pattern matches! kernel: nf_conntrack_ftp: Skipped up to `<7>nf_conntrack_ftp: find_pattern `PORT': dlen = 8 kernel: nf_conntrack_ftp: find_pattern `EPRT': dlen = 8 Output with the fix: kernel: nf_conntrack_ftp: Pattern matches! kernel: nf_conntrack_ftp: Skipped up to 0x0 delimiter! kernel: nf_conntrack_ftp: Match succeeded! kernel: nf_conntrack_ftp: conntrack_ftp: match `172,17,0,100,200,207' (20 bytes at 4150681645) kernel: nf_conntrack_ftp: find_pattern `PORT': dlen = 8 Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21netfilter: xt_physdev: Fix spurious error message in physdev_mt_checkTodd Seidelmann
[ Upstream commit 3cf2f450fff304be9cf4868bf0df17f253bc5b1c ] Simplify the check in physdev_mt_check() to emit an error message only when passed an invalid chain (ie, NF_INET_LOCAL_OUT). This avoids cluttering up the log with errors against valid rules. For large/heavily modified rulesets, current behavior can quickly overwhelm the ring buffer, because this function gets called on every change, regardless of the rule that was changed. Signed-off-by: Todd Seidelmann <tseidelmann@linode.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21netfilter: xt_nfacct: Fix alignment mismatch in xt_nfacct_match_infoJuliana Rodrigueiro
[ Upstream commit 89a26cd4b501e9511d3cd3d22327fc76a75a38b3 ] When running a 64-bit kernel with a 32-bit iptables binary, the size of the xt_nfacct_match_info struct diverges. kernel: sizeof(struct xt_nfacct_match_info) : 40 iptables: sizeof(struct xt_nfacct_match_info)) : 36 Trying to append nfacct related rules results in an unhelpful message. Although it is suggested to look for more information in dmesg, nothing can be found there. # iptables -A <chain> -m nfacct --nfacct-name <acct-object> iptables: Invalid argument. Run `dmesg' for more information. This patch fixes the memory misalignment by enforcing 8-byte alignment within the struct's first revision. This solution is often used in many other uapi netfilter headers. Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21netfilter: nft_flow_offload: missing netlink attribute policyPablo Neira Ayuso
[ Upstream commit 14c415862c0630e01712a4eeaf6159a2b1b6d2a4 ] The netlink attribute policy for NFTA_FLOW_TABLE_NAME is missing. Fixes: a3c90f7a2323 ("netfilter: nf_tables: flow offload expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-21netfilter: nf_flow_table: set default timeout after successful insertionPablo Neira Ayuso
commit 110e48725db6262f260f10727d0fb2d3d25895e4 upstream. Set up the default timeout for this new entry otherwise the garbage collector might quickly remove it right after the flowtable insertion. Fixes: ac2a66665e23 ("netfilter: add generic flow table infrastructure") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-10netfilter: nft_flow_offload: skip tcp rst and fin packetsPablo Neira Ayuso
[ Upstream commit dfe42be15fde16232340b8b2a57c359f51cc10d9 ] TCP rst and fin packets do not qualify to place a flow into the flowtable. Most likely there will be no more packets after connection closure. Without this patch, this flow entry expires and connection tracking picks up the entry in ESTABLISHED state using the fixup timeout, which makes this look inconsistent to the user for a connection that is actually already closed. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-10netfilter: nf_tables: use-after-free in failing rule with bound setPablo Neira Ayuso
[ Upstream commit 6a0a8d10a3661a036b55af695542a714c429ab7c ] If a rule that has already a bound anonymous set fails to be added, the preparation phase releases the rule and the bound set. However, the transaction object from the abort path still has a reference to the set object that is stale, leading to a use-after-free when checking for the set->bound field. Add a new field to the transaction that specifies if the set is bound, so the abort path can skip releasing it since the rule command owns it and it takes care of releasing it. After this update, the set->bound field is removed. [ 24.649883] Unable to handle kernel paging request at virtual address 0000000000040434 [ 24.657858] Mem abort info: [ 24.660686] ESR = 0x96000004 [ 24.663769] Exception class = DABT (current EL), IL = 32 bits [ 24.669725] SET = 0, FnV = 0 [ 24.672804] EA = 0, S1PTW = 0 [ 24.675975] Data abort info: [ 24.678880] ISV = 0, ISS = 0x00000004 [ 24.682743] CM = 0, WnR = 0 [ 24.685723] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000428952000 [ 24.692207] [0000000000040434] pgd=0000000000000000 [ 24.697119] Internal error: Oops: 96000004 [#1] SMP [...] [ 24.889414] Call trace: [ 24.891870] __nf_tables_abort+0x3f0/0x7a0 [ 24.895984] nf_tables_abort+0x20/0x40 [ 24.899750] nfnetlink_rcv_batch+0x17c/0x588 [ 24.904037] nfnetlink_rcv+0x13c/0x190 [ 24.907803] netlink_unicast+0x18c/0x208 [ 24.911742] netlink_sendmsg+0x1b0/0x350 [ 24.915682] sock_sendmsg+0x4c/0x68 [ 24.919185] ___sys_sendmsg+0x288/0x2c8 [ 24.923037] __sys_sendmsg+0x7c/0xd0 [ 24.926628] __arm64_sys_sendmsg+0x2c/0x38 [ 24.930744] el0_svc_common.constprop.0+0x94/0x158 [ 24.935556] el0_svc_handler+0x34/0x90 [ 24.939322] el0_svc+0x8/0xc [ 24.942216] Code: 37280300 f9404023 91014262 aa1703e0 (f9401863) [ 24.948336] ---[ end trace cebbb9dcbed3b56f ]--- Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29netfilter: ipset: Fix rename concurrency with listingJozsef Kadlecsik
[ Upstream commit 6c1f7e2c1b96ab9b09ac97c4df2bd9dc327206f6 ] Shijie Luo reported that when stress-testing ipset with multiple concurrent create, rename, flush, list, destroy commands, it can result ipset <version>: Broken LIST kernel message: missing DATA part! error messages and broken list results. The problem was the rename operation was not properly handled with respect of listing. The patch fixes the issue. Reported-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and ↵Stefano Brivio
hash:ip,mac sets [ Upstream commit 1b4a75108d5bc153daf965d334e77e8e94534f96 ] In commit 8cc4ccf58379 ("ipset: Allow matching on destination MAC address for mac and ipmac sets"), ipset.git commit 1543514c46a7, I added to the KADT functions for sets matching on MAC addreses the copy of source or destination MAC address depending on the configured match. This was done correctly for hash:mac, but for hash:ip,mac and bitmap:ip,mac, copying and pasting the same code block presents an obvious problem: in these two set types, the MAC address is the second dimension, not the first one, and we are actually selecting the MAC address depending on whether the first dimension (IP address) specifies source or destination. Fix this by checking for the IPSET_DIM_TWO_SRC flag in option flags. This way, mixing source and destination matches for the two dimensions of ip,mac set types works as expected. With this setup: ip netns add A ip link add veth1 type veth peer name veth2 netns A ip addr add 192.0.2.1/24 dev veth1 ip -net A addr add 192.0.2.2/24 dev veth2 ip link set veth1 up ip -net A link set veth2 up dst=$(ip netns exec A cat /sys/class/net/veth2/address) ip netns exec A ipset create test_bitmap bitmap:ip,mac range 192.0.0.0/16 ip netns exec A ipset add test_bitmap 192.0.2.1,${dst} ip netns exec A iptables -A INPUT -m set ! --match-set test_bitmap src,dst -j DROP ip netns exec A ipset create test_hash hash:ip,mac ip netns exec A ipset add test_hash 192.0.2.1,${dst} ip netns exec A iptables -A INPUT -m set ! --match-set test_hash src,dst -j DROP ipset correctly matches a test packet: # ping -c1 192.0.2.2 >/dev/null # echo $? 0 Reported-by: Chen Yi <yiche@redhat.com> Fixes: 8cc4ccf58379 ("ipset: Allow matching on destination MAC address for mac and ipmac sets") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-29netfilter: ipset: Actually allow destination MAC address for hash:ip,mac ↵Stefano Brivio
sets too [ Upstream commit b89d15480d0cacacae1a0fe0b3da01b529f2914f ] In commit 8cc4ccf58379 ("ipset: Allow matching on destination MAC address for mac and ipmac sets"), ipset.git commit 1543514c46a7, I removed the KADT check that prevents matching on destination MAC addresses for hash:mac sets, but forgot to remove the same check for hash:ip,mac set. Drop this check: functionality is now commented in man pages and there's no reason to restrict to source MAC address matching anymore. Reported-by: Chen Yi <yiche@redhat.com> Fixes: 8cc4ccf58379 ("ipset: Allow matching on destination MAC address for mac and ipmac sets") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-25netfilter: conntrack: Use consistent ct id hash calculationDirk Morris
commit 656c8e9cc1badbc18eefe6ba01d33ebbcae61b9a upstream. Change ct id hash calculation to only use invariants. Currently the ct id hash calculation is based on some fields that can change in the lifetime on a conntrack entry in some corner cases. The current hash uses the whole tuple which contains an hlist pointer which will change when the conntrack is placed on the dying list resulting in a ct id change. This patch also removes the reply-side tuple and extension pointer from the hash calculation so that the ct id will will not change from initialization until confirmation. Fixes: 3c79107631db1f7 ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id") Signed-off-by: Dirk Morris <dmorris@metaloft.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-16netfilter: nft_hash: fix symhash with modulus oneLaura Garcia Liebana
[ Upstream commit 28b1d6ef53e3303b90ca8924bb78f31fa527cafb ] The rule below doesn't work as the kernel raises -ERANGE. nft add rule netdev nftlb lb01 ip daddr set \ symhash mod 1 map { 0 : 192.168.0.10 } fwd to "eth0" This patch allows to use the symhash modulus with one element, in the same way that the other types of hashes and algorithms that uses the modulus parameter. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-16netfilter: conntrack: always store window size un-scaledFlorian Westphal
[ Upstream commit 959b69ef57db00cb33e9c4777400ae7183ebddd3 ] Jakub Jankowski reported following oddity: After 3 way handshake completes, timeout of new connection is set to max_retrans (300s) instead of established (5 days). shortened excerpt from pcap provided: 25.070622 IP (flags [DF], proto TCP (6), length 52) 10.8.5.4.1025 > 10.8.1.2.80: Flags [S], seq 11, win 64240, [wscale 8] 26.070462 IP (flags [DF], proto TCP (6), length 48) 10.8.1.2.80 > 10.8.5.4.1025: Flags [S.], seq 82, ack 12, win 65535, [wscale 3] 27.070449 IP (flags [DF], proto TCP (6), length 40) 10.8.5.4.1025 > 10.8.1.2.80: Flags [.], ack 83, win 512, length 0 Turns out the last_win is of u16 type, but we store the scaled value: 512 << 8 (== 0x20000) becomes 0 window. The Fixes tag is not correct, as the bug has existed forever, but without that change all that this causes might cause is to mistake a window update (to-nonzero-from-zero) for a retransmit. Fixes: fbcd253d2448b8 ("netfilter: conntrack: lower timeout to RETRANS seconds if window is 0") Reported-by: Jakub Jankowski <shasta@toxcorp.com> Tested-by: Jakub Jankowski <shasta@toxcorp.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-16netfilter: nfnetlink: avoid deadlock due to synchronous request_moduleFlorian Westphal
[ Upstream commit 1b0890cd60829bd51455dc5ad689ed58c4408227 ] Thomas and Juliana report a deadlock when running: (rmmod nf_conntrack_netlink/xfrm_user) conntrack -e NEW -E & modprobe -v xfrm_user They provided following analysis: conntrack -e NEW -E netlink_bind() netlink_lock_table() -> increases "nl_table_users" nfnetlink_bind() # does not unlock the table as it's locked by netlink_bind() __request_module() call_usermodehelper_exec() This triggers "modprobe nf_conntrack_netlink" from kernel, netlink_bind() won't return until modprobe process is done. "modprobe xfrm_user": xfrm_user_init() register_pernet_subsys() -> grab pernet_ops_rwsem .. netlink_table_grab() calls schedule() as "nl_table_users" is non-zero so modprobe is blocked because netlink_bind() increased nl_table_users while also holding pernet_ops_rwsem. "modprobe nf_conntrack_netlink" runs and inits nf_conntrack_netlink: ctnetlink_init() register_pernet_subsys() -> blocks on "pernet_ops_rwsem" thanks to xfrm_user module both modprobe processes wait on one another -- neither can make progress. Switch netlink_bind() to "nowait" modprobe -- this releases the netlink table lock, which then allows both modprobe instances to complete. Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Reported-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-28net: make skb_dst_force return true when dst is refcountedFlorian Westphal
[ Upstream commit b60a77386b1d4868f72f6353d35dabe5fbe981f2 ] netfilter did not expect that skb_dst_force() can cause skb to lose its dst entry. I got a bug report with a skb->dst NULL dereference in netfilter output path. The backtrace contains nf_reinject(), so the dst might have been cleared when skb got queued to userspace. Other users were fixed via if (skb_dst(skb)) { skb_dst_force(skb); if (!skb_dst(skb)) goto handle_err; } But I think its preferable to make the 'dst might be cleared' part of the function explicit. In netfilter case, skb with a null dst is expected when queueing in prerouting hook, so drop skb for the other hooks. v2: v1 of this patch returned true in case skb had no dst entry. Eric said: Say if we have two skb_dst_force() calls for some reason on the same skb, only the first one will return false. This now returns false even when skb had no dst, as per Erics suggestion, so callers might need to check skb_dst() first before skb_dst_force(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-26ipvs: fix tinfo memory leak in start_sync_threadJulian Anastasov
[ Upstream commit 5db7c8b9f9fc2aeec671ae3ca6375752c162e0e7 ] syzkaller reports for memory leak in start_sync_thread [1] As Eric points out, kthread may start and stop before the threadfn function is called, so there is no chance the data (tinfo in our case) to be released in thread. Fix this by releasing tinfo in the controlling code instead. [1] BUG: memory leak unreferenced object 0xffff8881206bf700 (size 32): comm "syz-executor761", pid 7268, jiffies 4294943441 (age 20.470s) hex dump (first 32 bytes): 00 40 7c 09 81 88 ff ff 80 45 b8 21 81 88 ff ff .@|......E.!.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000057619e23>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<0000000057619e23>] slab_post_alloc_hook mm/slab.h:439 [inline] [<0000000057619e23>] slab_alloc mm/slab.c:3326 [inline] [<0000000057619e23>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<0000000086ce5479>] kmalloc include/linux/slab.h:547 [inline] [<0000000086ce5479>] start_sync_thread+0x5d2/0xe10 net/netfilter/ipvs/ip_vs_sync.c:1862 [<000000001a9229cc>] do_ip_vs_set_ctl+0x4c5/0x780 net/netfilter/ipvs/ip_vs_ctl.c:2402 [<00000000ece457c8>] nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] [<00000000ece457c8>] nf_setsockopt+0x4c/0x80 net/netfilter/nf_sockopt.c:115 [<00000000942f62d4>] ip_setsockopt net/ipv4/ip_sockglue.c:1258 [inline] [<00000000942f62d4>] ip_setsockopt+0x9b/0xb0 net/ipv4/ip_sockglue.c:1238 [<00000000a56a8ffd>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616 [<00000000fa895401>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130 [<0000000095eef4cf>] __sys_setsockopt+0x98/0x120 net/socket.c:2078 [<000000009747cf88>] __do_sys_setsockopt net/socket.c:2089 [inline] [<000000009747cf88>] __se_sys_setsockopt net/socket.c:2086 [inline] [<000000009747cf88>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086 [<00000000ded8ba80>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301 [<00000000893b4ac8>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+7e2e50c8adfccd2e5041@syzkaller.appspotmail.com Suggested-by: Eric Biggers <ebiggers@kernel.org> Fixes: 998e7a76804b ("ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-26ipvs: defer hook registration to avoid leaksJulian Anastasov
[ Upstream commit cf47a0b882a4e5f6b34c7949d7b293e9287f1972 ] syzkaller reports for memory leak when registering hooks [1] As we moved the nf_unregister_net_hooks() call into __ip_vs_dev_cleanup(), defer the nf_register_net_hooks() call, so that hooks are allocated and freed from same pernet_operations (ipvs_core_dev_ops). [1] BUG: memory leak unreferenced object 0xffff88810acd8a80 (size 96): comm "syz-executor073", pid 7254, jiffies 4294950560 (age 22.250s) hex dump (first 32 bytes): 02 00 00 00 00 00 00 00 50 8b bb 82 ff ff ff ff ........P....... 00 00 00 00 00 00 00 00 00 77 bb 82 ff ff ff ff .........w...... backtrace: [<0000000013db61f1>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<0000000013db61f1>] slab_post_alloc_hook mm/slab.h:439 [inline] [<0000000013db61f1>] slab_alloc_node mm/slab.c:3269 [inline] [<0000000013db61f1>] kmem_cache_alloc_node_trace+0x15b/0x2a0 mm/slab.c:3597 [<000000001a27307d>] __do_kmalloc_node mm/slab.c:3619 [inline] [<000000001a27307d>] __kmalloc_node+0x38/0x50 mm/slab.c:3627 [<0000000025054add>] kmalloc_node include/linux/slab.h:590 [inline] [<0000000025054add>] kvmalloc_node+0x4a/0xd0 mm/util.c:431 [<0000000050d1bc00>] kvmalloc include/linux/mm.h:637 [inline] [<0000000050d1bc00>] kvzalloc include/linux/mm.h:645 [inline] [<0000000050d1bc00>] allocate_hook_entries_size+0x3b/0x60 net/netfilter/core.c:61 [<00000000e8abe142>] nf_hook_entries_grow+0xae/0x270 net/netfilter/core.c:128 [<000000004b94797c>] __nf_register_net_hook+0x9a/0x170 net/netfilter/core.c:337 [<00000000d1545cbc>] nf_register_net_hook+0x34/0xc0 net/netfilter/core.c:464 [<00000000876c9b55>] nf_register_net_hooks+0x53/0xc0 net/netfilter/core.c:480 [<000000002ea868e0>] __ip_vs_init+0xe8/0x170 net/netfilter/ipvs/ip_vs_core.c:2280 [<000000002eb2d451>] ops_init+0x4c/0x140 net/core/net_namespace.c:130 [<000000000284ec48>] setup_net+0xde/0x230 net/core/net_namespace.c:316 [<00000000a70600fa>] copy_net_ns+0xf0/0x1e0 net/core/net_namespace.c:439 [<00000000ff26c15e>] create_new_namespaces+0x141/0x2a0 kernel/nsproxy.c:107 [<00000000b103dc79>] copy_namespaces+0xa1/0xe0 kernel/nsproxy.c:165 [<000000007cc008a2>] copy_process.part.0+0x11fd/0x2150 kernel/fork.c:2035 [<00000000c344af7c>] copy_process kernel/fork.c:1800 [inline] [<00000000c344af7c>] _do_fork+0x121/0x4f0 kernel/fork.c:2369 Reported-by: syzbot+722da59ccb264bc19910@syzkaller.appspotmail.com Fixes: 719c7d563c17 ("ipvs: Fix use-after-free in ip_vs_in") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-26ipset: Fix memory accounting for hash types on resizeStefano Brivio
[ Upstream commit 11921796f4799ca9c61c4b22cc54d84aa69f8a35 ] If a fresh array block is allocated during resize, the current in-memory set size should be increased by the size of the block, not replaced by it. Before the fix, adding entries to a hash set type, leading to a table resize, caused an inconsistent memory size to be reported. This becomes more obvious when swapping sets with similar sizes: # cat hash_ip_size.sh #!/bin/sh FAIL_RETRIES=10 tries=0 while [ ${tries} -lt ${FAIL_RETRIES} ]; do ipset create t1 hash:ip for i in `seq 1 4345`; do ipset add t1 1.2.$((i / 255)).$((i % 255)) done t1_init="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')" ipset create t2 hash:ip for i in `seq 1 4360`; do ipset add t2 1.2.$((i / 255)).$((i % 255)) done t2_init="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')" ipset swap t1 t2 t1_swap="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')" t2_swap="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')" ipset destroy t1 ipset destroy t2 tries=$((tries + 1)) if [ ${t1_init} -lt 10000 ] || [ ${t2_init} -lt 10000 ]; then echo "FAIL after ${tries} tries:" echo "T1 size ${t1_init}, after swap ${t1_swap}" echo "T2 size ${t2_init}, after swap ${t2_swap}" exit 1 fi done echo "PASS" # echo -n 'func hash_ip4_resize +p' > /sys/kernel/debug/dynamic_debug/control # ./hash_ip_size.sh [ 2035.018673] attempt to resize set t1 from 10 to 11, t 00000000fe6551fa [ 2035.078583] set t1 resized from 10 (00000000fe6551fa) to 11 (00000000172a0163) [ 2035.080353] Table destroy by resize 00000000fe6551fa FAIL after 4 tries: T1 size 9064, after swap 71128 T2 size 71128, after swap 9064 Reported-by: NOYB <JunkYardMail1@Frontier.com> Fixes: 9e41f26a505c ("netfilter: ipset: Count non-static extension memory for userspace") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-10netfilter: nft_flow_offload: IPCB is only valid for ipv4 familyFlorian Westphal
commit 69aeb538587e087bfc81dd1f465eab3558ff3158 upstream. Guard this with a check vs. ipv4, IPCB isn't valid in ipv6 case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-10netfilter: nft_flow_offload: don't offload when sequence numbers need adjustmentFlorian Westphal
commit 91a9048f238063dde7feea752b9dd386f7e3808b upstream. We can't deal with tcp sequence number rewrite in flow_offload. While at it, simplify helper check, we only need to know if the extension is present, we don't need the helper data. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-10netfilter: nft_flow_offload: set liberal tracking mode for tcpFlorian Westphal
commit 8437a6209f76f85a2db1abb12a9bde2170801617 upstream. Without it, whenever a packet has to be pushed up the stack (e.g. because of mtu mismatch), then conntrack will flag packets as invalid, which in turn breaks NAT. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-10netfilter: nf_flow_table: ignore DF bit settingFlorian Westphal
commit e75b3e1c9bc5b997d09bdf8eb72ab3dd3c1a7072 upstream. Its irrelevant if the DF bit is set or not, we must pass packet to stack in either case. If the DF bit is set, we must pass it to stack so the appropriate ICMP error can be generated. If the DF is not set, we must pass it to stack for fragmentation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-22ipvs: Fix use-after-free in ip_vs_inYueHaibing
[ Upstream commit 719c7d563c17b150877cee03a4b812a424989dfa ] BUG: KASAN: use-after-free in ip_vs_in.part.29+0xe8/0xd20 [ip_vs] Read of size 4 at addr ffff8881e9b26e2c by task sshd/5603 CPU: 0 PID: 5603 Comm: sshd Not tainted 4.19.39+ #30 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: dump_stack+0x71/0xab print_address_description+0x6a/0x270 kasan_report+0x179/0x2c0 ip_vs_in.part.29+0xe8/0xd20 [ip_vs] ip_vs_in+0xd8/0x170 [ip_vs] nf_hook_slow+0x5f/0xe0 __ip_local_out+0x1d5/0x250 ip_local_out+0x19/0x60 __tcp_transmit_skb+0xba1/0x14f0 tcp_write_xmit+0x41f/0x1ed0 ? _copy_from_iter_full+0xca/0x340 __tcp_push_pending_frames+0x52/0x140 tcp_sendmsg_locked+0x787/0x1600 ? tcp_sendpage+0x60/0x60 ? inet_sk_set_state+0xb0/0xb0 tcp_sendmsg+0x27/0x40 sock_sendmsg+0x6d/0x80 sock_write_iter+0x121/0x1c0 ? sock_sendmsg+0x80/0x80 __vfs_write+0x23e/0x370 vfs_write+0xe7/0x230 ksys_write+0xa1/0x120 ? __ia32_sys_read+0x50/0x50 ? __audit_syscall_exit+0x3ce/0x450 do_syscall_64+0x73/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7ff6f6147c60 Code: 73 01 c3 48 8b 0d 28 12 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 5d 73 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 RSP: 002b:00007ffd772ead18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000034 RCX: 00007ff6f6147c60 RDX: 0000000000000034 RSI: 000055df30a31270 RDI: 0000000000000003 RBP: 000055df30a31270 R08: 0000000000000000 R09: 0000000000000000 R10: 00007ffd772ead70 R11: 0000000000000246 R12: 00007ffd772ead74 R13: 00007ffd772eae20 R14: 00007ffd772eae24 R15: 000055df2f12ddc0 Allocated by task 6052: kasan_kmalloc+0xa0/0xd0 __kmalloc+0x10a/0x220 ops_init+0x97/0x190 register_pernet_operations+0x1ac/0x360 register_pernet_subsys+0x24/0x40 0xffffffffc0ea016d do_one_initcall+0x8b/0x253 do_init_module+0xe3/0x335 load_module+0x2fc0/0x3890 __do_sys_finit_module+0x192/0x1c0 do_syscall_64+0x73/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 6067: __kasan_slab_free+0x130/0x180 kfree+0x90/0x1a0 ops_free_list.part.7+0xa6/0xc0 unregister_pernet_operations+0x18b/0x1f0 unregister_pernet_subsys+0x1d/0x30 ip_vs_cleanup+0x1d/0xd2f [ip_vs] __x64_sys_delete_module+0x20c/0x300 do_syscall_64+0x73/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff8881e9b26600 which belongs to the cache kmalloc-4096 of size 4096 The buggy address is located 2092 bytes inside of 4096-byte region [ffff8881e9b26600, ffff8881e9b27600) The buggy address belongs to the page: page:ffffea0007a6c800 count:1 mapcount:0 mapping:ffff888107c0e600 index:0x0 compound_mapcount: 0 flags: 0x17ffffc0008100(slab|head) raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff888107c0e600 raw: 0000000000000000 0000000080070007 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected while unregistering ipvs module, ops_free_list calls __ip_vs_cleanup, then nf_unregister_net_hooks be called to do remove nf hook entries. It need a RCU period to finish, however net->ipvs is set to NULL immediately, which will trigger NULL pointer dereference when a packet is hooked and handled by ip_vs_in where net->ipvs is dereferenced. Another scene is ops_free_list call ops_free to free the net_generic directly while __ip_vs_cleanup finished, then calling ip_vs_in will triggers use-after-free. This patch moves nf_unregister_net_hooks from __ip_vs_cleanup() to __ip_vs_dev_cleanup(), where rcu_barrier() is called by unregister_pernet_device -> unregister_pernet_operations, that will do the needed grace period. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: efe41606184e ("ipvs: convert to use pernet nf_hook api") Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-22netfilter: nf_queue: fix reinject verdict handlingJagdish Motwani
[ Upstream commit 946c0d8e6ed43dae6527e878d0077c1e11015db0 ] This patch fixes netfilter hook traversal when there are more than 1 hooks returning NF_QUEUE verdict. When the first queue reinjects the packet, 'nf_reinject' starts traversing hooks with a proper hook_index. However, if it again receives a NF_QUEUE verdict (by some other netfilter hook), it queues the packet with a wrong hook_index. So, when the second queue reinjects the packet, it re-executes hooks in between. Fixes: 960632ece694 ("netfilter: convert hook list to an array") Signed-off-by: Jagdish Motwani <jagdish.motwani@sophos.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15netfilter: nf_flow_table: fix netdev refcnt leakTaehee Yoo
[ Upstream commit 26a302afbe328ecb7507cae2035d938e6635131b ] flow_offload_alloc() calls nf_route() to get a dst_entry. Internally, nf_route() calls ip_route_output_key() that allocates a dst_entry and holds it. So, a dst_entry should be released by dst_release() if nf_route() is successful. Otherwise, netns exit routine cannot be finished and the following message is printed: [ 257.490952] unregister_netdevice: waiting for lo to become free. Usage count = 1 Fixes: ac2a66665e23 ("netfilter: add generic flow table infrastructure") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15netfilter: nf_flow_table: check ttl value in flow offload data pathTaehee Yoo
[ Upstream commit 33cc3c0cfa64c86b6c4bbee86997aea638534931 ] nf_flow_offload_ip_hook() and nf_flow_offload_ipv6_hook() do not check ttl value. So, ttl value overflow may occur. Fixes: 97add9f0d66d ("netfilter: flow table support for IPv4") Fixes: 0995210753a2 ("netfilter: flow table support for IPv6") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15netfilter: nf_tables: fix base chain stat rcu_dereference usageFlorian Westphal
[ Upstream commit edbd82c5fba009f68d20b5db585be1e667c605f6 ] Following splat gets triggered when nfnetlink monitor is running while xtables-nft selftests are running: net/netfilter/nf_tables_api.c:1272 suspicious rcu_dereference_check() usage! other info that might help us debug this: 1 lock held by xtables-nft-mul/27006: #0: 00000000e0f85be9 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x1a/0x50 Call Trace: nf_tables_fill_chain_info.isra.45+0x6cc/0x6e0 nf_tables_chain_notify+0xf8/0x1a0 nf_tables_commit+0x165c/0x1740 nf_tables_fill_chain_info() can be called both from dumps (rcu read locked) or from the transaction path if a userspace process subscribed to nftables notifications. In the 'table dump' case, rcu_access_pointer() cannot be used: We do not hold transaction mutex so the pointer can be NULLed right after the check. Just unconditionally fetch the value, then have the helper return immediately if its NULL. In the notification case we don't hold the rcu read lock, but updates are prevented due to transaction mutex. Use rcu_dereference_check() to make lockdep aware of this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15netfilter: nf_conntrack_h323: restore boundary check correctnessJakub Jankowski
[ Upstream commit f5e85ce8e733c2547827f6268136b70b802eabdb ] Since commit bc7d811ace4a ("netfilter: nf_ct_h323: Convert CHECK_BOUND macro to function"), NAT traversal for H.323 doesn't work, failing to parse H323-UserInformation. nf_h323_error_boundary() compares contents of the bitstring, not the addresses, preventing valid H.323 packets from being conntrack'd. This looks like an oversight from when CHECK_BOUND macro was converted to a function. To fix it, stop dereferencing bs->cur and bs->end. Fixes: bc7d811ace4a ("netfilter: nf_ct_h323: Convert CHECK_BOUND macro to function") Signed-off-by: Jakub Jankowski <shasta@toxcorp.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15netfilter: nf_flow_table: fix missing error check for rhashtable_insert_fastTaehee Yoo
[ Upstream commit 43c8f131184faf20c07221f3e09724611c6525d8 ] rhashtable_insert_fast() may return an error value when memory allocation fails, but flow_offload_add() does not check for errors. This patch just adds missing error checking. Fixes: ac2a66665e23 ("netfilter: add generic flow table infrastructure") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-04jump_label: move 'asm goto' support test to KconfigMasahiro Yamada
commit e9666d10a5677a494260d60d1fa0b73cc7646eb3 upstream. Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label". The jump label is controlled by HAVE_JUMP_LABEL, which is defined like this: #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) # define HAVE_JUMP_LABEL #endif We can improve this by testing 'asm goto' support in Kconfig, then make JUMP_LABEL depend on CC_HAS_ASM_GOTO. Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will match to the real kernel capability. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Sedat Dilek <sedat.dilek@gmail.com> [nc: Fix trivial conflicts in 4.19 arch/xtensa/kernel/jump_label.c doesn't exist yet Ensured CC_HAVE_ASM_GOTO and HAVE_JUMP_LABEL were sufficiently eliminated] Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16netfilter: nf_tables: add missing ->release_ops() in error path of newrule()Taehee Yoo
[ Upstream commit b25a31bf0ca091aa8bdb9ab329b0226257568bbe ] ->release_ops() callback releases resources and this is used in error path. If nf_tables_newrule() fails after ->select_ops(), it should release resources. but it can not call ->destroy() because that should be called after ->init(). At this point, ->release_ops() should be used for releasing resources. Test commands: modprobe -rv xt_tcpudp iptables-nft -I INPUT -m tcp <-- error command lsmod Result: Module Size Used by xt_tcpudp 20480 2 <-- it should be 0 Fixes: b8e204006340 ("netfilter: nft_compat: use .release_ops and remove list of extension") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2019-05-16netfilter: nf_tables: use-after-free in dynamic operationsPablo Neira Ayuso
[ Upstream commit 3f3a390dbd59d236f62cff8e8b20355ef7069e3d ] Smatch reports: net/netfilter/nf_tables_api.c:2167 nf_tables_expr_destroy() error: dereferencing freed memory 'expr->ops' net/netfilter/nf_tables_api.c 2162 static void nf_tables_expr_destroy(const struct nft_ctx *ctx, 2163 struct nft_expr *expr) 2164 { 2165 if (expr->ops->destroy) 2166 expr->ops->destroy(ctx, expr); ^^^^ --> 2167 module_put(expr->ops->type->owner); ^^^^^^^^^ 2168 } Smatch says there are three functions which free expr->ops. Fixes: b8e204006340 ("netfilter: nft_compat: use .release_ops and remove list of extension") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>