aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/vxlan.c
AgeCommit message (Collapse)Author
2021-07-20vxlan: add missing rcu_read_lock() in neigh_reduce()Eric Dumazet
[ Upstream commit 85e8b032d6ebb0f698a34dd22c2f13443d905888 ] syzbot complained in neigh_reduce(), because rcu_read_lock_bh() is treated differently than rcu_read_lock() WARNING: suspicious RCU usage 5.13.0-rc6-syzkaller #0 Not tainted ----------------------------- include/net/addrconf.h:313 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/0:0/5: #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:617 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2247 #1: ffffc90000ca7da8 ((work_completion)(&port->wq)){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2251 #2: ffffffff8bf795c0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x1da/0x3130 net/core/dev.c:4180 stack backtrace: CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events ipvlan_process_multicast Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 __in6_dev_get include/net/addrconf.h:313 [inline] __in6_dev_get include/net/addrconf.h:311 [inline] neigh_reduce drivers/net/vxlan.c:2167 [inline] vxlan_xmit+0x34d5/0x4c30 drivers/net/vxlan.c:2919 __netdev_start_xmit include/linux/netdevice.h:4944 [inline] netdev_start_xmit include/linux/netdevice.h:4958 [inline] xmit_one net/core/dev.c:3654 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3670 __dev_queue_xmit+0x2133/0x3130 net/core/dev.c:4246 ipvlan_process_multicast+0xa99/0xd70 drivers/net/ipvlan/ipvlan_core.c:287 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422 kthread+0x3b1/0x4a0 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Fixes: f564f45c4518 ("vxlan: add ipv6 proxy support") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29vxlan: Copy needed_tailroom from lowerdevSven Eckelmann
[ Upstream commit a5e74021e84bb5eadf760aaf2c583304f02269be ] While vxlan doesn't need any extra tailroom, the lowerdev might need it. In that case, copy it over to reduce the chance for additional (re)allocations in the transmit path. Signed-off-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20201126125247.1047977-2-sven@narfation.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29vxlan: Add needed_headroom for lower deviceSven Eckelmann
[ Upstream commit 0a35dc41fea67ac4495ce7584406bf9557a6e7d0 ] It was observed that sending data via batadv over vxlan (on top of wireguard) reduced the performance massively compared to raw ethernet or batadv on raw ethernet. A check of perf data showed that the vxlan_build_skb was calling all the time pskb_expand_head to allocate enough headroom for: min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len + VXLAN_HLEN + iphdr_len; But the vxlan_config_apply only requested needed headroom for: lowerdev->hard_header_len + VXLAN6_HEADROOM or VXLAN_HEADROOM So it completely ignored the needed_headroom of the lower device. The first caller of net_dev_xmit could therefore never make sure that enough headroom was allocated for the rest of the transmit path. Cc: Annika Wickert <annika.wickert@exaring.de> Signed-off-by: Sven Eckelmann <sven@narfation.org> Tested-by: Annika Wickert <aw@awlnx.space> Link: https://lore.kernel.org/r/20201126125247.1047977-1-sven@narfation.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-21Revert "vxlan: fix tos value before xmit"Hangbin Liu
[ Upstream commit a0dced17ad9dc08b1b25e0065b54c97a318e6e8b ] This reverts commit 71130f29979c7c7956b040673e6b9d5643003176. In commit 71130f29979c ("vxlan: fix tos value before xmit") we want to make sure the tos value are filtered by RT_TOS() based on RFC1349. 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | PRECEDENCE | TOS | MBZ | +-----+-----+-----+-----+-----+-----+-----+-----+ But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | DS FIELD, DSCP | ECN FIELD | +-----+-----+-----+-----+-----+-----+-----+-----+ So with IPTOS_TOS_MASK 0x1E RT_TOS(tos) ((tos)&IPTOS_TOS_MASK) the first 3 bits DSCP info will get lost. To take all the DSCP info in xmit, we should revert the patch and just push all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later. Fixes: 71130f29979c ("vxlan: fix tos value before xmit") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21vxlan: Ensure FDB dump is performed under RCUIdo Schimmel
[ Upstream commit b5141915b5aec3b29a63db869229e3741ebce258 ] The commit cited below removed the RCU read-side critical section from rtnl_fdb_dump() which means that the ndo_fdb_dump() callback is invoked without RCU protection. This results in the following warning [1] in the VXLAN driver, which relied on the callback being invoked from an RCU read-side critical section. Fix this by calling rcu_read_lock() in the VXLAN driver, as already done in the bridge driver. [1] WARNING: suspicious RCU usage 5.8.0-rc4-custom-01521-g481007553ce6 #29 Not tainted ----------------------------- drivers/net/vxlan.c:1379 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by bridge/166: #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: netlink_dump+0xea/0x1090 stack backtrace: CPU: 1 PID: 166 Comm: bridge Not tainted 5.8.0-rc4-custom-01521-g481007553ce6 #29 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x100/0x184 lockdep_rcu_suspicious+0x153/0x15d vxlan_fdb_dump+0x51e/0x6d0 rtnl_fdb_dump+0x4dc/0xad0 netlink_dump+0x540/0x1090 __netlink_dump_start+0x695/0x950 rtnetlink_rcv_msg+0x802/0xbd0 netlink_rcv_skb+0x17a/0x480 rtnetlink_rcv+0x22/0x30 netlink_unicast+0x5ae/0x890 netlink_sendmsg+0x98a/0xf40 __sys_sendto+0x279/0x3b0 __x64_sys_sendto+0xe6/0x1a0 do_syscall_64+0x54/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe14fa2ade0 Code: Bad RIP value. RSP: 002b:00007fff75bb5b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00005614b1ba0020 RCX: 00007fe14fa2ade0 RDX: 000000000000011c RSI: 00007fff75bb5b90 RDI: 0000000000000003 RBP: 00007fff75bb5b90 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00005614b1b89160 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Fixes: 5e6d24358799 ("bridge: netlink dump interface at par with brctl") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-20vxlan: Avoid infinite loop when suppressing NS messages with invalid optionsIdo Schimmel
[ Upstream commit 8066e6b449e050675df48e7c4b16c29f00507ff0 ] When proxy mode is enabled the vxlan device might reply to Neighbor Solicitation (NS) messages on behalf of remote hosts. In case the NS message includes the "Source link-layer address" option [1], the vxlan device will use the specified address as the link-layer destination address in its reply. To avoid an infinite loop, break out of the options parsing loop when encountering an option with length zero and disregard the NS message. This is consistent with the IPv6 ndisc code and RFC 4886 which states that "Nodes MUST silently discard an ND packet that contains an option with length zero" [2]. [1] https://tools.ietf.org/html/rfc4861#section-4.3 [2] https://tools.ietf.org/html/rfc4861#section-4.6 Fixes: 4b29dba9c085 ("vxlan: fix nonfunctional neigh_reduce()") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookupSabrina Dubroca
commit 6c8991f41546c3c472503dff1ea9daaddf9331c2 upstream. ipv6_stub uses the ip6_dst_lookup function to allow other modules to perform IPv6 lookups. However, this function skips the XFRM layer entirely. All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the ip_route_output_key and ip_route_output helpers) for their IPv4 lookups, which calls xfrm_lookup_route(). This patch fixes this inconsistent behavior by switching the stub to ip6_dst_lookup_flow, which also calls xfrm_lookup_route(). This requires some changes in all the callers, as these two functions take different arguments and have different return types. Fixes: 5f81bd2e5d80 ("ipv6: export a stub for IPv6 symbols used by vxlan") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 4.14: - Drop change in lwt_bpf.c - Delete now-unused "ret" in mlx5e_route_lookup_ipv6() - Initialise "out_dev" in mlx5e_create_encap_header_ipv6() to avoid introducing a spurious "may be used uninitialised" warning - Adjust filenames, context, indentation] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02vxlan: check return value of gro_cells_init()Taehee Yoo
[ Upstream commit 384d91c267e621e0926062cfb3f20cb72dc16928 ] gro_cells_init() returns error if memory allocation is failed. But the vxlan module doesn't check the return value of gro_cells_init(). Fixes: 58ce31cca1ff ("vxlan: GRO support at tunnel layer")` Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-27vxlan: changelink: Fix handling of default remotesPetr Machata
[ Upstream commit ce5e098f7a10b4bf8e948c12fa350320c5c3afad ] Default remotes are stored as FDB entries with an Ethernet address of 00:00:00:00:00:00. When a request is made to change a remote address of a VXLAN device, vxlan_changelink() first deletes the existing default remote, and then creates a new FDB entry. This works well as long as the list of default remotes matches exactly the configuration of a VXLAN remote address. Thus when the VXLAN device has a remote of X, there should be exactly one default remote FDB entry X. If the VXLAN device has no remote address, there should be no such entry. Besides using "ip link set", it is possible to manipulate the list of default remotes by using the "bridge fdb". It is therefore easy to break the above condition. Under such circumstances, the __vxlan_fdb_delete() call doesn't delete the FDB entry itself, but just one remote. The following vxlan_fdb_create() then creates a new FDB entry, leading to a situation where two entries exist for the address 00:00:00:00:00:00, each with a different subset of default remotes. An even more obvious breakage rooted in the same cause can be observed when a remote address is configured for a VXLAN device that did not have one before. In that case vxlan_changelink() doesn't remove any remote, and just creates a new FDB entry for the new address: $ ip link add name vx up type vxlan id 2000 dstport 4789 $ bridge fdb ap dev vx 00:00:00:00:00:00 dst 192.0.2.20 self permanent $ bridge fdb ap dev vx 00:00:00:00:00:00 dst 192.0.2.30 self permanent $ ip link set dev vx type vxlan remote 192.0.2.30 $ bridge fdb sh dev vx | grep 00:00:00:00:00:00 00:00:00:00:00:00 dst 192.0.2.30 self permanent <- new entry, 1 rdst 00:00:00:00:00:00 dst 192.0.2.20 self permanent <- orig. entry, 2 rdsts 00:00:00:00:00:00 dst 192.0.2.30 self permanent To fix this, instead of calling vxlan_fdb_create() directly, defer to vxlan_fdb_update(). That has logic to handle the duplicates properly. Additionally, it also handles notifications, so drop that call from changelink as well. Fixes: 0241b836732f ("vxlan: fix default fdb entry netlink notify ordering during netdev create") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-12vxlan: fix tos value before xmitHangbin Liu
[ Upstream commit 71130f29979c7c7956b040673e6b9d5643003176 ] Before ip_tunnel_ecn_encap() and udp_tunnel_xmit_skb() we should filter tos value by RT_TOS() instead of using config tos directly. vxlan_get_route() would filter the tos to fl4.flowi4_tos but we didn't return it back, as geneve_get_v4_rt() did. So we have to use RT_TOS() directly in function ip_tunnel_ecn_encap(). Fixes: 206aaafcd279 ("VXLAN: Use IP Tunnels tunnel ENC encap API") Fixes: 1400615d64cf ("vxlan: allow setting ipv6 traffic class") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-05vxlan: Fix error path in __vxlan_dev_create()Petr Machata
[ Upstream commit 6db9246871394b3a136cd52001a0763676563840 ] When a failure occurs in rtnl_configure_link(), the current code calls unregister_netdevice() to roll back the earlier call to register_netdevice(), and jumps to errout, which calls vxlan_fdb_destroy(). However unregister_netdevice() calls transitively ndo_uninit, which is vxlan_uninit(), and that already takes care of deleting the default FDB entry by calling vxlan_fdb_delete_default(). Since the entry added earlier in __vxlan_dev_create() is exactly the default entry, the cleanup code in the errout block always leads to double free and thus a panic. Besides, since vxlan_fdb_delete_default() always destroys the FDB entry with notification enabled, the deletion of the default entry is notified even before the addition was notified. Instead, move the unregister_netdevice() call after the manual destroy, which solves both problems. Fixes: 0241b836732f ("vxlan: fix default fdb entry netlink notify ordering during netdev create") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-10vxlan: check tun_info options_len properlyXin Long
[ Upstream commit eadf52cf1852196a1363044dcda22fa5d7f296f7 ] This patch is to improve the tun_info options_len by dropping the skb when TUNNEL_VXLAN_OPT is set but options_len is less than vxlan_metadata. This can void a potential out-of-bounds access on ip_tun_info. Fixes: ee122c79d422 ("vxlan: Flow based tunneling") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03vxlan: Don't call gro_cells_destroy() before device is unregisteredZhiqiang Liu
[ Upstream commit cc4807bb609230d8959fd732b0bf3bd4c2de8eac ] Commit ad6c9986bcb62 ("vxlan: Fix GRO cells race condition between receive and link delete") fixed a race condition for the typical case a vxlan device is dismantled from the current netns. But if a netns is dismantled, vxlan_destroy_tunnels() is called to schedule a unregister_netdevice_queue() of all the vxlan tunnels that are related to this netns. In vxlan_destroy_tunnels(), gro_cells_destroy() is called and finished before unregister_netdevice_queue(). This means that the gro_cells_destroy() call is done too soon, for the same reasons explained in above commit. So we need to fully respect the RCU rules, and thus must remove the gro_cells_destroy() call or risk use after-free. Fixes: 58ce31cca1ff ("vxlan: GRO support at tunnel layer") Signed-off-by: Suanming.Mou <mousuanming@huawei.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-19vxlan: test dev->flags & IFF_UP before calling gro_cells_receive()Eric Dumazet
[ Upstream commit 59cbf56fcd98ba2a715b6e97c4e43f773f956393 ] Same reasons than the ones explained in commit 4179cb5a4c92 ("vxlan: test dev->flags & IFF_UP before calling netif_rx()") netif_rx() or gro_cells_receive() must be called under a strict contract. At device dismantle phase, core networking clears IFF_UP and flush_all_backlogs() is called after rcu grace period to make sure no incoming packet might be in a cpu backlog and still referencing the device. A similar protocol is used for gro_cells infrastructure, as gro_cells_destroy() will be called only after a full rcu grace period is observed after IFF_UP has been cleared. Most drivers call netif_rx() from their interrupt handler, and since the interrupts are disabled at device dismantle, netif_rx() does not have to check dev->flags & IFF_UP Virtual drivers do not have this guarantee, and must therefore make the check themselves. Otherwise we risk use-after-free and/or crashes. Fixes: d342894c5d2f ("vxlan: virtual extensible lan") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-19vxlan: Fix GRO cells race condition between receive and link deleteStefano Brivio
[ Upstream commit ad6c9986bcb627c7c22b8f9e9a934becc27df87c ] If we receive a packet while deleting a VXLAN device, there's a chance vxlan_rcv() is called at the same time as vxlan_dellink(). This is fine, except that vxlan_dellink() should never ever touch stuff that's still in use, such as the GRO cells list. Otherwise, vxlan_rcv() crashes while queueing packets via gro_cells_receive(). Move the gro_cells_destroy() to vxlan_uninit(), which runs after the RCU grace period is elapsed and nothing needs the gro_cells anymore. This is now done in the same way as commit 8e816df87997 ("geneve: Use GRO cells infrastructure.") originally implemented for GENEVE. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: 58ce31cca1ff ("vxlan: GRO support at tunnel layer") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23vxlan: test dev->flags & IFF_UP before calling netif_rx()Eric Dumazet
[ Upstream commit 4179cb5a4c924cd233eaadd081882425bc98f44e ] netif_rx() must be called under a strict contract. At device dismantle phase, core networking clears IFF_UP and flush_all_backlogs() is called after rcu grace period to make sure no incoming packet might be in a cpu backlog and still referencing the device. Most drivers call netif_rx() from their interrupt handler, and since the interrupts are disabled at device dismantle, netif_rx() does not have to check dev->flags & IFF_UP Virtual drivers do not have this guarantee, and must therefore make the check themselves. Otherwise we risk use-after-free and/or crashes. Note this patch also fixes a small issue that came with commit ce6502a8f957 ("vxlan: fix a use after free in vxlan_encap_bypass"), since the dev->stats.rx_dropped change was done on the wrong device. Fixes: d342894c5d2f ("vxlan: virtual extensible lan") Fixes: ce6502a8f957 ("vxlan: fix a use after free in vxlan_encap_bypass") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petr Machata <petrm@mellanox.com> Cc: Ido Schimmel <idosch@mellanox.com> Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-07-28vxlan: fix default fdb entry netlink notify ordering during netdev createRoopa Prabhu
[ Upstream commit e99465b952861533d9ba748fdbecc96d9a36da3e ] Problem: In vxlan_newlink, a default fdb entry is added before register_netdev. The default fdb creation function also notifies user-space of the fdb entry on the vxlan device which user-space does not know about yet. (RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex). This patch fixes the user-space netlink notification ordering issue with the following changes: - decouple fdb notify from fdb create. - Move fdb notify after register_netdev. - Call rtnl_configure_link in vxlan newlink handler to notify userspace about the newlink before fdb notify and hence avoiding the user-space race. Fixes: afbd8bae9c79 ("vxlan: add implicit fdb entry for default destination") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28vxlan: make netlink notify in vxlan_fdb_destroy optionalRoopa Prabhu
[ Upstream commit f6e053858671bb156b6e44ad66418acc8c7f4e77 ] Add a new option do_notify to vxlan_fdb_destroy to make sending netlink notify optional. Used by a later patch. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28vxlan: add new fdb alloc and create helpersRoopa Prabhu
[ Upstream commit 7431016b107c95cb5b2014aa1901fcb115f746bc ] - Add new vxlan_fdb_alloc helper - rename existing vxlan_fdb_create into vxlan_fdb_update: because it really creates or updates an existing fdb entry - move new fdb creation into a separate vxlan_fdb_create Main motivation for this change is to introduce the ability to decouple vxlan fdb creation and notify, used in a later patch. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-22net: fix use-after-free in GRO with ESPSabrina Dubroca
[ Upstream commit 603d4cf8fe095b1ee78f423d514427be507fb513 ] Since the addition of GRO for ESP, gro_receive can consume the skb and return -EINPROGRESS. In that case, the lower layer GRO handler cannot touch the skb anymore. Commit 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") converted some of the gro_receive handlers that can lead to ESP's gro_receive so that they wouldn't access the skb when -EINPROGRESS is returned, but missed other spots, mainly in tunneling protocols. This patch finishes the conversion to using skb_gro_flush_final(), and adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and GUE. Fixes: 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-09net: don't call update_pmtu unconditionallyNicolas Dichtel
commit f15ca723c1ebe6c1a06bc95fda6b62cd87b44559 upstream. Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to: "BUG: unable to handle kernel NULL pointer dereference at (null)" Let's add a helper to check if update_pmtu is available before calling it. Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path") Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") CC: Roman Kapl <code@rkapl.cz> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Thomas Deutschmann <whissi@gentoo.org> Cc: Eddie Chapman <eddie@ehuk.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-03vxlan: update skb dst pmtu on tx pathXin Long
[ Upstream commit a93bf0ff449064e6b7f44e58522e940f88c0d966 ] Unlike ip tunnels, now vxlan doesn't do any pmtu update for upper dst pmtu, even if it doesn't match the lower dst pmtu any more. The problem can be reproduced when reducing the vxlan lower dev's pmtu when running netperf. In jianlin's testing, the performance went to 1/7 of the previous. This patch is to update the upper dst pmtu to match the lower dst pmtu on tx path so that packets can be sent out even when lower dev's pmtu has been changed. It also works for metadata dst. Note that this patch doesn't process any pmtu icmp packet. But even in the future, the support for pmtu icmp packets process of udp tunnels will also needs this. The same thing will be done for geneve in another patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02vxlan: restore dev->mtu setting based on lower deviceAlexey Kodanev
[ Upstream commit f870c1ff65a6d1f3a083f277280802ee09a5b44d ] Stefano Brivio says: Commit a985343ba906 ("vxlan: refactor verification and application of configuration") introduced a change in the behaviour of initial MTU setting: earlier, the MTU for a link created on top of a given lower device, without an initial MTU specification, was set to the MTU of the lower device minus headroom as a result of this path in vxlan_dev_configure(): if (!conf->mtu) dev->mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM); which is now gone. Now, the initial MTU, in absence of a configured value, is simply set by ether_setup() to ETH_DATA_LEN (1500 bytes). This breaks userspace expectations in case the MTU of the lower device is higher than 1500 bytes minus headroom. This patch restores the previous behaviour on newlink operation. Since max_mtu can be negative and we update dev->mtu directly, also check it for valid minimum. Reported-by: Junhan Yan <juyan@redhat.com> Fixes: a985343ba906 ("vxlan: refactor verification and application of configuration") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-24vxlan: fix the issue that neigh proxy blocks all icmpv6 packetsXin Long
[ Upstream commit 8bff3685a4bbf175a96bc6a528f13455d8d38244 ] Commit f1fb08f6337c ("vxlan: fix ND proxy when skb doesn't have transport header offset") removed icmp6_code and icmp6_type check before calling neigh_reduce when doing neigh proxy. It means all icmpv6 packets would be blocked by this, not only ns packet. In Jianlin's env, even ping6 couldn't work through it. This patch is to bring the icmp6_code and icmp6_type check back and also removed the same check from neigh_reduce(). Fixes: f1fb08f6337c ("vxlan: fix ND proxy when skb doesn't have transport header offset") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Vincent Bernat <vincent@bernat.im> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-29vxlan: factor out VXLAN-GPE next protocolJiri Benc
The values are shared between VXLAN-GPE and NSH. Originally probably by coincidence but I notified both working groups about this last year and they seem to keep the values in sync since then. Hopefully they'll get a single IANA registry for the values, too. (I asked them for that.) Factor out the code to be shared by the NSH implementation. NSH and MPLS values are added in this patch, too. For MPLS, the drafts incorrectly assign only a single value, while we have two MPLS ethertypes. I raised the problem with both groups. For now, I assume the value is for unicast. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13vxlan: change vxlan_[config_]validate() to use netlink_ext_ack for error ↵Girish Moodalbail
reporting The kernel log is not where users expect error messages for netlink requests; as we have extended acks now, we can replace pr_debug() with NL_SET_ERR_MSG_ATTR(). Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The UDP offload conflict is dealt with by simply taking what is in net-next where we have removed all of the UFO handling code entirely. The TCP conflict was a case of local variables in a function being removed from both net and net-next. In netvsc we had an assignment right next to where a missing set of u64 stats sync object inits were added. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01vxlan: fix remcsum when GRO on and CHECKSUM_PARTIAL boundary is outer UDPK. Den
In the case that GRO is turned on and the original received packet is CHECKSUM_PARTIAL, if the outer UDP header is exactly at the last csum-unnecessary point, which for instance could occur if the packet comes from another Linux guest on the same Linux host, we have to do either remcsum_adjust or set up CHECKSUM_PARTIAL again with its csum_start properly reset considering RCO. However, since b7fe10e5ebac("gro: Fix remcsum offload to deal with frags in GRO") that barrier in such case could be skipped if GRO turned on, hence we pass over it and the inner L4 validation mistakenly reckons it as a bad csum. This patch makes remcsum_offload being reset at the same time of GRO remcsum cleanup, so as to make it work in such case as before. Fixes: b7fe10e5ebac ("gro: Fix remcsum offload to deal with frags in GRO") Signed-off-by: Koichiro Den <den@klaipeden.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24geneve/vxlan: offload ports on register/unregister eventsSabrina Dubroca
This improves consistency of handling when moving a netdev to another netns. Most drivers currently do a full reset when the device goes up, so that will flush the offload state anyway. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24geneve/vxlan: add support for NETDEV_UDP_TUNNEL_DROP_INFOSabrina Dubroca
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04net, vxlan: convert vxlan_sock.refcnt from atomic_t to refcount_tReshetova, Elena
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03vxlan: fix hlist corruptionJiri Benc
It's not a good idea to add the same hlist_node to two different hash lists. This leads to various hard to debug memory corruptions. Fixes: b1be00a6c39f ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03vxlan: correctly set vxlan->net when creating the device in a netnsSabrina Dubroca
Commit a985343ba906 ("vxlan: refactor verification and application of configuration") modified vxlan device creation, and replaced the assignment of vxlan->net to src_net with dev_net(netdev) in ->setup(). But dev_net(netdev) is not the same as src_net. At the time ->setup() is called, dev_net hasn't been set yet, so we end up creating the socket for the vxlan device in init_net. Fix this by bringing back the assignment of vxlan->net during device creation. Fixes: a985343ba906 ("vxlan: refactor verification and application of configuration") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-27vxlan: fix incorrect nlattr access in MTU checkMatthias Schiffer
The access to the wrong variable could lead to a NULL dereference and possibly other invalid memory reads in vxlan newlink/changelink requests with a IFLA_MTU attribute. Fixes: a985343ba906 "vxlan: refactor verification and application of configuration" Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.validateMatthias Schiffer
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.changelinkMatthias Schiffer
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26net: add netlink_ext_ack argument to rtnl_link_ops.newlinkMatthias Schiffer
Add support for extended error reporting. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20vxlan: allow multiple VXLANs with same VNI for IPv6 link-local addressesMatthias Schiffer
As link-local addresses are only valid for a single interface, we can allow to use the same VNI for multiple independent VXLANs, as long as the used interfaces are distinct. This way, VXLANs can always be used as a drop-in replacement for VLANs with greater ID space. This also extends VNI lookup to respect the ifindex when link-local IPv6 addresses are used, so using the same VNI on multiple interfaces can actually work. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20vxlan: fix snooping for link-local IPv6 addressesMatthias Schiffer
If VXLAN is run over link-local IPv6 addresses, it is necessary to store the ifindex in the FDB entries. Otherwise, the used interface is undefined and unicast communication will most likely fail. Support for link-local IPv4 addresses should be possible as well, but as the semantics aren't as well defined as for IPv6, and there doesn't seem to be much interest in having the support, it's not implemented for now. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20vxlan: check valid combinations of address scopesMatthias Schiffer
* Multicast addresses are never valid as local address * Link-local IPv6 unicast addresses may only be used as remote when the local address is link-local as well * Don't allow link-local IPv6 local/remote addresses without interface We also store in the flags field if link-local addresses are used for the follow-up patches that actually make VXLAN over link-local IPv6 work. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20vxlan: improve validation of address family configurationMatthias Schiffer
Address families of source and destination addresses must match, and changelink operations can't change the address family. In addition, always use the VXLAN_F_IPV6 to check if a VXLAN device uses IPv4 or IPv6. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20vxlan: get rid of redundant vxlan_dev.flagsMatthias Schiffer
There is no good reason to keep the flags twice in vxlan_dev and vxlan_config. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20vxlan: refactor verification and application of configurationMatthias Schiffer
The vxlan_dev_configure function was mixing validation and application of the vxlan configuration; this could easily lead to bugs with the changelink operation, as it was hard to see if the function wcould return an error after parts of the configuration had already been applied. This commit splits validation and application out of vxlan_dev_configure as separate functions to make it clearer where error returns are allowed and where the vxlan_dev or net_device may be configured. Log messages in these functions are removed, as it is generally unexpected to find error output for netlink requests in the kernel log. Userspace should be able to handle errors based on the error codes returned via netlink just fine. In addition, some validation and initialization is moved to vxlan_validate and vxlan_setup respectively to improve grouping of similar settings. Finally, this also fixes two actual bugs: * if set, conf->mtu would overwrite dev->mtu in each changelink operation, reverting other changes of dev->mtu * the "if (!conf->dst_port)" branch would never be run, as conf->dst_port was set in vxlan_setup before. This caused VXLAN-GPE to use the same default port as other VXLAN sockets instead of the intended IANA-assigned 4790. Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16networking: make skb_push & __skb_push return void pointersJohannes Berg
It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) @@ expression SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - fn(SKB, LEN)[0] + *(u8 *)fn(SKB, LEN) Note that the last part there converts from push(...)[0] to the more idiomatic *(u8 *)push(...). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16networking: convert many more places to skb_put_zero()Johannes Berg
There were many places that my previous spatch didn't find, as pointed out by yuan linyu in various patches. The following spatch found many more and also removes the now unnecessary casts: @@ identifier p, p2; expression len; expression skb; type t, t2; @@ ( -p = skb_put(skb, len); +p = skb_put_zero(skb, len); | -p = (t)skb_put(skb, len); +p = skb_put_zero(skb, len); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, len); | -memset(p, 0, len); ) @@ type t, t2; identifier p, p2; expression skb; @@ t *p; ... ( -p = skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); | -p = (t *)skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, sizeof(*p)); | -memset(p, 0, sizeof(*p)); ) @@ expression skb, len; @@ -memset(skb_put(skb, len), 0, len); +skb_put_zero(skb, len); Apply it to the tree (with one manual fixup to keep the comment in vxlan.c, which spatch removed.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12vxlan: dont migrate permanent fdb entries during learnRoopa Prabhu
This patch fixes vxlan_snoop to not move permanent fdb entries on learn events. This is consistent with the bridge fdb handling of permanent entries. Fixes: 26a41ae60438 ("vxlan: only migrate dynamic FDB entries") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07net: Fix inconsistent teardown and release of private netdev state.David S. Miller
Network devices can allocate reasources and private memory using netdev_ops->ndo_init(). However, the release of these resources can occur in one of two different places. Either netdev_ops->ndo_uninit() or netdev->destructor(). The decision of which operation frees the resources depends upon whether it is necessary for all netdev refs to be released before it is safe to perform the freeing. netdev_ops->ndo_uninit() presumably can occur right after the NETDEV_UNREGISTER notifier completes and the unicast and multicast address lists are flushed. netdev->destructor(), on the other hand, does not run until the netdev references all go away. Further complicating the situation is that netdev->destructor() almost universally does also a free_netdev(). This creates a problem for the logic in register_netdevice(). Because all callers of register_netdevice() manage the freeing of the netdev, and invoke free_netdev(dev) if register_netdevice() fails. If netdev_ops->ndo_init() succeeds, but something else fails inside of register_netdevice(), it does call ndo_ops->ndo_uninit(). But it is not able to invoke netdev->destructor(). This is because netdev->destructor() will do a free_netdev() and then the caller of register_netdevice() will do the same. However, this means that the resources that would normally be released by netdev->destructor() will not be. Over the years drivers have added local hacks to deal with this, by invoking their destructor parts by hand when register_netdevice() fails. Many drivers do not try to deal with this, and instead we have leaks. Let's close this hole by formalizing the distinction between what private things need to be freed up by netdev->destructor() and whether the driver needs unregister_netdevice() to perform the free_netdev(). netdev->priv_destructor() performs all actions to free up the private resources that used to be freed by netdev->destructor(), except for free_netdev(). netdev->needs_free_netdev is a boolean that indicates whether free_netdev() should be done at the end of unregister_netdevice(). Now, register_netdevice() can sanely release all resources after ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit() and netdev->priv_destructor(). And at the end of unregister_netdevice(), we invoke netdev->priv_destructor() and optionally call free_netdev(). Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07vxlan: use a more suitable function when assigning NULLMark Bloch
When stopping the vxlan interface we detach it from the socket. Use RCU_INIT_POINTER() and not rcu_assign_pointer() to do so. Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02vxlan: fix use-after-free on deletionMark Bloch
Adding a vxlan interface to a socket isn't symmetrical, while adding is done in vxlan_open() the deletion is done in vxlan_dellink(). This can cause a use-after-free error when we close the vxlan interface before deleting it. We add vxlan_vs_del_dev() to match vxlan_vs_add_dev() and call it from vxlan_stop() to match the call from vxlan_open(). Fixes: 56ef9c909b40 ("vxlan: Move socket initialization to within rtnl scope") Acked-by: Jiri Benc <jbenc@redhat.com> Tested-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Mark Bloch <markb@mellanox.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>