aboutsummaryrefslogtreecommitdiffstats
path: root/include/net
AgeCommit message (Collapse)Author
2019-11-10net/flow_dissector: switch to siphashEric Dumazet
commit 55667441c84fa5e0911a0aac44fb059c15ba6da2 upstream. UDP IPv6 packets auto flowlabels are using a 32bit secret (static u32 hashrnd in net/core/flow_dissector.c) and apply jhash() over fields known by the receivers. Attackers can easily infer the 32bit secret and use this information to identify a device and/or user, since this 32bit secret is only set at boot time. Really, using jhash() to generate cookies sent on the wire is a serious security concern. Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be a dead end. Trying to periodically change the secret (like in sch_sfq.c) could change paths taken in the network for long lived flows. Let's switch to siphash, as we did in commit df453700e8d8 ("inet: switch IP ID generator to siphash") Using a cryptographically strong pseudo random function will solve this privacy issue and more generally remove other weak points in the stack. Packet schedulers using skb_get_hash_perturb() benefit from this change. Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default") Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Berger <jonathann1@walla.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10net: fix sk_page_frag() recursion from memory reclaimTejun Heo
[ Upstream commit 20eb4f29b60286e0d6dc01d9c260b4bd383c58fb ] sk_page_frag() optimizes skb_frag allocations by using per-task skb_frag cache when it knows it's the only user. The condition is determined by seeing whether the socket allocation mask allows blocking - if the allocation may block, it obviously owns the task's context and ergo exclusively owns current->task_frag. Unfortunately, this misses recursion through memory reclaim path. Please take a look at the following backtrace. [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10 ... tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 sock_xmit.isra.24+0xa1/0x170 [nbd] nbd_send_cmd+0x1d2/0x690 [nbd] nbd_queue_rq+0x1b5/0x3b0 [nbd] __blk_mq_try_issue_directly+0x108/0x1b0 blk_mq_request_issue_directly+0xbd/0xe0 blk_mq_try_issue_list_directly+0x41/0xb0 blk_mq_sched_insert_requests+0xa2/0xe0 blk_mq_flush_plug_list+0x205/0x2a0 blk_flush_plug_list+0xc3/0xf0 [1] blk_finish_plug+0x21/0x2e _xfs_buf_ioapply+0x313/0x460 __xfs_buf_submit+0x67/0x220 xfs_buf_read_map+0x113/0x1a0 xfs_trans_read_buf_map+0xbf/0x330 xfs_btree_read_buf_block.constprop.42+0x95/0xd0 xfs_btree_lookup_get_block+0x95/0x170 xfs_btree_lookup+0xcc/0x470 xfs_bmap_del_extent_real+0x254/0x9a0 __xfs_bunmapi+0x45c/0xab0 xfs_bunmapi+0x15/0x30 xfs_itruncate_extents_flags+0xca/0x250 xfs_free_eofblocks+0x181/0x1e0 xfs_fs_destroy_inode+0xa8/0x1b0 destroy_inode+0x38/0x70 dispose_list+0x35/0x50 prune_icache_sb+0x52/0x70 super_cache_scan+0x120/0x1a0 do_shrink_slab+0x120/0x290 shrink_slab+0x216/0x2b0 shrink_node+0x1b6/0x4a0 do_try_to_free_pages+0xc6/0x370 try_to_free_mem_cgroup_pages+0xe3/0x1e0 try_charge+0x29e/0x790 mem_cgroup_charge_skmem+0x6a/0x100 __sk_mem_raise_allocated+0x18e/0x390 __sk_mem_schedule+0x2a/0x40 [0] tcp_sendmsg_locked+0x8eb/0xe10 tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 ___sys_sendmsg+0x26d/0x2b0 __sys_sendmsg+0x57/0xa0 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In [0], tcp_send_msg_locked() was using current->page_frag when it called sk_wmem_schedule(). It already calculated how many bytes can be fit into current->page_frag. Due to memory pressure, sk_wmem_schedule() called into memory reclaim path which called into xfs and then IO issue path. Because the filesystem in question is backed by nbd, the control goes back into the tcp layer - back into tcp_sendmsg_locked(). nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes sense - it's in the process of freeing memory and wants to be able to, e.g., drop clean pages to make forward progress. However, this confused sk_page_frag() called from [2]. Because it only tests whether the allocation allows blocking which it does, it now thinks current->page_frag can be used again although it already was being used in [0]. After [2] used current->page_frag, the offset would be increased by the used amount. When the control returns to [0], current->page_frag's offset is increased and the previously calculated number of bytes now may overrun the end of allocated memory leading to silent memory corruptions. Fix it by adding gfpflags_normal_context() which tests sleepable && !reclaim and use it to determine whether to use current->task_frag. v2: Eric didn't like gfp flags being tested twice. Introduce a new helper gfpflags_normal_context() and combine the two tests. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10net: annotate lockless accesses to sk->sk_napi_idEric Dumazet
[ Upstream commit ee8d153d46a3b98c064ee15c0c0a3bbf1450e5a1 ] We already annotated most accesses to sk->sk_napi_id We missed sk_mark_napi_id() and sk_mark_napi_id_once() which might be called without socket lock held in UDP stack. KCSAN reported : BUG: KCSAN: data-race in udpv6_queue_rcv_one_skb / udpv6_queue_rcv_one_skb write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 0: sk_mark_napi_id include/net/busy_poll.h:125 [inline] __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline] udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672 udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689 udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832 __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913 udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015 ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409 ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459 dst_input include/net/dst.h:442 [inline] ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 1: sk_mark_napi_id include/net/busy_poll.h:125 [inline] __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline] udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672 udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689 udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832 __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913 udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015 ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409 ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459 dst_input include/net/dst.h:442 [inline] ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 10890 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: e68b6e50fa35 ("udp: enable busy polling for all sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-10net: annotate accesses to sk->sk_incoming_cpuEric Dumazet
[ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ] This socket field can be read and written by concurrent cpus. Use READ_ONCE() and WRITE_ONCE() annotations to document this, and avoid some compiler 'optimizations'. KCSAN reported : BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0: sk_incoming_cpu_update include/net/sock.h:953 [inline] tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1: sk_incoming_cpu_update include/net/sock.h:952 [inline] tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-06sctp: fix the issue that flags are ignored when using kernel_connectXin Long
commit 644fbdeacf1d3edd366e44b8ba214de9d1dd66a9 upstream. Now sctp uses inet_dgram_connect as its proto_ops .connect, and the flags param can't be passed into its proto .connect where this flags is really needed. sctp works around it by getting flags from socket file in __sctp_connect. It works for connecting from userspace, as inherently the user sock has socket file and it passes f_flags as the flags param into the proto_ops .connect. However, the sock created by sock_create_kern doesn't have a socket file, and it passes the flags (like O_NONBLOCK) by using the flags param in kernel_connect, which calls proto_ops .connect later. So to fix it, this patch defines a new proto_ops .connect for sctp, sctp_inet_connect, which calls __sctp_connect() directly with this flags param. After this, the sctp's proto .connect can be removed. Note that sctp_inet_connect doesn't need to do some checks that are not needed for sctp, which makes thing better than with inet_dgram_connect. Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-06sch_netem: fix rcu splat in netem_enqueue()Eric Dumazet
commit 159d2c7d8106177bd9a986fd005a311fe0d11285 upstream. qdisc_root() use from netem_enqueue() triggers a lockdep warning. __dev_queue_xmit() uses rcu_read_lock_bh() which is not equivalent to rcu_read_lock() + local_bh_disable_bh as far as lockdep is concerned. WARNING: suspicious RCU usage 5.3.0-rc7+ #0 Not tainted ----------------------------- include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by syz-executor427/8855: #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline] #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214 #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline] #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline] #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838 stack backtrace: CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357 qdisc_root include/net/sch_generic.h:492 [inline] netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479 __dev_xmit_skb net/core/dev.c:3527 [inline] __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838 dev_queue_xmit+0x18/0x20 net/core/dev.c:3902 neigh_hh_output include/net/neighbour.h:500 [inline] neigh_output include/net/neighbour.h:509 [inline] ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228 __ip_finish_output net/ipv4/ip_output.c:308 [inline] __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290 ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417 dst_output include/net/dst.h:436 [inline] ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125 ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555 udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887 udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:657 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-06llc: fix sk_buff leak in llc_conn_service()Eric Biggers
commit b74555de21acd791f12c4a1aeaf653dd7ac21133 upstream. syzbot reported: BUG: memory leak unreferenced object 0xffff88811eb3de00 (size 224): comm "syz-executor559", pid 7315, jiffies 4294943019 (age 10.300s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 a0 38 24 81 88 ff ff 00 c0 f2 15 81 88 ff ff ..8$............ backtrace: [<000000008d1c66a1>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<000000008d1c66a1>] slab_post_alloc_hook mm/slab.h:439 [inline] [<000000008d1c66a1>] slab_alloc_node mm/slab.c:3269 [inline] [<000000008d1c66a1>] kmem_cache_alloc_node+0x153/0x2a0 mm/slab.c:3579 [<00000000447d9496>] __alloc_skb+0x6e/0x210 net/core/skbuff.c:198 [<000000000cdbf82f>] alloc_skb include/linux/skbuff.h:1058 [inline] [<000000000cdbf82f>] llc_alloc_frame+0x66/0x110 net/llc/llc_sap.c:54 [<000000002418b52e>] llc_conn_ac_send_sabme_cmd_p_set_x+0x2f/0x140 net/llc/llc_c_ac.c:777 [<000000001372ae17>] llc_exec_conn_trans_actions net/llc/llc_conn.c:475 [inline] [<000000001372ae17>] llc_conn_service net/llc/llc_conn.c:400 [inline] [<000000001372ae17>] llc_conn_state_process+0x1ac/0x640 net/llc/llc_conn.c:75 [<00000000f27e53c1>] llc_establish_connection+0x110/0x170 net/llc/llc_if.c:109 [<00000000291b2ca0>] llc_ui_connect+0x10e/0x370 net/llc/af_llc.c:477 [<000000000f9c740b>] __sys_connect+0x11d/0x170 net/socket.c:1840 [...] The bug is that most callers of llc_conn_send_pdu() assume it consumes a reference to the skb, when actually due to commit b85ab56c3f81 ("llc: properly handle dev_queue_xmit() return value") it doesn't. Revert most of that commit, and instead make the few places that need llc_conn_send_pdu() to *not* consume a reference call skb_get() before. Fixes: b85ab56c3f81 ("llc: properly handle dev_queue_xmit() return value") Reported-by: syzbot+6b825a6494a04cc0e3f7@syzkaller.appspotmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-16ip6: fix skb leak in ip6frag_expire_frag_queue()Eric Dumazet
commit 47d3d7fdb10a21c223036b58bd70ffdc24a472c4 upstream. Since ip6frag_expire_frag_queue() now pulls the head skb from frag queue, we should no longer use skb_get(), since this leads to an skb leak. Stefan Bader initially reported a problem in 4.4.stable [1] caused by the skb_get(), so this patch should also fix this issue. 296583.091021] kernel BUG at /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207! [296583.091734] Call Trace: [296583.091749] [<ffffffff81740e50>] __pskb_pull_tail+0x50/0x350 [296583.091764] [<ffffffff8183939a>] _decode_session6+0x26a/0x400 [296583.091779] [<ffffffff817ec719>] __xfrm_decode_session+0x39/0x50 [296583.091795] [<ffffffff818239d0>] icmpv6_route_lookup+0xf0/0x1c0 [296583.091809] [<ffffffff81824421>] icmp6_send+0x5e1/0x940 [296583.091823] [<ffffffff81753238>] ? __netif_receive_skb+0x18/0x60 [296583.091838] [<ffffffff817532b2>] ? netif_receive_skb_internal+0x32/0xa0 [296583.091858] [<ffffffffc0199f74>] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe] [296583.091876] [<ffffffffc04eb260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6] [296583.091893] [<ffffffff8183d431>] icmpv6_send+0x21/0x30 [296583.091906] [<ffffffff8182b500>] ip6_expire_frag_queue+0xe0/0x120 [296583.091921] [<ffffffffc04eb27f>] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6] [296583.091938] [<ffffffff810f3b57>] call_timer_fn+0x37/0x140 [296583.091951] [<ffffffffc04eb260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6] [296583.091968] [<ffffffff810f5464>] run_timer_softirq+0x234/0x330 [296583.091982] [<ffffffff8108a339>] __do_softirq+0x109/0x2b0 Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Stefan Bader <stefan.bader@canonical.com> Cc: Peter Oskolkov <posk@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Baolin Wang <baolin.wang@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-16xfrm: clean up xfrm protocol checksCong Wang
commit dbb2483b2a46fbaf833cfb5deb5ed9cace9c7399 upstream. In commit 6a53b7593233 ("xfrm: check id proto in validate_tmpl()") I introduced a check for xfrm protocol, but according to Herbert IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so it should be removed from validate_tmpl(). And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific protocols, this is why xfrm_state_flush() could still miss IPPROTO_ROUTING, which leads that those entries are left in net->xfrm.state_all before exit net. Fix this by replacing IPSEC_PROTO_ANY with zero. This patch also extracts the check from validate_tmpl() to xfrm_id_proto_valid() and uses it in parse_ipsecrequest(). With this, no other protocols should be added into xfrm. Fixes: 6a53b7593233 ("xfrm: check id proto in validate_tmpl()") Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Zubin Mithra <zsm@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-10net: sched: act_sample: fix psample group handling on overwriteVlad Buslov
[ Upstream commit dbf47a2a094edf58983265e323ca4bdcdb58b5ee ] Action sample doesn't properly handle psample_group pointer in overwrite case. Following issues need to be fixed: - In tcf_sample_init() function RCU_INIT_POINTER() is used to set s->psample_group, even though we neither setting the pointer to NULL, nor preventing concurrent readers from accessing the pointer in some way. Use rcu_swap_protected() instead to safely reset the pointer. - Old value of s->psample_group is not released or deallocated in any way, which results resource leak. Use psample_group_put() on non-NULL value obtained with rcu_swap_protected(). - The function psample_group_put() that released reference to struct psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu grace period when deallocating it. Extend struct psample_group with rcu head and use kfree_rcu when freeing it. Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-10net_sched: fix a NULL pointer deref in ipt actionCong Wang
[ Upstream commit 981471bd3abf4d572097645d765391533aac327d ] The net pointer in struct xt_tgdtor_param is not explicitly initialized therefore is still NULL when dereferencing it. So we have to find a way to pass the correct net pointer to ipt_destroy_target(). The best way I find is just saving the net pointer inside the per netns struct tcf_idrinfo, which could make this patch smaller. Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset") Reported-and-tested-by: itugrok@yahoo.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-06tcp: fix tcp_rtx_queue_tail in case of empty retransmit queueTim Froidcoeur
Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") triggers following stack trace: [25244.848046] kernel BUG at ./include/linux/skbuff.h:1406! [25244.859335] RIP: 0010:skb_queue_prev+0x9/0xc [25244.888167] Call Trace: [25244.889182] <IRQ> [25244.890001] tcp_fragment+0x9c/0x2cf [25244.891295] tcp_write_xmit+0x68f/0x988 [25244.892732] __tcp_push_pending_frames+0x3b/0xa0 [25244.894347] tcp_data_snd_check+0x2a/0xc8 [25244.895775] tcp_rcv_established+0x2a8/0x30d [25244.897282] tcp_v4_do_rcv+0xb2/0x158 [25244.898666] tcp_v4_rcv+0x692/0x956 [25244.899959] ip_local_deliver_finish+0xeb/0x169 [25244.901547] __netif_receive_skb_core+0x51c/0x582 [25244.903193] ? inet_gro_receive+0x239/0x247 [25244.904756] netif_receive_skb_internal+0xab/0xc6 [25244.906395] napi_gro_receive+0x8a/0xc0 [25244.907760] receive_buf+0x9a1/0x9cd [25244.909160] ? load_balance+0x17a/0x7b7 [25244.910536] ? vring_unmap_one+0x18/0x61 [25244.911932] ? detach_buf+0x60/0xfa [25244.913234] virtnet_poll+0x128/0x1e1 [25244.914607] net_rx_action+0x12a/0x2b1 [25244.915953] __do_softirq+0x11c/0x26b [25244.917269] ? handle_irq_event+0x44/0x56 [25244.918695] irq_exit+0x61/0xa0 [25244.919947] do_IRQ+0x9d/0xbb [25244.921065] common_interrupt+0x85/0x85 [25244.922479] </IRQ> tcp_rtx_queue_tail() (called by tcp_fragment()) can call tcp_write_queue_prev() on the first packet in the queue, which will trigger the BUG in tcp_write_queue_prev(), because there is no previous packet. This happens when the retransmit queue is empty, for example in case of a zero window. Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") was not a simple cherry-pick of the original one from master (b617158dc096) because there is a specific TCP rtx queue only since v4.15. For more details, please see the commit message of b617158dc096 ("tcp: be more careful in tcp_fragment()"). The BUG() is hit due to the specific code added to versions older than v4.15. The comment in skb_queue_prev() (include/linux/skbuff.h:1406), just before the BUG_ON() somehow suggests to add a check before using it, what Tim did. In master, this code path causing the issue will not be taken because the implementation of tcp_rtx_queue_tail() is different: tcp_fragment() → tcp_rtx_queue_tail() → tcp_write_queue_prev() → skb_queue_prev() → BUG_ON() Fixes: 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-25xfrm: policy: remove pcpu policy cacheFlorian Westphal
commit e4db5b61c572475bbbcf63e3c8a2606bfccf2c9d upstream. Kristian Evensen says: In a project I am involved in, we are running ipsec (Strongswan) on different mt7621-based routers. Each router is configured as an initiator and has around ~30 tunnels to different responders (running on misc. devices). Before the flow cache was removed (kernel 4.9), we got a combined throughput of around 70Mbit/s for all tunnels on one router. However, we recently switched to kernel 4.14 (4.14.48), and the total throughput is somewhere around 57Mbit/s (best-case). I.e., a drop of around 20%. Reverting the flow cache removal restores, as expected, performance levels to that of kernel 4.9. When pcpu xdst exists, it has to be validated first before it can be used. A negative hit thus increases cost vs. no-cache. As number of tunnels increases, hit rate decreases so this pcpu caching isn't a viable strategy. Furthermore, the xdst cache also needs to run with BH off, so when removing this the bh disable/enable pairs can be removed too. Kristian tested a 4.14.y backport of this change and reported increased performance: In our tests, the throughput reduction has been reduced from around -20% to -5%. We also see that the overall throughput is independent of the number of tunnels, while before the throughput was reduced as the number of tunnels increased. Reported-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-25Revert "tcp: Clear sk_send_head after purging the write queue"Sasha Levin
This reverts commit e99e7745d03fc50ba7c5b7c91c17294fee2d5991. Ben Hutchings writes: >Sorry, this is the same issue that was already fixed by "tcp: reset >sk_send_head in tcp_write_queue_purge". You can drop my version from >the queue for 4.4 and 4.9 and revert it for 4.14. Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-16tcp: Clear sk_send_head after purging the write queueBen Hutchings
Denis Andzakovic discovered a potential use-after-free in older kernel versions, using syzkaller. tcp_write_queue_purge() frees all skbs in the TCP write queue and can leave sk->sk_send_head pointing to freed memory. tcp_disconnect() clears that pointer after calling tcp_write_queue_purge(), but tcp_connect() does not. It is (surprisingly) possible to add to the write queue between disconnection and reconnection, so this needs to be done in both places. This bug was introduced by backports of commit 7f582b248d0a ("tcp: purge write queue in tcp_connect_init()") and does not exist upstream because of earlier changes in commit 75c119afe14f ("tcp: implement rb-tree based retransmit queue"). The latter is a major change that's not suitable for stable. Reported-by: Denis Andzakovic <denis.andzakovic@pulsesecurity.co.nz> Bisected-by: Salvatore Bonaccorso <carnil@debian.org> Fixes: 7f582b248d0a ("tcp: purge write queue in tcp_connect_init()") Cc: <stable@vger.kernel.org> # before 4.15 Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-09tcp: be more careful in tcp_fragment()Eric Dumazet
commit b617158dc096709d8600c53b6052144d12b89fab upstream. Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented. We should allow these flows to make progress. This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit. It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15 Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like : static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk); return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); } Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Cc: Jonathan Looney <jtl@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-04VSOCK: use TCP state constants for sk_stateStefan Hajnoczi
commit 3b4477d2dcf2709d0be89e2a8dced3d0f4a017f2 upstream. There are two state fields: socket->state and sock->sk_state. The socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the sock->sk_state typically uses values that match TCP state constants (TCP_CLOSE, TCP_ESTABLISHED). AF_VSOCK does not follow this convention and instead uses SS_* constants for both fields. The sk_state field will be exposed to userspace through the vsock_diag interface for ss(8), netstat(8), and other programs. This patch switches sk_state to TCP state constants so that the meaning of this field is consistent with other address families. Not just AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too. The following mapping was used to convert the code: SS_FREE -> TCP_CLOSE SS_UNCONNECTED -> TCP_CLOSE SS_CONNECTING -> TCP_SYN_SENT SS_CONNECTED -> TCP_ESTABLISHED SS_DISCONNECTING -> TCP_CLOSING VSOCK_SS_LISTEN -> TCP_LISTEN In __vsock_create() the sk_state initialization was dropped because sock_init_data() already initializes sk_state to TCP_CLOSE. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> [Adjusted net/vmw_vsock/hyperv_transport.c since the commit b4562ca7925a ("hv_sock: add locking in the open/close/release code paths") and the commit c9d3fe9da094 ("VSOCK: fix outdated sk_state value in hvs_release()") were backported before 3b4477d2dcf2.] Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31tcp: fix tcp_set_congestion_control() use from bpf hookEric Dumazet
[ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ] Neal reported incorrect use of ns_capable() from bpf hook. bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1) Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context. As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site. The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31net: make skb_dst_force return true when dst is refcountedFlorian Westphal
[ Upstream commit b60a77386b1d4868f72f6353d35dabe5fbe981f2 ] netfilter did not expect that skb_dst_force() can cause skb to lose its dst entry. I got a bug report with a skb->dst NULL dereference in netfilter output path. The backtrace contains nf_reinject(), so the dst might have been cleared when skb got queued to userspace. Other users were fixed via if (skb_dst(skb)) { skb_dst_force(skb); if (!skb_dst(skb)) goto handle_err; } But I think its preferable to make the 'dst might be cleared' part of the function explicit. In netfilter case, skb with a null dst is expected when queueing in prerouting hook, so drop skb for the other hooks. v2: v1 of this patch returned true in case skb had no dst entry. Eric said: Say if we have two skb_dst_force() calls for some reason on the same skb, only the first one will return false. This now returns false even when skb had no dst, as per Erics suggestion, so callers might need to check skb_dst() first before skb_dst_force(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31ipvs: fix tinfo memory leak in start_sync_threadJulian Anastasov
[ Upstream commit 5db7c8b9f9fc2aeec671ae3ca6375752c162e0e7 ] syzkaller reports for memory leak in start_sync_thread [1] As Eric points out, kthread may start and stop before the threadfn function is called, so there is no chance the data (tinfo in our case) to be released in thread. Fix this by releasing tinfo in the controlling code instead. [1] BUG: memory leak unreferenced object 0xffff8881206bf700 (size 32): comm "syz-executor761", pid 7268, jiffies 4294943441 (age 20.470s) hex dump (first 32 bytes): 00 40 7c 09 81 88 ff ff 80 45 b8 21 81 88 ff ff .@|......E.!.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000057619e23>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<0000000057619e23>] slab_post_alloc_hook mm/slab.h:439 [inline] [<0000000057619e23>] slab_alloc mm/slab.c:3326 [inline] [<0000000057619e23>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<0000000086ce5479>] kmalloc include/linux/slab.h:547 [inline] [<0000000086ce5479>] start_sync_thread+0x5d2/0xe10 net/netfilter/ipvs/ip_vs_sync.c:1862 [<000000001a9229cc>] do_ip_vs_set_ctl+0x4c5/0x780 net/netfilter/ipvs/ip_vs_ctl.c:2402 [<00000000ece457c8>] nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] [<00000000ece457c8>] nf_setsockopt+0x4c/0x80 net/netfilter/nf_sockopt.c:115 [<00000000942f62d4>] ip_setsockopt net/ipv4/ip_sockglue.c:1258 [inline] [<00000000942f62d4>] ip_setsockopt+0x9b/0xb0 net/ipv4/ip_sockglue.c:1238 [<00000000a56a8ffd>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616 [<00000000fa895401>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130 [<0000000095eef4cf>] __sys_setsockopt+0x98/0x120 net/socket.c:2078 [<000000009747cf88>] __do_sys_setsockopt net/socket.c:2089 [inline] [<000000009747cf88>] __se_sys_setsockopt net/socket.c:2086 [inline] [<000000009747cf88>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086 [<00000000ded8ba80>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301 [<00000000893b4ac8>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+7e2e50c8adfccd2e5041@syzkaller.appspotmail.com Suggested-by: Eric Biggers <ebiggers@kernel.org> Fixes: 998e7a76804b ("ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-21ip6_tunnel: allow not to count pkts on tstats by passing dev as NULLXin Long
[ Upstream commit 6f6a8622057c92408930c31698394fae1557b188 ] A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL" is also needed by ip6_tunnel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-25Bluetooth: Align minimum encryption key size for LE and BR/EDR connectionsMarcel Holtmann
commit d5bb334a8e171b262e48f378bd2096c0ea458265 upstream. The minimum encryption key size for LE connections is 56 bits and to align LE with BR/EDR, enforce 56 bits of minimum encryption key size for BR/EDR connections as well. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17tcp: add tcp_min_snd_mss sysctlEric Dumazet
commit 5f3e2bf008c2221478101ee72f5cb4654b9fc363 upstream. Some TCP peers announce a very small MSS option in their SYN and/or SYN/ACK messages. This forces the stack to send packets with a very high network/cpu overhead. Linux has enforced a minimal value of 48. Since this value includes the size of TCP options, and that the options can consume up to 40 bytes, this means that each segment can include only 8 bytes of payload. In some cases, it can be useful to increase the minimal value to a saner value. We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility reasons. Note that TCP_MAXSEG socket option enforces a minimal value of (TCP_MIN_MSS). David Miller increased this minimal value in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.") from 64 to 88. We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17tcp: limit payload size of sacked skbsEric Dumazet
commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream. Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs Backport notes, provided by Joao Martins <joao.m.martins@oracle.com> v4.15 or since commit 737ff314563 ("tcp: use sequence distance to detect reordering") had switched from the packet-based FACK tracking and switched to sequence-based. v4.14 and older still have the old logic and hence on tcp_skb_shift_data() needs to retain its original logic and have @fack_count in sync. In other words, we keep the increment of pcount with tcp_skb_pcount(skb) to later used that to update fack_count. To make it more explicit we track the new skb that gets incremented to pcount in @next_pcount, and we get to avoid the constant invocation of tcp_skb_pcount(skb) all together. Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-15Revert "Bluetooth: Align minimum encryption key size for LE and BR/EDR ↵Greg Kroah-Hartman
connections" This reverts commit 2fa7a155b25160696cd77cdd995536cf5e172e20 which is commit d5bb334a8e171b262e48f378bd2096c0ea458265 upstream. Lots of people have reported issues with this patch, and as there does not seem to be a fix going into Linus's kernel tree any time soon, revert the commit in the stable trees so as to get people's machines working properly again. Reported-by: Vasily Khoruzhick <anarsoul@gmail.com> Reported-by: Hans de Goede <hdegoede@redhat.com> Cc: Jeremy Cline <jeremy@jcline.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11ipv4: Define __ipv4_neigh_lookup_noref when CONFIG_INET is disabledDavid Ahern
commit 9b3040a6aafd7898ece7fc7efcbca71e42aa8069 upstream. Define __ipv4_neigh_lookup_noref to return NULL when CONFIG_INET is disabled. Fixes: 4b2a2bfeb3f0 ("neighbor: Call __ipv4_neigh_lookup_noref in neigh_xmit") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11ipv6: fix the check before getting the cookie in rt6_get_cookieXin Long
[ Upstream commit b7999b07726c16974ba9ca3bb9fe98ecbec5f81c ] In Jianlin's testing, netperf was broken with 'Connection reset by peer', as the cookie check failed in rt6_check() and ip6_dst_check() always returned NULL. It's caused by Commit 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes"), where the cookie can be got only when 'c1'(see below) for setting dst_cookie whereas rt6_check() is called when !'c1' for checking dst_cookie, as we can see in ip6_dst_check(). Since in ip6_dst_check() both rt6_dst_from_check() (c1) and rt6_check() (!c1) will check the 'from' cookie, this patch is to remove the c1 check in rt6_get_cookie(), so that the dst_cookie can always be set properly. c1: (rt->rt6i_flags & RTF_PCPU || unlikely(!list_empty(&rt->rt6i_uncached))) Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09inet: switch IP ID generator to siphashEric Dumazet
[ Upstream commit df453700e8d81b1bdafdf684365ee2b9431fb702 ] According to Amit Klein and Benny Pinkas, IP ID generation is too weak and might be used by attackers. Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix()) having 64bit key and Jenkins hash is risky. It is time to switch to siphash and its 128bit keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16nfc: nci: Potential off by one in ->pipes[] arrayDan Carpenter
[ Upstream commit 6491d698396fd5da4941980a35ca7c162a672016 ] This is similar to commit e285d5bfb7e9 ("NFC: Fix the number of pipes") where we changed NFC_HCI_MAX_PIPES from 127 to 128. As the comment next to the define explains, the pipe identifier is 7 bits long. The highest possible pipe is 127, but the number of possible pipes is 128. As the code is now, then there is potential for an out of bounds array access: net/nfc/nci/hci.c:297 nci_hci_cmd_received() warn: array off by one? 'ndev->hci_dev->pipes[pipe]' '0-127 == 127' Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
2019-05-16netfilter: ctnetlink: don't use conntrack/expect object addresses as idFlorian Westphal
[ Upstream commit 3c79107631db1f7fd32cf3f7368e4672004a3010 ] else, we leak the addresses to userspace via ctnetlink events and dumps. Compute an ID on demand based on the immutable parts of nf_conn struct. Another advantage compared to using an address is that there is no immediate re-use of the same ID in case the conntrack entry is freed and reallocated again immediately. Fixes: 3583240249ef ("[NETFILTER]: nf_conntrack_expect: kill unique ID") Fixes: 7f85f914721f ("[NETFILTER]: nf_conntrack: kill unique ID") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-10Bluetooth: Align minimum encryption key size for LE and BR/EDR connectionsMarcel Holtmann
commit d5bb334a8e171b262e48f378bd2096c0ea458265 upstream. The minimum encryption key size for LE connections is 56 bits and to align LE with BR/EDR, enforce 56 bits of minimum encryption key size for BR/EDR connections as well. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-08caif: reduce stack size with KASANArnd Bergmann
commit ce6289661b14a8b391d90db918c91b6d6da6540a upstream. When CONFIG_KASAN is set, we can use relatively large amounts of kernel stack space: net/caif/cfctrl.c:555:1: warning: the frame size of 1600 bytes is larger than 1280 bytes [-Wframe-larger-than=] This adds convenience wrappers around cfpkt_extr_head(), which is responsible for most of the stack growth. With those wrapper functions, gcc apparently starts reusing the stack slots for each instance, thus avoiding the problem. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-08sctp: avoid running the sctp state machine recursivelyXin Long
[ Upstream commit fbd019737d71e405f86549fd738f81e2ff3dd073 ] Ying triggered a call trace when doing an asconf testing: BUG: scheduling while atomic: swapper/12/0/0x10000100 Call Trace: <IRQ> [<ffffffffa4375904>] dump_stack+0x19/0x1b [<ffffffffa436fcaf>] __schedule_bug+0x64/0x72 [<ffffffffa437b93a>] __schedule+0x9ba/0xa00 [<ffffffffa3cd5326>] __cond_resched+0x26/0x30 [<ffffffffa437bc4a>] _cond_resched+0x3a/0x50 [<ffffffffa3e22be8>] kmem_cache_alloc_node+0x38/0x200 [<ffffffffa423512d>] __alloc_skb+0x5d/0x2d0 [<ffffffffc0995320>] sctp_packet_transmit+0x610/0xa20 [sctp] [<ffffffffc098510e>] sctp_outq_flush+0x2ce/0xc00 [sctp] [<ffffffffc098646c>] sctp_outq_uncork+0x1c/0x20 [sctp] [<ffffffffc0977338>] sctp_cmd_interpreter.isra.22+0xc8/0x1460 [sctp] [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp] [<ffffffffc099443d>] sctp_primitive_ASCONF+0x3d/0x50 [sctp] [<ffffffffc0977384>] sctp_cmd_interpreter.isra.22+0x114/0x1460 [sctp] [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp] [<ffffffffc097b3a4>] sctp_assoc_bh_rcv+0xf4/0x1b0 [sctp] [<ffffffffc09840f1>] sctp_inq_push+0x51/0x70 [sctp] [<ffffffffc099732b>] sctp_rcv+0xa8b/0xbd0 [sctp] As it shows, the first sctp_do_sm() running under atomic context (NET_RX softirq) invoked sctp_primitive_ASCONF() that uses GFP_KERNEL flag later, and this flag is supposed to be used in non-atomic context only. Besides, sctp_do_sm() was called recursively, which is not expected. Vlad tried to fix this recursive call in Commit c0786693404c ("sctp: Fix oops when sending queued ASCONF chunks") by introducing a new command SCTP_CMD_SEND_NEXT_ASCONF. But it didn't work as this command is still used in the first sctp_do_sm() call, and sctp_primitive_ASCONF() will be called in this command again. To avoid calling sctp_do_sm() recursively, we send the next queued ASCONF not by sctp_primitive_ASCONF(), but by sctp_sf_do_prm_asconf() in the 1st sctp_do_sm() directly. Reported-by: Ying Xu <yinxu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-04net/sched: don't dereference a->goto_chain to read the chain indexDavide Caratti
[ Upstream commit fe384e2fa36ca084a456fd30558cccc75b4b3fbd ] callers of tcf_gact_goto_chain_index() can potentially read an old value of the chain index, or even dereference a NULL 'goto_chain' pointer, because 'goto_chain' and 'tcfa_action' are read in the traffic path without caring of concurrent write in the control path. The most recent value of chain index can be read also from a->tcfa_action (it's encoded there together with TC_ACT_GOTO_CHAIN bits), so we don't really need to dereference 'goto_chain': just read the chain id from the control action. Fixes: e457d86ada27 ("net: sched: add couple of goto_chain helpers") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
2019-04-27net: IP6 defrag: use rbtrees for IPv6 defragPeter Oskolkov
[ Upstream commit d4289fcc9b16b89619ee1c54f829e05e56de8b9a ] Currently, IPv6 defragmentation code drops non-last fragments that are smaller than 1280 bytes: see commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu") This behavior is not specified in IPv6 RFCs and appears to break compatibility with some IPv6 implemenations, as reported here: https://www.spinics.net/lists/netdev/msg543846.html This patch re-uses common IP defragmentation queueing and reassembly code in IPv6, removing the 1280 byte restriction. v2: change handling of overlaps to match that of upstream. Signed-off-by: Peter Oskolkov <posk@google.com> Reported-by: Tom Herbert <tom@herbertland.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-27ipv6: remove dependency of nf_defrag_ipv6 on ipv6 moduleFlorian Westphal
[ Upstream commit 70b095c84326640eeacfd69a411db8fc36e8ab1a ] IPV6=m DEFRAG_IPV6=m CONNTRACK=y yields: net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get': net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable' net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6' Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params ip6_frag_init and ip6_expire_frag_queue so it would be needed to force IPV6=y too. This patch gets rid of the 'followup linker error' by removing the dependency of ipv6.ko symbols from netfilter ipv6 defrag. Shared code is placed into a header, then used from both. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-27net: IP defrag: encapsulate rbtree defrag code into callable functionsPeter Oskolkov
[ Upstream commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f ] This is a refactoring patch: without changing runtime behavior, it moves rbtree-related code from IPv4-specific files/functions into .h/.c defrag files shared with IPv6 defragmentation code. v2: make handling of overlapping packets match upstream. Signed-off-by: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-17netns: provide pure entropy for net_hash_mix()Eric Dumazet
[ Upstream commit 355b98553789b646ed97ad801a619ff898471b92 ] net_hash_mix() currently uses kernel address of a struct net, and is used in many places that could be used to reveal this address to a patient attacker, thus defeating KASLR, for the typical case (initial net namespace, &init_net is not dynamically allocated) I believe the original implementation tried to avoid spending too many cycles in this function, but security comes first. Also provide entropy regardless of CONFIG_NET_NS. Fixes: 0b4419162aa6 ("netns: introduce the net_hash_mix "salt" for hashes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-17vrf: check accept_source_route on the original netdeviceStephen Suryaputra
[ Upstream commit 8c83f2df9c6578ea4c5b940d8238ad8a41b87e9e ] Configuration check to accept source route IP options should be made on the incoming netdevice when the skb->dev is an l3mdev master. The route lookup for the source route next hop also needs the incoming netdev. v2->v3: - Simplify by passing the original netdevice down the stack (per David Ahern). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-05netfilter: physdev: relax br_netfilter dependencyFlorian Westphal
[ Upstream commit 8e2f311a68494a6677c1724bdcb10bada21af37c ] Following command: iptables -D FORWARD -m physdev ... causes connectivity loss in some setups. Reason is that iptables userspace will probe kernel for the module revision of the physdev patch, and physdev has an artificial dependency on br_netfilter (xt_physdev use makes no sense unless a br_netfilter module is loaded). This causes the "phydev" module to be loaded, which in turn enables the "call-iptables" infrastructure. bridged packets might then get dropped by the iptables ruleset. The better fix would be to change the "call-iptables" defaults to 0 and enforce explicit setting to 1, but that breaks backwards compatibility. This does the next best thing: add a request_module call to checkentry. This was a stray '-D ... -m physdev' won't activate br_netfilter anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-03sctp: get sctphdr by offset in sctp_compute_cksumXin Long
[ Upstream commit 273160ffc6b993c7c91627f5a84799c66dfe4dee ] sctp_hdr(skb) only works when skb->transport_header is set properly. But in Netfilter, skb->transport_header for ipv6 is not guaranteed to be right value for sctphdr. It would cause to fail to check the checksum for sctp packets. So fix it by using offset, which is always right in all places. v1->v2: - Fix the changelog. Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-03packets: Always register packet sk in the same orderMaxime Chevallier
[ Upstream commit a4dc6a49156b1f8d6e17251ffda17c9e6a5db78a ] When using fanouts with AF_PACKET, the demux functions such as fanout_demux_cpu will return an index in the fanout socket array, which corresponds to the selected socket. The ordering of this array depends on the order the sockets were added to a given fanout group, so for FANOUT_CPU this means sockets are bound to cpus in the order they are configured, which is OK. However, when stopping then restarting the interface these sockets are bound to, the sockets are reassigned to the fanout group in the reverse order, due to the fact that they were inserted at the head of the interface's AF_PACKET socket list. This means that traffic that was directed to the first socket in the fanout group is now directed to the last one after an interface restart. In the case of FANOUT_CPU, traffic from CPU0 will be directed to the socket that used to receive traffic from the last CPU after an interface restart. This commit introduces a helper to add a socket at the tail of a list, then uses it to register AF_PACKET sockets. Note that this changes the order in which sockets are listed in /proc and with sock_diag. Fixes: dc99f600698d ("packet: Add fanout support") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23phonet: fix building with clangArnd Bergmann
[ Upstream commit 6321aa197547da397753757bd84c6ce64b3e3d89 ] clang warns about overflowing the data[] member in the struct pnpipehdr: net/phonet/pep.c:295:8: warning: array index 4 is past the end of the array (which contains 1 element) [-Warray-bounds] if (hdr->data[4] == PEP_IND_READY) ^ ~ include/net/phonet/pep.h:66:3: note: array 'data' declared here u8 data[1]; Using a flexible array member at the end of the struct avoids the warning, but since we cannot have a flexible array member inside of the union, each index now has to be moved back by one, which makes it a little uglier. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Rémi Denis-Courmont <remi@remlab.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-13Bluetooth: Fix locking in bt_accept_enqueue() for BH contextMatthias Kaehlcke
commit c4f5627f7eeecde1bb6b646d8c0907b96dc2b2a6 upstream. With commit e16337622016 ("Bluetooth: Handle bt_accept_enqueue() socket atomically") lock_sock[_nested]() is used to acquire the socket lock before manipulating the socket. lock_sock[_nested]() may block, which is problematic since bt_accept_enqueue() can be called in bottom half context (e.g. from rfcomm_connect_ind()): [<ffffff80080d81ec>] __might_sleep+0x4c/0x80 [<ffffff800876c7b0>] lock_sock_nested+0x24/0x58 [<ffffff8000d7c27c>] bt_accept_enqueue+0x48/0xd4 [bluetooth] [<ffffff8000e67d8c>] rfcomm_connect_ind+0x190/0x218 [rfcomm] Add a parameter to bt_accept_enqueue() to indicate whether the function is called from BH context, and acquire the socket lock with bh_lock_sock_nested() if that's the case. Also adapt all callers of bt_accept_enqueue() to pass the new parameter: - l2cap_sock_new_connection_cb() - uses lock_sock() to lock the parent socket => process context - rfcomm_connect_ind() - acquires the parent socket lock with bh_lock_sock() => BH context - __sco_chan_add() - called from sco_chan_add(), which is called from sco_connect(). parent is NULL, hence bt_accept_enqueue() isn't called in this code path and we can ignore it - also called from sco_conn_ready(). uses bh_lock_sock() to acquire the parent lock => BH context Fixes: e16337622016 ("Bluetooth: Handle bt_accept_enqueue() socket atomically") Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-13net: avoid use IPCB in cipso_v4_errorNazarov Sergey
[ Upstream commit 3da1ed7ac398f34fff1694017a07054d69c5f5c5 ] Extract IP options in cipso_v4_error and use __icmp_send. Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-13net: Add __icmp_send helper.Nazarov Sergey
[ Upstream commit 9ef6b42ad6fd7929dd1b6092cb02014e382c6a91 ] Add __icmp_send function having ip_options struct parameter Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru> Reviewed-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23ax25: fix possible use-after-freeEric Dumazet
commit 63530aba7826a0f8e129874df9c4d264f9db3f9e upstream. syzbot found that ax25 routes where not properly protected against concurrent use [1]. In this particular report the bug happened while copying ax25->digipeat. Fix this problem by making sure we call ax25_get_route() while ax25_route_lock is held, so that no modification could happen while using the route. The current two ax25_get_route() callers do not sleep, so this change should be fine. Once we do that, ax25_get_route() no longer needs to grab a reference on the found route. [1] ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de BUG: KASAN: use-after-free in memcpy include/linux/string.h:352 [inline] BUG: KASAN: use-after-free in kmemdup+0x42/0x60 mm/util.c:113 Read of size 66 at addr ffff888066641a80 by task syz-executor2/531 ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de CPU: 1 PID: 531 Comm: syz-executor2 Not tainted 5.0.0-rc2+ #10 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1db/0x2d0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x123/0x190 mm/kasan/generic.c:191 memcpy+0x24/0x50 mm/kasan/common.c:130 memcpy include/linux/string.h:352 [inline] kmemdup+0x42/0x60 mm/util.c:113 kmemdup include/linux/string.h:425 [inline] ax25_rt_autobind+0x25d/0x750 net/ax25/ax25_route.c:424 ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1224 __sys_connect+0x357/0x490 net/socket.c:1664 __do_sys_connect net/socket.c:1675 [inline] __se_sys_connect net/socket.c:1672 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:1672 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458099 Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f870ee22c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458099 RDX: 0000000000000048 RSI: 0000000020000080 RDI: 0000000000000005 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de R10: 0000000000000000 R11: 0000000000000246 R12: 00007f870ee236d4 R13: 00000000004be48e R14: 00000000004ce9a8 R15: 00000000ffffffff Allocated by task 526: save_stack+0x45/0xd0 mm/kasan/common.c:73 set_track mm/kasan/common.c:85 [inline] __kasan_kmalloc mm/kasan/common.c:496 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504 ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3609 kmalloc include/linux/slab.h:545 [inline] ax25_rt_add net/ax25/ax25_route.c:95 [inline] ax25_rt_ioctl+0x3b9/0x1270 net/ax25/ax25_route.c:233 ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763 sock_do_ioctl+0xe2/0x400 net/socket.c:950 sock_ioctl+0x32f/0x6c0 net/socket.c:1074 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de Freed by task 550: save_stack+0x45/0xd0 mm/kasan/common.c:73 set_track mm/kasan/common.c:85 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458 kasan_slab_free+0xe/0x10 mm/kasan/common.c:466 __cache_free mm/slab.c:3487 [inline] kfree+0xcf/0x230 mm/slab.c:3806 ax25_rt_add net/ax25/ax25_route.c:92 [inline] ax25_rt_ioctl+0x304/0x1270 net/ax25/ax25_route.c:233 ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763 sock_do_ioctl+0xe2/0x400 net/socket.c:950 sock_ioctl+0x32f/0x6c0 net/socket.c:1074 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888066641a80 which belongs to the cache kmalloc-96 of size 96 The buggy address is located 0 bytes inside of 96-byte region [ffff888066641a80, ffff888066641ae0) The buggy address belongs to the page: page:ffffea0001999040 count:1 mapcount:0 mapping:ffff88812c3f04c0 index:0x0 flags: 0x1fffc0000000200(slab) ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de raw: 01fffc0000000200 ffffea0001817948 ffffea0002341dc8 ffff88812c3f04c0 raw: 0000000000000000 ffff888066641000 0000000100000020 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888066641980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff888066641a00: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc >ffff888066641a80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ^ ffff888066641b00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff888066641b80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ralf Baechle <ralf@linux-mips.org> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23tcp: clear icsk_backoff in tcp_write_queue_purge()Eric Dumazet
[ Upstream commit 04c03114be82194d4a4858d41dba8e286ad1787c ] soukjin bae reported a crash in tcp_v4_err() handling ICMP_DEST_UNREACH after tcp_write_queue_head(sk) returned a NULL pointer. Current logic should have prevented this : if (seq != tp->snd_una || !icsk->icsk_retransmits || !icsk->icsk_backoff || fastopen) break; Problem is the write queue might have been purged and icsk_backoff has not been cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: soukjin bae <soukjin.bae@samsung.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23net: ipv4: use a dedicated counter for icmp_v4 redirect packetsLorenzo Bianconi
[ Upstream commit c09551c6ff7fe16a79a42133bcecba5fc2fc3291 ] According to the algorithm described in the comment block at the beginning of ip_rt_send_redirect, the host should try to send 'ip_rt_redirect_number' ICMP redirect packets with an exponential backoff and then stop sending them at all assuming that the destination ignores redirects. If the device has previously sent some ICMP error packets that are rate-limited (e.g TTL expired) and continues to receive traffic, the redirect packets will never be transmitted. This happens since peer->rate_tokens will be typically greater than 'ip_rt_redirect_number' and so it will never be reset even if the redirect silence timeout (ip_rt_redirect_silence) has elapsed without receiving any packet requiring redirects. Fix it by using a dedicated counter for the number of ICMP redirect packets that has been sent by the host I have not been able to identify a given commit that introduced the issue since ip_rt_send_redirect implements the same rate-limiting algorithm from commit 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-06ipvlan, l3mdev: fix broken l3s mode wrt local routesDaniel Borkmann
[ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ] While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, I ran into the issue that while l3 mode is working fine, l3s mode does not have any connectivity to kube-apiserver and hence all pods end up in Error state as well. The ipvlan master device sits on top of a bond device and hostns traffic to kube-apiserver (also running in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 where the latter is the address of the bond0. While in l3 mode, a curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 works fine from hostns, neither of them do in case of l3s. In the latter only a curl to https://127.0.0.1:37573 appeared to work where for local addresses of bond0 I saw kernel suddenly starting to emit ARP requests to query HW address of bond0 which remained unanswered and neighbor entries in INCOMPLETE state. These ARP requests only happen while in l3s. Debugging this further, I found the issue is that l3s mode is piggy- backing on l3 master device, and in this case local routes are using l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback"). I found that reverting them back into using the net->loopback_dev fixed ipvlan l3s connectivity and got everything working for the CNI. Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the l3mdev paper in [0] the only sole reason why ipvlan l3s is relying on l3 master device is to get the l3mdev_ip_rcv() receive hook for setting the dst entry of the input route without adding its own ipvlan specific hacks into the receive path, however, any l3 domain semantics beyond just that are breaking l3s operation. Note that ipvlan also has the ability to dynamically switch its internal operation from l3 to l3s for all ports via ipvlan_set_port_mode() at runtime. In any case, l3 vs l3s soley distinguishes itself by 'de-confusing' netfilter through switching skb->dev to ipvlan slave device late in NF_INET_LOCAL_IN before handing the skb to L4. Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook without any additional l3mdev semantics on top. This should also have minimal impact since dev->priv_flags is already hot in cache. With this set, l3s mode is working fine and I also get things like masquerading pod traffic on the ipvlan master properly working. [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback") Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>