android_kernel_msm-6.1_noth.../include
John Fastabend 9f4d7efb33 bpf, sockmap: Convert schedule_work into delayed_work
[ Upstream commit 29173d07f79883ac94f5570294f98af3d4287382 ]

Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.

The flow for Cilium's common use case with a stream parser is,

 tcp_read_sock()
  sk_psock_verdict_recv
    ret = bpf_prog_run_pin_on_cpu()
    sk_psock_verdict_apply(sock, skb, ret)
     // if system is under memory pressure or app is slow we may
     // need to queue skb. Do this queuing through ingress_skb and
     // then kick timer to wake up handler
     skb_queue_tail(ingress_skb, skb)
     schedule_work(work);

The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.

Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.

To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.

To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.

>From on list discussion. This commit

 bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")

was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-05 09:26:18 +02:00
..
acpi ACPI: video: Add auto_detect arg to __acpi_video_get_backlight_type() 2023-04-13 16:55:33 +02:00
asm-generic asm-generic/io.h: suppress endianness warnings for readq() and writeq() 2023-05-11 23:02:58 +09:00
clocksource
crypto crypto: api - Add scaffolding to change completion function signature 2023-05-17 11:53:40 +02:00
drm drm: fix drmm_mutex_init() 2023-05-30 14:03:20 +01:00
dt-bindings dt-bindings: clocks: imx8mp: Add ID for usb suspend clock 2022-12-31 13:33:09 +01:00
keys
kunit kunit: fix kunit_test_init_section_suites(...) 2023-02-09 11:28:08 +01:00
kvm KVM: arm64: PMU: Align chained counter implementation with architecture pseudocode 2023-04-13 16:55:17 +02:00
linux bpf, sockmap: Convert schedule_work into delayed_work 2023-06-05 09:26:18 +02:00
math-emu
media media: uvcvideo: Add GUID for BGRA/X 8:8:8:8 2023-03-11 13:55:35 +01:00
memory memory: renesas-rpc-if: Split-off private data from struct rpcif 2023-03-11 13:55:17 +01:00
misc
net tls: rx: strp: preserve decryption status of skbs when needed 2023-06-05 09:26:18 +02:00
pcmcia
ras
rdma
rv
scsi scsi: libsas: Add sas_ata_device_link_abort() 2023-05-11 23:03:20 +09:00
soc ARM: at91: pm: avoid soft resetting AC DLL 2022-11-01 12:25:19 +02:00
sound ASoC: amd: fix ACP version typo mistake 2023-05-11 23:02:59 +09:00
target scsi: target: Fix multiple LUN_RESET handling 2023-05-11 23:03:19 +09:00
trace f2fs: refactor extent_cache to support for read and more 2023-05-17 11:53:52 +02:00
uapi ipv{4,6}/raw: fix output xfrm lookup wrt protocol 2023-06-05 09:26:16 +02:00
ufs scsi: ufs: exynos: Fix DMA alignment for PAGE_SIZE != 4096 2023-03-10 09:33:15 +01:00
vdso
video
xen ACPI: processor: Fix evaluating _PDC method when running as Xen dom0 2023-05-11 23:03:11 +09:00