Locked #vnet A bug which may cause assertion error in vnet/session #vnet
Chen Weihao
Hi vpp-team,
I'm new to VPP and I'm trying to run Redis 6.0.18 in VCL with LD_PRELOAD using VPP 22.10 and VPP 23.02. I found that assert fails frequently in VPP 23.02, and after checking, I found that the assert fails in the session_get function in vnet/session/session.h. The cause was an invalid session_id with a value of -1 (or ~0). This function is called by the session_half_open_migrate_notify function in vnet/session/session.c, which is called by ct_accept_one in vnet/session/application_local.c. Function ct_accept_one is called because of an accept RPC request handled by the session_event_dispatch_ctrl function from the ct_connect function in vnet/session/application_local.c. Function ct_connect allocates and initializes a half-open transport object. However, its c_s_index value is -1 (or ~0), i.e., no session is allocated. allocating a session is implemented by calling session_alloc_for_half_open in the session_open_vc function of ct_connect (located in vnet/session/session.c). Therefore, I think the assertion failure is a case that ct_accept_one function accesses half-open tc without a session being allocated. I found that this problem does not exist on VPP 22.10. I checked the patches between 22.10 and 23.02 and found “session: move connects to first worker (https://gerrit.fd.io/r/c/vpp/+/35713)” that might be related to this issue, but I can't give a definite statement and I don’t know how fix it. I would be very grateful if you could address this issue. |
|
Florin Coras
Hi,
toggle quoted message
Show quoted text
That may very well be a problem introduced by the move of connects to first worker. Unfortunately, I we don’t have tests for all of those corner cases yet. However, to replicate this issue, could you provide a bit more details about your setup and the exact backtrace? It looks like you’re leverage cut-through sessions so the server and client are attached to the same vpp instance? Also, could you try vpp latest to see check if the issue still persists? Regards, Florin On Mar 19, 2023, at 1:53 AM, chenweihao@... wrote: |
|
Florin Coras
I just tried iperf3 in cut-through mode, i.e., server and client attached to the same vpp instance running 4 workers, with 128 connections and this seems to be working fine.
toggle quoted message
Show quoted text
Could you try that out and see if it’s also working for you? It might be that this is something specific to how Redis uses sockets, so to reproduce we’ll need to replicate your testbed. Regards, Florin
|
|
Chen Weihao
Thank you for your reply.
This is the stacktrace captured by gdb: 2: /home/chenweihao/vpp_dev/src/vnet/session/session.c:233 (session_is_valid) assertion `! pool_is_free (session_main.wrk[thread_index].sessions, _e)' fails
Thread 4 "vpp_wk_1" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffa93f5640 (LWP 4575)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140736032888384) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736032888384)
at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140736032888384)
at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140736032888384, signo=signo@entry=6)
at ./nptl/pthread_kill.c:89
#3 0x00007ffff6a42476 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/posix/raise.c:26
#4 0x00007ffff6a287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x000055555555b073 in os_panic ()
at /home/chenweihao/vpp_dev/src/vpp/vnet/main.c:417
#6 0x00007ffff6f0a699 in debugger ()
at /home/chenweihao/vpp_dev/src/vppinfra/error.c:84
#7 0x00007ffff6f0a450 in _clib_error (how_to_die=2, function_name=0x0,
line_number=0, fmt=0x7ffff7b88208 "%s:%d (%s) assertion `%s' fails")
at /home/chenweihao/vpp_dev/src/vppinfra/error.c:143
#8 0x00007ffff75f31c7 in session_is_valid (si=4294967295,
thread_index=1 '\001')
at /home/chenweihao/vpp_dev/src/vnet/session/session.c:233
#9 0x00007ffff75f177c in session_get (si=4294967295, thread_index=1)
at /home/chenweihao/vpp_dev/src/vnet/session/session.h:373
#10 0x00007ffff75f3770 in ho_session_get (ho_index=4294967295)
at /home/chenweihao/vpp_dev/src/vnet/session/session.h:689
#11 0x00007ffff75f37d8 in session_half_open_migrate_notify (tc=0x7fffbdba00c0)
at /home/chenweihao/vpp_dev/src/vnet/session/session.c:357
#12 0x00007ffff7648600 in ct_accept_one (thread_index=2, ho_index=2)
at /home/chenweihao/vpp_dev/src/vnet/session/application_local.c:669
#13 0x00007ffff7648243 in ct_accept_rpc_wrk_handler (rpc_args=0x2)
at /home/chenweihao/vpp_dev/src/vnet/session/application_local.c:760
#14 0x00007ffff7620b42 in session_event_dispatch_ctrl (wrk=0x7fffbdb6fe00,
elt=0x7fffbdccaf2c)
at /home/chenweihao/vpp_dev/src/vnet/session/session_node.c:1656
#15 0x00007ffff76175c2 in session_queue_node_fn (vm=0x7fffb82da700,
node=0x7fffbdbc21c0, frame=0x0)
at /home/chenweihao/vpp_dev/src/vnet/session/session_node.c:1962
#16 0x00007ffff7ea3a62 in dispatch_node (vm=0x7fffb82da700,
node=0x7fffbdbc21c0, type=VLIB_NODE_TYPE_INPUT,
dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0,
last_time_stamp=2911620275652)
at /home/chenweihao/vpp_dev/src/vlib/main.c:960
#17 0x00007ffff7e9f7d1 in vlib_main_or_worker_loop (vm=0x7fffb82da700,
is_main=0) at /home/chenweihao/vpp_dev/src/vlib/main.c:1557
#18 0x00007ffff7e9f1c7 in vlib_worker_loop (vm=0x7fffb82da700)
at /home/chenweihao/vpp_dev/src/vlib/main.c:1722
#19 0x00007ffff7edb020 in vlib_worker_thread_fn (arg=0x7fffb9182980)
at /home/chenweihao/vpp_dev/src/vlib/threads.c:1598
#20 0x00007ffff7ed62b6 in vlib_worker_thread_bootstrap_fn (arg=0x7fffb9182980)
at /home/chenweihao/vpp_dev/src/vlib/threads.c:418
#21 0x00007ffff6a94b43 in start_thread (arg=<optimized out>)
at ./nptl/pthread_create.c:442
#22 0x00007ffff6b26a00 in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 The resaon why assertion failed is that c_s_index value is -1 (or ~0). The redis veision I used: git clone https://github.com/redis/redis.git cd redis git checkout 6.0 make My startup.conf: unix {
nodaemon
interactive
full-coredump
cli-listen /run/vpp/cli.sock
gid vpp
startup-config /home/chenweihao/startup.txt
}
api-trace {
on
}
cpu {
main-core 0
workers 4
}
tcp {
cc-algo cubic
}
dpdk {
uio-driver vfio-pci
dev 0000:03:00.0 {
name eth1
num-rx-queues 2
num-tx-queues 2
}
}
buffers {
buffers-per-numa 131072
}
session { evt_qs_memfd_seg }
I tried iperf3 as you said,indeed I did not encounter this problem.My startup.txt: set interface ip address eth1 192.168.0.2/24
set interface state eth1 up
set interface rx-mode eth1 polling
create loopback interface
set interface ip address loop0 127.0.0.1/8
set interface state loop0 up
My vcl.conf: vcl {
rx-fifo-size 40000000
tx-fifo-size 40000000
app-scope-local
app-scope-global
api-socket-name /run/vpp/api.sock
use-mq-eventfd
}
|
|
Chen Weihao
Thanks for your reply.
I give a more detailed backtrace and config in https://lists.fd.io/g/vpp-dev/message/22731. My installation method is to clone vpp from github and make build on Ubuntu 22.04(Kernel version is 5.19),and I use make run for test and make debug for debugging. Yes, I yried to make the server and client are attached to the same vpp instance.I tried the latest version of vpp on github on yesterday, the problem is still exist. I am looking forward to your reply. |
|
Florin Coras
Hi,
toggle quoted message
Show quoted text
First of all, could you try this [1] with latest vpp? It’s really interesting that iperf does not exhibit this issue. Regarding your config, some observations: - I see you have configured 4 worker. I would then recommend to use 4 rx-queues and 5 tx-queues (main can send packets), as opposed to 2. - tcp defaults to cubic, so config can be omitted. - evt_qs_memfd_seg is not deprecated, so it can be omitted as well - any particular reason for "set interface rx-mode eth1 polling”? dpdk interfaces are in polling mode by default - you’re using binary api socket "api-socket-name /run/vpp/api.sock”. That works, but going forward we’ll slowly deprecate that api. So it’d recommend using the app socket api. See for instance [2] for changes needed to session stanza and vcl. Regards, Florin
|
|
Chen Weihao
Thank you for your reply.
I tried to change num-tx-queues from 2 to 5, but it got a SIGSEGV, the backtrace is: #0 0x00007fffb453ff89 in rte_write32_relaxed (addr=0x80007ffffef0, value=0)
at ../src-dpdk/lib/eal/include/generic/rte_io.h:310
#1 rte_write32 (addr=0x80007ffffef0, value=0)
at ../src-dpdk/lib/eal/include/generic/rte_io.h:373
#2 vmxnet3_enable_intr (hw=0xac03b0600, intr_idx=4294967262)
at ../src-dpdk/drivers/net/vmxnet3/vmxnet3_ethdev.c:210
#3 0x00007fffb4544d35 in vmxnet3_dev_rx_queue_intr_enable (
dev=0x7fffb5186980 <rte_eth_devices>, queue_id=0)
at ../src-dpdk/drivers/net/vmxnet3/vmxnet3_ethdev.c:1815
#4 0x00007fffaff4bbf2 in rte_eth_dev_rx_intr_enable (port_id=0, queue_id=0)
at ../src-dpdk/lib/ethdev/rte_ethdev.c:4740
#5 0x00007fffb49f4564 in dpdk_setup_interrupts (xd=0x7fffbdbb2940)
at /home/chenweihao/vpp_dev/src/plugins/dpdk/device/common.c:336
#6 0x00007fffb49f4430 in dpdk_device_start (xd=0x7fffbdbb2940)
at /home/chenweihao/vpp_dev/src/plugins/dpdk/device/common.c:411
#7 0x00007fffb49ff713 in dpdk_interface_admin_up_down (
vnm=0x7ffff7e2b828 <vnet_main>, hw_if_index=1, flags=1)
at /home/chenweihao/vpp_dev/src/plugins/dpdk/device/device.c:476
#8 0x00007ffff70d60e8 in vnet_sw_interface_set_flags_helper (
vnm=0x7ffff7e2b828 <vnet_main>, sw_if_index=1,
flags=VNET_SW_INTERFACE_FLAG_ADMIN_UP, helper_flags=0)
at /home/chenweihao/vpp_dev/src/vnet/interface.c:470
#9 0x00007ffff70d645a in vnet_sw_interface_set_flags (
vnm=0x7ffff7e2b828 <vnet_main>, sw_if_index=1,
flags=VNET_SW_INTERFACE_FLAG_ADMIN_UP)
at /home/chenweihao/vpp_dev/src/vnet/interface.c:524
#10 0x00007ffff710515f in set_state (vm=0x7fffb6a00740, input=0x7fffa9f84bb8,
cmd=0x7fffb7180850)
at /home/chenweihao/vpp_dev/src/vnet/interface_cli.c:946
#11 0x00007ffff7e72257 in vlib_cli_dispatch_sub_commands (vm=0x7fffb6a00740,
cm=0x7ffff7f6a770 <vlib_global_main+48>, input=0x7fffa9f84bb8,
parent_command_index=20) at /home/chenweihao/vpp_dev/src/vlib/cli.c:650
#12 0x00007ffff7e71fea in vlib_cli_dispatch_sub_commands (vm=0x7fffb6a00740,
cm=0x7ffff7f6a770 <vlib_global_main+48>, input=0x7fffa9f84bb8,
parent_command_index=7) at /home/chenweihao/vpp_dev/src/vlib/cli.c:607
#13 0x00007ffff7e71fea in vlib_cli_dispatch_sub_commands (vm=0x7fffb6a00740,
cm=0x7ffff7f6a770 <vlib_global_main+48>, input=0x7fffa9f84bb8,
parent_command_index=0) at /home/chenweihao/vpp_dev/src/vlib/cli.c:607
#14 0x00007ffff7e7122a in vlib_cli_input (vm=0x7fffb6a00740,
input=0x7fffa9f84bb8, function=0x0, function_arg=0)
at /home/chenweihao/vpp_dev/src/vlib/cli.c:753
#15 0x00007ffff7ef7e23 in unix_cli_exec (vm=0x7fffb6a00740,
input=0x7fffa9f84f30, cmd=0x7fffb71815b8)
at /home/chenweihao/vpp_dev/src/vlib/unix/cli.c:3431
#16 0x00007ffff7e72257 in vlib_cli_dispatch_sub_commands (vm=0x7fffb6a00740,
cm=0x7ffff7f6a770 <vlib_global_main+48>, input=0x7fffa9f84f30,
--Type <RET> for more, q to quit, c to continue without paging--<RET>
parent_command_index=0) at /home/chenweihao/vpp_dev/src/vlib/cli.c:650
#17 0x00007ffff7e7122a in vlib_cli_input (vm=0x7fffb6a00740,
input=0x7fffa9f84f30, function=0x0, function_arg=0)
at /home/chenweihao/vpp_dev/src/vlib/cli.c:753
#18 0x00007ffff7efdfc5 in startup_config_process (vm=0x7fffb6a00740,
rt=0x7fffb9194080, f=0x0)
at /home/chenweihao/vpp_dev/src/vlib/unix/main.c:291
#19 0x00007ffff7ea2c5d in vlib_process_bootstrap (_a=140736084405176)
at /home/chenweihao/vpp_dev/src/vlib/main.c:1221
#20 0x00007ffff6f1ffd8 in clib_calljmp ()
at /home/chenweihao/vpp_dev/src/vppinfra/longjmp.S:123
#21 0x00007fffac516bb0 in ?? ()
#22 0x00007ffff7ea26f9 in vlib_process_startup (vm=0x8,
p=0x7ffff7ea53bb <dispatch_suspended_process+763>, f=0x7fffac516cc0)
at /home/chenweihao/vpp_dev/src/vlib/main.c:1246
#23 0x00007ffff6f7aa1c in vec_mem_size (v=0x7fffb6a00740)
at /home/chenweihao/vpp_dev/src/vppinfra/vec.c:15
#24 0x00000581655dfd1c in ?? ()
#25 0x0000003300000004 in ?? ()
#26 0x0000000000000030 in ?? ()
#27 0x00007fffbdbc7240 in ?? ()
#28 0x00007fffbdbc7240 in ?? ()
#29 0x00007fffb80e5498 in ?? ()
#30 0x0000000000000001 in ?? ()
#31 0x0000000000000000 in ?? ()
I tried to change num-rx-queues and num-tx-queues to 4, then SIGSEGV not happened. I applied the patch https://gerrit.fd.io/r/c/vpp/+/38529 , and the problem of redis 6.0 seemed still exist, the stack backtrace is same with https://lists.fd.io/g/vpp-dev/message/22731 |
|
Florin Coras
Hi,
toggle quoted message
Show quoted text
The problem seems to be that you’re using a vmxnet3 interface, so I suspect this might be a vm configuration issue. Your current config should work but could end up being inefficient. With respect to your problem, I just built redis and ran redis-server and cli over LDP. Everything seems to be working fine so I’m assuming you’re doing some stress tests of redis? Could you provide more info about your client? Regards, Florin
|
|
Chen Weihao
Thank you for your reply.
I did use a VMware virtual machine before, I switched to a physical NIC for testing and now there is no problem setting rx-num to 5.
Also, I tested redis using redis-benchmark, the specific command is as follows: sudo taskset -c 9 bash -c "VCL_CONFIG=/home/chenweihao/vpp_config/vcl.conf LD_PRELOAD=/home/chenweihao/vpp_dev/build-root/install-vpp-native/vpp/lib/libvcl_ldpreload.so /home/chenweihao/redis_6.0/src/redis-server --protected-mode no"
sudo taskset -c 11 bash -c "VCL_CONFIG=/home/chenweihao/vpp_config/vcl.conf LD_PRELOAD=/home/chenweihao/vpp_dev/build-root/install-vpp-native/vpp/lib/libvcl_ldpreload.so /home/chenweihao/redis_6.0/src/redis-benchmark -h 127.0.0.1 -n 10000 -q -t set,get -d 4096"
In the case of running both commands, the problems I mentioned before still occur (even though I applied the patch https://gerrit.fd.io/r/c/vpp/+/38529). However, the problem is not triggered every time the test is run, and it may take several runs before the problem occurs once.
|
|
Florin Coras
Hi,
toggle quoted message
Show quoted text
Apologies for the delay. I finally had some spare time to test with redis-benchmark and hit the issue you’ve found. Your original observation was right and, as expected, this is a side effect of moving connects to first worker. That is, workers can race with first worker and half-open sessions might not be fully initialized. Could you check the latest version of the patch and see if it helps? I couldn’t get it to crash on my end but since this is a timing issue, the testbed might matter. Regards, Florin
|
|
Chen Weihao
I would like to express my sincere apologies to you.
I was busy with other things for a while and didn't follow your reply. May I ask which patch you are referring to when you say "latest version of the patch"?
I tried https://gerrit.fd.io/r/c/vpp/+/38529, it seems still cause SIGABRT
DBGvpp# 1: /home/chenweihao/vpp_dev/src/vnet/session/application_local.c:798 (ct_fwrk_flush_connects) assertion `fwrk_index == cm->fwrk_thread' fails
Thread 3 "vpp_wk_0" received signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736034825792)
at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140736034825792)
at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140736034825792, signo=signo@entry=6)
at ./nptl/pthread_kill.c:89
#3 0x00007ffff6a42476 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/posix/raise.c:26
#4 0x00007ffff6a287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x000055555555b073 in os_panic ()
at /home/chenweihao/vpp_dev/src/vpp/vnet/main.c:417
#6 0x00007ffff6e436f9 in debugger ()
at /home/chenweihao/vpp_dev/src/vppinfra/error.c:84
#7 0x00007ffff6e434b0 in _clib_error (how_to_die=2, function_name=0x0,
line_number=0, fmt=0x7ffff7b93208 "%s:%d (%s) assertion `%s' fails")
at /home/chenweihao/vpp_dev/src/vppinfra/error.c:143
#8 0x00007ffff764dd37 in ct_fwrk_flush_connects (rpc_args=0x2)
at /home/chenweihao/vpp_dev/src/vnet/session/application_local.c:798
#9 0x00007ffff76258d2 in session_event_dispatch_ctrl (wrk=0x7fffbdd8dd80,
elt=0x7fffbdc05ef8)
at /home/chenweihao/vpp_dev/src/vnet/session/session_node.c:1661
#10 0x00007ffff761c3d6 in session_queue_node_fn (vm=0x7fffb8f8c240,
node=0x7fffbdbff880, frame=0x0)
--Type <RET> for more, q to quit, c to continue without paging--<RET>
at /home/chenweihao/vpp_dev/src/vnet/session/session_node.c:1971
#11 0x00007ffff6f37a62 in dispatch_node (vm=0x7fffb8f8c240,
node=0x7fffbdbff880, type=VLIB_NODE_TYPE_INPUT,
dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0,
last_time_stamp=5437425743575)
at /home/chenweihao/vpp_dev/src/vlib/main.c:960
#12 0x00007ffff6f337d1 in vlib_main_or_worker_loop (vm=0x7fffb8f8c240,
is_main=0) at /home/chenweihao/vpp_dev/src/vlib/main.c:1557
#13 0x00007ffff6f331c7 in vlib_worker_loop (vm=0x7fffb8f8c240)
at /home/chenweihao/vpp_dev/src/vlib/main.c:1722
#14 0x00007ffff6f6f220 in vlib_worker_thread_fn (arg=0x7fffb8fec440)
at /home/chenweihao/vpp_dev/src/vlib/threads.c:1598
#15 0x00007ffff6f6a4b6 in vlib_worker_thread_bootstrap_fn (arg=0x7fffb8fec440)
at /home/chenweihao/vpp_dev/src/vlib/threads.c:418
#16 0x00007ffff6a94b43 in start_thread (arg=<optimized out>)
at ./nptl/pthread_create.c:442
#17 0x00007ffff6b26a00 in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
I found that fwrk_index=2 and cm->fwrk_thread=1. fwrk_index is set in ct_program_connect_to_wrk, "thread_index = cm->n_workers ? (cm->n_sessions % cm->n_workers) + 1 : 0;" |
|
Florin Coras
Hi Chen,
No worries. Inline
Regards, Florin [1] https://gerrit.fd.io/r/c/vpp/+/38606 |
|
Chen Weihao
Thank you for your reply.
I tried https://gerrit.fd.io/r/c/vpp/+/38606 , with this patch vpp works well with redis-server and redis-benchmark. Regards, |
|
Florin Coras
Great! Thanks for confirming!
toggle quoted message
Show quoted text
Regards, Florin
|
|