Hi Team, We have a process node. which we will use to do some control plane related activity. Sometimes we observe that this process node is suspended indefinitely.
I know that if any process node is taking Unreasonably long time such nodes will not be scheduled further. But not able to figure out in code where this is done.
Can anyone point me to the code where we are tracking time taken by each process node and suspend indefinitely if it is consuming more time.
Thanks and regards, Sudhir
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Process is a bit of a misnomer, “cooperative multitasking thread” would be more accurate. Src/vlib/main.c makes no effort to interrupt process nodes. If a process runs forever vpp will have a bad day. Recitations: which version of vpp is involved? Please send “show run” output for the process node in question. The process might be waiting on an event which never happens, or for the clock to reach a time so far in the future it won’t happen in one’s lifetime. Worse luck might involve memory corruption or an issue with the timer wheel code. HTH... Dave
toggle quoted message
Show quoted text
From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io Sent: Friday, February 17, 2023 12:12 PM To: vpp-dev@... Subject: [vpp-dev] process node suspended indefinitely Hi Team, We have a process node. which we will use to do some control plane related activity. Sometimes we observe that this process node is suspended indefinitely. I know that if any process node is taking Unreasonably long time such nodes will not be scheduled further. But not able to figure out in code where this is done. Can anyone point me to the code where we are tracking time taken by each process node and suspend indefinitely if it is consuming more time. NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
HI:
-
I have the same problem,
bfd process node stop running. I raised this issue,
https://lists.fd.io/g/vpp-dev/message/22380 I think there is a problem with the porcess scheduling module when using the time wheel.
|
|
Hi Dave, Thank you for your response and help.
Please find the additional details below. VPP Version 21.10 We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
``` static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ```
What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever.
Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards, Sudhir On Sun, Feb 19, 2023 at 6:19 AM jinsh11 < jinsh11@...> wrote: HI:
-
I have the same problem,
bfd process node stop running. I raised this issue,
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); }
toggle quoted message
Show quoted text
From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io Sent: Monday, February 20, 2023 4:02 AM To: vpp-dev@... Subject: Re: [vpp-dev] process node suspended indefinitely Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI:
bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results.
Regards, Sudhir On Mon, Feb 20, 2023 at 6:01 PM Dave Barach < vpp@...> wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI:
bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hi Sudhir,
Is your issue resolved?
Actually we are facing same issue on vpp.2106. In our case "api-rx-ring" is not getting called. in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section. This leads to Api-segment memory is used fully and leads to crash.
Thanks, Chetan
toggle quoted message
Show quoted text
Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results.
Regards, Sudhir
On Mon, Feb 20, 2023 at 6:01 PM Dave Barach < vpp@...> wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI:
bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hi Chetan, In our case we are observing this issue occasionally exact steps to recreate the issue are not known.
I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.
Soon I will update my results and findings in this mail thread.
Thanks and Regards, Sudhir Hi Sudhir,
Is your issue resolved?
Actually we are facing same issue on vpp.2106. In our case "api-rx-ring" is not getting called. in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section. This leads to Api-segment memory is used fully and leads to crash.
Thanks, Chetan Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results.
Regards, Sudhir
On Mon, Feb 20, 2023 at 6:01 PM Dave Barach < vpp@...> wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI:
bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hi Dave, Please excuse my delayed response. It took some time to recreate this issue. I made changes to our process node as per your suggestion. now our process node code looks like this while (1) {
vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER); event_type = vlib_process_get_events (vm, &event_data); vec_reset_length(event_data);
switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break;
default: /* bug! */ ASSERT (0); } }
After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.
In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).
Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.
vlib_process_t:
============ $38 = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", node_runtime = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, errors = 0x7f9b3076a560, clocks_since_last_overflow = 0, max_clock = 3785970526, max_clock_n = 0, calls_since_last_overflow = 0, vectors_since_last_overflow = 0, next_frame_index = 1668, node_index = 437, input_main_loops_per_call = 0, main_loop_count_last_dispatch = 4147405645, main_loop_vector_stats = {0, 0}, flags = 0, state = 0, n_next_nodes = 0, cached_next_index = 0, thread_index = 0, runtime_data = 0x7f9b2da503c6 "" }, return_longjmp = { regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766} }, resume_longjmp = { regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787} }, flags = 11, log2_n_stack_bytes = 16, suspended_process_frame_index = 0, n_suspends = 0, pending_event_data_by_type_index = 0x7f9b307b8310, non_empty_event_type_bitmap = 0x7f9b307b8390, one_time_event_type_bitmap = 0x0, event_type_index_by_type_opaque = 0x7f9b2dab8bd8, event_type_pool = 0x7f9b2dcb5978, resume_clock_interval = 1000, stop_timer_handle = 3098, output_function = 0x0, output_function_arg = 0, stack = 0x7f9b1bb78000 }
vlib_node_t
========= (gdb) p *n $17 = { function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process", name_elog_string = 11783, stats_total = { calls = 0, vectors = 0, clocks = 1971244932732, suspends = 6847366, max_clock = 3785970526, max_clock_n = 0 }, stats_last_clear = { calls = 0, vectors = 0, clocks = 0, suspends = 0, max_clock = 0, max_clock_n = 0 }, type = VLIB_NODE_TYPE_PROCESS, index = 437, runtime_index = 40, runtime_data = 0x0, flags = 0, state = 0 '\000', runtime_data_bytes = 0 '\000', protocol_hint = 0 '\000', n_errors = 0, scalar_size = 0, vector_size = 0, error_heap_handle = 0, error_heap_index = 0, error_counters = 0x0, next_node_names = 0x7f9b3076a530, next_nodes = 0x0, sibling_of = 0x0, sibling_bitmap = 0x0, n_vectors_by_next_node = 0x0, next_slot_by_node = 0x0, prev_node_bitmap = 0x0, owner_node_index = 4294967295, owner_next_index = 4294967295, format_buffer = 0x0, unformat_buffer = 0x0, format_trace = 0x0, validate_frame = 0x0, state_string = 0x0, node_fn_registrations = 0x0 } I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function. But this assert statement is not hitting.
diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } p->flags &= ~VLIB_PROCESS_IS_RUNNING; pool_put_index (nm->suspended_process_frames, p->suspended_process_frame_index);
I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue.
Hi Jinsh, I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.
Thanks and Regards, Sudhir Hi Chetan, In our case we are observing this issue occasionally exact steps to recreate the issue are not known.
I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.
Soon I will update my results and findings in this mail thread.
Thanks and Regards, Sudhir
Hi Sudhir,
Is your issue resolved?
Actually we are facing same issue on vpp.2106. In our case "api-rx-ring" is not getting called. in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section. This leads to Api-segment memory is used fully and leads to crash.
Thanks, Chetan Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results.
Regards, Sudhir
On Mon, Feb 20, 2023 at 6:01 PM Dave Barach < vpp@...> wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI:
bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
-
I think you can query who stopped the current node's time wheel.
always_inline void *
vlib_process_signal_event_helper (vlib_node_main_t * nm,
vlib_node_t * n,
vlib_process_t * p,
uword t,
uword n_data_elts, uword n_data_elt_bytes)
{
if (add_to_pending)
{
u32 x = vlib_timing_wheel_data_set_suspended_process (n->runtime_index);
p->flags = p_flags | VLIB_PROCESS_RESUME_PENDING;
vec_add1 (nm->data_from_advancing_timing_wheel, x);
if (delete_from_wheel){
TW (tw_timer_stop) ((TWT (tw_timer_wheel) *) nm->timing_wheel,
p->stop_timer_handle);
vlib_process_t *p1 = vec_elt (nm->processes, vlib_get_node_by_name(vm,"rtb-vpp-epoll-process”)->runtime_index);
If ((p != p1 && (p-> stop_timer_handle == p1->stop_timer_handle))
{
ASSERT();
} }
}
}
|
|
Hi jinsh, Thanks for the help. I placed assert statement in vlib_process_signal_event_helper function. But in this place the assert statement didn't hit. When I debugged further I found that my process node is not there in the data_from_advancing_timing_wheel vector. i believe due to this process node is not getting called.now i am checking why rtb-vpp-epoll-process node entry is not present in data_from_advancing_timing_wheel vector.
Thanks and regards, Sudhir On Thu, Mar 9, 2023 at 9:08 PM jinsh11 < jinsh11@...> wrote:
-
I think you can query who stopped the current node's time wheel.
always_inline void *
vlib_process_signal_event_helper (vlib_node_main_t * nm,
vlib_node_t * n,
vlib_process_t * p,
uword t,
uword n_data_elts, uword n_data_elt_bytes)
{
if (add_to_pending)
{
u32 x = vlib_timing_wheel_data_set_suspended_process (n->runtime_index);
p->flags = p_flags | VLIB_PROCESS_RESUME_PENDING;
vec_add1 (nm->data_from_advancing_timing_wheel, x);
if (delete_from_wheel){
TW (tw_timer_stop) ((TWT (tw_timer_wheel) *) nm->timing_wheel,
p->stop_timer_handle);
vlib_process_t *p1 = vec_elt (nm->processes, vlib_get_node_by_name(vm,"rtb-vpp-epoll-process”)->runtime_index);
If ((p != p1 && (p-> stop_timer_handle == p1->stop_timer_handle))
{
ASSERT();
} }
}
}
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
I should have had the sense to ask this earlier: which version of vpp are you using? The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years... D. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + }
toggle quoted message
Show quoted text
From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io Sent: Thursday, March 9, 2023 4:00 AM To: vpp-dev@... Cc: rtbrick.com@... Subject: Re: [vpp-dev] process node suspended indefinitely Hi Dave, Please excuse my delayed response. It took some time to recreate this issue. I made changes to our process node as per your suggestion. now our process node code looks like this while (1) {
vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER); event_type = vlib_process_get_events (vm, &event_data); vec_reset_length(event_data);
switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break;
default: /* bug! */ ASSERT (0); } } After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events. In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11). Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below. $38 = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", node_runtime = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, errors = 0x7f9b3076a560, clocks_since_last_overflow = 0, max_clock = 3785970526, max_clock_n = 0, calls_since_last_overflow = 0, vectors_since_last_overflow = 0, next_frame_index = 1668, node_index = 437, input_main_loops_per_call = 0, main_loop_count_last_dispatch = 4147405645, main_loop_vector_stats = {0, 0}, flags = 0, state = 0, n_next_nodes = 0, cached_next_index = 0, thread_index = 0, runtime_data = 0x7f9b2da503c6 "" }, return_longjmp = { regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766} }, resume_longjmp = { regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787} }, flags = 11, log2_n_stack_bytes = 16, suspended_process_frame_index = 0, n_suspends = 0, pending_event_data_by_type_index = 0x7f9b307b8310, non_empty_event_type_bitmap = 0x7f9b307b8390, one_time_event_type_bitmap = 0x0, event_type_index_by_type_opaque = 0x7f9b2dab8bd8, event_type_pool = 0x7f9b2dcb5978, resume_clock_interval = 1000, stop_timer_handle = 3098, output_function = 0x0, output_function_arg = 0, stack = 0x7f9b1bb78000 } $17 = { function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process", name_elog_string = 11783, stats_total = { calls = 0, vectors = 0, clocks = 1971244932732, suspends = 6847366, max_clock = 3785970526, max_clock_n = 0 }, stats_last_clear = { calls = 0, vectors = 0, clocks = 0, suspends = 0, max_clock = 0, max_clock_n = 0 }, type = VLIB_NODE_TYPE_PROCESS, index = 437, runtime_index = 40, runtime_data = 0x0, flags = 0, state = 0 '\000', runtime_data_bytes = 0 '\000', protocol_hint = 0 '\000', n_errors = 0, scalar_size = 0, vector_size = 0, error_heap_handle = 0, error_heap_index = 0, error_counters = 0x0, next_node_names = 0x7f9b3076a530, next_nodes = 0x0, sibling_of = 0x0, sibling_bitmap = 0x0, n_vectors_by_next_node = 0x0, next_slot_by_node = 0x0, prev_node_bitmap = 0x0, owner_node_index = 4294967295, owner_next_index = 4294967295, format_buffer = 0x0, unformat_buffer = 0x0, format_trace = 0x0, validate_frame = 0x0, state_string = 0x0, node_fn_registrations = 0x0 } I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function. But this assert statement is not hitting. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } p->flags &= ~VLIB_PROCESS_IS_RUNNING; pool_put_index (nm->suspended_process_frames, p->suspended_process_frame_index); I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out. Hi Chetan, In our case we are observing this issue occasionally exact steps to recreate the issue are not known. I made changes to our process node as suggested by dave and with these changes trying to recreate the issue. Soon I will update my results and findings in this mail thread. Hi Sudhir, Actually we are facing same issue on vpp.2106. In our case "api-rx-ring" is not getting called. in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section. This leads to Api-segment memory is used fully and leads to crash. Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results. On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI: bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hi Dave, we are using VPP Version 21.10.
Thanks and regards, Sudhir On Fri, Mar 10, 2023 at 5:31 PM Dave Barach < vpp@...> wrote: I should have had the sense to ask this earlier: which version of vpp are you using? The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years... D. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } Hi Dave, Please excuse my delayed response. It took some time to recreate this issue. I made changes to our process node as per your suggestion. now our process node code looks like this while (1) {
vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER); event_type = vlib_process_get_events (vm, &event_data); vec_reset_length(event_data);
switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break;
default: /* bug! */ ASSERT (0); } } After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events. In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11). Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below. $38 = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", node_runtime = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, errors = 0x7f9b3076a560, clocks_since_last_overflow = 0, max_clock = 3785970526, max_clock_n = 0, calls_since_last_overflow = 0, vectors_since_last_overflow = 0, next_frame_index = 1668, node_index = 437, input_main_loops_per_call = 0, main_loop_count_last_dispatch = 4147405645, main_loop_vector_stats = {0, 0}, flags = 0, state = 0, n_next_nodes = 0, cached_next_index = 0, thread_index = 0, runtime_data = 0x7f9b2da503c6 "" }, return_longjmp = { regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766} }, resume_longjmp = { regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787} }, flags = 11, log2_n_stack_bytes = 16, suspended_process_frame_index = 0, n_suspends = 0, pending_event_data_by_type_index = 0x7f9b307b8310, non_empty_event_type_bitmap = 0x7f9b307b8390, one_time_event_type_bitmap = 0x0, event_type_index_by_type_opaque = 0x7f9b2dab8bd8, event_type_pool = 0x7f9b2dcb5978, resume_clock_interval = 1000, stop_timer_handle = 3098, output_function = 0x0, output_function_arg = 0, stack = 0x7f9b1bb78000 } $17 = { function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process", name_elog_string = 11783, stats_total = { calls = 0, vectors = 0, clocks = 1971244932732, suspends = 6847366, max_clock = 3785970526, max_clock_n = 0 }, stats_last_clear = { calls = 0, vectors = 0, clocks = 0, suspends = 0, max_clock = 0, max_clock_n = 0 }, type = VLIB_NODE_TYPE_PROCESS, index = 437, runtime_index = 40, runtime_data = 0x0, flags = 0, state = 0 '\000', runtime_data_bytes = 0 '\000', protocol_hint = 0 '\000', n_errors = 0, scalar_size = 0, vector_size = 0, error_heap_handle = 0, error_heap_index = 0, error_counters = 0x0, next_node_names = 0x7f9b3076a530, next_nodes = 0x0, sibling_of = 0x0, sibling_bitmap = 0x0, n_vectors_by_next_node = 0x0, next_slot_by_node = 0x0, prev_node_bitmap = 0x0, owner_node_index = 4294967295, owner_next_index = 4294967295, format_buffer = 0x0, unformat_buffer = 0x0, format_trace = 0x0, validate_frame = 0x0, state_string = 0x0, node_fn_registrations = 0x0 } I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function. But this assert statement is not hitting. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } p->flags &= ~VLIB_PROCESS_IS_RUNNING; pool_put_index (nm->suspended_process_frames, p->suspended_process_frame_index); I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out. Hi Chetan, In our case we are observing this issue occasionally exact steps to recreate the issue are not known. I made changes to our process node as suggested by dave and with these changes trying to recreate the issue. Soon I will update my results and findings in this mail thread. Hi Sudhir, Actually we are facing same issue on vpp.2106. In our case "api-rx-ring" is not getting called. in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section. This leads to Api-segment memory is used fully and leads to crash. Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results. On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI: bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hello Everyone,
Any work around/patch for the below issue??
Thanks, Chetan
toggle quoted message
Show quoted text
Hi Dave, we are using VPP Version 21.10.
Thanks and regards, Sudhir
On Fri, Mar 10, 2023 at 5:31 PM Dave Barach < vpp@...> wrote: I should have had the sense to ask this earlier: which version of vpp are you using? The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years... D. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } Hi Dave, Please excuse my delayed response. It took some time to recreate this issue. I made changes to our process node as per your suggestion. now our process node code looks like this while (1) {
vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER); event_type = vlib_process_get_events (vm, &event_data); vec_reset_length(event_data);
switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break;
default: /* bug! */ ASSERT (0); } } After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events. In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11). Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below. $38 = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", node_runtime = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, errors = 0x7f9b3076a560, clocks_since_last_overflow = 0, max_clock = 3785970526, max_clock_n = 0, calls_since_last_overflow = 0, vectors_since_last_overflow = 0, next_frame_index = 1668, node_index = 437, input_main_loops_per_call = 0, main_loop_count_last_dispatch = 4147405645, main_loop_vector_stats = {0, 0}, flags = 0, state = 0, n_next_nodes = 0, cached_next_index = 0, thread_index = 0, runtime_data = 0x7f9b2da503c6 "" }, return_longjmp = { regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766} }, resume_longjmp = { regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787} }, flags = 11, log2_n_stack_bytes = 16, suspended_process_frame_index = 0, n_suspends = 0, pending_event_data_by_type_index = 0x7f9b307b8310, non_empty_event_type_bitmap = 0x7f9b307b8390, one_time_event_type_bitmap = 0x0, event_type_index_by_type_opaque = 0x7f9b2dab8bd8, event_type_pool = 0x7f9b2dcb5978, resume_clock_interval = 1000, stop_timer_handle = 3098, output_function = 0x0, output_function_arg = 0, stack = 0x7f9b1bb78000 } $17 = { function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process", name_elog_string = 11783, stats_total = { calls = 0, vectors = 0, clocks = 1971244932732, suspends = 6847366, max_clock = 3785970526, max_clock_n = 0 }, stats_last_clear = { calls = 0, vectors = 0, clocks = 0, suspends = 0, max_clock = 0, max_clock_n = 0 }, type = VLIB_NODE_TYPE_PROCESS, index = 437, runtime_index = 40, runtime_data = 0x0, flags = 0, state = 0 '\000', runtime_data_bytes = 0 '\000', protocol_hint = 0 '\000', n_errors = 0, scalar_size = 0, vector_size = 0, error_heap_handle = 0, error_heap_index = 0, error_counters = 0x0, next_node_names = 0x7f9b3076a530, next_nodes = 0x0, sibling_of = 0x0, sibling_bitmap = 0x0, n_vectors_by_next_node = 0x0, next_slot_by_node = 0x0, prev_node_bitmap = 0x0, owner_node_index = 4294967295, owner_next_index = 4294967295, format_buffer = 0x0, unformat_buffer = 0x0, format_trace = 0x0, validate_frame = 0x0, state_string = 0x0, node_fn_registrations = 0x0 } I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function. But this assert statement is not hitting. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } p->flags &= ~VLIB_PROCESS_IS_RUNNING; pool_put_index (nm->suspended_process_frames, p->suspended_process_frame_index); I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out. Hi Chetan, In our case we are observing this issue occasionally exact steps to recreate the issue are not known. I made changes to our process node as suggested by dave and with these changes trying to recreate the issue. Soon I will update my results and findings in this mail thread. Hi Sudhir, Actually we are facing same issue on vpp.2106. In our case "api-rx-ring" is not getting called. in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section. This leads to Api-segment memory is used fully and leads to crash. Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results. On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } Hi Dave, Thank you for your response and help. Please find the additional details below. We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.
code snippet looks like below
static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)
show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp#
```
show runtime output in issue case : ``` Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required.
Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards,
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote: HI: bfd process node stop running. I raised this issue,
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|

Dave Wallace
Unfortunately, VPP
21.10 is an unsupported version. Even if someone had a fix for
the problem, there is no CI available to verify any submissions to
the stable/2110 branch.
If so desired (e.g. to share a fix), one may push a gerrit change
to an unsupported stable branch, but it will not be merged. The
best practice is to fix the issue on 'master', then cherry-pick to
stable branches.
Has the issue been reproduced on 'master'?
Thanks,
-daw-
On 4/5/2023 12:15 PM, chetan bhasin
wrote:
toggle quoted message
Show quoted text
Hello Everyone,
Any work around/patch for the below issue??
Thanks,
Chetan
Hi Dave,
we are using VPP Version 21.10.
Thanks and regards,
Sudhir
On Fri, Mar 10, 2023
at 5:31 PM Dave Barach < vpp@...>
wrote:
I should have had the sense
to ask this earlier: which version of vpp are
you using?
The line number in your
debug snippet is more than 100 lines off from
master/latest. The timer wheel code has been
relatively untouched, but there have been
several important fixes over the years...
D.
diff
--git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process (vlib_main_t * vm,
}
else
{
+ if (strcmp((char *)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
Hi Dave,
Please excuse my
delayed response. It took some time to
recreate this issue.
I made changes to our
process node as per your suggestion. now
our process node code looks like this
while (1) {
vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type =
vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
}
After these changes we
didn't observe any assertions but we hit
the process node suspend issue. with this
it is clear other than time out we are not
getting any other events.
In the issue state I
have collected vlib_process node
(rtb_vpp_epoll_process) flags value and it
seems to be correct (flags = 11).
Please find the
vlib_process_t and vlib_node_t data
structure values collected in the issue
state below.
$38 = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
node_runtime = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
errors = 0x7f9b3076a560,
clocks_since_last_overflow = 0,
max_clock = 3785970526,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow = 0,
next_frame_index = 1668,
node_index = 437,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch =
4147405645,
main_loop_vector_stats = {0, 0},
flags = 0,
state = 0,
n_next_nodes = 0,
cached_next_index = 0,
thread_index = 0,
runtime_data = 0x7f9b2da503c6 ""
},
return_longjmp = {
regs = {94502584873984,
140304430422064, 140306731463680,
94502584874048, 94502640552512, 0,
140304430422032, 140306703608766}
},
resume_longjmp = {
regs = {94502584873984,
140304161734368, 140306731463680,
94502584874048, 94502640552512, 0,
140304161734272, 140304430441787}
},
flags = 11,
log2_n_stack_bytes = 16,
suspended_process_frame_index = 0,
n_suspends = 0,
pending_event_data_by_type_index =
0x7f9b307b8310,
non_empty_event_type_bitmap =
0x7f9b307b8390,
one_time_event_type_bitmap = 0x0,
event_type_index_by_type_opaque =
0x7f9b2dab8bd8,
event_type_pool = 0x7f9b2dcb5978,
resume_clock_interval = 1000,
stop_timer_handle = 3098,
output_function = 0x0,
output_function_arg = 0,
stack = 0x7f9b1bb78000
}
$17 = {
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
name = 0x7f9b3076a3f0
"rtb-vpp-epoll-process",
name_elog_string = 11783,
stats_total = {
calls = 0,
vectors = 0,
clocks = 1971244932732,
suspends = 6847366,
max_clock = 3785970526,
max_clock_n = 0
},
stats_last_clear = {
calls = 0,
vectors = 0,
clocks = 0,
suspends = 0,
max_clock = 0,
max_clock_n = 0
},
type = VLIB_NODE_TYPE_PROCESS,
index = 437,
runtime_index = 40,
runtime_data = 0x0,
flags = 0,
state = 0 '\000',
runtime_data_bytes = 0 '\000',
protocol_hint = 0 '\000',
n_errors = 0,
scalar_size = 0,
vector_size = 0,
error_heap_handle = 0,
error_heap_index = 0,
error_counters = 0x0,
next_node_names = 0x7f9b3076a530,
next_nodes = 0x0,
sibling_of = 0x0,
sibling_bitmap = 0x0,
n_vectors_by_next_node = 0x0,
next_slot_by_node = 0x0,
prev_node_bitmap = 0x0,
owner_node_index = 4294967295,
owner_next_index = 4294967295,
format_buffer = 0x0,
unformat_buffer = 0x0,
format_trace = 0x0,
validate_frame = 0x0,
state_string = 0x0,
node_fn_registrations = 0x0
}
I added an assert
statement before clearing VLIB_PROCESS_IS_RUNNING flag
in dispatch_suspended_process
function.
But this assert
statement is not hitting.
diff --git
a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process (vlib_main_t *
vm,
}
else
{
+ if (strcmp((char
*)node->name, "rtb-vpp-epoll-process")
== 0) {
+ ASSERT(0);
+ }
p->flags &=
~VLIB_PROCESS_IS_RUNNING;
pool_put_index
(nm->suspended_process_frames,
p->suspended_process_frame_index);
I am not able to figure
out why this process node is suspended in
some scenarios. Can you please help me by
providing some pointers to debug and
resolve this issue.
I applied your patch to
my code. The issue is not solved with your
patch. Thank you for helping me out.
Hi Chetan,
In our case we are
observing this issue occasionally
exact steps to recreate the issue are
not known.
I made
changes to our process node as
suggested by dave and with these
changes trying to recreate the issue.
Soon I will update
my results and findings in this mail
thread.
Hi Sudhir,
Actually we are
facing same issue on vpp.2106.
In our case
"api-rx-ring" is not getting
called.
in our usecase
workers are calling some functions
in main-thread context leading to
RPC message and memory is
allocated from api section.
This leads to
Api-segment memory is used fully
and leads to crash.
Hi Dave,
Thank you
very much for your inputs. I
will try this out and get
back to you with the
results.
On Mon,
Feb 20, 2023 at 6:01 PM Dave
Barach <vpp@...>
wrote:
Please try something like
this, to eliminate
the possibility that
some bit of code is
sending this process
an event. It’s not a
good idea to skip
the vec_reset_length
(event_data) step.
while (1)
{
uword event_type, *
event_data = 0;
int i;
vlib_process_wait_for_event_or_clock
(vm, 1e-2 /* 10 ms
*/);
event_type =
vlib_process_get_events
(vm,
&event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
vec_reset_length(event_data);
}
Hi
Dave,
Thank you for your
response and help.
Please
find the
additional
details below.
We are creating a
process node rtb-vpp-epoll-process
to handle
control plane
events like
interface
add/delete, route
add/delete.
This process node
waits for 10ms
of time (Not
Interested in any
events ) once 10ms
is expired it will
process control
plane events
mentioned above.
code snippet looks
like below
static
uword
rtb_vpp_epoll_process (vlib_main_t *vm,
vlib_node_runtime_t
*rt,
vlib_frame_t
*f)
{
...
...
while (1) {
vlib_process_wait_for_event_or_clock
(vm, 10e-3);
vlib_process_get_events
(vm, NULL);
rtb_event_loop_run_once();
<----
controlplane
events handling
}
}
```
What
we observed is
that sometimes
(when there is a
high controlplane
load like request
to install more
routes)
"rtb-vpp-epoll-process"
is suspended and
not scheduled
furever. this we
found by using
"show runtime
rtb-vpp-epoll-process"
(in "show runtime
rtb-vpp-epoll-process" command output suspends counter is not
incrementing.)
show runtime
output in
working case :
```
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 192246
1.91e6
0.00
DBGvpp#
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 193634
1.89e6
0.00
DBGvpp#
```
show
runtime output
in issue case :
```
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
```
Other process
nodes like lldp-process,
ip4-neighbor-age-process, ip6-ra-process running without any issue. only
"rtb-vpp-epoll-process" process node suspended forever.
Please
let me know if any
additional
information is
required.
Hi Jinsh,
Thanks for
pointing me to the
issue you faced.
The issue I am
facing looks
similar.
I will verify with
the given patch.
Thanks and Regards,
On
Sun, Feb 19, 2023
at 6:19 AM jinsh11
<jinsh11@...>
wrote:
HI:
bfd
process node
stop running. I
raised this
issue,
NOTICE
TO RECIPIENT This
e-mail message and
any attachments are
confidential and may
be privileged. If
you received this
e-mail in error, any
review, use,
dissemination,
distribution, or
copying of this
e-mail is strictly
prohibited. Please
notify us
immediately of the
error by return
e-mail and please
delete this message
from your system.
For more information
about Rtbrick,
please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message
and any attachments are
confidential and may be
privileged. If you received
this e-mail in error, any
review, use, dissemination,
distribution, or copying of
this e-mail is strictly
prohibited. Please notify us
immediately of the error by
return e-mail and please
delete this message from your
system. For more information
about Rtbrick, please visit us
at www.rtbrick.com
NOTICE TO RECIPIENT
This e-mail message and any attachments
are confidential and may be
privileged. If you received this e-mail
in error, any review, use,
dissemination, distribution, or copying
of this e-mail is strictly
prohibited. Please notify us immediately
of the error by return e-mail and please
delete this message from your system.
For more information about Rtbrick,
please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This
e-mail message and any attachments are
confidential and may be privileged. If you
received this e-mail in error, any review,
use, dissemination, distribution, or copying
of this e-mail is strictly
prohibited. Please notify us immediately of
the error by return e-mail and please delete
this message from your system. For more
information about Rtbrick, please visit us
at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are
confidential and may be
privileged. If you received this e-mail in error, any
review, use,
dissemination, distribution, or copying of this e-mail
is strictly
prohibited. Please notify us immediately of the error
by return e-mail and
please delete this message from your system. For more
information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hi Dave , unfortunately we have not yet integrated master code with our product. So we are not able to test the issue with latest master code.
While testing the issue I made below change to our process node after that this issue is not seen. while (1) {
vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER); event_type = vlib_process_get_events (vm, &event_data); vec_reset_length(event_data);
switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); vlib_process_suspend (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER); <--- added newly break;
default: /* bug! */ clib_warning ("BUG: event type 0x%wx", event_type); ASSERT (0); } }
This is just workaround to our problem, if by any chance our process node is missing in nm->data_from_advancing_timing_wheel list vlib_process_suspend will add process node to data_from_advancing_timing_wheel so that our process node will be scheduled again.
Thanks and Regards, Sudhir
Unfortunately, VPP
21.10 is an unsupported version. Even if someone had a fix for
the problem, there is no CI available to verify any submissions to
the stable/2110 branch.
If so desired (e.g. to share a fix), one may push a gerrit change
to an unsupported stable branch, but it will not be merged. The
best practice is to fix the issue on 'master', then cherry-pick to
stable branches.
Has the issue been reproduced on 'master'?
Thanks,
-daw-
On 4/5/2023 12:15 PM, chetan bhasin
wrote:
Hello Everyone,
Any work around/patch for the below issue??
Thanks,
Chetan
Hi Dave,
we are using VPP Version 21.10.
Thanks and regards,
Sudhir
On Fri, Mar 10, 2023
at 5:31 PM Dave Barach < vpp@...>
wrote:
I should have had the sense
to ask this earlier: which version of vpp are
you using?
The line number in your
debug snippet is more than 100 lines off from
master/latest. The timer wheel code has been
relatively untouched, but there have been
several important fixes over the years...
D.
diff
--git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process (vlib_main_t * vm,
}
else
{
+ if (strcmp((char *)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
Hi Dave,
Please excuse my
delayed response. It took some time to
recreate this issue.
I made changes to our
process node as per your suggestion. now
our process node code looks like this
while (1) {
vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type =
vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
}
After these changes we
didn't observe any assertions but we hit
the process node suspend issue. with this
it is clear other than time out we are not
getting any other events.
In the issue state I
have collected vlib_process node
(rtb_vpp_epoll_process) flags value and it
seems to be correct (flags = 11).
Please find the
vlib_process_t and vlib_node_t data
structure values collected in the issue
state below.
$38 = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
node_runtime = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
errors = 0x7f9b3076a560,
clocks_since_last_overflow = 0,
max_clock = 3785970526,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow = 0,
next_frame_index = 1668,
node_index = 437,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch =
4147405645,
main_loop_vector_stats = {0, 0},
flags = 0,
state = 0,
n_next_nodes = 0,
cached_next_index = 0,
thread_index = 0,
runtime_data = 0x7f9b2da503c6 ""
},
return_longjmp = {
regs = {94502584873984,
140304430422064, 140306731463680,
94502584874048, 94502640552512, 0,
140304430422032, 140306703608766}
},
resume_longjmp = {
regs = {94502584873984,
140304161734368, 140306731463680,
94502584874048, 94502640552512, 0,
140304161734272, 140304430441787}
},
flags = 11,
log2_n_stack_bytes = 16,
suspended_process_frame_index = 0,
n_suspends = 0,
pending_event_data_by_type_index =
0x7f9b307b8310,
non_empty_event_type_bitmap =
0x7f9b307b8390,
one_time_event_type_bitmap = 0x0,
event_type_index_by_type_opaque =
0x7f9b2dab8bd8,
event_type_pool = 0x7f9b2dcb5978,
resume_clock_interval = 1000,
stop_timer_handle = 3098,
output_function = 0x0,
output_function_arg = 0,
stack = 0x7f9b1bb78000
}
$17 = {
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
name = 0x7f9b3076a3f0
"rtb-vpp-epoll-process",
name_elog_string = 11783,
stats_total = {
calls = 0,
vectors = 0,
clocks = 1971244932732,
suspends = 6847366,
max_clock = 3785970526,
max_clock_n = 0
},
stats_last_clear = {
calls = 0,
vectors = 0,
clocks = 0,
suspends = 0,
max_clock = 0,
max_clock_n = 0
},
type = VLIB_NODE_TYPE_PROCESS,
index = 437,
runtime_index = 40,
runtime_data = 0x0,
flags = 0,
state = 0 '\000',
runtime_data_bytes = 0 '\000',
protocol_hint = 0 '\000',
n_errors = 0,
scalar_size = 0,
vector_size = 0,
error_heap_handle = 0,
error_heap_index = 0,
error_counters = 0x0,
next_node_names = 0x7f9b3076a530,
next_nodes = 0x0,
sibling_of = 0x0,
sibling_bitmap = 0x0,
n_vectors_by_next_node = 0x0,
next_slot_by_node = 0x0,
prev_node_bitmap = 0x0,
owner_node_index = 4294967295,
owner_next_index = 4294967295,
format_buffer = 0x0,
unformat_buffer = 0x0,
format_trace = 0x0,
validate_frame = 0x0,
state_string = 0x0,
node_fn_registrations = 0x0
}
I added an assert
statement before clearing VLIB_PROCESS_IS_RUNNING flag
in dispatch_suspended_process
function.
But this assert
statement is not hitting.
diff --git
a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process (vlib_main_t *
vm,
}
else
{
+ if (strcmp((char
*)node->name, "rtb-vpp-epoll-process")
== 0) {
+ ASSERT(0);
+ }
p->flags &=
~VLIB_PROCESS_IS_RUNNING;
pool_put_index
(nm->suspended_process_frames,
p->suspended_process_frame_index);
I am not able to figure
out why this process node is suspended in
some scenarios. Can you please help me by
providing some pointers to debug and
resolve this issue.
I applied your patch to
my code. The issue is not solved with your
patch. Thank you for helping me out.
Hi Chetan,
In our case we are
observing this issue occasionally
exact steps to recreate the issue are
not known.
I made
changes to our process node as
suggested by dave and with these
changes trying to recreate the issue.
Soon I will update
my results and findings in this mail
thread.
Hi Sudhir,
Actually we are
facing same issue on vpp.2106.
In our case
"api-rx-ring" is not getting
called.
in our usecase
workers are calling some functions
in main-thread context leading to
RPC message and memory is
allocated from api section.
This leads to
Api-segment memory is used fully
and leads to crash.
Hi Dave,
Thank you
very much for your inputs. I
will try this out and get
back to you with the
results.
On Mon,
Feb 20, 2023 at 6:01 PM Dave
Barach <vpp@...>
wrote:
Please try something like
this, to eliminate
the possibility that
some bit of code is
sending this process
an event. It’s not a
good idea to skip
the vec_reset_length
(event_data) step.
while (1)
{
uword event_type, *
event_data = 0;
int i;
vlib_process_wait_for_event_or_clock
(vm, 1e-2 /* 10 ms
*/);
event_type =
vlib_process_get_events
(vm,
&event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
vec_reset_length(event_data);
}
Hi
Dave,
Thank you for your
response and help.
Please
find the
additional
details below.
We are creating a
process node rtb-vpp-epoll-process
to handle
control plane
events like
interface
add/delete, route
add/delete.
This process node
waits for 10ms
of time (Not
Interested in any
events ) once 10ms
is expired it will
process control
plane events
mentioned above.
code snippet looks
like below
static
uword
rtb_vpp_epoll_process (vlib_main_t *vm,
vlib_node_runtime_t
*rt,
vlib_frame_t
*f)
{
...
...
while (1) {
vlib_process_wait_for_event_or_clock
(vm, 10e-3);
vlib_process_get_events
(vm, NULL);
rtb_event_loop_run_once();
<----
controlplane
events handling
}
}
```
What
we observed is
that sometimes
(when there is a
high controlplane
load like request
to install more
routes)
"rtb-vpp-epoll-process"
is suspended and
not scheduled
furever. this we
found by using
"show runtime
rtb-vpp-epoll-process"
(in "show runtime
rtb-vpp-epoll-process" command output suspends counter is not
incrementing.)
show runtime
output in
working case :
```
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 192246
1.91e6
0.00
DBGvpp#
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 193634
1.89e6
0.00
DBGvpp#
```
show
runtime output
in issue case :
```
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
```
Other process
nodes like lldp-process,
ip4-neighbor-age-process, ip6-ra-process running without any issue. only
"rtb-vpp-epoll-process" process node suspended forever.
Please
let me know if any
additional
information is
required.
Hi Jinsh,
Thanks for
pointing me to the
issue you faced.
The issue I am
facing looks
similar.
I will verify with
the given patch.
Thanks and Regards,
On
Sun, Feb 19, 2023
at 6:19 AM jinsh11
<jinsh11@...>
wrote:
HI:
bfd
process node
stop running. I
raised this
issue,
NOTICE
TO RECIPIENT This
e-mail message and
any attachments are
confidential and may
be privileged. If
you received this
e-mail in error, any
review, use,
dissemination,
distribution, or
copying of this
e-mail is strictly
prohibited. Please
notify us
immediately of the
error by return
e-mail and please
delete this message
from your system.
For more information
about Rtbrick,
please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message
and any attachments are
confidential and may be
privileged. If you received
this e-mail in error, any
review, use, dissemination,
distribution, or copying of
this e-mail is strictly
prohibited. Please notify us
immediately of the error by
return e-mail and please
delete this message from your
system. For more information
about Rtbrick, please visit us
at www.rtbrick.com
NOTICE TO RECIPIENT
This e-mail message and any attachments
are confidential and may be
privileged. If you received this e-mail
in error, any review, use,
dissemination, distribution, or copying
of this e-mail is strictly
prohibited. Please notify us immediately
of the error by return e-mail and please
delete this message from your system.
For more information about Rtbrick,
please visit us at www.rtbrick.com
NOTICE TO RECIPIENT This
e-mail message and any attachments are
confidential and may be privileged. If you
received this e-mail in error, any review,
use, dissemination, distribution, or copying
of this e-mail is strictly
prohibited. Please notify us immediately of
the error by return e-mail and please delete
this message from your system. For more
information about Rtbrick, please visit us
at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are
confidential and may be
privileged. If you received this e-mail in error, any
review, use,
dissemination, distribution, or copying of this e-mail
is strictly
prohibited. Please notify us immediately of the error
by return e-mail and
please delete this message from your system. For more
information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|

Dave Wallace
Sudhir,
Good to hear that you have a workaround. Hopefully you will be
able to move your product forward to a recent release soon.
Thanks,
-daw-
On 4/5/2023 11:43 PM, Sudhir CR via
lists.fd.io wrote:
toggle quoted message
Show quoted text
Hi Dave ,
unfortunately we have not yet integrated master code with
our product. So we are not able to test the issue with
latest master code.
While testing the issue I made below change to our process
node after that this issue is not seen.
while (1) {
vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type = vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle timer expirations */
rtb_event_loop_run_once ();
vlib_process_suspend (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER); <--- added newly
break;
default: /* bug! */
clib_warning ("BUG: event type 0x%wx",
event_type);
ASSERT (0);
}
}
This is just workaround to our problem, if by any chance our
process node is missing in nm->data_from_advancing_timing_wheel
list
vlib_process_suspend will add process node to data_from_advancing_timing_wheel
so that our process node will be scheduled again.
Thanks and Regards,
Sudhir
Unfortunately,
VPP 21.10 is an unsupported version. Even if someone had
a fix for the problem, there is no CI available to verify
any submissions to the stable/2110 branch.
If so desired (e.g. to share a fix), one may push a gerrit
change to an unsupported stable branch, but it will not be
merged. The best practice is to fix the issue on
'master', then cherry-pick to stable branches.
Has the issue been reproduced on 'master'?
Thanks,
-daw-
On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,
Any work around/patch for the below
issue??
Thanks,
Chetan
Hi Dave,
we are using VPP Version 21.10.
Thanks and regards,
Sudhir
On Fri, Mar
10, 2023 at 5:31 PM Dave Barach < vpp@...>
wrote:
I should have had
the sense to ask this earlier: which
version of vpp are you using?
The line number in
your debug snippet is more than 100
lines off from master/latest. The
timer wheel code has been relatively
untouched, but there have been several
important fixes over the years...
D.
diff --git
a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
Hi Dave,
Please excuse
my delayed response. It took some
time to recreate this issue.
I made changes
to our process node as per your
suggestion. now our process node
code looks like this
while (1) {
vlib_process_wait_for_event_or_clock
(vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type =
vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle
timer expirations */
rtb_event_loop_run_once ();
break;
default: /* bug! */
ASSERT (0);
}
}
After these
changes we didn't observe any
assertions but we hit the process
node suspend issue. with this it
is clear other than time out we
are not getting any other events.
In the issue
state I have collected
vlib_process node
(rtb_vpp_epoll_process) flags
value and it seems to be correct
(flags = 11).
Please find the
vlib_process_t and vlib_node_t
data structure values collected in
the issue state below.
$38 = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
node_runtime = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
errors = 0x7f9b3076a560,
clocks_since_last_overflow =
0,
max_clock = 3785970526,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow =
0,
next_frame_index = 1668,
node_index = 437,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch
= 4147405645,
main_loop_vector_stats = {0,
0},
flags = 0,
state = 0,
n_next_nodes = 0,
cached_next_index = 0,
thread_index = 0,
runtime_data = 0x7f9b2da503c6
""
},
return_longjmp = {
regs = {94502584873984,
140304430422064, 140306731463680,
94502584874048, 94502640552512, 0,
140304430422032, 140306703608766}
},
resume_longjmp = {
regs = {94502584873984,
140304161734368, 140306731463680,
94502584874048, 94502640552512, 0,
140304161734272, 140304430441787}
},
flags = 11,
log2_n_stack_bytes = 16,
suspended_process_frame_index =
0,
n_suspends = 0,
pending_event_data_by_type_index
= 0x7f9b307b8310,
non_empty_event_type_bitmap =
0x7f9b307b8390,
one_time_event_type_bitmap =
0x0,
event_type_index_by_type_opaque
= 0x7f9b2dab8bd8,
event_type_pool =
0x7f9b2dcb5978,
resume_clock_interval = 1000,
stop_timer_handle = 3098,
output_function = 0x0,
output_function_arg = 0,
stack = 0x7f9b1bb78000
}
$17 = {
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
name = 0x7f9b3076a3f0
"rtb-vpp-epoll-process",
name_elog_string = 11783,
stats_total = {
calls = 0,
vectors = 0,
clocks = 1971244932732,
suspends = 6847366,
max_clock = 3785970526,
max_clock_n = 0
},
stats_last_clear = {
calls = 0,
vectors = 0,
clocks = 0,
suspends = 0,
max_clock = 0,
max_clock_n = 0
},
type = VLIB_NODE_TYPE_PROCESS,
index = 437,
runtime_index = 40,
runtime_data = 0x0,
flags = 0,
state = 0 '\000',
runtime_data_bytes = 0 '\000',
protocol_hint = 0 '\000',
n_errors = 0,
scalar_size = 0,
vector_size = 0,
error_heap_handle = 0,
error_heap_index = 0,
error_counters = 0x0,
next_node_names = 0x7f9b3076a530,
next_nodes = 0x0,
sibling_of = 0x0,
sibling_bitmap = 0x0,
n_vectors_by_next_node = 0x0,
next_slot_by_node = 0x0,
prev_node_bitmap = 0x0,
owner_node_index = 4294967295,
owner_next_index = 4294967295,
format_buffer = 0x0,
unformat_buffer = 0x0,
format_trace = 0x0,
validate_frame = 0x0,
state_string = 0x0,
node_fn_registrations = 0x0
}
I added an
assert statement before clearing
VLIB_PROCESS_IS_RUNNING flag
in dispatch_suspended_process
function.
But this
assert statement is not hitting.
diff --git
a/src/vlib/main.c
b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
p->flags &=
~VLIB_PROCESS_IS_RUNNING;
pool_put_index
(nm->suspended_process_frames,
p->suspended_process_frame_index);
I am not able
to figure out why this process
node is suspended in some
scenarios. Can you please help me
by providing some pointers to
debug and resolve this issue.
I applied your
patch to my code. The issue is not
solved with your patch. Thank you
for helping me out.
Hi Chetan,
In our case
we are observing this issue
occasionally exact steps to
recreate the issue are not
known.
I
made changes to our process
node as suggested by dave and
with these changes trying to
recreate the issue.
Soon I will
update my results and findings
in this mail thread.
Hi
Sudhir,
Actually
we are facing same issue
on vpp.2106.
In our
case "api-rx-ring" is not
getting called.
in our
usecase workers are
calling some functions in
main-thread context
leading to RPC message and
memory is allocated from
api section.
This
leads to Api-segment
memory is used fully and
leads to crash.
Hi
Dave,
Thank
you very much for
your inputs. I will
try this out and get
back to you with the
results.
On
Mon, Feb 20, 2023 at
6:01 PM Dave Barach
<vpp@...> wrote:
Please try something like
this, to
eliminate the
possibility
that some bit
of code is
sending this
process an
event. It’s
not a good
idea to skip
the
vec_reset_length
(event_data)
step.
while (1)
{
uword event_type, *
event_data =
0;
int i;
vlib_process_wait_for_event_or_clock
(vm, 1e-2 /*
10 ms */);
event_type =
vlib_process_get_events
(vm,
&event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
vec_reset_length(event_data);
}
Hi
Dave,
Thank you for
your response
and help.
Please
find the
additional
details below.
We are
creating a
process node rtb-vpp-epoll-process
to handle
control plane
events like
interface
add/delete,
route
add/delete.
This process
node waits for
10ms of
time (Not
Interested in
any events )
once 10ms is
expired it
will process
control plane
events
mentioned
above.
code snippet
looks like
below
static
uword
rtb_vpp_epoll_process (vlib_main_t *vm,
vlib_node_runtime_t
*rt,
vlib_frame_t
*f)
{
...
...
while (1)
{
vlib_process_wait_for_event_or_clock
(vm, 10e-3);
vlib_process_get_events
(vm, NULL);
rtb_event_loop_run_once();
<----
controlplane
events
handling
}
}
```
What
we observed is
that sometimes
(when there is
a high
controlplane
load like
request to
install more
routes)
"rtb-vpp-epoll-process"
is suspended
and not
scheduled
furever. this
we found by
using "show
runtime
rtb-vpp-epoll-process"
(in "show runtime
rtb-vpp-epoll-process" command output suspends counter is not
incrementing.)
show
runtime output
in working
case :
```
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 192246
1.91e6
0.00
DBGvpp#
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 193634
1.89e6
0.00
DBGvpp#
```
show
runtime output
in issue case
:
```
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
```
Other process
nodes like lldp-process,
ip4-neighbor-age-process, ip6-ra-process running without any issue. only
"rtb-vpp-epoll-process" process node suspended forever.
Please
let me know if
any additional
information is
required.
Hi Jinsh,
Thanks for
pointing me to
the issue you
faced. The
issue I am
facing looks
similar.
I will verify
with the given
patch.
Thanks and
Regards,
On
Sun, Feb 19,
2023 at 6:19
AM jinsh11
<jinsh11@...>
wrote:
HI:
bfd process node stop running. I raised
this issue,
NOTICE
TO RECIPIENT
This e-mail
message and
any
attachments
are
confidential
and may be
privileged. If
you received
this e-mail in
error, any
review, use,
dissemination,
distribution,
or copying of
this e-mail is
strictly
prohibited. Please
notify us
immediately of
the error by
return e-mail
and please
delete this
message from
your system.
For more
information
about Rtbrick,
please visit
us at www.rtbrick.com
NOTICE
TO RECIPIENT This
e-mail message and any
attachments are
confidential and may
be privileged. If you
received this e-mail
in error, any review,
use, dissemination,
distribution, or
copying of this e-mail
is strictly
prohibited. Please
notify us immediately
of the error by return
e-mail and please
delete this message
from your system. For
more information about
Rtbrick, please visit
us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message
and any attachments are
confidential and may be
privileged. If you received this
e-mail in error, any review,
use, dissemination,
distribution, or copying of this
e-mail is strictly
prohibited. Please notify us
immediately of the error by
return e-mail and please delete
this message from your system.
For more information about
Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and
any attachments are confidential and
may be privileged. If you received
this e-mail in error, any review,
use, dissemination, distribution, or
copying of this e-mail is strictly
prohibited. Please notify us
immediately of the error by return
e-mail and please delete this
message from your system. For more
information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any
attachments are confidential and may be
privileged. If you received this e-mail in
error, any review, use, dissemination,
distribution, or copying of this e-mail is
strictly prohibited. Please notify us
immediately of the error by return e-mail and
please delete this message from your system.
For more information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are
confidential and may be
privileged. If you received this e-mail in error, any review,
use,
dissemination, distribution, or copying of this e-mail is
strictly
prohibited. Please notify us immediately of the error by
return e-mail and
please delete this message from your system. For more
information about Rtbrick, please visit us at www.rtbrick.com
|
|
Some of Netgate's customers have observed a similar issue on a build based on stable/2206 and Pim van Pelt has observed the same issue on some of his production routers using a build based on something close to the current master branch. In these cases, the api-rx-from-ring node stopped being executed.
My colleague Alexander Chernavin observed in https://lists.fd.io/g/vpp-dev/topic/96500275#22508 that when this issue arises, the process node for api-rx-from-ring has a timer ID in stop_timer_handle which is actually in use by another process node (the user_handle field in the timer entry points back to the process ID of a different process node). I haven't been able to reproduce this problem in a lab environment, but the going theory seems to be that when timers for process nodes are stopped or expired, the stop_timer_handle is not reset and contains a stale timer ID value until the node is dispatched and suspends itself again. So you could have a series of events like this:
process node A's timer is stopped (Its stop_timer_handle still refers to that timer) process node B is dispatched and suspends while adding a new timer. The new timer allocated has the ID formerly used by process A. process node C is dispatched and signals process node A with vlib_process_signal_event(). vlib_process_signal_event_helper() sees that the timer referred to by stop_timer_handle on process A is not free (it was allocated and is in use by process node B) and stops it. Process node B has no timer set now.
There is an existing patch in gerrit which seems aimed at addressing this - https://gerrit.fd.io/r/c/vpp/+/38400. It did not pass the checkstyle verify job when it was submitted, so I rebased it and fixed the formatting issues. That patch alone did not correct the issue on the router where Pim is able to reproduce the issue - it only resets the stop_timer_handle on a process node when vlib_process_signal_event_helper() explicitly stops a timer, it does not do the same for timers which were returned to the pool by TW (tw_timer_expire_timers_vec). I just uploaded this additional patch which attempts to deal with those timers - https://gerrit.fd.io/r/c/vpp/+/38621. Pim is currently testing it, so I can't say yet whether it actually solves the problem. But I thought I would share it so anyone else having this problem could try them out and give feedback on whether they help.
AFAICT both patches should be able to be applied to stable/2110.
Thanks, -Matt
toggle quoted message
Show quoted text
Sudhir,
Good to hear that you have a workaround. Hopefully you will be
able to move your product forward to a recent release soon.
Thanks,
-daw-
Hi Dave ,
unfortunately we have not yet integrated master code with
our product. So we are not able to test the issue with
latest master code.
While testing the issue I made below change to our process
node after that this issue is not seen.
while (1) {
vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type = vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle timer expirations */
rtb_event_loop_run_once ();
vlib_process_suspend (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER); <--- added newly
break;
default: /* bug! */
clib_warning ("BUG: event type 0x%wx",
event_type);
ASSERT (0);
}
}
This is just workaround to our problem, if by any chance our
process node is missing in nm->data_from_advancing_timing_wheel
list
vlib_process_suspend will add process node to data_from_advancing_timing_wheel
so that our process node will be scheduled again.
Thanks and Regards,
Sudhir
Unfortunately,
VPP 21.10 is an unsupported version. Even if someone had
a fix for the problem, there is no CI available to verify
any submissions to the stable/2110 branch.
If so desired (e.g. to share a fix), one may push a gerrit
change to an unsupported stable branch, but it will not be
merged. The best practice is to fix the issue on
'master', then cherry-pick to stable branches.
Has the issue been reproduced on 'master'?
Thanks,
-daw-
On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,
Any work around/patch for the below
issue??
Thanks,
Chetan
Hi Dave,
we are using VPP Version 21.10.
Thanks and regards,
Sudhir
On Fri, Mar
10, 2023 at 5:31 PM Dave Barach < vpp@...>
wrote:
I should have had
the sense to ask this earlier: which
version of vpp are you using?
The line number in
your debug snippet is more than 100
lines off from master/latest. The
timer wheel code has been relatively
untouched, but there have been several
important fixes over the years...
D.
diff --git
a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
Hi Dave,
Please excuse
my delayed response. It took some
time to recreate this issue.
I made changes
to our process node as per your
suggestion. now our process node
code looks like this
while (1) {
vlib_process_wait_for_event_or_clock
(vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type =
vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle
timer expirations */
rtb_event_loop_run_once ();
break;
default: /* bug! */
ASSERT (0);
}
}
After these
changes we didn't observe any
assertions but we hit the process
node suspend issue. with this it
is clear other than time out we
are not getting any other events.
In the issue
state I have collected
vlib_process node
(rtb_vpp_epoll_process) flags
value and it seems to be correct
(flags = 11).
Please find the
vlib_process_t and vlib_node_t
data structure values collected in
the issue state below.
$38 = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
node_runtime = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
errors = 0x7f9b3076a560,
clocks_since_last_overflow =
0,
max_clock = 3785970526,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow =
0,
next_frame_index = 1668,
node_index = 437,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch
= 4147405645,
main_loop_vector_stats = {0,
0},
flags = 0,
state = 0,
n_next_nodes = 0,
cached_next_index = 0,
thread_index = 0,
runtime_data = 0x7f9b2da503c6
""
},
return_longjmp = {
regs = {94502584873984,
140304430422064, 140306731463680,
94502584874048, 94502640552512, 0,
140304430422032, 140306703608766}
},
resume_longjmp = {
regs = {94502584873984,
140304161734368, 140306731463680,
94502584874048, 94502640552512, 0,
140304161734272, 140304430441787}
},
flags = 11,
log2_n_stack_bytes = 16,
suspended_process_frame_index =
0,
n_suspends = 0,
pending_event_data_by_type_index
= 0x7f9b307b8310,
non_empty_event_type_bitmap =
0x7f9b307b8390,
one_time_event_type_bitmap =
0x0,
event_type_index_by_type_opaque
= 0x7f9b2dab8bd8,
event_type_pool =
0x7f9b2dcb5978,
resume_clock_interval = 1000,
stop_timer_handle = 3098,
output_function = 0x0,
output_function_arg = 0,
stack = 0x7f9b1bb78000
}
$17 = {
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
name = 0x7f9b3076a3f0
"rtb-vpp-epoll-process",
name_elog_string = 11783,
stats_total = {
calls = 0,
vectors = 0,
clocks = 1971244932732,
suspends = 6847366,
max_clock = 3785970526,
max_clock_n = 0
},
stats_last_clear = {
calls = 0,
vectors = 0,
clocks = 0,
suspends = 0,
max_clock = 0,
max_clock_n = 0
},
type = VLIB_NODE_TYPE_PROCESS,
index = 437,
runtime_index = 40,
runtime_data = 0x0,
flags = 0,
state = 0 '\000',
runtime_data_bytes = 0 '\000',
protocol_hint = 0 '\000',
n_errors = 0,
scalar_size = 0,
vector_size = 0,
error_heap_handle = 0,
error_heap_index = 0,
error_counters = 0x0,
next_node_names = 0x7f9b3076a530,
next_nodes = 0x0,
sibling_of = 0x0,
sibling_bitmap = 0x0,
n_vectors_by_next_node = 0x0,
next_slot_by_node = 0x0,
prev_node_bitmap = 0x0,
owner_node_index = 4294967295,
owner_next_index = 4294967295,
format_buffer = 0x0,
unformat_buffer = 0x0,
format_trace = 0x0,
validate_frame = 0x0,
state_string = 0x0,
node_fn_registrations = 0x0
}
I added an
assert statement before clearing
VLIB_PROCESS_IS_RUNNING flag
in dispatch_suspended_process
function.
But this
assert statement is not hitting.
diff --git
a/src/vlib/main.c
b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
p->flags &=
~VLIB_PROCESS_IS_RUNNING;
pool_put_index
(nm->suspended_process_frames,
p->suspended_process_frame_index);
I am not able
to figure out why this process
node is suspended in some
scenarios. Can you please help me
by providing some pointers to
debug and resolve this issue.
I applied your
patch to my code. The issue is not
solved with your patch. Thank you
for helping me out.
Hi Chetan,
In our case
we are observing this issue
occasionally exact steps to
recreate the issue are not
known.
I
made changes to our process
node as suggested by dave and
with these changes trying to
recreate the issue.
Soon I will
update my results and findings
in this mail thread.
Hi
Sudhir,
Actually
we are facing same issue
on vpp.2106.
In our
case "api-rx-ring" is not
getting called.
in our
usecase workers are
calling some functions in
main-thread context
leading to RPC message and
memory is allocated from
api section.
This
leads to Api-segment
memory is used fully and
leads to crash.
Hi
Dave,
Thank
you very much for
your inputs. I will
try this out and get
back to you with the
results.
On
Mon, Feb 20, 2023 at
6:01 PM Dave Barach
<vpp@...> wrote:
Please try something like
this, to
eliminate the
possibility
that some bit
of code is
sending this
process an
event. It’s
not a good
idea to skip
the
vec_reset_length
(event_data)
step.
while (1)
{
uword event_type, *
event_data =
0;
int i;
vlib_process_wait_for_event_or_clock
(vm, 1e-2 /*
10 ms */);
event_type =
vlib_process_get_events
(vm,
&event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
vec_reset_length(event_data);
}
Hi
Dave,
Thank you for
your response
and help.
Please
find the
additional
details below.
We are
creating a
process node rtb-vpp-epoll-process
to handle
control plane
events like
interface
add/delete,
route
add/delete.
This process
node waits for
10ms of
time (Not
Interested in
any events )
once 10ms is
expired it
will process
control plane
events
mentioned
above.
code snippet
looks like
below
static
uword
rtb_vpp_epoll_process (vlib_main_t *vm,
vlib_node_runtime_t
*rt,
vlib_frame_t
*f)
{
...
...
while (1)
{
vlib_process_wait_for_event_or_clock
(vm, 10e-3);
vlib_process_get_events
(vm, NULL);
rtb_event_loop_run_once();
<----
controlplane
events
handling
}
}
```
What
we observed is
that sometimes
(when there is
a high
controlplane
load like
request to
install more
routes)
"rtb-vpp-epoll-process"
is suspended
and not
scheduled
furever. this
we found by
using "show
runtime
rtb-vpp-epoll-process"
(in "show runtime
rtb-vpp-epoll-process" command output suspends counter is not
incrementing.)
show
runtime output
in working
case :
```
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 192246
1.91e6
0.00
DBGvpp#
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 193634
1.89e6
0.00
DBGvpp#
```
show
runtime output
in issue case
:
```
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
```
Other process
nodes like lldp-process,
ip4-neighbor-age-process, ip6-ra-process running without any issue. only
"rtb-vpp-epoll-process" process node suspended forever.
Please
let me know if
any additional
information is
required.
Hi Jinsh,
Thanks for
pointing me to
the issue you
faced. The
issue I am
facing looks
similar.
I will verify
with the given
patch.
Thanks and
Regards,
On
Sun, Feb 19,
2023 at 6:19
AM jinsh11
<jinsh11@...>
wrote:
HI:
bfd process node stop running. I raised
this issue,
NOTICE
TO RECIPIENT
This e-mail
message and
any
attachments
are
confidential
and may be
privileged. If
you received
this e-mail in
error, any
review, use,
dissemination,
distribution,
or copying of
this e-mail is
strictly
prohibited. Please
notify us
immediately of
the error by
return e-mail
and please
delete this
message from
your system.
For more
information
about Rtbrick,
please visit
us at www.rtbrick.com
NOTICE
TO RECIPIENT This
e-mail message and any
attachments are
confidential and may
be privileged. If you
received this e-mail
in error, any review,
use, dissemination,
distribution, or
copying of this e-mail
is strictly
prohibited. Please
notify us immediately
of the error by return
e-mail and please
delete this message
from your system. For
more information about
Rtbrick, please visit
us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message
and any attachments are
confidential and may be
privileged. If you received this
e-mail in error, any review,
use, dissemination,
distribution, or copying of this
e-mail is strictly
prohibited. Please notify us
immediately of the error by
return e-mail and please delete
this message from your system.
For more information about
Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and
any attachments are confidential and
may be privileged. If you received
this e-mail in error, any review,
use, dissemination, distribution, or
copying of this e-mail is strictly
prohibited. Please notify us
immediately of the error by return
e-mail and please delete this
message from your system. For more
information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any
attachments are confidential and may be
privileged. If you received this e-mail in
error, any review, use, dissemination,
distribution, or copying of this e-mail is
strictly prohibited. Please notify us
immediately of the error by return e-mail and
please delete this message from your system.
For more information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are
confidential and may be
privileged. If you received this e-mail in error, any review,
use,
dissemination, distribution, or copying of this e-mail is
strictly
prohibited. Please notify us immediately of the error by
return e-mail and
please delete this message from your system. For more
information about Rtbrick, please visit us at www.rtbrick.com
|
|
toggle quoted message
Show quoted text
Some of Netgate's customers have observed a similar issue on a build based on stable/2206 and Pim van Pelt has observed the same issue on some of his production routers using a build based on something close to the current master branch. In these cases, the api-rx-from-ring node stopped being executed.
My colleague Alexander Chernavin observed in https://lists.fd.io/g/vpp-dev/topic/96500275#22508 that when this issue arises, the process node for api-rx-from-ring has a timer ID in stop_timer_handle which is actually in use by another process node (the user_handle field in the timer entry points back to the process ID of a different process node). I haven't been able to reproduce this problem in a lab environment, but the going theory seems to be that when timers for process nodes are stopped or expired, the stop_timer_handle is not reset and contains a stale timer ID value until the node is dispatched and suspends itself again. So you could have a series of events like this:
process node A's timer is stopped (Its stop_timer_handle still refers to that timer) process node B is dispatched and suspends while adding a new timer. The new timer allocated has the ID formerly used by process A. process node C is dispatched and signals process node A with vlib_process_signal_event(). vlib_process_signal_event_helper() sees that the timer referred to by stop_timer_handle on process A is not free (it was allocated and is in use by process node B) and stops it. Process node B has no timer set now.
There is an existing patch in gerrit which seems aimed at addressing this - https://gerrit.fd.io/r/c/vpp/+/38400. It did not pass the checkstyle verify job when it was submitted, so I rebased it and fixed the formatting issues. That patch alone did not correct the issue on the router where Pim is able to reproduce the issue - it only resets the stop_timer_handle on a process node when vlib_process_signal_event_helper() explicitly stops a timer, it does not do the same for timers which were returned to the pool by TW (tw_timer_expire_timers_vec). I just uploaded this additional patch which attempts to deal with those timers - https://gerrit.fd.io/r/c/vpp/+/38621. Pim is currently testing it, so I can't say yet whether it actually solves the problem. But I thought I would share it so anyone else having this problem could try them out and give feedback on whether they help.
AFAICT both patches should be able to be applied to stable/2110.
Thanks, -Matt
Sudhir,
Good to hear that you have a workaround. Hopefully you will be
able to move your product forward to a recent release soon.
Thanks,
-daw-
Hi Dave ,
unfortunately we have not yet integrated master code with
our product. So we are not able to test the issue with
latest master code.
While testing the issue I made below change to our process
node after that this issue is not seen.
while (1) {
vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type = vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle timer expirations */
rtb_event_loop_run_once ();
vlib_process_suspend (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER); <--- added newly
break;
default: /* bug! */
clib_warning ("BUG: event type 0x%wx",
event_type);
ASSERT (0);
}
}
This is just workaround to our problem, if by any chance our
process node is missing in nm->data_from_advancing_timing_wheel
list
vlib_process_suspend will add process node to data_from_advancing_timing_wheel
so that our process node will be scheduled again.
Thanks and Regards,
Sudhir
Unfortunately,
VPP 21.10 is an unsupported version. Even if someone had
a fix for the problem, there is no CI available to verify
any submissions to the stable/2110 branch.
If so desired (e.g. to share a fix), one may push a gerrit
change to an unsupported stable branch, but it will not be
merged. The best practice is to fix the issue on
'master', then cherry-pick to stable branches.
Has the issue been reproduced on 'master'?
Thanks,
-daw-
On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,
Any work around/patch for the below
issue??
Thanks,
Chetan
Hi Dave,
we are using VPP Version 21.10.
Thanks and regards,
Sudhir
On Fri, Mar
10, 2023 at 5:31 PM Dave Barach < vpp@...>
wrote:
I should have had
the sense to ask this earlier: which
version of vpp are you using?
The line number in
your debug snippet is more than 100
lines off from master/latest. The
timer wheel code has been relatively
untouched, but there have been several
important fixes over the years...
D.
diff --git
a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
Hi Dave,
Please excuse
my delayed response. It took some
time to recreate this issue.
I made changes
to our process node as per your
suggestion. now our process node
code looks like this
while (1) {
vlib_process_wait_for_event_or_clock
(vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type =
vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle
timer expirations */
rtb_event_loop_run_once ();
break;
default: /* bug! */
ASSERT (0);
}
}
After these
changes we didn't observe any
assertions but we hit the process
node suspend issue. with this it
is clear other than time out we
are not getting any other events.
In the issue
state I have collected
vlib_process node
(rtb_vpp_epoll_process) flags
value and it seems to be correct
(flags = 11).
Please find the
vlib_process_t and vlib_node_t
data structure values collected in
the issue state below.
$38 = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
node_runtime = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
errors = 0x7f9b3076a560,
clocks_since_last_overflow =
0,
max_clock = 3785970526,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow =
0,
next_frame_index = 1668,
node_index = 437,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch
= 4147405645,
main_loop_vector_stats = {0,
0},
flags = 0,
state = 0,
n_next_nodes = 0,
cached_next_index = 0,
thread_index = 0,
runtime_data = 0x7f9b2da503c6
""
},
return_longjmp = {
regs = {94502584873984,
140304430422064, 140306731463680,
94502584874048, 94502640552512, 0,
140304430422032, 140306703608766}
},
resume_longjmp = {
regs = {94502584873984,
140304161734368, 140306731463680,
94502584874048, 94502640552512, 0,
140304161734272, 140304430441787}
},
flags = 11,
log2_n_stack_bytes = 16,
suspended_process_frame_index =
0,
n_suspends = 0,
pending_event_data_by_type_index
= 0x7f9b307b8310,
non_empty_event_type_bitmap =
0x7f9b307b8390,
one_time_event_type_bitmap =
0x0,
event_type_index_by_type_opaque
= 0x7f9b2dab8bd8,
event_type_pool =
0x7f9b2dcb5978,
resume_clock_interval = 1000,
stop_timer_handle = 3098,
output_function = 0x0,
output_function_arg = 0,
stack = 0x7f9b1bb78000
}
$17 = {
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
name = 0x7f9b3076a3f0
"rtb-vpp-epoll-process",
name_elog_string = 11783,
stats_total = {
calls = 0,
vectors = 0,
clocks = 1971244932732,
suspends = 6847366,
max_clock = 3785970526,
max_clock_n = 0
},
stats_last_clear = {
calls = 0,
vectors = 0,
clocks = 0,
suspends = 0,
max_clock = 0,
max_clock_n = 0
},
type = VLIB_NODE_TYPE_PROCESS,
index = 437,
runtime_index = 40,
runtime_data = 0x0,
flags = 0,
state = 0 '\000',
runtime_data_bytes = 0 '\000',
protocol_hint = 0 '\000',
n_errors = 0,
scalar_size = 0,
vector_size = 0,
error_heap_handle = 0,
error_heap_index = 0,
error_counters = 0x0,
next_node_names = 0x7f9b3076a530,
next_nodes = 0x0,
sibling_of = 0x0,
sibling_bitmap = 0x0,
n_vectors_by_next_node = 0x0,
next_slot_by_node = 0x0,
prev_node_bitmap = 0x0,
owner_node_index = 4294967295,
owner_next_index = 4294967295,
format_buffer = 0x0,
unformat_buffer = 0x0,
format_trace = 0x0,
validate_frame = 0x0,
state_string = 0x0,
node_fn_registrations = 0x0
}
I added an
assert statement before clearing
VLIB_PROCESS_IS_RUNNING flag
in dispatch_suspended_process
function.
But this
assert statement is not hitting.
diff --git
a/src/vlib/main.c
b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
p->flags &=
~VLIB_PROCESS_IS_RUNNING;
pool_put_index
(nm->suspended_process_frames,
p->suspended_process_frame_index);
I am not able
to figure out why this process
node is suspended in some
scenarios. Can you please help me
by providing some pointers to
debug and resolve this issue.
I applied your
patch to my code. The issue is not
solved with your patch. Thank you
for helping me out.
Hi Chetan,
In our case
we are observing this issue
occasionally exact steps to
recreate the issue are not
known.
I
made changes to our process
node as suggested by dave and
with these changes trying to
recreate the issue.
Soon I will
update my results and findings
in this mail thread.
Hi
Sudhir,
Actually
we are facing same issue
on vpp.2106.
In our
case "api-rx-ring" is not
getting called.
in our
usecase workers are
calling some functions in
main-thread context
leading to RPC message and
memory is allocated from
api section.
This
leads to Api-segment
memory is used fully and
leads to crash.
Hi
Dave,
Thank
you very much for
your inputs. I will
try this out and get
back to you with the
results.
On
Mon, Feb 20, 2023 at
6:01 PM Dave Barach
<vpp@...> wrote:
Please try something like
this, to
eliminate the
possibility
that some bit
of code is
sending this
process an
event. It’s
not a good
idea to skip
the
vec_reset_length
(event_data)
step.
while (1)
{
uword event_type, *
event_data =
0;
int i;
vlib_process_wait_for_event_or_clock
(vm, 1e-2 /*
10 ms */);
event_type =
vlib_process_get_events
(vm,
&event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
vec_reset_length(event_data);
}
Hi
Dave,
Thank you for
your response
and help.
Please
find the
additional
details below.
We are
creating a
process node rtb-vpp-epoll-process
to handle
control plane
events like
interface
add/delete,
route
add/delete.
This process
node waits for
10ms of
time (Not
Interested in
any events )
once 10ms is
expired it
will process
control plane
events
mentioned
above.
code snippet
looks like
below
static
uword
rtb_vpp_epoll_process (vlib_main_t *vm,
vlib_node_runtime_t
*rt,
vlib_frame_t
*f)
{
...
...
while (1)
{
vlib_process_wait_for_event_or_clock
(vm, 10e-3);
vlib_process_get_events
(vm, NULL);
rtb_event_loop_run_once();
<----
controlplane
events
handling
}
}
```
What
we observed is
that sometimes
(when there is
a high
controlplane
load like
request to
install more
routes)
"rtb-vpp-epoll-process"
is suspended
and not
scheduled
furever. this
we found by
using "show
runtime
rtb-vpp-epoll-process"
(in "show runtime
rtb-vpp-epoll-process" command output suspends counter is not
incrementing.)
show
runtime output
in working
case :
```
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 192246
1.91e6
0.00
DBGvpp#
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 193634
1.89e6
0.00
DBGvpp#
```
show
runtime output
in issue case
:
```
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
```
Other process
nodes like lldp-process,
ip4-neighbor-age-process, ip6-ra-process running without any issue. only
"rtb-vpp-epoll-process" process node suspended forever.
Please
let me know if
any additional
information is
required.
Hi Jinsh,
Thanks for
pointing me to
the issue you
faced. The
issue I am
facing looks
similar.
I will verify
with the given
patch.
Thanks and
Regards,
On
Sun, Feb 19,
2023 at 6:19
AM jinsh11
<jinsh11@...>
wrote:
HI:
bfd process node stop running. I raised
this issue,
NOTICE
TO RECIPIENT
This e-mail
message and
any
attachments
are
confidential
and may be
privileged. If
you received
this e-mail in
error, any
review, use,
dissemination,
distribution,
or copying of
this e-mail is
strictly
prohibited. Please
notify us
immediately of
the error by
return e-mail
and please
delete this
message from
your system.
For more
information
about Rtbrick,
please visit
us at www.rtbrick.com
NOTICE
TO RECIPIENT This
e-mail message and any
attachments are
confidential and may
be privileged. If you
received this e-mail
in error, any review,
use, dissemination,
distribution, or
copying of this e-mail
is strictly
prohibited. Please
notify us immediately
of the error by return
e-mail and please
delete this message
from your system. For
more information about
Rtbrick, please visit
us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message
and any attachments are
confidential and may be
privileged. If you received this
e-mail in error, any review,
use, dissemination,
distribution, or copying of this
e-mail is strictly
prohibited. Please notify us
immediately of the error by
return e-mail and please delete
this message from your system.
For more information about
Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and
any attachments are confidential and
may be privileged. If you received
this e-mail in error, any review,
use, dissemination, distribution, or
copying of this e-mail is strictly
prohibited. Please notify us
immediately of the error by return
e-mail and please delete this
message from your system. For more
information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any
attachments are confidential and may be
privileged. If you received this e-mail in
error, any review, use, dissemination,
distribution, or copying of this e-mail is
strictly prohibited. Please notify us
immediately of the error by return e-mail and
please delete this message from your system.
For more information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are
confidential and may be
privileged. If you received this e-mail in error, any review,
use,
dissemination, distribution, or copying of this e-mail is
strictly
prohibited. Please notify us immediately of the error by
return e-mail and
please delete this message from your system. For more
information about Rtbrick, please visit us at www.rtbrick.com
|
|
Hi Matthew, Thank you very much for detailed information and patch. I will apply the patch,test it in our system and soone come back to you with the results.
Thanks and regards, Sudhir Some of Netgate's customers have observed a similar issue on a build based on stable/2206 and Pim van Pelt has observed the same issue on some of his production routers using a build based on something close to the current master branch. In these cases, the api-rx-from-ring node stopped being executed.
My colleague Alexander Chernavin observed in https://lists.fd.io/g/vpp-dev/topic/96500275#22508 that when this issue arises, the process node for api-rx-from-ring has a timer ID in stop_timer_handle which is actually in use by another process node (the user_handle field in the timer entry points back to the process ID of a different process node). I haven't been able to reproduce this problem in a lab environment, but the going theory seems to be that when timers for process nodes are stopped or expired, the stop_timer_handle is not reset and contains a stale timer ID value until the node is dispatched and suspends itself again. So you could have a series of events like this:
process node A's timer is stopped (Its stop_timer_handle still refers to that timer) process node B is dispatched and suspends while adding a new timer. The new timer allocated has the ID formerly used by process A. process node C is dispatched and signals process node A with vlib_process_signal_event(). vlib_process_signal_event_helper() sees that the timer referred to by stop_timer_handle on process A is not free (it was allocated and is in use by process node B) and stops it. Process node B has no timer set now.
There is an existing patch in gerrit which seems aimed at addressing this - https://gerrit.fd.io/r/c/vpp/+/38400. It did not pass the checkstyle verify job when it was submitted, so I rebased it and fixed the formatting issues. That patch alone did not correct the issue on the router where Pim is able to reproduce the issue - it only resets the stop_timer_handle on a process node when vlib_process_signal_event_helper() explicitly stops a timer, it does not do the same for timers which were returned to the pool by TW (tw_timer_expire_timers_vec). I just uploaded this additional patch which attempts to deal with those timers - https://gerrit.fd.io/r/c/vpp/+/38621. Pim is currently testing it, so I can't say yet whether it actually solves the problem. But I thought I would share it so anyone else having this problem could try them out and give feedback on whether they help.
AFAICT both patches should be able to be applied to stable/2110.
Thanks, -Matt
Sudhir,
Good to hear that you have a workaround. Hopefully you will be
able to move your product forward to a recent release soon.
Thanks,
-daw-
Hi Dave ,
unfortunately we have not yet integrated master code with
our product. So we are not able to test the issue with
latest master code.
While testing the issue I made below change to our process
node after that this issue is not seen.
while (1) {
vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type = vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle timer expirations */
rtb_event_loop_run_once ();
vlib_process_suspend (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER); <--- added newly
break;
default: /* bug! */
clib_warning ("BUG: event type 0x%wx",
event_type);
ASSERT (0);
}
}
This is just workaround to our problem, if by any chance our
process node is missing in nm->data_from_advancing_timing_wheel
list
vlib_process_suspend will add process node to data_from_advancing_timing_wheel
so that our process node will be scheduled again.
Thanks and Regards,
Sudhir
Unfortunately,
VPP 21.10 is an unsupported version. Even if someone had
a fix for the problem, there is no CI available to verify
any submissions to the stable/2110 branch.
If so desired (e.g. to share a fix), one may push a gerrit
change to an unsupported stable branch, but it will not be
merged. The best practice is to fix the issue on
'master', then cherry-pick to stable branches.
Has the issue been reproduced on 'master'?
Thanks,
-daw-
On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,
Any work around/patch for the below
issue??
Thanks,
Chetan
Hi Dave,
we are using VPP Version 21.10.
Thanks and regards,
Sudhir
On Fri, Mar
10, 2023 at 5:31 PM Dave Barach < vpp@...>
wrote:
I should have had
the sense to ask this earlier: which
version of vpp are you using?
The line number in
your debug snippet is more than 100
lines off from master/latest. The
timer wheel code has been relatively
untouched, but there have been several
important fixes over the years...
D.
diff --git
a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
Hi Dave,
Please excuse
my delayed response. It took some
time to recreate this issue.
I made changes
to our process node as per your
suggestion. now our process node
code looks like this
while (1) {
vlib_process_wait_for_event_or_clock
(vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type =
vlib_process_get_events (vm,
&event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle
timer expirations */
rtb_event_loop_run_once ();
break;
default: /* bug! */
ASSERT (0);
}
}
After these
changes we didn't observe any
assertions but we hit the process
node suspend issue. with this it
is clear other than time out we
are not getting any other events.
In the issue
state I have collected
vlib_process node
(rtb_vpp_epoll_process) flags
value and it seems to be correct
(flags = 11).
Please find the
vlib_process_t and vlib_node_t
data structure values collected in
the issue state below.
$38 = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
node_runtime = {
cacheline0 = 0x7f9b2da50380
"\200~\274+\233\177",
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
errors = 0x7f9b3076a560,
clocks_since_last_overflow =
0,
max_clock = 3785970526,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow =
0,
next_frame_index = 1668,
node_index = 437,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch
= 4147405645,
main_loop_vector_stats = {0,
0},
flags = 0,
state = 0,
n_next_nodes = 0,
cached_next_index = 0,
thread_index = 0,
runtime_data = 0x7f9b2da503c6
""
},
return_longjmp = {
regs = {94502584873984,
140304430422064, 140306731463680,
94502584874048, 94502640552512, 0,
140304430422032, 140306703608766}
},
resume_longjmp = {
regs = {94502584873984,
140304161734368, 140306731463680,
94502584874048, 94502640552512, 0,
140304161734272, 140304430441787}
},
flags = 11,
log2_n_stack_bytes = 16,
suspended_process_frame_index =
0,
n_suspends = 0,
pending_event_data_by_type_index
= 0x7f9b307b8310,
non_empty_event_type_bitmap =
0x7f9b307b8390,
one_time_event_type_bitmap =
0x0,
event_type_index_by_type_opaque
= 0x7f9b2dab8bd8,
event_type_pool =
0x7f9b2dcb5978,
resume_clock_interval = 1000,
stop_timer_handle = 3098,
output_function = 0x0,
output_function_arg = 0,
stack = 0x7f9b1bb78000
}
$17 = {
function = 0x7f9b2bbc7e80
<rtb_vpp_epoll_process>,
name = 0x7f9b3076a3f0
"rtb-vpp-epoll-process",
name_elog_string = 11783,
stats_total = {
calls = 0,
vectors = 0,
clocks = 1971244932732,
suspends = 6847366,
max_clock = 3785970526,
max_clock_n = 0
},
stats_last_clear = {
calls = 0,
vectors = 0,
clocks = 0,
suspends = 0,
max_clock = 0,
max_clock_n = 0
},
type = VLIB_NODE_TYPE_PROCESS,
index = 437,
runtime_index = 40,
runtime_data = 0x0,
flags = 0,
state = 0 '\000',
runtime_data_bytes = 0 '\000',
protocol_hint = 0 '\000',
n_errors = 0,
scalar_size = 0,
vector_size = 0,
error_heap_handle = 0,
error_heap_index = 0,
error_counters = 0x0,
next_node_names = 0x7f9b3076a530,
next_nodes = 0x0,
sibling_of = 0x0,
sibling_bitmap = 0x0,
n_vectors_by_next_node = 0x0,
next_slot_by_node = 0x0,
prev_node_bitmap = 0x0,
owner_node_index = 4294967295,
owner_next_index = 4294967295,
format_buffer = 0x0,
unformat_buffer = 0x0,
format_trace = 0x0,
validate_frame = 0x0,
state_string = 0x0,
node_fn_registrations = 0x0
}
I added an
assert statement before clearing
VLIB_PROCESS_IS_RUNNING flag
in dispatch_suspended_process
function.
But this
assert statement is not hitting.
diff --git
a/src/vlib/main.c
b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@
dispatch_suspended_process
(vlib_main_t * vm,
}
else
{
+ if (strcmp((char
*)node->name,
"rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
p->flags &=
~VLIB_PROCESS_IS_RUNNING;
pool_put_index
(nm->suspended_process_frames,
p->suspended_process_frame_index);
I am not able
to figure out why this process
node is suspended in some
scenarios. Can you please help me
by providing some pointers to
debug and resolve this issue.
I applied your
patch to my code. The issue is not
solved with your patch. Thank you
for helping me out.
Hi Chetan,
In our case
we are observing this issue
occasionally exact steps to
recreate the issue are not
known.
I
made changes to our process
node as suggested by dave and
with these changes trying to
recreate the issue.
Soon I will
update my results and findings
in this mail thread.
Hi
Sudhir,
Actually
we are facing same issue
on vpp.2106.
In our
case "api-rx-ring" is not
getting called.
in our
usecase workers are
calling some functions in
main-thread context
leading to RPC message and
memory is allocated from
api section.
This
leads to Api-segment
memory is used fully and
leads to crash.
Hi
Dave,
Thank
you very much for
your inputs. I will
try this out and get
back to you with the
results.
On
Mon, Feb 20, 2023 at
6:01 PM Dave Barach
<vpp@...> wrote:
Please try something like
this, to
eliminate the
possibility
that some bit
of code is
sending this
process an
event. It’s
not a good
idea to skip
the
vec_reset_length
(event_data)
step.
while (1)
{
uword event_type, *
event_data =
0;
int i;
vlib_process_wait_for_event_or_clock
(vm, 1e-2 /*
10 ms */);
event_type =
vlib_process_get_events
(vm,
&event_data);
switch (event_type) {
case ~0: /* handle timer
expirations */
rtb_event_loop_run_once
();
break;
default: /* bug! */
ASSERT (0);
}
vec_reset_length(event_data);
}
Hi
Dave,
Thank you for
your response
and help.
Please
find the
additional
details below.
We are
creating a
process node rtb-vpp-epoll-process
to handle
control plane
events like
interface
add/delete,
route
add/delete.
This process
node waits for
10ms of
time (Not
Interested in
any events )
once 10ms is
expired it
will process
control plane
events
mentioned
above.
code snippet
looks like
below
static
uword
rtb_vpp_epoll_process (vlib_main_t *vm,
vlib_node_runtime_t
*rt,
vlib_frame_t
*f)
{
...
...
while (1)
{
vlib_process_wait_for_event_or_clock
(vm, 10e-3);
vlib_process_get_events
(vm, NULL);
rtb_event_loop_run_once();
<----
controlplane
events
handling
}
}
```
What
we observed is
that sometimes
(when there is
a high
controlplane
load like
request to
install more
routes)
"rtb-vpp-epoll-process"
is suspended
and not
scheduled
furever. this
we found by
using "show
runtime
rtb-vpp-epoll-process"
(in "show runtime
rtb-vpp-epoll-process" command output suspends counter is not
incrementing.)
show
runtime output
in working
case :
```
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 192246
1.91e6
0.00
DBGvpp#
DBGvpp# show
runtime
rtb-vpp-epoll-process
Name
State
Calls
Vectors
Suspends
Clocks
Vectors/Call
rtb-vpp-epoll-process any wait 0
0 193634
1.89e6
0.00
DBGvpp#
```
show
runtime output
in issue case
:
```
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
Name State Calls Vectors Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00
```
Other process
nodes like lldp-process,
ip4-neighbor-age-process, ip6-ra-process running without any issue. only
"rtb-vpp-epoll-process" process node suspended forever.
Please
let me know if
any additional
information is
required.
Hi Jinsh,
Thanks for
pointing me to
the issue you
faced. The
issue I am
facing looks
similar.
I will verify
with the given
patch.
Thanks and
Regards,
On
Sun, Feb 19,
2023 at 6:19
AM jinsh11
<jinsh11@...>
wrote:
HI:
bfd process node stop running. I raised
this issue,
NOTICE
TO RECIPIENT
This e-mail
message and
any
attachments
are
confidential
and may be
privileged. If
you received
this e-mail in
error, any
review, use,
dissemination,
distribution,
or copying of
this e-mail is
strictly
prohibited. Please
notify us
immediately of
the error by
return e-mail
and please
delete this
message from
your system.
For more
information
about Rtbrick,
please visit
us at www.rtbrick.com
NOTICE
TO RECIPIENT This
e-mail message and any
attachments are
confidential and may
be privileged. If you
received this e-mail
in error, any review,
use, dissemination,
distribution, or
copying of this e-mail
is strictly
prohibited. Please
notify us immediately
of the error by return
e-mail and please
delete this message
from your system. For
more information about
Rtbrick, please visit
us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message
and any attachments are
confidential and may be
privileged. If you received this
e-mail in error, any review,
use, dissemination,
distribution, or copying of this
e-mail is strictly
prohibited. Please notify us
immediately of the error by
return e-mail and please delete
this message from your system.
For more information about
Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and
any attachments are confidential and
may be privileged. If you received
this e-mail in error, any review,
use, dissemination, distribution, or
copying of this e-mail is strictly
prohibited. Please notify us
immediately of the error by return
e-mail and please delete this
message from your system. For more
information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any
attachments are confidential and may be
privileged. If you received this e-mail in
error, any review, use, dissemination,
distribution, or copying of this e-mail is
strictly prohibited. Please notify us
immediately of the error by return e-mail and
please delete this message from your system.
For more information about Rtbrick, please
visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are
confidential and may be
privileged. If you received this e-mail in error, any review,
use,
dissemination, distribution, or copying of this e-mail is
strictly
prohibited. Please notify us immediately of the error by
return e-mail and
please delete this message from your system. For more
information about Rtbrick, please visit us at www.rtbrick.com
NOTICE TO
RECIPIENT This e-mail message and any attachments are confidential and may be
privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly
prohibited. Please notify us immediately of the error by return e-mail and
please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com
|
|