Locked process node suspended indefinitely


Sudhir CR
 

Hi Team,
We have a process node. which we will use to do some control plane related activity. Sometimes we observe that this process node is suspended  indefinitely.

I know that if any process node is taking Unreasonably long time such nodes will  not be scheduled further. But not able to figure out in code where this is done.

Can anyone point me to the code where we are tracking time taken by each process node and suspend indefinitely if it is consuming more time.

Thanks and regards,
Sudhir

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


Dave Barach
 

Process is a bit of a misnomer, “cooperative multitasking thread” would be more accurate. Src/vlib/main.c makes no effort to interrupt process nodes. If a process runs forever vpp will have a bad day.

 

Recitations: which version of vpp is involved? Please send “show run” output for the process node in question.

 

The process might be waiting on an event which never happens, or for the clock to reach a time so far in the future it won’t happen in one’s lifetime. Worse luck might involve memory corruption or an issue with the timer wheel code.

 

HTH... Dave

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Friday, February 17, 2023 12:12 PM
To: vpp-dev@...
Subject: [vpp-dev] process node suspended indefinitely

 

Hi Team,

We have a process node. which we will use to do some control plane related activity. Sometimes we observe that this process node is suspended  indefinitely.

 

I know that if any process node is taking Unreasonably long time such nodes will  not be scheduled further. But not able to figure out in code where this is done.

 

Can anyone point me to the code where we are tracking time taken by each process node and suspend indefinitely if it is consuming more time.

 

Thanks and regards,

Sudhir

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


jinsh11
 

HI:
  • I have the same problem,

    bfd process node stop running. I raised this issue,
https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.


Sudhir CR
 

Hi Dave,
Thank you for your response and help. 

Please find the additional details below.
VPP Version 21.10

We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

```
static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{
    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 
What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :

```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 
show runtime output in issue case :
```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.

Thanks and Regards,
Sudhir

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:
HI:
  • I have the same problem,

    bfd process node stop running. I raised this issue,
https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.




NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


Dave Barach
 

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.



 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


Sudhir CR
 

Hi Dave,
Thank you very much for your inputs. I will try this out and get back to you with the results.

Regards,
Sudhir 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.



 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


chetan bhasin
 

Hi Sudhir,

Is your issue resolved?

Actually we are facing same issue on vpp.2106. 
In our case "api-rx-ring" is not getting called.
in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.
This leads to Api-segment memory is used fully and leads to crash.

Thanks,
Chetan 


On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave,
Thank you very much for your inputs. I will try this out and get back to you with the results.

Regards,
Sudhir 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.



 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com




Sudhir CR
 

Hi Chetan,
In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.
I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

Thanks and Regards,
Sudhir

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:
Hi Sudhir,

Is your issue resolved?

Actually we are facing same issue on vpp.2106. 
In our case "api-rx-ring" is not getting called.
in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.
This leads to Api-segment memory is used fully and leads to crash.

Thanks,
Chetan 


On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave,
Thank you very much for your inputs. I will try this out and get back to you with the results.

Regards,
Sudhir 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.



 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com







NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


Sudhir CR
 

Hi Dave,
Please excuse my delayed response. It took some time to recreate this issue.
I made changes to our process node as per your suggestion. now our process node code looks like this
 
while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }
After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

vlib_process_t:
============
$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

vlib_node_t
=========
 (gdb) p *n
$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.
But this assert statement is not hitting.

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

Hi Jinsh,
I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

Thanks and Regards,
Sudhir
 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Chetan,
In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.
I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

Thanks and Regards,
Sudhir

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:
Hi Sudhir,

Is your issue resolved?

Actually we are facing same issue on vpp.2106. 
In our case "api-rx-ring" is not getting called.
in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.
This leads to Api-segment memory is used fully and leads to crash.

Thanks,
Chetan 


On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave,
Thank you very much for your inputs. I will try this out and get back to you with the results.

Regards,
Sudhir 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.



 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com







NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com




NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


jinsh11
 

  • I think you can query who stopped the current node's time wheel.

    always_inline void *

    vlib_process_signal_event_helper (vlib_node_main_t * nm,

      vlib_node_t * n,

      vlib_process_t * p,

      uword t,

      uword n_data_elts, uword n_data_elt_bytes)

    {

     

    if (add_to_pending)

        {

          u32 x = vlib_timing_wheel_data_set_suspended_process (n->runtime_index);

          p->flags = p_flags | VLIB_PROCESS_RESUME_PENDING;

          vec_add1 (nm->data_from_advancing_timing_wheel, x);

          if (delete_from_wheel){

    TW (tw_timer_stop) ((TWT (tw_timer_wheel) *) nm->timing_wheel,

        p->stop_timer_handle);

           vlib_process_t *p1 = vec_elt (nm->processes, vlib_get_node_by_name(vm,"rtb-vpp-epoll-process”)->runtime_index);

          If ((p != p1 && (p-> stop_timer_handle == p1->stop_timer_handle))

          {

              ASSERT();

           }
    }

        }

    }


Sudhir CR
 

Hi jinsh,
Thanks for the help. 
I placed assert statement in vlib_process_signal_event_helper function. But in this place the assert statement didn't hit.
When I debugged further I found that my process node is not there in the data_from_advancing_timing_wheel vector. 
i believe due to this process node is not getting called.now i am checking why  rtb-vpp-epoll-process node entry is not present in data_from_advancing_timing_wheel vector.

Thanks and regards,
Sudhir

On Thu, Mar 9, 2023 at 9:08 PM jinsh11 <jinsh11@...> wrote:
  • I think you can query who stopped the current node's time wheel.

    always_inline void *

    vlib_process_signal_event_helper (vlib_node_main_t * nm,

      vlib_node_t * n,

      vlib_process_t * p,

      uword t,

      uword n_data_elts, uword n_data_elt_bytes)

    {

     

    if (add_to_pending)

        {

          u32 x = vlib_timing_wheel_data_set_suspended_process (n->runtime_index);

          p->flags = p_flags | VLIB_PROCESS_RESUME_PENDING;

          vec_add1 (nm->data_from_advancing_timing_wheel, x);

          if (delete_from_wheel){

    TW (tw_timer_stop) ((TWT (tw_timer_wheel) *) nm->timing_wheel,

        p->stop_timer_handle);

           vlib_process_t *p1 = vec_elt (nm->processes, vlib_get_node_by_name(vm,"rtb-vpp-epoll-process”)->runtime_index);

          If ((p != p1 && (p-> stop_timer_handle == p1->stop_timer_handle))

          {

              ASSERT();

           }
    }

        }

    }





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


Dave Barach
 

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


Sudhir CR
 

Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


chetan bhasin
 

Hello Everyone,

Any work around/patch for the below issue??

Thanks,
Chetan 


On Mon, Mar 13, 2023, 06:51 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com




Dave Wallace
 

Unfortunately, VPP 21.10 is an unsupported version.  Even if someone had a fix for the problem, there is no CI available to verify any submissions to the stable/2110 branch.

If so desired (e.g. to share a fix), one may push a gerrit change to an unsupported stable branch, but it will not be merged.  The best practice is to fix the issue on 'master', then cherry-pick to stable branches.

Has the issue been reproduced on 'master'?

Thanks,
-daw-

On 4/5/2023 12:15 PM, chetan bhasin wrote:

Hello Everyone,

Any work around/patch for the below issue??

Thanks,
Chetan 

On Mon, Mar 13, 2023, 06:51 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com








Sudhir CR
 

Hi Dave ,
unfortunately we have not yet integrated master code with our product. So we are not able to test the issue with latest master code.

While testing the issue I made below change to our process node after that this issue is not seen.
 
    while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                vlib_process_suspend (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);  <--- added newly
                break;

            default: /* bug! */
                clib_warning ("BUG: event type 0x%wx", event_type);
                ASSERT (0);
        }
    }   

This is just workaround to our problem, if by any chance our process node is missing in nm->data_from_advancing_timing_wheel list
vlib_process_suspend will add process node to  data_from_advancing_timing_wheel  so that our process node will be scheduled again.

Thanks and Regards,
Sudhir

On Wed, Apr 5, 2023 at 10:02 PM Dave Wallace <dwallacelf@...> wrote:
Unfortunately, VPP 21.10 is an unsupported version.  Even if someone had a fix for the problem, there is no CI available to verify any submissions to the stable/2110 branch.

If so desired (e.g. to share a fix), one may push a gerrit change to an unsupported stable branch, but it will not be merged.  The best practice is to fix the issue on 'master', then cherry-pick to stable branches.

Has the issue been reproduced on 'master'?

Thanks,
-daw-

On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,

Any work around/patch for the below issue??

Thanks,
Chetan 

On Mon, Mar 13, 2023, 06:51 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





    





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


Dave Wallace
 

Sudhir,

Good to hear that you have a workaround.  Hopefully you will be able to move your product forward to a recent release soon.

Thanks,
-daw-

On 4/5/2023 11:43 PM, Sudhir CR via lists.fd.io wrote:

Hi Dave ,
unfortunately we have not yet integrated master code with our product. So we are not able to test the issue with latest master code.

While testing the issue I made below change to our process node after that this issue is not seen.
 
    while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                vlib_process_suspend (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);  <--- added newly
                break;

            default: /* bug! */
                clib_warning ("BUG: event type 0x%wx", event_type);
                ASSERT (0);
        }
    }   

This is just workaround to our problem, if by any chance our process node is missing in nm->data_from_advancing_timing_wheel list
vlib_process_suspend will add process node to  data_from_advancing_timing_wheel  so that our process node will be scheduled again.

Thanks and Regards,
Sudhir

On Wed, Apr 5, 2023 at 10:02 PM Dave Wallace <dwallacelf@...> wrote:
Unfortunately, VPP 21.10 is an unsupported version.  Even if someone had a fix for the problem, there is no CI available to verify any submissions to the stable/2110 branch.

If so desired (e.g. to share a fix), one may push a gerrit change to an unsupported stable branch, but it will not be merged.  The best practice is to fix the issue on 'master', then cherry-pick to stable branches.

Has the issue been reproduced on 'master'?

Thanks,
-daw-

On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,

Any work around/patch for the below issue??

Thanks,
Chetan 

On Mon, Mar 13, 2023, 06:51 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com









NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





Matthew Smith
 


Some of Netgate's customers have observed a similar issue on a build based on stable/2206 and Pim van Pelt has observed the same issue on some of his production routers using a build based on something close to the current master branch. In these cases, the api-rx-from-ring node stopped being executed.

My colleague Alexander Chernavin observed in https://lists.fd.io/g/vpp-dev/topic/96500275#22508 that when this issue arises, the process node for api-rx-from-ring has a timer ID in stop_timer_handle which is actually in use by another process node (the user_handle field in the timer entry points back to the process ID of a different process node). I haven't been able to reproduce this problem in a lab environment, but the going theory seems to be that when timers for process nodes are stopped or expired, the stop_timer_handle is not reset and contains a stale timer ID value until the node is dispatched and suspends itself again. So you could have a series of events like this:

process node A's timer is stopped (Its stop_timer_handle still refers to that timer)
process node B is dispatched and suspends while adding a new timer. The new timer allocated has the ID formerly used by process A.
process node C is dispatched and signals process node A with vlib_process_signal_event(). vlib_process_signal_event_helper() sees that the timer referred to by stop_timer_handle on process A is not free (it was allocated and is in use by process node B) and stops it. Process node B has no timer set now.

There is an existing patch in gerrit which seems aimed at addressing this - https://gerrit.fd.io/r/c/vpp/+/38400. It did not pass the checkstyle verify job when it was submitted, so I rebased it and fixed the formatting issues. That patch alone did not correct the issue on the router where Pim is able to reproduce the issue - it only resets the stop_timer_handle on a process node when vlib_process_signal_event_helper() explicitly stops a timer, it does not do the same for timers which were returned to the pool by TW (tw_timer_expire_timers_vec). I just uploaded this additional patch which attempts to deal with those timers - https://gerrit.fd.io/r/c/vpp/+/38621. Pim is currently testing it, so I can't say yet whether it actually solves the problem. But I thought I would share it so anyone else having this problem could try them out and give feedback on whether they help.

AFAICT both patches should be able to be applied to stable/2110.

Thanks,
-Matt


On Thu, Apr 6, 2023 at 9:24 AM Dave Wallace <dwallacelf@...> wrote:
Sudhir,

Good to hear that you have a workaround.  Hopefully you will be able to move your product forward to a recent release soon.

Thanks,
-daw-

On 4/5/2023 11:43 PM, Sudhir CR via lists.fd.io wrote:
Hi Dave ,
unfortunately we have not yet integrated master code with our product. So we are not able to test the issue with latest master code.

While testing the issue I made below change to our process node after that this issue is not seen.
 
    while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                vlib_process_suspend (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);  <--- added newly
                break;

            default: /* bug! */
                clib_warning ("BUG: event type 0x%wx", event_type);
                ASSERT (0);
        }
    }   

This is just workaround to our problem, if by any chance our process node is missing in nm->data_from_advancing_timing_wheel list
vlib_process_suspend will add process node to  data_from_advancing_timing_wheel  so that our process node will be scheduled again.

Thanks and Regards,
Sudhir

On Wed, Apr 5, 2023 at 10:02 PM Dave Wallace <dwallacelf@...> wrote:
Unfortunately, VPP 21.10 is an unsupported version.  Even if someone had a fix for the problem, there is no CI available to verify any submissions to the stable/2110 branch.

If so desired (e.g. to share a fix), one may push a gerrit change to an unsupported stable branch, but it will not be merged.  The best practice is to fix the issue on 'master', then cherry-pick to stable branches.

Has the issue been reproduced on 'master'?

Thanks,
-daw-

On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,

Any work around/patch for the below issue??

Thanks,
Chetan 

On Mon, Mar 13, 2023, 06:51 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com









NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


    





Matthew Smith
 


Testing since thursday on a build with https://gerrit.fd.io/r/c/vpp/+/38400 and https://gerrit.fd.io/r/c/vpp/+/38621 applied seems to have eliminated the issue with api-rx-from-ring on the production system where the issue could be reproduced. Hopefully these patches can also help with the process node issues reported by other users.

-Matt


On Thu, Apr 6, 2023 at 6:37 PM Matthew Smith via lists.fd.io <mgsmith=netgate.com@...> wrote:

Some of Netgate's customers have observed a similar issue on a build based on stable/2206 and Pim van Pelt has observed the same issue on some of his production routers using a build based on something close to the current master branch. In these cases, the api-rx-from-ring node stopped being executed.

My colleague Alexander Chernavin observed in https://lists.fd.io/g/vpp-dev/topic/96500275#22508 that when this issue arises, the process node for api-rx-from-ring has a timer ID in stop_timer_handle which is actually in use by another process node (the user_handle field in the timer entry points back to the process ID of a different process node). I haven't been able to reproduce this problem in a lab environment, but the going theory seems to be that when timers for process nodes are stopped or expired, the stop_timer_handle is not reset and contains a stale timer ID value until the node is dispatched and suspends itself again. So you could have a series of events like this:

process node A's timer is stopped (Its stop_timer_handle still refers to that timer)
process node B is dispatched and suspends while adding a new timer. The new timer allocated has the ID formerly used by process A.
process node C is dispatched and signals process node A with vlib_process_signal_event(). vlib_process_signal_event_helper() sees that the timer referred to by stop_timer_handle on process A is not free (it was allocated and is in use by process node B) and stops it. Process node B has no timer set now.

There is an existing patch in gerrit which seems aimed at addressing this - https://gerrit.fd.io/r/c/vpp/+/38400. It did not pass the checkstyle verify job when it was submitted, so I rebased it and fixed the formatting issues. That patch alone did not correct the issue on the router where Pim is able to reproduce the issue - it only resets the stop_timer_handle on a process node when vlib_process_signal_event_helper() explicitly stops a timer, it does not do the same for timers which were returned to the pool by TW (tw_timer_expire_timers_vec). I just uploaded this additional patch which attempts to deal with those timers - https://gerrit.fd.io/r/c/vpp/+/38621. Pim is currently testing it, so I can't say yet whether it actually solves the problem. But I thought I would share it so anyone else having this problem could try them out and give feedback on whether they help.

AFAICT both patches should be able to be applied to stable/2110.

Thanks,
-Matt


On Thu, Apr 6, 2023 at 9:24 AM Dave Wallace <dwallacelf@...> wrote:
Sudhir,

Good to hear that you have a workaround.  Hopefully you will be able to move your product forward to a recent release soon.

Thanks,
-daw-

On 4/5/2023 11:43 PM, Sudhir CR via lists.fd.io wrote:
Hi Dave ,
unfortunately we have not yet integrated master code with our product. So we are not able to test the issue with latest master code.

While testing the issue I made below change to our process node after that this issue is not seen.
 
    while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                vlib_process_suspend (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);  <--- added newly
                break;

            default: /* bug! */
                clib_warning ("BUG: event type 0x%wx", event_type);
                ASSERT (0);
        }
    }   

This is just workaround to our problem, if by any chance our process node is missing in nm->data_from_advancing_timing_wheel list
vlib_process_suspend will add process node to  data_from_advancing_timing_wheel  so that our process node will be scheduled again.

Thanks and Regards,
Sudhir

On Wed, Apr 5, 2023 at 10:02 PM Dave Wallace <dwallacelf@...> wrote:
Unfortunately, VPP 21.10 is an unsupported version.  Even if someone had a fix for the problem, there is no CI available to verify any submissions to the stable/2110 branch.

If so desired (e.g. to share a fix), one may push a gerrit change to an unsupported stable branch, but it will not be merged.  The best practice is to fix the issue on 'master', then cherry-pick to stable branches.

Has the issue been reproduced on 'master'?

Thanks,
-daw-

On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,

Any work around/patch for the below issue??

Thanks,
Chetan 

On Mon, Mar 13, 2023, 06:51 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com









NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


    








Sudhir CR
 

Hi Matthew,
Thank you very much for detailed information and patch. I will apply the patch,test it in our system and soone come back to you with the results.


Thanks and regards,
Sudhir

On Mon, Apr 10, 2023 at 6:46 PM Matthew Smith via lists.fd.io <mgsmith=netgate.com@...> wrote:

Testing since thursday on a build with https://gerrit.fd.io/r/c/vpp/+/38400 and https://gerrit.fd.io/r/c/vpp/+/38621 applied seems to have eliminated the issue with api-rx-from-ring on the production system where the issue could be reproduced. Hopefully these patches can also help with the process node issues reported by other users.

-Matt


On Thu, Apr 6, 2023 at 6:37 PM Matthew Smith via lists.fd.io <mgsmith=netgate.com@...> wrote:

Some of Netgate's customers have observed a similar issue on a build based on stable/2206 and Pim van Pelt has observed the same issue on some of his production routers using a build based on something close to the current master branch. In these cases, the api-rx-from-ring node stopped being executed.

My colleague Alexander Chernavin observed in https://lists.fd.io/g/vpp-dev/topic/96500275#22508 that when this issue arises, the process node for api-rx-from-ring has a timer ID in stop_timer_handle which is actually in use by another process node (the user_handle field in the timer entry points back to the process ID of a different process node). I haven't been able to reproduce this problem in a lab environment, but the going theory seems to be that when timers for process nodes are stopped or expired, the stop_timer_handle is not reset and contains a stale timer ID value until the node is dispatched and suspends itself again. So you could have a series of events like this:

process node A's timer is stopped (Its stop_timer_handle still refers to that timer)
process node B is dispatched and suspends while adding a new timer. The new timer allocated has the ID formerly used by process A.
process node C is dispatched and signals process node A with vlib_process_signal_event(). vlib_process_signal_event_helper() sees that the timer referred to by stop_timer_handle on process A is not free (it was allocated and is in use by process node B) and stops it. Process node B has no timer set now.

There is an existing patch in gerrit which seems aimed at addressing this - https://gerrit.fd.io/r/c/vpp/+/38400. It did not pass the checkstyle verify job when it was submitted, so I rebased it and fixed the formatting issues. That patch alone did not correct the issue on the router where Pim is able to reproduce the issue - it only resets the stop_timer_handle on a process node when vlib_process_signal_event_helper() explicitly stops a timer, it does not do the same for timers which were returned to the pool by TW (tw_timer_expire_timers_vec). I just uploaded this additional patch which attempts to deal with those timers - https://gerrit.fd.io/r/c/vpp/+/38621. Pim is currently testing it, so I can't say yet whether it actually solves the problem. But I thought I would share it so anyone else having this problem could try them out and give feedback on whether they help.

AFAICT both patches should be able to be applied to stable/2110.

Thanks,
-Matt


On Thu, Apr 6, 2023 at 9:24 AM Dave Wallace <dwallacelf@...> wrote:
Sudhir,

Good to hear that you have a workaround.  Hopefully you will be able to move your product forward to a recent release soon.

Thanks,
-daw-

On 4/5/2023 11:43 PM, Sudhir CR via lists.fd.io wrote:
Hi Dave ,
unfortunately we have not yet integrated master code with our product. So we are not able to test the issue with latest master code.

While testing the issue I made below change to our process node after that this issue is not seen.
 
    while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                vlib_process_suspend (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);  <--- added newly
                break;

            default: /* bug! */
                clib_warning ("BUG: event type 0x%wx", event_type);
                ASSERT (0);
        }
    }   

This is just workaround to our problem, if by any chance our process node is missing in nm->data_from_advancing_timing_wheel list
vlib_process_suspend will add process node to  data_from_advancing_timing_wheel  so that our process node will be scheduled again.

Thanks and Regards,
Sudhir

On Wed, Apr 5, 2023 at 10:02 PM Dave Wallace <dwallacelf@...> wrote:
Unfortunately, VPP 21.10 is an unsupported version.  Even if someone had a fix for the problem, there is no CI available to verify any submissions to the stable/2110 branch.

If so desired (e.g. to share a fix), one may push a gerrit change to an unsupported stable branch, but it will not be merged.  The best practice is to fix the issue on 'master', then cherry-pick to stable branches.

Has the issue been reproduced on 'master'?

Thanks,
-daw-

On 4/5/2023 12:15 PM, chetan bhasin wrote:
Hello Everyone,

Any work around/patch for the below issue??

Thanks,
Chetan 

On Mon, Mar 13, 2023, 06:51 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:
Hi Dave, 
we are using VPP Version 21.10.

Thanks and regards,
Sudhir

On Fri, Mar 10, 2023 at 5:31 PM Dave Barach <vpp@...> wrote:

I should have had the sense to ask this earlier: which version of vpp are you using?

 

The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@...
Cc: rtbrick.com@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
    errors = 0x7f9b3076a560,
    clocks_since_last_overflow = 0,
    max_clock = 3785970526,
    max_clock_n = 0,
    calls_since_last_overflow = 0,
    vectors_since_last_overflow = 0,
    next_frame_index = 1668,
    node_index = 437,
    input_main_loops_per_call = 0,
    main_loop_count_last_dispatch = 4147405645,
    main_loop_vector_stats = {0, 0},
    flags = 0,
    state = 0,
    n_next_nodes = 0,
    cached_next_index = 0,
    thread_index = 0,
    runtime_data = 0x7f9b2da503c6 ""
  },
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766}
  },
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787}
  },
  flags = 11,
  log2_n_stack_bytes = 16,
  suspended_process_frame_index = 0,
  n_suspends = 0,
  pending_event_data_by_type_index = 0x7f9b307b8310,
  non_empty_event_type_bitmap = 0x7f9b307b8390,
  one_time_event_type_bitmap = 0x0,
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
  event_type_pool = 0x7f9b2dcb5978,
  resume_clock_interval = 1000,
  stop_timer_handle = 3098,
  output_function = 0x0,
  output_function_arg = 0,
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
  name_elog_string = 11783,
  stats_total = {
    calls = 0,
    vectors = 0,
    clocks = 1971244932732,
    suspends = 6847366,
    max_clock = 3785970526,
    max_clock_n = 0
  },
  stats_last_clear = {
    calls = 0,
    vectors = 0,
    clocks = 0,
    suspends = 0,
    max_clock = 0,
    max_clock_n = 0
  },
  type = VLIB_NODE_TYPE_PROCESS,
  index = 437,
  runtime_index = 40,
  runtime_data = 0x0,
  flags = 0,
  state = 0 '\000',
  runtime_data_bytes = 0 '\000',
  protocol_hint = 0 '\000',
  n_errors = 0,
  scalar_size = 0,
  vector_size = 0,
  error_heap_handle = 0,
  error_heap_index = 0,
  error_counters = 0x0,
  next_node_names = 0x7f9b3076a530,
  next_nodes = 0x0,
  sibling_of = 0x0,
  sibling_bitmap = 0x0,
  n_vectors_by_next_node = 0x0,
  next_slot_by_node = 0x0,
  prev_node_bitmap = 0x0,
  owner_node_index = 4294967295,
  owner_next_index = 4294967295,
  format_buffer = 0x0,
  unformat_buffer = 0x0,
  format_trace = 0x0,
  validate_frame = 0x0,
  state_string = 0x0,
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate the issue are not known.

I made changes to our process node as suggested by dave and with these changes trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin017@...> wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <sudhir=rtbrick.com@...> wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <vpp@...> wrote:

Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@... <vpp-dev@...> On Behalf Of Sudhir CR via lists.fd.io
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@...
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    } 
}
``` 

What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          192246          1.91e6            0.00
DBGvpp#

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0          193634          1.89e6            0.00
DBGvpp#

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0           81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jinsh11@...> wrote:

HI

  • I have the same problem

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com





NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com









NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com


    











NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com