VPP crashes on CSIT Taishan server


Lijian Zhang
 

Hi,

VPP crashes on CSIT Taishan server due to function vlib_get_thread_core_numa (unsigned cpu_id) not getting NUMA node correctly via cpu_id on Taishan server.

vlib_get_thread_core_numa () is using physical_package_id as NUMA node.

 

However, Taishan server has 2 physical sockets, but 4 NUMA nodes as below output of lscpu. And the sysfs shows the physical_package_id is not sequential on Taishan server.

taishan-d05-08:~$ cat /sys/devices/system/cpu/cpu37/topology/physical_package_id

3002

taishan-d05-08:~$ cat /sys/devices/system/cpu/cpu3/topology/physical_package_id

36

taishan-d05-08:~$ cat /sys/devices/system/cpu/cpu15/topology/physical_package_id

36

 

How about using below sysfs to get NUA node via cpu_id? The code change is also attached in the end.

 

$ cat /sys/devices/system/node/online

0-3

$ cat /sys/devices/system/node/node0/cpulist

0-15

$ cat /sys/devices/system/node/node1/cpulist

16-31

$ cat /sys/devices/system/node/node2/cpulist

32-47

$ cat /sys/devices/system/node/node3/cpulist

48-63

 

taishan-d05-08:~$ lscpu

Architecture:        aarch64

Byte Order:          Little Endian

CPU(s):              64

On-line CPU(s) list: 0-63

Thread(s) per core:  1

Core(s) per socket:  32

Socket(s):           2

NUMA node(s):        4

Vendor ID:           ARM

Model:               2

Model name:          Cortex-A72

Stepping:            r0p2

BogoMIPS:            100.00

L1d cache:           32K

L1i cache:           48K

L2 cache:            1024K

L3 cache:            16384K

NUMA node0 CPU(s):   0-15

NUMA node1 CPU(s):   16-31

NUMA node2 CPU(s):   32-47

NUMA node3 CPU(s):   48-63

Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

 

 

diff --git a/src/vlib/threads.c b/src/vlib/threads.c

index 1ce4dc156..3f0905421 100644

--- a/src/vlib/threads.c

+++ b/src/vlib/threads.c

@@ -598,15 +598,30 @@ void

vlib_get_thread_core_numa (vlib_worker_thread_t * w, unsigned cpu_id)

{

   const char *sys_cpu_path = "/sys/devices/system/cpu/cpu";

+  const char *sys_node_path = "/sys/devices/system/node/node";

+  clib_bitmap_t *nbmp = 0, *cbmp = 0;

+  u32 node;

   u8 *p = 0;

   int core_id = -1, numa_id = -1;

 

   p = format (p, "%s%u/topology/core_id%c", sys_cpu_path, cpu_id, 0);

   clib_sysfs_read ((char *) p, "%d", &core_id);

   vec_reset_length (p);

-  p = format (p, "%s%u/topology/physical_package_id%c", sys_cpu_path,

-             cpu_id, 0);

-  clib_sysfs_read ((char *) p, "%d", &numa_id);

+

+  /* *INDENT-OFF* */

+  clib_sysfs_read ("/sys/devices/system/node/online", "%U",

+        unformat_bitmap_list, &nbmp);

+  clib_bitmap_foreach (node, nbmp, ({

+    p = format (p, "%s%u/cpulist%c", sys_node_path, node, 0);

+    clib_sysfs_read ((char *) p, "%U", unformat_bitmap_list, &cbmp);

+    if (clib_bitmap_get (cbmp, cpu_id))

+      numa_id = node;

+    vec_reset_length (cbmp);

+    vec_reset_length (p);

+  }));

+  /* *INDENT-ON* */

+  vec_free (nbmp);

+  vec_free (cbmp);

   vec_free (p);

 

   w->core_id = core_id;

 

Thanks.

Join vpp-dev@lists.fd.io to automatically receive all group messages.