Now Reading
Debugging a Combined Python and C Language Stack

Debugging a Combined Python and C Language Stack

2023-04-25 15:55:37

Debugging is troublesome. Debugging throughout a number of languages is very difficult, and debugging throughout gadgets typically requires a group with various talent units and experience to disclose the underlying downside. 

But initiatives typically require utilizing a number of languages, to make sure excessive efficiency the place vital, a user-friendly expertise, and compatibility the place doable. Sadly, there is no such thing as a single programming language that gives the entire above, demanding that builders grow to be versatile.

This submit reveals how a RAPIDS group approached debugging a number of programming languages, together with using GDB to establish and resolve deadlocks. The group is devoted to designing software program to speed up and scale information science options.

The bug featured on this submit was a part of the RAPIDS project that was recognized and resolved in the summertime of 2019. It entails a posh stack with a number of programming languages, primarily C, C++, and Python, in addition to CUDA for GPU acceleration. 

Documenting this historic bug and its decision serves a couple of targets, together with:

  1. Demonstrating Python and C debugging with GDB
  2. Presenting concepts on diagnose deadlocks
  3. Creating a greater understanding of blending Python and CUDA

The content material introduced on this submit ought to provide help to perceive how such bugs manifest and tackle comparable points in your individual work.

Bug description

To be environment friendly and performant, RAPIDS depends upon quite a lot of libraries for a large number of various operations. To call a couple of, RAPIDS makes use of CuPy and cuDF to compute arrays and DataFrames on the GPU, respectively. Numba is a just-in-time compiler that can be utilized for accelerating user-defined Python operations on the GPU. 

As well as, Dask is used to scale compute to a number of GPUs and a number of nodes. The final piece of the puzzle within the bug at hand is UCX, a communication framework used to leverage quite a lot of interconnects, akin to InfiniBand and NVLink. 

Determine 1 reveals an summary of this stack. Though unknown on the time, a impasse was occurring someplace on this stack, stopping the workflow from finishing.

Diagram showing the relevant RAPIDS software stack
Determine 1. Stack of elements in a RAPIDS and Dask cluster

This impasse was first noticed in August 2019, which was shortly after UCX was launched within the stack. It seems that the impasse beforehand manifested itself with out UCX (utilizing the Dask default TCP communicator), besides extra occasionally.

Loads of time was spent exploring the area when the impasse occurred. Although unknown on the time, the bug might have been in a selected operation, akin to group by aggregation, merge/joins, repartitioning, or in a selected model of any of the libraries, together with cuDF, CuPy, Dask, UCX, and extra. There have been many sides to discover in consequence.

Put together for debugging

The following sections stroll you thru put together for debugging.

Arrange a minimal reproducer

Discovering a minimal reproducer is vital to debugging something. This downside was initially recognized in a workflow operating eight GPUs. Over time, we diminished this down to 2 GPUs. Having a minimal reproducer is important to simply share a bug with others and get the time and a focus from a broader group.

Arrange your surroundings

Earlier than diving into the issue, arrange your surroundings. The bug will be minimally reproduced with the 0.10 model of RAPIDS (launched in October 2019). It’s doable to arrange the surroundings with both Conda or Docker (see the respective sections later on this submit).

This whole course of assumes using Linux. As a result of UCX just isn’t supported on Home windows or MacOS, it isn’t reproducible on these working techniques.

Conda

First, set up Miniconda. After the preliminary setup, we strongly advocate that you simply set up mamba by operating the next script:

conda set up mamba -n base -c conda-forge

Then run the next script to create and activate a conda surroundings with RAPIDS 0.10:

mamba create -n rapids-0.10 -c rapidsai -c nvidia -c conda-forge rapids=0.10 glog=0.4 cupy=6.7 numba=0.45.1 ucx-py=0.11 ucx=1.7 ucx-proc=*=gpu libnuma dask=2.30 dask-core=2.30 distributed=2.30 gdb
conda activate rapids-0.10

We advocate Mamba for rushing up surroundings decision. Skipping that step and changing mamba with conda ought to work as nicely, however could also be significantly slower.

Docker

Alternatively, you possibly can reproduce the bug with Docker. After you’ve NVIDIA Container Toolkit arrange, observe these directions.

docker run -it --rm --cap-add sys_admin --cap-add sys_ptrace --ipc shareable --net host --gpus all rapidsai/rapidsai:0.10-cuda10.0-runtime-ubuntu18.04 /bin/bash

Within the container, set up mamba to hurry up the surroundings decision.

conda create -n mamba -c conda-forge mamba -y

Then, set up UCX/UCX-Py, and libnuma, which is a UCX dependency. Additionally, improve Dask to a model that has built-in UCX assist. For debugging later, additionally set up GDB.

/choose/conda/envs/mamba/bin/mamba set up -y -c rapidsai -c nvidia -c conda-forge dask=2.30 dask-core=2.30 distributed=2.30 fsspec=2022.11.0 libnuma ucx-py=0.11 ucx=1.7 ucx-proc=*=gpu gdb -p /choose/conda/envs/rapids

Debugging

This part particulars how this explicit downside was encountered and in the end mounted, with an in depth step-by-step overview. You too can reproduce and observe a couple of of the described ideas.

Working (or hanging)

The debugging difficulty in query is certainly not restricted to a single compute downside, however it’s simpler to make use of the identical workflow that we utilized in 2019. That script will be downloaded to an area surroundings by operating the next script:

wget 
https://gist.githubusercontent.com/pentschev/9ce97f8efe370552c7dd5e84b64d3c92/uncooked/424c9cf95f31c18d32a9481f78dd241e08a071a9/cudf-deadlock.py

To breed, execute the next:

OPENBLAS_NUM_THREADS=1 UCX_RNDV_SCHEME=put_zcopy UCX_MEMTYPE_CACHE=n UCX_TLS=sockcm,tcp,cuda_copy,cuda_ipc python cudf-deadlock.py

In just some iterations (maybe as few as one or two), you must see the previous program dangle. Now the actual work begins.

The impasse

A pleasant attribute about deadlocks is that the processes and threads (if you know the way to research them) can present what they’re at present attempting to do. You possibly can infer what’s inflicting the impasse. 

The important software is GDB. Nonetheless, a lot time was spent initially with PDB for investigating what Python was doing at every step. GDB can connect to dwell processes, so it’s essential to first discover out what the processes and their related IDs are:

(rapids) root@dgx13:/rapids/notebooks# ps ax | grep python
   19 pts/0    S      0:01 /choose/conda/envs/rapids/bin/python /choose/conda/envs/rapids/bin/jupyter-lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=
  865 pts/0    Sl+    0:03 python cudf-deadlock.py
  871 pts/0    S+     0:00 /choose/conda/envs/rapids/bin/python -c from multiprocessing.semaphore_tracker import principal;principal(69)
  873 pts/0    Sl+    0:08 /choose/conda/envs/rapids/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=70, pipe_handle=76) --multiprocessing-fork
  885 pts/0    Sl+    0:07 /choose/conda/envs/rapids/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=70, pipe_handle=85) --multiprocessing-fork

4 Python processes are related to this downside:

  • Dask Consumer (865)
  • Dask Scheduler (871)
  • Two Dask staff (873 and 885)

Curiously sufficient, vital progress has been made in debugging Python since this bug was initially investigated. In 2019, RAPIDS was on Python 3.6, which already had instruments to debug decrease stacks however solely when Python was in-built debug mode. That required probably rebuilding your entire software program stack, which is prohibitive in complicated instances like this. 

Since Python 3.8, the debug builds use the same ABI as release builds, drastically simplifying debugging the C and Python stacks mixed. We don’t cowl that on this submit.

GDB exploration

Use gdb to connect to the final operating course of (one of many Dask staff):

(rapids) root@dgx13:/rapids/notebooks# gdb -p 885
Attaching to course of 885
[New LWP 889]
[New LWP 890]
[New LWP 891]
[New LWP 892]
[New LWP 893]
[New LWP 894]
[New LWP 898]
[New LWP 899]
[New LWP 902]
[Thread debugging using libthread_db enabled]
Utilizing host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5494d48938 in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb)

Every Dask employee has a number of threads (communication, compute, admin, and so forth). Use the gdb command information threads to examine what every thread is doing.

(gdb) information threads
  Id   Goal Id                                        Body
* 1    Thread 0x7f5495177740 (LWP 885) "python"         0x00007f5494d48938 in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7f5425b98700 (LWP 889) "python"         0x00007f5494d4d384 in learn () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f5425357700 (LWP 890) "python"         0x00007f5494d49f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f5424b16700 (LWP 891) "python"         0x00007f5494d49f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f5411fff700 (LWP 892) "cuda-EvtHandlr" 0x00007f5494a5fbf9 in ballot () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7f54117fe700 (LWP 893) "python"         0x00007f5494a6cbb7 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7f5410d3c700 (LWP 894) "python"         0x00007f5494d4c6d6 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
  8    Thread 0x7f53f6048700 (LWP 898) "python"         0x00007f5494d49f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  9    Thread 0x7f53f5847700 (LWP 899) "cuda-EvtHandlr" 0x00007f5494a5fbf9 in ballot () from /lib/x86_64-linux-gnu/libc.so.6
  10   Thread 0x7f53a39d9700 (LWP 902) "python"         0x00007f5494d4c6d6 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0

This Dask employee has 10 threads and half appear to be ready on a mutex/futex. The opposite half cuda-EvtHandlr is polling. Observe what the present thread (denoted by the * at left), Thread 1, is doing by trying on the backtrace:

(gdb) bt
#0  0x00007f5494d48938 in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f548bc770a8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f548ba3d87c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f548bac6dfa in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f54240ba372 in uct_cuda_ipc_iface_event_fd_arm (tl_iface=0x562398656990, occasions=<optimized out>) at cuda_ipc/cuda_ipc_iface.c:271
#5  0x00007f54241d4fc2 in ucp_worker_arm (employee=0x5623987839e0) at core/ucp_worker.c:1990
#6  0x00007f5424259b76 in __pyx_pw_3ucp_5_libs_4core_18ApplicationContext_23_blocking_progress_mode_1_fd_reader_callback ()
   from /choose/conda/envs/rapids/lib/python3.6/site-packages/ucp/_libs/core.cpython-36m-x86_64-linux-gnu.so
#7  0x000056239601d5ae in PyObject_Call (func=<cython_function_or_method at distant 0x7f54242bb608>, args=<optimized out>, kwargs=<optimized out>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Objects/summary.c:2261
#8  0x00005623960d13a2 in do_call_core (kwdict=0x0, callargs=(), func=<cython_function_or_method at distant 0x7f54242bb608>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:5120
#9  _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:3404
#10 0x00005623960924b5 in PyEval_EvalFrameEx (throwflag=0, f=Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:754
#11 _PyFunction_FastCall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:4933
#12 fast_function (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:4968
#13 0x00005623960a13af in call_function (pp_stack=0x7ffdfa2311e8, oparg=<optimized out>, kwnames=0x0)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:4872
#14 0x00005623960cfcaa in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:3335
#15 0x00005623960924b5 in PyEval_EvalFrameEx (throwflag=0, Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
f=) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:754
#16 _PyFunction_FastCall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:4933
#17 fast_function (func=<optimized out>, stack=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:4968
#18 0x00005623960a13af in call_function (pp_stack=0x7ffdfa2313f8, oparg=<optimized out>, kwnames=0x0)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:4872
#19 0x00005623960cfcaa in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:3335
#20 0x00005623960924b5 in PyEval_EvalFrameEx (throwflag=0, Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
f=) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:754

Wanting on the first 20 frames of the stack (the later frames are all irrelevant Python inner calls omitted for brevity), you possibly can see a handful of inner Python calls: _PyEval_EvalFrameDefault, _PyFunction_FastCall, and _PyEval_EvalCodeWithName. There are additionally some calls to libcuda.so.

This remark hints that perhaps there’s a impasse. It might be Python, CUDA, or probably each. The Linux Wikibook on Deadlocks accommodates methodologies for debugging deadlocks that can assist you transfer ahead. 

Nonetheless, as an alternative of pthread_mutex_lock as described within the WikiBook, right here it’s pthread_rwlock_wrlock.

(gdb) bt
#0  0x00007f8e94762938 in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f8e8b6910a8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f8e8b45787c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
…

Based on the documentation for pthread_rwlock_wrlock, it takes a single argument, rwlock, that may be a learn/write lock. Now, have a look at what the code is doing and listing the supply:

(gdb) listing
6       /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Packages/python.c: No such file or listing.

There aren’t any debugging symbols. Returning to the Linux Wikibook, you possibly can have a look at the registers. You are able to do that in GDB as nicely:

(gdb) information reg
rax            0xfffffffffffffe00  -512
rbx            0x5623984aa750      94710878873424
rcx            0x7f5494d48938      140001250937144
rdx            0x3                 3
rsi            0x189               393
rdi            0x5623984aa75c      94710878873436
rbp            0x0                 0x0
rsp            0x7ffdfa230be0      0x7ffdfa230be0
r8             0x0                 0
r9             0xffffffff          4294967295
r10            0x0                 0
r11            0x246               582
r12            0x5623984aa75c      94710878873436
r13            0xca                202
r14            0xffffffff          4294967295
r15            0x5623984aa754      94710878873428
rip            0x7f5494d48938      0x7f5494d48938 <pthread_rwlock_wrlock+328>
eflags         0x246               [ PF ZF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

The issue just isn’t figuring out their that means. Happily, documentation exists—for example, the Guide to x86-64 from Stanford CS107, which explains that the primary six arguments are in registers %rdi, %rsi, %rdx, %rcx, %r8, and %r9

As seen beforehand, pthread_rwlock_wrlock takes just one argument, in order that have to be in %rdi. The remaining are most likely then used as general-purpose registers by pthread_rwlock_wrlock

Now, you should learn the %rdi register. You already realize it has a kind pthread_rwlock_t, so dereferencing have to be doable:

(gdb) p *(pthread_rwlock_t*)$rdi
$2 = {__data = {__lock = 3, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 898, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0,
    __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = "03", '00' <repeats 11 instances>, "20203", '00' <repeats 41 instances>, __align = 3}

What’s proven is the inner state of the pthread_rwlock_t object that libcuda.so handed to pthread_rwlock_wrlock—the lock itself. Sadly, the names will not be of a lot relevance. You possibly can infer that __lock possible means the variety of concurrent makes an attempt to accumulate the lock, however that’s the extent of the inference. 

The one different attribute that has a non-zero worth is __write_wakeup. The Linux Wikibook lists an attention-grabbing worth referred to as __owner, which factors to the method identifier (PID) that at present has possession of the lock. Provided that pthread_rwlock_t is a learn/write lock, presuming that __writer_wakeup factors to the method that owns the lock could also be a great subsequent step.

One truth about Linux is that every thread in a program runs as if it had been a course of. Every thread ought to have a PID (or LWP in GDB). 

Look once more in any respect threads within the course of for one thread that has the identical PID as __writer_wakeup. Happily, one thread does have that ID:

(gdb) information threads
  Id   Goal Id                                        Body
  8    Thread 0x7f53f6048700 (LWP 898) "python"         0x00007f5494d49f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

Thus far, it appears possible that Thread 8 has the lock that Thread 1 is trying to accumulate. The stack of Thread 8 could present a clue about what’s going on. Run that subsequent:

(gdb) thread apply 8 bt
Thread 8 (Thread 0x7f53f6048700 (LWP 898) "python"):
#0  0x00007f5494d49f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00005623960e59e0 in PyCOND_TIMEDWAIT (cond=0x562396232f40 <gil_cond>, mut=0x562396232fc0 <gil_mutex>, us=5000) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/condvar.h:103
#2  take_gil (tstate=0x5623987ff240) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval_gil.h:224
#3  0x000056239601cf7e in PyEval_RestoreThread (tstate=0x5623987ff240) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:369
#4  0x00005623960e5cd4 in PyGILState_Ensure () at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/pystate.c:895
#5  0x00007f5493610aa7 in _CallPythonObject (pArgs=0x7f53f6042e80, flags=4353, converters=(<_ctypes.PyCSimpleType at distant 0x562396b4d588>,), callable=<perform at distant 0x7f53ec6e6950>, setfunc=0x7f549360ba80 <L_set>, restype=0x7f549369b9d8, mem=0x7f53f6043010) at /usr/native/src/conda/python-3.6.11/Modules/_ctypes/callbacks.c:141
#6  closure_fcn (cif=<optimized out>, resp=0x7f53f6043010, args=0x7f53f6042e80, userdata=<optimized out>) at /usr/native/src/conda/python-3.6.11/Modules/_ctypes/callbacks.c:296
#7  0x00007f54935fa3d0 in ffi_closure_unix64_inner () from /choose/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6
#8  0x00007f54935fa798 in ffi_closure_unix64 () from /choose/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6
#9  0x00007f548ba99dc6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f548badd4a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#11 0x00007f54935fa630 in ffi_call_unix64 () from /choose/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6
#12 0x00007f54935f9fed in ffi_call () from /choose/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6
#13 0x00007f549361109e in _call_function_pointer (argcount=6, resmem=0x7f53f6043400, restype=<optimized out>, atypes=0x7f53f6043380, avalues=0x7f53f60433c0, pProc=0x7f548bad61f0 <cuOccupancyMaxPotentialBlockSize>, flags=4353) at /usr/native/src/conda/python-3.6.11/Modules/_ctypes/callproc.c:831
#14 _ctypes_callproc (pProc=0x7f548bad61f0 <cuOccupancyMaxPotentialBlockSize>, argtuple=<optimized out>, flags=4353, argtypes=<optimized out>, restype=<_ctypes.PyCSimpleType at distant 0x562396b4d588>, checker=0x0) at /usr/native/src/conda/python-3.6.11/Modules/_ctypes/callproc.c:1195
#15 0x00007f5493611ad5 in PyCFuncPtr_call (self=self@entry=0x7f53ed534750, inargs=<optimized out>, kwds=<optimized out>) at /usr/native/src/conda/python-3.6.11/Modules/_ctypes/_ctypes.c:3970
#16 0x000056239601d5ae in PyObject_Call (func=Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
, args=<optimized out>, kwargs=<optimized out>) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Objects/summary.c:2261
#17 0x00005623960d13a2 in do_call_core (kwdict=0x0, callargs=(<CArgObject at distant 0x7f53ed516530>, <CArgObject at distant 0x7f53ed516630>, <c_void_p at distant 0x7f53ed4cad08>, <CFunctionType at distant 0x7f5410f4ef20>, 0, 1024), func=Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:5120
#18 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:3404
#19 0x0000562396017ea8 in PyEval_EvalFrameEx (throwflag=0, f=Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:754
#20 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=0x0, kwargs=0x7f541805a390, kwcount=<optimized out>, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=(<cell at distant 0x7f5410520408>, <cell at distant 0x7f53ed637c48>, <cell at distant 0x7f53ed6377f8>), identify=Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
, qualname=Python Exception <class 'RuntimeError'> Sort doesn't have a goal.:
) at /dwelling/conda/feedstock_root/build_artifacts/python_1596656032113/work/Python/ceval.c:4166

On the prime of the stack, it looks like an bizarre Python thread ready for the GIL. It appears to be like unsuspicious, so that you may simply ignore it and search for clues elsewhere. That’s precisely what we did in 2019. 

Take a extra thorough have a look at the remainder of the stack, particularly frames 9 and 10:

#9  0x00007f548ba99dc6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f548badd4a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so

At this level, issues could look much more complicated. Thread 1 is locking in libcuda.so internals. With out entry to the CUDA supply code, debugging could be troublesome. 

Additional analyzing the stack of Thread 8, you possibly can see two frames that present hints:

#13 0x00007f549361109e in _call_function_pointer (argcount=6, resmem=0x7f53f6043400, restype=<optimized out>, atypes=0x7f53f6043380, avalues=0x7f53f60433c0, pProc=0x7f548bad61f0 <cuOccupancyMaxPotentialBlockSize>, flags=4353) at /usr/native/src/conda/python-3.6.11/Modules/_ctypes/callproc.c:831
#14 _ctypes_callproc (pProc=0x7f548bad61f0 <cuOccupancyMaxPotentialBlockSize>, argtuple=<optimized out>, flags=4353, argtypes=<optimized out>, restype=<_ctypes.PyCSimpleType at distant 0x562396b4d588>, checker=0x0) at /usr/native/src/conda/python-3.6.11/Modules/_ctypes/callproc.c:1195

To summarize up up to now, two threads are sharing a lock. Thread 8 is attempting to take the GIL and likewise make a CUDA name to cuOccupancyMaxPotentialBlockSize

Nonetheless, libcuda.so doesn’t know something about Python, so why is it attempting to take the GIL? 

The docs for cuOccupancyMaxPotentialBlockSize present that it takes a callback. Callbacks are capabilities that may be registered with one other perform to be executed at a sure cut-off date, successfully executing a user-defined motion at that predefined level.

That is attention-grabbing. Subsequent, discover out the place that decision comes from. Grepping via piles and piles of code—cuDF, Dask, RMM, CuPy, and Numba—reveals an specific name to cuOccupancyMaxPotentialBlockSize within the 0.45 launch of Numba:

    def get_max_potential_block_size(self, func, b2d_func, memsize, blocksizelimit, flags=None):
        """Recommend a launch configuration with cheap occupancy.
        :param func: kernel for which occupancy is calculated
        :param b2d_func: perform that calculates how a lot per-block dynamic shared reminiscence 'func' makes use of primarily based on the block dimension.
        :param memsize: per-block dynamic shared reminiscence utilization supposed, in bytes
        :param blocksizelimit: most block dimension the kernel is designed to deal with"""


        gridsize = c_int()
        blocksize = c_int()
        b2d_cb = cu_occupancy_b2d_size(b2d_func)
        if not flags:
            driver.cuOccupancyMaxPotentialBlockSize(byref(gridsize), byref(blocksize),
                                                    func.deal with,
                                                    b2d_cb,
                                                    memsize, blocksizelimit)
        else:
            driver.cuOccupancyMaxPotentialBlockSizeWithFlags(byref(gridsize), byref(blocksize),
                                                             func.deal with, b2d_cb,
                                                             memsize, blocksizelimit, flags)
        return (gridsize.worth, blocksize.worth)

This perform is named in numba/cuda/compiler:

    def _compute_thread_per_block(self, kernel):
        tpb = self.thread_per_block
        # Favor user-specified config
        if tpb != 0:
            return tpb
        # Else, ask the motive force to provide a great cofnig
        else:
            ctx = get_context()
            kwargs = dict(
                func=kernel._func.get(),
                b2d_func=lambda tpb: 0,
                memsize=self.sharedmem,
                blocksizelimit=1024,
            )
            attempt:
                # Raises from the motive force if the characteristic is unavailable
                _, tpb = ctx.get_max_potential_block_size(**kwargs)
            besides AttributeError:
                # Fallback to table-based strategy.
                tpb = self._fallback_autotune_best(kernel)
                increase
            return tpb

Wanting carefully on the perform definition for _compute_thread_per_block, you possibly can see a callback written as a Python lambda: b2d_func=lambda tpb: 0

Aha! In the midst of this CUDA name, the callback perform should purchase the Python GIL to execute a perform that merely returns 0. It is because executing any Python code requires the GIL and that may solely be owned by a single thread at any given cut-off date. 

Changing this with a pure C perform solves the issue. You possibly can write a pure C perform from Python with Numba!

@cfunc("uint64(int32)")
def _b2d_func(tpb):
    return 0

b2d_func=_b2d_func.tackle

This repair was submitted and ultimately merged into Numba PR #4581. And this five-line code change in Numba is in the end what resolved a number of individuals pouring over the code over many weeks of debugging.

Debugging classes realized

All through the varied debugging classes, and much more after the bug was lastly resolved, we mirrored on the issue and got here up with the next classes:

  • Don’t implement deadlocks. Actually, don’t!
  • Don’t go Python capabilities to C/C++ capabilities as callbacks except you’re completely sure that the GIL just isn’t taken by one other thread when the callback is executed. Even if you’re completely positive that the GIL just isn’t taken, double and triple-check. You do not need to take any possibilities right here.
  • Use all of the instruments at your disposal. Although you primarily write Python code, you possibly can nonetheless discover a bug in a library written in one other language like C or C++. GDB is highly effective for debugging C and C++ in addition to Python. For extra data, see GDB support.

Bug complexity in comparison with code repair complexity

The plot in Determine 2 is a illustration of a standard debugging sample: the time spent to grasp and discover the issue is excessive, whereas the extent of adjustments is low. This case is a perfect instance of that sample: debugging time tending to infinity and features of code written or modified tending to zero.

Chart showing the length of change compared to debugging time invested.
Determine 2. The strains of code required to repair a problem is inversely proportional to debugging time

Conclusion

Debugging will be intimidating, notably whenever you wouldn’t have entry to all of the supply code or a pleasant IDE. As a lot as GDB can look horrifying, it’s simply as highly effective. Nonetheless, with the correct instruments, expertise, and data gained over time, seemingly impossible-to-understand issues will be checked out in various levels of element and really understood. 

This submit introduced a step-by-step overview of how one bug took a multifaceted growth group dozens of engineering hours to resolve. With this overview and a few understanding of GDB, multithreading, and deadlocks, you possibly can assist to resolve reasonably complicated issues with a newly gained talent set.

Lastly, by no means restrict your self to the instruments that you simply already know. If you already know PDB, attempt GDB subsequent. Should you perceive sufficient in regards to the OS name stack, attempt to discover registers and different CPU properties. These expertise can actually assist make builders of all fields and programming languages extra conscious of potential pitfalls and supply distinctive alternatives to stop foolish errors from turning into nightmarish monsters.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top