|Version 6 (modified by gz, 7 years ago) (diff)|
Running CCL under GDB
Any tips and tricks to share?
What do I do about segfaults? What about SIG40?
You basically want to tell GDB to load (or "source", as a verb) a file that tells it about signals that're handled by the application and defines some macros (most of which have to do with printing lisp object values)
(gdb) source ccl/lisp-kernel/linuxx8664/.gdbinint
That file will be sourced automatically if it (or a link to it) is in the same directory as the executable (or, IIRC, in your home directory.)
I think that the "handle" forms in the .gdbinit file enumerate all of the signals that the lisp handles; there was a time last fall when at least one case was missing from the checked-in .gdbinit file. The general idea is to say something like:
handle SIGQUIT pass nostop noprint
which tells GDB that if the target process gets a SIGQUIT, it should let the application handle it (GDB should "pass" it to the application) without stopping or printing anything.
A SIGINT by default causes entry to GDB and is not passed to the application. I sometimes find it useful to be able be able interrupt the lisp via SIGINT (after entering GDB). Doing something like
handle SIGINT pass stop print
causes GDB to ask for confirmation because "SIGINT is used by the debugger". (It's not used in the same way that breakpoints and single-step exceptions are used, so I usually just sigh and give it the confirmation it craves.)
GDB's very good at debugging C code that was compiled with debugging enabled and for which you have the source code. (It's even better if optimization's toned down.) If you're trying to debug C library code for which you have the source and for which debugging information was generated and not stripped, GDB's sort of in its element and offers lots of useful features.
The lisp wants to handle most of the signals that can be raised to indicate an exception. On x86-64 Linux, SIGSEGV means lots of things, and those things in turn mean different things when you're executing lisp code when the occur than they would if you were executing C ("foreign") code. If the lisp's exception-handling code doesn't know how to handle an exception, it enters the kernel debugger (there's no good and direct way to pass it to another debugger.) When an exception occurs in foreign code, the kernel debugger tries to note that fact; ideally, it would also disable some debugging commands that only make sense if the exception occurred while executing lisp code, but it leaves them enabled. (The "L" kernel debugger command is very useful for seeing the values of lisp objects in registers at the point of the exception, but it will crash or misbehave if those registers don't contain lisp objects, as they wouldn't if the exception occurred in foreign code.)
GDB's much more likely to be able to make at least some sense out of the state of things in the exception-in-foreign-code case than the lisp's kernel debugger is. If GDB's already running (as opposed to having been attached after the fact), you can do this via the same technique that I described a few weeks ago (but it's a little easier if you don't have to play "guess which thread was in the kernel debugger.) The general idea is:
a) In the kernel debugger do R to display raw (hex) register values and note the value in RIP (the program counter/instruction pointer.)
b) Drop into GDB (via ^C) and set a breakpoint at that address. If the address is 0x87654321, the GDB command to set that breakpoint would be:
(gdb) br *0x87654321
The leading asterisk is necessary to prevent GDB from interpreting the integer as a line number.
c) Tell GDB to let the interrupted process continue
The kernel debugger will likely still be waiting for input. All other lisp threads should be suspended.
d) In the kernel debugger, use the "x" command, which exits from the kernel debugger resumes other threads.
The next time any thread reaches the address of the breakpoint, GDB will be entered. It's hard to guarantee that the first thread that reaches that point will be the one that got the exception, but it's usually very likely (other threads usually require some time to wake up after being suspended.)
In GDB at that point,
will do a backtrace (at least as far back as the foreign function call from lisp)
(gdb) info regs
will show register values.
If the foreign code has symbolic debugging information and wasn't heavily optimized, you can do a lot more (show argument and local variable values, see argument names and values in backtrace, etc.) at that point. If the problem is in some library code (either in its behavior or in the parameters that lisp is passing it) and it's possible to build the library with debugging enabled and optimization toned down, you'll probably find the problem much more quickly than you would otherwise.
As far as other tips and tricks ... I'm not sure what I could say that'd be meaningful without a long explanation of how the lisp is implemented. The manual actually does explain quite a bit of that. If you want to use GDB to step through/set breakpoints in compiled lisp code it's certainly possible to do that (I do it all the time ...), but explaining the issues and details might take a while. (From GDB's point of view, this is like debugging machine code or debugging C code that you don't have the source to and don't have symbolic information for; it's OK at that and there isn't anything better at it widely available under Linux, but that's not really its primary are of focus.)
Here are some hints for linuxx8664:
To find the address corresponding to a lisp symbol, first tell GDB to call the "find_symbol" function, which walks memory until it finds a symbol with a matching pname and returns the symbol tagged as a vector:
(gdb) call find_symbol("FIND-IF-NOT") $1 = 52777632305533
You can then look at the slots of the symbol, which are a header followed by the pname, value and function. You have to subtract the tag from the address returned by find_symbol, which is 13 on x8664:
(gdb) x/gx 52777632305533-13 ; subtract fulltag_misc = 13 0x300040069170: 0x0000000000000715 ; header (gdb) 0x300040069178: 0x00003000000a995d ; pname (gdb) 0x300040069180: 0x0000000000000012 ; value (gdb) 0x300040069188: 0x000030004006970f ; function
You can set a breakpoint on entry to the function:
(gdb) br *0x000030004006970f
Note that you don't need to subtract any tags - the code starts right at the address of the function.
To enter GDB when lisp is starting up, set a breakpoint at *_SPfuncall, which is called soon after the image is loaded (and is rarely called thereafter, since funcall is inlined).
To enter GDB when lisp is already in the kernel debugger after an exception in foreign code, first do the R command in the kernel debugger to display raw (hex) register values and note the value in RIP (the program counter/instruction pointer.) Then:
shell> gdb /path/to/lisp-kernel (gdb) source lisp-kernel/linuxx8664/.gdbinit (gdb) attach <pid> # pid is printed in brackets in the kernel debugger prompt (gdb) br *0x87654321 # or whatever the RIP value is (gdb) continue
Back in the kernel debugger:
[pid] Clozure CL kernel debugger: x
That should immediately break into gdb at the instruction that caused the fault. At that point:
(gdb) x/i $pc # disassembles the instruction at the pc/%rip (gdb) bt # do a C backtrace
Some Linux distributions provide debugging information and library source for the standard libraries; on Fedora, this information is contained in optional "debuginfo" packages. If it's available, the information is often very useful.
To cause GC (including the EGC) to run integrity checks on entry, add -DGC_INTEGRITY_CHECKING to the CDEFINES in the kernel Makefile and rebuild the kernel. Alternately you can (setq ccl::*gc-event-status-bits* 4) at any time for the same effect.
If you look at the .gdbinit file, there are a number of useful lisp-related commands defined there. Try them...