Custom Query (1030 matches)
Results (196 - 198 of 1030)
| Ticket | Resolution | Summary | Owner | Reporter |
|---|---|---|---|---|
| #410 | fixed | interrupting/suspending threads on WIndows | ||
| Description |
On most Unix platforms, a thread is not interruptible (has all signals masked) when entering and while exiting an exception/asynchronous signal handler). On Darwin, the transitions that occur on entry to and exit from an exception handler are managed by another thread, so there's no point at which a Darwin thread is runnable, interruptable, and transitioning between running lisp code and running an exception handler. Threads can be suspended (for GC or other reasons) or interrupted (for PROCESS-INTERRUPT) by sending asynchronous signals to the target thread, and the handlers for these signals can't run (the signals are masked) when the thread is transitioning into and out of a signal/exception handler. Windows has no real concept of maskable signals, and many aspects of the transition between "running lisp code" and "running an exception handler" happen in user-mode code that can be suspended. It's not generally safe to interrupt or suspend a thread while it's in the process of calling a handler/returning from a handler unless the machine context that the handler's being called with (or that the handler's returning to) is visible to other threads (e.g., via the target thread's TCR.) On Windows, we generally try to emulate the effect of asynchronous signals by suspending the target thread and then (possibly) manipulate its context, possibly forcing it to run a handler. Depending on the state of the thread at the point where it's suspended, this can be complicated. The TCR has an integer field - tcr.valence - that is set to 0 (TCR_STATE_LISP) on synchronous transitions to lisp code (return from ff-call, entry to callback) and set to a non-zero value on synchronous transitions to foreign code (entry to ff-call, return from callback.) On Unix platforms, transitions due to exception/signal processing change tcr.valence in a way that's atomic with respect to asynchronous signals.) On Windows, we try to detect the case where a target thread (that we wish to interrupt or suspend) is in the middle of a transition due to an exception by noting that tcr.valence is = to TCR_STATE_LISP but the PC is not in some range (the lisp heap, subprims, the subprims jump table, in the temp stack - as happens briefly when stack-consed closures are called - ...). If the PC is in the range of addresses used to return from an exception handler - in the assembler function restore_windows_context() - then we can generally do some form of pc-lusering to either back out of an attempt to partially restore the context or emulate its full restoration; this depends on precise knowledge of the inner workings of restore_windows_context(). For win64, that "precise knowledge" is probably pretty accurate; for win32, it's known not to be (there's a compiler warning indicating that some aspects of "win32 context setup" are incomplete. If a thread has lisp valence but the PC isn't in a "legal place for lisp code", then it's assumed that the thread is entering an exception handler. A bit is set in the thread's TCR and the thread is resumed in this case; when it safely enters its exception handler (saves the context in the TCR, sets the valence correctly) it checks this bit and raises a sempahore to indicate acknowledgment of the suspend request, then waits on another semaphore (which the suspending thread raises to indicate that the thread should resume.) On both Windows platforms, the mechanisms are complicated and likely in need of further review and testing. For instance, it -may- be possible for a Windows thread to be suspended when it hasn't yet resumed after a previous suspend request. (I say "may" meaning exactly that: I'm not sure whether this is possible or not.) It should be possible (e.g., there's no known deep technical issue) to at least bring the win32 code up to the same level of completeness as the win64 code exhibits. SLIME tries to log/record information about protocol events, and the logged information often involves the printed representations of lisp PROCESS objects (which, by default, includes their WHOSTATE, and it's often necessary to suspend the thread in order to find the value of this thread-local binding.) This exercises the suspend/resume code heavily (heavily enough that I had to avoid printing the WHOSTATE in order to avoid exercising the buggy suspend/resume code.) Being able to run SLIME without disabling WHOSTATE printing would be a good test. (And since SLIME conses excessively and by default spawns threads for expression evaluation, it'll exercise other things as well.) |
|||
| #1136 | fixed | interning a million symbols is slow | ||
| Description |
From clop on #ccl: ; This script demonstrates an unusual, intermittent slowdown when growing
; packages by adding symbols with INTERN.
;
; The test simply tries to intern a million symbols into a package. This
; takes several minutes to complete.
;
; It appears that resizing the package is occasionally extremely slow. It can
; sometimes take hundreds of seconds to do an INTERN because the package is
; grown.
;
; It looks to me like the problem is NOT the search for a good, relatively
; prime number -- that never seems to take more than a few of microseconds.
;
; It also looks to me like the problem isn't *really* that we're growing too
; slowly. We do seem to be growing slowly--only around 10% at each
; iteration--and it might make more sense to grow at a rate like 1.5x, but
; that doesn't seem like the real culprit.
;
; Rather, there seems to be some kind of bad hashing behavior happening at
; certain thresholds along the way. As the package is grown, each time only
; growing by a small number of entries, the rehashing time bounces around
; wildly, e.g.,:
;
; - half a second
; - half a second
; - half a second
; - 5 seconds
; - half a second
; - 370 seconds
;
; So this "feels" like what happens when a hashing algorithm isn't sufficiently
; randomizing things. But that's just speculation at this point; I haven't
; tried to analyze, e.g., how many collisions there are in the table.
; --------------------------------------------------------------------------
; Preliminaries, stuff from nfasload.lisp + more timing and printing
; $ ccl
(in-package "CCL")
(gc-verbose t t)
(ccl::egc nil)
(setq ccl::*warn-if-redefine-kernel* nil)
(defconstant $primsizes (make-array 23
:element-type '(unsigned-byte 16)
:initial-contents
'(41 61 97 149 223 337 509 769 887 971 1153 1559 1733
2609 2801 3917 5879 8819 13229 19843 24989 29789 32749)))
(defconstant $hprimes (make-array 8
:element-type '(unsigned-byte 16)
:initial-contents '(5 7 11 13 17 19 23 29)))
;;; Symbol hash tables: (htvec . (hcount . hlimit))
(defmacro htvec (htab) `(%car ,htab))
(defmacro htcount (htab) `(%cadr ,htab))
(defmacro htlimit (htab) `(%cddr ,htab))
(defun inspect-package (name)
;; Dumb tool to print out some stats
(let* ((pkg (find-package name))
(tab (ccl::pkg.itab pkg)))
(format t "; - Current count: ~:D~%" (htcount tab))
(format t "; - Current limit: ~:D~%" (htlimit tab))
nil))
(defun %initialize-htab (htab size)
(declare (fixnum size))
;; Ensure that "size" is relatively prime to all secondary hash values.
;; If it's small enough, pick the next highest known prime out of the
;; "primsizes" array. Otherwize, iterate through all all of "hprimes"
;; until we find something relatively prime to all of them.
(format t "Looking for relatively prime, starting with ~:D~%" size)
(time (setq size
(if (> size 32749)
(do* ((nextsize (logior 1 size) (+ nextsize 2)))
()
(declare (fixnum nextsize))
(when (dotimes (i 8 t)
(unless (eql 1 (gcd nextsize (uvref #.$hprimes i)))
(return)))
(return nextsize)))
(dotimes (i (the fixnum (length #.$primsizes)))
(let* ((psize (uvref #.$primsizes i)))
(declare (fixnum psize))
(if (>= psize size)
(return psize)))))))
(setf (htvec htab) (make-array size #|:initial-element 0|#))
(setf (htcount htab) 0)
(setf (htlimit htab) (the fixnum (- size (the fixnum (ash size -3)))))
htab)
(defun %resize-htab (htab)
(declare (optimize (speed 3) (safety 0)))
(format t "About to resize:~%")
(inspect-package "FOO")
(time
(without-interrupts
(let* ((old-vector (htvec htab))
(old-len (length old-vector)))
(declare (fixnum old-len)
(simple-vector old-vector))
(let* ((nsyms 0))
(declare (fixnum nsyms))
(dovector (s old-vector)
(when (symbolp s) (incf nsyms)))
(%initialize-htab htab
(the fixnum (+
(the fixnum
(+ nsyms (the fixnum (ash nsyms -2))))
2)))
(let* ((new-vector (htvec htab))
(nnew 0))
(declare (fixnum nnew)
(simple-vector new-vector))
(dotimes (i old-len (setf (htcount htab) nnew))
(let* ((s (svref old-vector i)))
(if (symbolp s)
(let* ((pname (symbol-name s)))
(setf (svref
new-vector
(nth-value
2
(%get-htab-symbol
pname
(length pname)
htab)))
s)
(incf nnew)))))
htab)))))
(format t "Done with resize:~%")
(inspect-package "FOO"))
; ------------------------------------------------------------------------
; A very basic test... try to intern a million numbers as strings:
(make-package "FOO" :use nil)
; Just to create the strings takes about 1.9 seconds, 335 MB
(time (loop for i fixnum from 1 to 1000000 do
(let ((name (format nil "~a" i)))
(declare (ignore name)))))
; Now let's try to intern them...
(time (let ((pkg (find-package "FOO")))
(loop for i fixnum from 1 to 1000000 do
(let ((name (format nil "~a" i)))
(intern name pkg)))))
; ------------------------------------------------------------------------
; A cut down log of the output that I got...
#||
*** Cruising along pretty well... ***
[...]
About to resize:
; - Current count: 352,740
; - Current limit: 352,740
Looking for relatively prime, starting with 440,927
(SETQ SIZE ...) took 5 microseconds (0.000005 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 373,011 microseconds (0.373011 seconds) to run.
3,550,272 bytes of memory allocated.
About to resize:
; - Current count: 385,812
; - Current limit: 385,812
Looking for relatively prime, starting with 482,267
(SETQ SIZE ...) took 4 microseconds (0.000004 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 341,521 microseconds (0.341521 seconds) to run.
*** Until --- Yikes, what happened here?!? ***
About to resize:
; - Current count: 421,984
; - Current limit: 421,984
Looking for relatively prime, starting with 527,482
(SETQ SIZE ...) took 5 microseconds (0.000005 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 89,159,913 microseconds (89.159910 seconds) to run.
During that period, and with 8 available CPU cores,
89,142,448 microseconds (89.142450 seconds) were spent in user mode
11,998 microseconds ( 0.011998 seconds) were spent in system mode
4,242,752 bytes of memory allocated.
2 minor page faults, 0 major page faults, 0 swaps.
Done with resize:
; - Current count: 421,984
; - Current limit: 461,552
*** And then we're back to normal for awhile... ***
About to resize:
; - Current count: 461,552
; - Current limit: 461,552
Looking for relatively prime, starting with 576,942
(SETQ SIZE ...) took 6 microseconds (0.000006 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 385,721 microseconds (0.385721 seconds) to run.
About to resize:
; - Current count: 504,826
; - Current limit: 504,826
Looking for relatively prime, starting with 631,034
(SETQ SIZE ...) took 7 microseconds (0.000007 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 500,481 microseconds (0.500481 seconds) to run.
About to resize:
; - Current count: 552,160
; - Current limit: 552,160
Looking for relatively prime, starting with 690,202
(SETQ SIZE ...) took 5 microseconds (0.000005 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 320,744 microseconds (0.320744 seconds) to run.
About to resize:
; - Current count: 603,928
; - Current limit: 603,928
Looking for relatively prime, starting with 754,912
(SETQ SIZE ...) took 6 microseconds (0.000006 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 457,308 microseconds (0.457308 seconds) to run.
*** Then a slight bump ***
About to resize:
; - Current count: 660,549
; - Current limit: 660,549
Looking for relatively prime, starting with 825,688
(SETQ SIZE ...) took 6 microseconds (0.000006 seconds) to run.
During that period, and with 8 available CPU cores,
0 microseconds (0.000000 seconds) were spent in user mode
0 microseconds (0.000000 seconds) were spent in system mode
(WITHOUT-INTERRUPTS ...) took 5,804,430 microseconds (5.804430 seconds) to run.
During that period, and with 8 available CPU cores,
5,794,119 microseconds (5.794119 seconds) were spent in user mode
9,999 microseconds (0.009999 seconds) were spent in system mode
6,628,368 bytes of memory allocated.
4 minor page faults, 0 major page faults, 0 swaps.
Done with resize:
; - Current count: 660,549
; - Current limit: 722,478
*** Then back to normal ... ***
About to resize:
; - Current count: 722,478
; - Current limit: 722,478
Looking for relatively prime, starting with 903,099
(SETQ SIZE ...) took 7 microseconds (0.000007 seconds) to run.
(WITHOUT-INTERRUPTS ...) took 389,308 microseconds (0.389308 seconds) to run.
*** Then something goes horribly wrong ***
About to resize:
; - Current count: 790,212
; - Current limit: 790,212
Looking for relatively prime, starting with 987,767
(SETQ SIZE ...) took 9 microseconds (0.000009 seconds) to run.
During that period, and with 8 available CPU cores,
0 microseconds (0.000000 seconds) were spent in user mode
0 microseconds (0.000000 seconds) were spent in system mode
(WITHOUT-INTERRUPTS ...) took 372,714,560 microseconds (372.714570 seconds) to run.
During that period, and with 8 available CPU cores,
372,589,358 microseconds (372.589360 seconds) were spent in user mode
76,988 microseconds ( 0.076988 seconds) were spent in system mode
7,925,024 bytes of memory allocated.
4 minor page faults, 0 major page faults, 0 swaps.
*** and on it goes ***
The next resize hasn't completed yet, but since it began I've cleaned up
this script, adding all of the comments, etc., so it's probably been
running for at least 20 minutes.
||#
|
|||
| #1313 | invalid | intermittent socket-related failures (concurrency-related?) | ||
| Description |
The presumed bug causes "Bad file descriptor (error #9) during write" and/or "Unexpected end of file" during lfarm tests. It never appears with the old ccl. Of course there might be an issue with cl-store or flexi-streams or my code depending upon an old bug. The tests are doing highly concurrent stuff with streams, and the bug is intermittent. The changes to sockets in ccl might have introduced a concurrency problem. If you are interested in reproducing, (ql:quickload :lfarm-test) (lfarm-test:execute) on Linux x86. (32-bit) |
|||
