Ticket #297 (closed defect: fixed)

Opened 7 years ago

Last modified 6 years ago

process-enable experiencing broken symptons on rc1.2

Reported by: ragerdl Owned by: gb
Priority: major Milestone:
Component: Runtime (threads, GC) Version:
Keywords: process-enable, parallel, multi-threading Cc: kaufmann@…

Description (last modified by gb) (diff)

Hello Gary,

Release candidate 1.2 breaks Parallel ACL2 in a repeatable way. Openmcl states that it has been trying to enable a process, despite trying for a second. I noticed that level-1/processes.lisp:process-enable changed from 1.0.2 to rc1.2. It looks like the process is no longer being set as "Active". Perhaps this is just internal book keeping, but maybe calling attention to this change will help figure out why PACL2 doesn't work anymore.

Any ideas? I've included the transcript that replicates the problem. As you can see, it's from the most basic parallel example we have, fibonacci. There are more complex examples too if you're interested.

Thank you, David

lhug-7.cs.utexas.edu% ./saved_acl2
Welcome to Clozure Common Lisp Version 1.2-r9226-RC1  (LinuxX8664)!

 ACL2 Version 3.3 built May 5, 2008  02:56:57.
 Copyright (C) 2007  University of Texas at Austin
 ACL2 comes with ABSOLUTELY NO WARRANTY.  This is free software and you
 are welcome to redistribute it under certain conditions.  For details,
 see the GNU General Public License.

 Initialized with (INITIALIZE-ACL2 'INCLUDE-BOOK *ACL2-PASS-2-FILES*).
 See the documentation topic note-3-3 for recent changes.
 Note: We have modified the prompt in some underlying Lisps to further
 distinguish it from the ACL2 prompt.

 NOTE!!  Proof trees are disabled in ACL2.  To enable them in emacs,
 look under the ACL2 source directory in interface/emacs/README.doc; 
 and, to turn on proof trees, execute :START-PROOF-TREE in the ACL2 
 command loop.   Look in the ACL2 documentation under PROOF-TREE.

ACL2 Version 3.3.  Level 1.  Cbd 
"/v/filer4b/v8q002/hvg/parallel/pacl2-3.3/acl2-sources/".
Distributed books directory 
"/v/filer4b/v8q002/hvg/parallel/pacl2-3.3/acl2-sources/books/".
Type :help for help.
Type (good-bye) to quit completely out of ACL2.

ACL2 !>(include-book "parallel/fibonacci" :dir :system)

Summary
Form:  ( INCLUDE-BOOK "parallel/fibonacci" ...)
Rules: NIL
Warnings:  None
Time:  0.07 seconds (prove: 0.00, print: 0.00, other: 0.07)
 "/v/filer4b/v8q002/hvg/parallel/pacl2-3.3/acl2-sources/books/parallel/fibonacci.lisp"
ACL2 !>(time$ (fib 40))

(EV-REC (FARGN FORM 1) ALIST W (DECREMENT-BIG-N BIG-N) SAFE-MODE GC-OFF LATCHES HARD-ERROR-RETURNS-NILP) took 4,643,664 microseconds (4.643664 seconds) to run 
                    with 8 available CPU cores.
During that period, 4,644,291 microseconds (4.644291 seconds) were spent in user mode
                    0 microseconds (0.000000 seconds) were spent in system mode
 16 bytes of memory allocated.
 1 minor page faults, 0 major page faults, 0 swaps.
102334155
ACL2 !>(time$ (pfib 35))

(EV-REC (FARGN FORM 1) ALIST W (DECREMENT-BIG-N BIG-N) SAFE-MODE GC-OFF LATCHES HARD-ERROR-RETURNS-NILP) took 157,523 microseconds (0.157523 seconds) to run 
                    with 8 available CPU cores.
During that period, 380,023 microseconds (0.380023 seconds) were spent in user mode
                    4,001 microseconds (0.004001 seconds) were spent in system mode
4,494 microseconds (0.004494 seconds) was spent in GC.
 37,584 bytes of memory allocated.
 322 minor page faults, 0 major page faults, 0 swaps.
9227465
ACL2 !>(time$ (pfib 40))

> Error: Unable to enable process #<PROCESS Worker thread(38) [Active] #x300043A1C8ED>; have been trying for 1 seconds.
> While executing: PROCESS-ENABLE, in process Worker thread(35).


;;;
;;; #<PROCESS Worker thread(35) [Active] #x300043A1A69D> requires access to Shared Terminal Input
;;;

  C-c C-c> Break: interrupt signal
> While executing: CCL::%PROCESS-WAIT-ON-SEMAPHORE-PTR, in process listener(1).
> Type :GO to continue, :POP to abort, :R for a list of available restarts.
> If continued: Return from BREAK.
> Type :? for other options.
1 > [RAW LISP] :proc
38 :    Worker thread  [Active] 
35 :    Worker thread  [semaphore wait]  (Requesting terminal input)
14 :    Worker thread  [semaphore wait] 
1 : -> listener     [Active] 
0 :    Initial      [Active] 
1 > [RAW LISP] (:y 35)


;;;
;;; Shared Terminal Input is now owned by #<PROCESS Worker thread(35) [Active] #x300043A1A69D>
;;;

> Type :GO to continue, :POP to abort, :R for a list of available restarts.
> If continued: Keep trying.
> Type :? for other options.
1 > [RAW LISP] :b
 (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active] #x300043A1C8ED> [...]) 405
 (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread") #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> NIL) 1373
 (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread" #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F> [...]) 213
 (2AAAAD619C98) : 3 (SPAWN-WORKER-THREADS-IF-NEEDED) 757
 (2AAAAD619CC8) : 4 (PARALLELIZE-CLOSURE-LIST '(# #) [...]) 669
 (2AAAAD619D78) : 5 (PARALLELIZE-FN 'ACL2::IDENTITY-LIST '(# #) [...]) 125
 (2AAAAD619DA8) : 6 (PLET-FN '(# #) #<Compiled-function (:INTERNAL ACL2::PFIB) (Non-Global)  #x300043A0488F>) 45
 (2AAAAD619DD0) : 7 (EVAL-AND-SAVE-RESULT #S(ACL2::PARALLELISM-PIECE :THREAD-ARRAY #(#<# #(35) [#] #x300043A1A69D> NIL) :RESULT-ARRAY #(NIL #) ...)) 133
 (2AAAAD619E08) : 8 (CONSUME-WORK-ON-WORK-QUEUE-WHEN-ITS-THERE) 4677
 (2AAAAD619EB8) : 9 (RUN-PROCESS-INITIAL-FORM #<PROCESS Worker thread(35) [Active] #x300043A1A69D> '(#)) 717
 (2AAAAD619F48) : 10 (FUNCALL #'#<(:INTERNAL CCL::%PROCESS-PRESET-INTERNAL)> #<PROCESS Worker thread(35) [Active] #x300043A1A69D> '(#)) 397
 (2AAAAD619F98) : 11 (FUNCALL #'#<(:INTERNAL CCL::THREAD-MAKE-STARTUP-FUNCTION)>) 293
1 >

Change History

comment:1 Changed 7 years ago by ragerdl

This occurs on linux X86-64 bit, specifically the UT 64 bit lhug machines. This particular bug does not occur on mac X86-64.

comment:2 Changed 7 years ago by gb

  • Description modified (diff)

comment:3 Changed 7 years ago by gb

  • Status changed from new to assigned

Looking at it.

The backtrace seems to suggest that we timed out waiting in thread 35 for a newly-created thread (38) to enter the "reset" state. However, process 38 is on the list of all processes (where :PROC can find it) and claims to be "Active". That doesn't make a whole lot of sense, if I'm interpreting the backtrace correctly.

What Linux distribution is this, and what Linux kernel version ?

comment:4 Changed 7 years ago by ragerdl

uname -a results in:

Linux lhug-7.cs.utexas.edu 2.6.22.8 #1 SMP Fri Oct 12 13:55:12 CDT 2007 x86_64 GNU/Linux

I think it's a debian box. Is there another command I can type to give you some more information?

comment:5 Changed 7 years ago by ragerdl

The following definitions of Fibonacci can be used. I added them to the ticket, because I know organization is good.

; Serial version of Fibonacci (defun fib (x)

(declare (xargs :guard (natp x))) (cond ((mbe :logic (or (zp x) (<= x 0))

:exec (<= x 0))

0)

((= x 1) 1) (t (let ((a (fib (- x 1)))

(b (fib (- x 2))))

(+ a b)))))

; Parallelized version of Fibonacci, using plet (defun pfib (x)

(declare (xargs :guard (natp x))) (cond ((mbe :logic (or (zp x) (<= x 0))

:exec (<= x 0))

0)

((= x 1) 1) (t (plet (declare (granularity (> x 33)))

((a (pfib (- x 1)))

(b (pfib (- x 2))))

(+ a b)))))

(assert! (equal (fib 35) (pfib 35)))

; Parallel version of Fibonacci, using pargs (defun pfib-with-pargs (x)

(declare (xargs :guard (natp x))) (cond ((mbe :logic (or (zp x) (<= x 0))

:exec (<= x 0))

0)

((= x 1) 1) (t (pargs (declare (granularity (> x 33)))

(binary-+ (pfib-with-pargs (- x 1))

(pfib-with-pargs (- x 2)))))))

comment:6 Changed 7 years ago by ragerdl

You can delete:

(assert! (equal (fib 35) (pfib 35)))

It won't run w/o another book.

comment:7 Changed 7 years ago by gb

It's not really conclusive, but I was able to run (pfib 40) about 10 times in succession without problems (on a Core 2 Quad running Fedora 8 and 2.6.24-4-64). Did it usually happen quicker for you, or does it take several attempts to provoke the bug ?

comment:8 Changed 7 years ago by ragerdl

  • Cc kaufmann@… added

I'm fairly convinced that someone will have to login to one of the 64 bit lhug machines to duplicate this problem (like lhug6.cs.utexas.edu). It's my understanding that GB has a login ID for these computers, and I can help track one down if GB doesn't.

comment:9 Changed 7 years ago by gb

I have an account on the lhug machines.

If I can reproduce the problem there but can't reproduce the problem on any other x86-64 Linux platform, what does that tell us ?

Before concluding that I can't reproduce the bug on other machines, it'd be helpful to know if this fails pretty quickly/reliably for you on the lhug machines or if it takes a while to do so.

comment:10 Changed 7 years ago by ragerdl

Pasting this here so that it's with the ticket:

Hi, Gary --

David might be out-of-pocket, so I just tried his experiment on lhug-4.cs.utexas.edu. It failed quickly (a few seconds), and the first time. (I got an "uncertified" warning from ACL2 but I believe that's irrelevant.)

-- Matt

comment:11 Changed 6 years ago by gb

  • Status changed from assigned to closed
  • Resolution set to fixed

Some later discussion on openmcl-devel indicated that this was just a case of the 1-second timeout being too short (and some additional confusion about PROCESS-WHOSTATE reporting a "Reset" thread as being "Active".)

There's till a timeout involved, but it's now much, much larger; this should be in effect in both the 1.2 branch and the trunk.

comment:12 Changed 6 years ago by rme

  • Milestone 1.2 deleted

Milestone 1.2 deleted

Note: See TracTickets for help on using tickets.