Message boards : Number crunching : Workunits getting stuck and aborting
Author | Message |
---|---|
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error. These are mine thus far: https://boinc.bakerlab.org/rosetta/result.php?resultid=61646685 https://boinc.bakerlab.org/rosetta/result.php?resultid=61635395 https://boinc.bakerlab.org/rosetta/result.php?resultid=61598016 https://boinc.bakerlab.org/rosetta/result.php?resultid=61597212 https://boinc.bakerlab.org/rosetta/result.php?resultid=61589791 |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Looks like these are all DOC work units, and all ended by watchdog anywhere from 0.5hrs to 3hrs after starting. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error. I have a whole bunch of these...(all of my crunchers are getting them).... If the ID's are needed, I will list... let me know... I have aborted these WU's en-mass on my single thread machines... On the big machines, I will sort them out as they have problems.. Looking for a team ??? Join BoincSynergy!! |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
I'm getting plenty of those DOC_* workunits as well. At some point between a few minutes and some hours into processing the workunit the boinc manager shows the cpu time no longer progressing and 100% completed for the task, but the status remains "running". A 'ps' shows the rosetta tasks still existing, but no longer consuming any cpu time. I keep aborting them, since they do not appear to time out. These systems run the 5.4.9 and 5.4.11 linux boinc clients. Team Helix |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error. I think I have the problem on this one.... I tailed a running stdout.txt ... The 3600 second watchdog timeout is too short for these. The task is still running properly when it gets interupted by the watchdog... Maybe a 7200 second timeout should be set for these.... **Edit** Yep a 7200 timeout is getting it thru so far on a 1.5GHZ P4... Cycle time is 4640 seconds which is why the watchdog was killing it.. It's not done yet... (I run them each for 12 hours on that platform), but, I suspect that it will emerge correctly... I also edited an unstarted WU with the extended timeout.. We will see how it works... Looking for a team ??? Join BoincSynergy!! |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck. To further help us track down the problem, could you please report what kind of platform your host is? It is definitely happening on linux( from Thomos here and Conan at ralph), what about the rest of you? [quote]I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error. These are mine thus far: https://boinc.bakerlab.org/rosetta/result.php?resultid=61646685 https://boinc.bakerlab.org/rosetta/result.php?resultid=61635395 https://boinc.bakerlab.org/rosetta/result.php?resultid=61598016 https://boinc.bakerlab.org/rosetta/result.php?resultid=61597212 https://boinc.bakerlab.org/rosetta/result.php?resultid=61589791[/quote |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck. Mine are all linux systems.. and editing the client_state.xml and adding -watchdog_time 7200 to any of these workunits gets them thru... (at least all of mine to this point...).... Looking for a team ??? Join BoincSynergy!! |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck. If you look at https://boinc.bakerlab.org/rosetta/result.php?resultid=61635399 and https://boinc.bakerlab.org/rosetta/result.php?resultid=61635398 It is one MAC and one Windows so it seems to affect all systems. Anders n |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
A DOC that hung last night that I had to abort. Stuck at 100% https://boinc.bakerlab.org/rosetta/result.php?resultid=61387581 # random seed: 2214937 # cpu_run_time_pref: 14400 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score 14.2268 for 3600 seconds ********************************************************************** GZIP SILENT FILE: ./dd1MLC.out SIGSEGV: segmentation violation Stack trace (26 frames): [0x8ae59f7] [0x8b018bc] [0xffffe420] [0x8b83c29] [0x8b528d7] [0x8b54cc1] [0x80b7cf8] [0x89671a9] [0x896f069] [0x86480f2] [0x8649421] [0x89718c2] [0x8975e94] [0x897940f] [0x89a8549] [0x89a9f75] [0x804d236] [0x876b46f] [0x876effa] [0x87702da] [0x8302b7d] [0x84e3d1b] [0x85faa8b] [0x85fab34] [0x8b60dd4] [0x8048111] Exiting... FILE_LOCK::unlock(): close failed.: Bad file descriptor |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck. UPDATE ... Nope.....I was on the wrong track, they last longer, but, still hang later in the process..... Looking for a team ??? Join BoincSynergy!! |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
This morning I also checked our local windows and mac platforms. Consistent with what have been reported here, I also saw several "Watchdog ending stuck runs" for "DOC" WUs. However, those stuck WUs were terminated by the watchdog thread properly (returned as success) and none of them hang in the boinc manager( which have to be aborted manulally). So my speculation is: 1. the "DOC" WUs have some problems whose trajectories get stuch more frequently than Rosetta average. We will look into this issue and come up with the fix. 2. when a stuck WU is terminated by the watchdog thread, it has some problem of completely removing it from the task list on linux platform (but not windows and mac platform ???) and needs to be aborted by users. This speculation has to wait more user feedbacks yet to be confirmed. Please post any relevant observations on your side. Thank you for your help. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The watchdog is looking out for problems, so they can be terminated. If a work unit runs for an hour and remains stuck on the same Rosetta score, the watchdog will end the task (at least when it's working properly, there seems to be some quirks with that at this point). So, regardless of your runtime preference, a score not moving for an hour is one of the things the watchdog looks out for. The other times should be based on your runtime preference, found in your Rosetta Preferences settings. 3hrs is the default runtime preference, if none is selected. If that is what you are seeing in stderr, that is how the work is running. Why that differs from your 8hr preference may be that you've just recently change the preference, or that the preference you are comparing to is for a different location. Rosetta Moderator: Mod.Sense |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
The "watchdog" error for recent "DOC" workunits has been tracked down to be a bug in Rosetta code which was introduced in the past month. The worker thread worked properly, but it left some gaps during the simulation in which "score" is not updated ( to make it even worse, sometimes it is reset to ZERO ). The way how the "watchdog" thread works is that it periodically checks the "score" and compare it against the previously recorded value. If same, it thinks the current trajectory is stuck and it should terminate the whole process. For "DOC" workunits, the gaps can be relatively long and the chance of this happening therefore turns out to be high. We have fixed this problem and will test it in the next update on Ralph (very soon). As mentioned in my previous post, there seem to be two isolated problems. The first one is why those "DOC" WUs get stuck and we have found the problem. The second one is why the watchdog thread did not terminate the process properly. This problem seems to be specific to linux platforms. As we queried our database on the problematic batch of DOC workunits, the "watchdog ending runs" message was seen across all platforms, but I have not so far seen one case for windows and mac that results were not returned as success. On the other hand, when this happened on linux platform, I saw mostly "aborted by users" outcomes which indicate that even if the watchdog thread found the run stuck, it could not terminate the process properly so that the WU is still hanging in system until mannualy killed by users. I am not sure this is also true for the watchdog termination of non-DOC workunits and we will continue to look into that. Again, the rate of "false watchdog termination" should go away with the new fix, but there might be other problems which can cause a real stuck trajectory. If that happens, please report back to us here. Thank you very much for the help! |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
I still have that one system sitting with the stuck workunit. Is there any way to provide you with additional information that would help to track down the reason why the watchdog terminated process isn't really dead ? (strace/gdb) I haven't tried it on this particular instance, but suspending and later resuming a hung DOC_* workunit will remain hung. I'm pretty sure that also applied when terminating Boinc altogether and starting it back up. The only time I have seen this symptom of 'running' rosetta processes that don't consume any cpu cycles and make no progress outside of these DOC_* workunits was when I ran boinc without the preference setting to keep the tasks in memory. So yes, there do appear to be other situations in which workunits can get stuck in this way. Team Helix |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
This is from another system, but also linux. After the same Rosetta workunit hung a second time, I restarted boinc with strace: strace -ff -tt -o boinc_rosetta ./boinc user 23795 6196 0 21:25 pts/2 00:00:01 strace -ff -tt -o /xen2/boinc_rosetta ./boinc user 23796 23795 0 21:25 pts/2 00:00:00 ./boinc user 23797 23796 97 21:25 pts/2 00:03:21 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080 0 -watchdog -constant_seed -jran 2083828 user 23798 23797 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080 0 -watchdog -constant_seed -jran 2083828 user 23799 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080 0 -watchdog -constant_seed -jran 2083828 user 23800 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080 0 -watchdog -constant_seed -jran 2083828 PID 23795 is strace PID 23796 is the boinc client started by strace PID 23797 is the rosetta client started by boinc (this does all the computation) PID 23798 is another rosetta task (2nd) started by the first one PID 23799 is another rosetta task (3rd) started by the second one PID 23800 is another rosetta task (4th) started by the second one (watchdog?) 21:25:52.938929 PID 23796, boinc is being executed (started by strace) 21:25:53.136532 PID 23796 forks (clone system call) and creates PID 23797 21:25:53.224175 PID 23797 rosetta is being executed (started by boinc PID 23796) 21:25:53.719005 PID 23797 creates file "boinc_lockfile" 21:25:53.724098 PID 23797 forks (clone system call) and creates PID 23798 21:25:53.725109 PID 23797 waits for signal (sigsuspend) 21:25:53.726537 PID 23798 forks (clone system call) and creates PID 23799 21:25:53.726825 PID 23798 sends signal SIGRTMIN to PID 23797 (kill) 21:25:53.726951 PID 23797 receives signal SIGRTMIN and continues 21:25:53.731726 PID 23799 starts (but never does anything interesting) (PID 23797 writes lots of stuff to stdout.txt) 21:25:55.768784 PID 23797 waits for signal (sigsuspend) 21:25:55.769412 PID 23798 forks (clone system call) and creates PID 23800 21:25:55.769752 PID 23798 sends signal SIGRTMIN to PID 23797 (kill) 21:25:55.769875 PID 23797 receives signal SIGRTMIN and continues 21:25:55.772258 PID 23800 starts 22:26:42.220181 PID 23800 checks file "init_data.xml" 22:26:42.225455 PID 23800 writes "Rosetta score is stuck" to stdout.txt 22:26:42.226143 PID 23800 writes "Rosetta score is stuck" to stderr.txt 22:26:42.231475 PID 23800 writes "watchdog_failure: Stuck at score" to dd1IAI.out 22:26:45.470033 PID 23800 creates file "boinc_finish_called" 22:26:45.472508 PID 23800 removes file "boinc_lockfile" 22:26:45.490173 PID 23800 sends signal SIGRTMIN to PID 23797 (kill) 22:26:45.490560 PID 23797 receives signal SIGRTMIN 22:26:45.490459 PID 23800 waits for signal (sigsuspend) 22:26:45.491002 PID 23797 sends signal SIGRTMIN to PID 23800 (kill) 22:26:45.491108 PID 23800 receives signal SIGRTMIN and continues 22:26:45.502802 PID 23800 Segmentation Violation occurs! The SIGSEGV happens just after several munmap (memory unmap) calls, so possibly there was a reference to unmapped memory ? 22:26:45.503104 PID 23800 writes "SIGSEGV: segmentation violation" to stderr.txt 22:26:45.507844 PID 23797 waits for signal (sigsuspend) 22:26:45.509013 PID 23800 writes stack trace to stderr.txt 22:26:45.511959 PID 23800 writes "Exiting..." to stderr.txt (but it's a lie!) 22:26:45.512252 PID 23800 waits for signal (sigsuspend) that doesn't come! 22:26:45.821360 PID 23797 receives SIGALRM (timer expired ?) 22:26:45.822021 PID 23797 waits for signal (sigsuspend) The last two lines keep repeating with PID 23797 waiting for a signal (perhaps another SIGRTMIN from PID 23800 ?) and getting SIGALRM (timeout) instead. The normal watchdog termination procedure seems to have been thrown off track by the watchdog itself crashing in the process. Left out in the sequence above is some communication between 23797 and 23798 through a pipe. I'm assuming 23797 and 23800 are communicating with shared memory (besides signalling with SIGRTMIN), but that would not be visible in the strace output. Full strace logs available to anybody is interested. Team Helix |
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
Great, now this thread is impossible to read. |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
Thomas, thanks for helping debug this problem and posting such detailed log output. I never use trace before and do not have much knowledge in how processes work and communicate in linux. I will share your findings and thoughts with other project developers tomorrow to see what this can bring to us. I have run some problematic DOC workunits on our linux computers in stand alone mode (without boinc manager) and it seemed that all the watchdog terminations exited properly. Particularly, I did not remember seeing any segmentation viloations ( I will double check this tomorrow). So I guess this will also help us to narrow down whether the problem is within Rosetta or between Rosetta and bonic manager. This is from another system, but also linux. |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
if you mean the widdddddth of the board, I am wondering that too... Great, now this thread is impossible to read. |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
if you mean the widdddddth of the board, I am wondering that too... I'm sorry about blowing out the margins. I used the PRE tag to preserve the formatting of the output in my earlier post and the rosetta commandline is very, very long :( Unfortunately there isn't any preview option, so I didn't know what would happen until it was too late. It won't let me edit that post either, perhaps a moderator can remove the PRE /PRE tags ? Team Helix |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Moderator can only delete the post (both now that it is incorporated into a reply). You can only "preview" by posting and then you have up to an hour to make edits. I went after the wrong post first :) But it looks like I got normal margins back. Hope I have improved the thread more then disrupted it. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Workunits getting stuck and aborting
©2025 University of Washington
https://www.bakerlab.org