Message boards : Number crunching : BOINC Dying and orphaning Rosetta - Possible cause
Author | Message |
---|---|
David Ball Send message Joined: 25 Nov 05 Posts: 25 Credit: 1,439,333 RAC: 0 |
I keep finding my Linux RHEL3 machine has died and orphaned Rosetta. I have to kill Rosetta Manually and restart BOINC. BOINC runs as a service. The machine has libsafe on it. While going through the logs looking for an error with Docking@Home, I might have found the reason BOINC is dying. It looks like sometimes the Rosetta command line might be too long. This is from stdoutdae.txt. Lines ending in $ were cut short by nano. 2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$ 2006-11-22 03:41:02 [Docking@Home] Deferring scheduler requests for 1 minutes and 0 seconds 2006-11-22 03:41:02 [---] Rescheduling CPU: application exited 2006-11-22 03:41:02 [Docking@Home] Computation for task 1tng_mod0001_9218_83020_5 finished 2006-11-22 03:41:02 [---] Resuming round-robin CPU scheduling. 2006-11-22 03:41:02 [rosetta@home] Resuming task DOC_1MLC_R061114_pose_u_global_search_1402_736_0 using rosetta version 540 2006-11-22 04:14:59 [---] Resuming network activity 2006-11-22 04:14:59 [---] Allowing work fetch again. .........Skipped some attempted work fetches and upload of the failed docking workunit. 2006-11-22 04:15:09 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 2006-11-22 04:15:09 [rosetta@home] Reason: To fetch work 2006-11-22 04:15:09 [rosetta@home] Requesting 21600 seconds of new work, and reporting 1 completed tasks 2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_2 2006-11-22 04:15:12 [Docking@Home] Throughput 51542 bytes/sec 2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_3 2006-11-22 04:15:12 [Docking@Home] Throughput 598603 bytes/sec 2006-11-22 04:15:14 [rosetta@home] Scheduler request succeeded 2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.fasta.gz 2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz At this point I discovered the BOINC service was dead, and had to kill Rosetta manually, and restart the BOINC service. The NEXT line in stdoutdae.txt is: 2006-11-22 04:17:38 [---] Starting BOINC client version 5.4.9 for i686-pc-linux-gnu 2006-11-22 04:17:38 [---] libcurl/7.15.3 OpenSSL/0.9.8a zlib/1.2.3 2006-11-22 04:17:38 [---] Executing as a daemon 2006-11-22 04:17:38 [---] Data directory: /home/BOINC 2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$ 2006-11-22 04:17:38 [---] Processor: 1 GenuineIntel Intel(R) Celeron(R) CPU 2.40GHz 2006-11-22 04:17:38 [---] Memory: 1.95 GB physical, 1.95 GB virtual 2006-11-22 04:17:38 [---] Disk: 16.02 GB total, 11.62 GB free 2006-11-22 04:17:38 [Docking@Home] URL: http://docking.utep.edu/; Computer ID: 223; location: work; project prefs: default 2006-11-22 04:17:38 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 2185126; location: work; project prefs: d$ 2006-11-22 04:17:38 [rosetta@home] URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 211470; location: work; project pre$ 2006-11-22 04:17:38 [lhcathome] URL: http://lhcathome.cern.ch/lhcathome/; Computer ID: 2363079; location: work; project pref$ 2006-11-22 04:17:38 [---] General prefs: from Docking@Home (last modified 2006-11-22 03:03:43) 2006-11-22 04:17:38 [---] General prefs: using separate prefs for work 2006-11-22 04:17:38 [---] Local control only allowed 2006-11-22 04:17:38 [---] Listening on port 31416 2006-11-22 04:17:38 [SETI@home] Deferring task 10jn03ab.7548.30496.284650.3.57_1 2006-11-22 04:17:38 [SETI@home] Restarting task 10jn03ab.7548.30496.284650.3.57_1 using setiathome_enhanced version 512 2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.fasta.gz 2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz 2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.fasta.gz 2006-11-22 04:17:42 [rosetta@home] Throughput 1149 bytes/sec 2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.psipred_ss2.gz 2006-11-22 04:17:42 [rosetta@home] Throughput 7188 bytes/sec 2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_03_05.200_v1_3.gz 2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_09_05.200_v1_3.gz 2006-11-22 04:17:44 [rosetta@home] Finished download of file boinc_hom018_aas014_09_05.200_v1_3.gz 2006-11-22 04:17:44 [rosetta@home] Throughput 360810 bytes/sec 2006-11-22 04:17:44 [rosetta@home] Started download of file sg_target_description.txt 2006-11-22 04:17:45 [rosetta@home] Finished download of file boinc_hom018_aas014_03_05.200_v1_3.gz 2006-11-22 04:17:45 [rosetta@home] Throughput 687255 bytes/sec 2006-11-22 04:17:45 [rosetta@home] Finished download of file sg_target_description.txt 2006-11-22 04:17:45 [rosetta@home] Throughput 943 bytes/sec 2006-11-22 04:17:46 [---] Rescheduling CPU: files downloaded 2006-11-22 04:17:46 [---] Using earliest-deadline-first scheduling because computer is overcommitted. 2006-11-22 04:17:46 [SETI@home] Pausing task 10jn03ab.7548.30496.284650.3.57_1 (left in memory) 2006-11-22 04:17:46 [rosetta@home] Starting task s014__BOINC_ABRELAX_SAVE_ALL_OUT_hom018__1406_4371_0 using rosetta version $ 2006-11-22 04:17:49 [---] Suspending work fetch because computer is overcommitted. 2006-11-22 08:17:51 [---] Allowing work fetch again. Now, from the stderrdae.txt file 2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$ 2006-11-22 04:15:04 [Docking@Home] Message from server: No work sent 2006-11-22 04:15:04 [Docking@Home] Message from server: (reached daily quota of 1 results) 2006-11-22 04:15:04 [Docking@Home] No work from project SIGSEGV: segmentation violationStack trace (16 frames): /home/BOINC/boinc[0x8089dc2] /lib/libpthread.so.0[0x40174619] /lib/libc.so.6[0x400482b8] /lib/libc.so.6(vsprintf+0x5b)[0x4007da5b] /home/BOINC/boinc[0x808bc52] /home/BOINC/boinc[0x808c01b] /home/BOINC/boinc[0x80515c7] /home/BOINC/boinc[0x8051d2a] /home/BOINC/boinc[0x80718a9] /home/BOINC/boinc[0x80715eb] /home/BOINC/boinc[0x8071a99] /home/BOINC/boinc[0x8059c15] /home/BOINC/boinc[0x807d189] /home/BOINC/boinc[0x807d2b7] /lib/libc.so.6(__libc_start_main+0x8d)[0x40036bd1] /home/BOINC/boinc(__fxstat64+0x99)[0x804c1e1] Exiting... 2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$ I'm not sure if this is related to Docking or Rosetta, but I have noticed that anytime you stop the BOINC service on this machine, Rosetta keeps running, but in sleeping mode so it doesn't use any CPU. You have to kill Rosetta from top with a SIGTERM. This had happened prior to the log above when I stopped boinc to change something for another try at getting docking to work on this machine, IIRC. BTW, a couple of times I have noticed it wasn't reporting results and found that rosetta had been sleeping for 2+ days and boinc was nowhere to be found. This is the standard boinc 5.4.9 client on a text only machine (both console and ssh are text only), running as a service. They really need to release a command line only Linux boinc client version again. I'm having to use the boinc_cmd from boinc 5.2.13 to control it. That error might have to do with the Rosetta command line being too long. I just ran a "ps axu" and here are the boinc processes as of now. boinc 28900 0.0 0.0 4724 2020 ? S Nov25 0:00 /home/BOINC/boinc -redirectio daemon boinc 31274 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu boinc 31275 0.0 1.3 39288 26796 ? RN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu boinc 31276 11.4 1.3 39288 26796 ? SN Nov26 111:25 setiathome-5.12.i686-pc-linux-gnu boinc 31277 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu boinc 9907 88.7 3.6 111684 73940 ? RN 01:45 385:40 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - boinc 9908 0.0 3.6 111684 73940 ? RN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - boinc 9909 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - boinc 9921 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - Sorry for any weird formatting. I piped the output of "ps axu" into "nano -v" and did a cut-paste from the screen in nano. It looks like "ps axu" clipped the command lines for rosetta. Again, I don't know if it was docking or rosetta that killed boinc. Just went into /proc/9907 and got the command line from there. The spaces between options didn't show so I'm guessing at that part. [/proc/9907]# cat cmdline rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800 -watchdog -constant_seed -jran 2638533 I could be totally wrong but could the problem with BOINC controlling Rosetta be because the command line is too long and it's killing the BOINC client? BTW, I'm running the stock boinc client and applications. -- David EDIT: The only thing I was changing for Docking@Home was only to increase the allowed stack limit to unlimited. I keep finding more places where config files drop it back to the default unless you're root. Even if I have Docking suspended, I sometimes find the boinc client dead with Rosetta running in sleep mode. Since this error appears to be in a vsprintf in libpthread, I thought the BOINC client might be erroring out when it tried to control Rosetta. Have you read a good Science Fiction book lately? |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
This looks very similar to the problems listed in this thread - the BOINC client crashes just after downloading files from Rosetta. I'd been wondering if the problem could be down to the very long names that Rosetta uses, but of course it hasn't crashed since I started looking harder. |
![]() ![]() Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I could be totally wrong but could the problem with BOINC controlling Rosetta be because the command line is too long and it's killing the BOINC client? Rosetta has always had these horrendously long command lines(*), and has not always failed like this. It is possible that the latest BOINC has a shorter buffer than earlier versions, in which case the long line might be an issue, tho my guess is that this is unlikely. Certainly linux has no problem dealing with very long command lines. If windows had such a problem, again I'd be puzzled why it has not sufaced before. I'd agree it is a possibility that needs to be 'eliminated from enquiries' as detectvies say in bad crime fiction, but my guess is that this is not the smoking gnu. River~~ (*) linux users can see the command line of the current rosetta task with this command from a terminal window / shell: ps ax|grep rosetta |
Message boards :
Number crunching :
BOINC Dying and orphaning Rosetta - Possible cause
©2025 University of Washington
https://www.bakerlab.org