Message boards : Number crunching : Minirosetta 3.14
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
robertmiles Send message Joined: 16 Jun 08 Posts: 1235 Credit: 14,360,346 RAC: 1,269 |
I just had one of the hung workunits, on a computer that's usually much more reliable at completing Rosetta@home workunits. casd_sgr145_boinc_3duwA_208.nonlocal.pctid_0.09.tmscore_0.63331._nonlocal_tex_IGNORE THE REST_27533_3268 I've selected 12 hour workunit lengths. Currently at 20:04:42 elapsed, 5.770% progress and not increasing, 53:36:27 to completion. CPU time at last checkpoint 00:43:47 CPU time 00:43:51 Commit size 352,408 KB BOINC lists it as running, but it's not using any CPU time. The workunits on the other CPU cores are from other BOINC projects, and running just fine. Appears to be one of the several workunits I've seen where the hang occurred just after a checkpoint. BOINC appears likely to be in a state where it asks for GPU workunits only, and only from BOINC projects where I've never seen any available. I've seen this condition fairly often on my laptop before, but not on my desktop where it is now. Clicking on the Show Graphics button brings up the minirosetta_graphics_3.13_windows_x86_64.exe program (previously not running) showing a window with a proper frame and a proper label at the top, but with the space inside the frame totally black. Clicking on the X in the red space at the top right corner of the graphics window gives this error message: minirosetta_graphics_3.13_windows_x86_64.exe is not responding Problem Event Name: AppHangB1 Hang Signature: dd05 Hang Type: 0 (too many more detail lines to copy if every time I enter a window to copy them to, or start Snipping Tool, the error message disappears.) Should I abort this workunit so you can see the output files? Or just restart BOINC to see if that will restart that workunit properly? Or something else? 6/18/2011 5:48:13 PM Starting BOINC client version 6.10.58 for windows_x86_64 6/18/2011 5:48:13 PM log flags: file_xfer, sched_ops, task 6/18/2011 5:48:13 PM Libraries: libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3 6/18/2011 5:48:13 PM Data directory: C:ProgramDataBOINC 6/18/2011 5:48:13 PM Running under account Bobby 6/18/2011 5:48:13 PM Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10] 6/18/2011 5:48:13 PM Processor: 6.00 MB cache 6/18/2011 5:48:13 PM Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe 6/18/2011 5:48:13 PM OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00) 6/18/2011 5:48:13 PM Memory: 8.00 GB physical, 15.66 GB virtual 6/18/2011 5:48:13 PM Disk: 919.67 GB total, 544.50 GB free 6/18/2011 5:48:13 PM Local time is UTC -5 hours 6/18/2011 5:48:13 PM NVIDIA GPU 0: GeForce GTS 450 (driver version 26724, CUDA version 3020, compute capability 2.1, 993MB, 476 GFLOPS peak) 6/18/2011 5:48:13 PM General prefs: using separate prefs for work 6/18/2011 5:48:13 PM Reading preferences override file 6/18/2011 5:48:13 PM Preferences: 6/18/2011 5:48:13 PM max memory usage when active: 3276.16MB 6/18/2011 5:48:13 PM max memory usage when idle: 3276.16MB 6/18/2011 5:48:13 PM max disk usage: 30.00GB 6/18/2011 5:48:13 PM max CPUs used: 3 6/18/2011 5:48:13 PM (to change preferences, visit the web site of an attached project, or select Preferences in the Manager) 6/18/2011 5:48:13 PM Not using a proxy |
robertmiles Send message Joined: 16 Jun 08 Posts: 1235 Credit: 14,360,346 RAC: 1,269 |
Have you thought of creating a test application specifically to gather more information on the computer environment it is running on, then sending one such workunit to each machine known to have a problem with workunits freezing? No objection if it then goes on to attempt to run a normal workunit afterwards, possibly with more debugging output than usual enabled. Combining your system types and mine suggests that it might be worthwhile checking if it is specific to Intel CPUs, and even perhaps some ranges of Intel CPU types. As for what's next on Ralph@Home, it currently has no workunits queued, so you'll have to wait for some to become available. Since I currently have an Einstein@Home CPU workunit that seems rather reluctant to finish in a reasonable time (perhaps due to the debt from the several Einstein@Home GPU workunits run recently, I've decided to drain the queue of CPU workunits on my desktop by temporarily setting all BOINC projects offering CPU workunits to No New Tasks. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
If you have stalled work units please go ahead and abort the jobs and manually kill the minirosetta process from the task manager if necessary. we'll post an update on Ralph sometime early next week and submit more test work units. Please post the names of the work units so we know which ones to test on Ralph. I'd also recommend suspending R@h for the time being. |
alpha Send message Joined: 4 Nov 06 Posts: 27 Credit: 1,550,107 RAC: 0 |
Computation error: https://boinc.bakerlab.org/rosetta/result.php?resultid=430150155 - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x750F9617 |
Holmis Send message Joined: 15 Nov 07 Posts: 6 Credit: 975,490 RAC: 0 |
I've also got a computation error on this task. <message> Felaktig funktion. (0x1) - exit code 1 (0x1) </message> and ERROR: Cannot open PDB file "2ilaA.pdb" ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 199 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish Translation: Felaktig funktion = Incorrect function |
robertmiles Send message Joined: 16 Jun 08 Posts: 1235 Credit: 14,360,346 RAC: 1,269 |
One more where BOINC thinks it's running, but it's using no CPU time at all: ilv_hr41_all_boinc_2ebmA_108.nonlocal.pctid_0.14.tmscore_0.45048._nonlocal_tex_IGNORE_THE_REST_27535_3351 12 hour workunits requested. 13:38:38 elapsed, 62.456% progress, 07:46:40 To completion CPU time at last checkpoint 07:31:42 CPU time 07:31:46 Appears to be one more of the many I've seen that stopped using any CPU time shortly after a checkpoint, or possibly after the checkpoint was started but not yet finished. Rosetta@Home already on No new tasks while I drain the list of CPU workunits to force one especially slowly running workunit to get enough CPU time to finish. No more already on that computer. Computer environment already described above. |
Alan J Rodger Send message Joined: 16 Oct 05 Posts: 7 Credit: 32,282 RAC: 0 |
I've had to abort several work units because time elapsed and time to completion both go up without % completion changing - one work unit reached 25 hours and went from ca 3 hours to completion at the start to 18 hours to completion. Many are minirosetta 3.14. What's up? Alan |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
There are reports of minirosetta 3.14 continuing on with 0 cpu usage. This sounds like the same issue. You'll have to manually kill the process using the task manager and suspend the R@h project for the time being. We are looking into this issue. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
Compute error after 2.1 seconds, wingman had the same ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_2350_1 ERROR: Cannot open PDB file "2ilaA.pdb" ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish workunit - https://boinc.bakerlab.org/rosetta/workunit.php?wuid=392925238 |
N7QLT Send message Joined: 19 Dec 05 Posts: 2 Credit: 3,753,965 RAC: 0 |
I have noticed occasions when workunits are running but not using any CPU. If I exit Boinc and request it to stop science projects, wait a moment and then restart Boinc it seems to reset "something" and the Rosetta projects start running again. I am guessing that it is related to some other process on the system, maybe nightly virus scans or other maintenance. Maybe its related to the workunit in progress. Just haven't had time to research it further. |
Holmis Send message Joined: 15 Nov 07 Posts: 6 Credit: 975,490 RAC: 0 |
Got one more compute error today, it's the same error as I posted in message #70604 earlier in this thread. ERROR: Cannot open PDB file "2ilaA.pdb" ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 199 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish It also appears to be the same type om task: ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_4527_0 and ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_6682_0 Link to new error: https://boinc.bakerlab.org/rosetta/result.php?resultid=431062430 Something wrong with them? |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
It looks like the ilv_fgf2_all_boinc units have a problem another one failed after 2.1 seconds same error as befor. ERROR: Cannot open PDB file "2ilaA.pdb" ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish https://boinc.bakerlab.org/rosetta/workunit.php?wuid=393122110 |
James Thompson Send message Joined: 13 Oct 05 Posts: 46 Credit: 186,109 RAC: 0 |
It looks like the ilv_fgf2_all_boinc units have a problem another one failed after 2.1 seconds same error as befor. Something is wrong with those workunits. I'll remove them now. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This is not funny, why is this still happening. This task that should have finished after 49 models, I believe it should stopped but for some reason started again and did one more model and that,s all i'm getting credit for after 4+ hrs. # cpu_run_time_pref: 14400 ====================================================== DONE :: 49 starting structures 14250.2 cpu seconds This process generated 49 decoys from 49 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 14501.5 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish Valid Claimed credit__119.52 Granted credit__1.81 application version 3.14 |
Alan J Rodger Send message Joined: 16 Oct 05 Posts: 7 Credit: 32,282 RAC: 0 |
Why don't you stop sending out Minirosetta 3.14 work units until you solve the problem? The other units seem to work. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1235 Credit: 14,360,346 RAC: 1,269 |
Why don't you stop sending out Minirosetta 3.14 work units until you solve the problem? The other units seem to work. Possibly because the Minirosetta 3.14 workunits work on some computers. For example, my laptop. Possibly because they want to gather more information on what kinds of computers those workunits don't work on. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task 430561413 failed with an Out of Memory message Unhandled Exception Detected... - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x758B9617 Engaging BOINC Windows Runtime Debugger... (much debugging stuff snipped) Odd considering it's a C2D with 4M of memory and these tasks don't seem to use more than 300-400K. I've also been having a lot of these 'task hanging' issues on W7 (not Mac) : they're curable by quitting BOINC and restarting. Haven't been keeping track of all the names but tasks with names like ilv* seem particularly prone to this behaviour. |
[AF>france>pas-de-calais]symaski62 Send message Joined: 19 Sep 05 Posts: 47 Credit: 33,871 RAC: 0 |
25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_3.14_windows_intelx86.exe 25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_graphics_3.13_windows_intelx86.exe 3.14 version ? |
robertmiles Send message Joined: 16 Jun 08 Posts: 1235 Credit: 14,360,346 RAC: 1,269 |
25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_3.14_windows_intelx86.exe Yes, the 3.14 version of the main application program. Looks behind on the screensaver graphics program, though. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=430776403 casd_sgr145_boinc_3lccA_26.nonlocal.pctid_0.20.tmscore_0.67557._nonlocal_tex_IGNORE_THE_REST_27533_2536_1 Outcome Client error Client state Compute error Exit status -529697949 (0xe06d7363) Unhandled Exception Detected... - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB This task chewed up 3.24GB of RAM? That's insane. |
Message boards :
Number crunching :
Minirosetta 3.14
©2025 University of Washington
https://www.bakerlab.org