Message boards : Number crunching : minirosetta 2.03
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This died last night, same as others. homopt4.t328_.t328_.IGNORE_THE_REST.S_00002_0000009_0_0_0_0001.pdb_00004.pdb_00002.pdb.JOB_16816_14_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282247578 ERROR: [ERROR] Error opening RBSeg file 'S_00001_0000002_07.pdb_00001.pdb.loopfile' ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
Hi again, David Kim and I have tracked down this problem and I'm going to test a fix to it in the upcoming release. The problem was that per-decoy checkpointing was not on in this batch of simulations. When I mentioned that these protocols do not need checkpointing I only meant within-trajectory checkpointing. For the time being, I've stopped sending out this type of simulation, though over the next few days your computers might still work on them as quite a few have already been sent out. To assure you, the results of these simulations are certainly useful to us and in most cases credit will be allocated correctly. Thanks a lot for sending specific comments that allowed us to figure this out! Hello, |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
I don't know if this is a Validator problem or the task, any ideas. Edit/ It ran for over 4hrs none stop to finish. 9gbnnotyr_3gbn_2hxm_9Jan2010_16880_35_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282555003 # cpu_run_time_pref: 14400 ====================================================== DONE :: 27 starting structures 15475.2 cpu seconds This process generated 27 decoys from 27 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> Over__Validate error__Done__15,475.75 |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2126 Credit: 41,253,494 RAC: 7,932 |
More of the previously reported errors here on what's usually a very reliable error-free machine. One new odd one though, relating to credits rather than anything else: 9gbnnotyr_3gbn_2p8g_9Jan2010_16860_4_0 Outcome Success Generally my granted credit on this W7 laptop is close the claimed credit, with the occasional one being 30% less or 50% more, but 99% less seems very odd. Any ideas, or just a one off? |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Been having a slight issue with WU's freezing on my computer for the last few days. Most seem fine but a few odd ones lately. Most are Homopt WU's and now im having problems with boinc_filtered_loopbuild_threading (2nd on that has frozen). They seem to get to 20-70% then for some reason stop at some point and just tick away with process sitting idle. Ive chosen to manually abort these, has anyone else been having this issue? Also when i go to "show graphics" the graphics window freezes which makes me have to kill the process. I dont wanna keep aborting these Wu's but doesnt seem like anything else i can do.... anyone wanna help me out? |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Just downloaded 2 more of the same type WU's and they are stuck at 8 and 9%... can anyone tell me whats going on? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Admin, I've not heard of such problems until just the past few days. I've EMailed the Project Team asking they look in to it. Do you spot any pattern in the WU names that are working vs those hanging up? Your profile looks like you are running Win7. Rosetta Moderator: Mod.Sense |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Admin, I've not heard of such problems until just the past few days. I've EMailed the Project Team asking they look in to it. admin, I see you have one computer and it's running Windows System 7. I've had identical problems with R@h tasks running under this OS, some of which I've reported above. There seems no common pattern to the tasks that have to be aborted: given two tasks with names identical apart from the digits at the end one may successfully complete while the other has to be aborted. It always seems though, for those tasks I've looked at, that it gets successfully completed by a wingman running under a different OS. |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Its really random so I cant quite say which work, but Ive stated the ones that dont work for me above. If you can check the tasks Ive aborted, those are the WU's that have been faulty. Its been more and more the past few days, so ive aborted the last 2 bad ones and wont get anymore for right now until the issue is looked at. Homopt and boinc_filtered_loopback_threading seem to be the biggest issues for me and they have been the only ones ive been getting. Anything else you need to know? Yes im running Windows 7 RC right now. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This lasted about 11sec. t287__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_576_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282963623 <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process got signal 11 </message> Wed 13 Jan 2010 16:27:59 EST|rosetta@home|Output file t287__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_576_0_0 for task absent |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,101,436 RAC: 16,911 |
Hello, I do not worry about possible losses of 1 not completed model - in these tasks they are small, so losses will really make no more than several minutes of CPU time. And what about possible losses of all models calculated before turn of (or reboot of the computer or boinc client) - apparently from a screenshot(posted above), this type of WUs at all does not do any checkpoints for all time of the computation. Or results of ready models (completely calculated) are saved somehow differently (not through the mechanism of checkpoints), and checkpoints are necessary only for saving of subproducts in 1 model? And BOINC simply does not know about it and writes about them "no CPU time at last checkpoint? |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,101,436 RAC: 16,911 |
Hi again, Oh, I have written the previous post before has read this one. Is glad to hear that the problem is localised. Always it is pleasant to "squash up" one more bug in software. :) (On the my main work I am linked with programming as a whole, and with testing and debugging in particular. Projects are much easier, in comparison with scientific, but in programming much in common). |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,101,436 RAC: 16,911 |
More of the previously reported errors here on what's usually a very reliable error-free machine. I think here too the problem with saving of results of calculations takes place. Your computer has transmitted in the report "10 decoys" it is a very little for the given type of WUs. For matching here result of calculation of the similar WU on my processor: https://boinc.bakerlab.org/rosetta/result.php?resultid=309983219 Apparently my processor has calculated "96 decoys" all for 7138.77 cpu seconds. And your result: 10 decoys for 28348.9 cpu seconds, despite more powerful processor. Credits are calculated seem correctly: 15.15 Cr for 96 decoys (my result) 1.5 Cr for 10 decoys (your result) I.e. nearby 0,15 Cr for 1 result in both cases. So I think a problem on the side of you computer, instead of on a server. If on the computer there are no serious problems, capable to call sharp (many times over) degradation of calculations speed (for example hard swopping) most likely you computer calculates is much more "decoys", but their most part for any reason has been lost, and in the report have been referred only 10. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,101,436 RAC: 16,911 |
On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds). I think that I have caught the second bug with checkpoints. This time not "between small models", and "intro one big". Here one of such tasks: https://boinc.bakerlab.org/rosetta/result.php?resultid=310448366 Apparently the ratio between "Claimed credit" and "Granted credit" very bad that indirectly testifies to a problem (too few useful results for such CPU time) Those tasks which never interrupted in an operating time usual shows much better ratio on my computer. And now as performance of this job on my computer looked: it was fulfilled in 3 stages with 2 restartings between them (the 1st - this turn off of the computer for the night, 2nd - I specially restarted BOINC for testing). In the end of the first stage (before 1st restarting) CPU time was about 2.5 hours, the progress percent was ~88 %, "show graphics" - 1 model and it is a lot of steps (some thousand). Next day at start the progress percent has fallen at once to ~47 %, though I think that it has reduced to zero, is simple BOINC has calculated it as 2:49 hours (already used CPU Time) to divide at 6 hours (as much as possible admissible time = target CPU Time х 3 = 6h). In "show graphics" was a following: http://s004.radikal.ru/i206/1001/e5/15254410b960.jpg http://s005.radikal.ru/i210/1001/d5/a235df07123e.jpg Looks as though computing went from the very beginning. After two hours of computing I restarted BOINC 2nd time (Exit on the tray icon), after start "show graphics" looks so: http://i069.radikal.ru/1001/1f/f431840cb759.jpg Again counting of models and steps goes from 0... In task logs (stderr out) record about reading checkpoint is, but it only one though the job interrupted and restarted twice. Besides in a working folder was much more files concerning to checkpoints. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
seems to be a common theme going on with these tasks: homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000001_0_0_10089.pdb_00002.pdb_00006.pdb.JOB_16819_18_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=309699920 homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000022_0_0_0_0077.pdb_00001.pdb_00001.pdb.JOB_16816_16_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=309816167 homopt4.t322_.t322_.IGNORE_THE_REST.S_00006_0000023_0_0_00034.pdb_00008.pdb_00006.pdb.JOB_16815_24_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=309824444 They all died immediately due to: ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=310017128 homopt_fa_cstmc_1.t370_.t370_.IGNORE_THE_REST.S_00003_0000784_010.pdb.JOB_16898_23_0 Outcome Client error Client state Compute error Exit status -177 (0xffffff4f) Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E Engaging BOINC Windows Runtime Debugger... BOINC Windows Runtime Debugger Version 6.5.0 Dump Timestamp : 01/21/10 00:25:36 LoadLibraryA( E:xxxxx: GetLastError = 126 Loaded Library : version.dll Debugger Engine : 4.0.5.0 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=310017144 t308__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_69_0 Outcome Client error Client state Compute error Exit status -177 (0xffffff4f) <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> ]]> |
Message boards :
Number crunching :
minirosetta 2.03
©2024 University of Washington
https://www.bakerlab.org