Message boards : Number crunching : Minirosetta v1.40 bug thread
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next
Author | Message |
---|---|
Rabinovitch Send message Joined: 28 Apr 07 Posts: 28 Credit: 5,439,728 RAC: 0 |
15.11.2008 8:23:48|rosetta@home|Computation for task 1ail__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1ail_-_4768_650_0 finished 15.11.2008 8:23:48|rosetta@home|Output file 1ail__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1ail_-_4768_650_0_0 for task 1ail__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1ail_-_4768_650_0 absent https://boinc.bakerlab.org/rosetta/result.php?resultid=207358562 |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. I just noticed on my Quad that i had six tasks running, now four where running normailly the two rosetta mini 1.40 where marked as waiting to run but the time and percentage was going up. I tried suspending them it didn't work, i don't know if it's Boinc Ver 6.2.14 is the problem or Rosetta. Any ideas. pete. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,324,975 RAC: 3,637 |
I've found that just increasing the minimum and maximum virtual memory sizes is not enough to make minirosetta v1.40 start using more virtual memory, at least on my Vista SP1 machine. I then increased the maximum amount of disk space BOINC is allowed to use, and the maximum percentage of virtual memory BOINC is allowed to use. Since this, I've been seeing two of the more memory-hungry workunits run at the same time on my dual-core machine more often, such as two minirosetta v1.40 workunits, and have started seeing a higher total size of virtual memory in use. However, I've also stopped seeing workunits with the known tags for use of the new mode of minirosetta v1.40, so it's hard to tell which is responsible for the improvement. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...so it's hard to tell which is responsible for the improvement. Yes, it is always hard to say for certain cause and effect. Keep in mind that CPU time is the main contributor here, not amount of virtual memory utilized. It will actually run faster if it has real memory then virtual. And you cannot force an application to use more memory. It either requires it, or it doesn't. It is sort of like consuming water as your objective and someone suggests you use a larger glass with the thought that it would help you consume water faster. As long as your prior glass was able to provide water at a rate similar to the rate of consumption, a larger glass will not help. Rosetta Moderator: Mod.Sense |
Rabinovitch Send message Joined: 28 Apr 07 Posts: 28 Credit: 5,439,728 RAC: 0 |
Aenozer wuan: https://boinc.bakerlab.org/rosetta/result.php?resultid=207358562 |
Saharak Send message Joined: 28 Apr 07 Posts: 7 Credit: 1,170,212 RAC: 0 |
1hzh_2exu_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_284_0 Had been running for 25 hours then it was suspended because of time of day. Next day it restarted at 50% (which means 12 hours of computing was lost). Therefore I canceled the unit. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Well it looks like it could be the mini app that's the problem as i have got two Beta 5.98 tasks on now and they are suspending/waiting to run properly, ill have to keep an eye on it when i get a couple of 1.40's running. Edit// I forgot, these are the two tasks that where on at the time. cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrelax_cs_ccr19_olange_4727_24570_0 cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrelax_cs_ccr19_olange_4727_26842_0 pete. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,324,975 RAC: 3,637 |
...so it's hard to tell which is responsible for the improvement. I already got all the physical memory my motherboard can use when I saw a slowdown of the programs I normally use a few months ago. At least the greater use of virtual memory to swap out the workunits that aren't running allows me to use the browser and newsreader without the slowdowns I saw recently. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,324,975 RAC: 3,637 |
Hi. Those workunit names look a lot like the names at least one workunit I ran recently, but under 1.40 instead. Under 1.40, they seemed to run OK after I made the changes that let BOINC use more virtual memory to swap out workunits that weren't running. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=189103816 |
softweir Send message Joined: 30 Oct 05 Posts: 1 Credit: 1,691,061 RAC: 0 |
I keep getting compute errors on my minirosetta tasks. The following (from taskID 207659162) is typical:- stderr out Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x7C910193 write attempt to address 0x009254E6 Engaging BOINC Windows Runtime Debugger... Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x7C910193 write attempt to address 0x00408E6E Engaging BOINC Windows Runtime Debugger... |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
Several errors over the last week or so, which I'm only just catching up with: These messages (to a greater or lesser degree) appear in the Task IDs listed beneath (note: I've seen this reported in the thread Rosetta Mini with new score terms bug thread too): <core_client_version>6.2.19</core_client_version> Task ID 205994161 Task ID 206059326 Task ID 206211546 Task ID 206290866 Task ID 206333375 Task ID 206790264 Task ID 206812382 Task ID 206932670 Task ID 207028575 Task ID 207063049 Task ID 207098754 Task ID 207231397 Task ID 207268928 Task ID 207273838 Task ID 207278136 Task ID 207281019 Task ID 207305018 Task ID 207461993 Task ID 207466440 Task ID 207471528 Task ID 207471528 These messages appear in the Task IDs listed beneath: ERROR: NANs occured in hbonding! Task ID 206157249 Task ID 207109581 However, these and many more WUs error out with these following details: Client state Compute error The computer is listed here AMD Phenom(tm) 9850 Quad-Core, Vista Home Premium x64 Edition, SP1, 8Gb RAM, 330Gb free space, preferences set to 2 hour run time due to constant erroring out with "Can't acquire lockfile - exiting". This problem never occurs with Rosetta Beta 5.98 (or earlier versions of the Beta) - only with all versions of MiniRosetta since I upgraded to this machine and 64-bit OS. Of my last 162 WUs: Beta 5.98 - 58 - 100% success Mini 1.40 - 104 - 60% success (62) 40% failure (42) Failure of Mini 1.40 WUs rises rapidly if runtime is increased above 2 hours (60% failure rate) |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=207486683 https://boinc.bakerlab.org/rosetta/result.php?resultid=207628604 |
Dave Mickey Send message Joined: 29 Dec 07 Posts: 33 Credit: 4,136,957 RAC: 0 |
*As far as I can tell from the messages here, people are seeing two major *problems: *1. long run times with relatively low credit *2. larger than anticipated memory requirements * *Please let me know if you see any other type of problem. I think I'm seeing something else. I have 2 machines sharing r@h with seti. I'm seeing that each machine has gotten to a state where BOINC has suspended a rosetta task in order to restart a seti task. But I see that the seti task is only getting 50% of the CPU time, according to boincview/boinc. When I look in Win task manager, I see this is because the rosetta task is continuing to execute, and thus the rosetta and seti are trying to share the processor, each getting about 50%. But boincview has the rosetta task as waiting. But despite that, the cum CPU time reported in the boincview is increasing, with both the rosetta and seti getting about 30 secs of CPU each minute. In the current case, they've been sharing the CPU for about 2 hours, so this seems to be a steady state condition. This rosetta app (or maybe something about the WU) has made it apparently ignore BOINCs command to suspend and be preempted. The current problem case is 2008-11-16 04:47:50 [rosetta@home] [cpu_sched] Preempting cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrel ax_cs_ccr19_olange_4727_614_0 (left in memory) In the first instance, I shut down BOINC and restarted, and it properly restarted with only the seti wu executing. Dave |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 10 |
Can't acquire lockfile - exiting I posted this in another thread, but as it seems to be Mini Rosetta specific I'll copy and paste it here as well. I said... That's familiar. Go to "Your Account" then "Computing Preferences" check that at the bottom of the first block "Use at most" is set to 100%. That lock file error is common on systems where this is not set to 100% at some projects. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=207080587 h013__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-14-S3-7--h013_-_4675_237_1 it completed ok...but there was alot of messages like this: stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> recovering checkpoint of tag S_U14X7X_00000001 with id abrelax_rg_state recovering checkpoint of tag S_U14X7X_00000001 with id stage_1 recovering checkpoint of tag S_U14X7X_00000001 with id stage_2 # cpu_run_time_pref: 21600 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_1 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_2 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_3 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_4 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_5 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_6 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_7 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_8 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_9 recovering checkpoint of tag S_U14X7X_00000001 with id stage_3_iter1_10 recovering checkpoint of tag S_U14X7X_00000001 with id stage4_kk_1 and it goes on and on.....repeating recovering checkpoint of tag S_U14X7X_00000001 as the central theme |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. I got this one overnight it ran for 3hrs, 47min then errored. h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-8-S3-3--h001b-_4769_1442_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=189422310 <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt ERROR: NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763 called boinc_finish pete. |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=206369194 https://boinc.bakerlab.org/rosetta/result.php?resultid=205975695 ===================== https://boinc.bakerlab.org/rosetta/result.php?resultid=206340374 alidate state Valid Claimed credit 269.214089450125 Granted credit 81.1915060162041 WTF ?????? |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=207312755 https://boinc.bakerlab.org/rosetta/result.php?resultid=207413491 https://boinc.bakerlab.org/rosetta/result.php?resultid=207413498 https://boinc.bakerlab.org/rosetta/result.php?resultid=207417625 https://boinc.bakerlab.org/rosetta/result.php?resultid=207417646 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,324,975 RAC: 3,637 |
===================== That claimed to granted credit ratio is what typically happens when you return a workunit that had a serious underestimate of the amount of CPU time it needed to run. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
Hello all, I tried re-running this here locally in the lab and it runs just fine - so not sure what went wrong there i'm afraid :( Thanks for posting anyway! http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
Message boards :
Number crunching :
Minirosetta v1.40 bug thread
©2024 University of Washington
https://www.bakerlab.org