Message boards : Number crunching : minirosetta v1.19 bug thread
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix. Adding a little more, there's atleast 2 open Trac-tickets about this, #113 and #336. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0 |
I'll throw in a bit more about the no heartbeat message. At least once per release cycle we try to resolve this issue, so far the attempts to resolve the issue has lead to crashes within the core client. DNS resolution is done through libcurl, and using either libcurl's native async-dns solution or the c-ares library hasn't resolved the issue. We haven't found a way to reproduce this issue in a lab environment, and so we haven't bee able to give the libcurl guys enough information to get it fixed. So until we can get more info to the libcurl guys who can then fix it, the no heartbeat message is better than a crash. ----- Rom My Blog |
Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0 |
Finally got one to finish https://boinc.bakerlab.org/rosetta/workunit.php?wuid=148643026 It consumed 1,063MB of memory and similar VM. This was on a 12hr run. Bet if I had rebooted it would have failed. I watched the last 5% in task manager. The to completion time stopped at 9 mins 59 secs and the WU finished at 96.6% I then got a lot less credit than requested LOL Hope the result was worth it. :) |
[KWSN]John Galt 007 Send message Joined: 4 Aug 06 Posts: 6 Credit: 1,017,647 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=145523230 Client errors on 2 machines, one of which is mine. 0.00 seconds, so no time lost. |
RiverboatSam Send message Joined: 9 Dec 05 Posts: 1 Credit: 59,080 RAC: 0 |
Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening. |
glaesum Send message Joined: 16 Oct 06 Posts: 21 Credit: 508,632 RAC: 0 |
error #161 (whatever that is) finally a wu failed, that's on top of the usual non-fatal 120 error: resultid=162869266 <core_client_version>5.10.30</core_client_version> <stderr_txt> AllocateAndInitializeSid Error 120 failed to create shared mem segment # cpu_run_time_pref: 14400 : BOINC :: Watchdog shutting down... called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>rb_05_12_11631_20348_T0397_IGNORE_THE_REST_10_16_3247_49_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> ]]> |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
Error number 4, at 77,900+ CPU seconds. Reason: Access Violation (0xc0000005) at address 0x005C1E7C write attempt to address 0x00000024 Large and detailed debugger report available at the link, if anyone is reading those things at this point. The host that received the above error is 1/4 on mini 1.19 tasks that have a runtime preference in excess of 12 hours, but is 8/8 on mini 1.19 tasks with a runtime preference of 12 hours or less. Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening. Nothing has changed on the end of Rosetta suddenly this morning. It is designed to run at a low priority, so anything else your computer is working on is ahead in line for the CPU. You can configure BOINC to use a fraction of the CPU, or to only run at specific times of day. You can just go to the advanced view, then use the advanced pulldown menu, and select preferences to set these up for that specific machine. So, using all of your CPU is normal, when you aren't doing anything else. And if it is causing any noticible impact on your work, it is actually more likely an issue of how much memory is available then the CPU being used. Rosetta Moderator: Mod.Sense |
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
I have just finished an "observing" session on a Windows 2K server where multiple Mini 1.19 tasks were not honoring suspend behavior. I'm allowed to run Rosetta jobs on this machine during off hours. When I examined the tasks within BOINC Manager they reported as suspended, and were not accumulating CPU time. Checking in Windows Task Manager showed Rosetta Mini merrily consuming CPU. When I toggled Activity with BOINC Manager from Run based on preferences to Run always I would see the CPU time within BOINC Manager "catch up" to that shown in Windows Task Manager. I aborted the first Mini job, and the second started and demonstrated the same behavior. Shutdown the BOINC service (which did kill everything), and restarted. Problem continued. Shutdown BOINC again, uninstalled and reinstalled BOINC (5.10.45) and restarted. Problem continued. Aborted the second Mini job, observed problem with the third one, also aborted the job. Now I've got Beta 5.96 tasks downloaded, and these are obeying suspend/resume flawlessly. Has anyone else seen this, and more importantly if so is there a fix? I have a collection of machines where I'm allowed to run Rosetta only during off hours ... I'll have to pull them out of action if I can't count on reliable time of day suspends. Alternatively, is there any thing I can do to tell any machine exhibiting this behavior to avoid Mini jobs, since Beta 5.96 is behaving correctly? |
caesar1987 Send message Joined: 28 Nov 06 Posts: 13 Credit: 22,268 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=162352303 |
Venturini Dario[VENETO] Send message Joined: 25 May 07 Posts: 22 Credit: 245,028 RAC: 0 |
Validate error on a 84k+ seconds task (I'd say... rather annoying) https://boinc.bakerlab.org/rosetta/result.php?resultid=162388905 |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
- Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024 https://boinc.bakerlab.org/rosetta/result.php?resultid=162428256 - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB \ + 1480Mb use of ram https://boinc.bakerlab.org/rosetta/result.php?resultid=162386305 - Unhandled Exception Record - Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB https://boinc.bakerlab.org/rosetta/result.php?resultid=162246424 |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=161724764 https://boinc.bakerlab.org/rosetta/result.php?resultid=161724764 https://boinc.bakerlab.org/rosetta/result.php?resultid=161544482 https://boinc.bakerlab.org/rosetta/result.php?resultid=161438499 https://boinc.bakerlab.org/rosetta/result.php?resultid=161028445 |
popandbob Send message Joined: 30 Oct 05 Posts: 4 Credit: 1,671,080 RAC: 125 |
3 errors... Error 1 <core_client_version>5.10.30</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 36000 # cpu_run_time_pref: 36000 ERROR: Conformation: fold_tree nres should match size ERROR:: Exit from: ....srccoreconformationConformation.cc line: 192 called boinc_finish </stderr_txt> error2 <core_client_version>5.10.30</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> ERROR: unrecognized atom_type_name HOH ERROR:: Exit from: c:cygwinhomeboincboinc_buildminirosetta_1.19minisrccore/chemical/AtomTypeSet.hh line: 79 called boinc_finish </stderr_txt> ]]> error 3 core_client_version>5.10.30</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 36000 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x005C3030 write attempt to address 0x00000004 Engaging BOINC Windows Runtime Debugger... |
alpha Send message Joined: 4 Nov 06 Posts: 27 Credit: 1,550,107 RAC: 0 |
Access violation (exit code -1073741819 (0xc0000005)) after nearly 22,000 seconds: https://boinc.bakerlab.org/rosetta/result.php?resultid=162546878 |
Jipsu Send message Joined: 27 Jan 08 Posts: 10 Credit: 454,555 RAC: 0 |
I think the out of memory error is corrected already in minirosetta v1.2 which is going thru testing at ralph at the moment. 24h minorosetta v1.2 tasks are taking around 150M of memory and the out of memory error in minirosetta v1.19 seems to exist only in windows version of the application. Just throwing my thoughts around, but I think it's pointless to post out of memory errors since the problem is already fixed in v1.2. |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
{...} "Pointless" only for those who: 1) Participate in RALPH@home, 2) Have long runtime preferences, 3) Run Windows operating systems, and 4) Agree with the conclusion that the problem is solved. Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Venturini Dario[VENETO] Send message Joined: 25 May 07 Posts: 22 Credit: 245,028 RAC: 0 |
Validate error on a 84k+ seconds task (I'd say... rather annoying) And here's another one: https://boinc.bakerlab.org/rosetta/result.php?resultid=162547290 Both happened after a segfault error some hours before completion. |
Dr Who Fan Send message Joined: 28 May 06 Posts: 70 Credit: 268,055 RAC: 300 |
Anyone else seen this yet? I have a single incidence of minirosetta v1.19 using both "cores" of my Pentium 4 with Hyper Thread. It is not following the BOINC rules to use only 1 core/app/cpu. It is currently running: Task ID 164060225 Task Name h003__BOINC_ABRELAX_IGNORE_THE_REST-S25-5-S3-3--h003_-_3321_121_0 |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
For those with inquiring minds: rb_05_17_11407_20379_tim23_IGNORE_THE_REST_06_10_3329_49 <core_client_version>5.10.30</core_client_version> <![CDATA[ <stderr_txt> ====================================================== DONE :: 1 starting structures 10440.8 cpu seconds This process generated 8 decoys from 8 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 28.6254621435188 Granted credit 0 application version 1.19 and rb_05_17_11462_20386_CRF-BP_IGNORE_THE_REST_06_17_3330_30 <core_client_version>5.10.30</core_client_version> <![CDATA[ <stderr_txt> ====================================================== DONE :: 1 starting structures 10685.3 cpu seconds This process generated 6 decoys from 6 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 29.2953911287149 Granted credit 0 application version 1.19 |
Message boards :
Number crunching :
minirosetta v1.19 bug thread
©2024 University of Washington
https://www.bakerlab.org