Message boards : Number crunching : minirosetta v1.19 bug thread
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
I've got a "Compute Error" on 76,000+ seconds of CPU time. Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024 There is a large and detailed debugger report. Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0 |
Thanks Rom and sorry for being a bit short with you. Sometimes wonder where all this irritability comes from. I sometimes long for a slower pace of life LOL |
M.L. Send message Joined: 21 Nov 06 Posts: 182 Credit: 180,462 RAC: 0 |
Task ID 162368970 Name SSPAIR_MIN_ABINITIO_1fna_3115_6915_2 Workunit 145958649 Created 10 May 2008 22:29:48 UTC Sent 10 May 2008 22:30:26 UTC Received 11 May 2008 15:18:04 UTC Server state Over Outcome Client error Client state Compute error Exit status 1 (0x1) Computer ID 735230 Report deadline 20 May 2008 22:30:26 UTC CPU time 0 stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> ERROR: Option matching -fudge not found in command line top-level context </stderr_txt> ]]> Validate state Invalid Claimed credit 0 Granted credit 0 application version 1.19 Fudge is gooood, except in this case. |
BobCat13 Send message Joined: 18 Jun 06 Posts: 4 Credit: 130,387 RAC: 0 |
Rom Walton wrote: In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash. I just errored out with this same problem: https://boinc.bakerlab.org/rosetta/result.php?resultid=162306731 Watching Process Explorer, the MiniRosetta application was constantly grabbing more memory. Both physical and virtual were increasing throughout the task's run. My preference was set to 24 hours, but it only made it to ~15.5 hours before reaching the 2GB limit. I then changed preferences to 2 hours and a task finished properly. It appears I will have to set preferences at 12 hours on this machine to avoid the 2GB limit. For people running a Core2 at 3GHz or higher, you may want to try setting preferences at 8 hours or less to see if that helps. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out. https://boinc.bakerlab.org/rosetta/result.php?resultid=161862548 https://boinc.bakerlab.org/rosetta/result.php?resultid=161862513 In both cases the WU ran the normal length of time (16 hr), then printed a bunch of: can not open psipred_ss2 file tt can not open psipred_ss2 file tt can not open psipred_ss2 file tt can not open psipred_ss2 file tt can not open psipred_ss2 file tt ... lines to sterr. The WUs ended up being marked "invalid". These WUs were on separate machines, both running Linux. |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
A second "Compute Error", this one on 85,000+ seconds of CPU time: Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024 There is a large and detailed debugger message. This error, and the one I reported earlier in this thread, have the same signature as the errors I was getting with mini 1.15, errors which crippled two stable and reliable crunchers until I discovered a workaround. The only difference now is that the mini 1.19 workunits take about twice as long to crash, resulting in twice as much wasted CPU time... Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
radu Send message Joined: 7 May 08 Posts: 4 Credit: 66,301 RAC: 0 |
More segfaults,on linux running 5.10.45 client. Apparently there was a problem with the network connection and the client kept trying to reconnect. All tasks whose results were about to be sent were marked with "compute error", for example: https://boinc.bakerlab.org/rosetta/result.php?resultid=162637592 I hope this helps. Output of dmesg: tg3: eth0: Link is down. minirosetta_1.1[1243]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6 minirosetta_1.1[1258]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6 Clocksource tsc unstable (delta = -116217092 ns) minirosetta_1.1[1348]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6 rosetta_beta_5.[1353]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6 minirosetta_1.1[1363]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6 rosetta_beta_5.[1367]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6 minirosetta_1.1[1375]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6 rosetta_beta_5.[1379]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6 minirosetta_1.1[1390]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6 rosetta_beta_5.[1395]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6 rosetta_beta_5.[1411]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6 rosetta_beta_5.[1407]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6 rosetta_beta_5.[1422]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6 rosetta_beta_5.[1426]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6 rosetta_beta_5.[1434]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6 rosetta_beta_5.[1440]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6 rosetta_beta_5.[1449]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6 rosetta_beta_5.[1454]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6 rosetta_beta_5.[1470]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6 rosetta_beta_5.[1463]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6 minirosetta_1.1[1486]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6 rosetta_beta_5.[1481]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6 minirosetta_1.1[1498]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6 rosetta_beta_5.[1503]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6 minirosetta_1.1[1509]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6 rosetta_beta_5.[1514]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6 minirosetta_1.1[1520]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6 rosetta_beta_5.[1526]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6 rosetta_beta_5.[1537]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6 rosetta_beta_5.[1543]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6 |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
A third "Compute Error", this one on 73,000+ seconds of CPU time. The system cannot find the path specified. (0x3) - exit code 3 (0x3) Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Jipsu Send message Joined: 27 Jan 08 Posts: 10 Credit: 454,555 RAC: 0 |
|
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
Hello all, Running Ubuntu 7.10 x86 this Task ID: 162048556 has Outcome = Success, but a double message: <core_client_version>5.10.45</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 13761.5 cpu seconds This process generated 4 decoys from 4 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 16847.8 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> From Boinc I got this message: ma 12 mei 2008 01:33:52 CEST|rosetta@home|Task 1bkrA_BOINC_ABRELAX_IGNORE_THE_REST-S25-10-S3-11--1bkrA-_3181_3_1 exited with zero status but no 'finished' file ma 12 mei 2008 01:33:52 CEST|rosetta@home|If this happens repeatedly you may need to reset the project. Its total runtime was 16848.04 seconds. This WU errored before, running on Windows XP as Invalid. Have a nice day, Path7. |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
More segfaults,on linux running 5.10.45 client. That's usually caused by a known, unfixed BOINC flaw, not Rosetta. When BOINC is resolving a domain name, it blocks all other communication, including running tasks. If that continues past 30 seconds, things start failing/crashing. The only know workaround is changing the DNS timeout. That's done in resolv.conf (usually located at /etc/resolv.conf) by adding the line options timeout:2 That makes each attempt 2 seconds with the default of 2 retries per DNS server. You can play with the options some based on your number of DNS server but try not to go over 25 seconds. |
glaesum Send message Joined: 16 Oct 06 Posts: 21 Credit: 508,866 RAC: 11 |
I got a similar err msg as AMD_is_logical except the task succeeded and validated even on top of the error reporting with every work unit done on win98 os. note, the 'psipred' line only occurred three times = no. of decoys, hmmmm?? https://boinc.bakerlab.org/rosetta/result.php?resultid=162095619 Received 11 May 2008 2:20:03 UTC <core_client_version>5.10.30</core_client_version> <stderr_txt> AllocateAndInitializeSid Error 120 failed to create shared mem segment WARNING: Override of option -out:nstruct sets a different value can not open psipred_ss2 file tt # cpu_run_time_pref: 14400 can not open psipred_ss2 file tt can not open psipred_ss2 file tt ====================================================== DONE :: 1 starting structures 10977 cpu seconds This process generated 3 decoys from 3 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> (this work unit had been through boinc v.6.2 with another user where it failed to validate) ]]> I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this? Rosetta Moderator: Mod.Sense |
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
I've got a Mini 1.19 work unit with a duration of 13:27:06 (machine is set for 14hr target) that has consumed 33:17:05, with a progress of 0.000%. Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something? Thanks. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something? I'd suggest you suspend it and resume it again and if progress % doesn't change within 5min of going back to a "running" status, I'd abort it. Rosetta Moderator: Mod.Sense |
MacDitch Send message Joined: 1 Aug 06 Posts: 10 Credit: 206,444 RAC: 0 |
This computer errors on every Rosetta Mini work unit it gets - immediately! WU 1, WU 2, WU 3 & WU 4 I've literally just done this WU 5 and the messages in the manager were: 12/05/2008 18:09:02|rosetta@home|Starting fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 12/05/2008 18:09:02|rosetta@home|Starting task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 using minirosetta version 119 12/05/2008 18:09:04|rosetta@home|Computation for task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 finished 12/05/2008 18:09:04|rosetta@home|Output file fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0_0 for task fa_max_dis_9-1cei_-test_2008-5-6_3222_7535_0 absent Note: The computer happily crunches on ~15 projects, has had no changes in weeks and does Rosetta Beta without problems... :? Any ideas out there? |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this? Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix. |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this? Not sure if there's a trac for this, but waiting for DNS-lookup is definitely one of the reasons for "no heartbeat". If not mis-remembers, I've seen this problem on win2k, there one time internet-connection went down it was spitting-out "no heartbeat" all the time, while another running win2003 just continued crunching even didn't manage doing DNS-lookup... Not sure, but during very heavy disk-usage it's likely also possible to get a "no heartbeat". And, atleast in my experience, each and every time any of the dvd-players makes a nasty noise before spitting-out "read-error", I'm getting a "no heartbeat" in BOINC... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
Mod.Sense, thanks for the suggestion. It doesn't seem to have fixed anything for this machine, but it does produce an interesting result. According to BOINC Manager, this task is now suspended, and I see no CPU time accumulation within BOINC Manager. According to Windows Task Manager, minirosetta_1.1 is still grinding along, consuming CPU. When I resume the task the CPU time display within BOINC Manager catches up with what Windows Task Manager reports. I'm off to see if there is a later version of BOINC, but this work unit is looking like an abort. |
senatoralex85 Send message Joined: 27 Sep 05 Posts: 66 Credit: 169,644 RAC: 0 |
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this? David Baker has gotten this error on his own laptop. See here https://boinc.bakerlab.org/rosetta/result.php?resultid=161624205 stderr out <core_client_version>5.4.9</core_client_version> <stderr_txt> No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting FILE_LOCK::unlock(): close failed.: Bad file descriptor No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting No heartbeat from core client for 30 sec - exiting Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 131.476 cpu seconds This process generated 0 decoys from 0 attempts **Edit** Added Error Log results! |
Message boards :
Number crunching :
minirosetta v1.19 bug thread
©2024 University of Washington
https://www.bakerlab.org