Major problems... multiple machines various errors + 100% complete lock down

Author	Message
Dougga Send message Joined: 27 Nov 06 Posts: 28 Credit: 5,248,050 RAC: 0	Message 53949 - Posted: 24 Jun 2008, 5:47:09 UTC It seems that Rosetta is undergoing some growing pains. I live in Seattle on the block with one of the programmers. I need to buy him a few beers to really hear what's going on. If you surf my machines you'll see problems all over the place. The biggest annoyance is associated with locking up the client. It seems if a work unit is approaching expiration, it indicates that it is running on High Priority. It seems to me that this is a flag for trouble. If a unit is running high priority, it will lock up the client when it reaches 100%. I have 1 intel Core 2 Quad and 1 Core 2 Duo and both are showing this behavior. My overall productivity has taken a beating due to these irregularities. In addition to this, I'm seeing lots of segmentation faults and misc. programming errors. I'm thiking this is not machine based but somehow related to the code in the application. ID: 53949 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5756 Credit: 6,093,324 RAC: 1,063	Message 53952 - Posted: 24 Jun 2008, 7:40:34 UTC Odd how you lock up in high priority. I am running a ton of stuff in high priority because I got a ton of work schedulded for the same day with various hours of expiration and I have never had any lock up issues. I have a Core2 Duo as well and am not suffering the problem your describing. I am even pushing the CPU with OC and not suffering the problem you describe. Anyone else reading this have his problem? ID: 53952 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1833 Credit: 123,873,580 RAC: 24,274	Message 53953 - Posted: 24 Jun 2008, 7:52:43 UTC just want to make the distinction that 'high priority' in BOINC doesn't mean the thread is 'high' priority from an Operating System point of veiw - the Rosetta thread will always run as a low priority thread. ID: 53953 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 53955 - Posted: 24 Jun 2008, 8:21:13 UTC Doug, can you give some specifics like what work units are getting stuck? We had a bad batch of work units sent out last week. The task names started with "t405_." I just posted a news item about it and am working on a fix. It was a pretty bad bug that caused the client to sometimes stall and sit idle. We didn't catch it on Ralph because the stalled jobs did not get reported back so we had no information about their status. The successful jobs did get reported back of course, so it appeared okay on Ralph. ID: 53955 · Rating: 0 · rate: / Reply Quote

[B^S] thierry@home Send message Joined: 17 Sep 05 Posts: 182 Credit: 281,902 RAC: 0	Message 53964 - Posted: 24 Jun 2008, 18:44:24 UTC Last modified: 24 Jun 2008, 18:45:04 UTC Here's a "bad" WU with a Q9300: t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_10212_0 <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> Fonction incorrecte. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 28800 # random seed: 2584433 ERROR:: Exit from: .refold.cc line: 338 </stderr_txt> ]]> ID: 53964 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5756 Credit: 6,093,324 RAC: 1,063	Message 53966 - Posted: 24 Jun 2008, 19:57:26 UTC i'm curios about this one as well, its in my to do list in few days. mine is 4104_7660_0, it's the only one. ID: 53966 · Rating: 0 · rate: / Reply Quote

Vatsan Send message Joined: 19 Nov 05 Posts: 2 Credit: 6 RAC: 0	Message 53969 - Posted: 24 Jun 2008, 21:27:04 UTC - in response to Message 53966. i'm curios about this one as well, its in my to do list in few days. mine is 4104_7660_0, it's the only one. This was my workunit. I am tracking down the problem. Here is my analysis based on preliminary investigation: There are two stages in refinement. The first stage is aggressive loop modeling in the regions that are unaligned with the template and the second stage is full-atom relax. In full atom relax, the full chain structure (no broken loops) is back-bone perturbed, side chain repacked and minimized over a number of cycles. However, it is possible that not all loops could be closed in the first stage. In such a case, Rosetta will not do full-atom relax. If the loop is not fully closed at the end of first stage, Rosetta should write out the broken loop structure and exit. I suspect, this is not happening cleanly and that might be the problem. There is a mechanism in Rosetta to stochastically extend the length of the defined loop region to try and close the loop. As a result, if it is a hard-to-close loop, extending the loop could close it. For this WU, not all jobs failed, those that extended the loop, went onto the second stage and completed successfully. Those that did extend the loop adequately failed. The bottom-line is, if the first stage, loop modeling, fails it should exit without an error and that is not happening now. I am looking into it. Sorry for all the trouble. ID: 53969 · Rating: 0 · rate: / Reply Quote