Message boards : Number crunching : Recent problems SAVEing status
Author | Message |
---|---|
Pelle Send message Joined: 17 Sep 06 Posts: 4 Credit: 31,795 RAC: 0 |
Hello everybody, I`m only occasionally number-crunching for Rosetta, cause I'm busy with other projects, too. I recently got an update for Rosetta and ever since I'm running into serious problems with this application leading to me canceling all recent tasks. The problems are two: (1) The state of the task is not saved. Whenever I stop the task via BOINC interface or reboot or for some other reason stop and restart BOINC, the task restarts from scratch as well. This makes it close to impossible for me to finish a task, cause my Computer is rarely on long enough to finish a task in one run. (2) Towards the end of the task, computation slows down significantly, such that I get to a near standstill at about 9 minutes left (I've waited at that point for about 3 hours without any considerable progress). Now, where should I turn with this kind of problems? Thanks everybody! Regards, Pelle. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Hi Pelle, could you post this same thread over in the number crunching board. the more technically inclined people are there and can point you towards the right solution. I can tell you general that there has been alot of talk about how often a model is saved to the hard drive. The information I have seen says it is usally done after each model of that particular work unit is finished or on occasion at some other point in its process. That the work unit goes back to the starting point again usally means the program did not complete enough work to create a save point. You should probably change your settings to do shorter (4 hours minimum is best)run times with Rossetta since you say your computer is not on that long. But then again, what do you mean by that? Is it a laptop that shuts off after so long when it is on battery power or are you installing and unistalling applications that cause your system to reboot? Are you running behind any firewalls on your system? They could be causing issues as well. It is normal for the task status to slow down near the end of the run as the program does not know how much longer it will take to complet the last model of the work unit. That it stalls out at 9 minutes for over 3 hours is very odd. I would think after 1 hour it would end. It is possible that the other projects you are attached to could be taking over as priority running on your cpu when rosetta gets to the end of a model. That is just my opinion. Again some other people may have a different idea. I am not sure what the error code -197 (0xffffff3b) means, but it is something consitant with your aborting of the work unit. Perhaps you could try this, run 1 other project with Rosetta and set your run time for 4 hrs in your preferences in the setup part of your account online. See if you can complete 1 work unit of Rosetta in that time frame, let it run and see if it stalls, if it does, just let it sit for the rest of the day and see if it aborts itself. Check in the Boinc Manager messages tab to see what kind of error messages show up there. Again, just let the Rosetta work units go for a run time of 4 hours, do not do anything to them for 8 hours. If the workunit aborts itself, you should see an error in Boinc Maanager in the status of the workunit in the task tab (status column) and also in the messages tab and also in your results listings online. These 3 places are where you can see the status and the end result of the task. Again, for better results, repost your question over in the numbeer crunching board. greg_be Hello everybody, |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I moved this thread from the Cafe. I wanted to extended Greg's comments. Greg, thanks for providing the basic framework of the reply. I didn't have time to address all the issues last night, so I was hoping someone would. ...it is usally done after each model of that particular work unit is finished or on occasion at some other point in its process. A checkpoint is ALWAYS saved at the end of a model. At least one will be saved at the next interval that your "write to disk" settings allows (defaults to 1min). So within 1min of the end of the model, work is all saved. And since Rosetta does not write frequently, the write time generally does not delay using the disk drive for Rosetta. That the work unit goes back to the starting point again usally means the program did not complete enough work to create a save point. correct. And recently, some larger proteins are being studied. These take longer to complete each model. Some taking as long as 6 hours on my Pentium 4s. They are also running some types of work that are not able to take checkpoints in the middle of a model. You should probably change your settings to do shorter (4 hours minimum is best)run times with Rossetta since you say your computer is not on that long. ...that won't help. You'll get the same tasks as with any other setting, and it will still take just as long to complete a model, and still only checkpoint when the task is able to. The runtime preference is more a factor of how many models you will complete before reporting the work back before the deadlines. It is normal for the task status to slow down near the end of the run as the program does not know how much longer it will take to complete the last model of the work unit. That it stalls out at 9 minutes for over 3 hours is very odd. I would think after 1 hour it would end. Yes, the work itself does not slow down. But the estimate to completion does. The initial estimated completion time is just based on the runtime preference you've defined in your Rosetta Preferences. If that is 1hr, then the estimated completion starts at 1hr. If you happen to get one of those tasks that takes 6hrs to complete the first model, it's going to have to run for the six hours before it can be reported back. It may, or may not be able to take checkpoints during that 6hrs. So, after 50 minutes, we have about 10 minutes estimated time remaining... but we're not done yet. In fact, the machine really doesn't know ahead of time this one is going to take 6hrs, and so they just take time march very slooowly forward, getting exponentially slower and slower during that last 10min. It was a way to show you it is still moving forward and running... and yet the client isn't sure really how much longer it's going to take. Rosetta's watchdog is prepared for machines like yours that are not running all the time. If a given task is ended without reaching a checkpoint, and then restarted again... 4 times, without saving any progress, it will terminate the task for you. The next task you download has a high liklihood of being one that runs models faster, and checkpoints more frequently, and this will be a better fit for how you are using your machine. So, no manual intervention is required. The watchdog detects the specific tasks that are not running well for your situation, and ends them. It is then random as to what you get next, but number of tasks with these very long runtimes per model are probably <10%. In fact at some points in time, there won't be any like that in the work queue. So, your next task downloaded will tend to run more as you expected. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Recent problems SAVEing status
©2024 University of Washington
https://www.bakerlab.org