Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 278 · 279 · 280 · 281 · 282 · 283 · 284 . . . 315 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109300 - Posted: 27 May 2024, 9:24:39 UTC - in response to Message 109299.  

When running more than one project, no cache is best. Less chance of deadline issues.
Preferences, Computing Preferences, Other,
Store at least            0.1 days of work
Store up to an additional 0.01 days of work

Yes, this is the default, isn't it?
Nope.
No idea what the defaults are, but it certainly isn't those values- if it were, people wouldn't have nearly as many problems as they do.



Run Memtest on the system to see if there is an issue with the memory,
I'm running memtester on 1GB since yesterday. I don't think it covers much but it's a start, I guess.
If you are running Windows 7 or later, you can just use it's Memory Diagnostic tool. It doesn't work the memory as hard, so it doesn't take as long, but it will still show up dodgy memory. If it's only borderline it may not pick up a problem, then you make use of the F1 key to change the default test options, which will take longer.



But you're saying because I didn't do parts 2 to 4 it's bad for the project?
Sorry, but i've got no idea what it is you're asking bout there.



Also check your completed Valid Tasks and compare the Run time to the CPU time- if there's more than a few minutes difference, it means you're using your system a bit. If there's 30min or so then you're using it a lot.
Hours+, you or something else on the computer is making a huge use of your CPU's time.

I can at least check the running WUs. Are you saying that if the difference is too big I shouldn't crunch at all?
No, just that you should find out what else is chewing up your CPU time.
Taking 3 hrs 10 min to do 3hrs work isn't an issue. But if you're taking 9hrs+ to do only 3 hrs worth of work, it's really something you should look in to.





Jean-David Beyer wrote:
it was pretty easy to find which module it was
You mean by turning the computer off, pulling a module, turning the computer on, turning it off again etc.?[/quote]Nope.
Turn the computer off, remove all but one module, then power back up & test that module. If it's faulty- job done. If not, power down, pull that module, fit another one. Test again. Etc, etc.
On systems with huge amounts of RAM & multiple modules, testing one at a time lets you do other things in between testing modules if you have to- otherwise all you can do it start the test & then wait for it to finish, hours (or days) later.
Grant
Darwin NT
ID: 109300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 201
Credit: 6,765,644
RAC: 6,588
Message 109301 - Posted: 27 May 2024, 13:42:04 UTC - in response to Message 109300.  

Jean-David Beyer wrote:

it was pretty easy to find which module it was

You mean by turning the computer off, pulling a module, turning the computer on, turning it off again etc.?
Nope.
Turn the computer off, remove all but one module, then power back up & test that module. If it's faulty- job done. If not, power down, pull that module, fit another one. Test again. Etc, etc.


I could do it a little more effidiently than that. It ran with 4 modules for many months. The problem occurred when I added 4 new modules. So I took out all 4 new modules and the problem went away. I put in two of the new modules and still no problems. I moved those two new modules to the other two memory slots (was it a slot problem or a module problem?) and it still worked, So I put another new module in and it still worked, so it was probably the last new module. And so it proved.
ID: 109301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2025
Credit: 9,943,884
RAC: 6,777
Message 109310 - Posted: 29 May 2024, 14:55:07 UTC

Some daemons are down...so, no validation
ID: 109310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109311 - Posted: 30 May 2024, 0:41:55 UTC - in response to Message 109196.  

Hello. I'm back :)
Now, where were we...
I have to tell you, I'm absolutely amazed that you think Boinc scheduling being wrong by 50% one way or 2-300% the other way for the bulk of the time a task is processing - and 100% of the time it's sitting waiting in the cache - is no kind of problem,
And I am absolutely amazed & astounded you would think something that at no stage have I ever said or I suggested.

No where have is said it is not a problem.
What I have said is that it is not as big a problem as you make it out to be. What I have said is it is that it is not the root cause for the High Priority issues. It contributes to it, but it is not the cause.
How on earth do you turn "it is not as big a problem as you make it out to be" in to "is no kind of a problem?"
Seriously? How on earth can you think that???

Well, it started when I talked about reducing target runtime from 12 to 8hrs, thereby reducing wallclock time of running tasks by 7-11hrs each and reducing wallclock time of all cached tasks by 7-11hrs each as well , and you said 'all it would do is reduce tasks tripping into panic mode', as if that wasn't the pragmatic solution.
Taking 14-22hrs out of runtime goes a long way - in al l likelihood all the way - to prevent deadlines being missed and panic mode being tripped without changing anything else, while running all the projects the user wanted to run.

By saying "all it would do..." you're saying it's not a solution, when it might be the entire solution to missed deadlines and panic mode.

It is a problem for Scheduling.
But as I keep on repeating because you don't appear to be listening, it's not the cause of the High Priority issue. It's a contributing factor, but not the cause. The cause is the huge discrepancy between CPU time and Run time.

The entire problem is one of scheduling and the failure to meet deadlines.
Everything else you talk about is purely academic and entirely irrelevant to the user if deadlines are met and all CPU time is maximised for projects important to the user. Which they are.
It's an issue for the computer to solve, which it's perfectly capable of and it will always do for itself better than any attempt to micromanage it with other settings.
ID: 109311 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109312 - Posted: 30 May 2024, 0:55:00 UTC - in response to Message 109197.  

But for anyone else that's been reading these posts...

If you limit the number of cores/threads available to BOINC, you will maximise your BOINC processing. You will get the maximum possible amount of work done each day that your system is capable of, you won't have issues with deadlines (unless of course you have inappropriate cache settings), or Panic Mode or any of those types of issues.

Well, that's obviously not true.
If you limit the number of cores available to Boinc, your Boinc processing will be limited to the maximum # of cores you've allocated, while your unallocated cores won't be used for Boinc and may or may not be fully utilised, depending what else is going on.
Use all your cores all the time. Your computer will decide millions of times per second what it should do with its capability better than any human ever will.
Do not listen to the man behind the curtain.
ID: 109312 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109313 - Posted: 30 May 2024, 1:10:28 UTC - in response to Message 109310.  
Last modified: 30 May 2024, 1:22:13 UTC

Some daemons are down... so, no validation

They were, probably for about 16hrs today.
I think it's now fixed.
175k backlog when I looked earlier, now below 100k.

Edit: Server status page says there's still a 96k backlog, but it's not a live figure. Checking all my hosts, all tasks have been validated for each of them.
ID: 109313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 411
Credit: 12,359,416
RAC: 3,742
Message 109314 - Posted: 30 May 2024, 5:45:44 UTC - in response to Message 109312.  

But for anyone else that's been reading these posts...

If you limit the number of cores/threads available to BOINC, you will maximise your BOINC processing. You will get the maximum possible amount of work done each day that your system is capable of, you won't have issues with deadlines (unless of course you have inappropriate cache settings), or Panic Mode or any of those types of issues.

Well, that's obviously not true.
If you limit the number of cores available to Boinc, your Boinc processing will be limited to the maximum # of cores you've allocated, while your unallocated cores won't be used for Boinc and may or may not be fully utilised, depending what else is going on.
Use all your cores all the time. Your computer will decide millions of times per second what it should do with its capability better than any human ever will.
Do not listen to the man behind the curtain.


One thing I’ve noticed is that my Ryzens appear to be power limited, the TDP is 65w and the PTT comes out at 88w so with all 24 cores running each core is getting about 3.67w but with only 20 cores running each core gets about 4.4w and the power draw is still 88w with the cores running a higher frequency.

Also, running 23 cores allows the os to have its two pennies worth without having to swap out the data for a running WU and then swap it back in again, the WUs can run closer to 100% than the 97/98% they get when running 24 cores.

That being said, I always run at 24 cores and let the computer sort itself out.
ID: 109314 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109315 - Posted: 30 May 2024, 5:51:22 UTC - in response to Message 109311.  

Taking 14-22hrs out of runtime goes a long way - in al l likelihood all the way - to prevent deadlines being missed and panic mode being tripped without changing anything else, while running all the projects the user wanted to run.
Which is exactly what my advice does- it reduces the Runtime for each and every Task, for all projects that the person does. It doesn't just do it for one Project, but for all of them.

As i said before & i will say again- your suggestion addresses the symptom, mine fixes what is actually causing the problem.
Most people prefer to fix the problem. If you're ok with just fixing the symptom, then so be it.




But for anyone else that's been reading these posts...

If you limit the number of cores/threads available to BOINC, you will maximise your BOINC processing. You will get the maximum possible amount of work done each day that your system is capable of, you won't have issues with deadlines (unless of course you have inappropriate cache settings), or Panic Mode or any of those types of issues.
Well, that's obviously not true.
It's obviously true if you actually understand what is going on.
If you don't understand, then it it's not going to be obvious.

One final attempt to point out the obvious-
People doing GPU processing have known this for over a decade. If a GPU application requires a CPU core/thread to keep it fed, then losing the output from that core thread running the CPU application in order to support each Task running on the GPU, and you can get 10-20 times more work done from the GPU, and you don't reduce your CPU output by a large amount because they aren't all fighting to for CPU processing time that just isn't available.
If you don't reserve that core/thread, your GPU output is way less than it could be, and your CPU output takes a massive dive as well.


For the CPU-
Needing only 3 hours to do 3 hours worth of work means you will do way more work than it if takes you 12 hours to do 3 hours worth of work- it's that simple.
By losing the output of 1 thread, you end up doing 4 times the amount of work on each of the remaining cores/threads.
So the amount of work done each day by not using all the cores/threads is many, many times greater than the amount of work done if you try to use all cores/threads for BOINC work on a system that is also doing large amounts of other CPU intensive work.

So the statement you claim is not true, is true & factual, as evidenced by the output of thousands of computers over many years of crunching.
Grant
Darwin NT
ID: 109315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109316 - Posted: 30 May 2024, 6:02:10 UTC - in response to Message 109314.  

Also, running 23 cores allows the os to have its two pennies worth without having to swap out the data for a running WU and then swap it back in again, the WUs can run closer to 100% than the 97/98% they get when running 24 cores.

That being said, I always run at 24 cores and let the computer sort itself out.
Unfortunately this will just muddy the waters for those that don't understand the issue of an over committd system.

What i've been talking about is systems that are doing a lot of non-BOINC work while BOINC is running. Hence the massive difference between Run time & CPU time- for the person that started all this with Denis it was 4 times as long. 4 times...


For a system that is lightly used, or just a dedicated cruncher, then using all cores & threads all the time will result in the greatest amount of work being done each day.
But if it's heavily used for other things, reserving a thread or 2 will result in a massive increase in output of BOINC work, even with the loss of BOINC output from that thread/ those threads.
Grant
Darwin NT
ID: 109316 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bill F
Avatar

Send message
Joined: 29 Jan 08
Posts: 50
Credit: 1,656,004
RAC: 1,491
Message 109327 - Posted: 31 May 2024, 19:20:51 UTC

You two guys are having too much fun to be doing this by your selves, Since we have lots or users with different configuration's I figured that I would help muddy the waters a little more.

I you are running Windows 10 or newer some GPU's can do more of there own scheduling without as much CPU involvement.... please see below this is mostly old news.

-----------------------------------

On Windows 10 version 2004 Hardware accelerated GPU scheduling was added if your Video card supported it and you had (NVIDIA version 451.48, AMD version 20.5.1 Beta) (or newer) installed. Theory being off loading the CPU with the GPU scheduling that the GPU could do for it's self.

You can Google "Hardware accelerated GPU scheduling"

https://www.howtogeek.com/756935/how-to-enable-hardware-accelerated-gpu-scheduling-in-windows-11/

Have Fun
Bill F
ID: 109327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109329 - Posted: 31 May 2024, 22:48:14 UTC - in response to Message 109327.  

You two guys are having too much fun to be doing this by your selves, Since we have lots or users with different configuration's I figured that I would help muddy the waters a little more.
Yep, bringing up something that is only tangentially relevant to the discussion at best certainly doesn't help in the slightest.

Here we are talking about Scheduling work between different BOINC applications & sharing time with non-BOINC applications.
The link you posted to is about Operating System scheduling. The first line of that article-
Windows 10 and Windows 11 come with an advanced setting, called Hardware-Accelerated GPU Scheduling, which can boost gaming and video performance using your PC's GPU.
It's there to boost your Video card's performance.

If you're playing a CPU limited game while running BOINC work in the background, it might provide some very slight benefit, if any. Limiting the number of cores/threads BOINC can use so it doesn't compete with the game would provide much, much more benefit.
Using the BOINC settings to suspend BOINC while gaming would be better still.



The whole idea behind BOINC was to make use of unused CPU resources, not to try to use them even while they're being used heavily by other applications.
As i have been pointing out over & over gin in this discussion, trying to do so results in BOINC not actually getting much work done. And as any gamer would tell you, you don't want anything else running in the background while gaming as it will impact on your gaming experience (no matter how many cores/threads you have, although it probably wouldn't be that much of an issue with Threadrippers & greater. but when people get worked up over the difference between 200 frames per second and 203 frames per sconed...).
Grant
Darwin NT
ID: 109329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109331 - Posted: 2 Jun 2024, 18:18:20 UTC - in response to Message 109315.  

Taking 14-22hrs out of runtime goes a long way - in all likelihood all the way - to prevent deadlines being missed and panic mode being tripped without changing anything else, while running all the projects the user wanted to run.
Which is exactly what my advice does- it reduces the Runtime for each and every Task, for all projects that the person does. It doesn't just do it for one Project, but for all of them.

As I said before & I will say again - your suggestion addresses the symptom, mine fixes what is actually causing the problem.
Most people prefer to fix the problem. If you're ok with just fixing the symptom, then so be it.

In this case, you're assuming the problem when there's no evidence of it being the one you describe based on the symptom.
A while back you rightly pointed out the symptom was one of scheduling. Solve the scheduling issue where Rosetta knowingly misleads Boinc, in the way I described - the end.
If you were to take a look at adrianxw's Rosetta tasks (where tbf he only seems to be running Rosetta tasks atm which mislead Boinc in the other direction) and no deadlines are being missed any more so Panic mode won't be arising let alone missing deadlines. nor will they if his tasks are all Beta ones.

You also ignore the fact that, with all cores usable to Boinc tasks, the Folding tasks are <additional> to those tasks.
I don't know how many Folding tasks run at a time - I assume it's one - so the inefficiency you see in Boinc tasks is entirely taken up by a 9th task running at normal priority on an 8-core machine.
How does that pan out? Neither of us know for sure, but I'm going to suggest that almost all of that "inefficiency" disappears by the processing of a 9th task on an 8 core machine.

But for anyone else that's been reading these posts...

If you limit the number of cores/threads available to BOINC, you will maximise your BOINC processing. You will get the maximum possible amount of work done each day that your system is capable of, you won't have issues with deadlines (unless of course you have inappropriate cache settings), or Panic Mode or any of those types of issues.
Well, that's obviously not true.
It's obviously true if you actually understand what is going on.
If you don't understand, then it's not going to be obvious.

Having cut off quoting the critical part of what I wrote about how the unutilised cores to Boinc are made use of, this isn't a statement on what I wrote but what I explicitly didn't write, so worthless,

One final attempt to point out the obvious-

On a different situation no-one's talking about... irrelevant.
When someone raises that as their issue, bring it up again. Then it might have a point.
ID: 109331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2025
Credit: 9,943,884
RAC: 6,777
Message 109332 - Posted: 3 Jun 2024, 6:34:24 UTC

Still the same error, again and again

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
20:49:27 (8612): called boinc_finish(1)

ID: 109332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109333 - Posted: 3 Jun 2024, 6:57:41 UTC - in response to Message 109331.  

...
time to end it.
I have tried my best i to help you understand, but every point you make shows that you still don't understand what is happening, so it really is time for me to give up once and for all.
Grant
Darwin NT
ID: 109333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109336 - Posted: 5 Jun 2024, 8:26:29 UTC
Last modified: 5 Jun 2024, 9:22:36 UTC

Server Status showing all green, but there's a backlog of 12,640 Tasks waiting on Validation at present.
Grant
Darwin NT
ID: 109336 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109337 - Posted: 5 Jun 2024, 9:22:14 UTC - in response to Message 109336.  
Last modified: 5 Jun 2024, 9:22:52 UTC

Server Status showing all green, but there's a backlog of 12,640 Tasks waiting on Validation at present.
Now up to 24,128, and the Server Staus showing several processes on boinc-process not running.
Seems to be nothing but recurring issues with that server lately.
Grant
Darwin NT
ID: 109337 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109338 - Posted: 5 Jun 2024, 10:45:39 UTC - in response to Message 109337.  
Last modified: 5 Jun 2024, 10:47:51 UTC

Server Status showing all green, but there's a backlog of 12,640 Tasks waiting on Validation at present.
Now up to 24,128, and the Server Staus showing several processes on boinc-process not running.
Seems to be nothing but recurring issues with that server lately.
Now all processes on boinc-process are down and Waiting for Validation is now up to 35,496.

Maybe it's gone down in sympathy with the ralph server over on Ralph. It's been down for 4-5 days now.
Grant
Darwin NT
ID: 109338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 276
Credit: 513,050
RAC: 161
Message 109339 - Posted: 5 Jun 2024, 10:48:34 UTC

Everything is running as of as of 5 Jun 2024, 10:16:46 UTC
ID: 109339 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109340 - Posted: 5 Jun 2024, 10:51:12 UTC - in response to Message 109339.  
Last modified: 5 Jun 2024, 10:54:37 UTC

Everything is running as of as of 5 Jun 2024, 10:16:46 UTC
10 minutes earlier everything on boinc-processes was dead.
And the same with the ralph server at Ralph, it's showing life again as well.


BTW- check the date time stamp- that's for the Task application data.

The server status data is this one- Remote daemon status as of 5 Jun 2024, 10:45:06 UTC
It would be good if these things were updated more often.
Grant
Darwin NT
ID: 109340 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 276
Credit: 513,050
RAC: 161
Message 109341 - Posted: 5 Jun 2024, 10:52:16 UTC

They probably rebooted it.
ID: 109341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 278 · 279 · 280 · 281 · 282 · 283 · 284 . . . 315 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org