Waiting for memory, have 60 GB RAM and 30 cores available to Boinc.

Message boards : Number crunching : Waiting for memory, have 60 GB RAM and 30 cores available to Boinc.

To post messages, you must log in.

AuthorMessage
rcollins0618

Send message
Joined: 11 Jan 08
Posts: 4
Credit: 902,441
RAC: 9,473
Message 109918 - Posted: 26 Oct 2024, 14:13:56 UTC
Last modified: 26 Oct 2024, 15:10:55 UTC

Hello,

I'm using BOINC for multiple projects, and works great for all except Rosetta@Home. Whenever Rosetta is running, I can't get more than 16 instances going at a time. I see one says "Waiting for Memory" now that the GPU switched over to a GPUGRID project.

I'm using Ubuntu 24.04, and when I do a free -h, it tells me I'm only using 10 GB of memory (buff/cache says 19 Gi), and says 31Gi free, 51Gi available. Am I doing something wrong here? Shouldn't I have more Rosetta instances running?

I saw by searching the threads once this came up for someone else before way long ago. It was suggested to uncheck "Leave non-GPU tasks in memory while suspended". Does the write unfinished work to disk, or forget it completely? I don't want to unclick this if I have to start over at 0% when it switches between projects.

For clarity I have 64GB RAM, and a Ryzen 9 7950X3D (16 cores, 32 threads). Is it wise to use hyperthreading and run more than 16 instances of Rosetta anyway? I also have it set (I use local prefs on all machines) so that I use 93.75% of the CPU at all times (in use or not in use), and 93% of Memory at all times.

EDIT: I checked one of my other computers that I have Rosetta running on (4 cores/8 threads). It only has 16GB of memory, and is running 7 instances of Rosetta right now, no prob (that machine is set to use 7 threads by percent in local prefs). multiplying by 4 for both threads and memory, I should have at least 28 going right now, no?

Thanks,

Rich
ID: 109918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rcollins0618

Send message
Joined: 11 Jan 08
Posts: 4
Credit: 902,441
RAC: 9,473
Message 109921 - Posted: 26 Oct 2024, 19:00:13 UTC - in response to Message 109918.  

* free says 31Gi
ID: 109921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1675
Credit: 17,738,985
RAC: 22,900
Message 109922 - Posted: 26 Oct 2024, 22:35:40 UTC
Last modified: 26 Oct 2024, 22:45:08 UTC

On the system that isn't using all the possible cores/threads, check that it is actually using local preferences (i prefer to use web based ones- set it once & every system uses the same preferences. Save local preferences for working with systems with issues of one sort or another).
At the top of the dialogue box it will say "using local preferences. Click to use web prefs from ...."
And check that it is showing the value you expect in the use at most % of memory boxes.
For me, i've set both to 95%.

With Rosetta some Tasks can use over 2.5GB of RAM, and we had a batch of work recently that was using anywhere from 1GB to 2.2GB of RAM per Task. The current batches of work are only using 500MB or less.


It was suggested to uncheck "Leave non-GPU tasks in memory while suspended". Does the write unfinished work to disk, or forget it completely?
Neither.
If you suspend a project, it is up to the application whether or not it will checkpoint the data at that point. The default is to checkpoint every 60 seconds or so, so at most you will use 60 seconds of work as long as the application supports check pointing.
I personally don't keep suspended Tasks in RAM. 60 seconds isn't a big loss if the application doesn't checkpoint on suspending/exit.

EDIT- i know that some applications for Einstein will only checkpoint every few hours, and don't/can't checkpoint before suspending, so leaving Tasks in RAM is needed to avoid losing 1-4 hours of work.
Such Tasks should get shunted to the page file when more RAM is required by active Tasks, but i don't know how suspended Tasks appear to the OS- ie needed but inactive or active.

I'd check your RAM settings, then exit BOINC (assuming it's set to shut down all applications on exit), wait a bit for everything to exit cleanly, then restart BOINC.
If everything doesn't come up as expected, check on the Event log for the more detailed messages there as to how much RAM is actually available to BOINCV, and how much it needs to run those "waiting for memory" Tasks.
Grant
Darwin NT
ID: 109922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rcollins0618

Send message
Joined: 11 Jan 08
Posts: 4
Credit: 902,441
RAC: 9,473
Message 109924 - Posted: 27 Oct 2024, 0:17:52 UTC - in response to Message 109922.  

Yes. It says "Using local prefs. Click to use web prefs from..." With a "use web prefs" button available and not clicked.

So... I enabled mem_usage_debug in the diagnostic log flags selector screen. then i sudo service boinc-client stop. then i started it back up, and ran boincmgr. From there, I looked at the Event Log.

"""
[mem usage] BOINC totals: WS 10016.54MB, smoothed 9999.somethingMB, swap 18564MB 0.00 page faults/sec
"""
note, those values changed quicker than i could type but they stay very close to the original value (only off by a few MB each time)... it looks like it's using 10Gibibytes of Mem just like "free -h" told me.
My memory usage is set to 93%, and earlier in the log after startup it showed 50-something free for memory .and even if it used 93% of that, it would be way more than 10Gigs.

Note: I do run Einstein@Home too, but i did uncheck the "Leave non-GPU tasks in memory while suspended" for the past few hours just to see if that would change anything. I did this a few hours before this restart and log check. I've got 17 Rosettas running right now.
ID: 109924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1675
Credit: 17,738,985
RAC: 22,900
Message 109925 - Posted: 27 Oct 2024, 0:50:37 UTC - in response to Message 109924.  
Last modified: 27 Oct 2024, 1:07:45 UTC

[mem usage] BOINC totals: WS 10016.54MB, smoothed 9999.somethingMB, swap 18564MB 0.00 page faults/sec
Probably easiedr to find without the memory debug flags, but check where it says how much RAM is in use for Rosettta, and how much Rosetta is requesting.

EDIT- in the memory settings, Page/swap file: use at most xx%
What value is that?
I've got mine set at 75%.

With all that available RAM, swap file usage shouldn't be an issue. But if one of the other projects is making use of it while Rosetta is trying to run it might cause problems (but i wouldn't expect an insufficient RAM issue message to be the result).


EDIT- i re-read your initial post, it it mentions only 1 Task waiting on RAM? In which case any other available Tasks should start running on the available cores/threads.
Re-check your event viewer log right at the start & check that it reports 32 processors when BOINC first starts up.



Even if they aren't being used by Rosetta, if you have enough RAM, enough swap file space, and their usage isn't restricted by the "Use at most xx% of the CPUs" setting, then all available cores/threads should be in use by BOINC.
Check that BOINC can see all 32 cores/threads. Check in the event log how much RAM Rosetta is needing, and how much more RAM BOINC thinks it requires to meet that need.
Grant
Darwin NT
ID: 109925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rcollins0618

Send message
Joined: 11 Jan 08
Posts: 4
Credit: 902,441
RAC: 9,473
Message 109931 - Posted: 27 Oct 2024, 16:49:57 UTC - in response to Message 109925.  

Probably easier to find without the memory debug flags, but check where it says how much RAM is in use for Rosetta, and how much Rosetta is requesting.

from start of Debug log: "Memory: 61.76GB physical, 8.00 GB virtual"...
"max memory usage: 57.44 GB" <---It says this under both "when computer is in use" and "not in use".

EDIT- in the memory settings, Page/swap file: use at most xx%
What value is that?
I've got mine set at 75%.

I upped it to 75%, it was at 50.

e-check your event viewer log right at the start & check that it reports 32 processors when BOINC first starts up.

"Processor: 32 AuthenticAMD AMD Ryzen 9 7950X3D 16-Core Processor [Family 25 Model 97 Stepping 2]"
"max CPUs used: 30"
"max CPUs used: 30"

Check in the event log how much RAM Rosetta is needing, and how much more RAM BOINC thinks it requires to meet that need.

"[mem_usage] enforce: available RAM 58816.28MB swap 6553.60MB"
As you'll see in eventlog2.txt (from when i enabled mem usage debug before a restart of the boinc-client), I think Rosetta Beta 6.06 tasks seem to think each task will require 3.3 GiB of memory, and falsely stops short of running more because it only thinks I have 2072MB of RAM left after scheduling 17 instances to run.
NOTE!: 17 * 3.3 is 56.1, and max mem usage is 57.44. I think BOINC things that each task is requesting 3300 MB of RAM when really they're only using about 125-500MB each... I do see one reporting 976MB, but it's the outlier.
I see a bunch of work units being skipped from running because it thinks I'm out of ram in the mem_usage_debug enabled log:
"[Date&Time Stamp] | Rosetta@home | [cpu_sched_debug] skipping 8a_hal_t_hal_8aa[...rest of task name]: estimated WSS (3337.86MB) exceeds RAM left (2072.66MB)"

Check my logs:
https://pastebin.com/QVJvLBxr == boinc-log.txt == boinc log as I have it configured as stated above.
https://pastebin.com/fSxFyDET == boinc-log2.txt == boinc log from startup to running (just like boinc-log.txt), but with mem_usage_debug enabled in the event log options.[/url]
ID: 109931 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2119
Credit: 41,179,074
RAC: 11,480
Message 109934 - Posted: 27 Oct 2024, 22:08:57 UTC - in response to Message 109931.  

Probably easier to find without the memory debug flags, but check where it says how much RAM is in use for Rosetta, and how much Rosetta is requesting.

from start of Debug log: "Memory: 61.76GB physical, 8.00 GB virtual"...
"max memory usage: 57.44 GB" <---It says this under both "when computer is in use" and "not in use".

EDIT- in the memory settings, Page/swap file: use at most xx%
What value is that?
I've got mine set at 75%.

I upped it to 75%, it was at 50.

This should probably sort things out.
Fwiw I use 85% when the computer is in use and 95% when not in use on a 32Gb RAM 8C/16T machine.
Rosetta tasks seem to call for this 3337.86Mb when the task starts, even if it only uses much less while the task is running, and if it can't clear that hurdle then it gives you the problems you've been reporting.
No-one really knows why.
Because this hurdle is so artificial, there's no reason not to allow Rosetta tasks access to lots of RAM, because it never actually calls for it.
Someone once suggested they'd like to allocate 120% of RAM if it's so meaningless in practise - it's difficult to argue against that.

Meaningless note: The RAM requirement is actually set in bytes
3337.86Mb is 3337.86 * 1024 * 1024 = 3,500,000,000 bytes
No-one knows why this is either
ID: 109934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1675
Credit: 17,738,985
RAC: 22,900
Message 109937 - Posted: 28 Oct 2024, 6:03:29 UTC - in response to Message 109931.  
Last modified: 28 Oct 2024, 6:09:41 UTC

"[mem_usage] enforce: available RAM 58816.28MB swap 6553.60MB"
As you'll see in eventlog2.txt (from when i enabled mem usage debug before a restart of the boinc-client), I think Rosetta Beta 6.06 tasks seem to think each task will require 3.3 GiB of memory, and falsely stops short of running more because it only thinks I have 2072MB of RAM left after scheduling 17 instances to run.
NOTE!: 17 * 3.3 is 56.1, and max mem usage is 57.44. I think BOINC things that each task is requesting 3300 MB of RAM when really they're only using about 125-500MB each... I do see one reporting 976MB, but it's the outlier.
I see a bunch of work units being skipped from running because it thinks I'm out of ram in the mem_usage_debug enabled log:
"[Date&Time Stamp] | Rosetta@home | [cpu_sched_debug] skipping 8a_hal_t_hal_8aa[...rest of task name]: estimated WSS (3337.86MB) exceeds RAM left (2072.66MB)"

Check my logs:
https://pastebin.com/QVJvLBxr == boinc-log.txt == boinc log as I have it configured as stated above.
https://pastebin.com/fSxFyDET == boinc-log2.txt == boinc log from startup to running (just like boinc-log.txt), but with mem_usage_debug enabled in the event log options.[/url]
Might be worth posting to the BOINC forums on this.
There's 57GB of RAM available to BOINC. As you pointed out even if they actually need that much, it's still less than what is available to BONC. And once they start, they no longer request the larger amount, only what they are using at the time. And even so- there are some Tasks that are running, so at least several more of the Tasks should be able to start before the others will even start to be getting even remotely close to runing into any sort of RAM limit.

Personally, i can't see any reason from those logs why the Tasks won't start.
You've got enough RAM, you've got enough cores/threads, you've got enough virtual memory & disk space, and it's available to BOINC to use.
Short of setting the "use at most" in use & not in use memory values to 100% and see what happens, i can't think of anything else to try (other than reducing by one the number of cores/threads available to BOINC until those tasks start to run and then see what that magic number & RAM value is)- what you've got, and the settings you have should result in all those Tasks running, with not even a single one waiting for memory, let alone a whole bunch of them IMHO.
Grant
Darwin NT
ID: 109937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,473
RAC: 472
Message 109940 - Posted: 30 Oct 2024, 1:21:35 UTC

One of the major flaws of Windows, is if an application requests XXX amount of RAM to use, Windows reserves that XXX amount of RAM, even if the application never actually uses that XXX amount of RAM, but only X amount. And you don't have any control over what the application requests for RAM reservation requested. That comes from the task generator template values the admin/scientist sets up for the tasks to use for work generation.

OTOH, Linux does not do this and only allocates on the fly how much memory to reserve only at the request of the application when it actually runs. This issue is a big problem for tasks that need a huge amount of RAM reserved. The Windows users over at GPUGrid run into this issue all the time. The solution for Windows users is to set up a VERY large pagefile on the order of 50-64GB in size so that so that the when the application tells Windows to reserve 30GB of memory, it can actually do as requested by utilizing the pagefile.
ID: 109940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Klimax

Send message
Joined: 27 Apr 07
Posts: 44
Credit: 2,800,788
RAC: 1,333
Message 109943 - Posted: 30 Oct 2024, 17:29:55 UTC - in response to Message 109940.  

One of the major flaws of Windows, is if an application requests XXX amount of RAM to use, Windows reserves that XXX amount of RAM, even if the application never actually uses that XXX amount of RAM, but only X amount. And you don't have any control over what the application requests for RAM reservation requested. That comes from the task generator template values the admin/scientist sets up for the tasks to use for work generation.

OTOH, Linux does not do this and only allocates on the fly how much memory to reserve only at the request of the application when it actually runs. This issue is a big problem for tasks that need a huge amount of RAM reserved. The Windows users over at GPUGrid run into this issue all the time. The solution for Windows users is to set up a VERY large pagefile on the order of 50-64GB in size so that so that the when the application tells Windows to reserve 30GB of memory, it can actually do as requested by utilizing the pagefile.

It should be noted that programmers can just reserve memory without actual allocation by using VirtualAlloc function with specific flags (https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc). Unfortunately, it means custom allocators.

BTW: I thought Linux stopped by default overcommitting too some time ago.
ID: 109943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Waiting for memory, have 60 GB RAM and 30 cores available to Boinc.



©2024 University of Washington
https://www.bakerlab.org