Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 298 · 299 · 300 · 301

AuthorMessage
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 188
Credit: 6,416,824
RAC: 5,577
Message 110042 - Posted: 22 Nov 2024, 0:25:07 UTC - in response to Message 110041.  

... they're sitting there (headless for the most part) doing nothing but running boinc.
I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using?


My Linux machine runs lots of processes. It has 16 cores and 128 GBytes of RAM.

As fare as Boinc is concerned, the main process is the Boinc Client. It uses very little RAM and very little CPU time. From time-to-time, the boinc client sends a message a Boinc server and asks for work. The server send a reply complaining it cannot find any work, or a bunch of messages describinb the files the client hould download. In the latter case, the client downloads the files in the proper places. Then if the client has spare cores, it selects one and forks off a process to run it.

So let us say there are no Boinc tasks running, the client has just received a task from the Rosetta server. The client then fork off the Rosetta task.

top - 19:12:56 up 16 days,  8:42,  2 users,  load average: 13.38, 13.32, 13.29
Tasks: 483 total,  14 running, 469 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  0.3 sy, 80.6 ni, 18.0 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem : 128086.0 total,   5047.0 free,   7395.4 used, 115643.6 buff/cache
MiB Swap:  15992.0 total,  15687.0 free,    305.0 used. 116733.0 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
3176351    2043 boinc     39  19 R 596760   0.5  99.0 13  10:12.79 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3161135    2043 boinc     39  19 R 581420   0.4  99.3  2 121:33.16 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3111703    2043 boinc     39  19 R 541240   0.4  99.1  9 455:40.07 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3163687    2043 boinc     39  19 R 481148   0.4  99.2 10 103:13.41 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
3144411    2043 boinc     39  19 R 443480   0.3  99.1  6 233:56.51 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 
   2043       1 boinc     30  10 S  54708   0.0   0.1  8 300278:26 /usr/bin/boinc                                                            
3171024    2043 boinc     39  19 R  39676   0.0  99.3  4  48:38.05 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
3166711    2043 boinc     39  19 R  39668   0.0  99.3 11  80:07.82 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
3171561    2043 boinc     39  19 R  39584   0.0  99.2  0  44:34.46 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
3167425    2043 boinc     39  19 R  39520   0.0  99.3  7  75:58.11 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
3176944    2043 boinc     39  19 R  39172   0.0  99.4 15   5:33.72 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
3172039    2043 boinc     39  19 R  39116   0.0  99.3  3  41:39.57 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 
3176627    2043 boinc     39  19 R  36824   0.0  99.4  1   8:20.14 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_i686-pc-l+ 
3141011    2043 boinc     39  19 R  29944   0.0  99.3  5 258:04.99 ../../projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_x86_64-pc-linux-+ 


Pid is the process Id, PPID is the PID of the process's parent.
Pid 1 is the process that starts the parent of all other processes. One of the processes it starts is Pid 2043 that is my Boinc Client,
/usr/bin/boinc
This client starts all the others.
ID: 110042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Swisher

Send message
Joined: 10 Jun 13
Posts: 35
Credit: 33,089,810
RAC: 44,273
Message 110043 - Posted: 22 Nov 2024, 2:06:41 UTC - in response to Message 110042.  

I'm running openSUSE on all of my computers. Here's the one that caused the problem:

top - 16:42:35 up 2 days, 6:10, 2 users, load average: 33.55, 33.40, 33.38
Tasks: 475 total, 34 running, 441 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.2 sy, 99.8 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31927.27+total, 22505.63+free, 8120.531 used, 1792.707 buff/cache
MiB Swap: 2048.062 total, 2048.062 free, 0.000 used. 23806.74+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
61822 boinc 39 19 78392 35748 2304 R 100.3 0.109 100:31.99 wcgrid_mcm1_map
61824 boinc 39 19 78232 38868 2048 R 100.3 0.119 99:54.31 wcgrid_mcm1_map
62632 boinc 39 19 825532 674520 70144 R 100.3 2.063 249:05.56 rosetta_4.20_x8
62921 boinc 39 19 903184 753668 70400 R 100.3 2.305 217:03.78 rosetta_4.20_x8
63867 boinc 39 19 78524 38568 2048 R 100.3 0.118 108:19.06 wcgrid_mcm1_map
64033 boinc 39 19 78392 34544 2304 R 100.3 0.106 89:55.77 wcgrid_mcm1_map
64063 boinc 39 19 78396 40032 2048 R 100.3 0.122 87:05.91 wcgrid_mcm1_map
60802 boinc 39 19 2272136 2.027g 76288 R 100.0 6.502 448:17.00 rosetta_4.20_x8
61312 boinc 39 19 771740 617852 70144 R 100.0 1.890 387:41.08 rosetta_4.20_x8
61344 boinc 39 19 819136 665404 70400 R 100.0 2.035 382:14.14 rosetta_4.20_x8
61727 boinc 39 19 78640 40168 2304 R 100.0 0.123 109:32.81 wcgrid_mcm1_map
62042 boinc 39 19 956000 801992 70144 R 100.0 2.453 311:50.22 rosetta_4.20_x8
63796 boinc 39 19 78232 38764 2304 R 100.0 0.119 115:34.03 wcgrid_mcm1_map
63805 boinc 39 19 78232 39828 2304 R 100.0 0.122 112:30.09 wcgrid_mcm1_map
63859 boinc 39 19 78536 38488 2304 R 100.0 0.118 110:09.20 wcgrid_mcm1_map
63873 boinc 39 19 78232 39728 2304 R 100.0 0.122 107:39.82 wcgrid_mcm1_map
63877 boinc 39 19 78392 38792 2048 R 100.0 0.119 106:19.39 wcgrid_mcm1_map
63881 boinc 39 19 78524 38808 2304 R 100.0 0.119 105:28.10 wcgrid_mcm1_map
63885 boinc 39 19 78232 38796 2304 R 100.0 0.119 104:28.67 wcgrid_mcm1_map
63976 boinc 39 19 78392 39360 2304 R 100.0 0.120 92:57.77 wcgrid_mcm1_map
64022 boinc 39 19 78468 39464 2304 R 100.0 0.121 92:03.05 wcgrid_mcm1_map
64027 boinc 39 19 78392 39200 2304 R 100.0 0.120 90:19.73 wcgrid_mcm1_map
64061 boinc 39 19 78652 39572 2304 R 100.0 0.121 87:13.84 wcgrid_mcm1_map
64067 boinc 39 19 78536 38944 2304 R 100.0 0.119 86:28.16 wcgrid_mcm1_map
64757 boinc 39 19 78260 39200 2304 R 100.0 0.120 4:00.44 wcgrid_mcm1_map
61710 boinc 39 19 78592 39564 2304 R 99.67 0.121 114:27.44 wcgrid_mcm1_map
61732 boinc 39 19 78392 39740 2304 R 99.67 0.122 108:00.30 wcgrid_mcm1_map
63854 boinc 39 19 78232 39136 2304 R 99.67 0.120 111:13.06 wcgrid_mcm1_map
64029 boinc 39 19 78580 39696 2304 R 99.67 0.121 90:13.18 wcgrid_mcm1_map
64038 boinc 39 19 78292 38760 2304 R 99.67 0.119 87:54.97 wcgrid_mcm1_map
61814 boinc 39 19 78392 39304 2304 R 99.34 0.120 102:28.23 wcgrid_mcm1_map
63851 boinc 39 19 78232 39780 2304 R 99.34 0.122 112:01.97 wcgrid_mcm1_map

Notice the size of the rosetta processes. I've gone in and created the app_config, as root, to control how many rosetta processes can run.
# cd /var/lib/boinc/projects/boinc.bakerlab.org_rosetta/
# cat app_config.xml
<app_config>
<app>
<name>rosetta_beta</name>
<max_concurrent>6</max_concurrent>
</app>
<app>
<name>rosetta</name>
<max_concurrent>6</max_concurrent>
</app>
</app_config>

I have a newer machine running and the rosetta processes are much smaller. Dunno why.
I guess the answer is to go get more memory. Unfortunately I won't be near that computer until April.
ID: 110043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,796,505
RAC: 22,597
Message 110044 - Posted: 22 Nov 2024, 5:34:12 UTC - in response to Message 110041.  

I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that?
There is no BOINC process.
Those BOINC settings limit the amount of RAM, disk space, network activity, CPU usage etc available for all projects that run under BOINC.


Setting a massive swap file value allows Tasks that require massive amounts of memory to start, but don't actually use that amount of memory in order to actually run.
So systems with limited amounts of RAM can still run Tasks that require significant amount of RAM- as long as the RAM they have (and is available for BOINC to use) is more than they need to run- even if they claim to require massive amounts of RAM above & beyond that in order to actually start (if they do need more RAM than is available for BOINC to make use of, then the large swap file allows them to still run- but with massive amounts of swapping. If you have a SSD, it means the system will be sluggish at best. A HDD- it will probably grind to a non-responsive halt, depending on just how much more RAM the Task(s) need, and how many of them are trying to run).
Grant
Darwin NT
ID: 110044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,796,505
RAC: 22,597
Message 110045 - Posted: 22 Nov 2024, 5:42:44 UTC

And boinc-process is back up and running. That must be one of it's shortest outages yet.
Grant
Darwin NT
ID: 110045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,186,203
RAC: 9,305
Message 110046 - Posted: 22 Nov 2024, 6:37:20 UTC - in response to Message 110041.  

Guess I broke it.

If only any of us were that powerful...

So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers.

Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running Boinc.

Personally I agree with your choice rather than the suggestion of restricting RAM so that there's even less space to run.

I guess my question is... that's the "Boinc" process right? When the "Rosetta" process kicks off how does it interact with that?
Do the jobs that Rosetta provides indicate how much memory they will be using?

Aiui yes.

Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me?

I certainly don't understand the exact mechanics of how this works, but we have had the experience in the past where disk space became a limitation and the solution wasn't entirely obvious.
I do recall that my original setting was 10-20Gb rather than 2Gb, so that's one thing, but even that became an issue/restriction when the problem arose before.
I recall raising it to 500Gb and it still not solving the problem back then.

The solution turned out to be not having any restriction at all.
That is, on the disk tab, unselecting "Use no more than xx Gb"
Combined with that I'm currently using "Use no more than 90% of total" disk space and
Page/swap file: Use at most 75%

In the situation where you're currently running headless, I hope none of these settings should be a problem on those hosts.

I have no idea whether this solution to an old problem also solves your current one, but I am surprised you're reporting the problem at all as I haven't heard anyone experiencing anything similar in recent years.
Nothing to lose by trying anyway.
ID: 110046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,186,203
RAC: 9,305
Message 110047 - Posted: 22 Nov 2024, 6:46:13 UTC - in response to Message 110045.  

And boinc-process is back up and running. That must be one of it's shortest outages yet.

That's what I originally came here to say.
Not that it matters a great deal with current task availability but still...
ID: 110047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,094,573
RAC: 5,446
Message 110048 - Posted: 22 Nov 2024, 9:19:35 UTC - in response to Message 110041.  

Guess I broke it.

So, to continue on with the subject of memory usage, now that I've calmed down, and managed to place a limit on rosetta on the remote computers.

Someone, and forgive me for not mentioning your name, suggested setting the max memory variables under the computing preferences option. I tend to set mine at 98% since, with the exception of this computer, they're sitting there (headless for the most part) doing nothing but running boinc.
I guess my question is...that's the "boinc" process right? When the "rosetta" process kicks off how does it interact with that? Do the jobs that rosetta provides indicate how much memory they will be using? Then there's the whole swap thingie (the swap space on my computers are set at the pretty much standard 2GB per machine). Any wizards out there who can explain it to a maroon like me?


No, the percentage you set covers all the programs running under the Boinc user, so, the Boinc Client, Manager and all of the e.g. Rosetta work units are within the same pot and there’s no “interaction” between the one and the other.

The swap works the same way, the percentage covers all memory required by Boinc over the physical memory present.
ID: 110048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,186,203
RAC: 9,305
Message 110051 - Posted: 23 Nov 2024, 8:08:07 UTC - in response to Message 110047.  

And boinc-process is back up and running. That must be one of it's shortest outages yet.

That's what I originally came here to say.
Not that it matters a great deal with current task availability but still...

Some 40-50k came available maybe 12-14hrs ago and it seems another 800k in the last few hours
ID: 110051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,094,573
RAC: 5,446
Message 110052 - Posted: 23 Nov 2024, 12:56:18 UTC - in response to Message 110051.  

So far one error :-

<core_client_version>8.0.4</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-linux-gnu @8a_hal_u_hal_8aa_4jp3235_d104_0001_1.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_f5ae1de8e1/database

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2534
BOINC:: Error reading and gzipping output datafile: default.out
09:31:32 (83823): called boinc_finish(1)

</stderr_txt>
]]>
ID: 110052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,796,505
RAC: 22,597
Message 110055 - Posted: 23 Nov 2024, 20:09:24 UTC - in response to Message 110052.  

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
Unfortunately, one of the usual ones.
Grant
Darwin NT
ID: 110055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 70
Credit: 265,580
RAC: 368
Message 110057 - Posted: 23 Nov 2024, 22:30:26 UTC

Upload/Download SERVER(s) appear to be off-line again but the server status page is all green
11/23/2024 16:25:35 Internet access OK - project servers may be temporarily down.
Rosetta@home 11/23/2024 16:25:34 Backing off 01:14:05 on download of input_rb_11_23_642993_638479__t000__2_C1_robetta.zip
Rosetta@home 11/23/2024 16:25:34 Temporarily failed download of input_rb_11_23_642993_638479__t000__2_C1_robetta.zip: transient HTTP error
Rosetta@home 11/23/2024 16:25:34 Backing off 01:59:39 on download of flags_rb_11_23_642993_638479__t000__2_C1_robetta
Rosetta@home 11/23/2024 16:25:34 Temporarily failed download of flags_rb_11_23_642993_638479__t000__2_C1_robetta: transient HTTP error
Rosetta@home 11/23/2024 16:25:34 Backing off 02:07:26 on download of input_rb_11_23_642993_638479__t000__1_C1_robetta.zip
Rosetta@home 11/23/2024 16:25:34 Temporarily failed download of input_rb_11_23_642993_638479__t000__1_C1_robetta.zip: transient HTTP error
Rosetta@home 11/23/2024 16:25:34 Backing off 01:24:28 on download of flags_rb_11_23_642993_638479__t000__1_C1_robetta
Rosetta@home 11/23/2024 16:25:34 Temporarily failed download of flags_rb_11_23_642993_638479__t000__1_C1_robetta: transient HTTP error
11/23/2024 16:25:34 Project communication failed: attempting access to reference site
Rosetta@home 11/23/2024 16:25:33 Started download of input_rb_11_23_642993_638479__t000__2_C1_robetta.zip
Rosetta@home 11/23/2024 16:25:33 Started download of flags_rb_11_23_642993_638479__t000__2_C1_robetta
Rosetta@home 11/23/2024 16:25:33 Started download of input_rb_11_23_642993_638479__t000__1_C1_robetta.zip
Rosetta@home 11/23/2024 16:25:33 Started download of flags_rb_11_23_642993_638479__t000__1_C1_robetta


ID: 110057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 392
Credit: 12,094,573
RAC: 5,446
Message 110058 - Posted: 23 Nov 2024, 22:31:07 UTC - in response to Message 110055.  

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
Unfortunately, one of the usual ones.


Yes, presumably a definition error for the molecule being tested.
ID: 110058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,796,505
RAC: 22,597
Message 110059 - Posted: 23 Nov 2024, 23:32:06 UTC - in response to Message 110057.  

Upload/Download SERVER(s) appear to be off-line again but the server status page is all green
I'm not having any issues at all.
Grant
Darwin NT
ID: 110059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 70
Credit: 265,580
RAC: 368
Message 110060 - Posted: 24 Nov 2024, 0:41:49 UTC - in response to Message 110059.  

Upload/Download SERVER(s) appear to be off-line again but the server status page is all green
I'm not having any issues at all.

Did a manual retry a few minutes ago and they downloaded successfully.
ID: 110060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 298 · 299 · 300 · 301

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org