Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

preempt queue on derecho

raeder

Member
Data assimilation jobs (using DART and CESM) could benefit from using the preempt queue on Derecho.
The assimilation cycles can be made to be < 10 minutes long, so there is a natural stopping point
within the PBS request...kill window.
I don't see 'preempt' in CIME as of cime5.8.47, so I'm wondering whether there's been any discussion
about how CESM might use this queue.

If not, I'll work on implementing it within the DATA_ASSIMILATION_SCRIPT.
 

jedwards

CSEG and Liaisons
Staff member
My thoughts around the preempt queue was to trigger a cesm restart write alarm in esmf. This is a feature request that we have submitted to the esmf development team. I think that implementing this in the DATA_ASSIMILATION_SCRIPT for dart would be appropriate since that would be a natural break point for that case. Maybe you would like to meet in person and brain storm this idea a bit?
 

raeder

Member
My thoughts around the preempt queue was to trigger a cesm restart write alarm in esmf. This is a feature request that we have submitted to the esmf development team. I think that implementing this in the DATA_ASSIMILATION_SCRIPT for dart would be appropriate since that would be a natural break point for that case. Maybe you would like to meet in person and brain storm this idea a bit?
Thanks for the status report. I've also talked with Brian D. about this, and would like to figure out the best way forward.
My schedule's open today and Monday.
 

jedwards

CSEG and Liaisons
Staff member
You were going to implement this in the DATA_ASSIMILATION_SCRIPT - I don't think that you need any changes in cesm for that - do you?
To set the preempt queue to give it a try just run ./xmlchange JOB_QUEUE=preempt --force
 

raeder

Member
My understanding of the preempt queue is that PBS sends a "kill" signal
(actually the SIGUSR1 signal) to the CESM job process.
We don't want the CESM job to end immediately when it gets that signal.
We want it to finish the cycle normally, then stop.

The data assimilation script is a distinct process that can't (?) see this PBS signal,
so there needs to be a separate signal from the CESM process to the DA process that it should end.
The mechanism that I coded in the DA script listens for a signal from CESM.
When it receives it, it executes `xlmchange DATA_ASSIMILATION_CYCLES=$cycle`
where `cycle` is an existing argument passed from case_run to DATA_ASSIMILATION_SCRIPT.
The DA script finishes normally. Then case_run will see that it's the new "last" cycle and end normally.
This last step requires(?) adding a line in the "cycles" loop in case_run.py to xmlquery the value
of DATA_ASSIMILATION_CYCLES during each iteration.

So far, there is no signal from CESM to DA, so that needs to be created.

This probably isn't as general as an ESMF solution would be,
but could be very useful for us in the short term.
 

jedwards

CSEG and Liaisons
Staff member
Currently we have three distinct executables running in a dart job - correct?
  • cesm
  • dart
  • data_assimilation_script
as I read the documentation I take it that each of the three would have to have single handling code since there is
no way to know which process is running when the signal is issued. I think that this whole can of worms is more complicated than you imagine it to be.
 

raeder

Member
In case seeing the code is helpful:
env_run.xml: DATA_ASSIMILATION_SCRIPT=assim_preempt.sh (attached)
This script will run the traditional assimilate.csh script (which runs the fortran executable) in the background,
while monitoring for a signal (still to be defined) to end.

I don't think that the backgrounded assimilate.csh would need to listen for the signal.
The intent is that it will finish, even if the signal is received by data_assimilation_script (assim_preempt.sh).
That leaves 2 processes, one a subprocess of the other.

2 key things that I don't know are
1) After CESM, in preempt mode, starts data_assimilation_script, does it continue listening for a signal from PBS?
That's the mode that's described in the documentation you referenced, and it's how assim_preempt.sh works.
2) Can the subprocess (data_assimilation_script=assim_preempt.sh) see the PBS signal directly?

If sending a system signal from CESM to data_assimilation_script is difficult (because it's not yet running, ...)
one idea is for CESM to write out a small signal file, or define an environment variable,
which data_assimilation_script could look for whenever it runs.
 

Attachments

  • assim_preempt.sh.tar
    4 KB · Views: 0

jedwards

CSEG and Liaisons
Staff member
Perhaps you could write a python wrapper starting from NCAR HPC Documentation - Using Preemption
which sets the single handler and calls each of cesm, da script and dart in turn. It would check for signals after each task completes and exit rather than start the next task if a signal is found. It would take some experimenting and I don't think I have time right now.
 
Top