为什么会出现Error: attempt to use zero-lengthbad variable namee?

2.5 Submitting a Job
A job is submitted for execution to Condor using the
condor_submit command.
condor_submit takes as an argument the name of a
file called a submit description file.
This file contains commands and keywords to direct the queuing of jobs.
In the submit description file, Condor finds everything it needs
to know about the job.
Items such as the name of the executable to run,
the initial working directory, and command-line arguments to the
program all go into
the submit description file.
condor_submit creates a job
ClassAd based upon the information,
and Condor
works toward running the job.
The contents of a submit file
can save time for Condor users.
It is easy to submit multiple runs of a program to
Condor. To run the same program 500 times on 500
different input data sets, arrange your data files
accordingly so that each run reads its own input, and each run
writes its own output.
Each individual run may have its own initial
working directory, stdin, stdout, stderr, command-line arguments, and
shell environment.
A program that directly opens its own
files will read the file names to use either from stdin
or from the command line.
A program that opens a static filename every time
will need to use a separate subdirectory for the output of each run.
The condor_submit manual page
is on page& and
contains a complete and full description of how to use condor_submit.
It also includes descriptions of all the commands that may be placed
into a submit description file.
In addition, the index lists entries for each command under the
heading of Submit Commands.
2.5.1 Sample submit description files
In addition to the examples of submit description files given
condor_submit manual page, here are a few more.
Example 1 is one of the simplest submit description
files possible. It queues up one copy of the program foo
(which had been created by condor_compile)
for execution by Condor.
Since no platform is specified, Condor will use its default,
which is to run the job on a machine which has the
same architecture and operating system as the machine from which it was
submitted.
output, and
commands are given in the submit
description file, so the
files stdin, stdout, and stderr will all refer to
/dev/null.
The program may produce output by explicitly opening a file and writing to
A log file, foo.log, will also be produced that contains events
the job had during its lifetime inside of Condor.
When the job finishes, its exit conditions will be noted in the log file.
It is recommended that you always have a log file so you know what
happened to your jobs.
####################
# Example 1
# Simple condor job description file
####################
Executable
= standard
Example 2 queues two copies of the program mathematica. The
first copy will run in directory run_1, and the second will run in
directory run_2. For both queued copies,
stdin will be test.data,
stdout will be loop.out, and
stderr will be loop.error.
There will be two sets of files written,
as the files are each written to their own directories.
This is a convenient way to organize data if you
have a large group of Condor jobs to run. The example file
shows program submission of
mathematica as a vanilla universe job.
This may be necessary if the source
and/or object code to mathematica is not available.
####################
# Example 2: demonstrate use of multiple
# directories for data organization.
####################
Executable = mathematica
= test.data
= loop.out
= loop.error
= loop.log
Initialdir = run_1
Initialdir = run_2
The submit description file for Example 3 queues 150
runs of program foo which has been compiled and linked for
LINUX running on a 32-bit Intel processor.
This job requires Condor to run the program on machines which have
greater than 32 megabytes of physical memory, and expresses a
preference to run the program on machines with more than 64 megabytes.
It also advises Condor that it will
use up to 28 megabytes of memory when running.
Each of the 150 runs of the program is given its own process number,
starting with process number 0.
stdin, stdout, and stderr will
refer to in.0, out.0, and err.0 for the first run
of the program,
in.1, out.1,
and err.1 for the second run of the program, and so forth.
A log file containing entries
about when and where Condor runs, checkpoints, and migrates processes for
all the 150 queued programs
will be written into the single file foo.log.
####################
# Example 3: Show off some fancy features including
# use of pre-defined macros and logging.
####################
Executable
= standard
Requirements
= Memory &= 32 && OpSys == "LINUX" && Arch =="INTEL"
= Memory &= 64
Image_Size
= err.$(Process)
= in.$(Process)
= out.$(Process)
2.5.2 About Requirements and Rank
requirements and rank commands in the submit description file
are powerful and flexible.
Using them effectively requires care, and this section presents
those details.
Both requirements and rank need to be specified
as valid Condor ClassAd expressions, however, default values are set by the
condor_submit program if these are not defined in the submit description file.
From the condor_submit manual page and the above examples, you see
that writing ClassAd expressions is intuitive, especially if you
are familiar with the programming language C.
There are some
pretty nifty expressions you can write with ClassAds.
A complete description of ClassAds and their expressions
can be found in section& on
All of the commands in the submit description file are case insensitive,
except for the ClassAd attribute string values.
ClassAd attribute names are
case insensitive, but ClassAd string
values are case preserving.
Note that the comparison operators
(&, &, &=, &=, and ==)
compare strings
case insensitively.
The special comparison operators
=?= and =!=
compare strings case sensitively.
requirements or rank command in
the submit description file may utilize attributes
that appear in a machine or a job ClassAd.
Within the submit description file (for a job) the
prefix MY. (on a ClassAd attribute name)
causes a reference to the job ClassAd attribute,
and the prefix TARGET. causes a reference to
a potential machine or matched machine ClassAd attribute.
The condor_status command displays
statistics about machines within the pool.
The -l option displays the
machine ClassAd attributes for all machines in the Condor pool.
The job ClassAds, if there are jobs in the queue, can be seen
with the condor_q -l command.
This shows all the defined attributes for current jobs in the queue.
A list of defined ClassAd attributes for job ClassAds
is given in the unnumbered Appendix on
A list of defined ClassAd attributes for machine ClassAds
is given in the unnumbered Appendix on
2.5.2.1 Rank Expression Examples
When considering the match between a job and a machine, rank is used
to choose a match from among all machines that satisfy the job's
requirements and are available to the user, after accounting for
the user's priority and the machine's rank of the job.
The rank expressions, simple or complex, define a numerical value
that expresses preferences.
The job's Rank expression evaluates to one of three values.
It can be UNDEFINED, ERROR, or a floating point value.
If Rank evaluates to a floating point value,
the best match will be the one with the largest, positive value.
If no Rank is given
in the submit description file,
then Condor substitutes a default value of 0.0 when considering
machines to match.
If the job's Rank of a given machine evaluates
to UNDEFINED or ERROR,
this same value of 0.0 is used.
Therefore, the machine is still considered for a match,
but has no ranking above any other.
A boolean expression evaluates to the numerical value of 1.0
if true, and 0.0 if false.
The following Rank expressions provide examples to
For a job that desires the machine with the most available memory:
Rank = memory
For a job that prefers to run on a friend's machine
on Saturdays and Sundays:
Rank = ( (clockday == 0) || (clockday == 6) )
&& (machine == "friend.cs.wisc.edu")
For a job that prefers to run on one of three specific machines:
Rank = (machine == "friend1.cs.wisc.edu") ||
(machine == "friend2.cs.wisc.edu") ||
(machine == "friend3.cs.wisc.edu")
For a job that wants the machine with the best floating point
performance (on Linpack benchmarks):
Rank = kflops
This particular example highlights a difficulty with Rank expression
evaluation as currently defined.
While all machines have floating point processing ability,
not all machines will have the kflops attribute defined.
For machines where this attribute is not defined,
Rank will evaluate to the value UNDEFINED, and
Condor will use a default rank of the machine of 0.0.
The Rank attribute will only rank machines where
the attribute is defined.
Therefore, the machine with the highest floating point
performance may not be the one given the highest rank.
So, it is wise when writing a Rank expression to check
if the expression's evaluation will lead to the expected
resulting ranking of machines.
This can be accomplished using the condor_status command with the
-constraint argument.
This allows the user to see a list of
machines that fit a constraint.
To see which machines in the pool have kflops defined,
condor_status -constraint kflops
Alternatively, to see a list of machines where
kflops is not defined, use
condor_status -constraint "kflops=?=undefined"
For a job that prefers specific machines in a specific order:
Rank = ((machine == "friend1.cs.wisc.edu")*3) +
((machine == "friend2.cs.wisc.edu")*2) +
(machine == "friend3.cs.wisc.edu")
If the machine being ranked is friend1.cs.wisc.edu, then the
expression
(machine == "friend1.cs.wisc.edu")
is true, and gives the value 1.0.
The expressions
(machine == "friend2.cs.wisc.edu")
(machine == "friend3.cs.wisc.edu")
are false, and give the value 0.0.
Therefore, Rank evaluates to the value 3.0.
In this way, machine friend1.cs.wisc.edu is ranked higher than
machine friend2.cs.wisc.edu,
machine friend2.cs.wisc.edu
is ranked higher than
machine friend3.cs.wisc.edu,
and all three of these machines are ranked higher than others.
2.5.3 Submitting Jobs Using a Shared File System
If vanilla, java, or parallel universe
jobs are submitted without using the File Transfer mechanism,
Condor must use a shared file system to access input and output
In this case, the job must be able to access the data files
from any machine on which it could potentially run.
As an example, suppose a job is submitted from blackbird.cs.wisc.edu,
and the job requires a particular data file called
/u/p/s/psilord/data.txt.
If the job were to run on
cardinal.cs.wisc.edu, the file /u/p/s/psilord/data.txt must be
available through either NFS or AFS for the job to run correctly.
Condor allows users to ensure their jobs have access to the right
shared files by using the FileSystemDomain and
UidDomain machine ClassAd attributes.
These attributes specify which machines have access to the same shared
file systems.
All machines that mount the same shared directories in the same
locations are considered to belong to the same file system domain.
Similarly, all machines that share the same user information (in
particular, the same UID, which is important for file systems like
NFS) are considered part of the same UID domain.
The default configuration for Condor places each machine
in its own UID domain and file system domain, using the full host name of the
machine as the name of the domains.
So, if a pool does have access to a shared file system,
the pool administrator must correctly configure Condor
such that all
the machines mounting the same files have the same
FileSystemDomain configuration.
Similarly, all machines that share common user information must be
configured to have the same UidDomain configuration.
When a job relies on a shared file system,
Condor uses the
requirements expression to ensure that the job runs
on a machine in the
correct UidDomain and FileSystemDomain.
In this case, the default requirements expression specifies
that the job must run on a machine with the same UidDomain
and FileSystemDomain as the machine from which the job
is submitted.
This default is almost always correct.
However, in a pool spanning multiple UidDomains and/or
FileSystemDomains, the user may need to specify a different
requirements expression to have the job run on the correct
For example, imagine a pool made up of both desktop workstations and a
dedicated compute cluster.
Most of the pool, including the compute cluster, has access to a
shared file system, but some of the desktop machines do not.
In this case, the administrators would probably define the
FileSystemDomain to be cs.wisc.edu for all the machines
that mounted the shared files, and to the full host name for each
machine that did not. An example is jimi.cs.wisc.edu.
In this example,
a user wants to submit vanilla universe jobs from her own desktop
machine (jimi.cs.wisc.edu) which does not mount the shared file system
(and is therefore in its own file system domain, in its own world).
But, she wants the jobs to be able to run on more than just her own
machine (in particular, the compute cluster), so she puts the program
and input files onto the shared file system.
When she submits the jobs, she needs to tell Condor to send them to
machines that have access to that shared data, so she specifies a
different requirements expression than the default:
Requirements = TARGET.UidDomain == "cs.wisc.edu" && \
TARGET.FileSystemDomain == "cs.wisc.edu"
WARNING: If there is no shared file system, or the Condor pool
administrator does not configure the FileSystemDomain
setting correctly (the default is that each machine in a pool is in
its own file system and UID domain), a user submits a job that cannot
use remote system calls (for example, a vanilla universe job), and the
user does not enable Condor's File Transfer mechanism, the job will
only run on the machine from which it was submitted.
2.5.4 Submitting Jobs Without a Shared File System:
Condor's File Transfer Mechanism
Condor works well without a shared file system.
The Condor file transfer mechanism permits the user to select which files are
transferred and under which circumstances.
Condor can transfer any files needed by a job from
the machine where the job was submitted into a
remote scratch directory on the machine where the
job is to be executed.
Condor executes the job
and transfers output back to the submitting machine.
The user specifies which files and directories to transfer,
and at what point the output files should be copied back to the
submitting machine.
This specification is done within the job's submit description file.
The default behavior of the file transfer mechanism
varies across the different Condor universes,
and it differs between Unix and Windows machines.
For jobs submitted under the standard universe,
the existence of a shared file system is not relevant.
Access to files (input and output) is handled through Condor's
remote system call mechanism.
The executable and checkpoint files are transferred automatically, when
Therefore, the user does not need to change the submit description
file if there is no shared file system,
as the file transfer mechanism is not utilized.
For the vanilla, java, and parallel
universes, access to input files and the executable
through a shared file system is presumed as a default
on jobs submitted from Unix machines.
If there is no shared file system, then Condor's file transfer
mechanism must be explicitly enabled.
When submitting a job from a Windows machine,
Condor presumes the opposite: no access to a shared file system.
It instead enables the file transfer mechanism by default.
Submission of a job might need to specify which files to
transfer, and/or when to transfer the output files back.
For the grid universe,
jobs are to be executed on remote machines, so there would never
be a shared file system between machines.
See section& for more details.
For the scheduler universe,
Condor is only using the machine from which the job is submitted.
Therefore, the existence of a shared file system is not relevant.
2.5.4.2 Specifying If and When to Transfer Files
To enable the file transfer mechanism, place two commands
in the job's submit description file:
should_transfer_files and when_to_transfer_output.
In the common case, they will be set as:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Setting the should_transfer_files command explicitly
enables or disables the file transfer mechanism.
The command takes on one of three possible values:
YES: Condor transfers both the executable and the file
defined by the input command from the machine where the job is
submitted to the remote machine where the job is to be executed.
The file defined by the output command as well as any files
created by the execution of the job are transferred back to the machine
where the job was submitted.
When they are transferred and the directory location of the files
is determined by the command when_to_transfer_output.
IF_NEEDED: Condor transfers files if the job is
matched with and to be executed on a machine in a
different FileSystemDomain than the
one the submit machine belongs to, the same as if
should_transfer_files = YES.
If the job is matched with a machine in the local FileSystemDomain,
Condor will not transfer files and relies
on the shared file system.
NO: Condor's file transfer mechanism is disabled.
The when_to_transfer_output command tells Condor when output
files are to be transferred back to the submit machine.
The command takes on one of two possible values:
ON_EXIT: Condor transfers the file defined by the
output command,
as well as any other files in the remote scratch directory created by the job,
back to the submit machine only when the job exits on its own.
ON_EXIT_OR_EVICT: Condor behaves the same as described
for the value ON_EXIT when the job exits on its own.
However, if, and each time the job is evicted from a machine,
files are transferred back at eviction time.
The files that
are transferred back at eviction time may include intermediate files
that are not part of the final output of the job.
Before the job
starts running again, all of the files that were stored when the job
was last evicted are copied to the job's new remote scratch
directory.
The purpose of saving files at eviction time is to allow the job to
resume from where it left off.
This is similar to using the checkpoint feature of the standard universe,
but just specifying ON_EXIT_OR_EVICT is not enough to make a job
capable of producing or utilizing checkpoints.
The job must be designed to save and restore its state
using the files that are saved at eviction time.
The files that are transferred back at eviction time are not stored in
the location where the job's final output will be written when the job exits.
Condor manages these files automatically,
so usually the only reason for a user to worry about them
is to make sure that there is enough space to store them.
The files are stored on the submit machine in a temporary directory within the
directory defined by the configuration variable SPOOL.
The directory is named using the ClusterId and ProcId job
ClassAd attributes.
The directory name takes the form:
&X mod 10000&/&Y mod 10000&/cluster&X&.proc&Y&.subproc0
where &X& is the value of ClusterId, and
&Y& is the value of ProcId.
As an example, if job 735.0 is evicted, it will produce the directory
$(SPOOL)/735/0/cluster735.proc0.subproc0
There is no default value for when_to_transfer_output.
If using the file transfer mechanism,
this command must be defined.
However, if when_to_transfer_output is specified in the submit
description file,
but should_transfer_files is not, Condor assumes a
value of YES for should_transfer_files.
NOTE: The combination of:
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT_OR_EVICT
would produce undefined file access semantics.
Therefore, this combination is prohibited by condor_submit.
When submitting from a Windows platform,
the file transfer mechanism is enabled by default.
If the two commands when_to_transfer_output and
should_transfer_files are not in the job's
submit description file, then Condor uses the values:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
If the file transfer mechanism is enabled,
Condor will transfer the following files before the job
is run on a remote machine.
the executable, as defined with the executable command
the input, as defined with the input command
any jar files, for the java universe,
as defined with the jar_files command
If the job requires other input files,
the submit description file should utilize the
transfer_input_files command.
This comma-separated list specifies any other files or directories that Condor is to
transfer to the remote scratch directory,
to set up the execution environment for the job before it is run.
These files are placed in the same directory as the job's executable.
For example:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = file1,file2
This example explicitly enables the file transfer mechanism,
and it transfers the executable, the file specified by the input
command, any jar files specified by the jar_files command,
and files file1 and file2.
If the file transfer mechanism is enabled,
Condor will transfer the following files from the execute machine
back to the submit machine after the job exits.
the output file, as defined with the output command
the error file, as defined with the error command
any files created by the job in the remo
this only occurs for jobs other than grid
universe, and for Condor-C grid
directories created by the job within the remote scratch directory
are ignored for this automatic detection of files to be transferred.
A path given for output and error commands represents
a path on the submit machine.
If no path is specified, the directory
specified with initialdir is used, and if that is not specified,
the directory from which the job was submitted is used.
At the time the job is submitted, zero-length files are created
on the submit machine, at the given path for the files defined by the
output and error commands.
This permits job submission failure, if these files cannot be written by
To restrict the output files
or permit entire directory contents to be transferred,
specify the exact list with
transfer_output_files.
Delimit the list of file names, directory names, or paths with commas.
When this list is defined, and any of the files or directories
do not exist as the job exits,
Condor considers this an error, and places the job on hold.
When this list is defined, automatic detection of output files created by
the job is disabled.
Paths specified in this list refer to locations on the execute
The naming and placement of files and directories relies on the
term base name.
By example, the path a/b/c has the base name c.
It is the file name or directory name with all directories
leading up to that name stripped off.
On the submit machine, the transferred files or directories
are named using only the base name.
Therefore, each output file or directory must have a different name,
even if they originate from different paths.
For grid universe jobs other than than Condor-C grid jobs,
files to be transferred
(other than standard output and standard error)
must be specified using transfer_output_files
in the submit description file, because automatic detection of new files
created by the job does not take place.
Here are examples to promote understanding of what files and
directories are transferred, and how they are named after transfer.
Assume that the job produces the following structure within the
remote scratch directory:
d1 (directory)
If the submit description file sets
transfer_output_files = o1,o2,d1
then transferred back to the submit machine will be
d1 (directory)
Note that the directory d1 and all its contents are specified,
and therefore transferred.
If the directory d1 is not created by the job before exit,
then the job is placed on hold.
If the directory d1 is created by the job before exit,
but is empty, this is not an error.
If, instead, the submit description file sets
transfer_output_files = o1,o2,d1/o3
then transferred back to the submit machine will be
Note that only the base name is used in the naming and placement
of the file specified with d1/o3.
The file transfer mechanism specifies file names and/or paths on
both the file system of the submit machine and on the
file system of the execute machine.
Care must be taken to know which machine, submit or execute,
is utilizing the file name and/or path.
Files in the transfer_input_files command
are specified as they are accessed on the submit machine.
The job, as it executes, accesses files as they are
found on the execute machine.
There are three ways to specify files and paths
for transfer_input_files:
Relative to the current working directory as the job is submitted,
if the submit command initialdir is not specified.
Relative to the initial directory, if the submit command
initialdir is specified.
Before executing the program, Condor copies the
executable, an input file as specified
by the submit command input,
along with any input files specified
by transfer_input_files.
All these files are placed into
a remote scratch directory on the execute machine,
in which the program runs.
Therefore,
the executing program must access input files relative to its
working directory.
Because all files and directories listed for transfer are placed into a single,
flat directory,
inputs must be uniquely named to
avoid collision when transferred.
A collision causes the last file in the list to
overwrite the earlier one.
Both relative and absolute paths may be used in
transfer_output_files.
Relative paths are relative to
the job's remote scratch directory on the execute machine.
When the files and directories are copied back to the submit machine, they
are placed in the job's initial working directory as the base name of
the original path.
An alternate name or path may be specified by using
transfer_output_remaps.
A job may create files outside the remote scratch directory
but within the file system of the execute machine,
in a directory such as /tmp,
if this directory is guaranteed to exist and be
accessible on all possible execute machines.
Condor will not automatically
transfer such files back after execution completes, nor will it clean
up these files.
Here are several examples to illustrate the use of file transfer.
The program executable is called my_program,
and it uses three command-line arguments as it executes:
two input file names and an output file name.
The program executable and the submit description file
for this job are located in directory
/scratch/test.
Here is the directory tree as it exists on the submit machine,
for all the examples:
/scratch/test (directory)
my_program.condor (the submit description file)
my_program (the executable)
files (directory)
logs2 (directory)
in1 (file)
in2 (file)
logs (directory)
This first example explicitly transfers input files.
These input files to be transferred
are specified relative to the directory where the job is submitted.
An output file specified in the arguments command, out1,
is created when the job is executed.
It will be transferred back into the directory /scratch/test.
# file name:
my_program.condor
# Condor submit description file for my_program
Executable
= my_program
= logs/err.$(cluster)
= logs/out.$(cluster)
= logs/log.$(cluster)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/in1,files/in2
= in1 in2 out1
The log file is written on the submit machine, and is not involved
with the file transfer mechanism.
This second example is identical to Example 1,
except that absolute paths to the input files are specified,
instead of relative paths to the input files.
# file name:
my_program.condor
# Condor submit description file for my_program
Executable
= my_program
= logs/err.$(cluster)
= logs/out.$(cluster)
= logs/log.$(cluster)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = /scratch/test/files/in1,/scratch/test/files/in2
= in1 in2 out1
This third example illustrates the use of the
submit command initialdir, and its effect
on the paths used for the various files.
The expected location of the
executable is not affected by the
initialdir command.
All other files
(specified by input, output, error,
transfer_input_files,
as well as files modified or created by the job
and automatically transferred back)
are located relative to the specified initialdir.
Therefore, the output file, out1,
will be placed in the files directory.
Note that the logs2 directory
exists to make this example work correctly.
# file name:
my_program.condor
# Condor submit description file for my_program
Executable
= my_program
= logs2/err.$(cluster)
= logs2/out.$(cluster)
= logs2/log.$(cluster)
initialdir
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = in1,in2
= in1 in2 out1
Example 4 - Illustrates an Error
This example illustrates a job that will fail.
The files specified using the
transfer_input_files command work
correctly (see Example 1).
relative paths to files in the
arguments command
cause the executing program to fail.
The file system on the submission side may utilize
relative paths to files,
however those files are placed into the single,
flat, remote scratch directory on the execute machine.
# file name:
my_program.condor
# Condor submit description file for my_program
Executable
= my_program
= logs/err.$(cluster)
= logs/out.$(cluster)
= logs/log.$(cluster)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/in1,files/in2
= files/in1 files/in2 files/out1
This example fails with the following error:
err: files/out1: No such file or directory.
Example 5 - Illustrates an Error
As with Example 4,
this example illustrates a job that will fail.
The executing program's use of
absolute paths cannot work.
# file name:
my_program.condor
# Condor submit description file for my_program
Executable
= my_program
= logs/err.$(cluster)
= logs/out.$(cluster)
= logs/log.$(cluster)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = /scratch/test/files/in1, /scratch/test/files/in2
Arguments = /scratch/test/files/in1 /scratch/test/files/in2 /scratch/test/files/out1
The job fails with the following error:
err: /scratch/test/files/out1: No such file or directory.
This example illustrates a case
where the executing program creates an output file in a directory
other than within the remote scratch directory that the
program executes within.
The file creation may or may not cause an error,
depending on the existence and permissions
of the directories on the remote file system.
The output file /tmp/out1 is transferred back to the job's
initial working directory as /scratch/test/out1.
# file name:
my_program.condor
# Condor submit description file for my_program
Executable
= my_program
= logs/err.$(cluster)
= logs/out.$(cluster)
= logs/log.$(cluster)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/in1,files/in2
transfer_output_files = /tmp/out1
= in1 in2 /tmp/out1
This section describes Condor's behavior for some error cases
in dealing with the transfer of files.
Disk Full on Execute Machine
When transferring any files from the submit machine to the remote scratch
directory,
if the disk is full on the execute machine,
then the job is place on hold.
Error Creating Zero-Length Files on Submit Machine
As a job is submitted, Condor creates zero-length files as placeholders
on the submit machine for the files defined by
output and error.
If these files cannot be created, then job submission fails.
This job submission failure avoids having the job run to completion,
only to be unable to transfer the job's output due to permission errors.
Error When Transferring Files from Execute Machine to Submit Machine
When a job exits, or potentially when a job is evicted from an execute
machine, one or more files may be transferred from the execute machine
back to the machine on which the job was submitted.
During transfer, if any of the following three similar types of errors occur,
the job is put on hold as the error occurs.
If the file cannot be opened on the submit machine, for example
because the system is out of inodes.
If the file cannot be written on the submit machine, for example
because the permissions do not permit it.
If the write of the file on the submit machine fails, for example
because the system is out of disk space.
2.5.4.6 File Transfer Using a URL
Instead of file transfer that goes only between the submit machine
and the execute machine,
Condor has the ability to transfer files from a location specified
by a URL for a job's input file,
or from the execute machine to a location specified by a URL
for a job's output file(s).
This capability requires administrative set up,
as described in section&.
The transfer of an input file is restricted to
vanilla and vm universe jobs only.
Condor's file transfer mechanism must be enabled.
Therefore, the submit description file for the job will define both
should_transfer_files and when_to_transfer_output.
In addition, the URL for any files specified with a URL are
given in the transfer_input_files command.
An example portion of the submit description file for a job
that has a single file specified with a URL:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = http://www.full.url/path/to/filename
The destination file is given by the file name within the URL.
For the transfer of the entire contents of the output sandbox,
which are all files that the job creates or modifies,
Condor's file transfer mechanism must be enabled.
In this sample portion of the submit description file,
the first two commands explicitly enable file transfer,
and the added output_destination command provides
both the protocol to be used and the destination of the transfer.
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
output_destination = urltype://path/to/destination/directory
Note that with this feature, no files are transferred back to the
submit machine.
This does not interfere with the streaming of output.
If only a subset of the output sandbox should be transferred,
the subset is specified by further adding a submit command of the form:
transfer_output_files = file1, file2
The requirements expression for a job must depend
on the should_transfer_files command.
The job must specify the correct logic to ensure that the job is matched
with a resource that meets the file transfer needs.
If no requirements expression is in the submit description file,
or if the expression specified does not refer to the
attributes listed below, condor_submit adds an
appropriate clause to the requirements expression for the job.
condor_submit appends these clauses with a logical AND, &&,
to ensure that the proper conditions are met.
Here are the default clauses corresponding to the different values of
should_transfer_files:
should_transfer_files = YES results in the addition of
the clause (HasFileTransfer).
If the job is always going to transfer files, it is required to
match with a machine that has the capability to transfer files.
should_transfer_files = NO results in the addition of
(TARGET.FileSystemDomain == MY.FileSystemDomain).
In addition, Condor automatically adds the
FileSystemDomain attribute to the job ClassAd, with whatever
string is defined for the condor_schedd to which the job is
submitted.
If the job is not using the file transfer mechanism, Condor assumes
it will need a shared file system, and therefore, a machine in the
same FileSystemDomain as the submit machine.
should_transfer_files = IF_NEEDED results in the addition of
(HasFileTransfer || (TARGET.FileSystemDomain == MY.FileSystemDomain))
If Condor will optionally transfer files, it must require
that the machine is either capable of transferring files
or in the same file system domain.
To ensure that the job is matched to a machine with enough local disk
space to hold all the transferred files, Condor automatically adds the
DiskUsage job attribute.
This attribute includes the total
size of the job's executable and all input files to be transferred.
Condor then adds an additional clause to the Requirements
expression that states that the remote machine must have at least
enough available disk space to hold all these files:
&& (Disk &= DiskUsage)
If should_transfer_files = IF_NEEDED and the job prefers
to run on a machine in the local file system domain
over transferring files,
but is still willing to allow the job to run remotely and transfer files,
the Rank expression works well.
rank = (TARGET.FileSystemDomain == MY.FileSystemDomain)
The Rank expression is a floating point value,
so if other items are considered in ranking the possible machines this job
may run on, add the items:
Rank = kflops + (TARGET.FileSystemDomain == MY.FileSystemDomain)
The value of kflops can vary widely among machines,
so this Rank expression will likely not do as it intends.
To place emphasis on the job running in the same file system domain,
but still consider floating point speed among the machines
in the file system domain,
weight the part of the expression that is matching the file system domains.
For example:
Rank = kflops + (10000 * (TARGET.FileSystemDomain == MY.FileSystemDomain))
The environment under which a job executes often contains
information that is potentially useful to the job.
Condor allows a user to both set and reference environment
variables for a job or job cluster.
Within a submit description file, the user may define environment
variables for the job's environment by using the
environment command.
See within the condor_submit manual page at
section& for more details about this command.
The submitter's entire environment can be copied into the job
ClassAd for the job at job submission.
The getenv command within the submit description file
does this,
as described at section&.
If the environment is set with the environment command and
getenv is also set to true, values specified with
environment override values in the submitter's environment,
regardless of the order of the environment and getenv
Commands within the submit description file may reference the
environment variables of the submitter as a job is submitted.
Submit description file commands use $ENV(EnvironmentVariableName)
to reference the value of an environment variable.
Condor sets several additional environment variables for each executing
job that may be useful for the job to reference.
_CONDOR_SCRATCH_DIR
gives the directory
where the job may place temporary data files.
This directory is unique for every job that is run,
and its contents are deleted by Condor
when the job stops running on a machine, no matter how the job completes.
_CONDOR_SLOT
gives the name of the slot (for SMP machines), on which the job is run.
On machines with only a single slot, the value of this variable will be
1, just like the SlotID attribute in the machine's
This setting is available in all universes.
See section& for more details about SMP
machines and their configuration.
equivalent to _CONDOR_SLOT described above, except that it is
only available in the standard universe.
NOTE: As of Condor version 6.9.3, this environment variable is no longer
It will only be defined if the ALLOW_VM_CRUFT
configuration
variable is set to True.
X509_USER_PROXY
gives the full path to the X.509 user proxy file if one is
associated with the job.
Typically, a user will specify
x509userproxy in the submit description file.
This setting is currently available in the
local, java, and vanilla universes.
If executables are available for the different platforms of machines
in the Condor pool,
Condor can be allowed the choice of a larger number of machines
when allocating a machine for a job.
Modifications to the submit description file allow this choice
of platforms.
A simplified example is a cross submission.
An executable is available for one platform, but
the submission is done from a different platform.
Given the correct executable, the requirements command in
the submit description file specifies the target architecture.
For example, an executable compiled for a 32-bit Intel processor
Windows Vista, submitted
from an Intel architecture running Linux would add the
requirement
requirements = Arch == "INTEL" && OpSys == "WINNT60"
Without this requirement, condor_submit
will assume that the program is to be executed on
a machine with the same platform as the machine where the job
is submitted.
Cross submission works for all universes except scheduler and
See section& for how matchmaking works in the
grid universe.
The burden is on the user to both obtain and specify
the correct executable for the target architecture.
To list the architecture and operating systems of the machines
in a pool, run condor_status.
A more complex example of a heterogeneous submission
occurs when a job may be executed on
many different architectures to gain full
use of a diverse architecture and operating system pool.
If the executables are available for the different architectures,
then a modification to the submit description file
will allow Condor to choose an executable after an
available machine is chosen.
A special-purpose Machine Ad substitution macro can be used in
attributes in the submit description file.
The macro has the form
$$(MachineAdAttribute)
The $$() informs Condor to substitute the requested
MachineAdAttribute
from the machine where the job will be executed.
An example of the heterogeneous job submission
has executables available for two platforms:
RHEL 3 on both 32-bit and 64-bit Intel processors.
This example uses povray
to render images using a popular free rendering engine.
The substitution macro chooses a specific executable after
a platform for running the job is chosen.
These executables must therefore be named based on the
machine attributes that describe a platform.
The executables named
povray.LINUX.INTEL
povray.LINUX.X86_64
will work correctly for the macro
povray.$$(OpSys).$$(Arch)
The executables or links to executables with this name
are placed into the initial working directory so that they may be
found by Condor.
A submit description file that queues three jobs for this example:
####################
# Example of heterogeneous submission
####################
Executable
= povray.$$(OpSys).$$(Arch)
= povray.log
= povray.out.$(Process)
= povray.err.$(Process)
Requirements = (Arch == "INTEL" && OpSys == "LINUX") || \
(Arch == "X86_64" && OpSys =="LINUX")
= +W1024 +H768 +Iimage1.pov
= +W1024 +H768 +Iimage2.pov
= +W1024 +H768 +Iimage3.pov
These jobs are submitted to the vanilla universe
to assure that once a job is started on a specific platform,
it will finish running on that platform.
Switching platforms in the middle of job execution cannot
work correctly.
There are two common errors made with the substitution macro.
The first is the use of a non-existent MachineAdAttribute.
If the specified MachineAdAttribute does not
exist in the machine's ClassAd, then Condor will place
the job in the held state until the problem is resolved.
The second common error occurs due to an incomplete job set up.
For example, the submit description file given above specifies
three available executables.
If one is missing, Condor reports back that an
executable is missing when it happens to match the
job with a resource that requires the missing binary.
Jobs submitted to the standard universe may produce checkpoints.
A checkpoint can then be used to start up and continue execution
of a partially completed job.
For a partially completed job, the checkpoint and the job are specific
to a platform.
If migrated to a different machine, correct execution requires that
the platform must remain the same.
In previous versions of Condor, the author of the heterogeneous
submission file would need to write extra policy expressions in the
requirements expression to force Condor to choose the
same type of platform when continuing a checkpointed job.
However, since it is needed in the common case, this
additional policy is now automatically added
to the requirements expression.
The additional expression is added
provided the user does not use
CkptArch in the requirements expression.
Condor will remain backward compatible for those users who have explicitly
specified CkptRequirements-implying use of CkptArch,
in their requirements expression.
The expression added when the attribute CkptArch is not specified
will default to
# Added by Condor
CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \
((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))
Requirements = (&user specified policy&) && $(CkptRequirements)
The behavior of the CkptRequirements expressions and its addition to
requirements is as follows.
The CkptRequirements expression guarantees correct operation
in the two possible cases for a job.
In the first case, the job has not produced a checkpoint.
The ClassAd attributes CkptArch and CkptOpSys
will be undefined, and therefore the meta operator (=?=)
evaluates to true.
In the second case, the job has produced a checkpoint.
The Machine ClassAd is restricted to require further execution
only on a machine of the same platform.
The attributes CkptArch and CkptOpSys
will be defined, ensuring that the platform chosen for further
execution will be the same as the one used just before the
checkpoint.
Note that this restriction of platforms also applies to platforms where
the executables are binary compatible.
The complete submit description file for this example:
####################
# Example of heterogeneous submission
####################
= standard
Executable
= povray.$$(OpSys).$$(Arch)
= povray.log
= povray.out.$(Process)
= povray.err.$(Process)
# Condor automatically adds the correct expressions to insure that the
# checkpointed jobs will restart on the correct platform types.
Requirements = ( (Arch == "INTEL" && OpSys == "LINUX") || \
(Arch == "X86_64" && OpSys == "LINUX") )
= +W1024 +H768 +Iimage1.pov
= +W1024 +H768 +Iimage2.pov
= +W1024 +H768 +Iimage3.pov
htcondor-admin@cs.wisc.edu

我要回帖

更多关于 variable name 的文章

 

随机推荐