Other Compiler Switches

-x Compiler Switches

The compilers have a number of supported and unsupported switches that are used internal to the compilers. These are referred to as xflags. This section documents the xflags.

These xflags can be invoked in two ways.

  1. Use -x <number> <value> where <number> is a decimal number and ranges from 0 to 127. For example, to turn on information generation during compilation use: -x 0 2.

  2. The compiler may use symbolic names to map to certain xflags. For example, the -alpha switch is identical to using -x 25 15 and -beta is identical to using -x 25 240.

The first number in an xflag invocation is an array index into the array flg.x[128]. The second number is used to mask the value from that array. This sets up the xflag switches to be easily used as bit masks. The macro, XBIT(n,m), should be used to test an xflag where n is the xflag number and m is the mask.

The following is a brief summary of the reserved xflags. Note that not all are implemented.

flg.x[] - Values and Meanings

0:

Used to turn on information reporting regarding compiler (stderr):

0x01: compilation statistics

0x02: general information on loops

0x04: (pgc dev only) - prototypes for function definitons

0x08: inlining information (TBD)

0x10: code block sizes in cycles

0x20: put symbol comments into assembly file

0x40: put verbose comments into assembly file (includes ili)

0x80: put misc info out to stderr

0x100: OpenMp information.

0x200: Variable information within parallel regions.

0x400: Integrate info messages with source code and save to a .info file

0x800: Info on data flow optimizations (e.g. CSE, PRE, hoisting)

0x1000: issue variable use before def warning messages.

0x2000: for ST100, override default and put command line into assembly file

0x4000: Compute / Data Intensity output

0x8000: issue unrecognized pragma/directive messages (development/DEBUG versions only)

0x10000: Info on control flow optimizations (e.g. predication)

0x20000: describe when loops can be parallelized

0x40000: describe vector streaming calls

0x80000: produce negative loop information

0x100000: describe when barriers are deleted

0x200000: minfo messages for IPA optimizations

0x400000: minfo messages for Unified Binary optimizations

0x800000: minfo messages for -Mautoinline

0x1000000: don’t demangle C++ function names in minfo messages

0x2000000: Output CCFF to STDERR

0x4000000: Use CCFF messages

0x8000000: Output minfo/CCFF to stderr when lowering (pgf901)

0x20000000: Don’t emit error message text for select error messages.

0x40000000: Error messages in dev studio format:

block1 : block2 : block3
block1 : block3
block1 contains <filename>(line number); block2 contains either
error or warning; block3 contains message text.

In addition, the line number in block1 can also include the column number, <filename>(n,m). In all cases the line number is required.

0x80000000: L-suffixed constants (dev pgc only).

1:

Used to turn on producing negative and other information (stderr):

0x01: Enable enhanced error messages.

0x02: general information on loop optimizations not performed

0x04: For ST100, describe when (and why) dsp intrinsics are NOT expanded.

0x08: For IPA negative information.

0x10: For inliner failures

0x800: Give failure information on data-flow optimizations (e.g. CSE, PRE, hoisting).

0x10000: Give failure information on control-flow optimizations (e.g. predication).

0x20000: describe when loops can not be parallelized

0x40000: For PFO.

0x80000: For PRE.

2:

Used to turn on various non-standard optimizations Note that combinations can be used to get a desired effect.

0x01: C dummy args are treated with the same copyin/copyout semantics as Fortran dummy args.

0x02: C local ptrs do not conflict with any other local variables.

0x04: C static ptrs do not conflict with any other static variables.

0x08: C global ptrs do not conflict with any other global variables.

0x10: C malloc ptrs do not conflict with any other variables.

0x20: C struct ptrs do not conflict with other struct ptrs of the same type unless they are either the same structure or the same field. In other words, ptrs to structures are not skewed.

0x80: Relax dependence checking for unknown dependence relations.

0x100: C private ptrs do not conflict with any other private variables.

0x200 In dependence testing, assume 0/x and x/x are 0 and 1, even when x is an unknown nonconstant value.

0x400 Turn off ANSI C pointer rules: ANSI C pointer rules are implemented in dependence testing and alias analysis, it assumes that a pointer to type A cannot conflict with pointer to type B.

0x800 Disable checking if a member reference is derived from a pointer dereference.

0x1000 Compile (keep) C/C++ static functions that are never called and whose addresses are never taken

0x2000 For x86-64 extended asm, do not consider PDALNG field on pointer input/output item. Otherwise, look at PDALNG field to see if it’s 16-byte aligned. If so, we can generate movapd for the input/output item.

0x4000 Set ADDRTKNP bit for C/C++ structure moves,

0x8000 AVAILABLE (Was “safer interpretation of -Msafeptr”)

0x10000 C++ class type-based disambiguation rule. Different class types do not conflict. Class objects do not conflict with their pointers.

0x20000 Do not generate members NMEs for union members (just elide the member).

0x40000 Do not mangle members name with line number information: so that IPA inlining can match class types.

0x100000 Fill in BIH_LINENO with line numbers on inlined blocks.

0x200000 don’t use SMOVEI/SMOVES instead of SMOVE in exp_rte.c

0x400000 F90 pointer optimizations.

0x800000 Use GSMOVE in exp_rte.c

0x1000000 Don’t expand SMOVEs (struct moves) in a single IL_SMOVEI/IL_SMOVES

3:

Used to turn on/off various dual-op/dual-inst/pipelined ops.

0x01: turn on pipelined operations

0x02: turn on dual-instructions

0x04: turn off pipelined operations

0x08: turn off dual-instructions on i860 or swpipe on ST100

0x10: block level heuristics for above (default is file level)

0x20: loop level heuristics for above (dual mod if 1 column loop)

0x40: multi-block loop level heuristics for above

0x80: function level heuristics for above

0x100: Use pipelined moves instead of static dp moves.

0x200: For LLVM, use ymm registers on x86 to perform LRE

0x400: Don’t perform reduction compression

4:

Used to alter optimization/code generation techniques.

0x01: Use pipelined operations

0x02: Use dual-instruction mode

0x04: Enable multiple fp registers to be cached (used on the x86)

0x08: used in cgsched.c

0x10: used in cgsched.c

0x20: used in cgsched.c

0x40: used in cgsched.c

0x80: used in cgsched.c

0x1000 Generate a null pointer store fault.

0x2000 Generate a divide by zero fault.

0x4000 Generate the above or below fault on the third function, not the first.

0x8000 Generate an infinite loop fault

0x100000 In Fortran front end, disable any dependence checking, and assume no forall or array assignment needs a temp.

0x200000 In Fortran front end, disable early flow analysis to determine if the forall or array assignment needs a temp.

0x400000 In Fortran front end, disable kernels rescoping.

0x800000 Initialization of an array with zeros can sometimes be done directly with machine instructions, rather than a loop nest (array assignment collapsing). Disable array assignment collapsing optimization.

5:

CG uses

0x01: Ignore deletable store information. Hammer - disable register allocation at -O0.

0x02: Perform hardcoded register allocation in CG

0x04: used in cgsched.c

0x08: used in cgsched.c

0x10: used in cgsched.c

0x20: used in cgsched.c

0x40: used in cgsched.c

0x80: used in cgsched.c

6:

Inhibit optimizations

0x01: global constant propagation (opt >= 2)

0x02: store deletion due to constant propagation (opt >= 2)

0x04: copy LOOP ili (bla instruction) (opt >= 2)

0x08: br_to_br (opt >= 2)

0x10 remove useless register moves (opt >= 2)

0x20 loop live variable checks (opt >= 2)

0x40 replacement of data initialized local fortran variables or data initialized const C variables (opt >= 2)

0x80 using function return registers for global register assignments (opt >= 1, and only when function returns a value)

0x100 copy propagation where value already loaded; floating point constant propagation.

0x200 prevent copy propagation of values with any floating point compare (x86)

0x400 inhibit optimization to replace divide-by-constant by multiplication

0x800 inhibit floating point constant prop. (like 0x100), but this still allows floating point global regs.

0x1000 inhibit multiply-accumulate/subtract optimization (e.g. x += a * b).

0x2000 inhibit remove_jump() optimization

0x4000 store deletion within an extended basic block.

0x8000 SAVE (SC_STATIC) removal (invarif.c:save_elim(), fortran-only).

0x10000 inhibit optimizations based on the LSCOPE f90 symbol table flag.

0x20000 inhibit optimizations of ‘pure’ functions.

0x40000 Inhibit block splitting at inline points in expander followed by block unsplitting call in main.

0x80000 Inhibit ‘equals’ propagation.

0x100000 TEMPORARY – topsort the loops

0x200000 don’t defer compilation of routines based on whether they are called

0x400000 don’t remove unreachable blocks in fgraph

0x20000000 Do max/min pattern, but inhibit all other transformations.

0x40000000 TEMPORARY – don’t check address of threadprivate variables in recog; Q&D QA to work-around regressions.

0x80000000 perform analysis only: inhibit all transformations when building the flowgraph, discovering loops, and building the flow information. Generally, the transformations should not occur when the cg is calling the optimizer’s functions.

7:

Inhibit optimizations

0x01: non-constant def copy (opt >= 2)

0x02: terminal function optimization (opt >= 1)

0x04: a = a (opt >= 2)

0x08: separate base pointers (opt >= 3)

0x10: glue arg copy (opt >= 2)

0x20: non-glue arg copy (opt >= 2)

0x40: use scratch as globals if calls present

0x80: copy arg/glue if loops present

0x100: member store is within its boundaries (doesn’t mark parent structure as ‘stored’).

0x200: invariant hoisting of the LPSTRT & LDLPCT ili (ST100, opt >= 2).

0x400: hoisting of the LPSTRT & LDLPCT ili if there exists a call in the preheader block to which the ili are hoisted (ST100, opt >= 2).

0x800: Disable use of scratch as globals in terminal (leaf) functions.

0x1000: Inhibit global sign-extension elimination.

0x2000: For ST1xx, disable the `HRA16’ (`Holes Register Allocator for GP16’) GP16 register allocator enhancements.

0x4000: Inhibit setting the XMMSAFE flag for functions.

0x8000: Enable scalar replacement.

0x10000: Enable transitive loop invariant motion.

0x20000: Stress transitive loop invariant motion, i.e., disable heuristics to control register pressure.

0x40000: Enable class/struct based mod/div replacement by reciprical code emit. There were problems with this, so we’ve inverted the flag.

0x80000: Disable invariant searching of ILI_ALT.

0x100000: Disable replacement of data initialized local fortran scalar variables with assignment statement (opt >= 2)

0x200000: Disable optimization where we set GSCOPE on Fortran host subprogram declared symbols only when they are used in a contains subprogram.

0x400000: Disable dissociation of local C/C++ struct/class instance members, an optimization for the modern STL iterator implementation to get sole members into registers.

0x800000: Disable clearing the “address taken” flag on local symbols whose addresses are taken only in normal load/store sequences.

0x1000000: Disable more aggressive expression rewriting with store forwarding to loads.

0x2000000: Disable NME rewriting to clean up after constant propagation.

0x4000000: Don’t return fast from forward() if other xflags have disabled everything.

0x8000000: Disable interval analysis in hlscrub.c.

0x10000000: Disable block duplication in hlscrub.c.

0x20000000: Disable dead code elimination in hlscrub.c.

0x40000000: Disable conversion of pointer inequality loop tests in hlscrub.c.

0x80000000: Disable CSE replacements in hlscrub.c.

8:

Inhibit optimizations

0x01: loop count (opt >= 2)

0x02: store deletion (opt >= 2)

0x04: block merging (opt >= 2)

0x08: global register assignment (opt >= 1) (also disables loop count)

0x10: recurrence relations (opt >= 3)

0x20: invariant array addresses (opt >= 3)

0x40: last value computations (opt >= 2)

0x80: loop count if non-induction use occurs (opt >= 2)

0x100: replace not-equal test of loop control variable with less than or greater than test (opt >= 2)

0x200: reducing the number of induction variables (opt >= 2)

0x400 global register allocation (opt >=2) using live ranges (both integer and floating point globals)

0x800 allow fp caching in x86 if not innermost loop

0x1000 turn off fp caching in x86

0x2000: recognizer.

0x4000: finding common base pointers.

0x8000: ignoring a short/char extend when searching for basic induction variables.

0x10000: searching for stores via the same pointer (invar.c) when the pointer is ‘safe’.

0x20000: allowing QJSRs as candidates for invariancy (ST100) – should always be safe but there may be regarg problems introduced by hoisting QJRs.

0x40000: When finding a common base pointer (recog.c), select the use whose linear reference has offset 0.

0x80000: hw looping.

0x100000: allocating hw loop registers with respect to the total number of loops in a loop nest; if XBIT is set, allocate hw loop registers with respect to the ‘level’ of a loop, i.e., inner-to-outer.

0x200000: hw looping for ‘while’ loops.

0x400000: hw looping for loops containing calls.

0x800000: Inhibit use of H/W loop reload registers (ST1xx)

0x1000000: Inhibit signextend elimination (ST1xx)

0x2000000: Inhibit detecting PTRSAFE members and POINTER members in invar (PGF90)

0x4000000: Inhibit recognition of loops with constant loop count of 1

0x8000000: Inhibit hlinduc0:memset/memzero/memcopy idiom recognition.

0x10000000: Inhibit hlinduc0:memcopy idiom recognition

0x20000000: Disable inhibiting strength reduction for address-mode expressions

0x40000000: reserved

0x80000000: inhibit iltutil.c:merge_bih() - WARNING: in certain cases, merging must occur to complete an optimization, e.g. br_flatten().

9:

Non-zero value invokes the loop unroller. If value is other than 1, represents the number of times to unroll loops or the maximum iteration count if completely unrolling a loop.

10:

Number of unrolls (# of loop bodies) of a loop with non-constant iteration count.

11:

Unrolling

0x01: Enable completely unrolling a loop when the UNROLL directive is used.

0x02: Enable unrolling a loop by a factor specified via the UNROLL(n) directive.

0x04: (I386,X86_64) Ignore the check of the number of variable strides.

0x08: Inhibit completely unrolling outer loops.

0x10: (I386,X86_64) Don’t reduce the scoring by a factor of two when there are only variable strides.

0x20: (I386,X86_64) Ignore the check of the number of variable strides when attempting to completely unroll a loop.

0x40: (I386,X86_64) Ignore the check of the number of nested invariant array references when attempting to completely unroll a loop.

0x80: (I386,X86_64) Ignore the check of the number of variable strides from an innerloop when attempting to completely unroll its containing (outer) loop.

0x100: (I386,X86_64) Ignore the check of the number of nested invariant array references from an innerloop when attempting to completely unroll its containing (outer) loop.

0x200: Enable unrolling of multi-block loops. The unroll count is initially default 4 or flg.x[10] if that is set.

0x400: Disable unrolling of a loop marked with the NOUNROLL directive.

0x800: Do not attempt to increase the threshold for completely unrolling an innermost multi-block loops.

12:

Inhibit local optimizations

0x01: short bte/btne branching

0x02: Change float 0.0 compares so that the compare is done in integer unit.

0x04: inhibit elimination of redundant float register movement for SNGL or DBLE casts into argument registers or global registers.

0x08: inhibit branch to bla and branch to branch optimization inside linearizer.

0x10: inhibit st_sta_ld pointer precedence checking.

0x20: inhibit ulshifti followed by lshifti folding (or visa-versa).

0x40: inhibit ANDHI followed by BIEQI/BINEI folding into a BEQANDHI.

0x80: Inhibit special treatment of ICJMPZ pointing to a ISUB or ISUBI.

0x100: Inhibit BIH_RGSET(bih) register optimization and just use curr_entry->regset.

0x200: Inhibit moving of individual members of a structure.

0x400: Inhibit deletion of odd global reg obtained from -x 12 256

0x800: Inhibit LDINC/STINC optimizations.

0x1000: Inhibit replacement of uplevel variable address load optimization for llvm target.

13:

Used to turn on experimental inliner techniques.

0x01: array formal parameters replaced with pointers; expressions as arguments allowed.

0x02: Turn off CG checking of Fortran inlined SC_BASED variable dependency checking.

0x04: used in inliner.c

0x08: Suppress accelerator error messages with -Mextract.

0x10: Only available in the dev (under #if DEBUG) calls inline_mulh to inline IMULH UIMULH KMULH UKMULH calls into the appropriate ILI to get the upper half of integer/i8 signed/unsigned multiplies.

0x20: Replace memcpy/memset with faster hammer __c_mcopy1/__c_mset1 calls.

0x40: Replace memset of value 0 with __c_mzero1 on hammer.

0x800 AVAILABLE

0x1000 When the extents of the dummy array and actual argument do not match, linearize the subscript expressions; this amounts to generating the ilm, INLELEM. Normally, the dummy array is expressed as a cray pointee and its corresponding pointer is assigned the address of the actual argument.

0x8000 Call-site inlining: inliner will inline those call sites where ipa auto-inlining has decided to inline. This is in contrast to inline all call sites for a given callee in a function if ipa finds at least one call site beneficial to inline.

0x20000 Replace all memset with __c_mset1 in the fast_libc_calls(). This is only for an experimental purpose, not for production.

0x20000000 Enable inlining into OpenACC host_data regions.

0x40000000 Enable inlining into OpenMP task regions.

0x80000000 turn on alpha-level experimental features.

14:

Extractor/Inliner.

0x01: Require actual & dummy arrays to match in type, rather than just the size and alignment of their base types.

0x02: Don’t perform the optimization when ‘&var’ is an actual argument. The optimization replaces the dereference of the formal argument with ‘&var’.

0x04: Don’t reuse inliner temps across inlinings; this allows more precise IPA pointer target analysis (C)

0x08: Don’t extract this function (set by pragma noinline)

0x10: Run the extractor in the compiler itself; this is used for one-pass IPA and IPA-driven inlining.

0x20: Do inlining during the expand phase, instead of during the parse phase. This allows multiple levels of inlining during the compiler without multiple levels of extraction.

0x40: Leave the original names during inlining, instead of changing all names to ..inline

0x80: For Fortran, use IM_FARG for arguments

0x100: Compress the extract file, using lz.c

0x200: Share inliner temps for local variables from multiple inlines of the same function (fortran); hopefully, in the near future, will be reversing the sense of this XBIT.

0x400: For C, USED in inliner.c

0x800: inliner.c PGCPLUS decode_identifer???

0x1000: Don’t automatically create a new ili block if there are calls. calls.

0x2000: Do not mark ‘inline’ functions as ‘static’ (this-file-only). The default is treat inline like static.

0x4000: Used with IPA, allow extracting and inlining of functions or subprograms with C statics or Fortran SAVE.

0x8000: Do not automatically mark static functions as this-file-only, if extracting for IPA.

0x10000: Disable global inliner

0x20000: Enable global ILM module, which reads in all ILMs at once

0x40000: In the inliner, try to reuse struct/union datatypes and member symbols

0x80000: In the inliner, implement ‘small function’ heuristic

0x100000: Apply compression to the inline file when extracting ‘inline’ keyword functions.

0x200000: Extract and inline functions with the ‘inline’ keyword.

0x400000: Used with -x 14 0x200000, extract and inline all functions

0x800000: Used with -x 14 0x200000, when extracting functions with the inline keyword, save the extract file, named EXFILE (for debugging).

0x1000000: Apply libc memset() inlining.

0x2000000: For C/C++ extractor, extract ALL symbols, change language from C or D to E

0x4000000: For Fortran inliner, don’t inline if we must reshape array arguments with a Cray-pointer style based array.

0x8000000: For C/C++, also inline IPA-discovered ‘tiny’ routines.

0x10000000: for C/C++, increase the block ilm limit from 60,000 to 90,000

0x20000000: for C/C++, don’t extract routines with the INLINE_THIS_FILE_ONLY flag set. This is useful for extracting for libraries that we only are going to inline across files, like the libstd or libcpp routines.

0x40000000: Allow IPA-driven inlining of file static functions across files in some cases.

0x80000000: Don’t allow inlining functions into parallel regions.

0x10000000: Allow functions with statics in C and SAVE in Fortran to be inlined with Minline.

15:

ILI strength reductions or transformations

0x01: Compute ‘x/y’ as ‘x *(1.0/y)’, where x is not a constant & y is a constant; also set by -Mnouniform.

0x02 Compute ‘x/y/z’ as ‘x/(y*z)’.

0x04 Compute ‘x/y’ as ‘x *(1.0/y)’, (if not IEEE switch & -Mprelaxed=div)

0x08 Disable the sincos transformation

0x10 Relaxed fpmath. Enables a set of operations that can be performed using various methods, such as Newton’s method, that provide reasonable approximations to the actual results.

0x20 Do not check cpu type for relaxed fpmath.

0x40 call mkfunc() instead of mkfunc_cncall() when creating functions as ili replacements.

0x80 Don’t transform ‘(double)x <relop> y’ into ‘x <relop> yy’, where y is a double constant and can be exactly represented as the float yy.

0x100 Inhibit combining IAMV/KAMV in the operands of an AADD (fortran only).

0x200 -Mnouniform - do not require fp transformations/optimizatons to be uniform across simd and scalar generated code; e.g., x/constant -> x * (1.0/constant); the vectorizer may hoist an invarant reciprocal, but the residual will perform a divide; undo of -Mfprelaxed=div if the recip only has one use, etc.

0x400 -Mfprelaxed=intrinsic

0x800 Disable the generation of [SD]CMPLXDIV ILIs, which perform complex division using the new representation of complex data types by calling the fastmath routines “__f[sv][cz]_div”.

0x1000 Disable sorting of the ILI free list after garbage collection.

0x2000 AVAILABLE

0x4000 AVAILABLE

0x8000 AVAILABLE

0x10000 Disable the Newton’s appx for single precision sqrt

0x20000 Disable the Newton’s appx for single precision recip sqrt

0x40000 Disable the Newton’s appx for single precision divide

0x80000 AVAILABLE

0x100000 AVAILABLE

0x200000 AVAILABLE

0x400000 AVAILABLE

0x800000 AVAILABLE

0x1000000 AVAILABLE

0x2000000 AVAILABLE

0x4000000 Do not use the vex/fma4 fast math naming conventions.

0x8000000 Inhibit IEEE compare semantics unless -Kieee is present

0x10000000 Compute divide using the approximating instruction.

0x20000000 Compute sqrt using the approximating instruction.

0x40000000 Compute rsqrt using the approximating instruction.

0x80000000 Experimental ili transformations

16

alternate code for vectorization; vectorized code is executed if count is greater than n, i.e., the value of x[16].

17

alternate code for software pipelining; pipelined code is executed if count is greater than n, i.e., the value of x[17].

18

alternate code for unrolling; completely unrolled code is executed if count is less than or equal to n, i.e., the value of x[18].

19:

Modify optimizations (pragmas/directives)

0x01 noeqvchk. Don’t check equivalences for data dependencies.

0x02 nolstval. Don’t compute last values.

0x04 split. can split subroutine/function calls from loop.

0x08 notransform (no hlvect); also novector sets this bit

0x10 norecog (no llvect); also novector sets this bit

0x20 noswpipe (no recognize)

0x40 nostream

0x80 noinvarif. Don’t perform loop invariant conditional optimizations.

0x100 independent loop (forall-independent loop).

0x200 don’t perform tail recursion elimination

0x400 perform loop vectorisation.

Ox800 Perform zero trip elimination - will we ever be able to switch the sense?

0x1000 Allow an induction variable with a nonconstant stride to be used to compute a loop count.

0x2000 is_invariant:always_executed() - when is_invariant() is called, the default is to assume that the fg node containing the ili is always executed; if the XBIT is set, assume that the fg node is not always executed.

0x4000 induc.c:while_repl() - allow calls to be present when attempting to use hw looping for while loops.

0x8000 assume that induc.c:max_loop_count() cannot determine the maximum value of a loop count.

0x10000 assume that the loop count after unrolling, returned by unroll.c:unrolled_lpcnt(), is large.

0x20000 don’t reassociate adds/mults in the front-end

0x40000 Allow ‘extended range loops’ (a node is not within the lexical scope of the head and tail nodes) to be countable. If these types of loops must be allowed by default, detection needs to be added to the vectorizer and other high level opts (unrolling).

0x80000 Change zero-trip checks to use the ST122c ‘skiplp’ instruction.

0x100000 Assume addresses of dummy array arguments & allocatables/pointers are valid.

0x200000 Don’t allow 64-bit int variables as induction variables (TEMPORARY)

0x400000 Rely on the ADDRTKN flag of static variables being set by the front-end; phases after the front-end can check the ADDRTKN flag of statics (PGC).

0x800000 assume that the subscripts to invariant array references which appear in block that do not dominate the tail of the loop will not cause an illegal address to be generated.

0x1000000 only allow reassociation if the terms are variables or constants. Reassociation is disabled if XBIT(19,0x20000) is set.

0x2000000 Inhibit prefetching in induc.

0x4000000 Don’t assume guarded invariant floating point expressions are valid.

0x8000000 Disable replacing induc’s loop count.

0x10000000 Turn off tail recursion for X86_32,X86_64

0x20000000 Select the prefetching in induc.c using the implementation which integrates both inductive pointers and array address expressions.

0x40000000 Disable prefetching for indirect loads.

0x80000000 USED.

20:

Used to affect exception handling:

0x01: hw has exceptions in 21/22 turned off (default is on)

0x02: compiler turns off exceptions in 21/22 on program level

0x04: compiler turns off exceptions in 21/22 on file level

0x08: compiler turns off exceptions in 21/22 on function level (TBD)

0x10: compiler turns off exceptions in 21/22 on block level (TBD)

21:

Used to affect exception handling: Active status of individual fp exceptions.

0x01: all fp exceptions active

0x02: divide by zero (DIVZ)

0x04: fp overflow (FOVF)

0x08: fp underflow (FUNF)

0x10: fp invalid input (src denormalized, NaN, or inf)

0x20: fp inexact result (pipe or result denorm, NaN)

0x40 used in cgutil.c for n10

0x80 used in cgutil.c for n10

22:

Used to affect exception handling: Active status of individual integer exceptions.

0x01: all int exceptions active

0x02: divide by zero (DIVZ)

0x04: int overflow (FOVF)

0x08: int underflow (FUNF)

23:

Used to affect exception handling (fsr):

0x01: FTE (floating point trap enable), no flushz

0x02: FTE, TI, no flushz

0x04 FTE, flushz

0x08 no FTE, flushz

24:

Used to affect exception handling (traps)

0x01: -Ktrap=fp (ABI systems only)

0x02 -Ktrap=align (ABI systems only)

0x04 All normal calls are followed by a call to a system routine that only modifies R30/R31.

0x08 Ktrap=inv (x86-FCW invalid operation))

0x10 Ktrap=denorm (x86-FCW denormalized operand))

0x20 Ktrap=divz (x86-FCW zero divide)

0x40 Ktrap=ovf (x86-FCW overflow))

0x80 Ktrap=unf (x86-FCW underflow))

0x100 Ktrap=inexact (x86-FCW precision)

25:

Experimental features of compilers.

0x01: Turn on alpha-level experimental front-end features.

0x02: Turn on alpha-level experimental vectorizer features.

0x04: Turn on alpha-level experimental optimizer features.

0x08: Turn on alpha-level experimental code generation.

0x10: Turn on beta-level experimental front-end features.

0x20: Turn on beta-level experimental vectorizer features.

0x40: Turn on beta-level experimental optimizer features.

0x80: Turn on beta-level experimental code generation.

26:

Modify ILI. Was pipe flushing (deprecated)

0x01: TEMPORARY - enable new math names for complex routines under development (XBIT(164,0x800000) must also be set. Was “Flush pipes in minimal fashion” (deprecated)

0x02: When using the new math naming scheme for scalar routines, follow the ‘vector’ ABI instead of the C ABI. On x64, this will alter passing a complex double scalar. Was “Full flush of all pipes” (deprecated)

0x04: used in cgutil.c for n10

0x08: used in cgutil.c for n10

27:

reserved

28:

Optimizer - modify behavior

0x01: turn on global reg for region 0 if function exits early - a function exits early if there exists a branch to the exit from the 4th (or earlier) block of the function. For region 0 of a function which exits early, new global regs are not assigned and any registers which were previously assigned are not propagated; goal is to minimize the amount of code prior to the early exit.

0x02: perform complete induction analysis (override attempts to exclude certain linear, integer, array references).

0x04: propagate any registers assignments for a function which was determined to exit early.

0x08: inhibit recognition of min/max pattern ( if (a <rel> b) a = b ).

0x10: Disables copying of POINTER array to sequential temp array at calls. Disables using the descriptor’s ‘len’ field as the final subscript multiplier, i.e., assume that the pointer locates a contiguous array. Enabling this is nonstandard F90.

0x20: At subroutine call, when passing POINTER array to sequential dummy array, we usually copy to a sequential temp. If this bit is set, a run-time test for NULL pointer is inserted, and the temp is not created or copied if it is null. Note: passing the NULL pointer is nonstandard, but some other compilers implement this.

0x40 Allow [unsigned] long long variables to be assigned global registers for targets, such as the ST100, where the default is to disallow such assignments.

0x80 Allow float/double variables to be assigned global registers for targets, such as the ST100, where the default is to disallow such assignments.

0x100 Allow copy propagation of all exprs. Currently we disallow costly exprs (if the vectorizer is on.) [ in optutil.c - cp_loop. ]

0x200 Use alternate induc method to reduce induction variables; the actual methods are target dependent.

0x400 When inlining Fortran, passing 1D array element to array, use a base pointer

0x800 Do not allow a temp to be assigned to certain types of invariant loads if its address computation is “costly”; normally, we do not trade a simple load for a load of a temp.

0x1000 Use unique temps when replacing fp constants during invar. May be extended to include scalar replacement.

0x2000 Use unique temps when replacing invariant expressions. available

0x4000 Disable optimizations fg_opt_comp_one/fg_opt_comp_zero from fgraph.c These are designed to remove useless test/use of intermediate variables that hold results of comparisons

cond = (a>b) ;
if (cond) {
    ...
}

will be transformed into

cond = (a>b) ;
if (a>b) {
    ...
}

Thus if variable cond is no longer used it gets eliminated. This optimization mainly benefits C++ codes.

0x8000 Inhibit checking for invariant common base pointers.

0x10000 inhibit recognition of the if-then-else pattern

0x20000 inhibit recognition of the if-then pattern

0x40000 inhibit recognition of the if-the-else & if-then patterns when the conditional is floating point and the expression are integer or pointer

0x80000 Disable replacing narrow integer scalars with int temporaries

0x100000 Disable recognition of the if-then pattern with FREE* ops (often due to post-mod) and replacement of such a pattern with SELECT.

0x200000 Reassociate address computation expressions to improve code floating of subexpressions.

0x400000 Do not classify an address constant as costly to compute, such as one for computing the address of an external when -fpic for 64-bit Costly acons or invariant loads via costly acons may be assigned to temp.

0x800000 Exclude all induction pointers used as basepointers in load/store operations.

0x1000000 Inhibit combining of invariants in address expressions generated for subscripting of fortran arrays.

0x2000000 Restrict -x 28 0x1000000 (combining of invariants …) to fortran pointers.

0x4000000 Inhibit hlinduc0:do_ptr_branch - create countable loops out of pointer-controlled loops

0x8000000 Inhibit hlinduc0:do_ptr_branch - create countable loops out of all candidate pointer-controlled loops (aggressive)

0x10000000 Replace loop_cnt with new induction variable even if we cannot determine that init + (loop_cnt * skip) will not overflow.

0x20000000 Do not attempt to SIMD-ize a sequence of reciprocal sqrts (aka the gromacs hack).

0x40000000 Experimental invar

0x80000000 Experimental induc

29:

Optimizer - modify behavior

0x01: For the gromacs optimization on AVX, use avx (256-bit avx)

0x02: For the gromacs optimization on AVX, use vex (128-bit avx)

0x04: (C++) inhibit flow.c:delete_unrefd()

0x08: Inhibit recognition of the if-the-else & if-then patterns when the expressions are floating point

0x10: Inhibit recognition of the if-the-else & if-then patterns for a LHS which is not a scalar variable

0x20: Inhibit recognition of power-of-2 multipliers of induction variables appearing as subscripts

0x40 For scalar prefetching (induc.c:prefetch_integrated()), the non-stride-1 constraint is applied to the candidate’s induction variable rather than the induction family/master variable.

0x80 Experimental flow.c: treat uses of COMPLEX differently.

0x100 Inhibit the induction branch optimization if a call occurs.

0x200 Inhibit creating countable loops from pointer-controlled loops; does not apply when the loop-end condition is a known ‘distance’ away from the initial value of the pointer

0x400 Disable scalar replacement for invariant array references within loops.

0x800 reserved.

30-39

Reserved for low-level vectorizer.

30:

High level vectorizer - maximum size of loop nests to process.

31:

Low-level vectorizer - cache vectors only if strip size >= n. (860 only).

32:

Low-level vectorizer - amount of cache used by low-level vectorizer. (860 only).

High-level vectorizer - size of on-chip cache (x86 only).

33:

Low-level vectorizer - maximum strip size of loops with non-invariant complex vectors.

34:

Low-level vectorizer - modify behavior.

0x01: Generate mcp calls (FPS option).

0x02: Streamin/out all linear loads/stores

0x04: Generate XP calls.

0x08: Inhibit vector intrinsics recognition.

0x10: (Sparc only) Don’t allow parallel outer loop.

0x20: Sparc: Don’t allow parallel inner loop. 860: Don’t allow parallel loop.

0x40: (Sparc only) Designate outer loop to be parallel.

0x80: Sparc: Designate inner loop to be parallel. 860: Designate loop to be parallel.

0x100: (860 only) Allocate loop iterations to CPUs cyclically.

0x200: (mp sparc & 860) Permit automatic parallelization of loops.

0x400: (860 only) Permit parallel inner loops to contain invariant vectors.

0x800 Last values are computed on the last iteration of a loop.

0x1000 permit innermost loop to be parallelized if parallelizable

0x2000 don’t check loop count when parallelizing non-innermost loops

0x4000 disallow to parallelize innermost with conditional reduction

0x8000 set thread number to be constant 2 for dual_core system

0x10000 generate only parallel version for runtime testings

0x20000 generate serial version regarding ncpus setting value 1

0x40000 disallow pipeline parallelization

0x80000: Ignore any array bounds information when determining if stripmining needs to be performed (i.e., assume that array bounds will be violated in a vectorizable loop).

0x100000: nolastdim. Ignore the (declared) extents of an array in blank common if the extent of the last dimension is 1 (pgftn-sparc only); directive scope is either routine or global (not loop).

0x200000: in llvect for hammer, disable enhanced array reference alignment testing

0x400000: (hammer and x8632 only) `#pragma altcode alignment’: if possible, generate an alternative version of the loop with extra aligned moves, guarded by a runtime alignment test.

0x800000: Set the minimum loop count of innermost loops to 128 for -Mconcur

0x1000000: Disable the the profitablity check for -Mconcur

0x2000000: Inhibit the 2nd pass of the high-level vectorizer

0x4000000: Classify single level loops as innermost for -Mconcur

0x8000000: Disable multi-level invariant hoisting for loop created by array assignment or loop with same loop bounds.

0x10000000: Disable the generation of altcode whose execution is governed by runtime pointer conflict tests.

0x20000000: Enable the following performance enhancement: on x86 targets that support AVX, only perform LRE (i.e. loop-carried redundancy elimination) on a loop if it removes at least 10% of the loop’s real*4, real*8, complex*8 or complex*16 operations. The rationale for this is that LRE forces a loop to be vectorised using xmm registers rather than ymm or zmm ones, so if it only removes a small percentage of the loop’s operations then its benefit may be outweighed by the cost of vectorising the loop using smaller vector registers. For example this enhancement gives a speed-up for applu by preventing LRE from being performed on the loop at line 1673 of applu.f, for which it only removes 2% of the operations.

0x40000000 Allow conditional vectorization containing reductions (experimental)

0x80000000 Disable LLVM vectorization containing SELECT

35:

Low-level vectorizer - maximum loop iteration count; 0 means unknown.

0x01: Disable limiting number of non temporal stores according to relative alignments’s impact on write-combining buffer.

36

0x01: Place the vcache area on the stack. (860 only).

0x02: In parallel loops, allocate static area for the vcache. (860 only).

0x04: Beta fast- and/or relaxed- math scalar/vector versions of certain intrinsics.

0x08: LLVM - disable extended scalar analysis in conditional loops with a definition on only one side of the conditional.

0x10: LLVM - disable extended conditional vectorization in all loops where the predicate size is different than the computational size.

0x20: LLVM - allow vectorization if multiple lhs data type sizes exist

within the inner loop

0x40: LLVM - don’t allow vectorization of DBLE ili

0x80: LLVM - don’t allow vectorization of DFLOAT/IKMV ili

37

0x01: Sparc only – generate code for VPU.

0x02: Sparc only – use old-style parameter block code.

0x04: Put all loops in llv loop table. With loop scope, the following loop will be placed in the llv loop table, even if not parallelizable.

0x08: Put no loops in llv loop table. With loop scope, the following loop will not be placed in the llv loop table, even if parallelizable.

0x10 Insert Meiko polling code into outer loops.

0x20 Allow loops containing stack-based variables to be vectorized.

0x80 Temporary switch - turns of insertion of vsld32 instruction in front of certain single precision vector loads. This instruction is inserted to work around a hardware problem.

0x100 Turn off extended scalar expansion

0x200 Turn off calculation of condtitional vectorization possibilities

0x400 Disable llvect from generating vectorization code for conditionals

0x800 Disable llvect from generating masked fdiv fp routine for conditionals

0x1000 Disable conditional vectorization for compound predicates

0x2000 Don’t check conditional vectorization masks for all 0’s or 1’s (short circuiting)

0x4000 Conditional vectorization: turn off extended CSE for code outside current block

0x8000 Conditional vectorization: don’t use mask vector intrinsics

0x10000 Check for all 0’s and 1’s regardless of threshhold value

0x20000 Don’t let intrinsic calls prevent short circuiting

0x40000 Treat scalars the old way - NOP analysis not affected by conditional vect

0x80000 Allow chained control dependence with conditional vectorization

0x100000 Allow complex chained control dependence with conditional vectorization

0x200000 Turn off vectorization with assigments to logical compares

0x400000 For llvm compilers, do vectorize max/min operations

0x800000 For llvm compilers, don’t construct vector ILI trees with math intrinsic calls

0x1000000 For LLVM compilers, enable vectorization with small ints on rhs

0x2000000 For LLVM compilers, don’t allow scalar expansion with vector temps within loops

0x4000000 Native x86 compilers, don’t vectorize conditionals with any “OR” predicates`

0x8000000 LLVM compilers, check use counts

0x10000000 LLVM compilers, don’t perform newton’s method within llvect

0x20000000 Allow CVECT with just one link to flow down this value without merge

0x40000000 LLVM compilers, don’t perform vectorization on induction iterators

0x80000000 LLVM compilers, don’t allow any store matching to the RHS within add_vili()

38

reserved.

39:

i860 low-level vectorizer: Maximum number of elements over which array references may span before they can be combined within a single cache vector. The span between A(i+k1) and A(i+k2) is defined to be |k1-k2|+1. If this switch is 0, allow any span. Hammer and x8632 low-level vectoriser:

0x01: Don’t generate prefetches in vectorised loops or loops that are unrolled by the vectoriser.

0x02: Don’t vectorise loops (though the vectoriser can unroll them).

0x04: Don’t unroll a vector loop body.

0x08: Prefetch one vector iteration ahead.

0x10: Disable the vectorisation profitability test for real*4 loads and stores with (stride != 1).

0x20: Disable the streaming store optimization.

0x80: Enable vectorisation and llvect unrolling of loops containing intrinsic function calls.

0x100: Disable all vectorisation profitability tests.

0x200: -Mnontemporal or -Mmovnt: generate non-temporal stores even in loops that are not in memory altcode. (They’re generated by default in memory altcode loops.)

0x400: Disable the optimisation of stride-2 loads and stores.

0x800: Generate “prefetchnta” instructions instead of the default prefetch instructions.

0x1000: Generate “prefetchw” instructions for arrays that are stored into.

0x2000: Don’t vectorise loops that are “too big”.

0x4000: Generate “prefetcht0” instructions. (This is the default anyway.)

0x8000: Disable unrolling of non-vectorised loops by the vectoriser.

0x10000: Generate “prefetch” instructions instead of the default prefetch instructions.

0x20000: Generate prefetches for loads with any stride, rather than just for loads with stride 1 or 2.

0x40000: Disable the complex_add_loop optimisation.

0x80000: Don’t peel vectorised loops that contain non-stride-1 loads.

0x100000: Mark an array that has been parallelized as aligned if its initial non-parallel address is aligned.

0x200000: Don’t vectorize loops that have a lexically forward anti-flow dependence with (<) direction.

0x400000: In llvect, when checking whether a load and store with different addresses conflict, if ‘hlconflict’ returns ‘SAME’ we normally change that to ‘CONFLICT’, because the NMEs are not updated by loop unrolling, so ‘SAME’ is imprecise. This flag disables that behavior.

0x800000: Don’t vectorise loops that contain a store to an array element whose address is not a linear function of the loop index with a compile-time constant stride.

0x1000000 Vectorize loops with a constant small loop count if possible.

0x2000000 Don’t generate scalar non-temporal stores.

0x4000000 No induction analysis in the presence of indirect array refs; also checked in induc.c to inhibit all induc optimizations on loops that been vectorized.

0x8000000 Use ‘movlpd’ and ‘movhpd’ to load and store real*8 stride-2 pairs.

0x10000000 Use multiple registers to accumulate a vectorised reduction, i.e. use a different register to accumulate the reduction in each copy of the unrolled vector loop body. By default the same register is used to accumulate the reduction in all copies of the unrolled vector loop body.

0x20000000 Double the default number of vector loop unrolls for AMD processors >= greyhound.

0x40000000 Don’t vectorise or unroll a loop that contains conditional multiple blocks.

0x80000000 Allow vectorizing multiple blocks and 64-bit selects.

40:

High-level vectorizer: loop-splitting heuristic; number of array loads/stores allowed before loop is split. Default is 20.

41:

High-level vectorizer: loop-splitting heuristic; number of floating point operations allowed before loop is split. Double precision ops count 2, single precision ops count 1. Default is 40.

42:

High-level vectorizer behavior modification.

0x01 (Obsolete) Inhibit loop distribution and interchange.

0x02 Inhibit breaking cycles of anti-dependences in the Sparc compilers.

0x04 Permit external calls in vectorized loops.

0x08 Inhibit array expansion.

0x10 Disable loop blocking (tiling)

0x20 Disable unroll and jam

0x40 Disable outer loop distribution

0x80 ENABLE inner loop distribution (NOTICE sense difference here)

0x100 Disable loop interchange

0x200 Perform scalar unroll & jam on loop

0x400 Disable reduction marking in hlv

0x800 Disable scalar replacement in hlv

0x1000 Don’t enter hlv_vectorize() for a particular routine

0x2000 Go ahead and perform outer loop distribution on single-nested loops

0x4000 Don’t generate strip loop around loop-distributed code

0x8000 Limit vectorization on functions based upon heuristics

0x10000 Allow loop distribution as the only loop transformation

0x20000 Enable loop fusion.

0x40000 Allow loop fusion with calls

0x80000 Disable loop fusion of noninner loops.

0x100000 Disable scalar unroll and jam

0x200000 Disable loop-carried redundancy elimination

0x400000 Loop-carried redundancy elimination: disallow reassociation

0x800000 Loop-carried redundancy elimination: disallow reassociation

0x1000000 For testing: LRE temps do not get CCSYM flag set

0x2000000 LRE: treat array refs as expressions, so a[k] and a[k-1] will be recognized as LREs

0x4000000 LRE: allow modifications in the loop, so a[k]= will not eliminate a[k]+b[k] and a[k-1]+b[k-1] from being considered as LREs

0x8000000 LRE: build balanced tree of operands when rebuilding expressions should only be used with reassociation

0x10000000 LRE: run vanilla LRE before vectorizer; default is after vectorizer

0x20000000 LRE: run LRE with X heuristic before vectorizer; also runs full LRE after vectorizer

0x40000000 LRE: allow indirection

0x80000000 Enable loop fusion when loop contains array dummy argument read/write (default not allowed).

43:

(860) Minimum loop count for an innermost loop to be parallelized if it contains a reduction.

44:
  1. Minimum loop count for an innermost loop to be parallelized.

45:

The ‘machine number’ to use for the datatype table. The default is machine zero.

46:

reserved

47:

reserved

0x100 Disable shmem_get inlining.

0x200 Disable inline_small_matmul.

0x1000 Disable dead code and scalar optimization phase.

0x2000 Disable optimization of gather/scatter/copy/overlap-shift communication

0x4000 Disable optimization of hcstart

0x8000 Disable optimization of allobnds

0x10000 Disable optimization of localize-bounds and section

0x20000 Disable optimization of get_scalar

0x40000 Disable optimization of copy communication

0x80000 Disable optimization of gather/scatter communication

0x100000 Disable optimization of overlap-shift communication

0x200000 Enable automatic loop parallelization

0x400000 inline pgf90_sect calls

0x800000 Disable using the lhs of an assignment as the result of a call to a use function.

0x1000000 in fe90/outconv, do set the global-size for descriptors even if not global and not passed as arguments

0x2000000 in fe90/lowersym, do initialize pointer/sdsc for compiler-generated temp pointer arrays

0x4000000 Do fuse foralls even if the RHS is a constant zero; by default, we don’t fuse these, because we can efficiently turn them into mzero calls

0x8000000 Emit call to pghpf_associated for the ASSOCIATED intrinsic.

0x10000000 Do not call our streamlined/dgemm-like matmul run-time routines.

0x20000000 Disable the optimization of reshape

0x40000000 Do not attempt to dial down the opt level in the f90 frontend

0x80000000 reserved

48:

reserved

49:

Behavior modification.

0x1 Inhibit transformation passes – directive processing is done.

0x2 Inhibit forall conversion.

0x4 reserved

0x8 Inhibit discarding parentheses.

0x10 Don’t include PARAMETER variables when AST expression simplification is performed.

0x20 Inhibit communication analyzer and optimizer.

0x40 Allow CM fortran’s intrinsics.

0x80 f77 output (pgftn extensions not allowed; Cray POINTERS allowed).

0x100 Target’s pointers are 64-bits.

0x200 Normally double precision is twice real; this makes it the same as real. Similarly for double complex.

0x400 reserved

0x800 reserved

0x1000 Emit function & line number information for runtime error handling and for communication profiling.

0x2000 reserved

0x4000 Source profiling.

0x8000 Cray POINTERs not allowed in the output. Valid only if the output is f77 (-x 49 0x80).

0x10000 reserved

0x20000 Disable invariant communication call hoisting.

0x40000 C90 target.

0x80000 reserved

0x100000 Don’t pass character constants as arguments to pgf90_loc (e.g., when the platform is HP).

0x200000 Inhibit inlining on following loop.

0x400000 F90 output – kinded constants, pass some intrinsics through.

0x800000 Native REAL is REAL8 and CMPLX is CMPLX16. Map double precision constants (0.0d0) to real constant format (0.0e0).

0x1000000 T3D/T3E target.

0x2000000 reserved

0x4000000 reserved

0x8000000 reserved

0x10000000 For POINTERs in commons, place their associated variables near the beginning of the common. This is non/-f90/-f95/-f2003 behavior but was our original behavior.

0x20000000 In lowerilm, don’t use PLD and PST for load/store of pointer variables

0x40000000 Return the value of complex functions the same as C (e.g. be XLF compatible)).XB 0x80000000 Native INTEGER is INTEGER*8 and LOGICAL is LOGICAL*8.

0x80000000 reserved

50:

0x01: i860/apx under DOS (default is under UNIX)

0x02: native Fortran

0x04: Only put out # linenum or #line linenum when the line number changes, rather than the default which is when the line number sequence is broken. Good for debugging.

0x10: For Fortran, generate ‘verbose’ .ilm files.

0x20: Inhibit any specific DOS end-of-line checks.

0x40: Enable unconditional_branches() (Fortran): look for conditional branches with constant conditions; remove the branch, remove unreachable code as well.

0x100: Don’t generate pgdbg_stub reference, used for generating shared libraries

51:

In Fortran, determines host specific output options for TINY/HUGE.

0x01: Defines smallest integer*1 as -HUGE(integer*1).

0x02: Defines smallest integer*2 as -HUGE(integer*2).

0x04: Defines smallest integer*4 as -HUGE(integer*4). In two’s complement arithmetic, 0x80000000 is -2147483648, whereas 0x7fffffff is +2147483647; some compilers reserve the value 0x80000000, so we have to use 0x80000001 as the smallest integer (this is used for MAXVAL initialization).

0x08: Defines smallest integer*8 as -HUGE(integer*8).

0x10: Tells the compiler to use a hexadecimal double-word constant to represent TINY(1.0d0); the normal value for TINY(1.0e0) is 1.175494351E-38 (represented by 0x00800000 in IEEE floating point), and TINY(1.0d0) is 2.22507385850720138E-308 (represented by 0x0010000000000000 in IEEE). However, the IBM xlf compiler (and perhaps others) will round a value of 2.22507385850720138E-308 to zero; it will accept the z’0010000000000000’ syntax and will then use this and print it as the correct value (go figure).

0x20: Disallow REAL/DOUBLE PRECISION/COMPLEX in typeless pgi predefined functions AND, OR, EQV, NEQV, COMPL, and SHIFT.

0x40: Keeps ‘TINY’ and ‘HUGE’ even without F90 output.

0x80 When generating code for reductions, don’t generate quad-precision accumulators for double precision arguments.

52:

Host dependent:

0x01: Fortran: Complex in Common blocks must be aligned to double-word boundaries.

0x02: AVAILABLE

0x04: Fortran: do linearize arrays. (used to be don’t linearize, but we reversed the sense of the bit).

0x08: Do use the old method of filling in .A0000 variables for adjustable array bounds temps.

0x10: reserved

0x20 reserved

0x40 Generate unified .mod module output file.

0x80 Front-end generates the linkage names for module subrprograms (still experimental)

53:

0x01: Require pointer target analysis or interprocedural pointer disambiguation to be enabled when testing nme loop safeness (optutil.c, is_nme_loop_safe).

0x02: Enable intraprocedural pointer target analysis.

0x04: Remove points-to information before schedule().

0x08: Build the LP_PLOADS (loop pointer loads) structure when doing flow analysis on loops, allowing more pointer target analysis on loops.

0x10000 Disable checking the ptr refs information collected in flow for determining if there are any pointer conflicts with respect to the ansi alias rules.

0x20000 Enable creating DEFS for calls whose uses will be loads of variables that can be modified by calls.

0x40000 is_sym_parsect_safe() - only consider private variables safe in parallel sections; inhibit more aggressive checks.

0x80000 Enable ipa pointer alias analysis

0x100000 Enable ipa structure reaggregation optimization

0x200000 Don’t use cgr_modifies() (optutil.c:is_static_call_safe()).

0x400000 Disable propagation of certain IPA pointer information from actual arguments to ..inline temporaries when a call-site is inlined.

54:

More Flang behavior modification

0x01: Enable full Fortran 2003 allocatable attribute regularization

0x02: don’t assume assumed-shape arrays are stride 1

0x04: No 2003 allocatable assignment semantics for allocatable components

0x08: Allocate automatic arrays on the stack instead of the heap by using an alloca-like method (affects the frontend)

0x10: Fortran Back-End only: Where possible, Implement the alloca-like method by inlining alloca; otherwise, call our ‘builtin’ alloca routine (affects the backend). Note that XBIT(54,0x08) must be set.

0x10: Fortran Front-End only: Use pre-F2008 STOP command semantics; do not return integer values from STOP commands as the program exit status.

0x20: Assume that dummy arguments declared EXTERNAL are Fortran routines that were compiled with Flang.

0x40: Enable contiguity pointer checks on pointer assignments and on actual arguments inside callees.

0x80: Enable contiguity pointer checks at call-sites.

0x100: Use an alternate contiguity pointer check inline that checks whether the pointer target’s descriptor flags have __SEQUENTIAL_SECTION set and whether the object’s data type length match the descriptor’s data type length. This check is experimental and intended for pointer assignments and actual arguments inside callees. This check cannot currently be generated at a call-sites. The XBIT(54, 0x40) must also be enabled. If XBIT(54, 0x80) is enabled, then we perform the contiguity check at call-sites using a library routine. Note: In the case of an optional argument, the inline check will also check whether the argument is present.

0x200: When checking contiguity (using XBIT(54,0x40), XBIT(54,0x80), XBIT(54,0x100)), do not flag null pointer targets as noncontiguous.

55:

0x01: AVAILABLE

0x02: reserved

0x04: AVAILABLE

0x80: Don’t call update_shape_info() when assumed-shape array is marked target.

0x100: Try to reduce array copies in argument passing

0x200: AVAILABLE

56:

Algebraic transformation; llvect overflow

0x01 unused

0x02 Eable the floating-point factoring transformation.

0x04 Eable the integer factoring transformation.

0x08 Disallow prefetchnta auto-generation in llvect.c

0x20 Disable putting term to the end of each group in factoring_tm() when breaking them into 2 groups to keep them in the same order as before as much as possible in algetrans.c.

0x40 This x-flag is set by the command-line option -Mvect=simd:128. It restricts vectorisation to a vector length of 128 bits even if the target processor supports larger vector lengths.

0x80 Enable multiple outer loop unroll_and_jam.

0x100 This x-flag is set by the command-line option -Mvect=simd:256. It asserts that the target processor supports SIMD instructions with a vector length of at least 256 bits and restricts vectorisation to 256 bits even if the target processor supports larger vector lengths.

0x200 Do not replace a scalar expression of the form (+-(a * b) +- c) or (c +- (a * b)) by a scalar FMA instruction if the product (a * b) has more than one use.

0x400 Do not replace a vectorised expression of the form (+-(a(i) * b(i)) +- c(i)) or (c(i) +- (a(i) * b(i))) by a vector FMA instruction if the product (a(i) * b(i)) has more than one use.

0x800 This x-flag is set by the command-line option -Mvect=simd:512. It asserts that the target processor supports SIMD instructions with a vector length of at least 512 bits and restricts vectorisation to 512 bits even if the target processor supports larger vector lengths.

0x80000000 If a user-written prefetch inhibits vectorization, do not attempt to replace its address expression with the address of a matching array reference.

57:

Fortran behavior modification

0x01 Replace “$” with “_” in symbols occurring within the debug output file.

0x02 Disallow integer*8/logical*8.

0x04 Disallow real*16

0x08 Disallow complex*32.

0x10 Map REAL*16 and REAL(16) to REAL*8, and map COMPLEX*32 and COMPLEX(16) to COMPLEX*16. Map kinded real constants (0.0_16) to appropriate kinded constants (0.0_8) as appropriate. Give a warning in each case.

0x20 For F90, export all symbols from front end to back end, as with -debug, without creating -debug file.

0x40 For source to source compiler with F90 output, dollar signs and underscores are (by default) disallowed. Setting this switch allows them (with a warning).

0x80 When using base/offset (instead of cray pointers), for formal arguments, don’t use a $bs array, instead use the original variable as the formal argument. This prevents problems of having a local derived type that is unaligned with respect to the dummy argument.

0x100 For native compilers, renumber lines to be sequential as generated.

0x200 Print “DOUBLE COMPLEX” as “COMPLEX*16” (for Sun’s F90 compiler).

0x400 Set INHERIT bit for dummy arrays with TARGET attribute that don’t have explicit distributions.

0x800 Print -128_1 as (-127_1-1_1), and similarly for _2, _4, _8 types. This is for some compilers that treat -128_1 as negative 128_1, which overflows.

0x1000 Print integer*8 with _8 suffix, even if not ‘f90output’.

0x2000 remove unused variables from source to source output

0x4000 don’t allow an ac-do-variable to be in limit expression of an implied-do-loop

0x8000 Don’t replace references to the pghpf_ commons with other values (constants, static addresses, etc.); e.g., the value of an ‘absent’ argument was represented by &pghpf_0_; now, it’s just 0.

0x10000 Don’t generate pghpf_copy_in/copy_out calls for assumed-shape arguments. Instead, use the descriptors as passed in directly. For Fortran.

0x20000 Don’t generate pghpf_ptr_in/ptr_out calls for pointer arguments. Use the pointer and descriptor as passed in directly. For Fortran.

0x40000 For F90, generate pghpf_template/pghpf_instance calls in host subprograms for all globals and host arrays that MIGHT be used in the subprogram.

0x80000 Only for F90, pass pointer actual to pointer dummies as the pointer itself, eliminate the pghpf_ptr_in/pghpf_ptr_out calls in the callee.

0x100000 For F90, when passing a continuous section to a subroutine, don’t call pghpf_sect, instead call _template and build a new template. This allows the template creation routine call to float out of a loop.

0x200000 For F90, when passing a section to a subroutine, pass the address of the first element of the section, not the starting address of the array. Requires building the right section descriptor.

0x400000 For F90, don’t try to share section descriptors for arrays, build a new section descriptor for each array.

0x800000 Set PDALN field for arrays in module common blocks.

0x1000000 Do not apply any PDALN (pad & align) values to common block members.

0x2000000 Don’t perform additional padding beyond PDALN for module common blocks.

0x4000000 Special code for Rice CAF support. Recognize pgi_get_descriptor and pgi_set_descriptor functions.

0x8000000 Do not call lighter-weight alloc/dealloc functions for automatic arrays, (… hope to expand this list to include compiler-created allocatable temps …).

0x10000000 For 32-bit, do not check the PDALN field of module-created commons to to set their default alignment 16-byte; PDALN is set by module.c (fe90) and checked in f90’s assem.c

0x20000000 Do not inline PRESENT.

0x40000000 in outconv, generate value-arguments to pgf90_template[123]v routines even for 64-bit compilers

0x80000000 Do not make the default for the allocate size argument a 64-bit integer (64-bit targets only).

58:

Fortran behavior modification

0x01: Cray-style POINTERs allowed, but the pointer objects may not be character (e.g., Cray’s f77 compilers). Valid only if the output is f77 (-x 49 0x80).

0x02: caller mapping, if remapping occurs, caller would have explicit interface.

0x04: SERIAL_ONLY directive: the program unit will only be called from serial regions.

0x08: PARALLEL_ONLY directive: the program unit can only be called from parallel regions.

0x10: PARALLEL_AND_SERIAL directive: the program unit can be called from both parallel and serial regions.

0x20: no copy_in and copy_out inside callee.

0x40: reserved

0x80: Generate shared-memory communications.

0x100: Enables CRAFT features.

0x200: Create character constants for FORMATs.

0x400: ON HOME clause of INDEPENDENT loops cannot be overridden.

0x800: is f77 (-x 49 0x80).

0x1000: Set if this is an F90 compiler; only extrinsic F90/SERIAL allowed.

0x2000: Default extrinsic model is LOCAL

0x4000: Default extrinsic model is SERIAL

0x8000: Default extrinsic language is F77

0x10000: Pass F90 pointer variables through to the back end, I think.

0x20000: This is used for the Fortran compiler; the Fortran compiler allocates temporary arrays (such as for WHERE statements) to the full size of the aligned array, so the temporary array will be distributed and aligned to that array. The Fortran 90 compiler should not do that, and this flag disables that; when set, temporary array sizes will come from the array shape.

0x40000: Cray-style POINTERs allowed, but the pointer objects may not be derived type (e.g., Cray’s f77 compilers). Valid only if the output is f77 (-x 49 0x80).

0x80000: Fortran - for the compiler-created module commons, do not prepend an underscore.

0x100000: Compiler owned module.

0x200000: Revert to previous behavior of including module name in link name of bind(C) routine.

0x400000: Don’t make a copy of assumed-shape array arguments if the callee has it marked as target.

0x800000: Don’t attempt to call the descriptor-less read/write I/O function of an array.

0x1000000: Disable the use of rhs constant bound for forall loop:- When converting array assignment to forall if lhs bound is not constant, check array on rhs if it has constant bound and use to make forall loop bound.

0x2000000: AVAILABLE

0x4000000: AVAILABLE

0x8000000: For Fortran, don’t expand pointer references with multiply by section stride and add section offset in each dimension (pointer_squeezer) in fe90; do perform this in f90 back end

0x10000000: used for ??? (outconv.c:convert_output())

0x20000000: For Fortran, don’t replace alloc calls with calloc calls when allocating derived type objecs containing a pointer member.

0x40000000: For F90 native, don’t add multiply-by-section-stride and add-section-offset requires modified runtime to fold these into the linear offset and linear stride

0x80000000: When checking if a pointer lhs in a forall has a scatter dependency, revert to the old/conservative method where any array used in the lhs’ subscripts causes a conflict.

59:

0x01: Loop-scope pragmas and directives applied to loops affect all nested loops.

0x02: Ignore all pragmas and directives (no message is generated).

0x04: Allow the ‘mem’ pragmas/directives.

0x08: Allow block statements in the scope of OpenMP and OpenACC directives.

60:

Used to affect code generation when -debug is used. Turning all these on will produce code under -debug that is ‘nearly’ identical to that without -debug. Especially valuable for the higher optimization levels.

0x01: fill delay slot of delay branch.

0x02 do not fill delay slot of delay branch.

61:

Used to affect general delay slot filling for sparc.

0x01: (DON’T) fill unconditional delay slots (IL_JMP,…)

0x02: (DON’T) fill delay slots for ISUBI, BIGTI ..,0 combos

0x04: General delay???

0x08: used in cgasm.c

0x10: used in cgasm.c

0x20: used in cgasm.c

0x40: used in cgasm.c

0x80: used in cgasm.c

62

General code gen mods

0x01 Targets of branches can execute multiple instr instead of just 1.

0x02 Change ICJMPZ into ICMPZ1…ICJMPZ1.

0x04 Inhibit MAX/MIN optimization whereby the results of the MAX/MIN are stored directly within ili template.

0x08: Generate position-independent code

63

Used to pass opt level to CUDA back end code generator.

64

Controls code straightener; the lower 8 bits are used as a branch probability percentage (must be between 50 and 100) to be treated as high percentage; if it is below 50, a compiled-in default is used, and >100 is treated as 100 (essentially disabling straightening).

The rest of the word is used as the minimum block size to try to straighten out, that is, the block size which is more profitable to leave in place (perhaps to predicate) that to avoid branching over.

65

66

Used by DSP function expansion, ipa, etc.

0x01: Enable expansion of long long intrinsics by dspfunc

0x02: Remove constant pointer arguments made redundant via IPA.

0x04: Test bit for IPA

0x08: Externalize locals or global statics that are used as precise unique pointer targets, to allow those arguments to be removed.

0x10: Remove constant integer arguments made redundant via IPA.

0x20: Optimize which arguments get removed: for st100, don’t remove all pointer arguments, only those in excess of three.

0x40: IPA: automatic assign variables to SDA area

0x80: IPA: automatic assign variables to TDA area

0x100: Internal use only: for DSP user intrinsics, generate table of instructions that might be matched.

0x200: For user-defined intrinsics, prefix __ to name, convert to lower case, like the DMD intrinsics.

0x400: Disable Scalar Replacement optimization (no chains will be generated and no code generation for accelerator region)

0x1000: Use ‘old’ integer array dependence test

0x2000: IPA: propagate qalignment to dummy pointers

0x4000: Recognize #pragma ipa

0x8000: Recognize #pragma ipofile, and halt compile just after parser (used to create ipofiles for library routines)

0x10000: Propagate information about function calls, whether function modifies globals/statics, etc.

0x20000: Propagate user assignments of globals to SDA/TDA.

0x40000: reserved

0x80000: Use old method in dspfunc to determine whether to assign an intrinsic argument to a temp variable

0x100000 For IPA argument removal, don’t actually remove the argument

0x200000 ST100: For user-defined intrinsic creation, be silent. X86: when running with #pragma ipofile, prepend a character to the ipofile name.

0x400000 Instead insert check code to check that the removed argument actually receives the value that IPA propagates; use with -x 66 0x100000

0x800000 Don’t rename functions that have arguments removed

0x1000000 Simple Fortran 90 pointer disambiguation

0x2000000 allow .xrodata/.yrodata = bank assignment for CONST data

0x4000000 Fortran 90 assumed-shape dummy argument shape-propagation

0x8000000 allow RODATA to be placed in SDA/TDA

0x10000000 Don’t mangle function names

0x20000000 Use small IPA export protocol.

0x40000000 datadep.c, disable Fortran based-array conflict addition for unknown base vars

0x80000000 Use the ‘optimized’ datatype for a called function

67:

Branch optimizations.

0x01 Eliminates serially nested redundant conditionals.

0x02 Performs structure transposition optimization.

0x04 Performs loop peeling and loop index splitting optimizations.

0x08 Performs more branch elimination optimization.

0x10 Disable peephole redundant instruction elimination.

0x20 Performs structure transposition optimization unconditionally.

0x40 Performs loop invariant imul strength reduction optimization.

0x80 Replaces movsd with movlpd in certain loops.

0x100 Turn off invarif_merge invariant if analysis

0x200 Performs conditional loop invariant hoisting.

0x400 struct transposition - when transpose_struct_ili() recurses for an AADD, pass the AADD as the parent of its left operands instead of the AADD’s parent.

68:

Multiple language implementation-defined behavior modifications.

0x01 Assume large arrays: implies that the array bounds information will be stored as 64-bit integers and subscripts expressions are 64-bit; also, the macro BIGOBJ must be defined.

0x02 For large arrays, make the return data type of size, lbound, and ubound integer*8; the usual return type is default integer.

0x04 Disable assignment of type descriptor to an allocatable/pointer descriptor. By default, we will assign a type descriptor to an allocatable/pointer descriptor for all allocatable/pointer derived type objects. This is required to support F2003 features. However, if F2003 features are never used then this XBIT could be used to eliminate an extra assignment when we set up the allocatable/pointer descriptor. This XBIT will also disable creation of type descriptors for base types.

0x08 Automatically put non system and non constant global and static variables that are not TLS in the TLS or ETLS if the ETLS switch is set.

0x10 Use ETLS. As a result, it puts threadprivates in ETLS at the ETLS_OMP level, and it modify the 68,0x08 switch privatization by putting the symbols auto-privatized at the ETLS_TASK level.

0x20 Fortran character length for 64-bit target is integer*8 by default.

0x40 Use TLS to implemement OpenMP threadprivates instead of TP vectors. Has no effect if the ETLS switch is set.

0x80 Allow non-standard F2003 type bound procedure calls of the form z = x%foo (i.e., assume missing parenthesis as an empty arg list).

69:

SMP implementation-defined behavior modifications.

0x01: Don’t recognize OpenMP directives (-Mnoopenmp).

0x02: Don’t recognize SGI directives (-Mnosgimp).

0x04: For block-static parallel do/for, make the thread’s loop count a multiple of a value sufficient to keep the alignment of arrays the same as their alignment when serial.

0x08: Default schedule is dynamic (the normal default is static).

0x10: Default schedule is guided.

0x20: Default schedule is runtime.

0x40: P & V functions specifically for unnamed critical sections.

0x80: RESERVED for threadprivate-tls work.

0x100: Cache align & pad semaphore variables.

0x200: Allocate threadprivate data in parallel, i.e., a thread’s copy will hopefully be local to the thread (call _mp_cdeclp()).

0x400: Use the ‘fair’ schedule as the default static schedule for parallel do loops in pgf90.

0x800: linux 64 C++ : revert to .rodata sections, instead of linkonce.r sections for jump tables in weak(templated) functions

0x1000: Disable new OpenMP atomic and reduction implementation. Currently new OpenMP atomic is enabled with LLVM target only.

0x2000: Available

0x4000: Available

0x8000: Available

0x10000: Add trace points for the mp/omp constructs.

0x20000: Unconditionally generate prtcnt.

0x40000: Execute tasks immediately

0x80000: In the outliner used for KPMC openmp regions, when filling the argument

0x100000: Enable nodepchk for simd construct/clause.

70-79:

RESERVED FOR ALTERNATE CODE GENERATION

70:

Used to effect alternate code generation

0x01: Generate zero stride check when non-constant stride is used in the basepointer optimization (opt >= 3)

0x02: Check subscripts.

0x04: Check null pointers (f90).

0x08 Linearize arrays, and remove distributed members.

0x10 Linearize arrays, do not remove distributed members.

0x20 reserved

0x40 Don’t call redundant subscript removal - removes common subscript expressions to temps, floats out of loops, etc.

0x80 For ST100 -small, generate two versions of each function, one in -gp32, one in -gp16 (unless disabled for some other reason).

0x100 For ST100: Disable the H/W Loop check to determine if -small will generate two versions of the same function. By default, if a H/W loop uses a GP32 only HW loop (which is anything other than 2) then we only generate this function in GP32. Enabling this XBIT causes -small to always generate a GP16 and a GP32 version.

0x200 in redundant subscript removal - also remove redundancies in basic blocks

0x400 for PGF90, call sectfloat, float section descriptor calls out of loops

0x800 for PGF90, in sectfloat, floating section descriptor calls out of loops, also look at non-DO loops

0x1000 in flow.c, allow constant propagation of HCCSYM symbols

0x2000 in optimize.c, for PGF90, set -x 70 0x1000 around the call to flow()

0x4000 in optimize.c, remove empty loops

0x8000 Generate ‘unified binary’, that is, AMD and Intel binaries in one

0x10000 Generate ‘self-debugging’ binary, that is, one with -g and no opt, with with regular options

0x20000 When generating multiple versions of a function, generate each into a different .text section.

0x40000 Use the old unified-binary-version selection method

0x80000 When generating unified binary, generate two copies of version 1. (for debugging)

0x100000 When generating unified binary, generate two copies of version 2. (for debugging)

0x200000 When inlining code for F90 sum-like reduction intrinsics, don’t use a temp if the argument is ‘simple-enough’ to evaluate in-line, and the DIM argument is one.

0x400000 When inlining code for F90 sum-like reduction intrinsics, don’t use a temp if the argument is ‘simple-enough’ to evaluate in-line, regardless of the DIM argument.

0x1000000 Don’t expand reductions with expression arguments and DIM=1 inline.

0x2000000 Don’t bother to fill in the runtime pointer field of the section descriptor, unless debug set set.

0x4000000 for F90, in exp_ftn.c, always subtract zbase*size from array base even if zbase or size is not a constant. for F90, in func.c, don’t expand complex dot_product inline

0x8000000 in fe90/redundss.c, don’t remove subscripts from non-pointer arrays

0x10000000 For unified binary, disable culling

0x40000000 Enable complex operation; add, subtract, multiply, etc., as a single complex operation instead operating on 2 parts.

0x80000000 WIN DLL target where pgf90 generates indirections for its own commons which are shared between the run-time and the generated code; pointers are generated by the compiler and are filled in upon program startup. For the MS-method of DLL (the default for hammer as of 6.0), these commons are simply imported.

71:

Used to affect candidate list creation.

0x01: Make ili containing frcp appear to have standard latency + 1.

0x02: Make ili containing frcp appear to have standard latency + 2.

0x04: Make ili containing frcp appear to have standard latency + 4.

0x08: Make ili containing frcp appear to have standard latency + 8.

0x10 Create breadth first candidate list.

72:

Used to affect Scheduling of ili for link predecessors.

0x01: Make ili containing frcp appear to have standard latency + 1. (On the i860, it is frcp; on the SuperSparc it is fpop result). Enable the dag scheduler for X86_64 & X86_32.

0x02: Make ili containing frcp appear to have standard latency + 2. (X86_64 & X86_32) EM64T heuristic

0x04: Make ili containing frcp appear to have standard latency + 4. (X86_64 & X86_32) EM64T heuristic

0x08: Make ili containing frcp appear to have standard latency + 8.

0x10000: (X86_64 & X86_32) Don’t check for profitability. Presumably, this flag will be used to rule out performance regressions due to throttling the scheduler.

73:

Used to affect Scheduling of ili for non-link predecessors.

0x01: Make ili containing frcp appear to have standard latency + 1.

0x02: Make ili containing frcp appear to have standard latency + 2.

0x04: Make ili containing frcp appear to have standard latency + 4.

0x08: Make ili containing frcp appear to have standard latency + 8.

74:

Fadd and Fmul handling.

0x01: Make stores off of these fadd,dadd,fmul and dmul operations delay by one more cycle in single-operation mode. This should allow another fadd or fmul to begin earlier.

0x02: Allow dualops that take outputs from the fadder and feed the data into the fmul through a register and not a direct data path. This caused a problem in DYNA?? and is useful for 33MHz chips. mi2tpa instruction.

75:

Used for pipelined load selection

0x01: Change the first ldinc

76:

Used to affect scheduling technique

0x01 Do not use the default scheduling scheme for the Sparc. Currently (12/27/93), the default is XBIT(83,2) (i.e. psched_ili()).

0x02 Software Pipeline the instructions.

0x04 Change LDword to LDSP when possible.

0x08 Schedule next ili after last ili scheduled instead of as early as possible.

77:

Used to affect preemption and spilling.

0x1: Don’t spill constants, just reload.

0x2: Turn on expermimental spilling.

0x4: (ST100 only) In a block containing an extended asm statement, any argument registers (r0-3, p0-2) that are in the asm statement’s clobber list are removed from the scratch register list. N.B.: At the time of writing (11/19/02) this has not been implemented. Instead an alternative way of handling clobbered argument registers has been implemented, namely by preempting them if they are in use, which is enclosed in the condition “if ( ! XBIT(77,4))”.

78:

Used to affect spilling of registers:

0xf: Number of ir registers to spill == 0xf (default is 3).

0xf0: Number of sp registers to spill == (0xf0 >> 4) (default is 3).

0xf00: Number of dp registers to spill == (0xf00 >> 8) (default is 3).

0xf000: used in cgregmgr.c

79:

Used to CSE of a DP load. This is hardwired at a distance of 16 linear ili. If you want to CSE always, then just supply a value of 255. If you never want to CSE a DP load, then supply a value of 0.

0x01: used in cglinear.c

0x02: used in cglinear.c

0x04: used in cglinear.c

0x08: used in cglinear.c

0x10: used in cglinear.c

0x20: used in cglinear.c

0x40: used in cglinear.c

0x80: used in cglinear.c

80:

Sparc/X86 Versions First byte reserved for the sparc; second byte reserved for the X86.

0x01: Version 8 (has smul, umul, sdiv, udiv instructions + version 7)

0x02: Version 7 (has sqrt instr.)

0x04: Version 9

0x100: P6-only

0x200: P6-optimized but doesn’t use P6-only code unless 0x100 set

0x400: Following unconditional branch, align code on 16-byte boundary

0x800: Don’t generate store-load sequence to round floats before conversion to int.

0x1000: Interleave f.p. operations using FXCH; a P5-specific optimization

0x2000: Don’t eliminate floating pt. spilling; even right after a fp load from mem.

0x4000: Don’t pre-allocate argument space, always push arguments onto stack

0x8000: Issue the ‘CLD’ instruction (clears the direction flag, DF) when generating the rep-movestring instruction sequence.

0x10000: Don’t use the standard prolog; use our own variety

0x20000: perform a runtime check to assure internal floating point stack consistency; at the beginning of each routine and after all calls (not QJSR or CCSYM)

0x40000: don’t put in argument checking code for sin, cos, and tan

0x80000: use .byte instead of fcom and fcmov instructions (for old x86 assemblers)

0x100000: disable GH tuning for scalar conversion merge dependencies.

0x200000: disable GH tuning to eliminate merge dependencies on movhpd, movlhpx, movhlpx loads.

0x400000: AVAILABLE use cvttsd2si instruction. (Pentium IV specific. Uses opcode in assembly.)

0x800000: scalar sse code generation

0x1000000: sse4/mni/core2

0x2000000: gh

0x4000000: sse3/Prescott (pni).

0x8000000: AMD x86-32 (hammer-32).

0x10000000: AMD x86-64 (hammer).

0x20000000: AMD Athlon XP.

0x40000000: Willamette (wni)

0x80000000: AMD Athlon.

81:

Sparc chip

0x01: Regular Sparc Chip. This is the default (MT_S)

0x02: SuperSparc Chip (MT_SS)

0x03 HyperSparc Chip (MT_HS)

0x04 UltraSparc Chip (MT_US)

0x05 ST100 Chip (MT_STGP32)

0x06 ST100 Chip (MT_STVLIW)

0x07 ST100 Chip (MT_STGP16)

82:

i860 and Sparc CSE of loads in SW pipelined loops. If set, then the value set to determines when a load is CSE’d in a software pipelined loop.

83:

Scheduling technique of Sparc compiler.

0x01 Use scheduling such that ili are laid down in order determined by candidate list.

0x02 Use scheduling that uses cyc/subcyc in general but not for register allocation.

0x04 Use scheduling that uses cyc/subcyc in everwhere even for register allocation.

0x08 Schedule next ili after last ili scheduled instead of as early as possible.

84:

Register allocation scheme of Sparc compiler.

0x01 Try to release reg exactly on subcycle available.

0x02 Try to substitute freed up DP reg on cycle needed for new reg.

0x04 Suppress optimization of an AADD by not swapping operands. (cgoptim.c, peep_ar_res)

85:

Affect linearization process:

0x01: Do not CSE QJSRs.

0x02: used in cglinear.c

0x08: Originally this invoked the old version of PRE (partial redundancy elimination), i.e. function ‘cg_pre()’, but that call has been commented out since it has been superseded by the new version of PRE, i.e. function ‘pre_lilis()’. We should remove all references to this flag in the compiler.

0x10: do value hashing on distributed expressions in PRE.

0x20: disable hashing of loads missing data flow info in traditional extended block scope.

0x40: signal in PRE phase, desired to be set/unset only by PRE.

0x80: disable pattern-match forward propagation in PRE.

0x100: do value hashing of loads missing data flow info in tree-region-style extended blocks.

0x200: enable most aggressive PRE.

0x400: disable the tracking of point register pressure for innermost loops.

0x800 disable heuristic to avoid force stores.

0x1000 Enable ‘peephole0’ phase for x8632/hammer CG; initially this eliminates or reduces redundant %esp/%rsp updates.

86:

Reserved for VLIW/DSP compiler usage.

0x1 Force VLIW/scalar code gen heuristic

0x2 Allow generation of VLIW code for non-loop regions.

0x4 Disable generation of VLIW code for inter-loop regions.

87:

Code generation options for DSPs

0x1 16-bit code generation mode

0x2 GP32-bit code generation w/ scheduling (Default on ST100)

0x4 VLIW-bit code generation w/ scheduling

0x8 reserved

0x10 reserved

0x20 reserved

0x40 reserved

0x80 reserved

0xNNXX NN indicates the memory latency in cycles for the ST100. Default is 6 on ST100.

0x100000 This is used in direct.c to imply that the mode flags flg.x[87] & 0xff should be inherited; that is, not changed. Generally this is set for loop directives, not set for global/routine directives

88:

Predication optimizations

0x01 Simple control-flow flattening (use predication rather than jumps)

0x02 Advanced control-flow flattening using APT information.

0x04 More aggressive predication scheme.

0x08 Re-compute SLIW legality after predication.

0x10 Allow guarded procedure calls.

89:

Advanced optimizations for dsp chips and IPA-related stuff:

0x01 Enable inlining of DSP functions.

0x02 Enable IPA analysis.

0x04 Enable 2nd inlining pass of DSP functions.

0x08 Stop immediately after IPA analysis (don’t generate assembly file).

0x10 In IPA collection phase, output all function names, even if not used.

0x20 Enable ‘fake IPA collection’ mode, whereby IPA information is saved only for the given functions into a specified .ipo file; this is used to create IPA info for library functions for which we have no source.

0x40 Enable IPA inheritance (This is set for the IPA recompile).

0x80 Used internally to disable future IPA inheritance; used in case of errors when inheriting (such as stale .ipa file), or when the IPA collection is stale.

0x100 Do IPA pointer disambiguation in cgutil.c/nm_conflict

0x200 Do IPA pointer disambiguation in cgutil.c/st_sta_ld_conflict

0x400 Do IPA vestigial function elimination.

0x800 Do IPA constant propagation.

0x1000 Do IPA bank assignment.

0x2000 Do automatic fast/slow mode selection.

0x4000 Test IPA frequency feedback.

0x8000 Do IPA driven inlining.

0x10000 Enable IPA frequency feedback.

0x20000 Do IPA-driven global register allocation (safe to allocate global to register).

0x40000 Enable array-of-struct transpose to struct-of-array

0x80000 For DSP function expansion, use the ‘second’ set of functions.

0x100000 Enable outliner.

0x200000 Read .lai file to produce .dsp file.

0x400000 Read .dsp file for user-defined dsp functions.

0x800000 Propagate user-assignments of globals to X/Y banks.

0x1000000 Slim profiler mode.

0x2000000 Disable extended basic-block creation (more accurate line profiles).

0x4000000 Do actual replacement of loops in outlining.

0x8000000 Used for testing of ‘dsplai’ converter.

0x10000000 Enables IPA constant-range propagation and IF removal

0x20000000 Enables enhanced safe pointer optimizations using target analysis

0x40000000 For dsplai, generate .prn file

0x80000000 Compress the .ipo file (using lz)

90:

Used to affect candidate list creation.

0x01: Make ili containing ptr load appear to have standard latency + 1. (On the SuperSparc, it is all loads.)

0x02: Make ili containing ptr load appear to have standard latency + 2.

0x04: Make ili containing ptr load appear to have standard latency + 4.

0x08: Make ili containing ptr load appear to have standard latency + 8.

0x10: used in cgcand.c

0x20: used in machreg.c for st100

91:

Used to enable H/Q bug workarounds for ST1xx

0x01: Handle CB-15

0x08: Handle CB-4 & CB-10

92:

CG optimizations.

0x1: Do analysis on the candidate list to determine what schema to use (currently only breadth-first or depth-first selection.

0x2: On ST100, disable speculative scheduling. On hammer and x8632, move the tail block to the end of the sequence.

0x4 in sched-dag.c:selectinst(), check for prefetch instructions

0x8: Old behavior of sched-dag.c:isRM().

0x10: Generate fisst level of LAI output (up to, not including, virtual registers).

0x20: Generate LAI virtual registers (should probably only be used with 92,0x10).

0x80: used in cgoptim.c

0x100: Omit non LAI-friendly directives (e.g. the .word before a function for debug info)

0x1000: used in cgcand.c

0x2000: used in cgcand.c

0x4000: used in cglinear.c

0x8000: used in cgregmgr.c

0x10000: usedin cgsched.c

0x20000: used in cgregmgr.c

0x40000: Disable propagate and eliminate sign extensions. Used in cgoptim2.c.

93:

VJS XFLAG for ST100 CG alpha/beta opts. DO NOT TOUCH unless you’re VJS.

0x01: Allow for X Y banked loads.

0x02: Allow for new scheduling (MT_STGP32) mode for PL loops at -O3. This will put it thru ssched_ili32a

0x04: Set up 0 load latencies for DR loads inside loops.

0x08: Output ‘nop’s into Superscalar assembly code stream even when not needed. NOP will be inserted upon any empty subcycle.

0x10: Set up 0 load latencies for DR outside of loops.

0x20: Try scheduling at all opt levels. Do not drop down to -O2.

0x40: Set up 0 store latencies for DR register assigns from the DU unit and stored.

0x80 Change result availability for latest ILI. On 4/25/01, AIMV and IAMV were changed.

94:

Used for new CG pass.

0x1 Enable new pass

95:

Used to alter inner resource checking bounds for sparc.

0x1 Have each ili schedule exactly one cycle after last scheduled ili. Increments between microps is one full cycle.

0x2 Each ili microp must schedule within two cycles after last scheduled microp within the ili but ili must start exactly one cycle after last scheduled ili. Increments between microps is one subcycle.

0x4 Each ili microp must schedule within one cycle after last scheduled microp. ILI can start anytime after early start time. Increments between microps is one full cycle.

0x8 Each ili microp must schedule within one cycles after last scheduled microp. ILI can start anytime after early start time. Increments between microps is one subcycle.

0x10 Each ili microp must schedule within two cycles after last scheduled microp. ILI can start anytime after early start time. Increments between microps is one subcycle.

0x20 Each ili microp must schedule within three cycles after last scheduled microp. ILI can start anytime after early start time. Increments between microps is one subcycle.

0x40 Do not allow a cascade from an alu into the shifter within the same group.

0x80 Only split a condition code if it is set as a cascade into an alu. Otherwise branch can be performed within the same group.

96:

Used for scheduling multiple blocks.

0x01: Schedule inner loops that form a region with multiple blocks. Attempt to SW pipeline these loops.

0x02: Generate the multi-column loop even if ‘iteration count’ < ‘swpipe loop columns’.

97:

Used for guards

0x01: Force cg to guard ambiguous fp loads following fp stores. LDINC/STINC only.

0x02: Force cg to assume all stores within a SW pipelined loops do not hit cache.

0x04: Force cg to ignore the fact when stores that are marked to miss cache (ILT_MCACHE).

98:

Used for alternate memory accesses.

0x01: Force cg to use pipelined stores for fp autoinc stores.

0x02: Force cg to assume all double memory references are aligned.

99:

Used for alternate cg IL handling.

0x01: Force cg to choose dual-inst at -opt 4

0x02: Force cg to choose non-dual-inst at -opt 4

0x04: Let cg choose dual-inst at -opt 4

0x08: Let cg process .pgi file for alternate cg stuff (pgvision).

0x10: Let cg process loop level pragmas for innermost blocks only.

0x20: used in cgutil.c

0x40 used in cgutil.c

0x80 used in cgutil.c

100:

If nonzero, break blocks. break block if # ilm words for an ili block exceeds (2 ** (val % 31))

101:

ST Processor stepping information. See “stepping.h”

102:

Used to affect cg register handling.

0x01: Allow the ARDF of a scratch reg to be freed of the NOUSE flag and put back in list.

0x02: Handle special case of assigning to a DP reg out of an SP that is the same as that of the DP and whose usecnt is > 1.

103:

Used to affect cg.

0x01: Use alternate FRCP/DRCP ili that leave larger holes.

0x02: UNUSED

0x04: inhibit IL_ZFSUBFMP code generation.

0x08: change IL_FMLOW to IL_PFMLOW for dualop code code generation.

104:

Used to affect names conflict checking (CONFLICT and nm_conflict).

0x01: Inhibit check for unequal member names checking. Just return conflict.

0x02: Modify SAFE checking so that distance is always 6 for safe names.

0x04: inhibit NME_INLARR() inline array checking.

0x08: preform additional checks of inliner-created cray pointees with other inlined-created cray pointees and user arrays (hlconflict()).

0x10: Perform further looking at member symbols that are marked as noconflict. This is particularly used to determine that regular user symbols can’t conflict with section descriptor members.

0x20: in conflict, Fortran symbols marked as CCSYM will not conflict with a pointer NT_IND reference; this is unsafe, since even CCSYM symbols may be pointer targets

0x40: Assume a conflict between an ‘unknown’ NME and an NME for a symbol of nonbasic type (like struct, union, array).

105:

Used to specify maximum unroll factor in unroll & jam transformation

106:

Used to specify scalar unroll factor in unroll & jam transformation

107:

Used to specify loop threshold for entering vectorization

108:

Used to specify stripmine size for scalar expansion (STRIPSIZE in hlvect.h)

109:

Used to specify ili count threshold in br_flatten (ST100)

110:

Used to affect latency for ‘alu_latency’ sparc resource. Value is # of cycles + 1 of latency (note: 0 will not work).

111:

Used to affect latency for ‘fpu_latency’ sparc resource. Value is # of cycles + 1 of latency (note: 0 will not work).

112:

Used to affect latency for ‘fdiv_latency’ sparc resource. Value is # of cycles + 1 of latency (note: 0 will not work).

113:

Used to affect latency for ‘fld_latency’ sparc resource. Value is # of cycles + 1 of latency (note: 0 will not work).

114:

Used to affect latency for ‘ld_latency’ sparc resource. Value is # of cycles + 1 of latency (note: 0 will not work).

115:

n from -Minline=levels:n how many levels of inlining to do

116:

Used as a value between 0 and 100 to determine whether a function should execute in fast mode (gp32 for ST100) or slow mode. Sort the functions by the amount of execution time spent in each on a profiling run. From fastest to slowest, compute cumulative amount of time spent in this and more time-consuming functions. For each function, compute the percent of that cumulative time relative to the total time of the profiling run. If this percent is less than the value of the x flag, run in fast mode.

117:

Reserved for C++.

0x01: turn off extra C++ debug information when EDG produces C code (–c)

0x02: turn on output of mangled names in C++ debug information: for dolphin inc

0x04: Put out TAG_formal_parameter instead of TAG_unspecified_parameters for the this parameter

0x08 Turn off the translation of the EDG generated call to _mp_lcpu3() to IM_LCPUS3. This is an MP optimization.

0x10 Turn off a new optimization for –one_instantiation_pre_object where we don’t read the file scope information for every new template

0x20 Extract only C++ functions with the inline keyword. The default is to allow all functions to be inlinable.

0x40 Turn off the on gnu style inlining in which we mark all non static member functions as inlinable, even those that are declared outside the class.

0x80 turn on ADDRTKN flag setting according to EDG collected information on variable.

0x100 AVAILABLE (Formerly indicated setjmp/longjmp style exceptions, which are no longer supported.)

0x200 C++ exceptions are enabled. (Formerly indicated zero-cost exceptions, as opposed to setjmp/longjmp exceptions. But zero-cost exceptions are now the only style of exceptions that are supported.)

0x400 Disable GSCOPE optimization retarget. (Does not appear to be used anywhere.)

0x800 When using the auto-reinliner, treat flg.autoinline as having value 1

0x1000 Enable the auto-reinliner, that is, inline during the extract phase of auto-inline. This allows multiple levels of auto-inlining with a single inliner pass, since the inlining will have been done during the extract.

0x2000 Do not generate instrumented profile calls (e.g., prof_ruent, etc.) inside templated functions.

0x4000 Enable restart for levels-driven bottom-up auto-inlining from the leaves.

0x8000 AVAILABLE (Formerly indicated that exceptions had been disabled, but 117,0x200 now covers that case.)

0x10000 Enable bottom-up inlining for -Minline.

118:

Reserved for C++.

0x01: Force .ctor sections instead of .init sections, as an temporary step for x86 C++. Hammer C++ already uses .ctor sections. Win64 does not.

0x02: pgc++ is the nvcc host compiler : remove gnu __builtin for –c

0x04: emit gnu compatible DW.ref sections when -fpic is set. We don’t turn this on right now because libpgc.so contains a c++ file, and would give a gxx_personality undefine .

119:

Assembler - (NOTE: overflow at 129)

0x01: sym+off not allowed in .val for debug (coff)

0x02: unix-style (mcount()) profiling (augments -profile); see -x 119 0x40000.

0x04: align functions to 32 bytes, rather than 8 bytes.

0x08: emit unreferenced data-initialized statics (C).

0x10: For i860, misc directive hacks to generate a.out assembly language acceptable for input to gas i860 assembler (should be an astype) [ temp ]. For i386, (old) linux compatibility mode: the value placed in the .align directive is used as a power of 2 (number of low-order zero bits); fp instructions which pop the stack are suffixed with ‘p’; .s comment character is ‘#’;'include_next’ is recognized as a synonym for ‘include’.

0x20: Place strings in read-only section (C).

0x40: Allow repeat counts in data-initializing directives.

0x80: efficient SP & DP constants generated in-line.

0x100: Compiler-created variables allocated in vcache.

0x200: all SP & DP constants generated in-line.

0x400: Generate ..sys local symbol (i860). Generate call to __pgimain() (x86-nt, pgc, pgc++).

0x800: Don’t emit definition of __mp_fsr (i860).

0x1000: Add leading underscore to external names.

0x2000: x86 precision control: -pc 32

0x4000: x86 precision control: -pc 64

0x8000: x86 precision control: -pc 80

0x10000: x86 fp instructions without operands which pop the stack are suffixed with ‘p’.

0x20000: x86 ELF .section directive - don’t enclose the name of the section in quotes (‘”’)

0x40000: x86 - Same as mcount profiling, but libcount() is called rather than mcount (used by SSD to instrument library functions).

0x80000: x86 - .lcomm & .comm directives require values to indicate alignment.

0x100000 Don’t require 8-byte alignment for long long, unsigned long long, integer*8, and logical*8 data; instead, use 4-byte (int) alignment. -nodalign affects both double precision and 64-bit integer data.

0x200000 Assembly comment character is ‘#’.

0x400000 No .version directive.

0x800000 x86 - use .local, .comm directive sequence instead of .lcomm.; add value to .comm directive to indicate alignment.

0x1000000 x86 fortran - Don’t add any trailing underscores

0x2000000 x86 fortran - add a second trailing underscore if name contains an underscore

0x4000000 : Do not append @#bytes to function references for MS standard call (weird g77 compatibility mode)

0x8000000 Align stack in prolog of main routine, rather than crt1

0x10000000 Cache align data sections, e.g., the stack, common blocks.

0x20000000 Align outermost loops on a 4-byte boundary (pmn. changed from 16)

0x40000000 Align innermost loops on a 4-byte boundary (pmn. changed from 16)

0x80000000 Generate profiliing calls for all loads and stores on x86

120:

Coff debug information

0x01: generate additional symbolic information for pgftn.

0x02: ???

0x04: turn off translation of prototyped function info: P_FUNC is needed to produce correct debug info for overloaded functions, but may create user errors.

0x08: turn off generation of BASED array stab debug information if stab_sym is N_LSYM.

0x10: For the sparc, stab debug information uses stab 2.0 for the data type entries, allowing debugging on PGI’s Sun OS 4.x compilers with sunpro’s debugger. For the x86, generate gnu-style stab debug information.

0x20: generate stabs in ELF or COFF object files.

0x40: generate C++ debug information for all symbols. Do not delete according to the “referenced” flag.

0x80: generate dwarf in COFF object files.

0x100 Print DWARF comments

0x200 Generate dwarf2 (X86)

0x400 Do not generate dwarf2 call frame (ST100). Do not generate xdata/pdata (WIN64).

0x800 Do not allocate unreferenced variables when generating dwarf1 or dwarf2.

0x1000 Generate debug lite.

0x2000 Inhibit dwarf2 generation for fortran block data.

0x4000 Set the dwarf version to 3. Emit 4-byte quantity for DW_FORM_ref_addr regardless of the size of and address on the target machine.

0x8000 For the DT_AT_upper_bound of a VLA, generate the address of the compiler-created temp which is assigned the upper bound. This is actually incorrect, but needed as a work-around for pgdbg reporting ‘not compiled with -g’ when the correct info is present. When pgdbg is fixed, remove the use of the XBIT.

0x10000 Inhibit emission of DW_TAG_imported_declaration DIEs for each used module.

0x20000 Generate a popsection/previous.

0x40000 Do not extract the file name from the first line, a # line directive, of a file when it’s the output of the preprocessor. If the name is extracted, it will be used as the name of the file to be debugged.

0x80000 Obtain OpenMP thread id using DWARF3 compliant operations (as opposed to using DWARF3 extension DW_OP_PGI_OMP_THREAD_NUM).

0x100000: Do not generate attribute DW_AT_MIPS_linkage_name (C++/F90).

0x200000: Do not generate .pgi_trace section.

0x400000: Do not generate the addressing hacks for common blocks and statically allocated locals on Mac OS X.

0x800000: Do not emit artificial dwarf entries for compiler-created arguments to function/subroutine.

0x1000000 Set the dwarf version to 4.

0x2000000 Set the dwarf version to 5.

0x4000000: Do not generate include file tables.

0x8000000h AVAILABLE

0x10000000: Generating eh_frame.

0x20000000: Generating eh_frame with .cfi directives: requires 120,0x10000000 to be on

0x40000000 Generate .debug_names/.debug_pubnames section.

0x80000000: no license check in executable.

121:

Linkage modifications

0x01: don’t set up frame (only if not debug, alloca(), and varargs)

0x02: additional restriction for -x 121 1, no (*p)()

0x04: Replace normal calls (JSR all platform and QJSR for ST100) with far calls (JSRFAR).

0x08: in use

0x100: Do not generate calls to __builtin_stinit() on Windows (when allocating stack)

0x200: Use __chkstk instead of __builtin_stinit() on Windows (when allocating stack)

0x400: Generate ABI-neutral IL_RETURN for aggregate data types The expander does generic argument and return value bindings.

0x800: Generate ABI-neutral calls (GJSR/GJSRA) – eventually, this will be default with CUDA & OpenACC

0x10000 : WINNT/WIN95 calling conventions are the default for pgf77, pgf90.

0x20000 : pgcc, pgCC - for MSCALL defined names, also emit the undecorated entry name.

0x40000 : WIN CREF calling conventions for pgf77, pgf90.

0x80000 : WIN NOMIXED_STRLENs for pgf77, pgf90 (augments mscall or cref).

0x100000 : x86 - return small structs in registers (eax or eax+edx).

0x200000 : WIN - use lowercase names for fortran external names.

0x400000 : x86 C - return float complex the same as gcc (in registers eax+edx).

0x8000000 : call a check-stack-overflow function to check the per-thread stack size and perhaps a function’s stack size

122-127:

RESERVED FOR NON-STANDARD/IMPLEMENTATION-DEFINED BEHAVIOR MODIFICATIONS

122:

C implementation-defined behavior modifications.

0x01: Perform a narrowing operation from an int value by sign extending.

0x02: implied by -Xs: K&R

0x04: long long, unsigned long long

0x08: treat extern and static data as volatile

0x10: treat plain char as unsigned char; the default is signed char.

0x20: treat long as int and unsigned long as unsigned int.

0x40: Allow the GNU-defined __signed__ keyword as a synonym for signed (unless in strict ansi mode).

0x80: Use alternate builtin functions for arithmetic operations (e.g., integer divide).

0x100: ST100 - Disable enhanced jump table method for switch statements.

0x200: ST100 - Disable non-conservative approach in all enhanced switch statements (applies to enhanced jump table method and constant time method).

0x400: ST100 - Disable enhanced inline jump table method for switch statements (a.k.a. the constant time method).

0x800: ST100 - Disable copya elimination enhancement in constant time switch method.

0x1000: ST100 - Disable use of multiple guards in constant time switch method.

0x2000: nonST100 - For a use of a store ILM, don’t attempt to refer to the result as a load of the left-hand side; instead, refer to the result as a cse of the right-hand side. Someday, will want uses of store ILMs to be consistent across targets.

0x4000: allow narrow int arguments in a prototyped function declaration to be compatible with int arguments in an old-style function definition.

0x8000: for “bug compatibility”, revert to alignment used in previous releases for certain structures containing long integers and int bit fields that cross 2-byte boundaries.

0x10000: Output C macro definitions as they are encountered.

0x20000: Output C #include definitions as they are encountered

0x40000: Output C macro definitions for predefined macros.

0x80000: When outputting macro definitions, do NOT include the definitions.

0x100000: Emit warnings when invoking prototype-less functions.

0x200000: Drop limit on the maximum length of a line generated after preprocessing (‘cpp’ mode).

0x400000: C11

0x800000: AVAILABLE

0x1000000: AVAILABLE

0x2000000: AVAILABLE

0x4000000: AVAILABLE

0x8000000: AVAILABLE

0x10000000: AVAILABLE

0x20000000: AVAILABLE

0x40000000: AVAILABLE

0x80000000: temporary, 03/25/2010 (I hope) - at the center of fixing 16741, ST_UNKNOWNs are created immediately for formal arguments; however, this has the effect of ‘hiding’ previously declared variables which semant has to deal with. Just in case a regression occurs in the field, this XBIT says don’t create ST_UNKNOWNs (yes, f16741 will then fail).

123:

C implementation-defined behavior modifications (cont).

0x01: preprocessor passes comments thru (also implies -es); driver option -C

0x02: preprocessor generates makefile information to stdout; driver option -M

0x04: preprocessor allows C++ style comments; driver option -B

0x08: preprocessor generates makefile information to <program.d>; driver option -MD

0x10: implied by -Xa: att cc compatibility; default value of __STDC__ is 0 and XBIT(123,0x100) is set.

0x20: preprocessor does not separate tokens with spaces.

0x40: preprocessor performs macro replacement within character constants and strings

0x80: implied by -Xt: k&r compatibility plus transitional msgs.

0x100: implied by -Xc (C): strict Ansi conformance (C); default value of __STDC__ is 1 and XBIT(123,0x10) is not set. For fortran, don’t emit the #line directives,

0x200: preprocessor suppresses whitespace between tokens that are OUTSIDE of macro bodies. Whitespace is still added between tokens that are in macro bodies.

0x400: Don’t alter optimizations when generating debugging information. For example, if this bit is set, inhibit generating the lexical block debugging information by semant.

0x800 Don’t collapse whitespace (‘cpp’ mode)

0x1000 C preprocessor - allow gcc’s preprocessor extensions: #include_next, #warning, arg … (vararg function macros), CPATH, C_INCLUDE_PATH, etc.

0x2000 C preprocessor - expand macros within #pragma lines

0x4000: preprocessor ignores system files (<a.h>) when generating makefile information either to stdout (123 2) or file.d (123 8); only quoted files are handled.

0x8000: Do not check the first preprocessing token after #pragma to determine if macro replacement is to be performed for the #pragma line; normally, macro replace will occur in the line if the token “omp”, “acc”, or “pgi”.

0x10000: F90: print out .mod files needed to compile this file to stdout

0x20000: F90: print out .mod files needed to compile this file to filename.m

0x40000: PVF build dependencies.

0x80000: Keep blank lines … for -Mcpp switch

0x100000: When preprocesing, $ is not allowed in an identifier.

0x200000: When preprocessing assembly file, unrecognized # directives are just text.

0x400000: Don’t check definition of __STDC__

0x800000: Don’t attempt to distinguish include files as system header files

0x1000000 Don’t issue messages for extra tokens for line directives, as produced by gcc preprocessor.

0x2000000 Don’t terminate the expansion of the _Pragma preprocessor operator with a newline (i.e., the old behavior)

0x4000000 Use the legacy Fortran preprocessor (fpp), and not the ANSI-C99 preprocessor.

0x8000000 Preprocessor puts out dependence lines to gbl.cppfil instead of file.d or stdout

0x10000000 Unused.

0x20000000 preprocessor generates makefile information to stdout; driver option -MT

0x40000000 preprocessor generates makefile information to stdout; driver option -MQ

0x80000000 C9X

124:

F77 implementation-defined behavior modifications.

0x01: Perform a narrowing operation from an int value by sign extending.

0x02: pack common blocks and structures (not impl.)

0x04: treat unit ‘*’ as stdin if read, stdout if write

0x08: treat REAL as DOUBLEPRECISION and COMPLEX as DOUBLECOMPLEX (also applies to real/complex constants)

0x10: treat INTEGER as INTEGER*8 and LOGICAL as LOGICAL*8

0x20: treat the intrinsics REAL and CMPLX as DBLE and DCMPLX (obsolete in Fortran).

0x40: treat backslash as an ordinary character (no escape sequences)

0x80: don’t marked data initialized locals as SAVEd (not impl.)

0x100: enable cexe$ lines

0x200: inhibit expanding x**c, 1<=c<=__MAXPOW (10), to a sequence of multiplies

0x400: 64 bits of precision for integer*8 and logical*8 operations.

0x800: Perform hardcoded register allocation in CG

0x1000: Emit references to unreferenced EXTERNALs. This flag implies that global directives will be issued; for an actual reference, -x 124 0x4000, must also be present.

0x2000: AVAILABLE

0x4000 Emit an actual reference to unreferenced EXTERNALs; -x 124 0x1000 must also be present.

0x8000 Null-terminate character literals.

0x10000 The preprocessor behaves like cpp; for example, a function-like macro is expanded whenever the name appears irrespective of the presence of actual arguments.

0x20000 Change the level of the “has not been explicitly declared” error (#38) from severe to warning (f77, f90).

0x40000 Inhibit transforming x**c into x**i, where c is the integer i expressed as a real or double constant.

0x80000 Expand the list of real intrinsics to be treated as double to include float, TBD.

0x100000 Preprocessor - skip over fortran comments (e.g., don’t expand macros in comments, etc.).

0x200000 Preprocessor - ‘pgi’ is no longer defined by default (f15141); define pgi iff -Mx,124,0x200000 is set (just in case)

125:

F77 implementation-defined behavior modifications (cont).

0x01: treat an i/o statement as a critical section.

0x02: byte-swapped unformatted i/o

0x04: Treat all EUC characters as a single column position for Hollerith, source line length.

0x08: When testing logical values, treat zero as false and non-zero as true instead of odd and even, respectively.

0x10: Print error messages in Kanji.

0x20: Allocatable commons are allocated just once (can use precise names entries).

0x40: Use Cray’s ‘no conflict’ semantics for references via pointers; expander generates precise NMEs for references of pointer-based objects.

0x80: Allow implicit statements after specification statements.

0x100: The bounds of pointer-based arrays are precise; normally, it’s assumed that the last dimension is not valid even if it’s a constant.

0x200: Assume varargs callee (hammer)

0x400: For f90 array pointers, don’t attempt to multiply the subscript by the section stride and add in the section offset (don’t set ptrexpand).

0x800: Don’t replace calling …str_cpy1 with a ‘block move’.

0x1000: When replacing …str_cpy2 with a ‘block move’ and the rhs is a shorter constant, create a new constant completely paded with blanks. Normally, the new constant is a multiple of 8 (64-bit ) or 4 (32-bit).

0x2000 For F90, use TY_PTR for f90 pointers instead of Cray pointer integer types

0x4000 When expanding a subscript expression for non-pointer arrays, do not attempt to move the first subscript when constant into the zbase computation.

0x8000 When expanding a subscript expression for pointer arrays, do not attempt to move the first subscript when constant into the zbase computation.

0x10000 I was experimenting with a different way to expand array subscripts, and that’s controlled here.

0x20000 Use 64-bit subscripting (ALSO for C)

0x40000 Pass string lengths as ‘int’ (not as the target’s size_t)

0x80000 -Mcontiguous (fortran front-end and back-end)

0x100000 -Mnovariadic_macros (-Mvariadic_macros is the default and is used to augment the -c89 switch when we need to turn them back on )

126:

FTN keyword extensions

127:

C keyword extensions

0x01: asm

0x02: volatile (backend handling)

0x04: gcc keywords - __attribute__, … (see semant.c)

0x08: ghs keywords - __inline, … (see semant.c)

0x10: gcc compatible asm (see semant.c); incompatible with 127,1

0x20: disable built-in __m128, __m128d, __m128i, __m256, __m256d, __m256i data types

128:

LAI/LAO extensions

0x01: Enable basic LAI output by inhibiting harmful directives.

0x02: Enable Virtual Register output.

0x04: Inhibit push/pop sequence; emit .sliw - .ends; emit .leave

0x08: Enable .livein/.liveout directives.

0x10: Enable .proto and .loopinfo directives.

0x20: Enable LAO defect workarounds.

129:

Assembler - (NOTE: overflow from 119:)

0x01: don’t put out profiling line entry calls for lineno:0

0x02: x86/assem.c Set sse flush to zero mode.

0x04: x86/assem.c Set sse denorms are zero mode.

0x08: x86/assem.c Align smaller than size_of(int) auto vars on int boundary.

0x10: Generate %rip-relative addressing on WIN64.

0x20: Unified binary - generate test/jump in reverse order in stub

0x40: Unified binary - generate stub between the two versions, not after both

0x80: x86 - .lcomm (not .comm) directives require values to indicate alignment.

0x100: Allow 16-byte misaligned memory operands in vector arithmetic instructions and maximize the usage of memory operands in vector arithmetic instructions.

0x200: No special startup/initialization for main().

0x400: x86/assem.c Don’t set sse denorms to zero mode We need this negative flag because the -tp type sometimes sets the mode

0x800: hammer - -Mprof=instrument:functions – same as -Mprof=func, but call instent64/instret64

0x1000: Disable 32-byte stack alignment. Note that this only applies to AVX targets, since 32-byte stack alignment is not used for non-AVX targets.

0x2000: The stack is kept 16-byte aligned for 32-bit Linux per the OSX abi. When XBIT(129,0x2000) is set, allow legacy callers in which case we can only emit unaligned 16-byte moves.

0x4000: x86/assem.c: don’t set sse denorms to zero mode This is used with -Mnodaz; for x86 processors, the default is target CPU specific, this overrides the CPU-specific default.

0x8000:

0x10000: Use .align 8 at the function entry (hammer).

0x20000: Don’t align the function entry (hammer).

0x40000:

0x80000:

0x100000: Inhibit writing .ident info to assembly file.

0x200000: Sun assembler syntax for amd64: Assembly comment character is ‘/’; movdq instead of movd.

0x400000: Don’t add a second ‘#’ to the comment char (when XBIT(119,0x10) or XBIT(119,0x200000).

0x800000: Including comments for floating point constants has become a compile-time problem since the cost of converting the fp representation to ascii can be relatively high. Do not emit the values of fp constants in comments unless this XBIT is used.

0x1000000: use 16 byte alignment for stack data less than 16 bytes on x64

0x2000000: Don’t place constants in a read-only section. The default is to not protect constants.

0x4000000:

0x8000000:

0x10000000: The present of -Msmartalloc=huge; note that the value , in -Msmartalloc=huge:n is passed via flg.x[156].

0x20000000: mallopt secret

0x40000000: Hammer - 64-byte (cache) alignment and padding for locals (bss) 64 bytes or larger.

0x80000000: ST100 - when placing objects in the small data/bss sections, use use the minimum alignment rule, i.e., the possible sections are .s[bss|data][1|2|4]. The default is to only use .sbss1/.sdata1.

130:

VLIW levels.

131:

Predication levels.

132:

ST100 local register allocation.

0x1: Use the static local register allocator (-Mregalloc=static) for GP32 and SLIW code.

0x4: Use the optimized local register allocator and re-allocator, also known as the `holes’ register allocator (HRA), for GP32 and SLIW code. This is incompatible with -Mregalloc=static, i.e. with XBIT(132, 0x19). If any of the latter flags are set they take precedence and the HRA is disabled.

0x8: Use -Mregalloc=static only for GP32 code.

0x10: Use -Mregalloc=static only for SLIW code. Currently (11/19/02) this is not supported.

0x20: Use the HRA only for GP32 code. This may be combined with the use of -Mregalloc=static for SLIW code, i.e. XBIT(132, 0x10), but not for GP32 code, i.e. XBIT(132, 9). If either of the latter flags are set they take precedence and the HRA is disabled.

0x40: Use the HRA only for SLIW code. This may be combined with the use of -Mregalloc=static for GP32 code, i.e. XBIT(132, 8), but not for SLIW code, i.e. XBIT(132, 0x11). If either of the latter flags are set they take precedence and the HRA is disabled.

133:

A number n where 0 >= n && n <= 40. This gives the density threshold for SLIW scheduling on the ST100. Thus, if one uses ‘20’ for the value, the density would be 20.0/10.0, or 2.0 instructions/bundle. The threshold is open, so in the case of a 2.0 inst/bundle threshold, there must be more than 2.0 inst/bundle.

134:

Hammer/X8632 CG reg stall values The GP stall limit is the bottom nibble; the next nibble is the smm stall limit (see cgopt2rg.c)

135:

Hammer CG. (NOTE: continued at 164)

0x1: -mcmodel=medium

0x2: DOCUMENT

0x4: DOCUMENT

0x8: DOCUMENT

0x10: DOCUMENT

0x20: DOCUMENT

0x40: DOCUMENT

0x80: skip move exit code

0x100 use PUSH/POP for callee-save GP regs in entry/exit code

0x200 cgoptim2.c:cg_global_opts() - such as -Mdse

0x400 use PUSH/POP for callee-save GP regs in entry/exit code

0x800: no .p2align for labels of non-innermost loops (see xflag 155 for altering the .p2align values).

0x1000: AVAILABLE

0x2000: .align 16 before loop; no .align after jmp 0x4000: .align 8 before loop; no .align after jmp 0x8000: no align before loop; no .align after jmp

0x10000: no align after jmp

0x20000: allow coalescing of register-to-register moves of different sizes.

0x40000: disable two byte return for branch-to-ret scenario

0x80000: DOCUMENT

0x100000: Force OPT1 regalloc method

0x400000: Enable 32B loop alignment for GH. (!)

0x800000: Enable ‘tregion’ CSE.

0x1000000: DOCUMENT

0x2000000: DOCUMENT

0x4000000: DOCUMENT

0x8000000: DOCUMENT

0x10000000: DOCUMENT

0x20000000: DOCUMENT

0x40000000: enables experimental enhancements to CSE elimination. See also -Mx,145, 146 and 147.

0x80000000: enables Steve Christiansen’s experimental enhancement to CSE elimination.

136:

Branch prediction and optimizations

0x1: Enable static branch prediction

0x2:

0x4:

0x8:

0x10: Enable return heuristic

0x20: Enable call heuristic

0x40: Enable guard heuristic

0x80: Enable opcode heuristic

0x100: Enable pointer compare heuristic

0x200: Enable loop heuristic

0x400: Disable exit heuristic

0x800: Disable eh (exception handling) heuristic

0x2000: A compilation-time efficient block position implementation.

0x4000: Use edge frequencies to guide merging sequences in the block position final phase.

0x10000: Region-based (allowing small hammack regions, instead of pure trace-based) code layout.

0x40000: Skip dynamic code layout if the number of edges without matched edge counts is over a threshold.

0x80000: Experiment with code layout with C++ –zc_eh. The brpred.c blkcnt threshold is set to MAX_BLOCKS rather than 500.

137:

0x01: Enable CUDA C++ and Fortran parsing.

0x02: Enable CUDA Fortran emulation.

0x04: CUDA Fortran old/new calls to global routines.

0x08: Disable CUDA Fortran parallel task creation for emulation.

0x10: Enable CUDA Fortran automatic USE of cudadevice.mod in device routines.

0x20: Enable inlining of pgf90_lba and pgf90_uba even if not in device code.

0x40: Put the device array descriptor into constant memory. Perf optimization.

0x100: Enable CUDA X86 back end code generation.

0x200: allow automatic shared arrays

0x400: Don’t use optimized CUDA X86 back end

0x800: Do use optimized CUDA X86 back end, even at opt 0 or 1

0x1000: Temporarily, use kernel optimization in F90

0x2000: Imply MANAGED for all ALLOCATABLE objects in F90

0x4000: Don’t put managed variable array descriptors in constant memory

0x8000: Allow character strings in CUDA Fortran

0x10000: Allow some formatted print statements in CUDA Fortran, EXPERIMENTAL

0x20000: Don’t allow statements between the DO loops of a cuf kernels do construct

0x40000: reserved

0x80000: reserved

0x100000: reserved

138:

vect prefetch limit

139:

single precision SSE size limit

140:

single precision SSE size limit

141:

iteration count passed to llvect

142:

vect prefetch distance

143:

iteration limit for use of non-temporal stores

144:

Limit on number of non-temporal stores to use per loop (currently 1 for amd and 2 for intel targets).

145:

0x1: Enable static and inline unreferenced functions removal (LX-only for now).

(Temporary, for hammer only): if -Mx,135,0x40000000 is specified and (opt >= 2), then a non-zero value for -Mx,145 gives the maximum live range for constant CSEs. By default their maximum live range is calculated in the same way as for other types of CSE.

146:

(Temporary, for hammer only): a tuning parameter for CSE elimination at (opt >= 2). If either flg.x[146] or flg.x[147] is non-zero the maximum CSE live range is given by (flg.x[146] + (flg.x[147] * n_nodes_ilitree( ili ))), otherwise it is 170.

147:

(Temporary, for hammer only): a tuning parameter for CSE elimination at (opt >= 2). If either flg.x[146] or flg.x[147] is non-zero the maximum CSE live range is given by (flg.x[146] + (flg.x[147] * n_nodes_ilitree( ili ))), otherwise it is 170.

148:

Options for controlling collection and use of data for PFO.

0x1: Enable collection of information

0x2: Disable collection of edge information

0x4: Disable collection of value information

0x8: Use Min-MST form of edge instrumentation

0x10: Output BIH numbers instead of FG numbers for (src, dst) of EFCs.

0x20: The PFI_LONG members of the PFO structure are aligned on 8-byte boundaries (32-bit targets only).

0x1000: Enable use of PF data

0x2000: Enable old edge propagation.

0x4000: Disable new edge propagation.

0x8000: Enable simple forward edge propagation without dealing with inlined functions and loops.

0x10000: Disable basic block reordering based on profile data

0x20000: Disable optimizations of code involving semi-invariant values

0x40000: PFO-guided switch expansion to peel off hot cases.

0x80000: Disable profile feedback guidance of register allocation.

0x100000: Disable pgInstrumentValues() and pgInstrumentLoops().

0x200000: Disable the call to pgInstrumentEdges().

0x400000: Invoke PFO_Edges() again from optimize() under PFO.

0x800000: Enable the new method of computing BIH_BLKCNT values.

0x1000000: Disable the invocation of branch_prediction from latepredict().

0x2000000: Force block position even in the presence of missing or inconsistent edge counts.

0x4000000: Indirect call profiling.

0x8000000: Disable the fixup of ILM tags in edge count propagation.

0x10000000: Enable partial edge propagation in inlinee even if inliner’s profile data is missing.

0x20000000: Disable the shutdown of certain optimizations for cold loops.

0x40000000: Disable the code layout heuristic to favor a lexical order in the case of a tie on execution frequency.

149:

For hammer and x8632 only, a non-zero value n invokes the generation of alternative loop code without peeling. Its precise meaning depends on the value of n:

(1) If n > 1 it means: if (cnt <= n), where cnt is the loop count, then execute loop code that does not have any iterations peeled, otherwise execute the loop code that is generated by default, which may or may not be peeled.

Alternative code is only generated for a loop that has a non-constant count and is peeled by default. Otherwise only one version of the loop is generated, which is not peeled if (cnt <= n), and which is peeled or not according to the default heuristics if (cnt > n).

(2) If n == 1 the meaning is the same as above, but the critical value n is calculated by the compiler using a cost-benefit analysis to estimate the minimum loop count for which peeling is profitable.

150:

For hammer and x8632 only, a non-zero value n invokes the generation of alternative loop code with non-temporal stores. Its precise meaning depends on the value of n:

(1) If n > 1 it means: if (cnt <= n), where cnt is the loop count, then execute loop code that does not perform non-temporal stores, otherwise execute loop code that performs non-temporal stores if possible. In the latter case the maximum number of non-temporal stores is determined in the usual way, namely it is given by the value of x[144] if it is non-zero, otherwise it is 1, 2 or 4 depending on the target.

Alternative loop code is only generated if a loop has a non-constant count and the compiler can generate non-temporal stores in it. Otherwise only one version of the loop is generated, which does not have non-temporal stores if (cnt <= n), and which has them if possible if (cnt > n).

This option overrides -Mx,39,0x200, which means “use non-temporal stores if possible”.

(2) If n == 1 it means: if a loop has a non-constant count and the compiler can generate non-temporal stores in it, then generate two versions of the loop, one with and one without non-temporal stores. The latter is executed if (cnt <= N), where N equals (flg.x[143] ? flg.x[143] : 200000)/B. B is the approximate total number of bytes loaded and stored in one iteration of the loop, so the value of N is loop-dependent.

If a loop has a constant count then the default heuristic is still used to decide whether to generate non-temporal stores, namely they are only generated if (cnt\*B >= (flg.x[143] ? flg.x[143] : 200000)). By default the compiler does not generate non-temporal stores for loops with a non-constant count. Thus, -Mx,150,1 employs alternative code generation to apply the same (or a very similar) condition for using non-temporal stores to all loops, regardless of whether their loop count is constant.

151:

(Temporary, for hammer and x8632 only): provides parameters and flags for controlling and tuning alternative code generation. See file hammer/src/llvect.c for full details.

0x4000000: Enable peel and shuffle transformation which is not enabled by default for non-GH.

152:

Provides a parameter n for loop splitting. If loop splitting is enabled and n > 0, then split the loop after every n’th statement where possible.

153:

Provides a parameter n for .p2align emission after a JMP instruction. If n != 0, it overrides .align directive emission driven by xflag 135. n =2^ x + z with z <2^ x , we emit .p2align x ,, z directive. For example, -Mx,153,25 implies (25 = 2^4+9) .p2align 4,,9 directives.

154:

Similar to xflag 153 but .p2align directives are generated in fornt of loop start.

155:

Change the default values for .p2align emitted for labels of non-innermost loops. The form of .p2align is

.p2align m,,n

where, m is the number of low order bits of the address which are zero, and n is the maximum number of bytes that can be used to align the address. The default values for m and n are 4 and 7, respectively. Use, if nonzero, the value of the lower nibble of flg.x[155] as n. Use, if nonzero, the value of the next nibble of flg.x[155] as m. of flg.x[155]

156:

The value n in -Msmartalloc=huge:n

157:

Number of unrolls (# of loop bodies) of a loop with non-constant iteration count and multiple blocks.

158:

An upper bound to control the scale of code generation phase global data flow analysis. The value is (number_of_flow_graph_nodes * number_of_definitions * number_of_locations).

159:

The value is: (number_of_definitions for ALL_GLOBAL_LOCS * number_of_global_locations). Above this threashold, global locations are not tracked in the code generation phase global data flow.

160:

Used in intense.c for computing intensity

0x01: Display load/store information per loop

0x02: Display verifier messages. This flag will go away when verifier errors are rare.

161:

Used in ccffinfo to turn on informational messages

0x01: Inliner messages

0x02: Loop optimization messages

0x04: LRE messages

0x08: Intensity messages

0x10: IPA messages

0x20: Fusion messages

0x40: Vectorizer messages

0x80: OpenMP messages

0x100: Optimizer messages

0x200: Prefetch messages

0x400: Fortran-specific messages

0x800: Parallelization messages

0x1000: reserved

0x2000: PFO messages

0x4000: Accelerator messages

0x8000: Unified binary messages

0x10000: Additional information, usually used only for regression testing

0x20000: PCAST messages

0x100000: Use short tags

162:

Used in ccffinfo to turn on neg-informational messages. It uses the same bit mapping as above, for those that have negative information.

163:

0x01: Enable accelerator pragma/directive recognition

0x02: Just do the analysis, don’t generate the code

0x04: Do the analysis and generate the code, but don’t call the CUDA compiler

0x08: Do the analysis and generate the code and save the .gpu files

0x10: Save all the GPU files

0x20: don’t cache even with user cache directives

0x40: Generate __fmul_rn instead of ‘*’ instructions, to avoid coalescing multiply and add into FMA instructions, which gives different roundoff.

0x80: Disable double precision.

0x100: Enable shared-memory caching

0x200: Use fast math library

0x400: use 24-bit multiplies for subscripting

0x800: Generate ‘emulation mode’ code

0x1000: Generate strip-mined code on the host when private arrays are used.

0x2000: Original behavior: live-out induction variable marks a loop a invalid; now we usually just make it sequential on the device

0x4000: When compiling for a host version of the accelerator as well.

0x8000: For debugging, set unknown bounds of an array to 1:100

0x10000: test caching

0x20000: Save all the GPU files and load the modules from the .gpu files instead of inlining the GPU code.

0x40000: Keep .ptx file.

0x80000: Keep .bin file.

0x100000: Used only for testing

0x200000: Enable output from pgnvd

0x400000: Generate -ptxas -v output

0x800000: debug GPU code

0x1000000: Disable linear CG optimizations

0x2000000: Disable linear CG unrolling

0x4000000: for testing: insert call to __Test in the constructor

0x8000000: Disable dead-code after unrolling

0x10000000: for testing: change cudaRegisterFatBinary call to pgiRegisterFatBinary

0x20000000: Override default, unroll loops with calls

0x40000000: default is wait, don’t wait for each kernel to finish

0x80000000: always wait for each kernel to finish

164:

Hammer llvect and CG. (NOTE: continued from 135)

0x1: pragma save_all_gp_regs: At the entry and exit of a function, in addition to saving and restoring the used callee-saved GP and XMM registers (which is the normal action) also save and restore all non-callee-saved GP registers, except for any that are used to return the function result.

0x2: pragma save_all_regs: At the entry and exit of a function, in addition to saving and restoring the used callee-saved GP and XMM registers (which is the normal action) also save and restore all non-callee-saved GP and XMM registers, except for any that are used to return the function result.

0x4: pragma save_used_gp_regs: At the entry and exit of a function, in addition to saving and restoring the used callee-saved GP and XMM registers (which is the normal action) also save and restore used non-callee-saved GP registers, except for any that are used to return the function result.

0x8: pragma save_used_regs: At the entry and exit of a function, in addition to saving and restoring the used callee-saved GP and XMM registers (which is the normal action) also save and restore used non-callee-saved GP and XMM registers, except for any that are used to return the function result.

0x10: Disable the new method for reducing block pressures so that they are within limits.

0x20: Disable the new method for reducing loop pressures so that they are within limits.

0x40: Disable the new method for selecting register candidates to eliminate in order to reduce loop pressures to within limit.

0x80: Disable the improvements to the estimation of block execution frequencies.

0x100: Enable an experimental register allocator optimisation that attempts to restore eliminated register candidates at the end of the ‘limit resources’ phase.

0x200: Disable an enhancement to the ‘optimize_imul()’ function.

0x400: Disable a KIMV peephole optimisation.

0x800: Disable store re-scheduling, i.e. the cggenai.c optimisation of moving a store LILI forwards if it avoids the pre-emption of a load.

0x1000: Enable partial redundancy elimination on the linear ILIs.

0x2000: Enable copy propagation

0x4000: Do not perform CSE on QJSR ILIs.

0x8000: Disable the optimisation that inserts an xorps or xorpd instruction before cvtsi2ss, cvtsd2ss and cvtss2sd instructions whose dest != src in order to break merge dependences on the ‘dest’ register.

0x10000: Used by the f90 front end: enable float code in sfloat() in an accelerator region.

0x20000: Do not allow partial redundancy elimination to add new blocks after the lexically-last block in a function.

0x40000: Use the old heuristics for performing partial redundancy elimination.

0x80000: For AVX-512, enable the generation of calls to 64-byte-wide versions of the vector fastmath intrinsic functions, which take zmm register operands and return zmm register results. Without this x-flag such calls are replaced by two calls to the ymm version of the intrinsic. Currently the latter behaviour is enabled by default because zmm versions of the fastmath intrinsics are not available yet.

0x100000: For AVX, do not insert any ‘vzeroupper’ instructions.

0x200000: For AVX, only insert ‘vzeroupper’ instructions before calls to run-time library functions, not before ‘ret’ instructions or calls to user-defined functions as is done by default.

0x400000: Disable the vectorisation of loops containing ILIs that operate on the new representation of complex data-types.

0x800000: Use the new math naming scheme (not yet default), i.e.

__f<type><data type>_<name>_<vectlen><mask>
<type>      : f - fastmath (default)
              r - relaxed math (-Mfprelaxed ...)
              p - precise math (-Kieee)
<data type> : s - single precision
              d - double precision
              c - single precision complex
              z - double precision complex
<name>      : exp, log, log10, pow, powi, powk, sin, cos, tan, asin, acos,
              atan, sinh, cosh, tanh, atan2,
<vectlen>   : 1 (scalar), 2, 4, 8, 16
<mask>      : m or null

Currently, the new method only applies to exp, log, pow, & atan on 64-bit linux

0x1000000: For AVX, replace 32-byte aligned load and store instructions by their unaligned equivalents. This is a ‘quick fix’ that was added to avoid 32-byte alignment errors in AVX code, but these errors have now been fixed so this quick fix should not be necessary.

0x2000000: For AVX, generate ‘vzeroupper’ instructions even if -Mvect=simd:128 is used. By default ‘vzeroupper’ instructions are not generated for -Mvect=simd:128.

0x4000000: Disable the generation of non-destructive syntax, i.e. (dest != src2), for AVX packed merge-type instructions.

0x8000000: Disable the following optimisations to the generation of prefetch instructions in vectorised and unrolled loops: (i) increasing the default prefetch distance if necessary to ensure that none of the prefetched data is required in the current iteration; (ii) issuing 2 prefetch instructions per array reference instead of one if the vector loop processes 128 bytes of data per iteration; and (iii) spreading out the prefetches across the first half of the loop body instead of generating them all at the start of the loop body.

0x10000000: Disable the improvements to the LILI peephole optimisations for integer constant folding and address code generation.

0x20000000: Halve the unroll factor that is used for AVX 256-bit vectorised loops (or to be more precise, inhibit the doubling of the unroll factor that is normally performed for such loops), provided that it is legal to do so, i.e. provided the loop still processes at least 32 bytes of data per vector iteration.

0x40000000: Disable the generation of scalar FMA instructions. (This only affects bulldozer code generation, since currently these instructions are only generated on bulldozer.)

0x80000000: Disable the vectorisation of loops that contain any of the following: (i) a reference to the loop induction variable as a primary in a non-address expression, e.g.: for ( i = 0; i < 10; i++ ) a[i] = i; (ii) a FLOAT, DFLOAT or DFLOATK ILI, i.e. an integer*4 to real*4, integer*4 to real*8 or integer*8 to real*8 type conversion.

165:

Used temporarily in accelerator compiler to set thread-block size.

166:

Used for testing in the accelerator compiler to test selection criteria.

167:

Used for testing in the accelerator compiler to control automatic insertion of accelerator regions.

168:

For C/C++, control maximum size of auto-inlined function.

169:

0x01: For C/C++, we now normally remove compiler-created symbols from the symbol table hash lists after each function; this disables that.

0x02: TEMPORARY for 9.0-2… promote member inlined functions to extern weak symbols (as with member templated functions)

0x04: for C++ only. Can use with -Wc,–zc_eh_no_opt : do not remove zc_eh regions marked no_throw. –zc_eh_no_opt is the equivalent switch for pgcpp1.

0x08: for C++ only. Turn off removal of all the regions in a function if all landing pads are zero.

0x10: Turn off the special processiong of lambdas in accellerated regions to copy them in on the data clause

170:

Used temporarily for debugging loop fusion

171:

0x01: Override FEATURE_SCALAR_SSE in x86 settings, set to zero

0x02: Override FEATURE_SSE in x86 settings, set to zero

0x04: Override FEATURE_SSE2 in x86 settings, set to zero

0x08: Override FEATURE_SSE3 in x86 settings, set to zero

0x10: Override FEATURE_SSE41 in x86 settings, set to zero

0x20: Override FEATURE_SSE42 in x86 settings, set to zero

0x40: Override FEATURE_SSE4A in x86 settings, set to zero

0x80: Override FEATURE_SSE5 in x86 settings, set to zero

0x100: Override FEATURE_MNI in x86 settings, set to zero

0x200: Override FEATURE_DAZ in x86 settings, set to zero

0x400: Override FEATURE_PREFER_MOVLPD in x86 settings, set to zero

0x800: Override FEATURE_USE_INC in x86 settings, set to zero

0x1000: Override FEATURE_USE_MOVAPD in x86 settings, set to zero

0x2000: Override FEATURE_MERGE_DEPENDENT in x86 settings, set to zero

0x4000: Override FEATURE_SCALAR_NONTEMP in x86 settings, set to zero

0x8000: Override FEATURE_SSEIMAX in x86 settings, set to zero

0x10000: Override FEATURE_MISALIGNEDSSE in x86 settings, set to zero

0x20000: Override FEATURE_LD_MOVUPD in x86 settings, set to zero

0x40000: Override FEATURE_ST_MOVUPD in x86 settings, set to zero

0x80000: Override FEATURE_UNROLL_16 in x86 settings, set to zero

0x100000: Override FEATURE_DOUBLE_UNROLL in x86 settings, set to zero

0x200000: Override FEATURE_PEEL_SHUFFLE in x86 settings, set to zero

0x400000: Override FEATURE_PREFETCHNTA in x86 settings, set to zero

0x800000: Override FEATURE_PDSHUF in x86 settings, set to zero

0x1000000: Override FEATURE_SSEPMAX in x86 settings, set to zero

0x2000000: Override FEATURE_GHLIBS in x86 settings, set to zero

0x4000000: Override FEATURE_SSEMISALN in x86 settings, set to zero

0x8000000: Override FEATURE_ABM in x86 settings, set to zero

0x10000000: Override FEATURE_AVX in x86 settings, set to zero

0x20000000: Override FEATURE_LRBNI in x86 settings, set to zero

0x40000000: Override FEATURE_FMA4 in x86 settings, set to zero

0x80000000: Override FEATURE_XOP in x86 settings, set to zero

172:

This uses the same bits as xflag 171, but overrides to set to 1; reset overrides set.

173:

(Temporary, for hammer only): a tuning parameter for common subexpression elimination (CSE). If flg.x[173] is non-zero then the maximum range over which a CSE can be applied on 64-bit targets at (opt >= 2) is flg.x[173], otherwise it is 170.

174:

Another throttle for auto-inliner for C/C++. This sets the maximum function size into which to auto-inline.

175:

Set max-reg-count for NVIDIA assembler

176:

Accelerator flags

0x01: Formerly: For NVIDIA, use the CUDA 2.3 toolkit and all that implies; no longer supported.

0x02: For NVIDIA, use the CUDA 3.0 toolkit

0x04: For NVIDIA, use the CUDA 3.1 toolkit

0x08: For NVIDIA, use the CUDA 3.2 toolkit

0x10: Use 32-bit mode on 64-bit systems

0x20: Use the more general upload/download routines to allow asynchronous uploads

0x40: Inverted: Don’t use updated general upload/download routines; this should become the default.

0x80: Don’t try to minimize expression insertions, use redundancy elimination instead

0x100 Generate only compute capability that we specify on the command line.

0x200 Generate compute capability 1.0.

0x400 Generate compute capability 1.1.

0x800 Generate compute capability 1.2.

0x1000 Generate compute capability 1.3.

0x2000 Generate compute capability 2.0.

0x4000 output block numbers in .gpu file

0x8000 Testing a new planner

0x10000 do generate cache memory loads, but don’t use the cache memory in the expressions. This is for debugging bad cache memory references.

0x20000 Disable loop test replacement

0x40000 don’t regularize the compare operations (which changes a<b ==> b>a, and so forth).

0x80000 use the old loop unroller

0x100000 Mark induction variables live only if they are used.

0x200000 enable expression reassociation

0x400000 do generate register loads, but don’t use the register in the expressions. This is for debugging bad cache memory references.

0x800000 For testing, generate common blocks as a single block of bytes

0x1000000 when reassociating, invert the loop order

0x2000000 add induction increment at top of loop; default is at the bottom

0x4000000 disable some expression floating

0x8000000 used in accel.c to do lifetime analysis on whole accelerator region

0x10000000 use fdiv_rn instead of divide

0x20000000 Disable scalar kernels

0x40000000 use new paramset struct

0x80000000 Old method for placing fast-path tests, which tends to put them farther out but allows for fewer fast-path tests.

177:

More accelerator optimizer flags

0x01 Enable initial forward substitution

0x02 Enable initial expression reassociation

0x04 Enable induction variable substitution

0x08 Enable loop unrolling

0x10 Enable forward substitution after unrolling

0x20 Enable reassociation after substitution

0x40 Enable final forward substitution

0x80 Enable available expression replacement

0x100 Enable distribution of multiplication over addition when reassociating expressions.

0x200 Only float available expressions out of inner loops, or loops which contain unrolled code.

0x400 Do remove partially available expressions that are cheap, even try to float them out of a loop.

0x800 For distribution, only distribute multiplication over addition even when it’s not a constant times addition of a constant plus another value.

0x1000 Do generate fastpath even if there aren’t enough fastpath tests to warrant it.

0x2000 Count the maximum number of live variables we have in the program.

0x4000 Don’t regularize comparisons (in accelerator mode), put threadIdx.x on one side of the compare, everything else on the other side

0x8000 Regularize comparisons (in cuda fortran mode), put threadIdx.x on one side of the compare, everything else on the other side

0x10000 Don’t make induction variables be protected symbols.

0x20000 Do find positive, zero, negative expressions.

0x40000 Don’t replace positive, zero, negative comparisons with constant, when possible

0x80000 If unswitching

0x100000 Disable trivial PLOOP optimizations

0x200000 Don’t insert syncthreads calls for vector synchronization; this limits vector length == 32 for vector/nonparallel loops

0x400000 Disable scalar kernels

0x800000 testing fastpath

0x1000000 Don’t create FSINCOS to eliminate redundant sin/cos operations

0x2000000 Late basic-block-local redundancy elimination

0x4000000 Enable generate of ‘fast-path’

0x8000000 Don’t create a ‘temp’ for an address computation

0x10000000 Don’t multiply by the constant tile size, use the blockdim variable

0x20000000 Fastpath for arefs

0x40000000 disable ‘protected’ sequential loops, which puts the strip counters in shared memory

0x80000000 Split cache loads into register loads followed by cache stores

178:

0x01: Override FEATURE_FMA3 in x86 settings, set to zero

0x02: Override FEATURE_MULTI_ACCUM in x86 settings, set to zero

0x04: Override FEATURE_SIMD128 in x86 settings, set to zero

0x08: Override FEATURE_NOPREFETCH in x86 settings, set to zero

0x10: Override FEATURE_ALIGNLOOP4 in x86 settings, set to zero

0x20: Override FEATURE_ALIGNLOOP8 in x86 settings, set to zero

0x40: Override FEATURE_ALIGNLOOP16 in x86 settings, set to zero

0x80: Override FEATURE_ALIGNLOOP32 in x86 settings, set to zero

0x100: Override FEATURE_LD_VMOVUPD in x86 settings, set to zero

0x200: Override FEATURE_ST_VMOVUPD in x86 settings, set to zero

0x400: Override FEATURE_AVX2 in x86 settings, set to zero

0x800: Override FEATURE_AVX512F in x86 settings, set to zero

0x1000: Override ACC_FEATURE_OCLOFFSET in accel settings, set to zero

0x2000: Override FEATURE_AVX512VL in x86 settings, set to zero

179:

This uses the same bits as xflag 178, but overrides to set to 1; reset overrides set.

180:

0x01: Use OpenCL compiler to build accelerator output

0x02: one of -acc=required or -acc=norequired was set

0x04: -acc=required

0x08: for Fermi cards (compute capability 2.0), disable L1 caching

0x10: Reclaimed flag: now used to Disable flush-to-zero mode.

0x20: Enable passing array sections to reflected arguments.

0x40: Print user variable names in the .gpu file.

0x80: Enable flush-to-zero mode.

0x100: Extract device routines to an accelerator inline library (for development purposes so far)

0x200: Use old multiple paramset/launch routines instead of single routine to launch kernels

0x400: Enable OpenACC parsing

0x800: Save the fatbinary file.

0x1000: Don’t parse REFLECTED directive

0x2000: Don’t parse MIRROR directive

0x4000: Don’t parse LOCAL directive

0x8000: Don’t parse COPY/COPYIN/COPYOUT directive

0x10000: For OpenCL, use any device.

0x20000: Don’t pass unused arguments.

0x40000: Don’t remove late redundant operations in a basic block.

0x80000: For debugging generated code.

0x100000: Insert implicit copyin/copyout of the whole array for any REFLECTED arrays used.

0x200000: Insert implicit copyin/copyout of the whole array for any MIRROR arrays used.

0x400000: Insert implicit copyin/copyout of the whole array for any LOCAL arrays used.

0x800000: Insert implicit copyin/copyout of the whole array for any COPY/COPYIN/COPYOUT arrays used.

0x1000000: Don’t generate unrolled reduction code for accelerator reductions

0x2000000: don’t add __threadfence_block() calls after each synchronous update in unrolled reduction code

0x4000000: Use dataon/off/up/down instead of upload/download/alloc/etc.

0x8000000: Don’t generate acclin temps for TY_PTR data (workaround for another problem)

0x10000000: Insert implicit update device of all reflected arrays at the implicit data region top, and update host at the bottom.

0x20000000: add -verbose to pgocld call

0x40000000: for compute capability 2.0+, enable L1+L2 caching

0x80000000: Don’t change IL_AADD to IL_IADD or IL_KADD before acclinopt does its work; improves redundancy elimination

181:

Enables the 3D ‘mgrid’ tiling, i.e., tile at most the outer two loops in a loop nest of depth 3:

The lower halfword of -x 181 must be non-zero and is the tilesize.
If the upper halfword of -x 181 is non-zero, the outer loop is
tiled and this value is its tilesize; the outer loop is not
tiled if it’s a parallel loop.
182:

OpenCL modifications

183:

LLVM modifications

0x01: Do not attempt to replace calls to our run-time for certain ‘builtins’ with llvm instructions

0x02: Replace VLDU/VSTU of vect3 dtypes with bcopy calls - temporary front-end work-around for bugs in llc with unaligned vect3 references.

0x04: Print the data layout for intended target.

0x08: Copy-in all formal arguments into the function’s stack – appears that llc has a problem with generating dwarf location expressions for arguments which are passed on the stack

0x10: (Fortran only) Enable LLVM inlining by not marking all routines with the LLVM attribute ‘noinline’.

0x20: Disable cse load optimization and dead instr removal in LLVM bridge

0x40: Enable scheduling of llvm instructions for interesting blocks. This opt is only performed if cse load optimization is enabled.

0x80: Enable experimental enhanced conflict detection in LLVM bridge

0x100: Enable scheduling of llvm instructions for all blocks. This opt is only performed if cse load optimization is enabled and scheduling is enabled.

0x200: Use ILI_ALT when available.

0x400: Enable block level optimization (peep-hole)

0x800: Dump some extra information as comments of LLMV instructions, available only in DEBUG mode

0x1000: Temporary flag used by compiler back end, enable stb processing if set. Should be removed once stb processing is working.

0x2000: Disable openmp parallel region outlined function through kmpc_fork_call.

0x4000: Disable workaround to mark x86 dp vector math calls as not varargs for Fortran

0x8000: Disable reciprocal multiply undo

0x10000: Enable the use of Newton’s approximation for square root

0x20000: Disable generation of TBAA metadata in LLVM output

0x40000: Disable GEP folding

0x80000: For references to uplevel PAR variables in the outlined functions for OpenMP regions, emit indirect (NT_IND) nmes. Otherwise (the default), use NT_VAR; with NT_VAR, we have ‘precise’ info for flow anaysis, subscripting, etc., i.e., this is the same as NT_IND vs NT_VAR for cray pointers).

0x100000: Use intermediate temp variables in the call to __kmpc_for_…_init routines

0x200000: Turn off ENHANCED_CSE_OPT in cgmain for LLVM

0x400000: Disable promotion of INTEGER*2 in called function on X86-64

0x800000: Allow loop distribution on POWER even when routine is outlined

0x1000000: Allow new fast math power vector routines when real base elements are different size from integer power elements

0x2000000: Switch definition of “long double” on Power from “double double” to __float128

0x4000000: Disable generation of !llvm.loop metadata

0x8000000: (C/C++ only) Disable the LLVM inliner by marking all routines with the LLVM attribute ‘noinline’.

0x10000000: Enable arithmetic widening on address arithmetic.

0x20000000: Put constants in non read-only memories.

0x40000000: Emit DWARF name for Fortran COMMON blocks.

184:

ARM modifications

0x01: Generate the equivalent of ‘float-abi=hard’ where fp values are passed according to the vfp register conventions.

0x02: Specify in datalayout for ARM target that 8-bits/16-bits are native types for the target

185:

Accelerator OpenCL output flags

0x01: Accelerator OpenCL output for NVIDIA

0x02: Accelerator OpenCL output for Platform 2012

0x04: Accelerator OpenCL output for ATI

0x08: Accelerator OpenCL output for X86

0x10: Accelerator OpenCL output for Generic target (anything)

0x20: Accelerator OpenCL output for Generic host

0x40: Accelerator OpenCL output for Generic GPU

186:

More Accelerator flags

0x01: For NVIDIA, use the CUDA 4.0 toolkit

0x02: For NVIDIA, use the CUDA 4.1 toolkit

0x04: For NVIDIA, use the CUDA 4.1 or 4.2 toolkit with old CG

0x10: Failure mitigation mode is on by default; this turns it off

0x20: Disentangle data regions from compute regions Every compute region must re-determine whether the data is present

0x80: Enable warning messages with users attempt to use PGI Accelerator Directives that are being deprecated.

0x1000: Remove extra ‘protected’ symbol assignments that are only used in the same basic block.

0x2000: Forward substitution only for integer symbols.

0x4000: debug use

0x8000: for IEEE NOTxx comparisons, instead of generating the if(!(a>b)), generate if(a<=b).

0x10000: Generate smallest compute capability 1.x that is supported.

0x20000: Generate smallest compute capability 2.x that is supported.

0x40000: Array subscript range test in generated device code.

0x80000: Enable OpenACC interpretation of directives

0x100000: ACCSTRICT: Strict compliance with OpenACC syntax; issue warnings for any non-OpenACC accelerator directive

0x200000: ACCVERYSTRICT: Stricter compliance with OpenACC syntax; issue errors for any non-OpenACC accelerator directive

0x400000: Reorganize array calculations like we’re doing subscript range tests, but don’t insert the array checks.

0x800000: Combine redundant conditionals

0x1000000: Allow non-tightly nested vector/worker loops

0x2000000: enable or disable fmaopt

0x4000000: Remove unreachable code

0x8000000: change the way conditionals are generated testing for cache loads

0x10000000: Insert threadfence_block for cache-line sharing before the syncthreads call

0x20000000: Implicit ‘present’ on all data clauses

0x40000000: Add data region enter/exit calls.

0x80000000: Create local shadows of all argument symbols

187:

0x01: Enable store-forwarding

0x02: Strict kernels gang scheduling; don’t make a loop ‘gang parallel’ if it was specified as only worker or vector.

0x10: convert 1/sqrt(x) to rsqrt(x) (single and double)

0x20: convert 1/(x*sqrt(x)) to t=rsqrt(x), t*t*t (single and double)

0x40: Extend the above two to include y/sqrt(x) and y/(x*sqrt(x))

0x100: “Protect” symbols that hold descriptor values

0x200: “Protect” symbols that hold descriptor values even in CUDA Fortran

0x400: earliest useful placement of a computation

0x800: Combine conditionals again after unrolling

0x1000: Debug output for comparing outputs.

0x10000: Acclinopt: compute earliest computation points at edges.

0x20000: Disable finding single-entry/single-exit regions in acclinopt.

0x100000: Override default for GPU code: llvm version 3.5

0x200000: Override default for GPU code: llvm version 3.6

0x400000: Override default for GPU code: llvm version 3.7

0x800000: Override default for GPU code: llvm version 3.8

0x1000000: Override default for GPU code: llvm version 3.9

0x2000000: Override default for GPU code: llvm version 4.0

0x4000000: Override default for GPU code: llvm version 5.0

188:

The default OpenACC vector length

189:

More Accelerator flags

0x01: Generate compute capability 3.0 (Kepler-1)

0x02: Generate compute capability 3.5 (Kepler-2)

0x04: Generate both compute capability 3.0 and 3.5.

0x08: For NVIDIA, use the CUDA 4.2 toolkit

0x10: Generate llvm LL file, using llc llvm-ptx compiler

0x20: For NVIDIA, use the CUDA 5.0 toolkit

0x40: Don’t generate __ldg() refs to INTENT(IN) with compute capability 3.5 and up.

0x80: Generate calls to __pgiSetupArgument and __pgiLaunch instead of cudaSetupArgument and cudaLaunch, so we can intercept the calls for debugging.

0x100: generate declarations of all struct datatypes, used or not.

0x200 Generate cache loads by recreating the expression tree instead of using the memref info.

0x400 Generate AMD Trinity APU code

0x800 Generate AMD Tahiti GPU code

0x1000 Debugging: use VALUE for index variable names

0x2000 Multi-target accelerator code

0x4000 Use offsets to call OpenCL kernels.

0x8000 Generate relocatable device code, link at link time.

0x10000 Add -restrict to the build line.

0x20000 Add __restrict to pointer arguments to a kernel.

0x40000 Special code generation mode to call device-specific runtime routines, with no begin/end calls.

0x80000 generate __align__ on the common block declaration whether ‘extern’ or not

0x100000 don’t run the ‘demote’ pass in acclinopt

0x200000 don’t run the ‘lin_peep’ pass in acclinopt

0x400000 old way to load register data, for comparison only

0x800000 Add __restrict to pointer arguments to an accelerator kernel.

0x1000000 Use modified dataon/dataoff routines with baseoffset argument

0x2000000 In acclinopt, do demote IL_AADD, IL_KMUL operands; useful for 32-bit targets.

0x4000000 Disable autoparallelization of loops in acc parallel constructs.

0x8000000 Disable autoscoping and automatic detection of reductions in loops.

0x10000000 Implicit ‘present_or_’ on all data clauses

0x20000000 Don’t implicitly collapse outer parallel loops

0x40000000

0x80000000 only two nested gang loops

190:

Extractor/Inliner (overflow).

0x01: Don’t perform the optimization of replacing a struct/union formal with its an actual argument. When the optimization occurs, the actual argument is not copied into an inliner-generated temporary.

0x02: Set LVAL for dummy variables of a PST and LOC ilms in the extractor; setting LVAL is no longer the default given the front-ends are now tracking lval/rval.

0x04: Inhibit replacing CONST pointer formals with the actual argument.

0x08: Don’t attempt to use the number of switch case to throttle inlining

0x10: Turn on bottom-up autoinlining when IPA inlining is used.

191:

Temporary flags

0x01: Turn on C++ prototype implementation of the gnu visibility attribute “hidden”

0x02: Enable “alwaysinline” attribute for a function, using “forceinline” pragma

0x04: Enable vectorize always loop directive

192:

More Accelerator flags

0x01: Accelerator: Move planned strip-mine ploops outwards

0x02: Accelerator: Move planned gang loops outwards

0x04: Accelerator: Enable user-written planner

0x08: Accelerator: save the planner files

0x10: Accelerator: Always set blockDim, even if the block dim is constant

0x20: Accelerator: use nightly build of the next cuda release

0x40: Add const __restrict to pointer arguments to a kernel.

0x80: Add const __restrict to pointer arguments to an accelerator kernel.

0x100: GPS - gang private shared; inverted: gang private arrays don’t all get put into shared memory

0x200: WPS - worker private shared; worker private arrays all get put into shared memory. This is not yet implemented.

0x400: VPS - vector private shared; inverted: vector private arrays do not get put into local memory

0x800: Generate a different plan and kernel for each compute capability

0x1000: for CUDA output, disable insertion of __ldg() for global memory loads that are read-only

0x2000: Defer private array allocation

0x4000: Optimize vector0/worker0 sections of code

0x8000: Generate AMD Barts GPU code

0x10000: Generate AMD Cayman GPU code

0x20000: Generate AMD Pitcairn GPU code

0x40000: Generate AMD Bonaire GPU code

0x80000: Generate AMD Hawaii GPU code

0x100000: Special OpenCL code for reductions

0x200000: Allow variable-sized private arrays in the cache

0x400000: Accelerator: Don’t move planned vector loops outwards

0x800000: Treat all kernel launches as asynchronous.

0x1000000: Do call mark_array_subscripts in reassociate in acclinopt so array subscript multiply-by-array-size is not reassociated

0x2000000: Implicitly mark all routines as ‘acc routine’

0x4000000: Reserved to extend above flag

0x8000000: Reserved to extend above flag

0x10000000: Reserved to extend above flag

0x20000000: enable auto-loop-collapse without collapse directive

0x40000000: enable lineinfo generation for accelerator target

0x80000000: Code-sinking: allow non-tightly nested loops to be tiled.

193:

Used to set an unroll size and count limit in acclinopt

194:

More Accelerator flags

0x01: Generate AMD Capeverde GPU code

0x02: Generate AMD Spectre GPU code

0x04:

0x1000: Treat all parallel and kernels regions like ‘acc scalar region’

0x2000: Run on accelerator and host, and compare results.

0x4000: don’t print out ‘const’ (inverted)

0x8000: Default(none) implied on all OpenACC compute regions.

0x10000: gang-vector mode, ignore ‘worker’ dimension

0x20000: gang-worker mode, ignore ‘vector’ dimension

0x40000: Generate alternate code for reductions

0x80000: Generate multiple versions for different compute capabilities

0x100000: Maxwell compute capability 5.x

0x200000: Maxwell compute capability 5.0

0x400000: Maxwell compute capability 5.2

0x800000: For testing: allow unknown NME types in acc references

0x1000000: Allow expressions in vector() and vector_length() clauses

0x2000000: A loop with a user annotation of ‘vector’ implicitly scheduled as ‘shortloop’

0x4000000: don’t generate vector loop tests or strip loop branches if we know the trip count is less than the vector length

0x8000000: For AMD GPU, keep the OpenCL or SPIR source, don’t compile it.

0x10000000: Default(present) implied on all OpenACC compute regions.

0x20000000: Allow unknown-sized arrays, essentially assuming they will be present

0x40000000: Recognize libm functions even if we don’t know they are libm

0x80000000: For CUDA output, generate –devdebug flag to generate dwarf for cuda C

195:

reserved

196:

Threshhold value for conditional vectorization short circuiting.

197:

For NVIDIA code generation, the lower 12 bits set the __launch_bounds__ 2nd argument value. The next 8 bits are masked into the blockIdx value to randomize block assignment.

198:

More Accelerator flags

0x01: Accelerator scalar replacement.

0x02: Do generate ‘dev only’ Minfo messages even in the release.

0x04: Don’t combine list-oriented Minfo messages

0x08: acclinopt: check for uninitialized values

0x10: when compiling for NVIDIA, set PTXOPT level to zero

0x20: when compiling for NVIDIA, set PTXOPT level to one

0x40: when compiling for NVIDIA, set PTXOPT level to two

0x7f: when compiling for NVIDIA, set PTXOPT level to three

0x100 Compile with -ta=tesla:managed, use managed memory interface

0x1000: acclinopt: check for uninitialized values

0x2000: acclinopt: check for uninitialized values and give errors if there are any

0x4000: acclinopt: disable the wide load/store global memory optimization

0x8000: Use the open source llc GPU back end instead of libnvvm from the CUDA team.

0x10000: test multicore planner

0x20000: reserved

0x40000: Disable the insertion of begin/end labels for lexical scopes. These scope labels are used to privatize arrays and structs that are local to an accelerator region. See -Mnoautoprivatize.

0x80000: Don’t depend on warp-synchronous execution, insert syncs even with vector(32).

0x100000: For -ta=multicore, don’t actually go parallel but do everything else for the multicore code generation.

0x200000: For -ta=multicore, call __test_malloc and __test_free instead of malloc and free, so we can intercept the calls for debugging.

0x400000: Compile with -ta=tesla:pin, allocate using pinned memory

0x800000: Experimenting with statement unrolling

0x1000000: Experimenting with changing placement of synchronizations for calls to vector routines.

0x2000000: global vector-32 mode; GPU code uses vector length of 32 for nvidia

0x4000000: don’t go into vector-32 mode; GPU code will not restrict to vector length of 32 for nvidia even with ‘acc routine’ calls

0x8000000: Enable -ta=tesla:safecache, allowing variable-sized array section in cache directives.

0x10000000: Print out line numbers for all ccff messages.

0x20000000: In acclinopt, allow some builtin function calls to be marked redundant.

0x40000000: In acclin, disable printing of the lilix index for each statement in cuda C output. This makes it easier to compare two outputs from slightly different versions.

0x80000000: Enable unified memory support for OpenACC

199:

Non-zero value enable -Mvect=fastfuse. This flag is/must be passed only when -fast is enabled. Value other than 0 represents the miximum number of blocks to enable -Mvect=fastfuse. default value is 10.

200:

how many levels of inlining to do from leaves for bottom-up auto-inlining

201:

Enable/Disable Accelerator optimizations

0x04: Disable FMA generation

0x08: Enable FMA generation

0x10: Disable vector sync optimization - add vector syncs after every worker/vector loop

0x100: Enable gang-vector mode globally

0x200: Enable gang-vector mode only with gang/worker/vector routines or calls to them

0x400: Disable gang-vector mode entirely

0x800: Enable gang-worker mode globally

0x1000: Enable gang-worker mode only with gang/worker/vector routines or calls to them

0x2000: Disable gang-worker mode entirely

0x4000: Enable vector-32 mode for NVIDIA GPUs globally

0x8000: Enable vector-32 mode only with gang/worker/vector routines or calls to them

0x10000: Disable vector-32 entirely

0x20000:

0x40000: Set accelerator CG loop index variables as ‘noforward’

0x80000: Print array assignments using pointer arithmetic always.

0x100000: Don’t demote address KMUL operations

0x200000: in LLVM output, don’t output the instruction info (lilix index, opcode)

0x400000:

0x800000: If the number of ACIV induction variables is too large, kill off all but the innermost loop ones.

0x1000000: Only find ACIV induction variables for innermost loops. reserved

0x2000000: Assume that complex arrays on GPU are aligned as follows: complex:8-byte dcmplx:16-byte

202:

Set number of bigbuffers for multi-buffer memory management for AMD GPU. (moved from 250)

203:

Set the default vector_length for OpenACC scheduling for NVIDIA

204:

Set the default num_workers for OpenACC scheduling for NVIDIA

205:

Set the default vector_length for OpenACC scheduling for AMD

206:

Set the default num_workers for OpenACC scheduling for AMD

207:

Set the default vector_length for OpenACC scheduling for Generic OpenCL

208:

Set the default num_workers for OpenACC scheduling for Generic OpenCL

209:

0x01:

0x02:

0x04: Restore old IL_SMOVE usage, don’t expand into IL_SMOVEI/IL_SMOVES tree

210:

OpenACC Multicore behavior

0x01: Old behavior for collapsed gang loops

0x02: Remove unused induction variable assignments

0x04: don’t optimize away unused private variable assignments

0x08: enable tracing with -ta=multicore.

0x10: Enable master-thread task distribution model.

0x20: Generate “guided” schedule by default for OpenACC multicore with the LLVM backend.

211:

Enable various accelerator CG optimizations

0x01: Revoving unreachable code. (unreachable)

0x02: Rearranging threadidx compares. (threadidxcompares)

0x04: Unswitching. (unswitching)

0x08: Simplify threadidx compares. (simplifycompares);

0x10: Loop unrolling.

0x20: Combine conditionals.

0x40: Find fused mul-add opportunities.

0x80: Redundancy elimination.

0x100: Local store forwarding.

212:

Disable various accelerator CG optimizations. The bits here are the same as for flag 211. Disable overrides Enable.

213:

Enable O2 accelerator CG optimizations

0x01: Initial forward substitution (forward1)

0x02: Find expressions that are only positive or negative, and optimize away some branches. (findsign)

0x04: Combine conditionals. (combineconditionals)

0x08: Initial reassociation. (reassociate1, reassociatedead)

0x10: Induction variable recognition and replacement. (induct)

0x20: Safe expression forward substitition. (safeforward)

0x40: Reassociate after safe expression forward substitution (setlevel and reassociatesafe)

0x80: peephole optimizations. (peephole)

0x100: Mark cheap expressions

0x200: Forward substitution after marking cheap expressions. (forward2)

0x400: Local redundancy elimination. (localredund)

0x800: Local forward substitution. (localforward)

0x1000: Protext symbols holding descriptor values in OpenACC

0x2000: Protext symbols holding descriptor values in CUDA Fortran

0x4000: Second peephole optimization pass

0x8000: Wide load/store global memory optimization

0x10000: Use LDG instruction for CUDA and cc35+

0x20000: Interchange vector ploops outwards

0x40000: Scalar replacement.

0x80000: Late expression deassociation: turn 8*n + 8*m into 8*(n+m)

0x100000: Conditional removal based on min/max value determination

0x200000: Enable induction variable analysis across the memsize*subscript multiply for an array reference.

0x400000: When combining induction variables to families with the same step, do or don’t (default) limit to those with constant-offsets from the base value.

0x800000: Mark IL_IKMV as induction variable

0x1000000: Optimize branches based on finding min/max values of variables and expressions.

0x2000000: When combining induction variables to families with the same step, do or don’t (default) limit to those with constant-offsets and maybe a constant multiple of threadIdx, blockIdx, blockDim or gridDim from the base value.

214:

Disable O2 accelerator CG optimizations. The bits here are the same as for flag 213. Disable overrides Enable.

215:

reserved

216:

FLANG flags

0x01: The -ffast-math command-line option is present.

0x02: Disable fast math attribute on floating-point addition.

0x04: Disable fast math attribute on floating-point division.

0x08: Add nsz attribute to LLVM arithmetic operations.

0x10: Add reassoc attribute to LLVM arithmetic operations.

0x1000: The -ffp-contract=[fast|on] command-line option is present.

217:

POWER Modifications

0x01: Enable auto initialization of stack memory to 64bit signaling NaNs.

218:

reserved

220:

Enable tuning code for -Minline.

221:

This sets the maximum caller function size into which to Minline.

222:

Functions whose size if smaller than this value will get inlined by Minline.

232:

OpenMP Accelerator Model flags for Flang compiler

0x01: Enable outlining for device functions. Compiler creates a extra function for teams, parallel directives in the device.

0x02: Disable symbol replacer while saving ILM of outlined function. It is enabled normally for OpenMP GPU offload.

0x04: Disable skipping openmp cpu reduction code generation. We normally skip it since gpu has different implementation.

0x08: Enable debug information for GPU code. Experimental

0x10: Init libomptarget library in the main instead of constructor.

0x20: Enable codegne for push loop trip count for libomptarget runtime.

0x40: Enable codegen for spmd kernel init.

233:

reserved

234:

vector vectorlength identifier

0x01: vector vectorlength(number)

0x02: vector vectorlength(fixed)

0x04: vector vectorlength(scalable)

235:

Provides a parameter n for vector vectorlength(number)

248:

OpenMP Threadprivate TLS/TPvector implementation control.

249:

LLVM version number, computed as: Major = n / 10 Minor = n - (Major * 10) where, n = flg.x[249]

250:

Set number of bigbuffers for multi-buffer memory management for AMD GPU. (moved to 202)

251:

(NOT available - check declaration in global.h for flg.x[], all compilers)