" />
|STAT Statistical Data Analysis
Free Data Analysis Programs for UNIX and DOS
|
by Gary Perlman |
Home | | | Preface | | | Intro | | | Example | | | Conventions | | | Manipulation | | | Analysis | | | DM | | | Calc | | | Manuals | | | History |
Last updated: Accesses since 2001-12-20:
dm is a data manipulating program with many operators for manipulating columnated files of numbers and strings. dm helps avoid writing little BASIC or C programs every time some transformation to a file of data is wanted. To use dm, a list of expressions is entered, and for each line of data, dm prints the result of evaluating each expression.
Column Extraction: dm can be used to extract columns. If data is the name of a file of five columns, then the following will extract the 3rd string followed by the 1st, followed by the 4th, and print them to the standard output.
dm s3 s1 s4 < dataThus dm is useful for putting data in a correct format for input to many programs, notably the |STAT data analysis programs. Warning: If a column is missing (e.g., you access column 3 and there is no third column in the input), then the value of the access will be taken from the previous input line. This feature must be considered if there are blank lines in the input; it may be best to remove blank lines, with dm or some other filter program.
dm x1+x2 x3-x4 < dataAlmost all arithmetic operations are available and expressions can be of arbitrary complexity. Care must be taken because many of the symbols used by dm (such as * for multiplication) have special meaning when used in UNIX (though not MSDOS). Problems can be avoided by putting expressions in quotes. For example, the following will print the sum of the squares of the first two columns followed by the square of the third, a simple Pythagorean program.
dm "x1*x1+x2*x2" 'x3*x3' < data
dm "if x1 >= 100 then INPUT else NEXT" < datawill print only those lines that have first columns with values greater than or equal to 100. The variable INPUT refers to the whole input line. The special variable NEXT instructs dm to stop processing on the current line and go to the next.
dm s2 s3 s1will print the second, third and first columns of the input. One special string is called INPUT, and is the current input line of data. String constants in expressions are delimited by single or double quotes. For example:
"I am a string"
dm x1+x2 x1/x2will print out two columns, first the sum of the first two input columns, then their ratio.
The value of a previously evaluated expression can be accessed to avoid evaluating the same sub-expression more than once. yi refers to the numerical value of the ith expression. Instead of writing:
dm x1+x2+x3 (x1+x2+x3)/3the following would be more efficient:
dm x1+x2+x3 y1/3y1 is the value of the first expression, x1+x2+x3. String values of expressions are unfortunately inaccessible.
Indexing numerical variables is usually done by putting the index after x or y, but if value of the index is to depend on the input, such as when there are a variable number of columns, and only the last column is of interest, the index value will depend on the number of columns. If a computed index is desired for x or y the index should be an expression in square brackets following x or y. For example, x[N] is the value of the last column of the input. N is a special variable equal to the number of columns in INPUT. There is the option to use x1 or x[1] but x1 will execute faster so computed indexes should not be used unless necessary.
N the number of columns in the current input line SUM the sum of the numbers on the input line INLINE the line number of the input (initially 1.0) OUTLINE the number of lines so far output (initially 0.0) RAND [R] a random number uniform in [0,1) (may be followed by a seed) INPUT [I] the original input line, all spaces, etc. included NIL the empty expression (often used with a test) KILL [K] stop processing the current line and produce no output NEXT synonym for KILL SKIP synonym for KILL EXIT [E] exit immediately (useful after a search)
dm Efilenamewhere filename is a file of expressions. This mode makes it easier to use dm with pipelines and redirection.
dm reads data a line at a time and stores that line in a string variable called INPUT. dm then takes each column in INPUT, separated by spaces or tabs, and stores each in the string variables, si. dm then tries to convert these strings to numbers and stores the result in the number variables, xi. If a column is not a number (e.g., it is a string) then its numerical value will be inaccessible, and trying to refer to such a column will cause an error message. The number of columns in a line is stored in a special variable called N, so variable numbers of columns can be dealt with gracefully. The general control structure of dm is summarized in the following display.
read in n expressions; e1, e2, ..., en. repeat while there is some input left INPUT =N = SUM = 0 RAND = INLINE = INLINE + 1 for i = 1 until N do si = xi = SUM = SUM + xi for i = 1 until n do switch on case EXIT: case KILL: case NIL : default : OUTLINE = OUTLINE + 1 yi = if (ei not X'd) print yi
Output file or pipe:A filename, a ``pipe command,'' or just RETURN can be entered. A null filename tells dm to print to the terminal. If output is being directed to a file, the output file should be different from the input file. dm will ask permission to overwrite any file that contains anything, but that does not mean it makes sense to write the file it is reading from.
On UNIX, the output from dm can be redirected to another program by having the first character of the output specification be a pipe symbol, the vertical bar: |. For example, the following line tells dm to pipe its output to tee which prints a copy of its output to the terminal, and a copy to the named file.
Output file or pipe: | tee dm.save
Out of interactive mode, dm prints to the standard output.
dm prints the values of all its expressions in %.6g format for numbers (maintaining at most six digits of precision and printing in the fewest possible characters), and %s format for strings. A tab is printed after every column to insure separation.
An assignment operator is not directly available. Instead, variables can be evaluated but not printed by using the expression suppression flag, X. If the first character of an expression is X, it will be evaluated, but not printed. The value of a suppressed expression can later be accessed with the expression value variable, yi.
"abcde" <= 'eeek!'is equal to 1.0. The length of strings can be found with the len operator.
len 'five'evaluates to 4, the length of the string argument. The character # is a synonym for the len operator. The numerical type of a string can be checked with the number function, which returns 0 for non-numerical strings, 1 for integer strings, and 2 for real numbers (scientific notation or strings with non-zero digits after the decimal point).
Individual characters inside strings can be accessed by following a string with an index in square brackets.
"abcdefg"[4]is the ASCII character number (164.0) of the 4th character in abcdefg. Indexing a string is mainly useful for comparing characters because it is not the character that is printed, but the character number. A warning is appropriate here:
s1[1] = '*'will result in an error because the left side of the = is a number, and the right hand side is a string. The correct (although inelegant) form is:
s1[1] = '*'[1]
A substring test is available. The expression:
string1 C string2will return 1.0 if string1 is somewhere in string2. This can be used as a test for character membership if string1 has only one character. Also available is !C which returns 1.0 if string1 is NOT in string2.
< <= = != >= > LT LE EQ NE GE GTand have the analogous meanings as their string counterparts.
The binary operators, + (addition), - (subtraction or "change-sign"), * (multiplication), and / (division) are available. Multiplication and division are evaluated before addition and subtraction, and are all evaluated left to right. Exponentiation, ^, is the binary operator of highest precedence and is evaluated right to left. Modulo division, %, has the same properties as division, and is useful for tests of even/odd and the like. NOTE: Modulo division truncates its operands to integers before dividing.
Several unary functions are available: l (natural log [log]), L (base ten log [Log]), e (exponential [exp]), a (absolute value [abs]), f (floor [floor]), c (ceiling [ceil]). Their meaning can be verified in the UNIX Programmer's Manual. Single letter names for these functions or the more mnemonic strings bracketed after their names can be used. Also available are trigonometric functions that work on degrees in radians: sin cos tan asin acos atan.
x1would equal 1.0 if the condition was satisfied. Parentheses are unnecessary because < and > are of higher precedence than & which is of higher precedence than |. The above expression could be written as:x2 & x2>x3
x1 LT x2 AND x2 LT x3 OR x1 GT x2 AND x2 GT x3by using synonyms for the special character operators. This is useful to avoid the special meaning of characters in command lines. The unary logical operator, ! (NOT), evaluates to 1.0 if its operand is 0.0, otherwise it equals 0.0. Many binary operators can be immediately preceded by ! to negate their value. != is "not equal to," !| is "neither," !& is "not both," and !C is "not in."
if expression1 then expression2 else expression3 expression1 ? expression2 : expression3evaluate to expression2 if expression1 is non-zero, otherwise they evaluate to expression3. The first form is more mnemonic than the second which is consistent with C syntax. Upper case names can be used in their stead. Both forms have the same meaning. expression1 has to be numerical, expression2 or expression3 can be numerical or string. For example, The following expression will filter out lines with the word bad in them.
if 'bad' C INPUT then KILL else INPUTAs another example, the following expression will print the ratio of columns two and three if (a) there are at least three columns, and (b) column three is not zero.
if (N >= 3) & (x3 != 0) then x2/x3 else 'bad line'These are the only expressions, besides si or a string constant that can evaluate to a string. If a conditional expression does evaluate to a string, then it CANNOT be used in some other expression. The conditional expression is of lowest precedence and groups left to right, however parentheses are recommended to make the semantics obvious.
Operators of higher precedence are executed first. All binary operators are left associative except exponentiation, which groups to the right. An operator, O, is left associative if xOxOx is parsed as (xOx)Ox, while one that is right associative is parsed as xO(xOx).
op prec description sin 10 sine of argument degrees in radians cos 10 cosine of argument degrees in radians tan 10 tangent of argument degrees in radians asin 10 arc (inverse) sine function acos 10 arc (inverse) cosine function atan 10 arc (inverse) tangent function sqrt 10 square root function log 10 base e logarithm [l] Log 10 base 10 logarithm [L] exp 10 exponential [e] abs 10 absolute value [a] ceil 10 ceiling (rounds up to next integer) [c] floor 10 floor (rounds down to last integer) [f] len 10 number of characters in string [#] number 10 report if string is a number (0 non, 1 int, 2 real) [] 10 ASCII number of indexed string character - 9 change sign ! 4 logical not (also NOT, not)
op prec description ^ 8 exponentiation * 7 multiplication / 7 division % 7 modulo division + 6 addition - 6 subtraction = 5 test for equality (also EQ; opposite !=, NE) > 5 test for greater-than (also GT; opposite <=, LE) < 5 test for less-than (also LT; opposite, >=, GE) C 5 substring (opposite !C) & 4 logical AND (also AND, and; opposite !&) | 3 logical OR (also OR, or; opposite !|)
dm "if x >= 10 and x <= 20 then INPUT else SKIP" < dm.dat
To print all the lines longer than 100 characters, you could run the following:
dm "if len(INPUT) > 100 then INPUT else SKIP" < dm.dat
To print the running sums of values in a column, you need to use the y variables. The following will print the running sum of values in the first column.
dm y1+x1To print the running sum of the data in the 5th column is a bit unintuitive. y1 is the value from the previous line of the first expression, and x5 is the value of the fifth column on the current line. To get the running sum of column 5, you would type:
dm y1+x5If the running sum is to come out in the third column, then you would run:
dm <something> <something> y3+x5
dm is good at making tables of computed values. In the following example, the echo command prints headings for the columns, and colex reformats the output of dm. colex sets the default format to 10.3n (numbers 10 wide, with 3 decimal places), and prints column 1 in 2i format (2-wide integer) and column 6 in 6i format (6-wide integer). The -t option to colex stops the printing of tabs after columns.
echo " x 1/x x**2 sqrt(x) log(x)" series 1 10 | dm x1 1/x1 "x1*x1" "sqrt(x1)" "log(x1)" | colex -t -F 10.3n 2i1 2 6i3 4-5
x 1/x x**2 sqrt(x) log(x) 1 1.000 1 1.000 0.000 2 0.500 4 1.414 0.693 3 0.333 9 1.732 1.099 4 0.250 16 2.000 1.386 5 0.200 25 2.236 1.609 6 0.167 36 2.449 1.792 7 0.143 49 2.646 1.946 8 0.125 64 2.828 2.079 9 0.111 81 3.000 2.197 10 0.100 100 3.162 2.303
© 1986 Gary Perlman |