|STAT Statistical Data Analysis : Data Manipulation Program Overview

" /> |STAT Statistical Data Analysis : Data Manipulation Program Overview

|STAT Statistical Data Analysis

Free Data Analysis Programs for UNIX and DOS

by Gary Perlman

Manipulation

Last updated: Accesses since 2001-12-20:

Chapter 4: Data Manipulation

4.1 Data Generation/Augmentation
4.2 Data Transformation
4.3 Data Formatting
4.4 Data Extraction
4.5 Data Validation
4.6 DM: Tutorial and Manual

All data manipulation programs are introduced, showing some of their options. Full documentation is in the manual entries. |STAT data manipulation tools allow users to generate, transform, format, extract, and validate data. dm, the data manipulator, is the most important tool for use with other |STAT programs.

There are several classes of data manipulation programs.

Generation programs produce more data than their inputs by repeating data, numbering data, or by creating new data.
Transformation programs allow algebraic conversion of data.
Formatting programs change the shape or order of the data.
Extraction programs produce subsets of datasets.
Validation programs check the consistency, data types, and ranges of data.

4.1 Data Generation/Augmentation

repeat: repeat a string or file
repeat can repeat strings or lines in a file as many times as requested. It helps generate labels for datasets, or feed a program like dm that needs input to produce output. The following will repeat the file data 10 times.

repeat  -n 10  data

The following will repeat its input series of 20 numbers 15 times.

series  1 20  |  repeat  -n 15

Strings can be repeated using the echo command. The following will repeat the string hello 100 times.

echo hello | repeat -n 100

Manual for repeat

series: generate a linear series
series generates a linear series of numbers between two values. By default, its values change by units, but this can be modified. The following produces a series of 10 numbers, 1 to 10, one per line.

series 1 10

The following produces the same series, but in reverse order; the start of the series can be greater than the end.

series 10 1

Non-integral series can be created by supplying an optional increment.

series 0 1 .1

produces the series:

0  .1  .2 .3  .4  .5  .6  .7  .8  .9  1

except that each value is on its own line. The output from series can be transformed with dm to produce other than linear series. Here is an exponential series:

series 1 10 | dm "exp(x1)"

Manual for series

probdist: generate random numbers
probdist can generate random numbers for several probability distributions. The following will generate 100 random numbers from the uniform distribution (between 0 and 1).

probdist random uniform 100

This can be transformed using dm to get random numbers with other ranges. The following will produce 100 random integers uniformly distributed between 10 and 29.

probdist random uniform 100 | dm "floor(x1*20+10)"

The following generates numbers from a one- trial binomial distribution with probability 0.5.

probdist random uniform 100 | dm "if x1 > .5 then 1 else 0"

probdist also has a binomial distribution built in, so the following would be equivalent to the previous example:

probdist rand binomial 1 1/2 100

The random number generator can be seeded. The following will seed the random number generator with 143 and generate 100 normally distributed z values.

probdist -s 143 random normal 100

The seeding option is useful when a random sequence must be repeated. The random normal numbers have a mean of 0 and a standard deviation of 1, so dm can help create different random normal distributions. The following samples a normal distribution with mean 100 and standard deviation 15.

probdist random normal 100  |  dm  "x1*15+100"

Manual for probdist

abut: number lines, recycle files
abut can number input lines in files using the -n option, or cycle through input files as many times as is necessary to match the length of longer files. The latter case is common in creating input files for programs like anova and contab, which have input data tagged with regular patterns of labels.

File1     File2     Data
large     easy      12
small     easy      23
          hard      34
          hard      45
                    56
                    67
                    78
                    89

For the above input file configuration, the command

abut -nc File1 File2 Data

would produce the following by recycling the smaller files.

1         large     easy      12
2         small     easy      23
3         large     hard      34
4         small     hard      45
5         large     easy      56
6         small     easy      67
7         large     hard      78
8         small     hard      89

Manual for abut

dm: number lines
dm can number its input lines with its special variables INLINE, which always contains the input line number, and INPUT, which always contains the current input line.

dm INLINE INPUT < data

Manual for dm

4.2 Data Transformation

dm: conditional algebraic combinations of columns
dm can produce algebraic combinations of columns. The following command reads from data and produces the ratio of columns 2 and 1 with column 3 added on.

dm  x2/x1+x3  <  data

Transformations can be based on conditions. For example, if x1, the value in column 1, in the above example is 0, then dm will exit after producing an error message like:

dm: division by zero. input line 12  expr[1].

To avoid this problem, the following will do the division only if x1 is non-zero.

dm "if x1 then x2/x1+x3 else 0" < data

Manual for dm

probdist: probability/statistic conversion
probdist can convert probabilities to distribution statistics and vice versa as seen in tables at the end of most statistics textbooks. Many distributions are supported, including: the normal z, binomial, chi-square, F, and t. The following will print the two-tailed probability of an obtained t statistic of 2.5 with 20 degrees of freedom.

probdist prob t 20 2.5 0.021234

Similarly, the following will print the two-tailed probability of an F ratio of 6.25 with 1 and 20 degrees of freedom.

probdist prob F 1 20 6.25

0.021234

These results are the same because of the relationship between the t and F distributions.

The following prints the critical value (also called the quantile) in the chi-square distribution with 5 degrees of freedom to obtain a significance level of .05.

probdist crit chisq 5 .05

11.070498

Both probabilities and critical values in the normal z distribution use the lower one tail -oo to +oo distribution, so the z value that produces the .05 level is obtained with the following.

probdist crit z .05

-1.644854

The critical value for the 99th percentile is found with the following.

probdist crit z .99

2.326348

Binomial distribution critical values are treated differently than the other continuous distributions. For the binomial distribution based on five trials, and a probability of success of one half, The critical value for a one-tailed test at the .05 level is:

probdist crit binomial 5 1/2 .05

even though the probability of 5 successes is proportionally much less than .05:

probdist prob binomial 5 1/2 5

0.031250

This is because the binomial distribution is discrete. Not only are critical values conservative, sometimes there may be no possible value; there is no way to get a less probable event than five out of five successes:

probdist crit binomial 5 1/2 .01

Here, probdist is returning an impossible value (one with zero probability).

Manual for probdist

ranksort: convert data to ranks
ranksort can rank order data from numerical data columns. For the input:

1   95  4.3
2   113 5.2
3   89  4.5
4   100 5.0
5   89  4.5

ranksort would produce:

1   3   1
2   5   5
3   1.5 2.5
4   4   4
5   1.5 2.5

The ties in the second and third columns get the average rank of the values for which they are tied. Once data are ranksorted, further ranksorting has no effect. With rank orders within columns, rank order statistics (e.g., Spearman rank order correlation, average group rank) can be computed by parametric programs like pair or regress.

Manual for ranksort

4.3 Data Formatting

maketrix: form a matrix format file
maketrix reads its data, one whitespace separated string at a time from its free format input, and produces a multicolumn output.

series 1 20 | maketrix 5

The above produces a five column output.

1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20

Manual for maketrix

perm: permute lines
perm, with no options, randomizes its input lines. It can randomize output from programs like series.

series 1 20 | perm

A subset of this permutation is a sample without replacement. The following is a sample of size 10 from the file data.

perm < data | dm "if INLINE <= 10 then INPUT else EXIT"

perm can be supplied a seed for its random number generator, to replicate a random permutation.

series 1 20 | perm -s 5762 | maketrix 5

The above produces (with my system's random number generator):

18  7   10  13  2
14  11  19  15  20
1   3   9   6   16
8   17  12  5   4

perm can also put its lines in alphabetical or numerical order. For example, the output from the previous example could be put into ascending order (according to the first number on each line) with:

series 1 20 | perm -s 5762 | maketrix 5 | perm -n

This produces:

1   3   9   6   16
8   17  12  5   4
14  11  19  15  20
18  7   10  13  2

Manual for perm

dsort: sort data lines by multiple keys
The last example of the perm filter showed how lines can be ordered according to the numerical value in the first column. dsort can sort lines based on numerical or alphabetical values in any column. For example, the following command sorts the previous example matrix in ascending order of the values in the third column.

series 1 20 | perm -s 5762 | maketrix 5 | dsort -n 3

This produces:

1   3   9   6   16
18  7   10  13  2
8   17  12  5   4
14  11  19  15  20

If there were ties in a column, dsort could sort by additional key columns.

Manual for dsort

transpose: transpose matrix format file
transpose flips rows and columns in its input. For the input:

1   2   3   4
5   6   7   8
9   10  11  12

transpose produces:

The input to transpose does not have to be regular, nor does it have to be numerical.

one       two       three
four      five
six
seven     eight
nine      ten       eleven

For the above input, transpose produces the following.

one       four      six       seven     nine
two       five                eight     ten
three                                   eleven

Note that with regular inputs, the transposition of a transposition yields the original. This is not necessarily so with data as in the above input and output. The above output piped through another pass of transpose produces a result different from the original input.

one       two       three
four      five      eleven
six       eight
seven     ten
nine

Manual for transpose

reverse: reverse lines, columns, characters
reverse can reverse the lines, fields, or characters in its input. It can provide easier access to the last lines in a file, or the last columns on lines. To get the last 10 lines in a file, we can reverse the file, get the first 10 lines, and then reverse those 10 lines.

reverse < data | dm "if INLINE GT 10 then EXIT else INPUT" | reverse

To get the last two columns in a file is easier.

reverse -f < data | dm s2 s1

Here, dm is used for column extraction, and rather than call reverse a second time, what were the last two columns before reversal are listed in the opposite order.

Manual for reverse

colex: reorder columns, reformat columns
colex is a column extraction program that shares some of the functionality of dm and reverse. colex is faster and has a simpler syntax than dm and has data formatting capabilities. Suppose a matrix dataset with 10 columns is created with the following.

series 1 50 | maketrix 10

colex can extract the last five columns followed by the first five with the command:

series 1 50 | maketrix 10 | colex 6-10 1 2 3 4 5

Either ranges of columns or single columns can be given. The above command produces:

6   7   8   9   10  1   2   3   4   5
16  17  18  19  20  11  12  13  14  15
26  27  28  29  30  21  22  23  24  25
36  37  38  39  40  31  32  33  34  35
46  47  48  49  50  41  42  43  44  45

Note in the previous example how the numbers line up on the left, rather than the customary format to line up the unit digits. This is because colex puts tabs between columns, and it is not a problem because |STAT programs read data in free-format. colex can print its columns in several numerical formats as well as the default string format. The numerical formatting can round values to some number of decimal places (like zero, for whole numbers). The option: -F 4i would tell colex to format all the columns as integers, each four spaces wide, and the -t option would tell colex to not place a tab between columns. The format of columns can be assigned to individual columns by placing the format before each range of columns. For example, the following variation on the previous command would print columns 6-10 in a money format with two digits after the decimal place, and columns 1-5 as integers four wide.

series 1 50 | maketrix 10 | colex -t 6.2n6-10 4i1-5

  6.00  7.00  8.00  9.00 10.00   1   2   3   4   5
 16.00 17.00 18.00 19.00 20.00  11  12  13  14  15
 26.00 27.00 28.00 29.00 30.00  21  22  23  24  25
 36.00 37.00 38.00 39.00 40.00  31  32  33  34  35
 46.00 47.00 48.00 49.00 50.00  41  42  43  44  45

Manual for colex

dm: reorder columns
dm, like colex, can reorder columns. However, it does not allow the specification of ranges of columns. The above example of colex could be done with dm with similar results.

Manual for dm

series 1 50 | maketrix 10 | dm s6 s7 s8 s9 s10 s1 s2 s3 s4 s5

abut: paste corresponding lines from files
abut can join data in separate files beside one another. In the usual case, abut takes N files with K lines and produces 1 file with K lines. Suppose the files height and weight contain the respective heights and weights of the same people. Each line in each file contains one height or weight. These could be plotted with the plotting option on the pair program with the following command.

abut height weight | pair -p

Manual for abut

4.4 Data Extraction

dm: conditional data extraction
dm can extract subsets of its input, either by columns or by lines. To extract columns of data, each extracted column is specified with the number of the column preceded by the letter s. The following extracts columns 8, 2, and 11, in that order.

dm s8 s2 s11

dm can extract lines of data by using its built-in line skipping expression SKIP. The following will extract lines 50 to 100.

dm "if INLINE >= 50 & INLINE <= 100 then INPUT else SKIP"

It is more awkward than column extraction, but the latter is common.

Manual for dm

colex: quick column extraction
colex can extract individual columns, or ranges of columns. For column extraction, it is easier to use and faster than dm. The following extracts, in order, columns 8, 2, and 11.

colex  8  2  11

Manual for colex

linex: line extraction
linex can extract individual lines (by number), or ranges of lines. The following extracts, in order, lines 8, 2, and 11.

linex  8  2  11

To extract lines 50 to 100, you could type:

linex  50-100

or you could even extract them in reverse order:

linex  100-50

Manual for linex

4.5 Data Validation

validata: data validation
validata will report for its input the number of columns, data-types of columns, and for columns with numerical values, the maxima and minima. validata reports any inconsistencies in the number of columns in its input. Floating point numbers can be entered in scientific notation. For the input:

1   2   3
4   5   6
7   2E2 end
5       1e-3

validata's output is:

validata: Variable number of columns at line 4
Col   N  NA alnum alpha   int float other  type   min   max
  1   4   0     4     0     4     4     0   int     1     7
  2   4   0     3     0     2     4     0 float 0.001   200
  3   3   0     3     1     2     2     0 alnum     3     6

Manual for validata

dm: conditional data validation
dm can find exceptional cases in its input. A simple case is non-numerical input, which can be checked with dm's number function.

dm  "if !number(s1) then 'bad input on line' else SKIP"  INLINE

dm can check for specific values, ranges of values, or specific relations of values. The following prints all lines in data with the string bad in them.

dm "if 'bad' C INPUT then INPUT else SKIP"

The input line number could be prepended.

dm INLINE "if 'bad' C INPUT then INPUT else SKIP"

This is possible because dm will produce no output for skipped lines, regardless of expression order. The following prints all lines where column 3 is greater than column 2.

dm "if x3 > x2 then INPUT else SKIP"

dm can print lengths of strings and check for numerical fields:

dm  len(s1)  number(s1)

will print the length of column 1 strings, and report if they are numerical (0 for non-numbers, 1 for integers, 2 for real numbers, 3 for exponential scientific notation numbers).

Manipulation