Program DISTPCOA

Pierre Legendre
Département de sciences biologiques
Université de Montréal
C.P. 6128, succursale Centre-ville
Montréal, Québec H3C 3J7, Canada
Marti J. Anderson
Centre for Research on Ecological Impacts of Coastal Cities
and School of Biological Sciences
Marine Ecology Laboratories, A11
University of Sydney
Sydney, NSW 2006, Australia
MJAnders@bio.usyd.edu.au
January 1998 What does program DISTPCOA do? This program performs Principal Coordinate Analysis (PCoA; Gower 1966) with the option of correcting for negative eigenvalues. This procedure is used as part of the distance-based redundancy analysis method (db-RDA) proposed by Legendre & Anderson (1998a). It may also be used in any other case where one wishes to obtain a full Euclidean representation of a distance matrix. If negative eigenvalues are produced, the correction methods available in this program allow one to obtain a full Euclidean representation in all cases. The program can either read in a pre-computed distance matrix, or calculate a distance matrix from a raw data table. Five distance functions are available within the program: Bray-Curtis, square root of Bray-Curtis, chi-square, Hellinger, and Euclidean. Descriptions of these distances can be found in Legendre & Legendre (1998), among other texts. The program uses a Householder procedure for finding the eigenvalues and eigenvectors of a square distance matrix. The subroutines (TRED2, TQLI) are from Chapter 11 of Numerical Recipes (Press et al., 1986). Negative eigenvalues may be generated during the principal coordinate analysis of semimetric or nonmetric distance measures. For descriptions and comparisons of properties of various distance measures, see Gower & Legendre (1986) and Legendre & Legendre (1998). For example, the Bray-Curtis distance, which is widely used in ecology with species abundance data and is offered by the program, is a semimetric. Negative eigenvalues may also be produced during the analysis of some metric distances which do not guarantee a full Euclidean representation, as shown by Gower & Legendre (1986); see also Legendre & Legendre (1998, Table 7.2). The problem of negative eigenvalues is that the corresponding ordination axes are imaginary, their lengths being the square roots of their eigenvalues. Corrections for negative eigenvalues may be obtained using two methods:
  1. Lingoes method: d’(i,j)2 = d(i,j)2 + 2c1 where c1 is the absolute value of the largest negative eigenvalue of the first PCoA run. Note that d(i,i) = 0.
  2. Cailliez method: d’(i,j) = d(i,j) + c2 where c2 is the largest eigenvalue of a special non-symmetric matrix. Note that d(i,i) = 0. The eigenvalues of the special matrix are found using a QR algorithm for real Hessenberg matrices. The subroutines (BALANC, ELMHES, and HQR) are from Chapter 11 of Numerical Recipes (Press et al., 1986).
Metric measures, such as square-root-transformed Bray-Curtis, chi-square, Hellinger or Euclidean distances, will give all positive eigenvalues in the PCoA analysis, so no correction is needed. Methods 1 and 2 are described in Gower & Legendre (1986, theorem 7), in Legendre & Anderson (1998a), and in Legendre & Legendre (1998). The fact that square-root-transformed Bray-Curtis distances give all positive eigenvalues in PCoA is substantiated in Legendre & Anderson (1998a). For use with db-RDA, Legendre and Anderson (1998a) have shown that correction method 1 does not affect the test of the analysis-of-variance statistic by permutation. Thus they recommend the use of correction method 1 in that context. Input files The input data file is an ASCII text file.
  1. It may contain a raw data file, where objects (i.e. sites, replicates, etc.) are rows and variables (i.e. species or other descriptors) are columns. There is no identifier of any sort at the beginning of the file. Neither columns nor rows should have any labels whatsoever. The program asks the user how many objects and variables there are before reading the file.
  2. It may contain a square distance or similarity matrix computed using some other program; the diagonal is included in the matrix. There is no identifier of any sort at the beginning of the file or at the beginning of the rows. The only values in the file must be distances. The program asks the user how many objects there are in the distance matrix before reading the file.
Options of the program The following choices are offered by the program:
  1. Input data file: a square distance or similarity matrix, or a raw data file.
  2. A variety of preliminary data transformations are available for the analysis of raw data files, if desired: square root (i.e. y’ = y1/2), double square root (i.e. y’ = y1/4), as well as four logarithmic transformations: y’ = ln(y), y’ = ln(y + 1), y’ = log10(y), and y’ = log10(y + 1).
  3. For raw input data files, users may choose to compute one of the following distances:
    • Bray-Curtis distance
    • sqrt(Bray-Curtis distance)
    • Chi-square distance
    • Hellinger distance
    • Euclidean distance
  4. Correction for negative eigenvalues:
    • Method 1 (Lingoes)
    • Method 2 (Cailliez)
    • No correction
If a metric distance measure has been chosen, then no correction is necessary and this option will be obtained by the program regardless of the choice made here. Choosing “No correction” will only change the analysis explicitly if there are negative eigenvalues produced (for example, with Bray-Curtis distances). In that case, the eigenvectors corresponding to the negative eigenvalues will be ignored and only the coordinates corresponding to the positive eigenvalues will be output. Output files The run dialogue as well as the uncorrected and corrected eigenvalues are given in the dialogue window. The eigenvectors are given in a separate output file called PCOORD.TXT. If a correction for negative eigenvalues has been done, the coordinates in the output file are those of the corrected eigenvalues. The rows of this file correspond to the objects and the columns are the coordinates (i.e. variables) in the new system of axes. This file can be used directly as input to other programs of data analysis. For users of the db-RDA procedure, in particular, this file may become the “Species” matrix of a redundancy analysis using the CANOCO program. Disclaimer This program is provided without any explicit or implicit warranty of correct functioning. It has been developed as part of a university-based research program. If, however, you should encounter problems with this program, the authors will be happy to help solve them. Researchers may use this program for scientific purposes, but the source code remains the property of Pierre Legendre and Marti J. Anderson. Publications should give proper credit to the method by referring to the Legendre & Anderson (1998a) paper. Users of the program may refer to the present user’s manual as follows:
Legendre, P. & M. J. Anderson. 1998b. Program DISTPCOA. Département de sciences biologiques, Université de Montréal. 10 pages.
Technical notes The program is distributed in a variety of forms: References
Gower, J. C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325-338. Gower, J. C. & P. Legendre. 1986. Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification 3:5-48. Legendre, P. & M. J. Anderson. 1999. Distance-based redundancy analysis: testing multi-species responses in multi-factorial ecological experiments. Ecological Monographs 69 (1): 1-24. Legendre, P. & Legendre, L. 1998. Numerical Ecology, 2nd English edition. Elsevier Science BV, Amsterdam. xv + 853 pages. Press, W. H., B. P. Flanery, S. A. Teukolsky & W. T. Vetterling. 1986. Numerical recipes - The art of scientific computing. Cambridge Univ. Press, Cambridge. xx + 818 p.
Appendix: Test runs Consider the following input data matrix, called “test, 7x3”. It has 7 rows (sites) and 3 columns (species):
3 4 5
3 2 5
3 6 4
7 5 7
6 8 9
3 6 3
4 5 7

The output in the dialogue window is the following, using the Bray-Curtis distance and correction method 1.
Principal coordinate analysis
with correction for negative eigenvalues, if any.
Maximum size of matrix:  400 objects and descriptors

Do you have a file with (1) a square Distance or Similarity matrix, or
                        (2) raw data    ?
(Type -1 or -2 to get intermediate matrices printed.)
2
Name of input file with raw data?
(in which columns are variables and rows are replicates)
Input file name (raw data): test,7x3

How many objects?
7

How many variables?
3

Transform the raw data before computing distances?
   (0) No transformation
   (1) y’ = sqrt(y),        i.e. y’ = y^0.5
   (2) y’ = double sqrt(y), i.e. y’ = y^0.25
   (3) y’ = ln(y)
   (4) y’ = ln(y + 1)
   (5) y’ = log10(y)
   (6) y’ = log10(y + 1)
0

Options: (1) Bray-Curtis distance
         (2) sqrt(Bray-Curtis)
         (3) Chi-square distance
         (4) Hellinger distance
         (5) Euclidean distance
1

Correction for negative eigenvalues, if any: 
1) Method 1 (Lingoes):  d’(i,j) = sqrt(d(i,j)**2 + 2*c1)
2) Method 2 (Cailliez): d’(i,j) = d(i,j) + c2
3) No correction: yields coordinates corresponding
   to positive eigenvalues only
1
18:02:17  


*** Results of PCoA on the original distance matrix ***

Trace of Gower-centred matrix =            0.15814

PCoA eigenvalues
   0.10936   0.04657   0.00673   0.00017   0.00000  -0.00152  -0.00318

The largest negative eigenvalue is  -0.0031792355

Sum of computed eigenvalues =              0.15814


*** Results of PcoA on corrected distance matrix ***

Trace of Gower-centred matrix =            0.17721

PCoA eigenvalues
   0.11254   0.04975   0.00991   0.00335   0.00166   0.00000   0.00000

Sum of computed eigenvalues =              0.17721

The number of non-zero eigenvalues is:           5

Non-zero Principal coordinates
have been written to output file: “Pcoord.txt”

18:02:18  
Real time spent:      0.13 seconds

End of program.
File PCOORD.TXT contains the new coordinates of the 7 sites in 5 dimensions:
   -0.09732   0.03677   0.01757  -0.00996  -0.02045
   -0.16516   0.11596  -0.03876   0.00110   0.00089
   -0.06308  -0.08861  -0.02175  -0.01839   0.02695
    0.13589   0.06345   0.05297  -0.02800   0.00535
    0.21189  -0.02103  -0.05983   0.00077  -0.01204
   -0.07513  -0.14534   0.02514   0.00929  -0.01342
    0.05291   0.03880   0.02464   0.04518   0.01272

For Bray-Curtis distance and correction method 2, the output in the dialogue window is the following.
Principal coordinate analysis
with correction for negative eigenvalues, if any.
Maximum size of matrix:  400 objects and descriptors

Do you have a file with (1) a square Distance or Similarity matrix, or
                        (2) raw data    ?
(Type -1 or -2 to get intermediate matrices printed.)
2
Name of input file with raw data?
(in which columns are variables and rows are replicates)
Input file name (raw data): test,7x3

How many objects?
7

How many variables?
3

Transform the raw data before computing distances?
   (0) No transformation
   (1) y’ = sqrt(y),        i.e. y’ = y^0.5
   (2) y’ = double sqrt(y), i.e. y’ = y^0.25
   (3) y’ = ln(y)
   (4) y’ = ln(y + 1)
   (5) y’ = log10(y)
   (6) y’ = log10(y + 1)
0

Options: (1) Bray-Curtis distance
         (2) sqrt(Bray-Curtis)
         (3) Chi-square distance
         (4) Hellinger distance
         (5) Euclidean distance
1
Correction for negative eigenvalues, if any: 
1) Method 1 (Lingoes):  d’(i,j) = sqrt(d(i,j)**2 + 2*c1)
2) Method 2 (Cailliez): d’(i,j) = d(i,j) + c2
3) No correction: yields coordinates corresponding
   to positive eigenvalues only
2
18:10:21  


*** Results of PCoA on the original distance matrix ***

Trace of Gower-centred matrix =            0.15814

PCoA eigenvalues
   0.10936   0.04657   0.00673   0.00017   0.00000  -0.00152  -0.00318

Sum of computed eigenvalues =              0.15814

*** Create Special matrix and find its largest eigenvalue ***

The largest eigenvalue of the Special matrix is   0.0380438751

*** Results of PcoA on corrected distance matrix ***

Trace of Gower-centred matrix =            0.21088

PCoA eigenvalues
   0.13191   0.06090   0.01325   0.00351   0.00131   0.00000   0.00000

Sum of computed eigenvalues =              0.21088

The number of non-zero eigenvalues is:           5

Non-zero Principal coordinates
have been written to output file: “Pcoord.txt”

18:10:21  
Real time spent:      0.15 seconds

End of program.
File PCOORD.TXT contains the new coordinates of the 7 sites in 5 dimensions:
   -0.10669   0.04391  -0.01393   0.01163  -0.02278
   -0.17486   0.13057   0.04492  -0.00032   0.00661
   -0.07177  -0.10046   0.01498   0.00893   0.02273
    0.14993   0.06591  -0.05857   0.03115   0.00733
    0.22728  -0.02391   0.07344  -0.00071  -0.00801
   -0.08399  -0.15847  -0.02214  -0.00253  -0.00993
    0.06009   0.04243  -0.03869  -0.04815   0.00405

For Bray-Curtis distance without any correction for negative eigenvalues, the output in the dialogue window is the following.
Principal coordinate analysis
with correction for negative eigenvalues, if any.
Maximum size of matrix:  400 objects and descriptors

Do you have a file with (1) a square Distance or Similarity matrix, or
                        (2) raw data    ?
(Type -1 or -2 to get intermediate matrices printed.)
2
Name of input file with raw data?
(in which columns are variables and rows are replicates)
Input file name (raw data): test,7x3

How many objects?
7

How many variables?
3

Transform the raw data before computing distances?
   (0) No transformation
   (1) y’ = sqrt(y),        i.e. y’ = y^0.5
   (2) y’ = double sqrt(y), i.e. y’ = y^0.25
   (3) y’ = ln(y)
   (4) y’ = ln(y + 1)
   (5) y’ = log10(y)
   (6) y’ = log10(y + 1)
0

Options: (1) Bray-Curtis distance
         (2) sqrt(Bray-Curtis)
         (3) Chi-square distance
         (4) Hellinger distance
         (5) Euclidean distance
1
Correction for negative eigenvalues, if any: 
1) Method 1 (Lingoes):  d’(i,j) = sqrt(d(i,j)**2 + 2*c1)
2) Method 2 (Cailliez): d’(i,j) = d(i,j) + c2
3) No correction: yields coordinates corresponding
   to positive eigenvalues only
3
18:13:09  


*** Results of PCoA on the original distance matrix ***

Trace of Gower-centred matrix =            0.15814

PCoA eigenvalues
   0.10936   0.04657   0.00673   0.00017   0.00000  -0.00152  -0.00318
The negative eigenvalues, if any,
are being ignored in this analysis.

Sum of computed eigenvalues =              0.15814

The number of positive eigenvalues is:           4

Principal coordinates corresponding
to positive eigenvalues only
have been written to output file: “Pcoord.txt”

18:13:09  
Real time spent:      0.08 seconds

End of program. 
File PCOORD.TXT contains the new coordinates of the 7 sites in the 4 dimensions corresponding to the positive eigenvalues:
   -0.09594   0.03558  -0.01448   0.00225
   -0.16281   0.11219   0.03194  -0.00025
   -0.06219  -0.08574   0.01792   0.00416
    0.13396   0.06139  -0.04366   0.00633
    0.20888  -0.02034   0.04930  -0.00017
   -0.07406  -0.14062  -0.02072  -0.00210
    0.05216   0.03754  -0.02031  -0.01022