Calculate principal components of mixed-type data

Nancy Chelaru-Centea

Aug. 10, 2019


1. Introduction

Many datasets that a data scientist will encounter in the real world will contain both numerical and categorical variables. Factor analysis for mixed data (FAMD) is a principal component method that combines principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical variables. To learn more about FAMD, see an excellent tutorial using the FactoMineR package.

I will use three commonly used packages in R (FactoMineR and PCAmixdata) and Python (prince) to performance FAMD on the Telco customer churn dataset, to gain insights into the relationships between various aspects of customer behaviour. This will be a toy example of how FAMD can be used to derive actionable business insights in the real world.

2. Import and pre-process data

Here, I will import the cleaned Telco dataset in both R and Python.

As Calculated_TotalCharges is highly correlated with tenure and MonthlyCharges, it will be excluded from analysis.

Also, all three packages automatically normalize the numerical variables, so I will not do so before hand.

2.1 In R

df <- read.csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')

df <- within(df, rm('Calculated_TotalCharges'))

2.2 In Python

import pandas as pd

df = pd.read_csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')

df.drop('Calculated_TotalCharges', axis=1, inplace=True)

3. Factor analysis of mixed data (FAMD)

3.1 FactoMineR (R package)

FactoMineR provides a variety of functions for PCA, correspondence analysis (CA), multiple correspondence analysis (MCA) and FAMD.

See CRAN documentation for FactoMineR.

## Import libraries
library(FactoMineR)
library(factoextra)

## PCA
res.famd <- FAMD(df, 
                 sup.var = 19,  ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now
                 graph = FALSE, 
                 ncp=25)

## Inspect principal components
get_eigenvalue(res.famd)
A matrix: 22 × 3 of type dbl
eigenvaluevariance.percentcumulative.variance.percent
Dim.14.509851384220.49932447 20.49932
Dim.22.802401263812.73818756 33.23751
Dim.31.8247464408 8.29430200 41.53181
Dim.41.1490819680 5.22309985 46.75491
Dim.51.0495900326 4.77086378 51.52578
Dim.61.0144020899 4.61091859 56.13670
Dim.70.9993340111 4.54242732 60.67912
Dim.80.9842737776 4.47397172 65.15310
Dim.90.9039827483 4.10901249 69.26211
Dim.100.8465510646 3.84795938 73.11007
Dim.110.7634513562 3.47023344 76.58030
Dim.120.7069736613 3.21351664 79.79382
Dim.130.6743784465 3.06535658 82.85917
Dim.140.6215863186 2.82539236 85.68457
Dim.150.6096204549 2.77100207 88.45557
Dim.160.5972640853 2.71483675 91.17041
Dim.170.4783653418 2.17438792 93.34479
Dim.180.4652041236 2.11456420 95.45936
Dim.190.4628728877 2.10396767 97.56332
Dim.200.3061815235 1.39173420 98.95506
Dim.210.2291355725 1.04152533 99.99658
Dim.220.0007514474 0.00341567100.00000

To inspect the results in further detail, use the summary(res.famd) and print(res.famd) functions.

3.2 PCAmixdata (R package)

According to its authors, PCAmixdata is "dedicated to multivariate analysis of mixed data where observations are described by a mixture of numerical and categorical variables" (Chavent et al., 2017). As we will see in part 2 of this series, PCAmixdata provides a very useful function for performing (a generalized form of) varimax rotation that aids in interpreting the principal components identified.

See CRAN documentation for PCAmixdata.

## Import library
library(PCAmixdata)

## Split mixed dataset into quantitative and qualitative variables
split <- splitmix(df[1:18])  ## For now excluding the target variable "Churn", which will be added back later as a supplementary variable

## PCA
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Inspect principal components
res.pcamix$eig
A matrix: 22 × 3 of type dbl
EigenvalueProportionCumulative
dim 14.509851384220.49932447 20.49932
dim 22.802401263812.73818756 33.23751
dim 31.8247464408 8.29430200 41.53181
dim 41.1490819680 5.22309985 46.75491
dim 51.0495900326 4.77086378 51.52578
dim 61.0144020899 4.61091859 56.13670
dim 70.9993340111 4.54242732 60.67912
dim 80.9842737776 4.47397172 65.15310
dim 90.9039827483 4.10901249 69.26211
dim 100.8465510646 3.84795938 73.11007
dim 110.7634513562 3.47023344 76.58030
dim 120.7069736613 3.21351664 79.79382
dim 130.6743784465 3.06535658 82.85917
dim 140.6215863186 2.82539236 85.68457
dim 150.6096204549 2.77100207 88.45557
dim 160.5972640853 2.71483675 91.17041
dim 170.4783653418 2.17438792 93.34479
dim 180.4652041236 2.11456420 95.45936
dim 190.4628728877 2.10396767 97.56332
dim 200.3061815235 1.39173420 98.95506
dim 210.2291355725 1.04152533 99.99658
dim 220.0007514474 0.00341567100.00000

Similarly, to inspect the results in further detail, use the summary(res.pcamix) and print(res.pcamix) functions.

Thus far, we see that the results from FactoMineR and PCAmixdata are identical.

A little background: an eigenvalue > 1 indicates that the principal component (PCs) accounts for more variance than accounted by one of the original variables in standardized data (N.B. This holds true only when the data are standardized.). This is commonly used as a cutoff point for which PCs are retained.

Therefore, interestingly, only the first four PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set. This suggests that patterns between the variables are likely non-linear and complex.

3.3 prince (Python package)

Like FactoMineR, prince can be used to perform a varity of factor analyses involving purely numerical/categorical or mixed type datasets. Implemented in Python, this package uses a familiar scikit-learn API.

Unlike the two R packages above, there does not seem to be an option for adding in supplementary variables after FAMD.

For more detailed documentation, see the GitHub repo.

## Import libraries
import prince

## Instantiate FAMD object
famd = prince.FAMD(
     n_components=25,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',       ## Can be "auto", 'sklearn', 'fbpca'
     random_state=42)

## Fit FAMD object to data 
famd = famd.fit(df.drop('Churn', axis=1)) ## Excluding the target variable "Churn"

## Inspect principal dimensions
famd.explained_inertia_            
[0.5374654125558428,
 0.08801867828903967,
 0.05728512443689753,
 0.03937333023240265,
 0.03127877295810108,
 0.02791225157585326,
 0.024703045349880843,
 0.02080730170414231,
 0.01893723097468581,
 0.01800539646894817,
 0.016560227742665183,
 0.015976757070266752,
 0.014945093457876387,
 0.013999456118270725,
 0.013763376399511602,
 0.013589916692549968,
 0.012208295034160321,
 0.011979368941411946,
 0.01133988994746592,
 0.007002655988383323,
 0.004847867736269992,
 5.485296596458002e-07,
 1.7957132325092428e-09,
 5.366065875277613e-33,
 5.366065875277613e-33]

Surprisingly, the results here differ greatly from the ones above. In my preliminary readings, I understand that "explained inertia" is synonymous with "explained variance", so this seems to be an unlikely cause of the discrepancy. I will keep digging, but as you will see in later parts of this series, FAMD performed using prince does reach nearly identical conclusions as the two R packages.

Finally, if you want to try this out for yourself, please head over to the Intelligence Refinery workspace at Nextjournal to check out the fully interactive notebook.

Til next post! :)