Factor analysis of mixed data (FAMD) in R and Python - Part 1

Nancy Chelaru-Centea

June 13, 2019


Introduction

  • Factor analysis for mixed data (FAMD) is a principal component method for exploring data with both continuous and categorical variables
    • Roughly a combination of principal component analysis (PCA) for continuous variables and multiple component analysis (MCA) for categorical variables
      • Continuous variables are scaled to unit variance
      • Categorical variables are transformed into a disjunctive data table and scaled
  • FAMD can be used to study relationships between individual data points and between all variables
  • To learn more about FAMD, see an excellent tutorial using the FactoMineR package

Import data

df <- read.csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')
import pandas as pd

df = pd.read_csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')

Pre-process data

Drop Calculated_TotalCharges column

As Calcualted_TotalCharges is highly correlated with tenure and MonthlyCharges, it will be excluded from analysis.

pca_df <- within(df, rm('Calculated_TotalCharges'))
df.drop('Calculated_TotalCharges', axis=1, inplace=True)

Normalize numerical variables

ind <- sapply(pca_df, is.numeric)

pca_df[ind] <- lapply(pca_df[ind], scale)


Principal component analysis

An eigenvalue > 1 indicates that the principal component (PCs) accounts for more variance than accounted by one of the original variables in standardized data (N.B. This holds true only when the data are standardized.).

This is commonly used as a cutoff point for which PCs are retained.

So, we see that really only the first four PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set.


FactoMineR

The FactoMineR package documentation indicated that "By default, all quantitative variables are scaled to unit variance."

## Import libraries
library(FactoMineR)
library(factoextra)

## PCA
res.famd <- FAMD(pca_df, 
                 sup.var = 19, 
                 graph = FALSE, 
                 ncp=25)

## Inspect principal components
get_eigenvalue(res.famd)
Loading required package: ggplot2
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
A matrix: 22 × 3 of type dbl
eigenvaluevariance.percentcumulative.variance.percent
Dim.14.509851384220.49932447 20.49932
Dim.22.802401263812.73818756 33.23751
Dim.31.8247464408 8.29430200 41.53181
Dim.41.1490819680 5.22309985 46.75491
Dim.51.0495900326 4.77086378 51.52578
Dim.61.0144020899 4.61091859 56.13670
Dim.70.9993340111 4.54242732 60.67912
Dim.80.9842737776 4.47397172 65.15310
Dim.90.9039827483 4.10901249 69.26211
Dim.100.8465510646 3.84795938 73.11007
Dim.110.7634513562 3.47023344 76.58030
Dim.120.7069736613 3.21351664 79.79382
Dim.130.6743784465 3.06535658 82.85917
Dim.140.6215863186 2.82539236 85.68457
Dim.150.6096204549 2.77100207 88.45557
Dim.160.5972640853 2.71483675 91.17041
Dim.170.4783653418 2.17438792 93.34479
Dim.180.4652041236 2.11456420 95.45936
Dim.190.4628728877 2.10396767 97.56332
Dim.200.3061815235 1.39173420 98.95506
Dim.210.2291355725 1.04152533 99.99658
Dim.220.0007514474 0.00341567100.00000


PCAmixdata

library(PCAmixdata)

## Split quantitative and qualitative variables
split <- splitmix(pca_df[1:18])

## PCA
res.pcamix <- PCAmix(X.quanti=pca_df[c(5, 18)], 
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Inspect principal components
res.pcamix$eig
A matrix: 22 × 3 of type dbl
EigenvalueProportionCumulative
dim 14.509851384220.49932447 20.49932
dim 22.802401263812.73818756 33.23751
dim 31.8247464408 8.29430200 41.53181
dim 41.1490819680 5.22309985 46.75491
dim 51.0495900326 4.77086378 51.52578
dim 61.0144020899 4.61091859 56.13670
dim 70.9993340111 4.54242732 60.67912
dim 80.9842737776 4.47397172 65.15310
dim 90.9039827483 4.10901249 69.26211
dim 100.8465510646 3.84795938 73.11007
dim 110.7634513562 3.47023344 76.58030
dim 120.7069736613 3.21351664 79.79382
dim 130.6743784465 3.06535658 82.85917
dim 140.6215863186 2.82539236 85.68457
dim 150.6096204549 2.77100207 88.45557
dim 160.5972640853 2.71483675 91.17041
dim 170.4783653418 2.17438792 93.34479
dim 180.4652041236 2.11456420 95.45936
dim 190.4628728877 2.10396767 97.56332
dim 200.3061815235 1.39173420 98.95506
dim 210.2291355725 1.04152533 99.99658
dim 220.0007514474 0.00341567100.00000


prince

Unlike the two pacakges above, prince is implemented in Python.

The prince package automatically scales continuous variables for FAMD (as indicated here), so can use the original unscaled dataset.

## Import libraries
import prince
import fbpca

## Instantiate FAMD object
famd = prince.FAMD(
     n_components=25,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',   ## Can be "auto", 'sklearn', 'fbpca'
     random_state=42)

## Fit FAMD object to data (excluding target varible)
famd = famd.fit(df.drop('Churn', axis=1))

## Inspect principal dimensions
famd.explained_inertia_            
[0.5374654125558428,
 0.08801867828903982,
 0.0572851244368975,
 0.03937333023240256,
 0.03127877295810106,
 0.027912251575853336,
 0.024703045349880923,
 0.020807301704142356,
 0.018937230974685852,
 0.01800539646894815,
 0.01656022774266522,
 0.015976757070266686,
 0.014945093457876379,
 0.013999456118270703,
 0.013763376399511574,
 0.013589916692549978,
 0.012208295034160313,
 0.011979368941411956,
 0.011339889947465936,
 0.007002655988383335,
 0.0048478677362699864,
 5.485296596457979e-07,
 1.795713232509243e-09,
 5.3660658752776136e-33,
 5.3660658752776136e-33]