# Calculate principal components of mixed-type data

## 1. Introduction

Many datasets that a data scientist will encounter in the real world will contain both numerical and categorical variables. Factor analysis for mixed data (FAMD) is a principal component method that combines principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical variables. To learn more about FAMD, see an excellent tutorial using the `FactoMineR`

package.

I will use three commonly used packages in R (FactoMineR and PCAmixdata) and Python (prince) to performance FAMD on the Telco customer churn dataset, to gain insights into the relationships between various aspects of customer behaviour. This will be a toy example of how FAMD can be used to derive actionable business insights in the real world.

## 2. Import and pre-process data

Here, I will import the cleaned Telco dataset in both R and Python.

As `Calculated_TotalCharges`

is highly correlated with `tenure`

and `MonthlyCharges`

, it will be excluded from analysis.

Also, all three packages automatically normalize the numerical variables, so I will not do so before hand.

### 2.1 In R

```
df <- read.csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')
df <- within(df, rm('Calculated_TotalCharges'))
```

### 2.2 In Python

```
import pandas as pd
df = pd.read_csv('https://github.com/nd823/data-cleaning/raw/master/telco_cleaned_Jun13.csv')
df.drop('Calculated_TotalCharges', axis=1, inplace=True)
```

## 3. Factor analysis of mixed data (FAMD)

### 3.1 `FactoMineR`

(R package)

`FactoMineR`

provides a variety of functions for PCA, correspondence analysis (CA), multiple correspondence analysis (MCA) and FAMD.

See CRAN documentation for `FactoMineR`

.

```
## Import libraries
library(FactoMineR)
library(factoextra)
## PCA
res.famd <- FAMD(df,
sup.var = 19, ## Set the target variable "Churn" as a supplementary variable, so it is not included in the analysis for now
graph = FALSE,
ncp=25)
## Inspect principal components
get_eigenvalue(res.famd)
```

eigenvalue | variance.percent | cumulative.variance.percent | |
---|---|---|---|

Dim.1 | 4.5098513842 | 20.49932447 | 20.49932 |

Dim.2 | 2.8024012638 | 12.73818756 | 33.23751 |

Dim.3 | 1.8247464408 | 8.29430200 | 41.53181 |

Dim.4 | 1.1490819680 | 5.22309985 | 46.75491 |

Dim.5 | 1.0495900326 | 4.77086378 | 51.52578 |

Dim.6 | 1.0144020899 | 4.61091859 | 56.13670 |

Dim.7 | 0.9993340111 | 4.54242732 | 60.67912 |

Dim.8 | 0.9842737776 | 4.47397172 | 65.15310 |

Dim.9 | 0.9039827483 | 4.10901249 | 69.26211 |

Dim.10 | 0.8465510646 | 3.84795938 | 73.11007 |

Dim.11 | 0.7634513562 | 3.47023344 | 76.58030 |

Dim.12 | 0.7069736613 | 3.21351664 | 79.79382 |

Dim.13 | 0.6743784465 | 3.06535658 | 82.85917 |

Dim.14 | 0.6215863186 | 2.82539236 | 85.68457 |

Dim.15 | 0.6096204549 | 2.77100207 | 88.45557 |

Dim.16 | 0.5972640853 | 2.71483675 | 91.17041 |

Dim.17 | 0.4783653418 | 2.17438792 | 93.34479 |

Dim.18 | 0.4652041236 | 2.11456420 | 95.45936 |

Dim.19 | 0.4628728877 | 2.10396767 | 97.56332 |

Dim.20 | 0.3061815235 | 1.39173420 | 98.95506 |

Dim.21 | 0.2291355725 | 1.04152533 | 99.99658 |

Dim.22 | 0.0007514474 | 0.00341567 | 100.00000 |

To inspect the results in further detail, use the `summary(res.famd)`

and `print(res.famd)`

functions.

### 3.2 `PCAmixdata`

(R package)

According to its authors, `PCAmixdata`

is "dedicated to multivariate analysis of mixed data where observations are described by a mixture of numerical and categorical variables" (Chavent et al., 2017). As we will see in part 2 of this series, `PCAmixdata`

provides a very useful function for performing (a generalized form of) varimax rotation that aids in interpreting the principal components identified.

See CRAN documentation for `PCAmixdata`

.

```
## Import library
library(PCAmixdata)
## Split mixed dataset into quantitative and qualitative variables
split <- splitmix(df[1:18]) ## For now excluding the target variable "Churn", which will be added back later as a supplementary variable
## PCA
res.pcamix <- PCAmix(X.quanti=split$X.quanti,
X.quali=split$X.quali,
rename.level=TRUE,
graph=FALSE,
ndim=25)
## Inspect principal components
res.pcamix$eig
```

Eigenvalue | Proportion | Cumulative | |
---|---|---|---|

dim 1 | 4.5098513842 | 20.49932447 | 20.49932 |

dim 2 | 2.8024012638 | 12.73818756 | 33.23751 |

dim 3 | 1.8247464408 | 8.29430200 | 41.53181 |

dim 4 | 1.1490819680 | 5.22309985 | 46.75491 |

dim 5 | 1.0495900326 | 4.77086378 | 51.52578 |

dim 6 | 1.0144020899 | 4.61091859 | 56.13670 |

dim 7 | 0.9993340111 | 4.54242732 | 60.67912 |

dim 8 | 0.9842737776 | 4.47397172 | 65.15310 |

dim 9 | 0.9039827483 | 4.10901249 | 69.26211 |

dim 10 | 0.8465510646 | 3.84795938 | 73.11007 |

dim 11 | 0.7634513562 | 3.47023344 | 76.58030 |

dim 12 | 0.7069736613 | 3.21351664 | 79.79382 |

dim 13 | 0.6743784465 | 3.06535658 | 82.85917 |

dim 14 | 0.6215863186 | 2.82539236 | 85.68457 |

dim 15 | 0.6096204549 | 2.77100207 | 88.45557 |

dim 16 | 0.5972640853 | 2.71483675 | 91.17041 |

dim 17 | 0.4783653418 | 2.17438792 | 93.34479 |

dim 18 | 0.4652041236 | 2.11456420 | 95.45936 |

dim 19 | 0.4628728877 | 2.10396767 | 97.56332 |

dim 20 | 0.3061815235 | 1.39173420 | 98.95506 |

dim 21 | 0.2291355725 | 1.04152533 | 99.99658 |

dim 22 | 0.0007514474 | 0.00341567 | 100.00000 |

Similarly, to inspect the results in further detail, use the `summary(res.pcamix)`

and `print(res.pcamix)`

functions.

Thus far, we see that the results from `FactoMineR`

and `PCAmixdata`

are identical.

A little background: an **eigenvalue > 1** indicates that the principal component (PCs) accounts for **more** variance than accounted by one of the original variables in **standardized** data (**N.B. This holds true only when the data are standardized.**). This is commonly used as a cutoff point for which PCs are retained.

Therefore, interestingly, only the **first four** PCs account for more variance than each of the original variables, and together they account for only 46.7% of the total variance in the data set. This suggests that patterns between the variables are likely non-linear and complex.

### 3.3 `prince`

(Python package)

Like `FactoMineR`

, `prince`

can be used to perform a varity of factor analyses involving purely numerical/categorical or mixed type datasets. Implemented in Python, this package uses a familiar `scikit-learn`

API.

Unlike the two R packages above, there does not seem to be an option for adding in supplementary variables after FAMD.

For more detailed documentation, see the GitHub repo.

```
## Import libraries
import prince
## Instantiate FAMD object
famd = prince.FAMD(
n_components=25,
n_iter=10,
copy=True,
check_input=True,
engine='auto', ## Can be "auto", 'sklearn', 'fbpca'
random_state=42)
## Fit FAMD object to data
famd = famd.fit(df.drop('Churn', axis=1)) ## Excluding the target variable "Churn"
## Inspect principal dimensions
famd.explained_inertia_
```

```
[0.5374654125558428,
0.08801867828903967,
0.05728512443689753,
0.03937333023240265,
0.03127877295810108,
0.02791225157585326,
0.024703045349880843,
0.02080730170414231,
0.01893723097468581,
0.01800539646894817,
0.016560227742665183,
0.015976757070266752,
0.014945093457876387,
0.013999456118270725,
0.013763376399511602,
0.013589916692549968,
0.012208295034160321,
0.011979368941411946,
0.01133988994746592,
0.007002655988383323,
0.004847867736269992,
5.485296596458002e-07,
1.7957132325092428e-09,
5.366065875277613e-33,
5.366065875277613e-33]
```

Surprisingly, the results here differ greatly from the ones above. In my preliminary readings, I understand that "explained inertia" is synonymous with "explained variance", so this seems to be an unlikely cause of the discrepancy. I will keep digging, but as you will see in later parts of this series, FAMD performed using `prince`

does reach nearly identical conclusions as the two R packages.

Finally, if you want to try this out for yourself, please head over to the Intelligence Refinery workspace at Nextjournal to check out the fully interactive notebook.

Til next post! :)