Conditional probability density plots

Nancy Chelaru-Centea

Aug. 12, 2019


Conditional probability density plots as a great way to examine the relationship between a continuous and categorical variable, as they shows how the conditional distribution of the former changes over different values of the latter. Unlike the commonly used bar graphs that show the mean or median of a continuous variable for different levels of a categorical value, which collapses much more information into just a few numbers, these plots can reveal interesting dynamics over a range of values.

To make these plots, we use the smoothed density estimates implementation in the ggplot2 package, which calculates and plots the kernel density estimate over a range of values. As usual, we will use the Telco customer churn dataset as a toy example for using these techniques to gain actionable business insights from data. As part of exploratory data analysis, I will make a conditional density plot for every possible categorical x numerical variables combination in the Telco dataset, so I can glean anything interesting.

## Import libraries
library(PCAmixdata)
library(ggplot2)
library(plotly)
library(gridExtra)

## Import data
df <- read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv")

## Create conditional probability density plots for each categorical variable against one of the two continuous variables
plots = list()

split <- splitmix(df) ## Split the dataframe into categorical and continuous variables

i <- 1

for (v in colnames(split$X.quali)) {
    for (c in c("Tenure", "MonthlyCharges")){
        plots[[i]] <- ggplotly(ggplot(df, aes_string(df[[c]], fill = df[[v]])) + 
                        geom_density(position='fill', alpha = 0.5) + 
                        xlab(c) + labs(fill=v) +
                        theme(legend.text=element_text(size=12), 
                              axis.title=element_text(size=14)))

        i <- i + 1
        }
    }

## Plot
grid.arrange(grobs=plots, ncol=2)

output-9-0

A quick glance reveals some interesting stuff:

  • Male and female customers have very similar tenures and monthly charges
    • Factor analyses showed that the Gender variable factors very little into the variations in this dataset (see post here)
  • Monthly charges for customers with fiber optic internet service are much higher than those with DSL or none at all
    • Preliminary EDA (see post here) showed that customers who churn mostly have fiber optic internet service and higher monthly charges than those who do not churn
    • So dissatisfaction with this may factor into a customer's decision to leave
  • There are two "tiers" in terms of monthly charges that customers appear to be more likely to churn: ~40 dollars and ~75-100 dollars
    • It may be useful to see if the services and prices at these two "tiers" are less competitive than those offered by other companies
    • Conversely, perhaps the services offered around ~60 dollars/month is more appealing than those from competitors, and so may warrant more focus for advertising and promotions

Of course, it is important to keep in mind that these correlations do not indicate causal relationships, or even the direction of the relationship, between the variables examined. These are just starting points for further investigation.

If you are interested in trying this out for yourself, head over to the Intelligence Refinery workspace at Nextjournal to get the interactive notebook.

Til the next post! :)