Keywords: Whatsapp, Data Analysis, Statistik Beratung, Data Science

Introduction

In the following we want to create a mathematical model of a Whatsapp-two-persons-chat.

I have the following Ansatz: Let \(t_{j,i}\) be the time at which something is said by Person A or Person B in the whatsapp-chat at conversation j. We have the following “waiting times”: \(0=t_{11}<t_{12}<\cdots<t_{1,a_1}<t_{2,1}<t_{2,2}<\cdots<t_{2,a_2}<\cdots<t_{n,1}<\cdots<t_{n,a_n}\) So we have \(n\) “conversations” in this chat by two people. Now my modeling Ansatz is that we have between each conversation a pause \(P_j\):

\(t_{1,a_1}+P_1 = t_{2,1}\)

\(t_{2,a_2}+P_2 = t_{3,1}\)

\(\cdots\)

\(t_{n-1,a_{n-1}}+P_{n-1} = t_{n,1}\)

I have verified with the Kolmogorov-Smirnov Test all my assumptions concerning distribution of variables. Now we have

\(P_j \sim Exp(\lambda_P)\)

\(d_{j,i} = t_{j,i+1}-t_{j,i} \sim Exp(\lambda_d)\) “interarrival times”

\(a_j \sim Pois(\lambda_a)\)

Now one could think of this as a “nested Poisson process”, by which I mean, that we have a Poisson Process which governs the distributions of the conversations, and in each conversation we have a homogeneous Poisson process.

Ok, so in reality we can not observe when one conversation ends and when it starts. So the question is, given the data \(t_1 < \cdots < t_m\) is it possible to calibrate the above model to find out how many conversations there are in this chat and when a conversation ends / starts, or are there to many parameters in the model, which need to be estimated?

We have

\(t_{n,a_n} = \sum_{j=1}^n P_j + \sum_{j=1}^n\sum_{i=1}^{a_j-1}d_{j,i}\)

From this I have computed the expected value and the variance of \(t_{n,a_n}\):

\(E(t_{n,a_n}) = n/\lambda_P + n(\lambda_a-1)/\lambda_d\)

\(Var(t_{n,a_n}) = n/\lambda_P^2 + n(\lambda_a-1)/\lambda_d^2\)

Now the question is, given the data \(t_1<\cdots<t_m\) how to estimate the parameters: \(n, \lambda_P, \lambda_d, \lambda_a\)?

Suppose we had a cutoff value \(\widehat{d}\). Now let \(n = \) number of times we have \(d_i > \widehat{d}\).

Suppose, that the above procedure can distinguish between a conversation and a pause, then we have \(E(m) = \sum_{i=1}^nE(a_i) = n \lambda_a\) hence we can estimate \(\lambda_a\) as \(\widehat{\lambda_a} = m / n\). On the other hand we can estimate \(\lambda_P\) as \(\widehat{\lambda_P} = \frac{1}{1/n \sum_{d_j>\widehat{d}}d_j}\)

And the Ansatz

\(t_m = n/\widehat{\lambda_P}+n(\widehat{\lambda_a}-1)/\widehat{\lambda_d}\)

gives an estimate of \(\widehat{\lambda_d}\) as:

\(\widehat{\lambda_d} = \frac{m/n-1}{t_m/n-1/n \sum_{d_j>\widehat{d}}d_j}\)

Now the question is how to find the cutoff value \(T = \widehat{d}\). Consider the following scenario: We have \(X_1,\cdots,X_m\) bernoulli distributed variables with probability \(p\). Let \(D_1,\cdots,D_m \sim Exp(\lambda_d)\) and \(P_1,\cdots,P_m \sim Exp(\lambda_p)\) and set \(d_i = X_i D_i + (1-X_i) P_i\) and suppose that \(\lambda_d >> \lambda_p\). Then we want to find a threshold \(T\) such that \(d_i > T \) implies that \(X_i=0\) hence \(P_i\) was chosen and such that if \(d_i \le T\) then \(X_i = 1\), hence \(D_i\) was chosen.

One method to do this, is to assume that we know the \(p,\lambda_d,\lambda_p\) and then to minimize the following probability:

\((1-p)\mathbb{P}(P_i \le T) + p \mathbb{P}(D_i \ge T) = (1-p)(1-e^{-\lambda_p T}) + p e^{-\lambda_d T} \equiv (1-p)\lambda_p T + p e^{-\lambda_d T}\) where the last equivalence is because we assume that \(T << 1/\lambda_p\). Now taking derivatives with respect to \(T\) and setting equal to \(0\) and solving for \(T\) we get :

\[ T = -\frac{1}{\lambda_d} log(\frac{(1-p)\lambda_p}{p\lambda_d}) \]

So the idea is to take for the first \(T = \widehat{d} := 1/m \sum_{i=1}^m d_i\). Then to estimate \(\lambda_d,\lambda_p,p\) based on what has been written above and then to iterate this procedure, say 10 times.

(Simulation suggest, that this procedure does not always converge to the actual \(\lambda_d,\lambda_p,p\) but it is close enough for practical applications.)

The estimates then could be calibrated by the following R-Code as described above:

  threshold <- function(di,pu,lambdaD,lambdaP){
     T <- -1/lambdaD*log((1-pu)*lambdaP/(pu*lambdaD))
     A <- di[di>T]
     B <- di[di<=T]
     lp <- 1/mean(A)
     ld <- 1/mean(B)
     p <- 1-length(A)/length(di)
     return( c(T,lp,ld,p) )
  }

  calibrate<-function(dt){
     A <- t[t>mean(t)]
     B <- t[t<=mean(t)]
     lp <- 1/mean(A)
     ld <- 1/mean(B)
     Pu <- 1-length(A)/m

     for( i in seq(1,10)){
       y <- threshold(t,Pu,ld,lp)
       T <- y[1]
       lp <- y[2]
       ld <- y[3]
       Pu <- y[4]
     }
     return( c(lp,ld,Pu) )
   }

Application:

As an application of the above model, Whatsapp could implement a reminder when to write to somebody you haven’t “long” been writing.

I recommend to remind if the actual time in minutes minus the last time when a conversation was is greater then \(-log(1-0.999)/\widehat{\lambda_P}\).

Thanks go to Bjørn Kjos-Hanssen for pointing out how to simplify the model and Anthony Quas for pointing out how to find a threshold \(\widehat{d}\) for \(d_i\).