<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://sijiachen.me/feed.xml" rel="self" type="application/atom+xml" /><link href="http://sijiachen.me/" rel="alternate" type="text/html" /><updated>2026-05-21T03:20:17+00:00</updated><id>http://sijiachen.me/feed.xml</id><title type="html">Chen Sijia’s personal blog</title><subtitle>This is where I share fun stuff about my life</subtitle><author><name>Chen Sijia</name></author><entry><title type="html">Survival Analysis Case Report - Telecom Customer Churn Prediction</title><link href="http://sijiachen.me/2026/04/27/survival-analysis-case-report.html" rel="alternate" type="text/html" title="Survival Analysis Case Report - Telecom Customer Churn Prediction" /><published>2026-04-27T00:00:00+00:00</published><updated>2026-04-27T00:00:00+00:00</updated><id>http://sijiachen.me/2026/04/27/survival-analysis-case-report</id><content type="html" xml:base="http://sijiachen.me/2026/04/27/survival-analysis-case-report.html"><![CDATA[<p><strong>Author</strong>: Chen Sijia</p>

<p><strong>Dataset</strong>: IBM Telco Customer Churn</p>

<p><strong>Tutorial Link</strong>: https://github.com/databricks-industry-solutions/survival-analysis</p>

<p><strong>Environment</strong>: PySpark Environment</p>

<hr />
<h2 id="table-of-contents">Table of Contents</h2>
<ol>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#data-overview">Data Overview</a></li>
  <li><a href="#survival-analysis-methods">Survival Analysis Methods</a></li>
  <li><a href="#kaplan-meier-survival-analysis">Kaplan-Meier Survival Analysis</a></li>
  <li><a href="#cox-proportional-hazards-model">Cox Proportional Hazards Model</a></li>
  <li><a href="#accelerated-failure-time-model-aft">Accelerated Failure Time Model (AFT)</a></li>
  <li><a href="#customer-lifetime-value-clv">Customer Lifetime Value (CLV)</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
  <li><a href="#business-recommendations">Business Recommendations</a></li>
  <li><a href="#model-limitations">Model Limitations</a></li>
  <li><a href="#appendix">Appendix</a></li>
</ol>

<hr />

<h2 id="introduction">1. Introduction</h2>

<h3 id="11-what-is-survival-analysis">1.1 What is Survival Analysis?</h3>

<p>Survival Analysis is a collection of statistical methods for studying “time-to-event” data. Although originally applied in medicine (studying patient survival time), it is now widely used in:</p>
<ul>
  <li><strong>Telecommunications</strong>: Customer churn prediction</li>
  <li><strong>Manufacturing</strong>: Equipment failure prediction</li>
  <li><strong>Finance</strong>: Loan default time prediction</li>
  <li><strong>E-commerce</strong>: User activation time prediction</li>
</ul>

<h3 id="12-project-overview">1.2 Project Overview</h3>

<p>This case uses survival analysis to predict churn time for telecom customers, helping enterprises:</p>
<ul>
  <li>Identify customers with high churn risk</li>
  <li>Take retention actions at critical time points</li>
  <li>Optimize customer retention strategies</li>
</ul>

<hr />

<h2 id="data-overview">2. Data Overview</h2>

<h3 id="21-data-source">2.1 Data Source</h3>
<ul>
  <li><strong>Dataset</strong>: IBM Telco Customer Churn Dataset</li>
  <li><strong>Original records</strong>: 7,043</li>
  <li><strong>Analysis sample size</strong>: 3,351 (month-to-month contract + internet service customers)</li>
</ul>

<h3 id="22-sample-characteristics">2.2 Sample Characteristics</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Number of churned customers</td>
      <td>1,556</td>
    </tr>
    <tr>
      <td>Churn rate</td>
      <td>46.4%</td>
    </tr>
    <tr>
      <td>Observation time range</td>
      <td>0-72 months</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="survival-analysis-methods">3. Survival Analysis Methods</h2>

<h3 id="31-kaplan-meier-estimator">3.1 Kaplan-Meier Estimator</h3>

<p><strong>Basic Principle</strong>:<br />
Non-parametric method to estimate the survival function S(t) = P(T &gt; t)</p>

<p><strong>Formula</strong>:<br />
S(t) = ∏(1 - d_j / n_j)</p>

<p>Where:</p>
<ul>
  <li>d_j: number of events (churns) at time point j</li>
  <li>n_j: number at risk just before time point j</li>
</ul>

<p><strong>Advantages</strong>:</p>
<ul>
  <li>No distributional assumptions</li>
  <li>Handles censored data</li>
  <li>Intuitive and easy to interpret</li>
</ul>

<p><strong>Application</strong>: Visualize survival curves for different groups, compare differences between groups</p>

<h3 id="32-cox-proportional-hazards-model">3.2 Cox Proportional Hazards Model</h3>

<p><strong>Basic Principle</strong>:<br />
Semi-parametric model to analyze the effect of multiple covariates on survival time</p>

<p><strong>Model Form</strong>:<br />
h(t|X) = h₀(t) × exp(β₁X₁ + … + βₚXₚ)</p>

<p>Where:</p>
<ul>
  <li>
    <table>
      <tbody>
        <tr>
          <td>h(t</td>
          <td>X): hazard function</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li>h₀(t): baseline hazard</li>
  <li>β: covariate coefficients</li>
  <li>exp(β): hazard ratio (HR)</li>
</ul>

<p><strong>Advantages</strong>:</p>
<ul>
  <li>Analyzes multiple factors simultaneously</li>
  <li>No need to specify baseline hazard</li>
  <li>HR &lt; 1 indicates protective factor, HR &gt; 1 indicates risk factor</li>
</ul>

<p><strong>Application</strong>: Identify key factors influencing customer churn</p>

<h3 id="33-accelerated-failure-time-aft-model">3.3 Accelerated Failure Time (AFT) Model</h3>

<p><strong>Basic Principle</strong>:<br />
Parametric model assuming covariates accelerate or decelerate survival time</p>

<p><strong>Model Form</strong>:<br />
T = exp(β₁X₁ + … + βₚXₚ + σ·ε)</p>

<p>Where:</p>
<ul>
  <li>T: survival time</li>
  <li>exp(β): time acceleration factor (&gt;1 extends, &lt;1 shortens)</li>
  <li>ε: error term</li>
</ul>

<p><strong>Advantages</strong>:</p>
<ul>
  <li>Directly predicts survival time</li>
  <li>Handles multiple distributions (Weibull, LogNormal, LogLogistic)</li>
</ul>

<p><strong>Application</strong>: Predict customer churn probability at specific time points</p>

<h3 id="34-log-rank-test">3.4 Log-Rank Test</h3>

<p><strong>Basic Principle</strong>:<br />
Chi-square test to test whether multiple survival curves are statistically equivalent</p>

<p><strong>Null Hypothesis</strong>: No significant difference among survival curves of groups</p>

<p><strong>Application</strong>: Verify whether survival curves differ significantly across groups</p>

<h3 id="35-method-comparison">3.5 Method Comparison</h3>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Type</th>
      <th>Purpose</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>KM</td>
      <td>Non-parametric</td>
      <td>Estimate survival function</td>
      <td>Survival probability</td>
    </tr>
    <tr>
      <td>Cox</td>
      <td>Semi-parametric</td>
      <td>Analyze influencing factors</td>
      <td>Hazard ratio (HR)</td>
    </tr>
    <tr>
      <td>AFT</td>
      <td>Parametric</td>
      <td>Predict survival time</td>
      <td>Time acceleration factor</td>
    </tr>
    <tr>
      <td>Log-rank</td>
      <td>Non-parametric test</td>
      <td>Compare differences between groups</td>
      <td>test_statistic, p, -log2(p)</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="kaplan-meier-survival-analysis">4. Kaplan-Meier Survival Analysis</h2>

<h3 id="41-analysis-workflow">4.1 Analysis Workflow</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">lifelines</span> <span class="kn">import</span> <span class="n">KaplanMeierFitter</span>
<span class="kn">from</span> <span class="nn">lifelines.statistics</span> <span class="kn">import</span> <span class="n">pairwise_logrank_test</span>

<span class="c1"># Overall KM fit
</span><span class="n">kmf</span> <span class="o">=</span> <span class="n">KaplanMeierFitter</span><span class="p">()</span>
<span class="n">kmf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">telco_pd</span><span class="p">[</span><span class="s">'tenure'</span><span class="p">],</span> <span class="n">telco_pd</span><span class="p">[</span><span class="s">'churn'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">))</span>

<span class="c1"># Group KM and log-rank test
</span><span class="k">def</span> <span class="nf">plot_km</span><span class="p">(</span><span class="n">col</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">telco_pd</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="n">unique</span><span class="p">():</span>
        <span class="n">ix</span> <span class="o">=</span> <span class="n">telco_pd</span><span class="p">[</span><span class="n">col</span><span class="p">]</span> <span class="o">==</span> <span class="n">r</span>
        <span class="n">kmf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">telco_pd</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">ix</span><span class="p">,</span> <span class="s">'tenure'</span><span class="p">],</span> <span class="n">telco_pd</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">ix</span><span class="p">,</span> <span class="s">'churn'</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">r</span><span class="p">)</span>
        <span class="n">kmf</span><span class="p">.</span><span class="n">plot</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">print_logrank</span><span class="p">(</span><span class="n">col</span><span class="p">):</span>
    <span class="n">log_rank</span> <span class="o">=</span> <span class="n">pairwise_logrank_test</span><span class="p">(</span><span class="n">telco_pd</span><span class="p">[</span><span class="s">'tenure'</span><span class="p">],</span> <span class="n">telco_pd</span><span class="p">[</span><span class="n">col</span><span class="p">],</span> <span class="n">telco_pd</span><span class="p">[</span><span class="s">'churn'</span><span class="p">])</span>
    <span class="k">print</span><span class="p">(</span><span class="n">log_rank</span><span class="p">.</span><span class="n">summary</span><span class="p">)</span>

<span class="c1"># Perform group analysis
</span><span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">categorical_cols</span><span class="p">:</span>
    <span class="n">plot_km</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
    <span class="n">print_logrank</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="42-analysis-results">4.2 Analysis Results</h3>

<h4 id="421-overall-survival-curve">4.2.1 Overall Survival Curve</h4>

<ul>
  <li><strong>Median survival time: 34 months</strong></li>
  <li><strong>Interpretation</strong>: 50% of customers churn within 34 months</li>
</ul>

<p><img src="/assets/images/km_overall_curve.png" alt="Overall survival curve" />
<em>Figure 1: Overall Kaplan-Meier survival curve</em></p>

<h4 id="422-dsl-internet-service-survival-probability-first-10-months">4.2.2 DSL Internet Service Survival Probability (first 10 months)</h4>

<table>
  <thead>
    <tr>
      <th>Month</th>
      <th>DSL Survival Probability</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <td>1</td>
      <td>0.902698</td>
    </tr>
    <tr>
      <td>2</td>
      <td>0.864380</td>
    </tr>
    <tr>
      <td>3</td>
      <td>0.834702</td>
    </tr>
    <tr>
      <td>4</td>
      <td>0.810522</td>
    </tr>
    <tr>
      <td>5</td>
      <td>0.794352</td>
    </tr>
    <tr>
      <td>6</td>
      <td>0.783900</td>
    </tr>
    <tr>
      <td>7</td>
      <td>0.776362</td>
    </tr>
    <tr>
      <td>8</td>
      <td>0.768486</td>
    </tr>
    <tr>
      <td>9</td>
      <td>0.750833</td>
    </tr>
  </tbody>
</table>

<h4 id="423-group-survival-analysis-results">4.2.3 Group Survival Analysis Results</h4>

<ol>
  <li><strong>Gender</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>2.038938</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>0.153317</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>2.705414</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: p &gt; 0.05, survival curves for different genders are not significantly different</li>
</ul>

<p><img src="/assets/images/km_gender_curve.png" alt="Gender group survival curve" />
<em>Figure 2: Survival curve by gender</em></p>

<hr />

<ol>
  <li><strong>Senior Citizen Status (seniorCitizen)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>0.125471</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>0.723174</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>0.467584</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: p &gt; 0.05, senior citizen status has no significant impact on customer retention</li>
</ul>

<p><img src="/assets/images/km_senior_curve.png" alt="Senior citizen group survival curve" />
<em>Figure 3: Survival curve by senior citizen status</em></p>

<hr />

<ol>
  <li><strong>Partner Status (partner)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>135.758896</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>2.252911e-31</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>101.807981</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with partners have significantly longer retention time</li>
</ul>

<p><img src="/assets/images/km_partner_curve.png" alt="Partner status group survival curve" />
<em>Figure 4: Survival curve by partner status</em></p>

<hr />

<ol>
  <li><strong>Dependents Status (dependents)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>35.031241</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>3.244576e-09</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>28.199323</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with dependents have significantly longer retention time</li>
</ul>

<p><img src="/assets/images/km_dependents_curve.png" alt="Dependents status group survival curve" />
<em>Figure 5: Survival curve by dependents status</em></p>

<hr />

<ol>
  <li><strong>Phone Service (phoneService)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>1.683709</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>0.194432</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>2.36266</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: p &gt; 0.05, having phone service has no significant impact on retention</li>
</ul>

<p><img src="/assets/images/km_phoneservice_curve.png" alt="Phone service group survival curve" />
<em>Figure 6: Survival curve by phone service</em></p>

<hr />

<ol>
  <li><strong>Multiple Lines Service (multipleLines)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Group Comparison</th>
      <th>test_statistic</th>
      <th>p-value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>No phone service vs No</td>
      <td>12.382712</td>
      <td>4.333273e-04</td>
    </tr>
    <tr>
      <td>No vs Yes</td>
      <td>72.358368</td>
      <td>1.794602e-17</td>
    </tr>
    <tr>
      <td>No phone service vs Yes</td>
      <td>1.500291</td>
      <td>0.2206266</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Multiple lines service has a significant impact on retention</li>
</ul>

<p><img src="/assets/images/km_multiplelines_curve.png" alt="Multiple lines service group survival curve" />
<em>Figure 7: Survival curve by multiple lines service</em></p>

<hr />

<ol>
  <li><strong>Internet Service Type (internetService)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>25.172866</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>5.241449e-07</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>20.863531</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: DSL customers retain significantly better than fiber optic customers</li>
</ul>

<p><img src="/assets/images/km_internet_curve.png" alt="Internet service group survival curve" />
<em>Figure 8: Survival curve by internet service type</em></p>

<hr />

<ol>
  <li><strong>Streaming TV (streamingTV)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>12.93926</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>0.000322</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>11.601718</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with streaming TV service retain significantly better</li>
</ul>

<p><img src="/assets/images/km_streamingtv_curve.png" alt="Streaming TV group survival curve" />
<em>Figure 9: Survival curve by streaming TV</em></p>

<hr />

<ol>
  <li><strong>Streaming Movies (streamingMovies)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>17.941685</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>0.000023</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>15.422016</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with streaming movies service retain significantly better</li>
</ul>

<p><img src="/assets/images/km_streamingmovies_curve.png" alt="Streaming movies group survival curve" />
<em>Figure 10: Survival curve by streaming movies</em></p>

<hr />

<ol>
  <li><strong>Online Security Service (onlineSecurity)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>141.60316</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>1.187554e-32</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>106.053706</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with online security service have significantly longer retention time</li>
</ul>

<p><img src="/assets/images/km_onlinesecurity_curve.png" alt="Online security group survival curve" />
<em>Figure 11: Survival curve by online security</em></p>

<hr />

<ol>
  <li><strong>Online Backup Service (onlineBackup)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>189.482865</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>4.122979e-43</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>140.799221</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with online backup service have significantly longer retention time</li>
</ul>

<p><img src="/assets/images/km_onlinebackup_curve.png" alt="Online backup group survival curve" />
<em>Figure 12: Survival curve by online backup</em></p>

<hr />

<ol>
  <li><strong>Device Protection Service (deviceProtection)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>71.496825</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>2.777047e-17</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>54.999226</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with device protection service have significantly longer retention time</li>
</ul>

<p><img src="/assets/images/km_deviceprotection_curve.png" alt="Device protection group survival curve" />
<em>Figure 13: Survival curve by device protection</em></p>

<hr />

<ol>
  <li><strong>Tech Support Service (techSupport)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>90.430334</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>1.916059e-21</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>68.822348</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers with tech support service have significantly longer retention time</li>
</ul>

<p><img src="/assets/images/km_techsupport_curve.png" alt="Tech support group survival curve" />
<em>Figure 14: Survival curve by tech support</em></p>

<hr />

<ol>
  <li><strong>Paperless Billing (paperlessBilling)</strong></li>
</ol>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>test_statistic</td>
      <td>8.340802</td>
    </tr>
    <tr>
      <td>p-value</td>
      <td>0.003876</td>
    </tr>
    <tr>
      <td>-log2(p)</td>
      <td>8.011049</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Conclusion</strong>: Customers using paperless billing have higher churn risk</li>
</ul>

<p><img src="/assets/images/km_paperless_curve.png" alt="Paperless billing group survival curve" />
<em>Figure 15: Survival curve by paperless billing</em></p>

<hr />

<ol>
  <li><strong>Payment Method (paymentMethod)</strong>
    <ul>
      <li><strong>Conclusion</strong>: Payment method has a highly significant impact on retention; electronic check is a high-risk payment method</li>
    </ul>
  </li>
</ol>

<table>
  <thead>
    <tr>
      <th>Group Comparison</th>
      <th>test_statistic</th>
      <th>p-value</th>
      <th>-log2(p)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bank transfer (automatic) vs Credit card (automatic)</td>
      <td>0.061543</td>
      <td>8.040732e-01</td>
      <td>0.314601</td>
    </tr>
    <tr>
      <td>Bank transfer (automatic) vs Electronic check</td>
      <td>91.191889</td>
      <td>1.303937e-21</td>
      <td>69.377616</td>
    </tr>
    <tr>
      <td>Bank transfer (automatic) vs Mailed check</td>
      <td>43.536998</td>
      <td>4.160192e-11</td>
      <td>34.484559</td>
    </tr>
    <tr>
      <td>Credit card (automatic) vs Electronic check</td>
      <td>79.991082</td>
      <td>3.761035e-19</td>
      <td>61.205504</td>
    </tr>
    <tr>
      <td>Credit card (automatic) vs Mailed check</td>
      <td>39.684613</td>
      <td>2.984678e-10</td>
      <td>31.641706</td>
    </tr>
    <tr>
      <td>Electronic check vs Mailed check</td>
      <td>0.898320</td>
      <td>3.432326e-01</td>
      <td>1.542741</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/km_paymentmethod_curve.png" alt="Payment method group survival curve" />
<em>Figure 16: Survival curve by payment method</em></p>

<h4 id="424-key-findings">4.2.4 Key Findings</h4>

<ol>
  <li>
    <p><strong>Overall customer retention level</strong><br />
The target customer segment has a median survival time of 34 months, meaning 50% of customers churn within 34 months of joining the network, indicating a relatively high overall churn risk.</p>
  </li>
  <li>
    <p><strong>Factors with no significant impact on retention</strong><br />
Gender, senior citizen status, and phone service subscription do not significantly affect customer retention (p &gt; 0.05).</p>
  </li>
  <li><strong>Protective services that significantly extend customer retention</strong><br />
Online backup, online security, tech support, and device protection all significantly reduce churn risk, with:
    <ul>
      <li>Online backup having the strongest effect (log-rank statistic as high as 189.48)</li>
      <li>Online security second</li>
      <li>Tech support also being a core protective factor</li>
    </ul>
  </li>
  <li><strong>Service type differences</strong>
    <ul>
      <li>DSL customers retain significantly better than fiber optic customers; fiber optic customers are a key churn concern.</li>
      <li>Customers with streaming TV and streaming movies have significantly better retention.</li>
    </ul>
  </li>
  <li>
    <p><strong>Impact of customer personal characteristics</strong><br />
Customers with partners or dependents have lower churn risk; family-type customers are more stable.</p>
  </li>
  <li><strong>Billing and payment method risk signals</strong>
    <ul>
      <li>Customers using paperless billing have higher churn risk.</li>
      <li>Electronic check payment is the highest-risk payment method; automatic deductions (bank transfer/credit card) yield the best retention.</li>
    </ul>
  </li>
  <li><strong>Summary of high-risk customer profile</strong><br />
Customers without a partner, without dependents, using fiber optic internet service, not purchasing value-added services (security/backup/tech support/device protection), and paying by electronic check are the highest churn risk group in this analysis.</li>
</ol>

<hr />

<h2 id="cox-proportional-hazards-model">5. Cox Proportional Hazards Model</h2>

<h3 id="51-analysis-workflow">5.1 Analysis Workflow</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">lifelines</span> <span class="kn">import</span> <span class="n">CoxPHFitter</span>

<span class="c1"># Data preparation and One-Hot encoding
</span><span class="n">encode_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'dependents'</span><span class="p">,</span> <span class="s">'internetService'</span><span class="p">,</span> <span class="s">'onlineBackup'</span><span class="p">,</span> <span class="s">'techSupport'</span><span class="p">,</span> <span class="s">'paperlessBilling'</span><span class="p">]</span>
<span class="n">encoded_pd</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">telco_pd</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">encode_cols</span><span class="p">,</span> <span class="n">prefix</span><span class="o">=</span><span class="n">encode_cols</span><span class="p">,</span> <span class="n">drop_first</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="c1"># Select variables
</span><span class="n">survival_pd</span> <span class="o">=</span> <span class="n">encoded_pd</span><span class="p">[[</span><span class="s">'churn'</span><span class="p">,</span> <span class="s">'tenure'</span><span class="p">,</span> <span class="s">'dependents_Yes'</span><span class="p">,</span> 
                          <span class="s">'internetService_DSL'</span><span class="p">,</span> <span class="s">'onlineBackup_Yes'</span><span class="p">,</span> <span class="s">'techSupport_Yes'</span><span class="p">]]</span>

<span class="c1"># Fit Cox model
</span><span class="n">cph</span> <span class="o">=</span> <span class="n">CoxPHFitter</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.05</span><span class="p">)</span>
<span class="n">cph</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">survival_pd</span><span class="p">,</span> <span class="s">'tenure'</span><span class="p">,</span> <span class="s">'churn'</span><span class="p">)</span>

<span class="c1"># Output results
</span><span class="n">cph</span><span class="p">.</span><span class="n">print_summary</span><span class="p">()</span>
<span class="n">cph</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">hazard_ratios</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="c1"># Proportional hazards assumption test
</span><span class="n">cph</span><span class="p">.</span><span class="n">check_assumptions</span><span class="p">(</span><span class="n">survival_pd</span><span class="p">,</span> <span class="n">p_value_threshold</span><span class="o">=</span><span class="mf">0.05</span><span class="p">)</span>
<span class="n">cph</span><span class="p">.</span><span class="n">check_assumptions</span><span class="p">(</span><span class="n">survival_pd</span><span class="p">,</span> <span class="n">p_value_threshold</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span> <span class="n">show_plots</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="52-analysis-results">5.2 Analysis Results</h3>

<h4 id="521-model-overview">5.2.1 Model Overview</h4>
<p>| Metric | Value |
|——–|——-|
| model | lifelines.CoxPHFitter |
| duration col | tenure |
| event col | churn |
| baseline estimation | breslow |
| number of observations | 3351 |
| number of events observed | 1556 |
| partial log-likelihood | -11315.95 |
| Concordance | 0.64 |
| Partial AIC | 22639.90 |
| log-likelihood ratio test | 337.77 (df=4) |
| -log2(p) of ll-ratio test | 236.24 |</p>

<h4 id="522-model-coefficients-and-hazard-ratio-analysis">5.2.2 Model Coefficients and Hazard Ratio Analysis</h4>

<p><img src="/assets/images/cox_hazard_ratios.png" alt="Cox model hazard ratios" />
<em>Figure 17: Cox model hazard ratios (HR&lt;1 indicates protective factor, with 95% CI)</em></p>

<table>
  <thead>
    <tr>
      <th>Variable</th>
      <th>coef</th>
      <th>exp(coef)</th>
      <th>se(coef)</th>
      <th>coef 95% CI</th>
      <th>exp(coef) 95% CI</th>
      <th>z</th>
      <th>p-value</th>
      <th>-log2(p)</th>
      <th>Significance</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>dependents_Yes</td>
      <td>-0.33</td>
      <td>0.72</td>
      <td>0.07</td>
      <td>[-0.47, -0.19]</td>
      <td>[0.63, 0.83]</td>
      <td>-4.64</td>
      <td>&lt;0.005</td>
      <td>18.12</td>
      <td>***</td>
    </tr>
    <tr>
      <td>internetService_DSL</td>
      <td>-0.22</td>
      <td>0.80</td>
      <td>0.06</td>
      <td>[-0.33, -0.10]</td>
      <td>[0.72, 0.90]</td>
      <td>-3.68</td>
      <td>&lt;0.005</td>
      <td>12.07</td>
      <td>***</td>
    </tr>
    <tr>
      <td>onlineBackup_Yes</td>
      <td>-0.78</td>
      <td>0.46</td>
      <td>0.06</td>
      <td>[-0.89, -0.66]</td>
      <td>[0.41, 0.52]</td>
      <td>-13.13</td>
      <td>&lt;0.005</td>
      <td>128.37</td>
      <td>***</td>
    </tr>
    <tr>
      <td>techSupport_Yes</td>
      <td>-0.64</td>
      <td>0.53</td>
      <td>0.08</td>
      <td>[-0.79, -0.49]</td>
      <td>[0.46, 0.61]</td>
      <td>-8.48</td>
      <td>&lt;0.005</td>
      <td>55.36</td>
      <td>***</td>
    </tr>
  </tbody>
</table>

<p><strong>Significance markers</strong>: <em>** p&lt;0.001, ** p&lt;0.01, * p&lt;0.05</em></p>

<p><img src="/assets/images/schoenfeld_residuals_techSupport.png" alt="Scaled Schoenfeld residuals plot" />
<img src="/assets/images/schoenfeld_residuals_dependentYes.png" alt="Scaled Schoenfeld residuals plot" />
<img src="/assets/images/schoenfeld_residuals_internetService.png" alt="Scaled Schoenfeld residuals plot" />
<img src="/assets/images/schoenfeld_residuals_onlineBackup.png" alt="Scaled Schoenfeld residuals plot" /><br />
<em>Figure 18: Scaled Schoenfeld residual plots for each variable (with both rank and km time transformation methods)</em></p>

<h4 id="523-proportional-hazards-assumption-test-results">5.2.3 Proportional Hazards Assumption Test Results</h4>

<table>
  <thead>
    <tr>
      <th>Variable</th>
      <th>Test Method</th>
      <th>Test Statistic</th>
      <th>p-value</th>
      <th>-log2(p)</th>
      <th>Assumption Check</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>dependents_Yes</td>
      <td>km</td>
      <td>1.48</td>
      <td>0.22</td>
      <td>2.16</td>
      <td>Pass</td>
    </tr>
    <tr>
      <td>dependents_Yes</td>
      <td>rank</td>
      <td>0.81</td>
      <td>0.37</td>
      <td>1.44</td>
      <td>Pass</td>
    </tr>
    <tr>
      <td>internetService_DSL</td>
      <td>km</td>
      <td>20.98</td>
      <td>&lt;0.005</td>
      <td>17.72</td>
      <td>Violated</td>
    </tr>
    <tr>
      <td>internetService_DSL</td>
      <td>rank</td>
      <td>26.71</td>
      <td>&lt;0.005</td>
      <td>22.01</td>
      <td>Violated</td>
    </tr>
    <tr>
      <td>onlineBackup_Yes</td>
      <td>km</td>
      <td>17.80</td>
      <td>&lt;0.005</td>
      <td>15.31</td>
      <td>Violated</td>
    </tr>
    <tr>
      <td>onlineBackup_Yes</td>
      <td>rank</td>
      <td>17.47</td>
      <td>&lt;0.005</td>
      <td>15.07</td>
      <td>Violated</td>
    </tr>
    <tr>
      <td>techSupport_Yes</td>
      <td>km</td>
      <td>8.09</td>
      <td>&lt;0.005</td>
      <td>7.81</td>
      <td>Violated</td>
    </tr>
    <tr>
      <td>techSupport_Yes</td>
      <td>rank</td>
      <td>13.76</td>
      <td>&lt;0.005</td>
      <td>12.23</td>
      <td>Violated</td>
    </tr>
  </tbody>
</table>

<p>The following variables violate the proportional hazards assumption:</p>
<ol>
  <li><strong>internetService_DSL</strong>: p-value &lt; 5e-05</li>
  <li><strong>onlineBackup_Yes</strong>: p-value &lt; 5e-05</li>
  <li><strong>techSupport_Yes</strong>: p-value = 0.0002</li>
</ol>

<p><strong>Remedial suggestion</strong>: When modeling, use <code class="language-plaintext highlighter-rouge">strata=['internetService_DSL', 'onlineBackup_Yes', 'techSupport_Yes']</code> to stratify variables that violate the assumption, improving model reliability.</p>

<p><img src="/assets/images/loglog_km_curves-1.png" alt="Log-log KM curves" />
<img src="/assets/images/loglog_km_curves-2.png" alt="Log-log KM curves" />
<img src="/assets/images/loglog_km_curves-3.png" alt="Log-log KM curves" />
<img src="/assets/images/loglog_km_curves-4.png" alt="Log-log KM curves" /><br />
<em>Figure 19: Log-log Kaplan-Meier curves for each variable group, used to visually verify the proportional hazards assumption</em></p>

<h4 id="524-key-findings">5.2.4 Key Findings</h4>
<ol>
  <li><strong>Protective factors (reducing customer churn risk)</strong><br />
All variables included in this model are protective factors against customer churn, ordered by effect strength as follows:
    <ul>
      <li><strong>onlineBackup_Yes</strong>: HR=0.46, reduces customer churn risk by 54.0% (p&lt;0.001) – strongest churn inhibition factor</li>
      <li><strong>techSupport_Yes</strong>: HR=0.53, reduces customer churn risk by 47.2% (p&lt;0.001)</li>
      <li><strong>dependents_Yes</strong>: HR=0.72, reduces customer churn risk by 28.0% (p&lt;0.001)</li>
      <li><strong>internetService_DSL</strong>: HR=0.80, reduces customer churn risk by 19.5% (p&lt;0.001)</li>
    </ul>
  </li>
  <li><strong>Risk factors</strong><br />
In the Cox regression model constructed in this study, no risk factors with HR&gt;1.2 and statistical significance were identified. All included features showed a positive effect on customer retention.</li>
</ol>

<hr />

<h2 id="accelerated-failure-time-model-aft">6. Accelerated Failure Time Model (AFT)</h2>

<h3 id="61-analysis-workflow">6.1 Analysis Workflow</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">lifelines</span> <span class="kn">import</span> <span class="n">LogLogisticAFTFitter</span>

<span class="c1"># Data preparation and One-Hot encoding
</span><span class="n">encode_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s">'partner'</span><span class="p">,</span> <span class="s">'multipleLines'</span><span class="p">,</span> <span class="s">'internetService'</span><span class="p">,</span> <span class="s">'onlineSecurity'</span><span class="p">,</span> 
               <span class="s">'onlineBackup'</span><span class="p">,</span> <span class="s">'deviceProtection'</span><span class="p">,</span> <span class="s">'techSupport'</span><span class="p">,</span> <span class="s">'paymentMethod'</span><span class="p">]</span>
<span class="n">encoded_pd</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">telco_pd</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">encode_cols</span><span class="p">,</span> <span class="n">prefix</span><span class="o">=</span><span class="n">encode_cols</span><span class="p">,</span> <span class="n">drop_first</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="c1"># Select variables
</span><span class="n">survival_pd</span> <span class="o">=</span> <span class="n">encoded_pd</span><span class="p">[[</span><span class="s">'churn'</span><span class="p">,</span> <span class="s">'tenure'</span><span class="p">,</span> <span class="s">'partner_Yes'</span><span class="p">,</span> <span class="s">'multipleLines_Yes'</span><span class="p">,</span>
                          <span class="s">'internetService_DSL'</span><span class="p">,</span> <span class="s">'onlineSecurity_Yes'</span><span class="p">,</span> <span class="s">'onlineBackup_Yes'</span><span class="p">,</span>
                          <span class="s">'deviceProtection_Yes'</span><span class="p">,</span> <span class="s">'techSupport_Yes'</span><span class="p">,</span>
                          <span class="s">'paymentMethod_Bank transfer (automatic)'</span><span class="p">,</span>
                          <span class="s">'paymentMethod_Credit card (automatic)'</span><span class="p">]]</span>

<span class="c1"># Fit LogLogistic AFT model
</span><span class="n">aft</span> <span class="o">=</span> <span class="n">LogLogisticAFTFitter</span><span class="p">()</span>
<span class="n">aft</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">survival_pd</span><span class="p">,</span> <span class="n">duration_col</span><span class="o">=</span><span class="s">'tenure'</span><span class="p">,</span> <span class="n">event_col</span><span class="o">=</span><span class="s">'churn'</span><span class="p">)</span>

<span class="c1"># Output results
</span><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Median Survival Time:</span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">aft</span><span class="p">.</span><span class="n">median_survival_time_</span><span class="p">)</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">aft</span><span class="p">.</span><span class="n">print_summary</span><span class="p">()</span>
<span class="n">aft</span><span class="p">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div></div>

<h3 id="62-analysis-results">6.2 Analysis Results</h3>

<h4 id="621-model-results">6.2.1 Model Results</h4>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>model</td>
      <td>lifelines.LogLogisticAFTFitter</td>
    </tr>
    <tr>
      <td>duration col</td>
      <td>tenure</td>
    </tr>
    <tr>
      <td>event col</td>
      <td>churn</td>
    </tr>
    <tr>
      <td>baseline estimation</td>
      <td>breslow</td>
    </tr>
    <tr>
      <td>number of observations</td>
      <td>3351</td>
    </tr>
    <tr>
      <td>number of events observed</td>
      <td>1556</td>
    </tr>
    <tr>
      <td>log-likelihood</td>
      <td>-6838.36</td>
    </tr>
    <tr>
      <td>Concordance</td>
      <td>0.73</td>
    </tr>
    <tr>
      <td>AIC</td>
      <td>13698.72</td>
    </tr>
    <tr>
      <td>log-likelihood ratio test</td>
      <td>877.49 (df=9)</td>
    </tr>
    <tr>
      <td>-log2(p) of ll-ratio test</td>
      <td>605.78</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/aft_model_plot.png" alt="AFT model coefficient plot" />
<em>Figure 20: AFT model coefficients and confidence intervals</em></p>

<h4 id="622-aft-model-coefficient-table">6.2.2 AFT Model Coefficient Table</h4>

<table>
  <thead>
    <tr>
      <th>Variable</th>
      <th>coef</th>
      <th>exp(coef)</th>
      <th>se(coef)</th>
      <th>z</th>
      <th>p</th>
      <th>-log2(p)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>deviceProtection_Yes</td>
      <td>0.48</td>
      <td>1.62</td>
      <td>0.07</td>
      <td>6.88</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>internetService_DSL</td>
      <td>0.38</td>
      <td>1.47</td>
      <td>0.08</td>
      <td>4.98</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>multipleLines_Yes</td>
      <td>0.66</td>
      <td>1.94</td>
      <td>0.07</td>
      <td>9.64</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>onlineBackup_Yes</td>
      <td>0.81</td>
      <td>2.25</td>
      <td>0.07</td>
      <td>11.63</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>onlineSecurity_Yes</td>
      <td>0.86</td>
      <td>2.37</td>
      <td>0.09</td>
      <td>10.12</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>partner_Yes</td>
      <td>0.68</td>
      <td>1.97</td>
      <td>0.07</td>
      <td>10.21</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>paymentMethod_Bank transfer</td>
      <td>0.74</td>
      <td>2.10</td>
      <td>0.09</td>
      <td>8.05</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>paymentMethod_Credit card</td>
      <td>0.80</td>
      <td>2.22</td>
      <td>0.10</td>
      <td>8.36</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>techSupport_Yes</td>
      <td>0.69</td>
      <td>1.99</td>
      <td>0.09</td>
      <td>7.90</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>Intercept</td>
      <td>1.59</td>
      <td>4.91</td>
      <td>0.07</td>
      <td>24.47</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
    <tr>
      <td>beta_Intercept</td>
      <td>0.12</td>
      <td>1.13</td>
      <td>0.02</td>
      <td>5.71</td>
      <td>&lt;0.005</td>
      <td>-</td>
    </tr>
  </tbody>
</table>

<h4 id="623-model-assumption-verification---log-odds-plots">6.2.3 Model Assumption Verification - Log-odds Plots</h4>

<p><img src="/assets/images/logodds_partner.png" alt="Log-odds plot - partner" />
<em>Figure 21: Log-odds plot (partner)</em></p>

<p><img src="/assets/images/logodds_multiplelines.png" alt="Log-odds plot - multipleLines" />
<em>Figure 22: Log-odds plot (multipleLines)</em></p>

<p><img src="/assets/images/logodds_internet.png" alt="Log-odds plot - internetService" />
<em>Figure 23: Log-odds plot (internetService)</em></p>

<p><img src="/assets/images/logodds_onlineSecurity.png" alt="Log-odds plot - onlineSecurity" />
<em>Figure 24: Log-odds plot (onlineSecurity)</em></p>

<p><img src="/assets/images/logodds_onlineBackup.png" alt="Log-odds plot - onlineBackup" />
<em>Figure 25: Log-odds plot (onlineBackup)</em></p>

<p><img src="/assets/images/logodds_deviceProtection.png" alt="Log-odds plot - deviceProtection" />
<em>Figure 26: Log-odds plot (deviceProtection)</em></p>

<p><img src="/assets/images/logodds_techSupport.png" alt="Log-odds plot - techSupport" />
<em>Figure 27: Log-odds plot (techSupport)</em></p>

<p><img src="/assets/images/logodds_paymentMethod.png" alt="Log-odds plot - paymentMethod" />
<em>Figure 28: Log-odds plot (paymentMethod)</em></p>

<h4 id="624-reliability-warnings">6.2.4 Reliability Warnings</h4>
<ul>
  <li><strong>Warning 1</strong>: Predicted value (135.5) exceeds 1.5 times the data range (72.0)</li>
  <li><strong>Warning 2</strong>: Large discrepancy from Kaplan-Meier result (34.0), ratio = 3.99</li>
</ul>

<h4 id="625-recommendation">6.2.5 Recommendation</h4>
<p><strong>Do not use AFT model results for business decisions. Use Kaplan-Meier results (34 months) instead.</strong></p>

<hr />

<h2 id="customer-lifetime-value-clv">7. Customer Lifetime Value (CLV)</h2>

<h3 id="71-calculation-workflow">7.1 Calculation Workflow</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">calculate_customer_lifetime_value</span><span class="p">(</span><span class="n">cph</span><span class="p">,</span> <span class="n">monthly_profit</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">discount_rate</span><span class="o">=</span><span class="mf">0.10</span><span class="p">):</span>
    <span class="c1"># Define baseline customer
</span>    <span class="n">baseline_customer</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([{</span>
        <span class="s">'dependents_Yes'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s">'internetService_DSL'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
        <span class="s">'onlineBackup_Yes'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s">'techSupport_Yes'</span><span class="p">:</span> <span class="mi">0</span>
    <span class="p">}])</span>
    
    <span class="n">irr</span> <span class="o">=</span> <span class="n">discount_rate</span> <span class="o">/</span> <span class="mi">12</span>
    <span class="n">survival_func</span> <span class="o">=</span> <span class="n">cph</span><span class="p">.</span><span class="n">predict_survival_function</span><span class="p">(</span><span class="n">baseline_customer</span><span class="p">)</span>
    
    <span class="c1"># Build cohort table
</span>    <span class="n">cohort_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="mf">1.00</span><span class="p">]),</span> <span class="nb">round</span><span class="p">(</span><span class="n">survival_func</span><span class="p">,</span> <span class="mi">2</span><span class="p">)])</span>
    <span class="n">cohort_df</span> <span class="o">=</span> <span class="n">cohort_df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s">'Survival Probability'</span><span class="p">})</span>
    <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Contract Month'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cohort_df</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'int'</span><span class="p">)</span>
    <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Monthly Profit for the Selected Plan'</span><span class="p">]</span> <span class="o">=</span> <span class="n">monthly_profit</span>
    <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Avg Expected Monthly Profit'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span><span class="n">cohort_df</span><span class="p">[</span><span class="s">'Survival Probability'</span><span class="p">]</span> <span class="o">*</span> <span class="n">monthly_profit</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
    <span class="n">cohort_df</span><span class="p">[</span><span class="s">'NPV of Avg Expected Monthly Profit'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span>
        <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Avg Expected Monthly Profit'</span><span class="p">]</span> <span class="o">/</span> <span class="p">((</span><span class="mi">1</span> <span class="o">+</span> <span class="n">irr</span><span class="p">)</span> <span class="o">**</span> <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Contract Month'</span><span class="p">]),</span> <span class="mi">2</span>
    <span class="p">)</span>
    <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Cumulative NPV'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cohort_df</span><span class="p">[</span><span class="s">'NPV of Avg Expected Monthly Profit'</span><span class="p">].</span><span class="n">cumsum</span><span class="p">()</span>
    <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Contract Month'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cohort_df</span><span class="p">[</span><span class="s">'Contract Month'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
    
    <span class="k">return</span> <span class="n">cohort_df</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'Contract Month'</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="72-calculation-results">7.2 Calculation Results</h3>

<h4 id="721-calculation-parameters">7.2.1 Calculation Parameters</h4>
<ul>
  <li><strong>Monthly profit per customer</strong>: $30</li>
  <li><strong>Annual discount rate</strong>: 10%</li>
  <li><strong>Monthly discount rate</strong>: 0.83%</li>
  <li><strong>Forecast time horizon</strong>: 72 months</li>
</ul>

<h4 id="722-clv-key-node-results">7.2.2 CLV Key Node Results</h4>

<table>
  <thead>
    <tr>
      <th>Time Horizon</th>
      <th>Cumulative NPV (CLV)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>12 months</td>
      <td>$266.88</td>
    </tr>
    <tr>
      <td>24 months</td>
      <td>$405.44</td>
    </tr>
    <tr>
      <td>36 months</td>
      <td>$515.01</td>
    </tr>
    <tr>
      <td>Lifetime CLV</td>
      <td>$626.69</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/clv_payback_period.png" alt="CLV payback period analysis" />
<em>Figure 29: Payback period analysis</em></p>

<p><img src="/assets/images/clv_survival_curve.png" alt="CLV survival probability curve" />
<em>Figure 30: Survival probability curve</em></p>

<h4 id="723-clv-trend-table-complete-data-for-first-25-months">7.2.3 CLV Trend Table (complete data for first 25 months)</h4>

<table>
  <thead>
    <tr>
      <th>Contract Month</th>
      <th>Survival Probability</th>
      <th>Monthly Profit</th>
      <th>Avg Expected Monthly Profit</th>
      <th>NPV</th>
      <th>Cumulative NPV</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>1.00</td>
      <td>30</td>
      <td>30.00</td>
      <td>30.00</td>
      <td>30.00</td>
    </tr>
    <tr>
      <td>2</td>
      <td>0.87</td>
      <td>30</td>
      <td>26.10</td>
      <td>25.88</td>
      <td>55.88</td>
    </tr>
    <tr>
      <td>3</td>
      <td>0.81</td>
      <td>30</td>
      <td>24.30</td>
      <td>23.90</td>
      <td>79.78</td>
    </tr>
    <tr>
      <td>4</td>
      <td>0.77</td>
      <td>30</td>
      <td>23.10</td>
      <td>22.53</td>
      <td>102.31</td>
    </tr>
    <tr>
      <td>5</td>
      <td>0.74</td>
      <td>30</td>
      <td>22.20</td>
      <td>21.48</td>
      <td>123.79</td>
    </tr>
    <tr>
      <td>6</td>
      <td>0.71</td>
      <td>30</td>
      <td>21.30</td>
      <td>20.43</td>
      <td>144.22</td>
    </tr>
    <tr>
      <td>7</td>
      <td>0.69</td>
      <td>30</td>
      <td>20.70</td>
      <td>19.69</td>
      <td>163.91</td>
    </tr>
    <tr>
      <td>8</td>
      <td>0.67</td>
      <td>30</td>
      <td>20.10</td>
      <td>18.97</td>
      <td>182.88</td>
    </tr>
    <tr>
      <td>9</td>
      <td>0.65</td>
      <td>30</td>
      <td>19.50</td>
      <td>18.25</td>
      <td>201.13</td>
    </tr>
    <tr>
      <td>10</td>
      <td>0.63</td>
      <td>30</td>
      <td>18.90</td>
      <td>17.54</td>
      <td>218.67</td>
    </tr>
    <tr>
      <td>11</td>
      <td>0.60</td>
      <td>30</td>
      <td>18.00</td>
      <td>16.57</td>
      <td>235.24</td>
    </tr>
    <tr>
      <td>12</td>
      <td>0.59</td>
      <td>30</td>
      <td>17.70</td>
      <td>16.16</td>
      <td>251.40</td>
    </tr>
    <tr>
      <td>13</td>
      <td>0.57</td>
      <td>30</td>
      <td>17.10</td>
      <td>15.48</td>
      <td>266.88</td>
    </tr>
    <tr>
      <td>14</td>
      <td>0.55</td>
      <td>30</td>
      <td>16.50</td>
      <td>14.81</td>
      <td>281.69</td>
    </tr>
    <tr>
      <td>15</td>
      <td>0.54</td>
      <td>30</td>
      <td>16.20</td>
      <td>14.42</td>
      <td>296.11</td>
    </tr>
    <tr>
      <td>16</td>
      <td>0.52</td>
      <td>30</td>
      <td>15.60</td>
      <td>13.77</td>
      <td>309.88</td>
    </tr>
    <tr>
      <td>17</td>
      <td>0.51</td>
      <td>30</td>
      <td>15.30</td>
      <td>13.40</td>
      <td>323.28</td>
    </tr>
    <tr>
      <td>18</td>
      <td>0.50</td>
      <td>30</td>
      <td>15.00</td>
      <td>13.03</td>
      <td>336.31</td>
    </tr>
    <tr>
      <td>19</td>
      <td>0.48</td>
      <td>30</td>
      <td>14.40</td>
      <td>12.40</td>
      <td>348.71</td>
    </tr>
    <tr>
      <td>20</td>
      <td>0.47</td>
      <td>30</td>
      <td>14.10</td>
      <td>12.04</td>
      <td>360.75</td>
    </tr>
    <tr>
      <td>21</td>
      <td>0.46</td>
      <td>30</td>
      <td>13.80</td>
      <td>11.69</td>
      <td>372.44</td>
    </tr>
    <tr>
      <td>22</td>
      <td>0.45</td>
      <td>30</td>
      <td>13.50</td>
      <td>11.34</td>
      <td>383.78</td>
    </tr>
    <tr>
      <td>23</td>
      <td>0.44</td>
      <td>30</td>
      <td>13.20</td>
      <td>11.00</td>
      <td>394.78</td>
    </tr>
    <tr>
      <td>24</td>
      <td>0.43</td>
      <td>30</td>
      <td>12.90</td>
      <td>10.66</td>
      <td>405.44</td>
    </tr>
    <tr>
      <td>25</td>
      <td>0.42</td>
      <td>30</td>
      <td>12.60</td>
      <td>10.32</td>
      <td>415.76</td>
    </tr>
  </tbody>
</table>

<h4 id="724-key-findings">7.2.4 Key Findings</h4>
<ol>
  <li><strong>Customer Lifetime Value (CLV)</strong>: The cumulative net present value (NPV) for the baseline customer over 72 months is <strong>$626.69</strong>, a core reference metric for setting customer acquisition cost limits.</li>
  <li><strong>Revenue growth trend</strong>: Customer CLV grows rapidly to $266.88 in the first 12 months, to $405.44 by 24 months, and reaches $515.01 by 36 months, then growth slows, indicating the early period is critical for value contribution.</li>
  <li><strong>Survival probability decay</strong>: Customer survival probability continuously declines over time, from 1.00 in the first month to 0.43 by 24 months, reflecting the long-term trend of customer churn.</li>
  <li><strong>Impact of expected profit and discounting</strong>: Due to decaying survival probability and the discount rate, the average expected monthly profit per customer gradually declines from $30.00 in the first month to $12.90 by 24 months, and the growth rate of NPV also slows.</li>
  <li><strong>Business decision recommendations</strong>: Customer acquisition cost (CAC) should be controlled within 30% of CLV (approximately $188) to ensure profitability of customer relationships; at the same time, focus on implementing customer retention strategies within the first 24 months to maximize long-term customer value.</li>
</ol>

<hr />

<h2 id="conclusion">8. Conclusion</h2>

<h3 id="81-model-applicability-and-reliability-assessment">8.1 Model Applicability and Reliability Assessment</h3>

<p>Based on the IBM Telco Customer Churn dataset, this study systematically quantifies churn behavior of month-to-month internet service customers using Kaplan-Meier estimation, Cox proportional hazards regression, Accelerated Failure Time (AFT) models, and the Customer Lifetime Value (CLV) framework. Main model evaluation conclusions are as follows:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Reliability</th>
      <th>Primary Use</th>
      <th>Key Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Kaplan-Meier estimation</td>
      <td>✅ Highly reliable</td>
      <td>Non-parametric survival function estimation</td>
      <td>Median survival time: 34 months</td>
    </tr>
    <tr>
      <td>Cox proportional hazards model</td>
      <td>✅ Reliable</td>
      <td>Multi-factor hazard ratio analysis</td>
      <td>Concordance Index: 0.64; HR(onlineBackup)=0.46</td>
    </tr>
    <tr>
      <td>LogLogistic AFT model</td>
      <td>❌ Unreliable</td>
      <td>Parametric survival time prediction</td>
      <td>Predicted median survival time 135.5 months (beyond observation range)</td>
    </tr>
    <tr>
      <td>CLV framework</td>
      <td>✅ Usable</td>
      <td>Long-term customer value quantification</td>
      <td>72-month cumulative NPV: $626.69</td>
    </tr>
  </tbody>
</table>

<p><strong>Overall judgment</strong>: The Kaplan-Meier and Cox models provide robust core analytical conclusions for this study; the AFT model is not suitable for business decisions due to extrapolation beyond supported data range; the CLV framework, while informative, depends on the predictive ability of the Cox model.</p>

<h3 id="82-core-empirical-findings">8.2 Core Empirical Findings</h3>

<h4 id="1-overall-customer-retention-level">(1) Overall customer retention level</h4>

<p>The target customer segment (month-to-month + internet service users) has a median survival time of <strong>34 months</strong>. This indicates that 50% of customers in this segment churn within 34 months after joining the network, representing a relatively high overall churn risk.</p>

<h4 id="2-identification-of-key-protective-factors-based-on-cox-model">(2) Identification of key protective factors (based on Cox model)</h4>

<p>Four variables were identified as significant protective factors, ordered by effect strength:</p>

<table>
  <thead>
    <tr>
      <th>Protective Factor</th>
      <th>Hazard Ratio (HR)</th>
      <th>Reduction in Churn Risk</th>
      <th>Statistical Significance</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>onlineBackup_Yes</td>
      <td>0.46</td>
      <td>54.0%</td>
      <td>p &lt; 0.001</td>
    </tr>
    <tr>
      <td>techSupport_Yes</td>
      <td>0.53</td>
      <td>47.2%</td>
      <td>p &lt; 0.001</td>
    </tr>
    <tr>
      <td>dependents_Yes</td>
      <td>0.72</td>
      <td>28.0%</td>
      <td>p &lt; 0.001</td>
    </tr>
    <tr>
      <td>internetService_DSL</td>
      <td>0.80</td>
      <td>19.5%</td>
      <td>p &lt; 0.001</td>
    </tr>
  </tbody>
</table>

<p>These results indicate that online backup and tech support services are the two most effective interventions for reducing customer churn risk. The Cox model’s Concordance Index is 0.64, indicating moderate discriminative ability.</p>

<h4 id="3-high-risk-customer-profile">(3) High-risk customer profile</h4>

<p>Combining KM group analysis and marginal effects from the Cox model, high-risk churn customers exhibit the following typical characteristics:</p>

<ul>
  <li><strong>Demographic characteristics</strong>: No partner, no dependents</li>
  <li><strong>Service usage characteristics</strong>: Use fiber optic internet service, not subscribed to value-added services such as online backup/online security/device protection/tech support</li>
  <li><strong>Payment behavior characteristics</strong>: Pay by electronic check, use paperless billing</li>
</ul>

<p>Log-rank test results show that the between-group difference for partner status is 135.76 (p &lt; 2.25e-31), for dependents is 35.03 (p &lt; 3.24e-09), and for fiber vs. DSL users is 25.17 (p &lt; 5.24e-07), all statistically significant.</p>

<h4 id="4-customer-lifetime-value-clv">(4) Customer Lifetime Value (CLV)</h4>

<p>The 72-month cumulative NPV for the baseline customer (not subscribed to any value-added services) predicted by the Cox model is <strong>$626.69</strong>. Of this, the first 12 months contribute $266.88 (42.6% of total value), and the first 24 months contribute $405.44 (64.7% of total value), indicating that customer value is concentrated in the first two years after joining.</p>

<h3 id="83-summary-of-methodological-limitations">8.3 Summary of Methodological Limitations</h3>

<ul>
  <li><strong>Unreliability of the AFT model</strong>: The LogLogistic AFT model predicted a median survival time (135.5 months) significantly exceeding the actual observed range (0–72 months), with a ratio of 3.99 compared to the KM estimate (34 months). This deviation arises from the combination of high censoring rate and insufficient observation window, limiting the model’s extrapolation ability.</li>
  <li><strong>Partial violation of proportional hazards assumption</strong>: In the Cox model, the variables <code class="language-plaintext highlighter-rouge">internetService_DSL</code>, <code class="language-plaintext highlighter-rouge">onlineBackup_Yes</code>, and <code class="language-plaintext highlighter-rouge">techSupport_Yes</code> did not pass the proportional hazards test (p &lt; 0.05), suggesting that the effects of these variables may change over time. Stratified Cox or time-varying covariate models are recommended to address this.</li>
  <li><strong>Sample selection bias</strong>: This study includes only month-to-month contract customers who subscribe to internet services. Conclusions cannot be directly generalized to long-term contract customers or those without internet service.</li>
</ul>

<hr />

<h2 id="business-recommendations">9. Business Recommendations</h2>

<h3 id="91-short-term-operational-strategies-06-months">9.1 Short-term Operational Strategies (0–6 months)</h3>

<h4 id="1-value-added-service-promotion-plan">(1) Value-added service promotion plan</h4>

<p>Based on the hazard ratio estimates from the Cox model, online backup (HR=0.46) and tech support (HR=0.53) are the most effective risk mitigation tools. Recommendations:</p>

<ul>
  <li>Implement <strong>bundling strategies</strong> for new customers, offering online backup and tech support as default add-ons to internet service with a first-month free trial.</li>
  <li>Conduct <strong>targeted marketing campaigns</strong> for existing high-risk customers (fiber users, those without partners/dependents) via email, in-app notifications, etc., to promote these services.</li>
  <li>Establish an <strong>A/B testing framework</strong> to quantify the causal effect of interventions on retention.</li>
</ul>

<h4 id="2-early-identification-of-high-risk-customers">(2) Early identification of high-risk customers</h4>

<p>Based on median survival time differences from KM group analysis:</p>

<table>
  <thead>
    <tr>
      <th>Risk Dimension</th>
      <th>High-risk Group</th>
      <th>Low-risk Group</th>
      <th>Median Survival Time Difference</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Partner status</td>
      <td>No partner (24 months)</td>
      <td>With partner (49 months)</td>
      <td>25 months</td>
    </tr>
    <tr>
      <td>Dependents status</td>
      <td>No dependents (25 months)</td>
      <td>With dependents (48 months)</td>
      <td>23 months</td>
    </tr>
    <tr>
      <td>Internet service</td>
      <td>Fiber (30 months)</td>
      <td>DSL (52 months)</td>
      <td>22 months</td>
    </tr>
    <tr>
      <td>Tech support</td>
      <td>No (29 months)</td>
      <td>Yes (56 months)</td>
      <td>27 months</td>
    </tr>
  </tbody>
</table>

<p>It is recommended to embed the above four high-risk labels into the <strong>real-time risk scoring engine</strong> of the CRM system, setting up automated retention intervention nodes at months 6, 12, and 18 after customer onboarding.</p>

<h3 id="92-medium-term-strategy-optimization-612-months">9.2 Medium-term Strategy Optimization (6–12 months)</h3>

<h4 id="1-customer-stratification-and-refined-operations">(1) Customer stratification and refined operations</h4>

<p>Based on the risk score (linear predictor = β̂ᵀX) output by the Cox model, divide customers into three risk tiers:</p>

<table>
  <thead>
    <tr>
      <th>Risk Tier</th>
      <th>Risk Score Percentile</th>
      <th>Suggested Intervention</th>
      <th>Expected Resource Investment</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Low risk</td>
      <td>&lt; 25%</td>
      <td>Routine service maintenance</td>
      <td>Low</td>
    </tr>
    <tr>
      <td>Medium risk</td>
      <td>25%–75%</td>
      <td>Quarterly service follow-up, coupon推送</td>
      <td>Medium</td>
    </tr>
    <tr>
      <td>High risk</td>
      <td>&gt; 75%</td>
      <td>Dedicated account manager, personalized retention plan</td>
      <td>High</td>
    </tr>
  </tbody>
</table>

<h4 id="2-service-portfolio-optimization">(2) Service portfolio optimization</h4>

<ul>
  <li>For fiber optic internet service customers (median survival only 30 months), design <strong>exclusive service packages</strong> including online backup, tech support, and device protection to close the retention gap with DSL users.</li>
  <li>Target <strong>single-person households</strong> (no partner and no dependents) as core intervention subjects; their KM median survival is only 24 months, significantly lower than customers with families.</li>
</ul>

<h4 id="3-customer-acquisition-cost-cac-control">(3) Customer Acquisition Cost (CAC) control</h4>

<p>Based on the CLV estimate ($626.69) and the 10% annual discount rate assumption, it is recommended to:</p>

<ul>
  <li>Control CAC within <strong>30% of CLV</strong>, i.e., not exceeding <strong>$188</strong>.</li>
  <li>Adjust CAC limits by channel according to the average risk score of customers acquired from that channel. Channels with higher risk propensity should have a lower CAC ceiling.</li>
</ul>

<h3 id="93-long-term-strategic-recommendations-1236-months">9.3 Long-term Strategic Recommendations (12–36 months)</h3>

<h4 id="1-model-lifecycle-management">(1) Model lifecycle management</h4>

<ul>
  <li>Establish a <strong>quarterly model recalibration mechanism</strong>, incorporating the latest churn data to update Cox model coefficients.</li>
  <li>Expand feature engineering to include <strong>behavioral time-series features</strong> such as customer service interaction records (number of complaints, call duration), bill payment delay days, and plan change history.</li>
  <li>Explore the use of <strong>random survival forests</strong> or <strong>deep survival models</strong> (e.g., DeepSurv) as alternatives to the Cox model to capture non-linear effects and interactions.</li>
</ul>

<h4 id="2-retention-effectiveness-monitoring-system">(2) Retention effectiveness monitoring system</h4>

<p>Recommend setting up the following Key Performance Indicators (KPIs) with automated monitoring dashboards:</p>

<table>
  <thead>
    <tr>
      <th>KPI</th>
      <th>Definition</th>
      <th>Update Frequency</th>
      <th>Alert Threshold</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Overall median survival time</td>
      <td>50% churn time point estimated by KM</td>
      <td>Monthly</td>
      <td>Month-over-month decrease &gt; 5%</td>
    </tr>
    <tr>
      <td>Proportion of high-risk customers</td>
      <td>Percentage of customers with risk score &gt; 75th percentile</td>
      <td>Weekly</td>
      <td>Proportion &gt; 30%</td>
    </tr>
    <tr>
      <td>Value-added service penetration rate</td>
      <td>Subscription rate for online backup/tech support</td>
      <td>Monthly</td>
      <td>Year-over-year growth &lt; 5%</td>
    </tr>
    <tr>
      <td>CLV trend</td>
      <td>72-month cumulative NPV for baseline customer</td>
      <td>Quarterly</td>
      <td>Quarter-over-quarter decrease &gt; 10%</td>
    </tr>
  </tbody>
</table>

<h4 id="3-maximizing-customer-lifetime-value">(3) Maximizing customer lifetime value</h4>

<p>The CLV trend analysis shows that the first 24 months contribute 64.7% of total value. Therefore:</p>

<ul>
  <li><strong>Front-load retention resources</strong> in the first two years after onboarding, implementing the highest-intensity interventions during this period.</li>
  <li>Set up <strong>key touchpoints</strong> at months 12 and 24 to enhance renewal rates at those times through personalized offers, service upgrade recommendations, etc.</li>
  <li>For long-standing customers who have been active for more than 36 months, reduce retention resource investment and transition them into a low-maintenance “stable period” management.</li>
</ul>

<hr />

<h2 id="model-limitations">10. Model Limitations and Improvement Directions</h2>

<h3 id="101-reasons-for-aft-model-prediction-failure">10.1 Reasons for AFT Model Prediction Failure</h3>

<p>The LogLogistic AFT model used in this study predicted a median survival time (135.5 months) that significantly deviates from the Kaplan-Meier estimate (34 months). The root causes can be attributed to the following two points:</p>

<ol>
  <li><strong>High censoring rate</strong>: The analysis sample (n=3,351) had 1,556 observed churn events, a censoring rate of approximately 53.5%. A large number of customers had not yet churned by the end of the observation period (72 months), leading to severe extrapolation bias in the AFT model’s inference of the tail survival distribution.</li>
  <li><strong>Mismatched distributional assumption</strong>: The LogLogistic distribution assumes a unimodal hazard function (increasing then decreasing), whereas telecom churn data may be closer to a monotonically decreasing hazard function. We recommend trying a Weibull distribution (allows monotonic hazard changes) or selecting the optimal parametric form via cross‑validation.</li>
</ol>

<p><strong>Remedial suggestion</strong>: When the observation window is insufficient to capture churn events for the majority of customers, prioritize non‑parametric (KM) or semi‑parametric (Cox) methods over fully parametric AFT models for extrapolation.</p>

<h3 id="102-data-limitations">10.2 Data Limitations</h3>

<ol>
  <li><strong>Sample selection bias</strong>: This study includes only month-to-month contract customers with internet service (n=3,351), representing 47.6% of the original sample (n=7,043). This selection criterion controls for heterogeneity in contract type and service scope, but it also means conclusions cannot be generalized to:
    <ul>
      <li>Customers with annual/two-year long-term contracts (typically lower churn rates)</li>
      <li>Customers with only phone service (no internet)</li>
    </ul>
  </li>
  <li>
    <p><strong>Insufficient observation window</strong>: The maximum follow‑up time is 72 months. The AFT model’s predicted churn time far exceeds this range, indicating that the available data are insufficient for reliable inference about the churn time of long‑tail customers.</p>
  </li>
  <li><strong>Cross‑sectional data limitation</strong>: The data used are cross‑sectional observations, lacking time‑series information on customer behavior (e.g., service usage frequency, billing payment history, customer service interactions), limiting the model’s ability to capture dynamic churn signals.</li>
</ol>

<h3 id="103-external-validity">10.3 External Validity</h3>

<ul>
  <li>The empirical findings of this study are based on a simulated telecom dataset provided by IBM. Although designed to reflect real‑world business scenarios, differences exist compared to actual telecom company operations (e.g., service pricing, market competition intensity, customer demographic distributions).</li>
  <li>When generalizing the conclusions of this study to other industries (e.g., finance, retail, SaaS), the model needs to be recalibrated and validated for industry‑specific customer lifecycle characteristics.</li>
</ul>

<h3 id="104-model-assumption-violations-and-mitigation-strategies">10.4 Model Assumption Violations and Mitigation Strategies</h3>

<p>The proportional hazards assumption test for the Cox model shows that the variables <code class="language-plaintext highlighter-rouge">internetService_DSL</code>, <code class="language-plaintext highlighter-rouge">onlineBackup_Yes</code>, and <code class="language-plaintext highlighter-rouge">techSupport_Yes</code> all fail the test (p &lt; 0.05). This suggests that the effects of these variables may change over customer tenure. For example:</p>

<ul>
  <li>The protective effect of online backup may be more pronounced early in a customer’s tenure and decay over time.</li>
  <li>The effect of tech support may manifest at specific times when customers encounter issues, rather than being uniformly distributed.</li>
</ul>

<p><strong>Model improvement directions</strong>:</p>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>Operational Path</th>
      <th>Applicable Scenario</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Stratified Cox model</td>
      <td>Use <code class="language-plaintext highlighter-rouge">strata=['internetService_DSL', 'onlineBackup_Yes', 'techSupport_Yes']</code></td>
      <td>Variable effect does not change uniformly over time, but explicit time interaction modeling is not required</td>
    </tr>
    <tr>
      <td>Time-varying covariates</td>
      <td>Construct interaction terms of the form <code class="language-plaintext highlighter-rouge">X(t) = X × g(t)</code></td>
      <td>Need to quantify the specific functional form of effect change over time</td>
    </tr>
    <tr>
      <td>Extended Cox model</td>
      <td>Use <code class="language-plaintext highlighter-rouge">CoxTimeVaryingFitter</code></td>
      <td>Covariate values themselves change over time (e.g., service subscription status changes)</td>
    </tr>
  </tbody>
</table>

<p>We recommend trying the <strong>stratified Cox model</strong> first in model iterations. This method is easy to implement and effectively handles violations of the proportional hazards assumption.</p>

<h3 id="105-future-research-directions">10.5 Future Research Directions</h3>

<ol>
  <li><strong>Feature engineering expansion</strong>: Incorporate behavioral time‑series features (e.g., average monthly data usage, number of customer service complaints, bill payment delay days) to build dynamic survival analysis models.</li>
  <li><strong>Model comparison experiments</strong>: Compare the predictive performance of the Cox model, random survival forest, and DeepSurv on the same validation set, using time‑dependent AUC or Brier Score as evaluation metrics.</li>
  <li><strong>Causal inference extensions</strong>: Use propensity score matching (PSM) or instrumental variables to further validate the causal relationship strength between online backup/tech support services and customer retention, ruling out self‑selection bias.</li>
</ol>

<hr />

<h2 id="appendix">11. Appendix</h2>

<h3 id="appendix-a-technical-parameters">Appendix A: Technical Parameters</h3>
<ul>
  <li><strong>Analysis tools</strong>: PySpark + Lifelines</li>
  <li><strong>Spark configuration</strong>: Driver memory 4G, Executor memory 2G</li>
  <li><strong>Model version</strong>: v1.0</li>
</ul>

<h3 id="appendix-b-file-list">Appendix B: File List</h3>

<table>
  <thead>
    <tr>
      <th>File Name</th>
      <th>Content</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>kaplan_meier_summary.csv</td>
      <td>KM analysis event table</td>
    </tr>
    <tr>
      <td>cox_model_summary.csv</td>
      <td>Detailed Cox model results</td>
    </tr>
    <tr>
      <td>aft_model_summary.csv</td>
      <td>AFT model results</td>
    </tr>
    <tr>
      <td>clv_cohort.csv</td>
      <td>CLV monthly calculation results</td>
    </tr>
    <tr>
      <td>analysis_report.txt</td>
      <td>Full analysis report</td>
    </tr>
  </tbody>
</table>

<h3 id="appendix-c-code-runtime-environment">Appendix C: Code Runtime Environment</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Report generation <span class="nb">date</span>: 2026-04-26
Python version: 3.x
Dependencies: pyspark, pandas, numpy, lifelines, matplotlib, seaborn
</code></pre></div></div>]]></content><author><name>Chen Sijia</name></author><summary type="html"><![CDATA[Author: Chen Sijia]]></summary></entry></feed>