Tổng hợp tài liệu Poisson regression

Poisson regression
Author

Duc Nguyen

Published

February 4, 2026

Hướng dẫn cách biện luận:

Backup

library(readxl)
df <- read_excel("E:/GITHUB/tailieuthamkhao/posts/poisson_regression/pois.xlsx")

df <- as.data.frame(df)

head(df)
  GENDER AGE SUM_K
1 FEMALE  22    78
2   MALE  20     0
3   MALE  20    40
4   MALE  19    42
5   MALE  20     0
6 FEMALE  19    58
library(dplyr)

df |> dplyr:::group_by(GENDER) |> dplyr:::summarise(mean_AGE = mean(AGE),
                                                    mean_SUM_K = mean(SUM_K),
                                                    count = n())
# A tibble: 2 × 4
  GENDER mean_AGE mean_SUM_K count
  <chr>     <dbl>      <dbl> <int>
1 FEMALE     19.3       35.4    88
2 MALE       19.9       27.2    63
df$SUM_K <- df$SUM_K + 1

Vì có nhiều zero nên ta cần cộng thêm 1 để model dễ tính toán

df |> dplyr:::group_by(GENDER) |> dplyr:::summarise(mean_AGE = mean(AGE),
                                                    mean_SUM_K = mean(SUM_K),
                                                    count = n())
# A tibble: 2 × 4
  GENDER mean_AGE mean_SUM_K count
  <chr>     <dbl>      <dbl> <int>
1 FEMALE     19.3       36.4    88
2 MALE       19.9       28.2    63
fit <- glm(SUM_K ~ GENDER + AGE,
           data = df,
           family = "poisson")

summary(fit)

Call:
glm(formula = SUM_K ~ GENDER + AGE, family = "poisson", data = df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.46369    0.24746   5.915 3.32e-09 ***
GENDERMALE  -0.33307    0.03112 -10.703  < 2e-16 ***
AGE          0.11037    0.01275   8.658  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4004.5  on 150  degrees of freedom
Residual deviance: 3852.7  on 148  degrees of freedom
AIC: 4550.9

Number of Fisher Scoring iterations: 5
library(MASS)
df$SUM_K <- df$SUM_K + 1
fit.nb <- glm.nb(SUM_K ~ GENDER + AGE,
                 data = df) 

summary(fit.nb)

Call:
glm.nb(formula = SUM_K ~ GENDER + AGE, data = df, init.theta = 0.9955253049, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.55819    1.40250   1.111    0.267
GENDERMALE  -0.28457    0.17420  -1.634    0.102
AGE          0.10613    0.07255   1.463    0.143

(Dispersion parameter for Negative Binomial(0.9955) family taken to be 1)

    Null deviance: 174.85  on 150  degrees of freedom
Residual deviance: 170.87  on 148  degrees of freedom
AIC: 1375.4

Number of Fisher Scoring iterations: 1

              Theta:  0.996 
          Std. Err.:  0.109 

 2 x log-likelihood:  -1367.416 
library(stargazer)

stargazer::stargazer(fit, 
                     fit.nb,
                     type = "text")

==============================================
                      Dependent variable:     
                  ----------------------------
                             SUM_K            
                    Poisson       negative    
                                  binomial    
                      (1)           (2)       
----------------------------------------------
GENDERMALE         -0.333***       -0.285     
                    (0.031)       (0.174)     
                                              
AGE                0.110***        0.106      
                    (0.013)       (0.073)     
                                              
Constant           1.464***        1.558      
                    (0.247)       (1.403)     
                                              
----------------------------------------------
Observations          151           151       
Log Likelihood    -2,272.439      -684.708    
theta                         0.996*** (0.109)
Akaike Inf. Crit.  4,550.879     1,375.416    
==============================================
Note:              *p<0.1; **p<0.05; ***p<0.01

\[\text{ln}(\lambda (\text{SUM\_K})) = \beta_0 + \beta_1 \text{GENDER} + \beta_2 \text{AGE}\]

BIỆN LUẬN:

AGE là biến liên tục

Biện luận cách 1: The coefficient of AGE is \(0.11\). When the AGE increase 1 unit, the number SUM_K counts is expected to increase by a factor of \(e^{0.11} = 1.116278\), all else equal (given the other variables are held constant in the model)

Biện luận cách 2: The coefficient for AGE is \(0.11\). This means that the expected log count SUM_K for a one-unit increase in AGE is \(0.11\).

Tiếng Việt: Hệ số của biến ĐỘ TUỔI\(0.11\). Khi ĐỘ TUỔI tăng 1 đơn vị, thì số đếm SUM_K tăng theo 1 tỷ lệ là \(e^{0.11} = 1.116278\) đơn vị, trong trường hợp các biến còn lại là hằng số.

GENDER là biến phân loại, với FEMALE là reference

This is the estimated Poisson regression coefficient comparing MALE to FEMALE,

Biện luận cách 1: The coefficient of MALE is \(-0.333\), are associated with a decrease in mean SUM_K by a factor of \(e^{-0.333} = 0.7167\). That is, only 71.67% SUM_K in MALE than FEMALE.

Biện luận cách 2: The difference in the logs of expected counts SUM_K is expected to be 0.333 unit decrease for MALE compared to FEMALE, while holding the other variables constant in the model.

Biện luận cách 3: The indicator variable GENDER compares between MALE to FEMALE. The expected log count for MALE *decrease** by about \(-0.333\)

Tiếng Việt: Khi xét theo giới tính, thì MALE làm giảm trung bình SUM_K theo một tỷ lệ 0.7167 so với FEMALE. Hay nói cách khác, điểm số SUM_K của MALE thấp hơn FEMALE.

https://stats.oarc.ucla.edu/r/dae/poisson-regression/

https://stats.oarc.ucla.edu/stata/output/poisson-regression/

https://stats.oarc.ucla.edu/r/dae/zinb/