Review of Empirical Exercises

Ian McCarthy | Emory University

Areas of empirical work

Empirical excercises cover four “applied” areas:

Difference-in-differences
Instrumental variables
Regression discontinuity
Demand estimation and hospital market construction

Today, let’s talk about each of these (briefly). And provide some code examples/hints.

Workflow

Another sub-goal of the exercises is to practice good workflow. I’ll focus on two possible workflows:

“Solo-author” workflow, where everything lives on your computer, Git/GitHub, and nothing is hard-coded
“Co-author” workflow, where you transition to Overleaf for writing purposes

Exercise 1: Difference-in-differences

Question

Does Medicaid expansion reduced hospital uncompensated care costs?

Data

Hospital Cost Report Information System
Provider of Services files
Medicaid expansion from KFF

Data management

Create hcris, pos, and KFF medicaid datasets separately
Join appropriately
Define expansion period and uncompensated care costs
Trim data to remove outliers (many, many different/better ways to do this)

R Code

full.data <- hcris.data %>% rename(state_short=state) %>%
  left_join(pos.data %>% select(provider_number=provider, category, own_change, beds_cert, beds_tot, name,
                                term_date, fac_type, own_type, category_sub, profit_status, year), 
            by=c("provider_number", "year")) %>%
  left_join(medicaid.data, by="state_short") %>%
  mutate(profit_status = 
           case_when(
             own_type=="Non-profit Private" ~ "Non Profit",
             own_type %in% c("Physician Owned","Profit") ~ "For Profit"
           )) %>%
  mutate(expand_year=year(date_adopted),
         expand = (year>=expand_year & !is.na(expand_year)),
         uncomp_care=uncomp_care*cost_to_charge) %>%
  rename(expand_ever=expanded) %>%
  filter(year>=2003, year<2020,
         uncomp_care>0, tot_pat_rev>0)

trim.data <- full.data %>%
  group_by(year) %>%
  mutate(ptile_uncomp=ntile(uncomp_care,100)) %>%
  filter(ptile_uncomp>1 & ptile_uncomp<99)

Identification strategy (ies)

DD with staggered treatment adoption (Medicaid expansion)
Work through different specifications/strategies:
- TWFE by group and with staggered treatment
- Event study by group and with staggered treatment
- Sun and Abraham with event study
- Callaway and Sant’Anna with event study
- Rambachan and Roth “honest DD”

Estimation

TWFE

reg.data1 <- trim.data %>% mutate(uncomp_care=uncomp_care/1000) %>%
  mutate(treat=
           case_when(
             year>=expand_year & !is.na(expand_year) ~ 1,
             year<expand_year & !is.na(expand_year) ~0,
             is.na(expand_year) ~ 0
           )
         )

reg.data2 <- trim.data %>% mutate(uncomp_care=uncomp_care/1000) %>%
  filter(is.na(expand_year) | expand_year==2014) %>%
  mutate(post=(year>=2014),
         treat=post*expand_ever)

reg.data3 <- trim.data %>% mutate(uncomp_care=uncomp_care/1000) %>%
  filter(is.na(expand_year) | expand_year==2015) %>%
  mutate(post=(year>=2015),
         treat=post*expand_ever)

reg.data4 <- trim.data %>% mutate(uncomp_care=uncomp_care/1000) %>%
  filter(is.na(expand_year) | expand_year==2016) %>%
  mutate(post=(year>=2016),
         treat=post*expand_ever)

dd.est1 <- feols(uncomp_care~treat | year + provider_number, data=reg.data1)
dd.est2 <- feols(uncomp_care~treat | year + provider_number, data=reg.data2)
dd.est3 <- feols(uncomp_care~treat | year + provider_number, data=reg.data3)
dd.est4 <- feols(uncomp_care~treat | year + provider_number, data=reg.data4)

sum.fmt <- function(x) formatC(x, digits = 2, big.mark = ",", format = "f")
dd.summary <- msummary(list("Full Sample"=dd.est1, "Expand 2014"=dd.est2, 
                            "Expand 2015"=dd.est3, "Expand 2016"=dd.est4),
                       shape=term + statistic ~ model, 
                       gof_map=NA,
                       coef_omit='Intercept',
                       coef_rename=c("treat"="Expansion"),
                       fmt= sum.fmt,
                       vcov = ~provider_number,
                       output="markdown",
                       caption="TWFE Estimates for Different Treatment Groups",
                       label="ddmodels")

Event Studies

## based on full data
event.data1 <- reg.data1 %>%
  mutate(event_time=
           case_when(
             !is.na(expand_year) ~ year-expand_year,
             is.na(expand_year) ~ -1
           )
         ) 

event.reg1 <- feols(uncomp_care ~ i(as.factor(event_time), expand_ever, ref=-1) | year + provider_number, 
                   cluster=~provider_number, data=event.data1)



## based on 2014 treatment group only
event.reg2 <- feols(uncomp_care ~ i(as.factor(year), expand_ever, ref=2013) | year + provider_number, 
                    cluster=~provider_number, data=reg.data2)

Sun and Abraham

sa.data <- event.data1 %>%
  mutate(expand_year = ifelse(expand_ever==FALSE, 10000, expand_year),
         time_to_treat = ifelse(expand_ever==FALSE, -1, year-expand_year),
         time_to_treat = ifelse(time_to_treat < -15, -15, time_to_treat))

sa.reg <- feols(uncomp_care ~ sunab(expand_year, time_to_treat) | year + provider_number,
                cluster=~provider_number, data=sa.data)
sa.est <- tidy(summary(sa.reg, agg=FALSE)) %>%
  filter(str_detect(term,"cohort::2014|cohort::2015|cohort::2016")) %>%
  mutate(term=str_replace(term,"time_to_treat::",""),
         term=str_replace(term,":cohort::",":")) %>%
  separate(term,c("period","cohort"),":") %>%
  mutate(period=as.numeric(period)) %>%
  select(period, cohort, estimate, p.value) %>%
  rename(p_value=p.value) 

sa.event.plot <- iplot(sa.reg, xlab = "Time to Treatment", main="Sun and Abraham Event Study")

Callaway and Sant’Anna

cs.data <- reg.data1 %>%
  mutate(expand_year=ifelse(is.na(expand_year),0,expand_year)) %>%
  group_by(provider_number) %>%
  mutate(hospital_id=cur_group_id()) %>% ungroup()


cs.mod <- att_gt(yname="uncomp_care", tname="year", idname="hospital_id",
                 gname="expand_year",
                 data=cs.data, panel=TRUE, est_method="dr",
                 allow_unbalanced_panel=TRUE)
cs.event <- aggte(cs.mod, type="dynamic")

coef.cs <- tidy(cs.event) %>%
  select(rel_year=event.time, estimate, ci_lower=conf.low, ci_upper=conf.high) %>%
  mutate(rel_year=as.numeric(rel_year))
coef.cs <- as_tibble(coef.cs)

cs.plot <- ggplot(coef.cs, aes(x=rel_year, y=estimate)) + 
  geom_point(size = 2)  +
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = -0.5, linetype = "dashed") +
  theme(legend.position="none") +
  scale_x_continuous(breaks = -15:5, minor_breaks = NULL) +
  scale_y_continuous(minor_breaks = NULL, label=comma) +
  labs(x = "Relative Time", y = "Estimated Effect, $1000s", color = NULL, title = NULL) +
  theme_bw()

Rambachan and Roth Honest DD

library(devtools)
install_github("bcallaway11/BMisc", dependencies = TRUE)
install_github("asheshrambachan/HonestDiD", dependencies = TRUE)

library(HonestDiD)
source('solutions/exercise1/fn_honest_did.R')

cs.hd <- att_gt(yname="uncomp_care", tname="year", idname="hospital_id",
                 gname="expand_year",
                 data=cs.data, panel=TRUE, est_method="dr",
                 allow_unbalanced_panel=TRUE,
                base_period="universal")
cs.hd.event <- aggte(cs.hd, type="dynamic", min_e=-10, max_e=5)

hd.cs <- honest_did(cs.hd.event, type="smoothness", Mvec=seq(from=0, to=2000, by=500))
hd.cs.graph <- createSensitivityPlot(hd.cs$robust_ci,
                                        hd.cs$orig_ci)

coef.cs.hd <- hd.cs$robust_ci %>% bind_rows(hd.cs$orig_ci) %>%
  mutate(type=case_when(
    M==2000 ~ "M = +2,000",
    M==1500 ~ "M = +1,500",
    M==1000 ~ "M = +1,000",
    M==500 ~ "M = +500",
    M==0 ~ "M = 0",
    is.na(M) ~ "Original"
  ))

cs.hd.plot <- ggplot(coef.cs.hd, aes(x=factor(type, 
                                              level=c('Original', 'M = 0', 'M = +500', 'M = +1,000', 'M = +1,500', 'M = +2,000')))) + 
  geom_linerange(aes(ymin = lb, ymax = ub)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme(legend.position="none") +
  scale_y_continuous(minor_breaks = NULL, label=comma) +
  labs(x = "Violation in Parallel Trends", y = "Estimated Effect (at t=0)", color = NULL, title = NULL) +
  theme_bw()

Exercise 2: Instrumental Variables

Question

What is the effect of vertical integration on total physician billable activity?

Data

Medicare Data on Provider and Practice Specialty (MD-PPAS)
Medicare Utilization and Payment Data (PUF)
Physician Fee Schedule (PFS)…already provided for you

Data management

Using MD-PPAS, PFS, and PUF…calculate physician relative price changes due to price shock
Aggregate to practice level
Repeat 100 times for random realizations of price shock
Recenter practice level price changes based on random realizations

R Code

pfs.update <- read_tsv("solutions/exercise2/data/input/PFS_update_data.txt")
taxid.base <- read_csv("solutions/exercise2/data/input/MDPPAS/PhysicianData_2009.csv") %>%
  select(npi, tax_id_base=group1)

## for recentering (part 8)
pfs.year <- pfs.update %>% filter(year==2012)
value <- 1:100
n.rows=nrow(pfs.year)
pfs.year.rc <- bind_cols(pfs.year,
                           map_dfc(value, ~pfs.year %>% transmute(!!paste0('dprice_rel_2010_', .x) := sample(dprice_rel_2010, n.rows)))) %>%
  select(hcpcs, starts_with('dprice_rel_2010_'))
##

step <- 0
for (i in 2012:2017) {
  step <- step+1
  
  if (i<=2013) {
    pfs.yearly <- pfs.update %>% filter(year==i)
  } else {
    pfs.yearly <- pfs.update %>% filter(year==2013)
  }
  
  
  mdppas <- read_csv(paste0("solutions/exercise2/data/input/MDPPAS/PhysicianData_",i,".csv"))
  
  medicare.puf <- read_delim(paste0("solutions/exercise2/data/input/utilization-payment-puf/",i,"/Medicare_Provider_Util_Payment_PUF_CY",i,".txt")) %>%
    mutate(year=i) %>%
    rename_with(tolower) %>%
    filter(str_detect(nppes_credentials, "M.D.|MD|m.d.|md")) %>%
    mutate(npi=as.numeric(npi),
           spend=average_medicare_allowed_amt*bene_day_srvc_cnt)

  price.shock <- medicare.puf %>% inner_join(taxid.base, by="npi") %>%
    inner_join(pfs.yearly %>% select(hcpcs, dprice_rel_2010, price_nonfac_orig_2010, price_nonfac_orig_2007), by=c("hcpcs_code"="hcpcs")) %>%
    mutate_at(vars(dprice_rel_2010, price_nonfac_orig_2010, price_nonfac_orig_2007), replace_na, 0) %>%
    mutate(price_shock = case_when(
            i<=2013 ~ ((i-2009)/4)*dprice_rel_2010,
            i>2013  ~ dprice_rel_2010),
          denom = line_srvc_cnt*price_nonfac_orig_2010,
          numer = price_shock*line_srvc_cnt*price_nonfac_orig_2010) %>%
    group_by(npi) %>%
    summarize(phy_numer=sum(numer, na.rm=TRUE), phy_denom=sum(denom, na.rm=TRUE), tax_id_base=first(tax_id_base)) %>%
    ungroup() %>%
    mutate(phy_rev_change=phy_numer/phy_denom)

  price.shock.practice <- price.shock %>%
    group_by(tax_id_base) %>%
    summarize(practice_rev_change=sum(phy_rev_change, na.rm=TRUE)) %>%
    ungroup()
  
  price.shock.rc <- medicare.puf %>% inner_join(taxid.base, by="npi") %>%
    inner_join(pfs.yearly %>% select(hcpcs, price_nonfac_orig_2010), by=c("hcpcs_code"="hcpcs")) %>%        
    inner_join(pfs.year.rc, by=c("hcpcs_code"="hcpcs")) %>%
    mutate(across(starts_with('dprice_rel_2010'), replace_na, 0)) %>%
    mutate(across(starts_with('dprice_rel_2010'),
                  ~ case_when(
                    i<=2013 ~ ((i-2009)/4)*(.x),
                    i>2013 ~ .x),
                  .names = "price_shock{sub('dprice_rel_2010_','_',col)}"),
           denom = line_srvc_cnt*price_nonfac_orig_2010,
           across(starts_with('price_shock'),
                  ~ (.x)*line_srvc_cnt*price_nonfac_orig_2010,
                  .names = "numer{sub('price_shock_','_',col)}")) %>%
    group_by(npi) %>%
    summarize(across(starts_with('numer'),
                     ~ sum( .x, na.rm=TRUE),
                     .names = "phy_{col}"),
              phy_denom=sum(denom, na.rm=TRUE),
              tax_id_base=first(tax_id_base)) %>%
    ungroup() %>%
    mutate(across(starts_with('phy_numer'),
                  ~ (.x)/phy_denom,
                  .names = "phy_rev_change{sub('phy_numer_','_',col)}"))

  price.shock.practice.rc <- price.shock.rc %>%
    group_by(tax_id_base) %>%
    summarize(across(starts_with('phy_rev_change'),
                     ~ sum(.x, na.rm=TRUE),
                     .names = "{sub('phy','practice',col)}")) %>%
    ungroup() %>%
    mutate(mean_practice_shock=rowMeans( across(starts_with('practice_rev_change_')), na.rm=TRUE)) %>%
    select(tax_id_base, mean_practice_shock)

  phy.data <- medicare.puf %>% 
    group_by(npi) %>%
    summarize(total_claims=sum(bene_day_srvc_cnt, na.rm=TRUE),
              total_spend=sum(spend, na.rm=TRUE),
              total_patients=sum(bene_unique_cnt, na.rm=TRUE)) %>%
    inner_join(taxid.base, by="npi") %>%
    inner_join(mdppas %>% select(npi, group1, group2, pos_opd, pos_office, pos_asc), by="npi") %>%
    inner_join(price.shock %>% select(-tax_id_base), by="npi") %>%
    inner_join(price.shock.practice, by="tax_id_base")  %>%
    inner_join(price.shock.practice.rc, by="tax_id_base")  %>%
    rename(tax_id_2009=tax_id_base) %>%
    mutate(year=i) %>%
    ungroup()
  
  if (step==1) {
    final.data <- phy.data
  } else {
    final.data <- bind_rows(final.data, phy.data)
  }
}

Identification strategy

First, assess potential role of selection on unobservables
Exploit PPIS price shock as instrument for integration
- Consider new work on instrument strength
- Consider “recentering” as proposed by Borusyak and Hull

Estimation

Role of selection on observables

delta.dx <- feols(ln_claims ~ int | npi + year, data=reg.data)
delta.d <- feols(ln_claims ~ int, data=reg.data)

hat.dx <- tidy(delta.dx) %>% select(estimate)
hat.d <- tidy(delta.d) %>% filter(term=='int1') %>% select(estimate)
R.dx <- glance(delta.dx)$r.squared
R.d <- glance(delta.d)$r.squared
oster.data <- tibble(
  coef_dx = hat.dx$estimate,
  coef_d = hat.d$estimate,
  r_dx = R.dx,
  r_d = R.d
)

rho <- tibble(
  rho= c(0,.5,1,1.5,2),
)

r.max <- tibble(
  r_max= c(0.5, 0.6, 0.7, 0.8, 0.9, 1)
)

oster.bound <- crossing(rho, r.max) %>%
  bind_cols(oster.data) %>%
  mutate(d_upper = coef_dx - rho*(coef_d - coef_dx)*( (r_max - r_dx)/(r_dx - r_d)))

oster.table <- pivot_wider(oster.bound %>% select(rho, r_max, d_upper),
            names_from=rho,
            names_glue="rho_{rho}",
            values_from=d_upper)

kable(oster.table, format="latex",
      col.names=c("Max R2","Rho 0","Rho 0.5", "Rho 1.0", "Rho 1.5", "Rho 2.0"),
      digits=c(1,3,3,3,3,3),
      format.args=list(big.mark=","),
      booktabs=T) %>%
  kable_styling(latex_options=c("HOLD_position")) %>%
  save_kable("solutions/exercise2/t3_oster.tex")

IV Estimation

mod.fs <- feols(as.numeric(int) ~ practice_rev_change | npi + year, data=reg.data)
mod.iv <- feols(ln_claims ~ 0 | npi + year | int ~ practice_rev_change, data=reg.data)
mod.rf <- feols(ln_claims ~ practice_rev_change | npi + year, data=reg.data)

iv.summary <- msummary(list("OLS"=mod.ols, "IV"=mod.iv, 
                            "Reduced Form"=mod.rf, "First Stage"=mod.fs),
                       shape=term + statistic ~ model, 
                       gof_map=NA,
                       coef_omit='Intercept',
                       coef_rename=c("int1"="Integration",
                                     "fit_int1"="Integration",
                                     "practice_rev_change"="Instrument"),
                       vcov = ~npi,
                       caption="Instrumental Variables Estimates",
                       output="solutions/exercise2/t4-iv.tex",
                       label="ivmodels",
                       booktabs=T) %>%
  kable_styling(latex_options='HOLD_position')

Weak Instruments

mod.fe1 <- feols(as.numeric(int) ~ 1 | npi + year, data=reg.data)
mod.fe2 <- feols(practice_rev_change ~ 1 | npi + year, data=reg.data)
mod.fe3 <- feols(ln_claims ~ 1 | npi + year, data=reg.data)

ar.data <- reg.data %>%
  filter(!is.na(npi)) %>%
  add_residuals(mod.fe1, var="x_hat") %>%
  add_residuals(mod.fe2, var="z_hat") %>%
  add_residuals(mod.fe3, var="y_hat") %>%
  select(y_hat, x_hat, z_hat, npi, year)

ar.data <- as.data.frame(ar.data)
Y=ar.data[,"y_hat"]
D=ar.data[,"x_hat"]
Z=ar.data[,"z_hat"]
weak.test <- ivmodel(Y=Y, D=D, Z=Z)
kable(weak.test$AR$ci, format="latex",
      col.names=c("Lower CI","Upper CI"),
      digits=c(3,3),
      booktabs=T) %>%
  kable_styling(latex_options=c("HOLD_position")) %>%
  save_kable("solutions/exercise2/t6_ar.tex")

## still need Lee et al tF stat
lee.tf <- msummary(list("IV"=mod.iv, "First Stage"=mod.fs),
                        shape=term + statistic ~ model, 
                        gof_map=NA,
                        coef_omit='Intercept',
                        coef_rename=c("int1"="Integration",
                                      "fit_int1"="Integration",
                                      "practice_rev_change"="Instrument"),
                        vcov = ~npi,
                        statistic="statistic",
                        caption="Instrumental Variables Estimates",
                        output="solutions/exercise2/t6-tf.tex",
                        label="ivmodels",
                        booktabs=T) %>%
  kable_styling(latex_options='HOLD_position')

Recentering

reg.data <- reg.data %>%
  mutate(mean_shock=mean_practice_shock/1000) %>%
  mutate(shock_rc=practice_rev_change-mean_shock)

mod.fs2 <- feols(as.numeric(int) ~ shock_rc | npi + year, data=reg.data)
mod.iv2 <- feols(ln_claims ~ 0 | npi + year | int ~ shock_rc, data=reg.data)
mod.rf2 <- feols(ln_claims ~ shock_rc | npi + year, data=reg.data)

iv.summary2 <- msummary(list("OLS"=mod.ols, "IV"=mod.iv2, 
                            "Reduced Form"=mod.rf2, "First Stage"=mod.fs2),
                       shape=term + statistic ~ model, 
                       gof_map=NA,
                       coef_omit='Intercept',
                       coef_rename=c("int1"="Integration",
                                     "fit_int1"="Integration",
                                     "shock_rc"="Recentered Instrument"),
                       vcov = ~npi,
                       caption="Instrumental Variables Estimates",
                       output="solutions/exercise2/t7-iv2.tex",
                       label="ivmodels",
                       booktabs=T) %>%
  kable_styling(latex_options='HOLD_position')

Exercise 3: Regression Discontinuity

Question

What is the effect of the low income subsidy on Part D enrollment?

Data

Replication data from Ericson (2014)
Based on Part D enrollment and low-income subsidity data (publicly available)

Data management

Just combining datasets from replication files

R Code

# calculate shares from pdp data
pdp.data <- pdp.data %>%
  group_by(state, year) %>%
  mutate(state_yr_enroll = sum(enrollment, na.rm=TRUE)) %>%
  ungroup() %>%
  mutate(share = enrollment/state_yr_enroll,
         ln_share = log(share))

# reshape subsidy data to long
lis.data <- pivot_longer(lis.data, cols=c("s2006","s2007","s2008","s2009","s2010"), 
                         names_to="year",
                         names_prefix="s",
                         values_to="LISsubsidy") %>%
  mutate(year=as.numeric(year))

# merge the subsidy data into the pdp data and create new variables matching Ericson's code files
final.data <- pdp.data %>%
  inner_join(lis.data, by=c("PDPregion","year")) %>%
  mutate(LISPremium = premium - LISsubsidy,
         proposedBenchmarkPlan = ifelse(LISPremium<=0,1,0),
         ProblemObs = case_when(
           LISPremium < 0 & LIS == 0 ~ 1,
           LISPremium > 0 & LIS == 1 ~ 2
         ),
         LISPremium = ifelse(benefit=="E",NA,LISPremium),
         proposedBenchmarkPlan = ifelse(benefit=="E", NA, proposedBenchmarkPlan),
         LISPremiumNeg = ifelse(LISPremium<=0, LISPremium, 0),
         LISPremiumPos = ifelse(LISPremium>=0, LISPremium, 0))

# recreate Ericson RD windows
final.data.rd <- final.data %>%
  mutate(RDwindow1 = ifelse(LISPremium>=-10 & LISPremium<=10 & year==2006, 1, 0),
         belowBench1 = ifelse(LISPremium<=0 & RDwindow1==1, 1, 0),
         RDwindow2 = ifelse(LISPremium>=-4 & LISPremium<=4 & year==2006, 1, 0),
         belowBench2 = ifelse(LISPremium<=0 & RDwindow2==1, 1, 0),
         RDwindow3 = ifelse(LISPremium>=-2.5 & LISPremium<=2.5 & year==2006, 1, 0),
         belowBench3 = ifelse(LISPremium<=0 & RDwindow3==1, 1, 0),
         RDwindow4 = ifelse(LISPremium>=-6 & LISPremium<=6 & year==2006, 1, 0),
         belowBench4 = ifelse(LISPremium<=0 & RDwindow4==1, 1, 0)) %>%
  select(uniqueID, starts_with(c("RDwindow","belowBench"))) %>%
  group_by(uniqueID) %>%
  summarize_all(max, na.rm=TRUE) %>%
  filter_at(vars(RDwindow1:belowBench4), all_vars(is.finite(.)))

final.data <- final.data %>%
  left_join(final.data.rd, by=c("uniqueID"))

Identification strategy

Binned scatterplots
RD estimation
- Manipulation of the running variable
- Optimal bandwidth and inference
Extension of Ericson (2014) using IV

Estimation

Binned Scatterplots

final.bw1 <- final.data %>% 
  filter(RDwindow1==1, benefit=="B", year==2006)

rd.result1 <- rdplot(y=final.bw1$ln_share, 
                     x=final.bw1$LISPremium, 
                     c=0,
                     p=4,
                     hide=TRUE,
                     ci=95,
                     nbins=20)

bin.avg1 <- as_tibble(rd.result1$vars_bins)

rd.plot1 <- bin.avg1 %>%
  ggplot() +
  geom_point(aes(x=rdplot_mean_bin, y=rdplot_mean_y)) +
  geom_vline(aes(xintercept=0),linetype='dashed') +
  labs(
    x="Monthly premium - LIS subsidy, 2006",
    y="log enrollment share, 2006"
  ) +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_l), linetype='longdash') +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_r), linetype='longdash') +
  theme_bw()

Binned Scatterplots with Different Binwidths

rd.result2 <- rdplot(y=final.bw1$ln_share, 
                     x=final.bw1$LISPremium, 
                     c=0,
                     p=4,
                     hide=TRUE,
                     ci=95,
                     nbins=10)

bin.avg2 <- as_tibble(cbind(rd.result2$vars_bins))

rd.plot2 <- bin.avg2 %>%
  ggplot() +
  geom_point(aes(x=rdplot_mean_bin, y=rdplot_mean_y)) +
  geom_vline(aes(xintercept=0),linetype='dashed') +
  labs(
    x="Monthly premium - LIS subsidy, 2006",
    y="log enrollment share, 2006"
  ) +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_l), linetype='longdash') +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_r), linetype='longdash') +
  theme_bw()



rd.result3 <- rdplot(y=final.bw1$ln_share, 
                     x=final.bw1$LISPremium, 
                     c=0,
                     p=4,
                     hide=TRUE,
                     ci=95,
                     nbins=30)

bin.avg3 <- as_tibble(cbind(rd.result3$vars_bins))

rd.plot3 <- bin.avg3 %>%
  ggplot() +
  geom_point(aes(x=rdplot_mean_bin, y=rdplot_mean_y)) +
  geom_vline(aes(xintercept=0),linetype='dashed') +
  labs(
    x="Monthly premium - LIS subsidy, 2006",
    y="log enrollment share, 2006"
  ) +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_l), linetype='longdash') +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_r), linetype='longdash') +
  theme_bw()

Scatterplot with Optimal Binwidth

rd.result4 <- rdplot(y=final.bw1$ln_share, 
                     x=final.bw1$LISPremium, 
                     c=0,
                     p=4,
                     hide=TRUE,
                     ci=95,
                     binselect="es")

bin.avg4 <- as_tibble(cbind(rd.result4$vars_bins))

rd.plot4 <- bin.avg4 %>%
  ggplot() +
  geom_point(aes(x=rdplot_mean_bin, y=rdplot_mean_y)) +
  geom_vline(aes(xintercept=0),linetype='dashed') +
  labs(
    x="Monthly premium - LIS subsidy, 2006",
    y="log enrollment share, 2006"
  ) +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_l), linetype='longdash') +
  geom_line(aes(x=rdplot_mean_bin, y=rdplot_ci_r), linetype='longdash') +
  theme_bw()

opt.bins <- nrow(bin.avg4)

Manipulation Test

rd.test <- rddensity(X=final.bw1$LISPremium, p=4)
pval.rd.test <- rd.test$test$p_jk
diff.rd.test <- rd.test$hat$diff
rd.test.plot <- rdplotdensity(rd.test, final.bw1$LISPremium,
                              lcol = c("black","black"),
                              xlabel = "Monthly premium - LIS subsidy, 2006",
                              plotRange = c(-10,10),
                              plotN = 100)

RD Estimation

data.2006 <- final.data %>%
  filter(year==2006) %>%
  select(orgParentCode, planName, state, contractId, uniqueID,
         enrollment, share, ln_share, LISPremium_2006=LISPremium,
         premium_2006=premium)


for (i in 2006:2010) {
  rd.dat <- final.data %>% filter(year==i) %>%
    left_join(data.2006 %>% select(uniqueID, LISPremium_2006), by=c("uniqueID"))
  rd.est.p1 <- rdrobust(y=rd.dat$ln_share, x=rd.dat$LISPremium_2006, h=4, kernel="uniform")
  rd.est.p2 <- rdrobust(y=rd.dat$ln_share, x=rd.dat$LISPremium_2006, h=4, kernel="uniform", p=2)
  
  ti <- data.frame(
    term=c("RD Linear", "RD Polynomial"),
    estimate=c(-1*rd.est.p1$coef[1], -1*rd.est.p2$coef[1]),
    std.error=c(rd.est.p1$se[1], rd.est.p2$se[1])
  )
  
  gl <- data.frame(
    N1 = rd.est.p1$M[1],
    N2 = rd.est.p1$M[2]
  )

  mod <- list(
    tidy=ti,
    glance=gl
  )

  class(mod) <- "modelsummary_list"
    
  assign(paste0("mod",i),mod)
}

CE-optimal Bandwidth

for (i in 2006:2010) {
  rd.dat <- final.data %>% filter(year==i) %>%
    left_join(data.2006 %>% select(uniqueID, LISPremium_2006), by=c("uniqueID"))
  rd.est.p1 <- rdrobust(y=rd.dat$ln_share, x=rd.dat$LISPremium_2006, bwselect="cerrd", kernel="uniform")
  rd.est.p2 <- rdrobust(y=rd.dat$ln_share, x=rd.dat$LISPremium_2006, bwselect="cerrd", kernel="uniform", p=2)

  ti <- data.frame(
    term=c("RD Linear", "RD Polynomial"),
    estimate=c(-1*rd.est.p1$coef[1], -1*rd.est.p2$coef[1]),
    std.error=c(rd.est.p1$se[1], rd.est.p2$se[1])
  )
  
  gl <- data.frame(
    N1 = rd.est.p1$M[1],
    N2 = rd.est.p1$M[2]
  )
  
  mod <- list(
    tidy=ti,
    glance=gl
  )
  
  class(mod) <- "modelsummary_list"
  
  assign(paste0("modce",i),mod)
  
}

Discontinuity as IV

data.2007 <- final.data %>%
  filter(year==2007) %>%
  select(orgParentCode, planName, state, contractId, uniqueID,
         premium_2007=premium)

reg.data <- data.2006 %>%
  left_join(data.2007, by=c("orgParentCode","planName","state",
                            "contractId","uniqueID")) %>%
  mutate(premium_diff=premium_2007-premium_2006)

mean.share <- round(as.numeric(reg.data %>% summarize(mean_share=mean(share, na.rm=TRUE))), 3)
inertia <- feols(premium_diff~1 | ln_share~LISPremium_2006, data=reg.data)

Question

What is the role of market definition on market shares and demand estimates?

Data

Hospital Cost Report Information System
Hospital Service Area Files

Data management

Biggest issue is creating markets using the community detection algorithm
I did that for you with the ‘hospital_markets’ data due to time, but if you want to do it yourself, please take a look at my ongoing Hospital Choice Project

Market Definition

Every analysis of competition requires some definition of the market. This is complicated in healthcare for several reasons:

Hospital markets more local than insurance markets
Hospitals are multi-product firms
Geographic market may differ by procedure
Insurance networks limit choice within a geographic market

Hospital Service Areas (HSAs)

Begin with town or cities with a hospital (possibly more than one)
Assign zip codes to that town/hospital(s) if the plurality of people in that zip code receive care from that town/hospital(s)
Define the HSA as all contiguous zip codes from step 2

Around 3,400 HSAs total

Hospital Referral Regions (HRRs)

Contiguous HSAs
Population of at least 120,000
Account for at least 65% of residents’ health care services (cardiovascular and neurosurgery)

306 HRRs total

Community Detection

Goal: Identify connected nodes (some geographic region like zip code or county) where residents tend to receive health care services

Community Detection

Form data on geographic units, providers, and patient counts (bipartite). This is a matrix with geographic unit as row, provider as columns, and patient counts as cells
Convert to matrix on counts of connections (common hospitals) between geographic areas (unipartite)
Employ “cluster walktrap” algorithm to identify clusters of geographic units

Cluster Walktrap

What is a “cluster walktrap”?

Identify densely connected subgraphs (aka communities)
Random walk
- “walker” moves from node to node, uniformly randomly among neighbors
- “distance” will be large across communities and small within a community
Algorithm
- Begin with each node as its own community and calculate distance from each adjacent node
- Merge two adjacent communities, selected based on minimum sum of squared distances between each node and its community
- Update distances between communities and repeat

Estimation

Asked to estimate a discrete choice logit model, but we only have continuous data on market shares at the hospital level
So…we need to employ Berry (1994)
His paper shows that a logit discrete choice model can be estimated with continuous market share data as follows…

Basic Setup

Indirect utility of person \(i\), \[u_{ij} = x_{ij}\beta + \epsilon_{ij},\] where \(x_{ij}\) denotes person (and perhaps product) characteristics and \(\epsilon_{ij}\) denotes an error term.

Standard logit: one choice, \(j=0,1\)
Multinomial logit: many possible choices, \(j=0,1,...,J\)

Logit terminology

A few different terms for very similar models:

Multinomial Logit: Individual covariates only, alternative-specific coeficients. \[u_{ij}=x_{i}\beta_{j} + \epsilon_{ij},\] such that \[p_{ij} = \frac{e^{x_{i}\beta_{j}}}{\sum_{k} e^{x_{i}\beta_{k}}}\]
Conditional Logit: Allow for alternative-specific regressors, such that \[u_{ij}=x_{ij}\beta + \epsilon_{ij}\]
“Mixed” Logit: Allow for individual and alternative-specific regressors, such that \[u_{ij}=x_{ij}\beta + w_{i} \gamma_{j} + \epsilon_{ij}\]

but people sometimes use “mixed” to refer to random-coefficients logit

Does it matter?

These are really all the same and it’s just a matter of specification (e.g., interact individual covariates with product characteristics or with product dummies). I’ll refer to them as “multinomial” logit.

The Indepenence of Irrelevant Alternatives

Fundamental issue with logit models…the ratio of choice probabilities for \(j\) and \(k\) does not depend on any other alternatives: \[\frac{P_{ij}}{P_{ik}} = \frac{e^{V_{ij}}}{e^{V_{ik}}}.\]

Relaxing IIA

This is really an omitted variables problem…with enough interactions, we can allow for a sufficiently rich substitution pattern
Alternatively, relax assumptions on the error term with nested logit or random-coefficient logit (or multinomial probit)

Discrete choice with market level data

Utility of individual \(i\) from selecting product \(j\) is \[U_{ij}=\delta_{j}+\epsilon_{ij},\] where \(\delta_{j}=x_{j}\beta + \xi_{j}\), and \(\xi_{j}\) represents the mean level of utility derived from unobserved characteristics.

Discrete choice with market level data

Goal is to find \(\hat{\delta}\) to statisfy moment condition, \[\frac{1}{J}\sum_{j} (\hat{\delta}_{j}-x_{j}\beta)z_{j}=0.\]

In standard logit, \(s_{j}=e^{\delta_{j}}/\sum e^{\delta_{j}}\), and \(\delta_{j}\) then follows directly from taking logs and subtracting the outside share (with the normalization of \(\delta_{0}=0\), which yields the estimating equation \[\ln(s_{j}) - \ln(s_{0}) = x_{j}\beta + \xi_{j}\]

Discrete choice with market level data

Standard logit imposes cross-price elasticities that are proportional to market shares (limited substitution patterns)
Relax with nested logit or random-coefficients logit

References

Berry, Steven T. 1994. “Estimating Discrete-Choice Models of Product Differentiation.” The RAND Journal of Economics, 242–62.

Review of Empirical Exercises

Areas of empirical work

Workflow

Exercise 1: Difference-in-differences

Question

Data

Data management

Identification strategy (ies)

Estimation

Exercise 2: Instrumental Variables

Question

Data

Data management

Identification strategy

Estimation

Exercise 3: Regression Discontinuity

Question

Data

Data management

Identification strategy

Estimation

Exercise 4: Demand Estimation and Market Share Construction

Question

Data

Data management

Market Definition

Hospital Service Areas (HSAs)

Hospital Referral Regions (HRRs)

Community Detection

Community Detection

Cluster Walktrap

Estimation

Basic Setup

Logit terminology

Does it matter?

The Indepenence of Irrelevant Alternatives

Relaxing IIA

Discrete choice with market level data

Discrete choice with market level data

Discrete choice with market level data

References