Population, Infrastructure, and Nighttime Lights: A Spatial Regression Analysis Across Countries¶
Authors: Bouchra Daddaoui · Namuunzaya Barsbold · Amanda González Mejía
Repository: https://github.com/namibarsbold/STATS201-Course-project
GitHub Pages: https://namibarsbold.github.io/STATS201-Course-project/
Research Question and Motivation¶
This project asks how population density and infrastructure jointly predict satellite-observed nighttime light intensity, and whether the population–light relationship varies across regions and countries in ways consistent with spatial electrification inequality. Nighttime lights are widely used as a remotely sensed proxy for economic activity and development when administrative statistics are noisy or missing; related work emphasizes that lights also reflect infrastructure access and spatial inequality, capturing uneven electrification, industrial concentration, and development disparities (Henderson et al., 2012; Falchetta et al., 2023).
Electricity is a foundational input shaping labor markets, productivity, and household welfare. Causal evidence from rural electrification expansions links electricity access to higher labor market participation, particularly increased female employment in Africa (Dinkelman, 2011). If lights proxy electrification, then spatial variation in lights plausibly reflects differences in economic opportunity.
This variation is spatially structured. Subnational light patterns have been used to measure within-country inequality, revealing regional disparities that national averages obscure (Lipscomb, 2013). Related work argues that population–development elasticities vary across spatial contexts as institutions and infrastructure change how demographic concentration translates into observed outcomes (Lessmann & Seidel, 2017).
Data and Feature Engineering¶
Remote Sensing and Population Data¶
Nighttime lights use the Earth Observation Group (EOG) VIIRS annual nighttime lights composite product (referred to in our materials as "Annual VNL V1"), derived from the VIIRS Day/Night Band. These annual composites are processed (not raw imagery): they filter sunlit/moonlit/cloudy observations, reduce ephemeral/background noise, use the vcm screening configuration to exclude stray-light-impacted data, and apply outlier filtering to reduce biomass burning and other anomalous radiance. Population uses WorldPop annual gridded population surfaces, which are model-based (spatial disaggregation / dasymetric mapping rather than census counts at each pixel), and therefore introduce measurement uncertainty at fine spatial scales.
Spatial Aggregation and Region Typology¶
Annual lights and population rasters are aggregated into fixed tiles to form a panel. Tile sizing is adaptive (256–1024 pixels; 60–80 tiles along the short dimension) to manage computation while preserving within-country structure. Tiles are filtered for data quality (e.g., dropping tiles with fewer than 500 valid pixels). The resulting dataset contains 30,442 tile-year observations across the three countries. Each tile-year is classified into one of five spatial regimes using country-year–specific thresholds computed from the joint distribution of tile-level population density and tile-level light intensity. We use country-year medians of population and lights as the primary thresholds to define "high" vs. "low" relative to the country-year. The four corner regimes are assigned by the median split:
- urban_core: population ≥ median and lights ≥ median (high–high)
- dense_dim: population ≥ median and lights < median (high population, low light)
- bright_sparse: population < median and lights ≥ median (low population, high light)
- empty_or_rural: population < median and lights < median (low–low)
"mixed" is reserved for tiles that are not cleanly separated by the median split in practice, i.e., tiles that fall in a narrow neighborhood around the thresholds (near-median population and/or near-median lights) and are therefore not substantively "extreme" in either direction. Operationally in the pipeline, these observations are flagged as mixed during the classification step rather than being forced into a corner regime. This typology is fully data-driven and not externally validated against administrative urban/rural definitions. Because it is constructed from the same variables used in the regression, it is used as a descriptive stratification (regime heterogeneity) rather than a causal treatment.
Figure 1. Cross-country maps of 2023 population–light regimes and cumulative brightness change (2014–2023), illustrating persistent spatial stratification and uneven light growth.
Figure 2. Log population vs log nighttime light intensity at the tile level (2020). Both Brazil and Morocco show a clear positive association — tiles with higher resident populations tend to emit more light. Substantial scatter around the trend line indicates that population density alone does not fully account for spatial variation in brightness. Outliers above the regression line likely reflect industrial zones, tourist areas, or resource-extraction sites with high radiance relative to permanent residents. Brazil shows a tighter overall relationship with a pronounced upper-right cluster of dense, bright metropolitan tiles; Morocco displays more heterogeneity, consistent with coastal concentration of economic activity and large dim interior regions. This scatter motivates the regime-stratified interaction approach used throughout this study.
Figure 3. Region typology share evolution across 2014–2023. In Brazil (left), spatial regime composition is relatively stable over the decade: the vast majority of tiles remain classified as "empty_or_rural," while "urban_core" tiles constitute a consistently small but stable share concentrated in major metropolitan clusters. In Morocco (right), regime shares shift more noticeably — a gradual reduction in "empty_or_rural" tiles and modest growth in "dense_dim" and "mixed" tiles — consistent with ongoing peri-urban expansion and electrification around secondary cities. Temporal stability in Brazil supports using the typology as a structural covariate; the more dynamic pattern in Morocco motivates attention to temporal robustness.
Infrastructure Features¶
Infrastructure is proxied with spatial accessibility measures from tile geometry: (i) distance to urban core (isolation from the nearest urban-core tile), (ii) local urban density (count of urban-classified tiles within a fixed radius, capturing clustering/spillovers), and (iii) geographic centrality score (distance from the tile centroid to the national geographic centroid). These accessibility variables enter in logs for comparability with the log–log population/lights specification. We initially planned to include road density derived from OpenStreetMap via Geofabrik extracts (road length per tile area), but due to processing and harmonization constraints this variable is not included in the final specification and is treated as future work.
Figure 4. Spatial distribution of infrastructure accessibility variables (distance to urban core, local urban density, centrality) across Brazil, China, and Morocco (2023). Each map shows how infrastructure access varies across tiles within each country. Brazil exhibits a large, relatively distributed urban network with moderate accessibility gradients across its interior. China shows a steep west-to-east accessibility gradient, with coastal provinces highly connected and interior provinces remote. Morocco displays pronounced coastal concentration of infrastructure access, with the interior largely isolated from urban networks — consistent with the large infrastructure R² gains observed for Morocco in the regression.
Methods¶
Modeling Strategy¶
We use interpretable regression (rather than high-capacity machine learning) because the goal is estimation and interpretation of regime-specific elasticities linking population density to nighttime light intensity. All models use a log–log structure, so slopes are interpretable as elasticities: a 1% increase in population density is associated with a β% change in lights, holding other regressors constant.
Baseline Models¶
Pooled Log–Log Model
The simplest baseline estimates the overall association between population and light:
$$\log(1 + \text{Light}_{it}) = \alpha + \beta \log(1 + \text{Pop}_{it}) + \varepsilon_{it}$$
This specification establishes the aggregate population–light relationship without allowing spatial heterogeneity.
Regime-Interaction Model
To allow elasticities to vary across spatial development regimes, we estimate:
$$\log(1 + \text{Light}_{it}) = \alpha + \beta \log(1 + \text{Pop}_{it}) + \gamma_r + \delta_r \bigl(\log(1 + \text{Pop}_{it}) \times R_{it}\bigr) + \varepsilon_{it}$$
where $R_{it}$ indexes region type.
In this interaction design, the regime-specific elasticity for regime $r$ is given by:
$$\beta + \delta_r$$
This specification captures heterogeneity in how population translates into light across urban cores, dense-but-dim regions, bright-sparse areas, mixed tiles, and empty/rural tiles.
Figure 5. Population–light interaction slopes by spatial regime (2023). Each bar shows the regime-specific elasticity — the percentage change in nighttime light intensity associated with a 1% increase in population density within that spatial type. Elasticities range from near-zero in "empty_or_rural" tiles (Brazil: 0.014; China: 0.017; Morocco: 0.015) to well above 0.4 in "urban_core" tiles (Brazil: 0.471; China: 0.696; Morocco: 0.544). This sharp cross-regime variation confirms that the population–light relationship is strongly nonlinear in spatial context: demographic growth in low-access regions yields almost no increase in radiance, while equivalent growth in dense urban systems translates into large brightness gains. China's particularly high urban-core elasticity (0.696) suggests a strong "urban multiplier" effect — dense built form, shared networks, and industrial concentration amplify population–light coupling far beyond what population size alone predicts.
Final Model¶
The final model extends the regime-interaction baseline by incorporating spatial accessibility variables that proxy infrastructure conditions. These variables are added additively while retaining regime-specific elasticities.
The final specification is:
$$\log(1 + \text{Light}_{it}) = \alpha + (\beta + \delta_r)\log(1 + \text{Pop}_{it}) + \lambda_1 \log(\text{DistanceToUrbanCore}_{it}) + \lambda_2 \log(\text{LocalUrbanDensity}_{it}) + \lambda_3 \text{Centrality}_{it} + \gamma_r + \varepsilon_{it}$$
where $r$ denotes region type.
Models are estimated by OLS. Key design choices that can affect estimates include tile sizing, valid-pixel thresholds, log(1+x) transforms, and median-based regime thresholds. Given heteroskedasticity in residuals (especially in high-light urban regions), we report heteroskedasticity-robust standard errors.
Evaluation and Results¶
Evaluation Framework¶
We evaluate in-sample because the goal is explaining spatial variation and interpreting elasticities, not claiming generalization performance. Because tiles are spatially autocorrelated, blocked spatial cross-validation is the correct out-of-sample approach, but it is not implemented in this report.
Metrics¶
We report three complementary performance measures. Firstly, R², which captures the share of variation in log-transformed nighttime light intensity explained by the model. Secondly, RMSE, which measures the typical magnitude of prediction error (in log units). Thirdly, AIC, an information criterion that compares model efficiency while penalizing additional covariates. Together, these metrics quantify explanatory power and relative model efficiency within the analyzed sample.
Baseline Interaction Model Results¶
Fit is high across countries (R² = 0.879–0.941), and elasticities vary sharply by regime, indicating systematic spatial nonlinearity in population–light coupling. Empty/rural elasticities are near zero and statistically indistinguishable from zero (Brazil 0.014; China 0.017; Morocco 0.015). Substantively, marginal population increases in low-density areas yield negligible changes in radiance (a 10% population increase corresponds to roughly a 0.1–0.2% change in lights).
Mixed tiles show small but statistically significant elasticities (Brazil 0.056; China 0.039; Morocco 0.083), consistent with population translating into brightness once baseline infrastructure is present (roughly 0.4–0.8% higher lights for a 10% population increase). Bright_sparse shows pronounced cross-country heterogeneity: Morocco has a large, highly significant elasticity (0.238), while Brazil (0.039) and China (0.030) are small and statistically weak—suggesting that "bright but low-density" areas capture different underlying structures across countries.
Urban-core elasticities are large and precisely estimated (Brazil 0.471; China 0.696; Morocco 0.544). In China, a 10% increase in population in urban cores is associated with nearly a 7% increase in lights, consistent with an "urban premium" where dense built form and shared networks amplify the population–light relationship.
Table 1: Baseline Regression Results. Cells show the implied slope of log(1+Population) within each regime (standard errors in parentheses). *** p<0.01, ** p<0.05, * p<0.10.
Infrastructure-Augmented Model Results¶
Adding distance to urban core, local urban density, and centrality modestly but consistently improves model performance relative to the baseline: ΔR² = +0.013 (Morocco), +0.005 (China), +0.001 (Brazil). AIC declines substantially, particularly in China (ΔAIC = −28.1) and Morocco (−6.7), indicating improved efficiency despite additional covariates.
Conditioning on infrastructure shifts the role of population. In Morocco, log(pop) remains large and highly significant (0.242, p < 0.01), implying that a 10% population increase is associated with about a 2.4% increase in lights, holding infrastructure constant. In Brazil and China, the direct population coefficient becomes small and insignificant (0.003; 0.024), consistent with the idea that what looked like a "population effect" in simpler models is largely mediated by spatial accessibility and urban structure.
Distance to urban centers is negative and statistically significant in Brazil (−0.024, p < 0.01) and China (−0.027, p < 0.01): a 10% increase in distance predicts roughly 0.2–0.3% lower lights, conditional on population and regime. Local urban density is strongly negative in all countries (Morocco −0.749, p < 0.05; Brazil −0.786, p < 0.01; China −1.705, p < 0.01), capturing a steep gradient from dense urban systems to less developed peripheries. Centrality is significant and negative in China (−0.151, p < 0.01) and Morocco (−0.202, p < 0.05), but small and statistically insignificant in Brazil (p > 0.10). Because local urban density is mechanically related to the regime classification and broader spatial structure, the large negative coefficients should be interpreted as controlling for spatial stratification rather than as a standalone causal mechanism, as multicollinearity or suppression effects may partly influence magnitude.
Figure 6. Model Fit Comparison and Infrastructure-induced R² Gains (2023). Left panel: R² for baseline and final model side by side per country. Right panel: marginal ΔR² from adding infrastructure variables — Morocco gains the most (+0.013), China moderately (+0.005), Brazil minimally (+0.001), consistent with Morocco's high geographic polarization where spatial accessibility explains variance beyond population and regime type alone.
Figure 7. Predicted vs. actual log nighttime light intensity for the final infrastructure-augmented model (2023 cross-section). Points close to the 45-degree line indicate accurate predictions. Across all three countries, the bulk of observations cluster tightly around the line, consistent with the high in-sample R² values. A small number of extreme outliers at the upper end — primarily corresponding to ultra-bright tiles in industrial or metropolitan zones — show systematic underprediction, indicating that the constant-elasticity log–log specification does not fully capture nonlinear dynamics at the top of the light distribution.
Figure 8. Spatial model maps showing observed log nighttime light, baseline model predictions, and final model predictions at the tile level (2023). Comparing the observed and predicted maps reveals where each model captures or misses the spatial pattern of brightness. The final model (right column) tracks the observed map more closely than the baseline (middle column), particularly in geographically polarized countries: Morocco's coastal concentration and China's east-west gradient are better reproduced once infrastructure accessibility variables are included. Remaining prediction gaps are concentrated in extremely bright urban cores, consistent with the heteroskedasticity and underprediction documented in the diagnostics section.
Diagnostics and Robustness Checks¶
Diagnostic Assessment¶
We report three complementary performance measures. Firstly, R² captures the share of variation in log-transformed nighttime light intensity explained by the model (ΔR² = +0.013 Morocco; +0.005 China; +0.001 Brazil). Secondly, RMSE measures the typical magnitude of prediction error (in log units); the final model reduces RMSE from 0.0983 to 0.0902 in Morocco (−8.2%), 0.0844 to 0.0839 in Brazil (−0.6%), and 0.1444 to 0.1418 in China (−1.8%). Thirdly, AIC compares model efficiency while penalizing additional covariates (ΔAIC = −6.7 Morocco; −28.1 China; −5.5 Brazil). Together, these metrics quantify explanatory power, prediction error, and relative model efficiency within the analyzed sample.
Furthermore, residual–fitted plots indicate that the infrastructure model modestly reduces dispersion, most visibly in Morocco, consistent with the RMSE improvement (−8.2%). However, heteroskedasticity persists at higher fitted values, particularly in China, where residual variance increases substantially in bright urban cores. Its fan-shaped pattern suggests nonlinear brightness dynamics at the upper tail rather than simple random noise. In Brazil and China, residuals also display mild clustering at mid-to-high fitted values, indicating that some peri-urban and industrial zones remain systematically under- or over-predicted. These patterns imply remaining spatial structure not captured by the additive specification. While robust standard errors address heteroskedasticity in inference, formal spatial diagnostics (like Moran's I on residuals) would be required to quantify the remaining autocorrelation.
In addition to residual diagnostics, we evaluate temporal and specification sensitivity. Figure 9 shows that model R² remains stable over 2014–2023 in all three countries, with the final specification consistently outperforming the baseline; infrastructure gains are persistent rather than episodic, averaging roughly 0.013–0.017 in Morocco, 0.004–0.006 in China, and near zero in Brazil. Figure 10 indicates that the population–light slope evolves smoothly over time, with no structural breaks, suggesting stable coupling between demographic density and radiance. Finally, Figure 11 shows that regime-specific elasticities in 2023 remain qualitatively unchanged between the baseline and infrastructure models: the urban-core premium remains dominant, low-density regimes remain near zero, and relative ranking across regimes is preserved. These patterns indicate that the core results are not driven by a single year or fragile specification choice, but reflect stable spatial relationships.
Figure 9. Temporal R² 2014–2023. Left: R² for baseline and final model per country over time — Morocco (red), Brazil (blue), China (gold). Right: infrastructure ΔR² gain over time. R² remains consistently high and stable with no systematic degradation, validating the 2023 cross-section as representative and confirming that infrastructure gains are persistent rather than episodic.
Figure 10. Temporal Coefficient Trends 2014–2023. The pooled population–nightlights slope (β) evolves smoothly over time with no structural breaks across Brazil, China, and Morocco, suggesting stable coupling between demographic density and radiance over the decade. Morocco's slope rises gradually from ~0.12 to ~0.14, consistent with ongoing electrification; Brazil and China remain stable near 0.16–0.18.
Figure 11. Cross-Country Elasticity Comparison. Regime-specific population–light elasticities (baseline vs. final model) across Morocco, Brazil, and China (2023). Urban-core elasticities remain the largest in all countries (urban-core premium), low-density regimes remain near zero, and the relative ranking across regimes is preserved after adding infrastructure variables — confirming that core results are robust to model specification.
Figure 12. Residual-versus-fitted plots for Brazil (left, red), China (center, teal), and Morocco (right, gold) — final model, 2023 cross-section. All three panels display a mild funnel shape: residual variance is larger at intermediate fitted values and compresses at the high end, indicating heteroskedasticity. This is most pronounced in China, where the highest-brightness metropolitan tiles carry large negative residuals (systematic underprediction), consistent with nonlinear dynamics in ultra-dense urban systems. The persistence of heteroskedasticity across all countries, despite the log transformation, motivates the use of heteroskedasticity-robust standard errors throughout.
Figure 13. Spatial residual maps (2020) showing the geographic distribution of model prediction errors at the tile level. Warm colors indicate underprediction (actual light exceeds predicted); cool colors indicate overprediction. Systematic spatial clustering in residuals signals that not all spatial structure has been captured by population, regime type, and accessibility variables. In Brazil, residual clustering is visible in the northeastern interior and parts of the São Paulo metropolitan area. In China, pronounced clustering along the coastal manufacturing belt reflects agglomeration economies not captured by the population-and-accessibility framework. In Morocco, residuals concentrate along the Casablanca–Rabat coastal corridor. These maps highlight the need for additional controls — road density, industrial composition, policy-zone designations — in future model extensions.
Interpretation and Substantive Takeaways¶
Population density is strongly associated with nighttime brightness across all three countries. However, the strength of this association varies systematically by spatial regime. In dense metropolitan systems, increases in population translate efficiently into greater luminosity, whereas in low-access regions, demographic growth produces little change in brightness. Cross-country comparisons therefore reflect differences in spatial structure rather than simply differences in national average brightness. For example, China's lower global coupling coefficient masks a sharp internal divide between highly efficient urban centers and weakly responsive rural regions. This highlights how national aggregates obscure regime-level inequality.
Adding infrastructure accessibility variables improves explanatory power most in geographically polarized systems, particularly Morocco and China. In these contexts, population density alone fails to account for brightness patterns when spatial connectivity is uneven. In Morocco, for instance, coastal concentration of lights and large interior distances to urban cores indicate that geographic isolation explains substantial additional variance beyond population. These findings suggest that electrification intensity is not purely demographic; it is mediated by spatial access and connectivity.
Figure 14. Infrastructure accessibility (distance to urban core) vs nighttime light intensity across tiles, by country. The downward-sloping relationship indicates that tiles farther from the nearest urban center tend to emit less light, even after conditioning on population density and regime type. This negative gradient is steepest in Morocco and China, where geographic polarization is strong: beyond a certain distance from the urban network, tiles are systematically less bright regardless of resident population size. The persistence of this gradient after controlling for population and spatial regime supports interpreting geographic isolation as an independent dimension of electrification inequality — areas disconnected from urban infrastructure networks remain structurally underserved relative to what their population size would predict.
Limitations and Future Work¶
Limitations¶
Results are, overall, structural associations, not causal effects. The regime typology (constructed from population and lights) introduces potential endogeneity and should be treated as descriptive. Nighttime radiance is an imperfect proxy and can include non-residential sources (industrial sites, gas flares, road lighting), so lights should be interpreted as development/electrification-related activity rather than welfare. WorldPop estimates introduce measurement uncertainty, and residual clustering suggests omitted spatial processes (grid architecture, industrial composition, policy zones).
Ethics and Data Governance¶
Inputs are aggregated satellite imagery and gridded population surfaces that do not identify individuals, but outputs can still be sensitive if interpreted as performance or deprivation measures; results should be communicated cautiously without normative labeling of regions. If OpenStreetMap-derived roads are incorporated later, attribution and ODbL compliance should be explicitly acknowledged.
Future Work¶
Future research will incorporate exogenous road-density measures to better isolate infrastructure effects from typology-based spatial proxies. We also plan to address potential nonlinear dynamics in extreme urban cores through spline terms or quantile regression, as residual diagnostics suggest possible nonlinearity at high brightness levels. Future versions will also either harmonize tile resolution across countries or formally test elasticity stability under alternative spatial resolutions. Finally, extending the analysis across 2014–2023 will allow evaluation of whether population–light elasticities are stable over time or evolving alongside infrastructure expansion.
References and AI acknowledgment¶
GitHub repository: "namibarsbold/STATS201-Course-project" (folders include data, figures, scripts; notebooks, docs, README).
Datasets and tools:
- VIIRS nighttime lights annual composites (EOG "Annual VNL V1"; annual composites use "vcm" and screen ephemeral lights/background; documentation and processing notes).
- WorldPop population surfaces (methods and RF/dasymetric redistribution documentation; note that WorldPop products are model-based gridded estimates).
- Spatial cross-validation guidance for autocorrelated data (blocked CV recommended when dependence exists).
- Breusch–Pagan heteroskedasticity test implementation documentation (statsmodels het_breuschpagan) and robust covariance (get_robustcov_results).
- Metric definitions used in reporting (scikit-learn regression metrics for R² and MSE/RMSE).
References:
Dinkelman, T. (2011). The effects of rural electrification on employment: New evidence from South Africa. American Economic Review, 101(7), 3078–3108. https://doi.org/10.1257/aer.101.7.3078
Henderson, J. V., Storeygard, A., & Weil, D. N. (2012). Measuring economic growth from outer space. American Economic Review, 102(2), 994–1028. https://doi.org/10.1257/aer.102.2.994
Lessmann, C., & Seidel, A. (2017). Regional inequality, convergence, and its determinants – A view from outer space. European Economic Review, 92, 110–132. https://doi.org/10.1016/j.euroecorev.2016.11.009
Lipscomb, M., Mobarak, A. M., & Barham, T. (2013). Development effects of electrification: Evidence from the topographic placement of hydropower plants in Brazil. American Economic Journal: Applied Economics, 5(2), 200–231. https://doi.org/10.1257/app.5.2.200
Falchetta, G., Pachauri, S., Byers, E., Danylo, O., & Parkinson, S. C. (2020). Satellite observations reveal inequalities in the progress and effectiveness of recent electrification in Sub-Saharan Africa. One Earth, 2(4), 364–379. https://doi.org/10.1016/j.oneear.2020.03.007
AI Acknowledgment: Regressions were brainstormed and edited with assistance from ChatGPT for correctness. Project infrastructure (repository structure, GitHub Pages setup, and reproducibility documentation) was designed with assistance from Claude (Anthropic).