Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of...

26
Biogeosciences, 18, 367–392, 2021 https://doi.org/10.5194/bg-18-367-2021 © Author(s) 2021. This work is distributed under the Creative Commons Attribution 4.0 License. Machine learning estimates of eddy covariance carbon flux in a scrub in the Mexican highland Aurelio Guevara-Escobar 1 , Enrique González-Sosa 2 , Mónica Cervantes-Jiménez 1 , Humberto Suzán-Azpiri 1 , Mónica Elisa Queijeiro-Bolaños 1 , Israel Carrillo-Ángeles 1 , and Víctor Hugo Cambrón-Sandoval 1 1 Facultad de Ciencias Naturales, Universidad Autónoma de Querétaro, Av. de las Ciencias s/n Juriquilla, CP. 76230, Querétaro, Querétaro, Mexico 2 Facultad de Ingeniería, Universidad Autónoma de Querétaro, Cerro de las Campanas s/n Las Campanas, CP. 76010 Querétaro, Querétaro, Mexico Correspondence: Mónica Cervantes-Jiménez ([email protected]) Received: 23 November 2019 – Discussion started: 10 January 2020 Revised: 22 September 2020 – Accepted: 15 November 2020 – Published: 18 January 2021 Abstract. Arid and semiarid ecosystems contain relatively high species diversity and are subject to intense use, in par- ticular extensive cattle grazing, which has favored the expan- sion and encroachment of perennial thorny shrubs into the grasslands, thus decreasing the value of the rangeland. How- ever, these environments have been shown to positively im- pact global carbon dynamics. Machine learning and remote sensing have enhanced our knowledge about carbon dynam- ics, but they need to be further developed and adapted to par- ticular analysis. We measured the net ecosystem exchange (NEE) of C with the eddy covariance (EC) method and es- timated gross primary production (GPP) in a thorny scrub at Bernal in Mexico. We tested the agreement between EC estimates and remotely sensed GPP estimates from the Mod- erate Resolution Imaging Spectroradiometer (MODIS), and also with two alternative modeling methods: ordinary-least- squares (OLS) regression and ensembles of machine learning algorithms (EMLs). The variables used as predictors were MODIS spectral bands, vegetation indices and products, and gridded environmental variables. The Bernal site was a car- bon sink even though it was overgrazed, the average NEE during 15 months of 2017 and 2018 was -0.78 g C m -2 d -1 , and the flux was negative or neutral during the measured months. The probability of agreement (θ s) represented the agreement between observed and estimated values of GPP across the range of measurement. According to the mean value of θ s, agreement was higher for the EML (0.6) fol- lowed by OLS (0.5) and then MODIS (0.24). This graphic metric was more informative than r 2 (0.98, 0.67, 0.58, re- spectively) to evaluate the model performance. This was par- ticularly true for MODIS because the maximum θ s of 4.3 was for measurements of 0.8 g C m -2 d -1 and then decreased steadily below 1 θ s for measurements above 6.5 g C m -2 d -1 for this scrub vegetation. In the case of EML and OLS, the θ s was stable across the range of measurement. We used an EML for the Ameriflux site US-SRM, which is similar in vegetation and climate, to predict GPP at Bernal, but θ s was low (0.16), indicating the local specificity of this model. Al- though cacti were an important component of the vegetation, the nighttime flux was characterized by positive NEE, sug- gesting that the photosynthetic dark-cycle flux of cacti was lower than ecosystem respiration. The discrepancy between MODIS and EC GPP estimates stresses the need to under- stand the limitations of both methods. 1 Introduction Deserts and semideserts occupy more than 30 % of terres- trial ecosystems. In Mexico, almost 2 × 10 6 km 2 (50 %) cor- respond to arid and semiarid ecosystems, mainly the Sono- ran and the Chihuahuan deserts (Verbist et al., 2010). The Spanish-Criollo intrusion (1540–1640) brought new land use methods, but there is no evidence of additional landscape degradation from the central highlands to the northeastern frontier of New Spain until well into the 18th century (Butzer and Butzer, 1997). At the country scale, the extent of grass- lands in Mexico declined, and the area of croplands and Published by Copernicus Publications on behalf of the European Geosciences Union.

Transcript of Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of...

Page 1: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

Biogeosciences 18 367ndash392 2021httpsdoiorg105194bg-18-367-2021copy Author(s) 2021 This work is distributed underthe Creative Commons Attribution 40 License

Machine learning estimates of eddy covariance carbon flux in ascrub in the Mexican highlandAurelio Guevara-Escobar1 Enrique Gonzaacutelez-Sosa2 Moacutenica Cervantes-Jimeacutenez1 Humberto Suzaacuten-Azpiri1Moacutenica Elisa Queijeiro-Bolantildeos1 Israel Carrillo-Aacutengeles1 and Viacutector Hugo Cambroacuten-Sandoval11Facultad de Ciencias Naturales Universidad Autoacutenoma de Quereacutetaro Av de las Ciencias sn JuriquillaCP 76230 Quereacutetaro Quereacutetaro Mexico2Facultad de Ingenieriacutea Universidad Autoacutenoma de Quereacutetaro Cerro de las Campanas sn Las CampanasCP 76010 Quereacutetaro Quereacutetaro Mexico

Correspondence Moacutenica Cervantes-Jimeacutenez (monicacervantesuaqmx)

Received 23 November 2019 ndash Discussion started 10 January 2020Revised 22 September 2020 ndash Accepted 15 November 2020 ndash Published 18 January 2021

Abstract Arid and semiarid ecosystems contain relativelyhigh species diversity and are subject to intense use in par-ticular extensive cattle grazing which has favored the expan-sion and encroachment of perennial thorny shrubs into thegrasslands thus decreasing the value of the rangeland How-ever these environments have been shown to positively im-pact global carbon dynamics Machine learning and remotesensing have enhanced our knowledge about carbon dynam-ics but they need to be further developed and adapted to par-ticular analysis We measured the net ecosystem exchange(NEE) of C with the eddy covariance (EC) method and es-timated gross primary production (GPP) in a thorny scrubat Bernal in Mexico We tested the agreement between ECestimates and remotely sensed GPP estimates from the Mod-erate Resolution Imaging Spectroradiometer (MODIS) andalso with two alternative modeling methods ordinary-least-squares (OLS) regression and ensembles of machine learningalgorithms (EMLs) The variables used as predictors wereMODIS spectral bands vegetation indices and products andgridded environmental variables The Bernal site was a car-bon sink even though it was overgrazed the average NEEduring 15 months of 2017 and 2018 was minus078 gCmminus2 dminus1and the flux was negative or neutral during the measuredmonths The probability of agreement (θs) represented theagreement between observed and estimated values of GPPacross the range of measurement According to the meanvalue of θs agreement was higher for the EML (06) fol-lowed by OLS (05) and then MODIS (024) This graphicmetric was more informative than r2 (098 067 058 re-

spectively) to evaluate the model performance This was par-ticularly true for MODIS because the maximum θs of 43was for measurements of 08 gCmminus2 dminus1 and then decreasedsteadily below 1 θs for measurements above 65 gCmminus2 dminus1

for this scrub vegetation In the case of EML and OLS theθs was stable across the range of measurement We used anEML for the Ameriflux site US-SRM which is similar invegetation and climate to predict GPP at Bernal but θs waslow (016) indicating the local specificity of this model Al-though cacti were an important component of the vegetationthe nighttime flux was characterized by positive NEE sug-gesting that the photosynthetic dark-cycle flux of cacti waslower than ecosystem respiration The discrepancy betweenMODIS and EC GPP estimates stresses the need to under-stand the limitations of both methods

1 Introduction

Deserts and semideserts occupy more than 30 of terres-trial ecosystems In Mexico almost 2times 106 km2 (50 ) cor-respond to arid and semiarid ecosystems mainly the Sono-ran and the Chihuahuan deserts (Verbist et al 2010) TheSpanish-Criollo intrusion (1540ndash1640) brought new land usemethods but there is no evidence of additional landscapedegradation from the central highlands to the northeasternfrontier of New Spain until well into the 18th century (Butzerand Butzer 1997) At the country scale the extent of grass-lands in Mexico declined and the area of croplands and

Published by Copernicus Publications on behalf of the European Geosciences Union

368 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

woody areas increased ruralndashurban migration being an im-portant driver of that transition (Bonilla-Moheno and Aide2020) The transition from grasslands to shrublands or scrubis linked to the extremely heavy grazing by domestic live-stock (Wilcox et al 2018)

Vegetation in the arid and semiarid ecosystems are mostlyclassified as rangelands These are one of the most widelydistributed landscapes on earth incorporating a wide rangeof communities including grasslands shrublands and savan-nah Scrub is a xeric category of shrublands characterized byplants with small leaves it is very thorny and its biomass isdistributed mainly to roots and leaves rather than the stems(Rzedowski 1978 Wheeler et al 2007 Zhang et al 2017)Studies need to consider multiple sources of evidence andthe driving processes of land use change in Mexico to aidin policy formulation and to identify regions that may pro-vide important ecosystem services (Murray-Tortarolo et al2016)

On the other hand photosynthesis contributes to carbonsequestration by moving carbon stock from the atmosphereto other pools or sinks such as aboveground biomass rootsand soil organic matter (Booker et al 2013) The role of veg-etation in carbon sequestration on arid and semiarid ecosys-tems is less evident because the growth rate is low andbiomass partition above and below ground is different fromthat of temperate and tropical forests Competitive interac-tions of arid plants at the community level are strongly influ-enced by rooting architecture and phenological growth (Zenget al 2008) Many plants in semiarid systems support a deepand wide root system as a drought adaptation but also fornutrient uptake (McCulley et al 2004)

Recent time trends indicate that semiarid ecosystems reg-ulate the terrestrial carbon sink and dominate its interan-nual variability (Piao et al 2019 Scott et al 2015 Zhanget al 2020) This variability mainly results from the imbal-ance between two larger biogenic fluxes that constitute thenet ecosystem exchange (NEE) the photosynthetic uptake ofCO2 (gross primary production GPP) and the respiratory re-lease of CO2 (total ecosystem respiration Reco) Radiationand water availability are important environmental drivers ofNEE and thus GPP and Reco (Marcolla et al 2017) How-ever other carbon fluxes contribute to the imbalance suchas fire and anthropogenic CO2 emissions (Jaumlrvi et al 2019Piao et al 2019) Another atmospheric CO2 flux is that fromsoil inorganic carbon in arid and semiarid ecosystems (Soperet al 2017) Calcium carbonates form in the soil at a rel-atively low rate of 5 to 150 kgChaminus1 yrminus1 this carbon canreturn to the atmosphere but it forms a carbon sink when car-bonates are leached into the groundwater (Lal et al 2004)

The methods used to explore the ecosystems and the un-derstanding of their functioning are changing rapidly par-ticularly for arid and semiarid ecosystems (Goldstein et al2020 Ma et al 2020 Xiao et al 2019 Yao et al 2020)There are many instruments and techniques for estimatingcarbon and water fluxes but two stand out in the literature

eddy covariance (EC) and remote-sensing techniques ECis a micrometeorological method that measures the ecosys-tem community NEE at short time intervals representinga land surface smaller than 1 km2 The orbital remote sen-sors measure radiation emitted or reflected by the earthrsquos sur-face using different algorithms representing different traitsof vegetation activity from large-scale areas Both techniquesare complementary but an agreement between their esti-mates is important for regional countrywide and global spa-tiotemporal monitoring of greenhouse gas inventories (Yonaet al 2020) ecological modeling quantifying the interactionamong the vegetation component and the hydrological com-ponent energy and nutrient cycles and others applications(Pasetto et al 2018) Particularly products from the Mod-erate Resolution Imaging Spectroradiometer (MODIS) haveample availability and have been extensively used to studyland surface since 2000

Gross primary production can be represented by a widerange of models ranging in complexity from simple regres-sion based on climatic forcing variables to complex mod-els that simulate biophysical and ecophysiological processes(Anav et al 2015) The MODIS MOD17 product uses aphotosynthetic radiation conversion efficiency model (Run-ning and Zhao 2015) but a better relationship is reportedwith EC-derived GPP when the model uses vegetation in-dices calculated from the same MODIS platform (Ma et al2014 Wu et al 2010) Although they are black box modelsin principle recent modeling efforts report good agreementof GPP estimates obtained from machine learning (ML) algo-rithms or ensembles of models (Eshel et al 2019 Joiner andYoshida 2020 Jung et al 2020) Different machine learningalgorithms are powerful because they can identify trends andpatterns in big datasets and solve regression or classificationproblems

To generate models of GPP we measured EC fluxes dur-ing 2017ndash2018 in a thorny scrub with semiarid climate inthe highlands of Mexico (Bernal site) Competing mod-els were data-driven machine learning regression ensembles(EMLs) and ordinary-least-squares (OLS) regression bothusing Daymet (Thornton et al 2017) and MODIS datasets asexplanatory variables The MODIS GPP product was used asa baseline comparison The second step was to use an EMLmodel based on local data (Daymet and MODIS) from a sitewith EC instrumentation and similar vegetation to that of theBernalrsquos site and then use that model to predict GPP at theBernal site The site we used was Santa Rita from the Ameri-flux network While Santa Rita is in the Sonoran Desert andBernal is on the southern border of the Chihuahuan Desertboth have a similar climate and vegetation (Fig 1) A goodagreement between Bernal EC data and the predictions fromthe Santa Rita model would support the use of machine learn-ing algorithms as a scale-up mechanism This would be use-ful to the understanding of semiarid ecosystems and also im-prove current earth system models (Piao et al 2019)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 369

We measured the carbon flux for the vegetation in a semi-arid site located at Bernal Mexico The phenology patternsin the region suggested that this site could be a carbon sinkduring the wet season or a carbon source during the dryseason since some predominant species reproduce duringwinter and spring particularly cacti Acacia and Prosopis(mesquite) Furthermore the Bernal site had a history of dis-turbance by overgrazing this could decrease the GPP andeven result in a positive carbon balance thus being a car-bon source If the shrub vegetation in this site predomi-nantly absorbed carbon during the measuring period thenthis evidence would contribute to reinforce the reported im-portance of semiarid environments in the global carbon bal-ance (Zhang et al 2020) However land ownership patternsand the balance between agricultural investments and landconservation will determine the absolute amount of carbonsequestered Hopefully the results of the present investiga-tion would promote the idea that carbon sequestration is pos-sible in scrubland and be incorporated in informed decisionsand new policy

2 Materials and methods

21 Site description

The study site (Bernal) is located at 20717 N 99941Wand 2050 masl in the municipality of Ezequiel Montes inQuereacutetaro where real-estate development feedlot beef pro-duction cheese and wine production associated with tourismand automotive-industry development are very attractive op-tions for landowners in the region Bernal is located in a shal-low valley oriented from north to south approximately 15 to20 km wide and opening to the south to the Riacuteo Lerma basinand then draining into the Pacific Ocean The northern limitof the valley is surrounded by hill country and its characteris-tic dacitic dome 433 m in height (Aguirre-Diacuteaz et al 2013)Moisture-laden winds blow westward from the Gulf of Mex-ico but the Sierra Gorda located 60 km east of Bernal castsa rain shadow over the area (Segerstrom 1961)

The Bernal site is private property with grazing dairy cat-tle receiving additional concentrated feedstuffs under stall-feeding Grazing was continuous and water for livestock wasonly available in the feeding and milking area there were nopasture divisions and the perimeter fence was made of stoneThese characteristics of the animal production model and thestate of vegetation are representative of land managementpractices and the scrub vegetation of the region Howeverthe Bernal site suffered important changes in land use during2019 and the scrub was suddenly cleared and converted intorainfed cropping

The climate is arid with summer rains (BSk) with a meanannual rainfall of 476 mm and a mean annual temperature of171 C (CICESE 2015) Prevailing wind is from the eastand northeast The terrain is mostly flat most grades are

below 2 The soil has a clay loam texture the class is aVertisol with abundant subrounded basaltic stones withoutrocky outcrops and the depth is greater than 06 m Vegeta-tion was less than 3 m in height with an overgrazed herba-ceous stratum Vegetation corresponds to secondary scrubwith the dominant genera Acacia Prosopis and differentCacti (Fig 3) This site was classified as grassland by theMODIS land cover product

For the scrub and tree species the importance vegetationindex (IVI) was determined following Curtis and McIntosh(1950) to assess the vegetation homogeneity The IVI is thesum of relative dominance relative density and relative fre-quency of the species present Vegetation sample points werechosen according to the flux footprint of the eddy covari-ance tower (Fig 2) For each plant in the vegetation samplepoints two stem diameters the number of individuals (abun-dance) and identity of each species were measured as well asthe coverage which is the horizontal projection of the aerialparts of the individuals on the ground expressed as a per-centage of the total area (Wilson 2011)

22 Eddy covariance measurements

The micrometeorological EC technique measures at the plantcommunity level NEE in a nondestructive way and con-tinuously over time (Baldocchi 2014) The negative CO2fluxes corresponded to NEE which is equivalent to NEP (netecosystem production) but with opposite sign The EC hasadvantages compared to other techniques that need to scaleup measurements from the leaf plant or soil levels up toecosystems especially when the vegetation is heterogeneous(Yepez et al 2003) However EC is an expensive techniqueand data analysis and processing are complicated also spe-cific assumptions must be met regarding the terrain vege-tation and micrometeorological conditions among other as-pects (Richardson et al 2019)

The fluxes were measured with the EC technique at aheight of 6 m with the following instruments a Biomet sys-tem (LI-COR Biosciences USA) to measure H2O and CO2fluxes using an IRGASON-EC-150 open-circuit analyzer aCSAT3 sonic anemometer and a KH20 krypton hygrome-ter these were connected to a CR3000 data logger (Camp-bell Scientific Inc Logan UT USA) The relative humid-ity and air temperature were measured with an HMP155Aprobe (Vaisala Corporation Helsinki Finland) net radiationwas measured with an NR-Lite2 radiometer (Kipp and Zo-nen BV Delft the Netherlands) and the photosynthetic ac-tive radiation (PAR) was measured with a quantum sensor(SKP215 Skye Instruments Llandrindod Wells UK) Mea-surements of the soil heat flux was implemented with fourself-calibrating HFP01SC plates at 80 mm depth and in fourrepresentative positions of the landscape (Hukseflux Ther-mal Sensors BV Delft The Netherlands) Three time do-main reflectometry (TDR) probes (CS616) measured volu-metric water content in the ground installed vertically and

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

370 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 1 Localization and land use maps for the study site (a) biogeographic arid and semiarid zones of southern North America relevantfor the Bernal and Santa Rita sites the black outline represents the state limit for Quereacutetaro (b) Heterogeneity of the normalized differencevegetation index (NDVI) surrounding the EC tower at Bernal during the peak of the growing season 2017 (DOY 257) (c) Land cover in theregion of the Bernal site according to the annual University of Maryland (UMD) classification (MCD12 MODIS product)

two sets of TCAV (averaging soil thermocouple) probes mea-sured the temperature at 60 and 40 mm depths and abovethe HFP01SC plates (Campbell Scientific Inc Logan UTUSA) The TE525 (Texas Electronics Dallas TX USA)tipping-bucket rain gauge was installed at 12 m height and3 m away from the tower All these meteorological variableswere measured every 5 s and average values were stored ev-ery 30 min rainfall was accumulated for the same time inter-val Sensible (H ) and latent (λE) heat fluxes were calculatedby the EddyPro package (LI-COR Biosciences USA)

23 Flux data processing

All EC data were collected at 10 MHz in the data loggerand reported as micromoles of CO2 per square meter persecond and processed with the EddyPro package to con-vert values into average fluxes of 30 min intervals Only

quality-flagged records were used to account for the CO2flux (qc_co2_flux= 0) according to the Mauder and Foken(2011) policy in the EddyPro program (LI-COR 2019)

However this quality-checking is not sufficient especiallyin the case of CO2 therefore data were postprocessed usingthe REddyProc package of R (R Development Core Team2009) to estimate the friction speed thresholds (ulowast) gap-filldata and partition the NEE flux into its GPP and Reco com-ponents (Wutzler et al 2018) The filled-in estimates of NEE(NEE_uStar_f) GPP (GPP_uStar_f) and Reco (Reco_uStar)were used when the ulowast was lower than a ulowast annual thresholdabove which nighttime fluxes are considered valid The an-nual ulowast threshold was 0193 and 0194 for 2017 and 2018The difference between these thresholds and the 95 ulowast

threshold was small (0033 msminus1) Appendix A presents thethreshold means and confidence intervals calculated in theREddyProc package Only 1050 half-hour records (93 )

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 371

Figure 2 Source strength-weighted function at the Bernal site during 2017 and 2018 (gray points) Horizontally plots show the peak ofthe function and increasing percentages of the flux footprint The radial scale is along-wind distance from the sensor (m) The measurementheight was 6 m The blue points represent the vegetation sample plots used to calculate the importance vegetation index (IVI)

Figure 3 Thorny scrub at Bernal Quereacutetaro during the rainy season 2017 In the foreground is Cylindropuntia imbricata a very thornycactus shrubs in the background are Prosopis laevigata mesquites

had a ulowast below the annual mean ulowast threshold The data with aflag equal to 0 were used for the variable NEE_uStar_fqc asdefined by REddyProc Carbon dioxide flux data were time-integrated and converted to grams of C per square meter perday using the molar ratio of C We only reported the contin-uous measurements of the exchange of CO2 for the periodof April 2017 (DOY 89) to August 2018 (DOY 234) usingthe EC technique Due to equipment malfunction and incom-plete datasets some periods of time were not considered Themeasurement campaign presented here was not biased by wet

winters since both years were characterized by a less-than-weak NintildeondashNintildea

24 Remotely sensed data

Data were requested to the National Aeronautics andSpace Administration (NASA) via the Land Processes Dis-tributed Active Archive Center (DAAC) using the Ap-plication for Extracting and Exploring Analysis ReadySamples (AppEEARS) to obtain spatial and temporal

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

372 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

subsets for the Bernal and Santa Rita sites includ-ing daily surface reflectance (MOD09GA006 and MOD-OCGA006) daily daytime and nighttime land surface tem-perature (LST) (MOD11A2006 and MYD11A2006) 8 dleaf area index (LAI) and fraction of photosynthetically ac-tive radiation (FPAR) (MOD15A2H006 MYD15A2H006MCD15A2H006) 16 d enhanced vegetation index (EVI)(MYD13A1006) 16 d gross primary production (GPP)and net photosynthesis (PsnNet) (MOD17A2H006) TheAppEEARS also unpacks and interprets the quality layersAppendix B presents the details of each spectral band ofMODIS Data with less-than-good quality flags were deletedMissing data were filled with splines and a database witha 1 d time step was generated This would smooth lin-ear temporalndashphenological evolution between any two suc-cessive remotely sensed data points (Eshel et al 2019)Daily accumulated rainfall was requested using the Giovanniplatform from the Goddard Space Flight Center (GSFC)DAAC two products were used the 3IMERGDF006 fromthe Global Precipitation Measurement mission (GPM) andthe 3B42007 from the Tropical Rainfall Measuring mis-sion (TRMM) Gridded weather parameters from the ORNLDAAC Daymet dataset were precipitation shortwave radia-tion maximum and minimum air temperature and water va-por pressure Daymet is a data product derived from a collec-tion of algorithms interpolating and extrapolating daily me-teorological observations (Thornton et al 2017) FollowingHenrich et al (2012) and Hill et al (2006) daily reflectancebands of MODIS were used to compute several vegetation in-dices redndashgreen ratio index (RGRI) simple ratio (SimpleR)moisture stress (MoistS) disease stress index (DSI) greenatmospherically resistant vegetation index (GARI) normal-ized difference vegetation index (NDVI) normalized differ-ence water index (NDVI_w) and enhanced vegetation in-dex (EVI) the corresponding equations are presented in Ap-pendix B

25 MODIS algorithm for GPP

Estimates of GPP are derived from data recorded by theMODIS sensor aboard the Terra and Aqua satellites The ef-ficiency (ε gCMJminus1) with which vegetation produces drymatter is defined as the amount of solar energy stored by pho-tosynthesis in a given period divided by the solar constant in-tegrated over the same period (Monteith 1972) Not all inci-dent solar radiation is available for biomass conversion onlyabout 48 is photosynthetically active (PAR MJmminus2) andnot all PAR is absorbed (Zhu et al 2008) Thus carbon ex-change is mainly controlled by the amount of PAR absorbedby green vegetation (APAR) and modified by ε (Gitelson etal 2015) The fraction of absorbed PAR (FPAR) is equalto APARPAR but can be represented by the NDVI spectralvegetation index produced by MODIS (Running 2004) Theefficiency term ε is described as the product of different fac-tors as a whole or as a part of the system (Monteith 1972) but

mostly those related to the efficiencies with which the vegeta-tion intercepts the radiation and the efficiency of convertingthe intercepted radiation into biomass (Long et al 2015)The MODIS algorithm that estimates GPP in the MOD17product is (Running et al 2004)

GPP= εtimesFPARtimesPAR (1)GPP= εtimesNDVItimesPAR (2)

The ε term in the MODIS algorithm is represented by amaximum radiation conversion efficiency (εmax kgCMJminus1)that is attenuated by suboptimal climatic conditions mainlyminimum air temperature (Tmin) and vapor pressure deficit(VPD) Two parameters for each Tmin and VPD are usedto define attenuation scalars for general biome types Theseparameters form linear functions between the scalars (Run-ning and Zhao 2015 Wang et al 2013) daily minimumtemperature at which ε = εmax and at which ε = 0 and thedaylight average VPD at which ε = εmax and at which ε =0 GPP is truncated on days when air temperature is be-low 0 C or VPD is higher than 2000 Pa (Running andZhao 2015) Stress and nutrient constraints on vegetationgrowth are quantified by the limiting relation of leaf areain NDVItimesPAR rather than constrained through ε (Runninget al 2004) However the MODIS algorithm does not con-sider stomatal sensitivity to leaf-to-air vapor pressure acrossand within species and particularly between isohydric andanisohydric plant species (Grossiord et al 2020)

ε = εmaxtimes Tmin scalartimesVPD_scalar (3)

The MOD17 userrsquos guide presents a biome-property look-up table (BPLUT) with the parameters for each biome typeand assumes that they do not vary with space or time (Run-ning and Zhao 2015) This aspect is important because εmaxhas the strongest impact on the predicted GPP of the MOD17algorithm (Wang et al 2013) The assumption is also impor-tant because the overstory and understory could be decou-pled from each other and would intercept different amountsof light and have different water sources during the grow-ing season (Scott et al 2003) Light quantity and quality asdiffuse light or sunflecks determine differences among un-derstory species in their temporal response to gaps involvingacclimation and avoidance of photoinhibition (Pearcy 2007)Another shortcoming is that few land cover classifications areincorporated into the MOD17 algorithm

26 Santa Rita site dataset

The Santa Rita Experimental Range (SRER) is located inthe western range of the Santa Rita Mountains in ArizonaUSA (lat 318214 longminus1108661 1120 masl) Climate isBSk with mean annual precipitation of 380 mm temperatureof 179 C and Ustic Torrifluvent soils Established in 1903SRER has a long history of experimental manipulations toenhance grazing potential for cattle (Glenn et al 2015) Two

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 373

Figure 4 Semidesert grassland encroached by mesquite (Prosopisvelutina) at Santa Rita Arizona (US-SRM) Image credit RussellScott 9 December 2016

Ameriflux sites are located in the SRER Santa Rita Grass-land (US-SRG) and Santa Rita Mezquite (US-SRM) Weused EC data for the years 2013ndash2019 from US-SRM whichis a mesquite grass savanna (35 mesquite canopy coverand mean canopy height above 2 m 22 grasses and 43 bare soil) although MODIS describes this site as open shrub-lands (Glenn et al 2015 Scott et al 2004) The US-SRMsite is dominated by velvet mesquite (Prosopis velutina) andhas a diversity of shrubs cacti succulents and bunch grasses(McClaran 2003) This site was chosen because the vegeta-tion and climate are similar to the Bernal site and it was theclosest EC instrumentation with data availability (Fig 4)

27 Modeling

Gross primary production estimated by EC at the Bernal sitewas modeled using OLS and EML The explanatory vari-ables were the remotely sensed data the weather parame-ters and the vegetation indices (Appendix B) The OLS isa particular case of the generalized linear model where thevariation of a single response variable is explained by severalindependent variables The OLS was fitted with the stepwiseprocedure the final model included variables with a variance

inflation factor (VIF) lower than 10 and a significance levelof 005 Predictions of the EC GPP were obtained with the fi-nal model Analysis and diagnostics were made with Minitabv 17 (Minitab LLC) Analysis of the OLS model can be usedto determine agreement between methods of measurementbut it is sensitive to the range of values in the dataset andits metrics ndash r r2 and root mean square ndash do not provideinformation on the type of association (Bland and Altman2010) In this paper we used another metric the probabil-ity of agreement to determine bias and agreement betweenmodel estimates and observed data (see below)

While OLS is a well-known algorithm machine learningalgorithms are emerging techniques that focus on the datastructure and match that data onto models The EML ap-proach considers the different realizations of machine learn-ing models and constructs an ensemble of models com-ing with the advantage of being more accurate than thepredictions from the individual ensemble members How-ever EML is computationally intensive requiring nodeswith purpose-built hardware such as multiple processors orreduced-precision accelerators The nodes could be aggre-gated in computing clusters which require storage powerand cooling redundancy

A stack of EMLs was obtained with the H2O packageof R (Hall et al 2019) This package provides several al-gorithms that can contribute to a stack of ensembles usingthe AutoML (automatic machine learning) function feedfor-ward artificial neural network (DL) general linear models(GLMs) gradient-boosting machine (GBM) extreme gra-dient boosting (XGBoost) default distributed random for-est (DRF) and extremely randomized trees (XRT) AutoMLtrains two stacked-ensemble models one ensemble containsall the models and the second ensemble contains just thebest-performing model from each algorithm class or fam-ily both ensembles should produce better models than anyindividual model from the AutoML run The term AutoMLimplies data preprocessing normalization feature engineer-ing model selection hyperparameter optimization and pre-diction analysis including procedures to identify and dealwith nonindependent and identically distributed observationsand overfitting (Michailidis 2018 Truong et al 2019)

Machine learning has two elements for supervised learn-ing training loss and regularization The task of training triesto find the best parameters for the model while minimizingthe training loss function the mean squared error for exam-ple The regularization term controls the complexity of themodel helping to reduce overfitting Overfitting becomes ap-parent when the model performs accurately during the train-ing but the accuracy is low during the testing A good modelneeds extensive parameter tuning by running the algorithmmany times to explore the effect on regularization and cross-validation accuracy (Mitchell and Frank 2017) In this inves-tigation the function of training loss was the deviance whichis a generalization of the residual sum of squares driven bythe likelihood Deviance is a measure of model fit and lower

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 2: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

368 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

woody areas increased ruralndashurban migration being an im-portant driver of that transition (Bonilla-Moheno and Aide2020) The transition from grasslands to shrublands or scrubis linked to the extremely heavy grazing by domestic live-stock (Wilcox et al 2018)

Vegetation in the arid and semiarid ecosystems are mostlyclassified as rangelands These are one of the most widelydistributed landscapes on earth incorporating a wide rangeof communities including grasslands shrublands and savan-nah Scrub is a xeric category of shrublands characterized byplants with small leaves it is very thorny and its biomass isdistributed mainly to roots and leaves rather than the stems(Rzedowski 1978 Wheeler et al 2007 Zhang et al 2017)Studies need to consider multiple sources of evidence andthe driving processes of land use change in Mexico to aidin policy formulation and to identify regions that may pro-vide important ecosystem services (Murray-Tortarolo et al2016)

On the other hand photosynthesis contributes to carbonsequestration by moving carbon stock from the atmosphereto other pools or sinks such as aboveground biomass rootsand soil organic matter (Booker et al 2013) The role of veg-etation in carbon sequestration on arid and semiarid ecosys-tems is less evident because the growth rate is low andbiomass partition above and below ground is different fromthat of temperate and tropical forests Competitive interac-tions of arid plants at the community level are strongly influ-enced by rooting architecture and phenological growth (Zenget al 2008) Many plants in semiarid systems support a deepand wide root system as a drought adaptation but also fornutrient uptake (McCulley et al 2004)

Recent time trends indicate that semiarid ecosystems reg-ulate the terrestrial carbon sink and dominate its interan-nual variability (Piao et al 2019 Scott et al 2015 Zhanget al 2020) This variability mainly results from the imbal-ance between two larger biogenic fluxes that constitute thenet ecosystem exchange (NEE) the photosynthetic uptake ofCO2 (gross primary production GPP) and the respiratory re-lease of CO2 (total ecosystem respiration Reco) Radiationand water availability are important environmental drivers ofNEE and thus GPP and Reco (Marcolla et al 2017) How-ever other carbon fluxes contribute to the imbalance suchas fire and anthropogenic CO2 emissions (Jaumlrvi et al 2019Piao et al 2019) Another atmospheric CO2 flux is that fromsoil inorganic carbon in arid and semiarid ecosystems (Soperet al 2017) Calcium carbonates form in the soil at a rel-atively low rate of 5 to 150 kgChaminus1 yrminus1 this carbon canreturn to the atmosphere but it forms a carbon sink when car-bonates are leached into the groundwater (Lal et al 2004)

The methods used to explore the ecosystems and the un-derstanding of their functioning are changing rapidly par-ticularly for arid and semiarid ecosystems (Goldstein et al2020 Ma et al 2020 Xiao et al 2019 Yao et al 2020)There are many instruments and techniques for estimatingcarbon and water fluxes but two stand out in the literature

eddy covariance (EC) and remote-sensing techniques ECis a micrometeorological method that measures the ecosys-tem community NEE at short time intervals representinga land surface smaller than 1 km2 The orbital remote sen-sors measure radiation emitted or reflected by the earthrsquos sur-face using different algorithms representing different traitsof vegetation activity from large-scale areas Both techniquesare complementary but an agreement between their esti-mates is important for regional countrywide and global spa-tiotemporal monitoring of greenhouse gas inventories (Yonaet al 2020) ecological modeling quantifying the interactionamong the vegetation component and the hydrological com-ponent energy and nutrient cycles and others applications(Pasetto et al 2018) Particularly products from the Mod-erate Resolution Imaging Spectroradiometer (MODIS) haveample availability and have been extensively used to studyland surface since 2000

Gross primary production can be represented by a widerange of models ranging in complexity from simple regres-sion based on climatic forcing variables to complex mod-els that simulate biophysical and ecophysiological processes(Anav et al 2015) The MODIS MOD17 product uses aphotosynthetic radiation conversion efficiency model (Run-ning and Zhao 2015) but a better relationship is reportedwith EC-derived GPP when the model uses vegetation in-dices calculated from the same MODIS platform (Ma et al2014 Wu et al 2010) Although they are black box modelsin principle recent modeling efforts report good agreementof GPP estimates obtained from machine learning (ML) algo-rithms or ensembles of models (Eshel et al 2019 Joiner andYoshida 2020 Jung et al 2020) Different machine learningalgorithms are powerful because they can identify trends andpatterns in big datasets and solve regression or classificationproblems

To generate models of GPP we measured EC fluxes dur-ing 2017ndash2018 in a thorny scrub with semiarid climate inthe highlands of Mexico (Bernal site) Competing mod-els were data-driven machine learning regression ensembles(EMLs) and ordinary-least-squares (OLS) regression bothusing Daymet (Thornton et al 2017) and MODIS datasets asexplanatory variables The MODIS GPP product was used asa baseline comparison The second step was to use an EMLmodel based on local data (Daymet and MODIS) from a sitewith EC instrumentation and similar vegetation to that of theBernalrsquos site and then use that model to predict GPP at theBernal site The site we used was Santa Rita from the Ameri-flux network While Santa Rita is in the Sonoran Desert andBernal is on the southern border of the Chihuahuan Desertboth have a similar climate and vegetation (Fig 1) A goodagreement between Bernal EC data and the predictions fromthe Santa Rita model would support the use of machine learn-ing algorithms as a scale-up mechanism This would be use-ful to the understanding of semiarid ecosystems and also im-prove current earth system models (Piao et al 2019)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 369

We measured the carbon flux for the vegetation in a semi-arid site located at Bernal Mexico The phenology patternsin the region suggested that this site could be a carbon sinkduring the wet season or a carbon source during the dryseason since some predominant species reproduce duringwinter and spring particularly cacti Acacia and Prosopis(mesquite) Furthermore the Bernal site had a history of dis-turbance by overgrazing this could decrease the GPP andeven result in a positive carbon balance thus being a car-bon source If the shrub vegetation in this site predomi-nantly absorbed carbon during the measuring period thenthis evidence would contribute to reinforce the reported im-portance of semiarid environments in the global carbon bal-ance (Zhang et al 2020) However land ownership patternsand the balance between agricultural investments and landconservation will determine the absolute amount of carbonsequestered Hopefully the results of the present investiga-tion would promote the idea that carbon sequestration is pos-sible in scrubland and be incorporated in informed decisionsand new policy

2 Materials and methods

21 Site description

The study site (Bernal) is located at 20717 N 99941Wand 2050 masl in the municipality of Ezequiel Montes inQuereacutetaro where real-estate development feedlot beef pro-duction cheese and wine production associated with tourismand automotive-industry development are very attractive op-tions for landowners in the region Bernal is located in a shal-low valley oriented from north to south approximately 15 to20 km wide and opening to the south to the Riacuteo Lerma basinand then draining into the Pacific Ocean The northern limitof the valley is surrounded by hill country and its characteris-tic dacitic dome 433 m in height (Aguirre-Diacuteaz et al 2013)Moisture-laden winds blow westward from the Gulf of Mex-ico but the Sierra Gorda located 60 km east of Bernal castsa rain shadow over the area (Segerstrom 1961)

The Bernal site is private property with grazing dairy cat-tle receiving additional concentrated feedstuffs under stall-feeding Grazing was continuous and water for livestock wasonly available in the feeding and milking area there were nopasture divisions and the perimeter fence was made of stoneThese characteristics of the animal production model and thestate of vegetation are representative of land managementpractices and the scrub vegetation of the region Howeverthe Bernal site suffered important changes in land use during2019 and the scrub was suddenly cleared and converted intorainfed cropping

The climate is arid with summer rains (BSk) with a meanannual rainfall of 476 mm and a mean annual temperature of171 C (CICESE 2015) Prevailing wind is from the eastand northeast The terrain is mostly flat most grades are

below 2 The soil has a clay loam texture the class is aVertisol with abundant subrounded basaltic stones withoutrocky outcrops and the depth is greater than 06 m Vegeta-tion was less than 3 m in height with an overgrazed herba-ceous stratum Vegetation corresponds to secondary scrubwith the dominant genera Acacia Prosopis and differentCacti (Fig 3) This site was classified as grassland by theMODIS land cover product

For the scrub and tree species the importance vegetationindex (IVI) was determined following Curtis and McIntosh(1950) to assess the vegetation homogeneity The IVI is thesum of relative dominance relative density and relative fre-quency of the species present Vegetation sample points werechosen according to the flux footprint of the eddy covari-ance tower (Fig 2) For each plant in the vegetation samplepoints two stem diameters the number of individuals (abun-dance) and identity of each species were measured as well asthe coverage which is the horizontal projection of the aerialparts of the individuals on the ground expressed as a per-centage of the total area (Wilson 2011)

22 Eddy covariance measurements

The micrometeorological EC technique measures at the plantcommunity level NEE in a nondestructive way and con-tinuously over time (Baldocchi 2014) The negative CO2fluxes corresponded to NEE which is equivalent to NEP (netecosystem production) but with opposite sign The EC hasadvantages compared to other techniques that need to scaleup measurements from the leaf plant or soil levels up toecosystems especially when the vegetation is heterogeneous(Yepez et al 2003) However EC is an expensive techniqueand data analysis and processing are complicated also spe-cific assumptions must be met regarding the terrain vege-tation and micrometeorological conditions among other as-pects (Richardson et al 2019)

The fluxes were measured with the EC technique at aheight of 6 m with the following instruments a Biomet sys-tem (LI-COR Biosciences USA) to measure H2O and CO2fluxes using an IRGASON-EC-150 open-circuit analyzer aCSAT3 sonic anemometer and a KH20 krypton hygrome-ter these were connected to a CR3000 data logger (Camp-bell Scientific Inc Logan UT USA) The relative humid-ity and air temperature were measured with an HMP155Aprobe (Vaisala Corporation Helsinki Finland) net radiationwas measured with an NR-Lite2 radiometer (Kipp and Zo-nen BV Delft the Netherlands) and the photosynthetic ac-tive radiation (PAR) was measured with a quantum sensor(SKP215 Skye Instruments Llandrindod Wells UK) Mea-surements of the soil heat flux was implemented with fourself-calibrating HFP01SC plates at 80 mm depth and in fourrepresentative positions of the landscape (Hukseflux Ther-mal Sensors BV Delft The Netherlands) Three time do-main reflectometry (TDR) probes (CS616) measured volu-metric water content in the ground installed vertically and

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

370 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 1 Localization and land use maps for the study site (a) biogeographic arid and semiarid zones of southern North America relevantfor the Bernal and Santa Rita sites the black outline represents the state limit for Quereacutetaro (b) Heterogeneity of the normalized differencevegetation index (NDVI) surrounding the EC tower at Bernal during the peak of the growing season 2017 (DOY 257) (c) Land cover in theregion of the Bernal site according to the annual University of Maryland (UMD) classification (MCD12 MODIS product)

two sets of TCAV (averaging soil thermocouple) probes mea-sured the temperature at 60 and 40 mm depths and abovethe HFP01SC plates (Campbell Scientific Inc Logan UTUSA) The TE525 (Texas Electronics Dallas TX USA)tipping-bucket rain gauge was installed at 12 m height and3 m away from the tower All these meteorological variableswere measured every 5 s and average values were stored ev-ery 30 min rainfall was accumulated for the same time inter-val Sensible (H ) and latent (λE) heat fluxes were calculatedby the EddyPro package (LI-COR Biosciences USA)

23 Flux data processing

All EC data were collected at 10 MHz in the data loggerand reported as micromoles of CO2 per square meter persecond and processed with the EddyPro package to con-vert values into average fluxes of 30 min intervals Only

quality-flagged records were used to account for the CO2flux (qc_co2_flux= 0) according to the Mauder and Foken(2011) policy in the EddyPro program (LI-COR 2019)

However this quality-checking is not sufficient especiallyin the case of CO2 therefore data were postprocessed usingthe REddyProc package of R (R Development Core Team2009) to estimate the friction speed thresholds (ulowast) gap-filldata and partition the NEE flux into its GPP and Reco com-ponents (Wutzler et al 2018) The filled-in estimates of NEE(NEE_uStar_f) GPP (GPP_uStar_f) and Reco (Reco_uStar)were used when the ulowast was lower than a ulowast annual thresholdabove which nighttime fluxes are considered valid The an-nual ulowast threshold was 0193 and 0194 for 2017 and 2018The difference between these thresholds and the 95 ulowast

threshold was small (0033 msminus1) Appendix A presents thethreshold means and confidence intervals calculated in theREddyProc package Only 1050 half-hour records (93 )

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 371

Figure 2 Source strength-weighted function at the Bernal site during 2017 and 2018 (gray points) Horizontally plots show the peak ofthe function and increasing percentages of the flux footprint The radial scale is along-wind distance from the sensor (m) The measurementheight was 6 m The blue points represent the vegetation sample plots used to calculate the importance vegetation index (IVI)

Figure 3 Thorny scrub at Bernal Quereacutetaro during the rainy season 2017 In the foreground is Cylindropuntia imbricata a very thornycactus shrubs in the background are Prosopis laevigata mesquites

had a ulowast below the annual mean ulowast threshold The data with aflag equal to 0 were used for the variable NEE_uStar_fqc asdefined by REddyProc Carbon dioxide flux data were time-integrated and converted to grams of C per square meter perday using the molar ratio of C We only reported the contin-uous measurements of the exchange of CO2 for the periodof April 2017 (DOY 89) to August 2018 (DOY 234) usingthe EC technique Due to equipment malfunction and incom-plete datasets some periods of time were not considered Themeasurement campaign presented here was not biased by wet

winters since both years were characterized by a less-than-weak NintildeondashNintildea

24 Remotely sensed data

Data were requested to the National Aeronautics andSpace Administration (NASA) via the Land Processes Dis-tributed Active Archive Center (DAAC) using the Ap-plication for Extracting and Exploring Analysis ReadySamples (AppEEARS) to obtain spatial and temporal

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

372 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

subsets for the Bernal and Santa Rita sites includ-ing daily surface reflectance (MOD09GA006 and MOD-OCGA006) daily daytime and nighttime land surface tem-perature (LST) (MOD11A2006 and MYD11A2006) 8 dleaf area index (LAI) and fraction of photosynthetically ac-tive radiation (FPAR) (MOD15A2H006 MYD15A2H006MCD15A2H006) 16 d enhanced vegetation index (EVI)(MYD13A1006) 16 d gross primary production (GPP)and net photosynthesis (PsnNet) (MOD17A2H006) TheAppEEARS also unpacks and interprets the quality layersAppendix B presents the details of each spectral band ofMODIS Data with less-than-good quality flags were deletedMissing data were filled with splines and a database witha 1 d time step was generated This would smooth lin-ear temporalndashphenological evolution between any two suc-cessive remotely sensed data points (Eshel et al 2019)Daily accumulated rainfall was requested using the Giovanniplatform from the Goddard Space Flight Center (GSFC)DAAC two products were used the 3IMERGDF006 fromthe Global Precipitation Measurement mission (GPM) andthe 3B42007 from the Tropical Rainfall Measuring mis-sion (TRMM) Gridded weather parameters from the ORNLDAAC Daymet dataset were precipitation shortwave radia-tion maximum and minimum air temperature and water va-por pressure Daymet is a data product derived from a collec-tion of algorithms interpolating and extrapolating daily me-teorological observations (Thornton et al 2017) FollowingHenrich et al (2012) and Hill et al (2006) daily reflectancebands of MODIS were used to compute several vegetation in-dices redndashgreen ratio index (RGRI) simple ratio (SimpleR)moisture stress (MoistS) disease stress index (DSI) greenatmospherically resistant vegetation index (GARI) normal-ized difference vegetation index (NDVI) normalized differ-ence water index (NDVI_w) and enhanced vegetation in-dex (EVI) the corresponding equations are presented in Ap-pendix B

25 MODIS algorithm for GPP

Estimates of GPP are derived from data recorded by theMODIS sensor aboard the Terra and Aqua satellites The ef-ficiency (ε gCMJminus1) with which vegetation produces drymatter is defined as the amount of solar energy stored by pho-tosynthesis in a given period divided by the solar constant in-tegrated over the same period (Monteith 1972) Not all inci-dent solar radiation is available for biomass conversion onlyabout 48 is photosynthetically active (PAR MJmminus2) andnot all PAR is absorbed (Zhu et al 2008) Thus carbon ex-change is mainly controlled by the amount of PAR absorbedby green vegetation (APAR) and modified by ε (Gitelson etal 2015) The fraction of absorbed PAR (FPAR) is equalto APARPAR but can be represented by the NDVI spectralvegetation index produced by MODIS (Running 2004) Theefficiency term ε is described as the product of different fac-tors as a whole or as a part of the system (Monteith 1972) but

mostly those related to the efficiencies with which the vegeta-tion intercepts the radiation and the efficiency of convertingthe intercepted radiation into biomass (Long et al 2015)The MODIS algorithm that estimates GPP in the MOD17product is (Running et al 2004)

GPP= εtimesFPARtimesPAR (1)GPP= εtimesNDVItimesPAR (2)

The ε term in the MODIS algorithm is represented by amaximum radiation conversion efficiency (εmax kgCMJminus1)that is attenuated by suboptimal climatic conditions mainlyminimum air temperature (Tmin) and vapor pressure deficit(VPD) Two parameters for each Tmin and VPD are usedto define attenuation scalars for general biome types Theseparameters form linear functions between the scalars (Run-ning and Zhao 2015 Wang et al 2013) daily minimumtemperature at which ε = εmax and at which ε = 0 and thedaylight average VPD at which ε = εmax and at which ε =0 GPP is truncated on days when air temperature is be-low 0 C or VPD is higher than 2000 Pa (Running andZhao 2015) Stress and nutrient constraints on vegetationgrowth are quantified by the limiting relation of leaf areain NDVItimesPAR rather than constrained through ε (Runninget al 2004) However the MODIS algorithm does not con-sider stomatal sensitivity to leaf-to-air vapor pressure acrossand within species and particularly between isohydric andanisohydric plant species (Grossiord et al 2020)

ε = εmaxtimes Tmin scalartimesVPD_scalar (3)

The MOD17 userrsquos guide presents a biome-property look-up table (BPLUT) with the parameters for each biome typeand assumes that they do not vary with space or time (Run-ning and Zhao 2015) This aspect is important because εmaxhas the strongest impact on the predicted GPP of the MOD17algorithm (Wang et al 2013) The assumption is also impor-tant because the overstory and understory could be decou-pled from each other and would intercept different amountsof light and have different water sources during the grow-ing season (Scott et al 2003) Light quantity and quality asdiffuse light or sunflecks determine differences among un-derstory species in their temporal response to gaps involvingacclimation and avoidance of photoinhibition (Pearcy 2007)Another shortcoming is that few land cover classifications areincorporated into the MOD17 algorithm

26 Santa Rita site dataset

The Santa Rita Experimental Range (SRER) is located inthe western range of the Santa Rita Mountains in ArizonaUSA (lat 318214 longminus1108661 1120 masl) Climate isBSk with mean annual precipitation of 380 mm temperatureof 179 C and Ustic Torrifluvent soils Established in 1903SRER has a long history of experimental manipulations toenhance grazing potential for cattle (Glenn et al 2015) Two

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 373

Figure 4 Semidesert grassland encroached by mesquite (Prosopisvelutina) at Santa Rita Arizona (US-SRM) Image credit RussellScott 9 December 2016

Ameriflux sites are located in the SRER Santa Rita Grass-land (US-SRG) and Santa Rita Mezquite (US-SRM) Weused EC data for the years 2013ndash2019 from US-SRM whichis a mesquite grass savanna (35 mesquite canopy coverand mean canopy height above 2 m 22 grasses and 43 bare soil) although MODIS describes this site as open shrub-lands (Glenn et al 2015 Scott et al 2004) The US-SRMsite is dominated by velvet mesquite (Prosopis velutina) andhas a diversity of shrubs cacti succulents and bunch grasses(McClaran 2003) This site was chosen because the vegeta-tion and climate are similar to the Bernal site and it was theclosest EC instrumentation with data availability (Fig 4)

27 Modeling

Gross primary production estimated by EC at the Bernal sitewas modeled using OLS and EML The explanatory vari-ables were the remotely sensed data the weather parame-ters and the vegetation indices (Appendix B) The OLS isa particular case of the generalized linear model where thevariation of a single response variable is explained by severalindependent variables The OLS was fitted with the stepwiseprocedure the final model included variables with a variance

inflation factor (VIF) lower than 10 and a significance levelof 005 Predictions of the EC GPP were obtained with the fi-nal model Analysis and diagnostics were made with Minitabv 17 (Minitab LLC) Analysis of the OLS model can be usedto determine agreement between methods of measurementbut it is sensitive to the range of values in the dataset andits metrics ndash r r2 and root mean square ndash do not provideinformation on the type of association (Bland and Altman2010) In this paper we used another metric the probabil-ity of agreement to determine bias and agreement betweenmodel estimates and observed data (see below)

While OLS is a well-known algorithm machine learningalgorithms are emerging techniques that focus on the datastructure and match that data onto models The EML ap-proach considers the different realizations of machine learn-ing models and constructs an ensemble of models com-ing with the advantage of being more accurate than thepredictions from the individual ensemble members How-ever EML is computationally intensive requiring nodeswith purpose-built hardware such as multiple processors orreduced-precision accelerators The nodes could be aggre-gated in computing clusters which require storage powerand cooling redundancy

A stack of EMLs was obtained with the H2O packageof R (Hall et al 2019) This package provides several al-gorithms that can contribute to a stack of ensembles usingthe AutoML (automatic machine learning) function feedfor-ward artificial neural network (DL) general linear models(GLMs) gradient-boosting machine (GBM) extreme gra-dient boosting (XGBoost) default distributed random for-est (DRF) and extremely randomized trees (XRT) AutoMLtrains two stacked-ensemble models one ensemble containsall the models and the second ensemble contains just thebest-performing model from each algorithm class or fam-ily both ensembles should produce better models than anyindividual model from the AutoML run The term AutoMLimplies data preprocessing normalization feature engineer-ing model selection hyperparameter optimization and pre-diction analysis including procedures to identify and dealwith nonindependent and identically distributed observationsand overfitting (Michailidis 2018 Truong et al 2019)

Machine learning has two elements for supervised learn-ing training loss and regularization The task of training triesto find the best parameters for the model while minimizingthe training loss function the mean squared error for exam-ple The regularization term controls the complexity of themodel helping to reduce overfitting Overfitting becomes ap-parent when the model performs accurately during the train-ing but the accuracy is low during the testing A good modelneeds extensive parameter tuning by running the algorithmmany times to explore the effect on regularization and cross-validation accuracy (Mitchell and Frank 2017) In this inves-tigation the function of training loss was the deviance whichis a generalization of the residual sum of squares driven bythe likelihood Deviance is a measure of model fit and lower

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 3: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 369

We measured the carbon flux for the vegetation in a semi-arid site located at Bernal Mexico The phenology patternsin the region suggested that this site could be a carbon sinkduring the wet season or a carbon source during the dryseason since some predominant species reproduce duringwinter and spring particularly cacti Acacia and Prosopis(mesquite) Furthermore the Bernal site had a history of dis-turbance by overgrazing this could decrease the GPP andeven result in a positive carbon balance thus being a car-bon source If the shrub vegetation in this site predomi-nantly absorbed carbon during the measuring period thenthis evidence would contribute to reinforce the reported im-portance of semiarid environments in the global carbon bal-ance (Zhang et al 2020) However land ownership patternsand the balance between agricultural investments and landconservation will determine the absolute amount of carbonsequestered Hopefully the results of the present investiga-tion would promote the idea that carbon sequestration is pos-sible in scrubland and be incorporated in informed decisionsand new policy

2 Materials and methods

21 Site description

The study site (Bernal) is located at 20717 N 99941Wand 2050 masl in the municipality of Ezequiel Montes inQuereacutetaro where real-estate development feedlot beef pro-duction cheese and wine production associated with tourismand automotive-industry development are very attractive op-tions for landowners in the region Bernal is located in a shal-low valley oriented from north to south approximately 15 to20 km wide and opening to the south to the Riacuteo Lerma basinand then draining into the Pacific Ocean The northern limitof the valley is surrounded by hill country and its characteris-tic dacitic dome 433 m in height (Aguirre-Diacuteaz et al 2013)Moisture-laden winds blow westward from the Gulf of Mex-ico but the Sierra Gorda located 60 km east of Bernal castsa rain shadow over the area (Segerstrom 1961)

The Bernal site is private property with grazing dairy cat-tle receiving additional concentrated feedstuffs under stall-feeding Grazing was continuous and water for livestock wasonly available in the feeding and milking area there were nopasture divisions and the perimeter fence was made of stoneThese characteristics of the animal production model and thestate of vegetation are representative of land managementpractices and the scrub vegetation of the region Howeverthe Bernal site suffered important changes in land use during2019 and the scrub was suddenly cleared and converted intorainfed cropping

The climate is arid with summer rains (BSk) with a meanannual rainfall of 476 mm and a mean annual temperature of171 C (CICESE 2015) Prevailing wind is from the eastand northeast The terrain is mostly flat most grades are

below 2 The soil has a clay loam texture the class is aVertisol with abundant subrounded basaltic stones withoutrocky outcrops and the depth is greater than 06 m Vegeta-tion was less than 3 m in height with an overgrazed herba-ceous stratum Vegetation corresponds to secondary scrubwith the dominant genera Acacia Prosopis and differentCacti (Fig 3) This site was classified as grassland by theMODIS land cover product

For the scrub and tree species the importance vegetationindex (IVI) was determined following Curtis and McIntosh(1950) to assess the vegetation homogeneity The IVI is thesum of relative dominance relative density and relative fre-quency of the species present Vegetation sample points werechosen according to the flux footprint of the eddy covari-ance tower (Fig 2) For each plant in the vegetation samplepoints two stem diameters the number of individuals (abun-dance) and identity of each species were measured as well asthe coverage which is the horizontal projection of the aerialparts of the individuals on the ground expressed as a per-centage of the total area (Wilson 2011)

22 Eddy covariance measurements

The micrometeorological EC technique measures at the plantcommunity level NEE in a nondestructive way and con-tinuously over time (Baldocchi 2014) The negative CO2fluxes corresponded to NEE which is equivalent to NEP (netecosystem production) but with opposite sign The EC hasadvantages compared to other techniques that need to scaleup measurements from the leaf plant or soil levels up toecosystems especially when the vegetation is heterogeneous(Yepez et al 2003) However EC is an expensive techniqueand data analysis and processing are complicated also spe-cific assumptions must be met regarding the terrain vege-tation and micrometeorological conditions among other as-pects (Richardson et al 2019)

The fluxes were measured with the EC technique at aheight of 6 m with the following instruments a Biomet sys-tem (LI-COR Biosciences USA) to measure H2O and CO2fluxes using an IRGASON-EC-150 open-circuit analyzer aCSAT3 sonic anemometer and a KH20 krypton hygrome-ter these were connected to a CR3000 data logger (Camp-bell Scientific Inc Logan UT USA) The relative humid-ity and air temperature were measured with an HMP155Aprobe (Vaisala Corporation Helsinki Finland) net radiationwas measured with an NR-Lite2 radiometer (Kipp and Zo-nen BV Delft the Netherlands) and the photosynthetic ac-tive radiation (PAR) was measured with a quantum sensor(SKP215 Skye Instruments Llandrindod Wells UK) Mea-surements of the soil heat flux was implemented with fourself-calibrating HFP01SC plates at 80 mm depth and in fourrepresentative positions of the landscape (Hukseflux Ther-mal Sensors BV Delft The Netherlands) Three time do-main reflectometry (TDR) probes (CS616) measured volu-metric water content in the ground installed vertically and

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

370 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 1 Localization and land use maps for the study site (a) biogeographic arid and semiarid zones of southern North America relevantfor the Bernal and Santa Rita sites the black outline represents the state limit for Quereacutetaro (b) Heterogeneity of the normalized differencevegetation index (NDVI) surrounding the EC tower at Bernal during the peak of the growing season 2017 (DOY 257) (c) Land cover in theregion of the Bernal site according to the annual University of Maryland (UMD) classification (MCD12 MODIS product)

two sets of TCAV (averaging soil thermocouple) probes mea-sured the temperature at 60 and 40 mm depths and abovethe HFP01SC plates (Campbell Scientific Inc Logan UTUSA) The TE525 (Texas Electronics Dallas TX USA)tipping-bucket rain gauge was installed at 12 m height and3 m away from the tower All these meteorological variableswere measured every 5 s and average values were stored ev-ery 30 min rainfall was accumulated for the same time inter-val Sensible (H ) and latent (λE) heat fluxes were calculatedby the EddyPro package (LI-COR Biosciences USA)

23 Flux data processing

All EC data were collected at 10 MHz in the data loggerand reported as micromoles of CO2 per square meter persecond and processed with the EddyPro package to con-vert values into average fluxes of 30 min intervals Only

quality-flagged records were used to account for the CO2flux (qc_co2_flux= 0) according to the Mauder and Foken(2011) policy in the EddyPro program (LI-COR 2019)

However this quality-checking is not sufficient especiallyin the case of CO2 therefore data were postprocessed usingthe REddyProc package of R (R Development Core Team2009) to estimate the friction speed thresholds (ulowast) gap-filldata and partition the NEE flux into its GPP and Reco com-ponents (Wutzler et al 2018) The filled-in estimates of NEE(NEE_uStar_f) GPP (GPP_uStar_f) and Reco (Reco_uStar)were used when the ulowast was lower than a ulowast annual thresholdabove which nighttime fluxes are considered valid The an-nual ulowast threshold was 0193 and 0194 for 2017 and 2018The difference between these thresholds and the 95 ulowast

threshold was small (0033 msminus1) Appendix A presents thethreshold means and confidence intervals calculated in theREddyProc package Only 1050 half-hour records (93 )

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 371

Figure 2 Source strength-weighted function at the Bernal site during 2017 and 2018 (gray points) Horizontally plots show the peak ofthe function and increasing percentages of the flux footprint The radial scale is along-wind distance from the sensor (m) The measurementheight was 6 m The blue points represent the vegetation sample plots used to calculate the importance vegetation index (IVI)

Figure 3 Thorny scrub at Bernal Quereacutetaro during the rainy season 2017 In the foreground is Cylindropuntia imbricata a very thornycactus shrubs in the background are Prosopis laevigata mesquites

had a ulowast below the annual mean ulowast threshold The data with aflag equal to 0 were used for the variable NEE_uStar_fqc asdefined by REddyProc Carbon dioxide flux data were time-integrated and converted to grams of C per square meter perday using the molar ratio of C We only reported the contin-uous measurements of the exchange of CO2 for the periodof April 2017 (DOY 89) to August 2018 (DOY 234) usingthe EC technique Due to equipment malfunction and incom-plete datasets some periods of time were not considered Themeasurement campaign presented here was not biased by wet

winters since both years were characterized by a less-than-weak NintildeondashNintildea

24 Remotely sensed data

Data were requested to the National Aeronautics andSpace Administration (NASA) via the Land Processes Dis-tributed Active Archive Center (DAAC) using the Ap-plication for Extracting and Exploring Analysis ReadySamples (AppEEARS) to obtain spatial and temporal

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

372 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

subsets for the Bernal and Santa Rita sites includ-ing daily surface reflectance (MOD09GA006 and MOD-OCGA006) daily daytime and nighttime land surface tem-perature (LST) (MOD11A2006 and MYD11A2006) 8 dleaf area index (LAI) and fraction of photosynthetically ac-tive radiation (FPAR) (MOD15A2H006 MYD15A2H006MCD15A2H006) 16 d enhanced vegetation index (EVI)(MYD13A1006) 16 d gross primary production (GPP)and net photosynthesis (PsnNet) (MOD17A2H006) TheAppEEARS also unpacks and interprets the quality layersAppendix B presents the details of each spectral band ofMODIS Data with less-than-good quality flags were deletedMissing data were filled with splines and a database witha 1 d time step was generated This would smooth lin-ear temporalndashphenological evolution between any two suc-cessive remotely sensed data points (Eshel et al 2019)Daily accumulated rainfall was requested using the Giovanniplatform from the Goddard Space Flight Center (GSFC)DAAC two products were used the 3IMERGDF006 fromthe Global Precipitation Measurement mission (GPM) andthe 3B42007 from the Tropical Rainfall Measuring mis-sion (TRMM) Gridded weather parameters from the ORNLDAAC Daymet dataset were precipitation shortwave radia-tion maximum and minimum air temperature and water va-por pressure Daymet is a data product derived from a collec-tion of algorithms interpolating and extrapolating daily me-teorological observations (Thornton et al 2017) FollowingHenrich et al (2012) and Hill et al (2006) daily reflectancebands of MODIS were used to compute several vegetation in-dices redndashgreen ratio index (RGRI) simple ratio (SimpleR)moisture stress (MoistS) disease stress index (DSI) greenatmospherically resistant vegetation index (GARI) normal-ized difference vegetation index (NDVI) normalized differ-ence water index (NDVI_w) and enhanced vegetation in-dex (EVI) the corresponding equations are presented in Ap-pendix B

25 MODIS algorithm for GPP

Estimates of GPP are derived from data recorded by theMODIS sensor aboard the Terra and Aqua satellites The ef-ficiency (ε gCMJminus1) with which vegetation produces drymatter is defined as the amount of solar energy stored by pho-tosynthesis in a given period divided by the solar constant in-tegrated over the same period (Monteith 1972) Not all inci-dent solar radiation is available for biomass conversion onlyabout 48 is photosynthetically active (PAR MJmminus2) andnot all PAR is absorbed (Zhu et al 2008) Thus carbon ex-change is mainly controlled by the amount of PAR absorbedby green vegetation (APAR) and modified by ε (Gitelson etal 2015) The fraction of absorbed PAR (FPAR) is equalto APARPAR but can be represented by the NDVI spectralvegetation index produced by MODIS (Running 2004) Theefficiency term ε is described as the product of different fac-tors as a whole or as a part of the system (Monteith 1972) but

mostly those related to the efficiencies with which the vegeta-tion intercepts the radiation and the efficiency of convertingthe intercepted radiation into biomass (Long et al 2015)The MODIS algorithm that estimates GPP in the MOD17product is (Running et al 2004)

GPP= εtimesFPARtimesPAR (1)GPP= εtimesNDVItimesPAR (2)

The ε term in the MODIS algorithm is represented by amaximum radiation conversion efficiency (εmax kgCMJminus1)that is attenuated by suboptimal climatic conditions mainlyminimum air temperature (Tmin) and vapor pressure deficit(VPD) Two parameters for each Tmin and VPD are usedto define attenuation scalars for general biome types Theseparameters form linear functions between the scalars (Run-ning and Zhao 2015 Wang et al 2013) daily minimumtemperature at which ε = εmax and at which ε = 0 and thedaylight average VPD at which ε = εmax and at which ε =0 GPP is truncated on days when air temperature is be-low 0 C or VPD is higher than 2000 Pa (Running andZhao 2015) Stress and nutrient constraints on vegetationgrowth are quantified by the limiting relation of leaf areain NDVItimesPAR rather than constrained through ε (Runninget al 2004) However the MODIS algorithm does not con-sider stomatal sensitivity to leaf-to-air vapor pressure acrossand within species and particularly between isohydric andanisohydric plant species (Grossiord et al 2020)

ε = εmaxtimes Tmin scalartimesVPD_scalar (3)

The MOD17 userrsquos guide presents a biome-property look-up table (BPLUT) with the parameters for each biome typeand assumes that they do not vary with space or time (Run-ning and Zhao 2015) This aspect is important because εmaxhas the strongest impact on the predicted GPP of the MOD17algorithm (Wang et al 2013) The assumption is also impor-tant because the overstory and understory could be decou-pled from each other and would intercept different amountsof light and have different water sources during the grow-ing season (Scott et al 2003) Light quantity and quality asdiffuse light or sunflecks determine differences among un-derstory species in their temporal response to gaps involvingacclimation and avoidance of photoinhibition (Pearcy 2007)Another shortcoming is that few land cover classifications areincorporated into the MOD17 algorithm

26 Santa Rita site dataset

The Santa Rita Experimental Range (SRER) is located inthe western range of the Santa Rita Mountains in ArizonaUSA (lat 318214 longminus1108661 1120 masl) Climate isBSk with mean annual precipitation of 380 mm temperatureof 179 C and Ustic Torrifluvent soils Established in 1903SRER has a long history of experimental manipulations toenhance grazing potential for cattle (Glenn et al 2015) Two

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 373

Figure 4 Semidesert grassland encroached by mesquite (Prosopisvelutina) at Santa Rita Arizona (US-SRM) Image credit RussellScott 9 December 2016

Ameriflux sites are located in the SRER Santa Rita Grass-land (US-SRG) and Santa Rita Mezquite (US-SRM) Weused EC data for the years 2013ndash2019 from US-SRM whichis a mesquite grass savanna (35 mesquite canopy coverand mean canopy height above 2 m 22 grasses and 43 bare soil) although MODIS describes this site as open shrub-lands (Glenn et al 2015 Scott et al 2004) The US-SRMsite is dominated by velvet mesquite (Prosopis velutina) andhas a diversity of shrubs cacti succulents and bunch grasses(McClaran 2003) This site was chosen because the vegeta-tion and climate are similar to the Bernal site and it was theclosest EC instrumentation with data availability (Fig 4)

27 Modeling

Gross primary production estimated by EC at the Bernal sitewas modeled using OLS and EML The explanatory vari-ables were the remotely sensed data the weather parame-ters and the vegetation indices (Appendix B) The OLS isa particular case of the generalized linear model where thevariation of a single response variable is explained by severalindependent variables The OLS was fitted with the stepwiseprocedure the final model included variables with a variance

inflation factor (VIF) lower than 10 and a significance levelof 005 Predictions of the EC GPP were obtained with the fi-nal model Analysis and diagnostics were made with Minitabv 17 (Minitab LLC) Analysis of the OLS model can be usedto determine agreement between methods of measurementbut it is sensitive to the range of values in the dataset andits metrics ndash r r2 and root mean square ndash do not provideinformation on the type of association (Bland and Altman2010) In this paper we used another metric the probabil-ity of agreement to determine bias and agreement betweenmodel estimates and observed data (see below)

While OLS is a well-known algorithm machine learningalgorithms are emerging techniques that focus on the datastructure and match that data onto models The EML ap-proach considers the different realizations of machine learn-ing models and constructs an ensemble of models com-ing with the advantage of being more accurate than thepredictions from the individual ensemble members How-ever EML is computationally intensive requiring nodeswith purpose-built hardware such as multiple processors orreduced-precision accelerators The nodes could be aggre-gated in computing clusters which require storage powerand cooling redundancy

A stack of EMLs was obtained with the H2O packageof R (Hall et al 2019) This package provides several al-gorithms that can contribute to a stack of ensembles usingthe AutoML (automatic machine learning) function feedfor-ward artificial neural network (DL) general linear models(GLMs) gradient-boosting machine (GBM) extreme gra-dient boosting (XGBoost) default distributed random for-est (DRF) and extremely randomized trees (XRT) AutoMLtrains two stacked-ensemble models one ensemble containsall the models and the second ensemble contains just thebest-performing model from each algorithm class or fam-ily both ensembles should produce better models than anyindividual model from the AutoML run The term AutoMLimplies data preprocessing normalization feature engineer-ing model selection hyperparameter optimization and pre-diction analysis including procedures to identify and dealwith nonindependent and identically distributed observationsand overfitting (Michailidis 2018 Truong et al 2019)

Machine learning has two elements for supervised learn-ing training loss and regularization The task of training triesto find the best parameters for the model while minimizingthe training loss function the mean squared error for exam-ple The regularization term controls the complexity of themodel helping to reduce overfitting Overfitting becomes ap-parent when the model performs accurately during the train-ing but the accuracy is low during the testing A good modelneeds extensive parameter tuning by running the algorithmmany times to explore the effect on regularization and cross-validation accuracy (Mitchell and Frank 2017) In this inves-tigation the function of training loss was the deviance whichis a generalization of the residual sum of squares driven bythe likelihood Deviance is a measure of model fit and lower

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 4: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

370 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 1 Localization and land use maps for the study site (a) biogeographic arid and semiarid zones of southern North America relevantfor the Bernal and Santa Rita sites the black outline represents the state limit for Quereacutetaro (b) Heterogeneity of the normalized differencevegetation index (NDVI) surrounding the EC tower at Bernal during the peak of the growing season 2017 (DOY 257) (c) Land cover in theregion of the Bernal site according to the annual University of Maryland (UMD) classification (MCD12 MODIS product)

two sets of TCAV (averaging soil thermocouple) probes mea-sured the temperature at 60 and 40 mm depths and abovethe HFP01SC plates (Campbell Scientific Inc Logan UTUSA) The TE525 (Texas Electronics Dallas TX USA)tipping-bucket rain gauge was installed at 12 m height and3 m away from the tower All these meteorological variableswere measured every 5 s and average values were stored ev-ery 30 min rainfall was accumulated for the same time inter-val Sensible (H ) and latent (λE) heat fluxes were calculatedby the EddyPro package (LI-COR Biosciences USA)

23 Flux data processing

All EC data were collected at 10 MHz in the data loggerand reported as micromoles of CO2 per square meter persecond and processed with the EddyPro package to con-vert values into average fluxes of 30 min intervals Only

quality-flagged records were used to account for the CO2flux (qc_co2_flux= 0) according to the Mauder and Foken(2011) policy in the EddyPro program (LI-COR 2019)

However this quality-checking is not sufficient especiallyin the case of CO2 therefore data were postprocessed usingthe REddyProc package of R (R Development Core Team2009) to estimate the friction speed thresholds (ulowast) gap-filldata and partition the NEE flux into its GPP and Reco com-ponents (Wutzler et al 2018) The filled-in estimates of NEE(NEE_uStar_f) GPP (GPP_uStar_f) and Reco (Reco_uStar)were used when the ulowast was lower than a ulowast annual thresholdabove which nighttime fluxes are considered valid The an-nual ulowast threshold was 0193 and 0194 for 2017 and 2018The difference between these thresholds and the 95 ulowast

threshold was small (0033 msminus1) Appendix A presents thethreshold means and confidence intervals calculated in theREddyProc package Only 1050 half-hour records (93 )

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 371

Figure 2 Source strength-weighted function at the Bernal site during 2017 and 2018 (gray points) Horizontally plots show the peak ofthe function and increasing percentages of the flux footprint The radial scale is along-wind distance from the sensor (m) The measurementheight was 6 m The blue points represent the vegetation sample plots used to calculate the importance vegetation index (IVI)

Figure 3 Thorny scrub at Bernal Quereacutetaro during the rainy season 2017 In the foreground is Cylindropuntia imbricata a very thornycactus shrubs in the background are Prosopis laevigata mesquites

had a ulowast below the annual mean ulowast threshold The data with aflag equal to 0 were used for the variable NEE_uStar_fqc asdefined by REddyProc Carbon dioxide flux data were time-integrated and converted to grams of C per square meter perday using the molar ratio of C We only reported the contin-uous measurements of the exchange of CO2 for the periodof April 2017 (DOY 89) to August 2018 (DOY 234) usingthe EC technique Due to equipment malfunction and incom-plete datasets some periods of time were not considered Themeasurement campaign presented here was not biased by wet

winters since both years were characterized by a less-than-weak NintildeondashNintildea

24 Remotely sensed data

Data were requested to the National Aeronautics andSpace Administration (NASA) via the Land Processes Dis-tributed Active Archive Center (DAAC) using the Ap-plication for Extracting and Exploring Analysis ReadySamples (AppEEARS) to obtain spatial and temporal

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

372 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

subsets for the Bernal and Santa Rita sites includ-ing daily surface reflectance (MOD09GA006 and MOD-OCGA006) daily daytime and nighttime land surface tem-perature (LST) (MOD11A2006 and MYD11A2006) 8 dleaf area index (LAI) and fraction of photosynthetically ac-tive radiation (FPAR) (MOD15A2H006 MYD15A2H006MCD15A2H006) 16 d enhanced vegetation index (EVI)(MYD13A1006) 16 d gross primary production (GPP)and net photosynthesis (PsnNet) (MOD17A2H006) TheAppEEARS also unpacks and interprets the quality layersAppendix B presents the details of each spectral band ofMODIS Data with less-than-good quality flags were deletedMissing data were filled with splines and a database witha 1 d time step was generated This would smooth lin-ear temporalndashphenological evolution between any two suc-cessive remotely sensed data points (Eshel et al 2019)Daily accumulated rainfall was requested using the Giovanniplatform from the Goddard Space Flight Center (GSFC)DAAC two products were used the 3IMERGDF006 fromthe Global Precipitation Measurement mission (GPM) andthe 3B42007 from the Tropical Rainfall Measuring mis-sion (TRMM) Gridded weather parameters from the ORNLDAAC Daymet dataset were precipitation shortwave radia-tion maximum and minimum air temperature and water va-por pressure Daymet is a data product derived from a collec-tion of algorithms interpolating and extrapolating daily me-teorological observations (Thornton et al 2017) FollowingHenrich et al (2012) and Hill et al (2006) daily reflectancebands of MODIS were used to compute several vegetation in-dices redndashgreen ratio index (RGRI) simple ratio (SimpleR)moisture stress (MoistS) disease stress index (DSI) greenatmospherically resistant vegetation index (GARI) normal-ized difference vegetation index (NDVI) normalized differ-ence water index (NDVI_w) and enhanced vegetation in-dex (EVI) the corresponding equations are presented in Ap-pendix B

25 MODIS algorithm for GPP

Estimates of GPP are derived from data recorded by theMODIS sensor aboard the Terra and Aqua satellites The ef-ficiency (ε gCMJminus1) with which vegetation produces drymatter is defined as the amount of solar energy stored by pho-tosynthesis in a given period divided by the solar constant in-tegrated over the same period (Monteith 1972) Not all inci-dent solar radiation is available for biomass conversion onlyabout 48 is photosynthetically active (PAR MJmminus2) andnot all PAR is absorbed (Zhu et al 2008) Thus carbon ex-change is mainly controlled by the amount of PAR absorbedby green vegetation (APAR) and modified by ε (Gitelson etal 2015) The fraction of absorbed PAR (FPAR) is equalto APARPAR but can be represented by the NDVI spectralvegetation index produced by MODIS (Running 2004) Theefficiency term ε is described as the product of different fac-tors as a whole or as a part of the system (Monteith 1972) but

mostly those related to the efficiencies with which the vegeta-tion intercepts the radiation and the efficiency of convertingthe intercepted radiation into biomass (Long et al 2015)The MODIS algorithm that estimates GPP in the MOD17product is (Running et al 2004)

GPP= εtimesFPARtimesPAR (1)GPP= εtimesNDVItimesPAR (2)

The ε term in the MODIS algorithm is represented by amaximum radiation conversion efficiency (εmax kgCMJminus1)that is attenuated by suboptimal climatic conditions mainlyminimum air temperature (Tmin) and vapor pressure deficit(VPD) Two parameters for each Tmin and VPD are usedto define attenuation scalars for general biome types Theseparameters form linear functions between the scalars (Run-ning and Zhao 2015 Wang et al 2013) daily minimumtemperature at which ε = εmax and at which ε = 0 and thedaylight average VPD at which ε = εmax and at which ε =0 GPP is truncated on days when air temperature is be-low 0 C or VPD is higher than 2000 Pa (Running andZhao 2015) Stress and nutrient constraints on vegetationgrowth are quantified by the limiting relation of leaf areain NDVItimesPAR rather than constrained through ε (Runninget al 2004) However the MODIS algorithm does not con-sider stomatal sensitivity to leaf-to-air vapor pressure acrossand within species and particularly between isohydric andanisohydric plant species (Grossiord et al 2020)

ε = εmaxtimes Tmin scalartimesVPD_scalar (3)

The MOD17 userrsquos guide presents a biome-property look-up table (BPLUT) with the parameters for each biome typeand assumes that they do not vary with space or time (Run-ning and Zhao 2015) This aspect is important because εmaxhas the strongest impact on the predicted GPP of the MOD17algorithm (Wang et al 2013) The assumption is also impor-tant because the overstory and understory could be decou-pled from each other and would intercept different amountsof light and have different water sources during the grow-ing season (Scott et al 2003) Light quantity and quality asdiffuse light or sunflecks determine differences among un-derstory species in their temporal response to gaps involvingacclimation and avoidance of photoinhibition (Pearcy 2007)Another shortcoming is that few land cover classifications areincorporated into the MOD17 algorithm

26 Santa Rita site dataset

The Santa Rita Experimental Range (SRER) is located inthe western range of the Santa Rita Mountains in ArizonaUSA (lat 318214 longminus1108661 1120 masl) Climate isBSk with mean annual precipitation of 380 mm temperatureof 179 C and Ustic Torrifluvent soils Established in 1903SRER has a long history of experimental manipulations toenhance grazing potential for cattle (Glenn et al 2015) Two

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 373

Figure 4 Semidesert grassland encroached by mesquite (Prosopisvelutina) at Santa Rita Arizona (US-SRM) Image credit RussellScott 9 December 2016

Ameriflux sites are located in the SRER Santa Rita Grass-land (US-SRG) and Santa Rita Mezquite (US-SRM) Weused EC data for the years 2013ndash2019 from US-SRM whichis a mesquite grass savanna (35 mesquite canopy coverand mean canopy height above 2 m 22 grasses and 43 bare soil) although MODIS describes this site as open shrub-lands (Glenn et al 2015 Scott et al 2004) The US-SRMsite is dominated by velvet mesquite (Prosopis velutina) andhas a diversity of shrubs cacti succulents and bunch grasses(McClaran 2003) This site was chosen because the vegeta-tion and climate are similar to the Bernal site and it was theclosest EC instrumentation with data availability (Fig 4)

27 Modeling

Gross primary production estimated by EC at the Bernal sitewas modeled using OLS and EML The explanatory vari-ables were the remotely sensed data the weather parame-ters and the vegetation indices (Appendix B) The OLS isa particular case of the generalized linear model where thevariation of a single response variable is explained by severalindependent variables The OLS was fitted with the stepwiseprocedure the final model included variables with a variance

inflation factor (VIF) lower than 10 and a significance levelof 005 Predictions of the EC GPP were obtained with the fi-nal model Analysis and diagnostics were made with Minitabv 17 (Minitab LLC) Analysis of the OLS model can be usedto determine agreement between methods of measurementbut it is sensitive to the range of values in the dataset andits metrics ndash r r2 and root mean square ndash do not provideinformation on the type of association (Bland and Altman2010) In this paper we used another metric the probabil-ity of agreement to determine bias and agreement betweenmodel estimates and observed data (see below)

While OLS is a well-known algorithm machine learningalgorithms are emerging techniques that focus on the datastructure and match that data onto models The EML ap-proach considers the different realizations of machine learn-ing models and constructs an ensemble of models com-ing with the advantage of being more accurate than thepredictions from the individual ensemble members How-ever EML is computationally intensive requiring nodeswith purpose-built hardware such as multiple processors orreduced-precision accelerators The nodes could be aggre-gated in computing clusters which require storage powerand cooling redundancy

A stack of EMLs was obtained with the H2O packageof R (Hall et al 2019) This package provides several al-gorithms that can contribute to a stack of ensembles usingthe AutoML (automatic machine learning) function feedfor-ward artificial neural network (DL) general linear models(GLMs) gradient-boosting machine (GBM) extreme gra-dient boosting (XGBoost) default distributed random for-est (DRF) and extremely randomized trees (XRT) AutoMLtrains two stacked-ensemble models one ensemble containsall the models and the second ensemble contains just thebest-performing model from each algorithm class or fam-ily both ensembles should produce better models than anyindividual model from the AutoML run The term AutoMLimplies data preprocessing normalization feature engineer-ing model selection hyperparameter optimization and pre-diction analysis including procedures to identify and dealwith nonindependent and identically distributed observationsand overfitting (Michailidis 2018 Truong et al 2019)

Machine learning has two elements for supervised learn-ing training loss and regularization The task of training triesto find the best parameters for the model while minimizingthe training loss function the mean squared error for exam-ple The regularization term controls the complexity of themodel helping to reduce overfitting Overfitting becomes ap-parent when the model performs accurately during the train-ing but the accuracy is low during the testing A good modelneeds extensive parameter tuning by running the algorithmmany times to explore the effect on regularization and cross-validation accuracy (Mitchell and Frank 2017) In this inves-tigation the function of training loss was the deviance whichis a generalization of the residual sum of squares driven bythe likelihood Deviance is a measure of model fit and lower

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 5: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 371

Figure 2 Source strength-weighted function at the Bernal site during 2017 and 2018 (gray points) Horizontally plots show the peak ofthe function and increasing percentages of the flux footprint The radial scale is along-wind distance from the sensor (m) The measurementheight was 6 m The blue points represent the vegetation sample plots used to calculate the importance vegetation index (IVI)

Figure 3 Thorny scrub at Bernal Quereacutetaro during the rainy season 2017 In the foreground is Cylindropuntia imbricata a very thornycactus shrubs in the background are Prosopis laevigata mesquites

had a ulowast below the annual mean ulowast threshold The data with aflag equal to 0 were used for the variable NEE_uStar_fqc asdefined by REddyProc Carbon dioxide flux data were time-integrated and converted to grams of C per square meter perday using the molar ratio of C We only reported the contin-uous measurements of the exchange of CO2 for the periodof April 2017 (DOY 89) to August 2018 (DOY 234) usingthe EC technique Due to equipment malfunction and incom-plete datasets some periods of time were not considered Themeasurement campaign presented here was not biased by wet

winters since both years were characterized by a less-than-weak NintildeondashNintildea

24 Remotely sensed data

Data were requested to the National Aeronautics andSpace Administration (NASA) via the Land Processes Dis-tributed Active Archive Center (DAAC) using the Ap-plication for Extracting and Exploring Analysis ReadySamples (AppEEARS) to obtain spatial and temporal

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

372 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

subsets for the Bernal and Santa Rita sites includ-ing daily surface reflectance (MOD09GA006 and MOD-OCGA006) daily daytime and nighttime land surface tem-perature (LST) (MOD11A2006 and MYD11A2006) 8 dleaf area index (LAI) and fraction of photosynthetically ac-tive radiation (FPAR) (MOD15A2H006 MYD15A2H006MCD15A2H006) 16 d enhanced vegetation index (EVI)(MYD13A1006) 16 d gross primary production (GPP)and net photosynthesis (PsnNet) (MOD17A2H006) TheAppEEARS also unpacks and interprets the quality layersAppendix B presents the details of each spectral band ofMODIS Data with less-than-good quality flags were deletedMissing data were filled with splines and a database witha 1 d time step was generated This would smooth lin-ear temporalndashphenological evolution between any two suc-cessive remotely sensed data points (Eshel et al 2019)Daily accumulated rainfall was requested using the Giovanniplatform from the Goddard Space Flight Center (GSFC)DAAC two products were used the 3IMERGDF006 fromthe Global Precipitation Measurement mission (GPM) andthe 3B42007 from the Tropical Rainfall Measuring mis-sion (TRMM) Gridded weather parameters from the ORNLDAAC Daymet dataset were precipitation shortwave radia-tion maximum and minimum air temperature and water va-por pressure Daymet is a data product derived from a collec-tion of algorithms interpolating and extrapolating daily me-teorological observations (Thornton et al 2017) FollowingHenrich et al (2012) and Hill et al (2006) daily reflectancebands of MODIS were used to compute several vegetation in-dices redndashgreen ratio index (RGRI) simple ratio (SimpleR)moisture stress (MoistS) disease stress index (DSI) greenatmospherically resistant vegetation index (GARI) normal-ized difference vegetation index (NDVI) normalized differ-ence water index (NDVI_w) and enhanced vegetation in-dex (EVI) the corresponding equations are presented in Ap-pendix B

25 MODIS algorithm for GPP

Estimates of GPP are derived from data recorded by theMODIS sensor aboard the Terra and Aqua satellites The ef-ficiency (ε gCMJminus1) with which vegetation produces drymatter is defined as the amount of solar energy stored by pho-tosynthesis in a given period divided by the solar constant in-tegrated over the same period (Monteith 1972) Not all inci-dent solar radiation is available for biomass conversion onlyabout 48 is photosynthetically active (PAR MJmminus2) andnot all PAR is absorbed (Zhu et al 2008) Thus carbon ex-change is mainly controlled by the amount of PAR absorbedby green vegetation (APAR) and modified by ε (Gitelson etal 2015) The fraction of absorbed PAR (FPAR) is equalto APARPAR but can be represented by the NDVI spectralvegetation index produced by MODIS (Running 2004) Theefficiency term ε is described as the product of different fac-tors as a whole or as a part of the system (Monteith 1972) but

mostly those related to the efficiencies with which the vegeta-tion intercepts the radiation and the efficiency of convertingthe intercepted radiation into biomass (Long et al 2015)The MODIS algorithm that estimates GPP in the MOD17product is (Running et al 2004)

GPP= εtimesFPARtimesPAR (1)GPP= εtimesNDVItimesPAR (2)

The ε term in the MODIS algorithm is represented by amaximum radiation conversion efficiency (εmax kgCMJminus1)that is attenuated by suboptimal climatic conditions mainlyminimum air temperature (Tmin) and vapor pressure deficit(VPD) Two parameters for each Tmin and VPD are usedto define attenuation scalars for general biome types Theseparameters form linear functions between the scalars (Run-ning and Zhao 2015 Wang et al 2013) daily minimumtemperature at which ε = εmax and at which ε = 0 and thedaylight average VPD at which ε = εmax and at which ε =0 GPP is truncated on days when air temperature is be-low 0 C or VPD is higher than 2000 Pa (Running andZhao 2015) Stress and nutrient constraints on vegetationgrowth are quantified by the limiting relation of leaf areain NDVItimesPAR rather than constrained through ε (Runninget al 2004) However the MODIS algorithm does not con-sider stomatal sensitivity to leaf-to-air vapor pressure acrossand within species and particularly between isohydric andanisohydric plant species (Grossiord et al 2020)

ε = εmaxtimes Tmin scalartimesVPD_scalar (3)

The MOD17 userrsquos guide presents a biome-property look-up table (BPLUT) with the parameters for each biome typeand assumes that they do not vary with space or time (Run-ning and Zhao 2015) This aspect is important because εmaxhas the strongest impact on the predicted GPP of the MOD17algorithm (Wang et al 2013) The assumption is also impor-tant because the overstory and understory could be decou-pled from each other and would intercept different amountsof light and have different water sources during the grow-ing season (Scott et al 2003) Light quantity and quality asdiffuse light or sunflecks determine differences among un-derstory species in their temporal response to gaps involvingacclimation and avoidance of photoinhibition (Pearcy 2007)Another shortcoming is that few land cover classifications areincorporated into the MOD17 algorithm

26 Santa Rita site dataset

The Santa Rita Experimental Range (SRER) is located inthe western range of the Santa Rita Mountains in ArizonaUSA (lat 318214 longminus1108661 1120 masl) Climate isBSk with mean annual precipitation of 380 mm temperatureof 179 C and Ustic Torrifluvent soils Established in 1903SRER has a long history of experimental manipulations toenhance grazing potential for cattle (Glenn et al 2015) Two

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 373

Figure 4 Semidesert grassland encroached by mesquite (Prosopisvelutina) at Santa Rita Arizona (US-SRM) Image credit RussellScott 9 December 2016

Ameriflux sites are located in the SRER Santa Rita Grass-land (US-SRG) and Santa Rita Mezquite (US-SRM) Weused EC data for the years 2013ndash2019 from US-SRM whichis a mesquite grass savanna (35 mesquite canopy coverand mean canopy height above 2 m 22 grasses and 43 bare soil) although MODIS describes this site as open shrub-lands (Glenn et al 2015 Scott et al 2004) The US-SRMsite is dominated by velvet mesquite (Prosopis velutina) andhas a diversity of shrubs cacti succulents and bunch grasses(McClaran 2003) This site was chosen because the vegeta-tion and climate are similar to the Bernal site and it was theclosest EC instrumentation with data availability (Fig 4)

27 Modeling

Gross primary production estimated by EC at the Bernal sitewas modeled using OLS and EML The explanatory vari-ables were the remotely sensed data the weather parame-ters and the vegetation indices (Appendix B) The OLS isa particular case of the generalized linear model where thevariation of a single response variable is explained by severalindependent variables The OLS was fitted with the stepwiseprocedure the final model included variables with a variance

inflation factor (VIF) lower than 10 and a significance levelof 005 Predictions of the EC GPP were obtained with the fi-nal model Analysis and diagnostics were made with Minitabv 17 (Minitab LLC) Analysis of the OLS model can be usedto determine agreement between methods of measurementbut it is sensitive to the range of values in the dataset andits metrics ndash r r2 and root mean square ndash do not provideinformation on the type of association (Bland and Altman2010) In this paper we used another metric the probabil-ity of agreement to determine bias and agreement betweenmodel estimates and observed data (see below)

While OLS is a well-known algorithm machine learningalgorithms are emerging techniques that focus on the datastructure and match that data onto models The EML ap-proach considers the different realizations of machine learn-ing models and constructs an ensemble of models com-ing with the advantage of being more accurate than thepredictions from the individual ensemble members How-ever EML is computationally intensive requiring nodeswith purpose-built hardware such as multiple processors orreduced-precision accelerators The nodes could be aggre-gated in computing clusters which require storage powerand cooling redundancy

A stack of EMLs was obtained with the H2O packageof R (Hall et al 2019) This package provides several al-gorithms that can contribute to a stack of ensembles usingthe AutoML (automatic machine learning) function feedfor-ward artificial neural network (DL) general linear models(GLMs) gradient-boosting machine (GBM) extreme gra-dient boosting (XGBoost) default distributed random for-est (DRF) and extremely randomized trees (XRT) AutoMLtrains two stacked-ensemble models one ensemble containsall the models and the second ensemble contains just thebest-performing model from each algorithm class or fam-ily both ensembles should produce better models than anyindividual model from the AutoML run The term AutoMLimplies data preprocessing normalization feature engineer-ing model selection hyperparameter optimization and pre-diction analysis including procedures to identify and dealwith nonindependent and identically distributed observationsand overfitting (Michailidis 2018 Truong et al 2019)

Machine learning has two elements for supervised learn-ing training loss and regularization The task of training triesto find the best parameters for the model while minimizingthe training loss function the mean squared error for exam-ple The regularization term controls the complexity of themodel helping to reduce overfitting Overfitting becomes ap-parent when the model performs accurately during the train-ing but the accuracy is low during the testing A good modelneeds extensive parameter tuning by running the algorithmmany times to explore the effect on regularization and cross-validation accuracy (Mitchell and Frank 2017) In this inves-tigation the function of training loss was the deviance whichis a generalization of the residual sum of squares driven bythe likelihood Deviance is a measure of model fit and lower

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 6: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

372 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

subsets for the Bernal and Santa Rita sites includ-ing daily surface reflectance (MOD09GA006 and MOD-OCGA006) daily daytime and nighttime land surface tem-perature (LST) (MOD11A2006 and MYD11A2006) 8 dleaf area index (LAI) and fraction of photosynthetically ac-tive radiation (FPAR) (MOD15A2H006 MYD15A2H006MCD15A2H006) 16 d enhanced vegetation index (EVI)(MYD13A1006) 16 d gross primary production (GPP)and net photosynthesis (PsnNet) (MOD17A2H006) TheAppEEARS also unpacks and interprets the quality layersAppendix B presents the details of each spectral band ofMODIS Data with less-than-good quality flags were deletedMissing data were filled with splines and a database witha 1 d time step was generated This would smooth lin-ear temporalndashphenological evolution between any two suc-cessive remotely sensed data points (Eshel et al 2019)Daily accumulated rainfall was requested using the Giovanniplatform from the Goddard Space Flight Center (GSFC)DAAC two products were used the 3IMERGDF006 fromthe Global Precipitation Measurement mission (GPM) andthe 3B42007 from the Tropical Rainfall Measuring mis-sion (TRMM) Gridded weather parameters from the ORNLDAAC Daymet dataset were precipitation shortwave radia-tion maximum and minimum air temperature and water va-por pressure Daymet is a data product derived from a collec-tion of algorithms interpolating and extrapolating daily me-teorological observations (Thornton et al 2017) FollowingHenrich et al (2012) and Hill et al (2006) daily reflectancebands of MODIS were used to compute several vegetation in-dices redndashgreen ratio index (RGRI) simple ratio (SimpleR)moisture stress (MoistS) disease stress index (DSI) greenatmospherically resistant vegetation index (GARI) normal-ized difference vegetation index (NDVI) normalized differ-ence water index (NDVI_w) and enhanced vegetation in-dex (EVI) the corresponding equations are presented in Ap-pendix B

25 MODIS algorithm for GPP

Estimates of GPP are derived from data recorded by theMODIS sensor aboard the Terra and Aqua satellites The ef-ficiency (ε gCMJminus1) with which vegetation produces drymatter is defined as the amount of solar energy stored by pho-tosynthesis in a given period divided by the solar constant in-tegrated over the same period (Monteith 1972) Not all inci-dent solar radiation is available for biomass conversion onlyabout 48 is photosynthetically active (PAR MJmminus2) andnot all PAR is absorbed (Zhu et al 2008) Thus carbon ex-change is mainly controlled by the amount of PAR absorbedby green vegetation (APAR) and modified by ε (Gitelson etal 2015) The fraction of absorbed PAR (FPAR) is equalto APARPAR but can be represented by the NDVI spectralvegetation index produced by MODIS (Running 2004) Theefficiency term ε is described as the product of different fac-tors as a whole or as a part of the system (Monteith 1972) but

mostly those related to the efficiencies with which the vegeta-tion intercepts the radiation and the efficiency of convertingthe intercepted radiation into biomass (Long et al 2015)The MODIS algorithm that estimates GPP in the MOD17product is (Running et al 2004)

GPP= εtimesFPARtimesPAR (1)GPP= εtimesNDVItimesPAR (2)

The ε term in the MODIS algorithm is represented by amaximum radiation conversion efficiency (εmax kgCMJminus1)that is attenuated by suboptimal climatic conditions mainlyminimum air temperature (Tmin) and vapor pressure deficit(VPD) Two parameters for each Tmin and VPD are usedto define attenuation scalars for general biome types Theseparameters form linear functions between the scalars (Run-ning and Zhao 2015 Wang et al 2013) daily minimumtemperature at which ε = εmax and at which ε = 0 and thedaylight average VPD at which ε = εmax and at which ε =0 GPP is truncated on days when air temperature is be-low 0 C or VPD is higher than 2000 Pa (Running andZhao 2015) Stress and nutrient constraints on vegetationgrowth are quantified by the limiting relation of leaf areain NDVItimesPAR rather than constrained through ε (Runninget al 2004) However the MODIS algorithm does not con-sider stomatal sensitivity to leaf-to-air vapor pressure acrossand within species and particularly between isohydric andanisohydric plant species (Grossiord et al 2020)

ε = εmaxtimes Tmin scalartimesVPD_scalar (3)

The MOD17 userrsquos guide presents a biome-property look-up table (BPLUT) with the parameters for each biome typeand assumes that they do not vary with space or time (Run-ning and Zhao 2015) This aspect is important because εmaxhas the strongest impact on the predicted GPP of the MOD17algorithm (Wang et al 2013) The assumption is also impor-tant because the overstory and understory could be decou-pled from each other and would intercept different amountsof light and have different water sources during the grow-ing season (Scott et al 2003) Light quantity and quality asdiffuse light or sunflecks determine differences among un-derstory species in their temporal response to gaps involvingacclimation and avoidance of photoinhibition (Pearcy 2007)Another shortcoming is that few land cover classifications areincorporated into the MOD17 algorithm

26 Santa Rita site dataset

The Santa Rita Experimental Range (SRER) is located inthe western range of the Santa Rita Mountains in ArizonaUSA (lat 318214 longminus1108661 1120 masl) Climate isBSk with mean annual precipitation of 380 mm temperatureof 179 C and Ustic Torrifluvent soils Established in 1903SRER has a long history of experimental manipulations toenhance grazing potential for cattle (Glenn et al 2015) Two

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 373

Figure 4 Semidesert grassland encroached by mesquite (Prosopisvelutina) at Santa Rita Arizona (US-SRM) Image credit RussellScott 9 December 2016

Ameriflux sites are located in the SRER Santa Rita Grass-land (US-SRG) and Santa Rita Mezquite (US-SRM) Weused EC data for the years 2013ndash2019 from US-SRM whichis a mesquite grass savanna (35 mesquite canopy coverand mean canopy height above 2 m 22 grasses and 43 bare soil) although MODIS describes this site as open shrub-lands (Glenn et al 2015 Scott et al 2004) The US-SRMsite is dominated by velvet mesquite (Prosopis velutina) andhas a diversity of shrubs cacti succulents and bunch grasses(McClaran 2003) This site was chosen because the vegeta-tion and climate are similar to the Bernal site and it was theclosest EC instrumentation with data availability (Fig 4)

27 Modeling

Gross primary production estimated by EC at the Bernal sitewas modeled using OLS and EML The explanatory vari-ables were the remotely sensed data the weather parame-ters and the vegetation indices (Appendix B) The OLS isa particular case of the generalized linear model where thevariation of a single response variable is explained by severalindependent variables The OLS was fitted with the stepwiseprocedure the final model included variables with a variance

inflation factor (VIF) lower than 10 and a significance levelof 005 Predictions of the EC GPP were obtained with the fi-nal model Analysis and diagnostics were made with Minitabv 17 (Minitab LLC) Analysis of the OLS model can be usedto determine agreement between methods of measurementbut it is sensitive to the range of values in the dataset andits metrics ndash r r2 and root mean square ndash do not provideinformation on the type of association (Bland and Altman2010) In this paper we used another metric the probabil-ity of agreement to determine bias and agreement betweenmodel estimates and observed data (see below)

While OLS is a well-known algorithm machine learningalgorithms are emerging techniques that focus on the datastructure and match that data onto models The EML ap-proach considers the different realizations of machine learn-ing models and constructs an ensemble of models com-ing with the advantage of being more accurate than thepredictions from the individual ensemble members How-ever EML is computationally intensive requiring nodeswith purpose-built hardware such as multiple processors orreduced-precision accelerators The nodes could be aggre-gated in computing clusters which require storage powerand cooling redundancy

A stack of EMLs was obtained with the H2O packageof R (Hall et al 2019) This package provides several al-gorithms that can contribute to a stack of ensembles usingthe AutoML (automatic machine learning) function feedfor-ward artificial neural network (DL) general linear models(GLMs) gradient-boosting machine (GBM) extreme gra-dient boosting (XGBoost) default distributed random for-est (DRF) and extremely randomized trees (XRT) AutoMLtrains two stacked-ensemble models one ensemble containsall the models and the second ensemble contains just thebest-performing model from each algorithm class or fam-ily both ensembles should produce better models than anyindividual model from the AutoML run The term AutoMLimplies data preprocessing normalization feature engineer-ing model selection hyperparameter optimization and pre-diction analysis including procedures to identify and dealwith nonindependent and identically distributed observationsand overfitting (Michailidis 2018 Truong et al 2019)

Machine learning has two elements for supervised learn-ing training loss and regularization The task of training triesto find the best parameters for the model while minimizingthe training loss function the mean squared error for exam-ple The regularization term controls the complexity of themodel helping to reduce overfitting Overfitting becomes ap-parent when the model performs accurately during the train-ing but the accuracy is low during the testing A good modelneeds extensive parameter tuning by running the algorithmmany times to explore the effect on regularization and cross-validation accuracy (Mitchell and Frank 2017) In this inves-tigation the function of training loss was the deviance whichis a generalization of the residual sum of squares driven bythe likelihood Deviance is a measure of model fit and lower

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 7: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 373

Figure 4 Semidesert grassland encroached by mesquite (Prosopisvelutina) at Santa Rita Arizona (US-SRM) Image credit RussellScott 9 December 2016

Ameriflux sites are located in the SRER Santa Rita Grass-land (US-SRG) and Santa Rita Mezquite (US-SRM) Weused EC data for the years 2013ndash2019 from US-SRM whichis a mesquite grass savanna (35 mesquite canopy coverand mean canopy height above 2 m 22 grasses and 43 bare soil) although MODIS describes this site as open shrub-lands (Glenn et al 2015 Scott et al 2004) The US-SRMsite is dominated by velvet mesquite (Prosopis velutina) andhas a diversity of shrubs cacti succulents and bunch grasses(McClaran 2003) This site was chosen because the vegeta-tion and climate are similar to the Bernal site and it was theclosest EC instrumentation with data availability (Fig 4)

27 Modeling

Gross primary production estimated by EC at the Bernal sitewas modeled using OLS and EML The explanatory vari-ables were the remotely sensed data the weather parame-ters and the vegetation indices (Appendix B) The OLS isa particular case of the generalized linear model where thevariation of a single response variable is explained by severalindependent variables The OLS was fitted with the stepwiseprocedure the final model included variables with a variance

inflation factor (VIF) lower than 10 and a significance levelof 005 Predictions of the EC GPP were obtained with the fi-nal model Analysis and diagnostics were made with Minitabv 17 (Minitab LLC) Analysis of the OLS model can be usedto determine agreement between methods of measurementbut it is sensitive to the range of values in the dataset andits metrics ndash r r2 and root mean square ndash do not provideinformation on the type of association (Bland and Altman2010) In this paper we used another metric the probabil-ity of agreement to determine bias and agreement betweenmodel estimates and observed data (see below)

While OLS is a well-known algorithm machine learningalgorithms are emerging techniques that focus on the datastructure and match that data onto models The EML ap-proach considers the different realizations of machine learn-ing models and constructs an ensemble of models com-ing with the advantage of being more accurate than thepredictions from the individual ensemble members How-ever EML is computationally intensive requiring nodeswith purpose-built hardware such as multiple processors orreduced-precision accelerators The nodes could be aggre-gated in computing clusters which require storage powerand cooling redundancy

A stack of EMLs was obtained with the H2O packageof R (Hall et al 2019) This package provides several al-gorithms that can contribute to a stack of ensembles usingthe AutoML (automatic machine learning) function feedfor-ward artificial neural network (DL) general linear models(GLMs) gradient-boosting machine (GBM) extreme gra-dient boosting (XGBoost) default distributed random for-est (DRF) and extremely randomized trees (XRT) AutoMLtrains two stacked-ensemble models one ensemble containsall the models and the second ensemble contains just thebest-performing model from each algorithm class or fam-ily both ensembles should produce better models than anyindividual model from the AutoML run The term AutoMLimplies data preprocessing normalization feature engineer-ing model selection hyperparameter optimization and pre-diction analysis including procedures to identify and dealwith nonindependent and identically distributed observationsand overfitting (Michailidis 2018 Truong et al 2019)

Machine learning has two elements for supervised learn-ing training loss and regularization The task of training triesto find the best parameters for the model while minimizingthe training loss function the mean squared error for exam-ple The regularization term controls the complexity of themodel helping to reduce overfitting Overfitting becomes ap-parent when the model performs accurately during the train-ing but the accuracy is low during the testing A good modelneeds extensive parameter tuning by running the algorithmmany times to explore the effect on regularization and cross-validation accuracy (Mitchell and Frank 2017) In this inves-tigation the function of training loss was the deviance whichis a generalization of the residual sum of squares driven bythe likelihood Deviance is a measure of model fit and lower

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 8: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

374 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

or negative values indicate better model performance (McEl-reath 2020)

A stack of EML solutions was based on a random sam-ple of the dataset for training the model For the Bernal site85 was used for training for US-SRM 80 was usedThe AutoML function was run 20 times each run added ap-proximately 48 models to the leader board and ranked thebest-performing models by their deviance Each run splitsthe training data 10 times for k-fold cross-validation Theseed for an EML that is dependent on randomization waschanged in every run The stopping rule for each run wasset at 100 s and the maximum memory allocation pool forH2O was 100 GB in a single workstation with dual Xeon2680 v4 processors and 128 GB of RAM The H2O packagewas installed in a rockergeospatial docker container (imageavailable at httpshubdockercomrrockergeospatial lastaccess 9 January 2021) which is a portable scalable and re-producible environment (Boettiger and Eddelbuettel 2017)

Two sets of predictions for the GPP at the Bernal site wereobtained from the stacked ensemble The first set of predic-tions was based on the 15 of the Bernal site data reservedfor testing The second set of predictions was obtained byrefeeding the US-SRM site model with the Bernal site ex-planatory variables The first set of predictions would showthe importance of local data to predict EC-based GPP Thesecond set of predictions would represent the suitability ofoff-site data to predict EC-based GPP If the second scenariohas good agreement then an EML model could be used torepresent wider areas of the ecosystem

28 Variable importance

The variable importance within individual models was usedto answer the question of which environmental variableswere important for GPP prediction For the ldquoall-models en-semblerdquo and ldquobest-of-family ensemblerdquo generated by Au-toML it is not possible to examine the variable importance orthe contribution of the individual models to the stack (H2Oai2017) Therefore a weight (wi)was calculated using Eq (4)which is adequate for other information criteria besides theAkaike weights this weight is an estimate of the conditionalprobability that the model will make the best predictions onnew data considering the set of models (McElreath 2020)Then the importance of each variable () was multipliedby the modelrsquos weight (wi) and then added by variable tobuild the variable importance index This index would mea-sure how often a given variable was used on the leader board

wi =exp

(minus

12 dWAICi

)summj=1 exp

(minus

12 dWAICi

) (4)

dWAIC= deviancei minus deviance of top-performing model(5)

29 Model agreement

Calibration and agreement between methods of measure-ment are different procedures Calibration compares knownquantities of the true value or measurements made bya highly accurate method (a gold standard) against themeasurements of a new or a contending method When twomethods of measurement are compared neither providesan unequivocally correct measurement because both havea measurement error and the true value remains unknown(Bland and Altman 2010) Stevens et al (2015) proposes theprobability of agreement (θs) as a plot metric to representthe agreement between two measurement systems acrossa range of plausible values The θs method addressessome of the challenges of the accepted ldquolimits agreementmethodrdquo presented by Bland and Altman (2010) Besides theagreement plot agreement is based on maximum likelihoodbias parameters α and β quantifying the fixed bias andthe proportional bias If α = 0 β = 1 and σ1= σ2 thenthe two measurements are identical σj is the measurementvariation The probability-of-agreement analysis was per-formed using the ProbAgreeAnalysis (httpsuwaterloocabusiness-and-industrial-statistics-research-groupsoftwarelast access 28 November 2020) in MATLAB 94 (Math-Works Inc) An arbitrary 1 gCmminus2 dminus1 was considered tobe a tolerable magnitude to conclude that agreement is suf-ficient enough to use either estimated GPP interchangeablyThe reference measurement was the GPP obtained fromEC data at the Bernal site and tested against the MODISMOD17 model the OLS predictions or each of the two setsof EML predictions If the probability-of-agreement plotsuggested disagreement between two measurement systemsthen the predictions can be adjusted using

Adjusted predictor= (predictorminusα)β (6)

3 Results

31 Eddy covariance fluxes at Bernal

Dominant flow at the Bernal site was from the northeast(Fig 2) Energy balance closure for this site had a slopeof 072 and r2

= 092 (Fig 5) Homogeneous sites of theFluxnet network obtain closure percentages higher than72 and for the Bernal site the vegetation heterogeneitywas important (see below) Average H was always nega-tive during nighttime but during some months of the dryand rainy season λE was positive particularly after dawn(Fig 6) so as to allow nighttime evaporation from the soil orvegetation However during some months of the dry seasonrainfall was low (Fig 7) then the positive λE suggested thatcacti could have an active gas exchange at that time

Carbon dioxide absorptions had a diurnal behavior begin-ning at dawn and ending before sunset (Fig 8) Nighttimeflux was positive indicating respiration notwithstanding the

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 9: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 375

Figure 5 Closure of the surface energy balance from eddy covari-ance measurements averaged at 30 min between the turbulent fluxes(H + λE) and available fluxes (Rn-G) Data are from 30 March2017 to 22 August 2018 at the Bernal site The regression wasy = 2302+072times with adjusted r2

= 092 The diagonal line rep-resents the 1 1 relation

presence of cacti Although summer rains are characteristicof the climate at the Bernal site (Fig 7) a negative NEE fluxoccurred in all measured months The lowest CO2 flux wasrecorded in January and February 2017 and in May 2018 (Ta-ble 1) This behavior resulted from the phenology of the veg-etation since most species lost their leaves in the dry seasonand was also due to the effect of low temperature Withinthe rainy season the flux of CO2 increased compared to themonths of January to June The correlation between NEE andprecipitation was minus045 When the sum of the precipitationof the current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment

The scrub at Bernal was heterogeneous in botanical com-position A total of 24 species of cacti and shrub wereidentified on average each sampling plot had 103 speciesThe IVI was similar between all cacti (036plusmn 004) shrublegumes (038plusmn004) and other shrubs (023plusmn006) sampledMost sampling plots were in areas of high flux frequency(Fig 2) Cylindropuntia imbricata had the largest IVI fol-lowed by Acacia farnesiana Acacia schaffneri and Prosopislaevigata The IVI of the herbaceous stratum representedby grasses was not characterized due to the state of over-grazing and the absence of reproductive structures in plantswhich made measurement of their abundances frequenciesand dominance difficult The grass genera present were Meli-nis Chloris Cynodon and Cenchrus all corresponding to in-

Table 1 Daily average values of the net ecosystem exchange(NEE) gross primary productivity (GPP) and ecosystem respira-tion (Reco) in a scrub at the Bernal site Negative values of NEEindicate photosynthetic absorption

NEE GPP RecomicromolCO2 mminus2 sminus1

2017

JANFEBMARAPR minus054 248 194MAY 005 286 291JUN 038 421 459JUL minus100 178 077AUGSEPOCT minus126 525 399NOV minus013 265 252DEC 004 180 184

2018

JAN minus005 129 124FEB 006 188 194MAR minus094 245 151APR minus058 287 229MAY minus029 377 348JUN minus252 433 181JUL minus283 823 541AUG minus193 921 728

vasive C4 tropical grasses Scrub species of higher IVI hada similar LAI (12) although the magnitude of the LAI of Plaevigata stood out (Table 2)

32 Machine learning ensembles as predictors of eddycovariance GPP

In this section we describe the modeling with EML using lo-cal remotely sensed data from the Bernal site to predict GPPat the same site and then the agreement between EML GPPpredictions and EC-derived GPP The AutoML function gen-erated 1031 models with an average deviance of 135 whilethe deviance of the leader model was 063 in the trainingdataset (Table 3) A total of 11 models of type GBM andfive XGBoost models were in the top 30 models along withthe nine best-of-family ensembles and five all-models ensem-bles The weighted variable importance on the leader boardwas higher for the LAI from the MOD15 and MCD15 prod-ucts (17 and 14 ) The PsnNet EVI (MOD17) FPAR(MCD15) green atmospherically resistant vegetation index(GARI) and MODIS reflectance band 13 had an importancehigher than 3 (59 54 42 36 and 30 ) LAI(MCD13 and MOD13) PsnNet and the FPAR (MOD15)were the more important variables (20 17 13 and10 ) in the top nonstacked model a GBM model that wasranked in fourth place

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 10: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

376 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 6 Latent heat flux daily trend at Bernal during different months emphasizing nighttime λE Panel (a) shows months with low rainfallin the previous month and predominantly negative λE during nighttime (these months had low rainfall 017 113 and 033 mm rainfall forJanuary March and December) Panel (b) shows months with low rainfall in the previous month and positive λE after sunset (67 73 and207 mm rainfall for February April and May) Panel (c) shows months during the rainy season with positive λE mainly due to soil wetnessand antecedent rainfall of 43 202 9 190 and 34 mm for June July August October and November Plot (c) is out of scale on the y axis forcompatibility with the other plots

Table 2 Importance value index (IVI) and leaf area index (LAI) of the main species present at the Bernal Quereacutetaro study site

Species Plant type IVI SEMa LAI SEM

Coryphantha cornifera Cactus 007 027Bouvardia ternifolia Herb 007 027Karwinskia humboldtiana Shrub 007 027Forestiera phillyreoides Shrub 009 027Ferocactus latispinus Cactus 009 027Cylindropuntia leptocaulis Cactus 009 027Asphodelus fistulosus Shrub 009 019Brickellia veronicifolia Shrub 010 027Dalea lutea Shrub 011 015Eysenhardtia polystachya Legume 013 015Myrtillocactus geometrizans Cactus 014 027Schinus molle Shrub 014 019Jatropha dioica Herb 015 019Mammillaria uncinata Cactus 016 012Opuntia tomentosa Cactus 017 011Opuntia robusta Cactus 023 007Opuntia hyptiacantha Cactus 026 007Mimosa monancistra Legume 028 012Mimosa depauperata Legume 031 012Zaluzania augusta Shrub 033 010Viguiera linearis Herb 036 011Acacia schaffneri Legume 041 007 113 015Prosopis laevigata Legume 041 007 148 012Acacia farnesiana Legume 056 009 112 037Cylindropuntia imbricata Cactus 074 007 113 011

a SEM standard error of the mean

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 11: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 377

Figure 7 Monthly rainfall and land surface temperature (LST) dur-ing years (a) 2017 and (b) 2018 at the Bernal site The LST valuescorrespond to the 1330 (LST day) and 0130 (LST night) MODISAqua satellite overpasses

Predictions of GPP in the testing dataset showed disper-sion in the lower range of the scale of measurement andthe correlation was 094 (Fig 9a) The final prediction ofGPP for the whole dataset had a probability of agreement(θs) of 058plusmn 001 (parameter estimate and standard error)α =minus00616plusmn 011 and β = 10133plusmn 002 suggesting agood fit with low fixed and proportional bias (Fig 9b) Theprobability of agreement decreased slightly at the lower andupper range of the scale of measurement (Fig 9c) indicatingthat the EML model would predict GPP without increasingthe bias particularly in the range from 0 to 4 gCmminus2 dminus1However the value of θs should be higher than 095 so as toconsider EC measurements and EML to be interchangeableThe correlation of 099 (r2

= 098) for the data in Fig 9bcould be misleading as it would suggest a very good fit

Using only the five of the more important variables namedabove to generate an EML resulted in an XBoost leadermodel with 273 deviance and a total of 1094 models Thetop 30 models were 16 XBoost and GBM models and the

best-of-family ensembles started to show up in 12th placeAlthough the number of runs was the same (20) the AutoMLfunction increased the number of produced models but thesmaller set of explanatory variables constrained the ability toidentify features contributing to better models Using anotherset of five randomly selected explanatory variables (one veg-etation index and four MODIS bands) resulted in a leadermodel with 252 deviance out of 1024 models but this timethe leader was the best-of-the-family ensemble Using only afew variables was considered to increase the deviance com-pared to the average deviance of 135 obtained during thetraining phase and using all available variables

An important question for modeling upscaling is the ca-pacity to extrapolate results temporally and spatially herewe explored the latter posing the following question wouldpredictions of GPP from EML for an EC site agree withEC observations from another site with ldquosimilarrdquo environ-mental conditions First an EML solution was found train-ing 80 of the Santa Rita dataset and obtaining a best-of-family ensemble with 023 deviance out of 634 trainedmodels (hereafter this model is referred to as the Santa Ritamodel) Then the environmental and remotely sensed datafrom the Bernal site were fed into the Santa Rita model thiswould be an external validation dataset However agreementwas not good the mean value of θs was 015plusmn 001 withα =minus10822plusmn 009 and β = 058127plusmn 002 The value ofθs was not constant across the range of measurement and de-creased rapidly after 2 gCmminus2 dminus1 (Fig 10b) Because thebias was important predictions were adjusted using Eq (6)showing some improvement with r = 078 (Fig 10a) Com-paring Figs 5b and 6a it is evident that an EML modelextrapolation to other conditions is noisier ie Santa Ritamodel trying to represent the ecosystem function at BernalNotwithstanding some of the most important variables wereshared by both EML ensembles Bernal and Santa Rita in thecase of Santa Rita LAI from MOD15 MYD15 and MCD15had 350 48 and 31 of variable importance and theFPAR from MOD15 was 12

33 MODIS as predictor of eddy covariance GPP

MODIS is important because it overpasses every point ofthe earth every 1 or 2 d and it implements a GPP prod-uct (MOD17) that has helped track the response of the bio-sphere to the environment since 2000 The product MOD17has been validated against many EC sites but few validationsites correspond to deserts and semideserts (Running et al2004) The GPP MOD17 underestimated the GPP derivedfrom EC data at Bernal (Fig 11a) In a similar bell-shapeddistribution of θs as in the case of the extrapolation of theSanta Rita site (Fig 10b) here the θs was not constant acrossthe range of measurement mean θs was 024plusmn 013 withα = 000047plusmn0087 and β = 048749plusmn002 (Fig 11c) Ad-justing MOD17 estimates with Eq (6) improved the relation-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 12: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

378 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 8 Net ecosystem exchange (NEE) and photosynthetic active radiation (PAR) at the Bernal site in (a) 2017 and (b) 2018 Negativevalues in the CO2 flux indicate photosynthesis The gray shadow is the standard error of the mean for each month at any given hour

Table 3 Leader board of EML models for the Bernal site using 85 of day observations as training dataset NA denotes the outcome wherethe type of model was not present in the 30 top-performing models according to deviance

Type of model Number of models Average deviance

All models Top models Leader model

Stacked ensembleAll models 20 0852 0834Best of family 20 0812 0703 0633GBM 453 1366 0796DRF 20 1108 NAXGBoost 277 1385 0778XRT 20 1077 NADeep learning 201 2072 NAGLM 20 1474 NATotal 1031 1356

ship but note that the value of r (076) was the same for theoriginal MOD17 and the adjusted MOD17 (Fig 11b)

34 Prediction eddy covariance GPP withordinary-least-squares multiple regression

OLS is a common estimation method for linear models andhere this model appeared as adequate judging by the gen-eral distribution of predictions (Fig 12a) and the probability-of-agreement plot (Fig 12b) A total of 14 variables wereincluded in the model all of them with VIF values lowerthan 70 (Appendix C) the VIF statistic quantifies the sever-ity of multicollinearity and an acceptable threshold is 10The most significant variables were the EVI from MYD13and Daymet variables precipitation shortwave radiation and

minimum and maximum temperatures (see Appendix B forvariable details) Variables with high coefficient values wereMODIS reflectance band 14 (917) the EVI from MYD13(853) and the NDVI (39) while Daymet temperatures hadsmall coefficientsminus023 for maximum temperature and 017for minimum temperature The θs decreased slightly at theend of the measurement range mean θs was 05plusmn0014 withα = 018845plusmn0137 and β = 094966plusmn0031No correctionfor this model was calculated since α was close to 0 β wasclose to 1 and the EML model had a higher θs

35 Agreement

The probability of agreement (θs) was the statistic to deter-mine if the mean responses were in agreement Comparing

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 13: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 379

Figure 9 Agreement between predictions of GPP obtained with machine learning algorithms or derived from eddy covariance measurementsat the Bernal site (a) Example of one run of predictions of GPP in the test dataset from the Bernal site using the leader model of an ensembleof machine learning algorithms (EML) the test dataset was 15 of data with r = 094 (b) Predictions for the complete dataset withr = 099 the diagonal line is the 1 1 agreement (c) Function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreementbetween methods of estimation of GPP corresponding to plot (b) The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

Figure 10 (a) Adjusted predictions of GPP for the complete dataset from the Bernal site using the leader model of the final ensembleof machine learning algorithms (EML) derived from the Santa Rita site compared to estimates of GPP from EC data with r = 078 (b)Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP ECdata from the Bernal site and EML model for the Santa Rita site The horizontal axis (s) represents the magnitude of the measurement thevertical axis is the probability of agreement for the measurement and the red line is the confidence interval p lt 005

the confidence intervals (CIs) for θs the best modeling ap-proach was the EML because its CI (056ndash059) was differentfrom that of the OLS model (047ndash052) More importantlyfor the EML and OLS models the values of θs had little vari-ation across the range of GPP estimates However the CIsof the EML and OLS models were similar in their bias es-timates α and β and both models had no bias (p lt 005)Altogether the best result was the EML ensemble using en-vironmental and remotely sensed data corresponding to thesame site ie Bernal (Fig 10c) This kind of EML wouldbe useful for gap filling or the evaluation of GPP time se-ries of the site that generated the model Machine learningalgorithms can fill gaps longer than 30 d (Kang et al 2019)

The second option to estimate GPP using the samedatasets was the multiple-regression OLS model (Fig 12a)The multiple regression is straightforward and here multi-collinearity was not a problem The EML ensemble and theOLS regression have the highest values of θs (058 and 05respectively Fig 12b) Higher values for θs are desirable(gt 095) and this could be achieved by increasing the samplesize relaxing the tolerable magnitude for agreement (here itwas set at 1 gCmminus2 dminus1) or perhaps using different forcingvariables

The MODIS estimates were a third-best alternative sincethe mean θs was 024 In the present study we used a splineto fill the data to a daily time step since the MCD17 is an8 d composite product but a similar result was obtained if

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 14: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

380 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Figure 11 (a) MOD17 GPP daily time step from the Bernal site versus estimates of GPP from EC data using the complete dataset at adaily time step derived from spline for MOD17 GPP (copy) with r = 076 or 8 d composite estimates obtained by 8 d averages of EC GPP() with r = 076 (b) Adjusted predictions of GPP from MODIS (c) Corresponding function of probability of agreement for Fig 7a using1 g Cmminus2 dminus1 as a tolerable agreement between methods of estimation of GPP EC data from the Bernal site and MODIS MOD17 Thehorizontal axis (s) represents the magnitude of the measurement The vertical axis is the probability of agreement for the measurement

Figure 12 (a) Ordinary-least-squares multiple-regression estimates of GPP for the complete dataset from the Bernal site versus estimates ofGPP from EC data with r = 082 (b) Respective function of probability of agreement using 1 gCmminus2 dminus1 as a tolerable agreement betweenmethods of estimation of GPP EC data from the Bernal site and OLS multiple regression The horizontal axis (s) represents the magnitudeof the measurement The vertical axis is the probability of agreement for the measurement

the EC GPP was rescaled and compared to the original 8 dMCD17 data (Fig 11a) The GPP from MODIS was an un-derestimate of EC GPP and when the estimates were ad-justed (Eq 6 Fig 11b) the performance was not better thanthe OLS (model not shown) The MODIS land cover classi-fication represented this site as grassland (MCD12) and thiscould be another reason for the poor agreement besides theassumptions made in the MODIS algorithm regarding the εand εmax parameters and the response of vegetation to VPDAgreement of MODIS GPP is crucial because MODIS prod-ucts are frequently used in country-wide assessments of thecarbon cycle and can influence public policies

The model with least agreement resulted when the EMLensemble generated from the Santa Rita site was used to pre-dict GPP at Bernal Machine learning models can make pre-dictions but their usefulness decreases when they are usedoutside the context of where they were built while process-

based mechanistic models have this ability Although the Au-toML function in H2O is designed to protect against over-fitting using cross-validation runs (Michailidis 2018) in ourstudy the Santa Rita model could not be generalized to repre-sent the Bernal site GPP process probably because the vari-ables and features selected for Bernal or Santa Rita were dif-ferent during the AutoML workflow

The Santa Rita model was good at predicting GPP at thatsite with a deviance of the leader model of 023 while atBernal the deviance of the leader model was 063 (Table 3)indicating that the Santa Rita EML ensemble was at leastas good a model as the EML at Bernal (not shown) TheGPP time series for Santa Rita was about 4 times the sizeof the Bernal dataset and therefore the deviance was lowerHowever when the Santa Rita model was used with Bernaldata the mean θs was 016 indicating that the agreement wasinsufficient Eyeballing the predictions in Figs 9a 10b and

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 15: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 381

11a and their corresponding correlation values (078 076and 082 for adjusted EML Santa Rita adjusted MODIS andmultiple-regression OLS) it could be argued that these mod-els were comparable However their θs plots present a dif-ferent perspective

4 Discussion

41 Model agreement

The Santa Rita model had higher θs when it extrapolatedfor low GPP values for the Bernal site suggesting that theSanta Rita model had a better skill when predicting GPPclose to 0 and even negative GPP values (Fig 10b) Al-though the Bernal and Santa Rita sites had similar vegeta-tion and climate classifications they are more than 1600 kmapart and rainfall monthly distribution is different The GPPseasonal cycle at Bernal started in February and steadily in-creased to a maximum during July and August (Table 1)At Santa Rita the GPP was low from January until mid-July (lt 05 gCO2 mminus2 dminus1) and then increased sharply to amaximum level in mid-August (Joiner and Yoshida 2020)What matters eventually for machine learning methods ishow well the predictor space rather than geographic spaceis sampled (Jung et al 2020) Incorporating data from morehumid (semiarid) sites could improve the GPP predictions bya machine learning method Not only more EC sites but alsosites representing a water availability gradient would be im-portant for semiarid ecosystems representing long spells andthe influence of oceanic oscillations and monsoon rains

All the models presented here used transient data to rep-resent GPP specifically at the 1 d time step Besides theradiation-related variables two sources of rainfall were usedas forcing variables but evapotranspiration (ET) was notused The MOD16A2 version 6 is available as an 8 d gap-filled product and could be included in EML or OLS mod-els However a more general representation of the carboncycling could be achieved when including variables that rep-resent annual or seasonal timescales of soil water evapo-ration or precipitation (Scott and Biederman 2019) Scottet al (2015) suggest that real lags between precipitation andproductivity that may impart legacy effects may also be par-tially masked by using ET as ET more carefully tracks pro-ductivity when soil moisture storage is accessed The occur-rence of off-season rainfall dry spells and carryover effectscould be parameterized as windowed events of a given dura-tion A window would be a period with distinct time bound-aries the window allows the grouping of records with similarfeatures The effect of the window at a given time could berepresented as moving weights as the point in time in ques-tion is closer or farther from the window

In our study the metric to assess model agreement wasthe probability of agreement (θs) and their bias parame-ters Many other metrics can be used to evaluate model per-

formance such as the root mean square error r r2 or themodel efficiency factor (MEF) presented by Nash and Sut-cliffe (1970) In particular the MEF is a step in the Fluxnetdata processing pipeline (Pastorello et al 2020) A MEFvalue close to 1 represents a high correlation and lower bi-ases (Joiner and Yoshida 2020) Therefore the calculatedMEF for the Bernal site EML (098) would suggest very goodpredictive performance while the θs of 06 for this model in-dicates a more modest performance The θs for a particularvalue of the measurand is the probability that the differencebetween two measurements made by different systems fallswithin an interval that is deemed to be acceptable (Stevenset al 2015) In the present study we used 1 gCmminus2 dminus1 as acritical value defining an acceptable difference If this valueis smaller then the probability of agreement would decreasefor this same model In such a case it would be less likelythat the predictors agree considering that the θs takes intoaccount both the difference in the function means at a givenvalue of the measurand and the uncertainty in its estimation

42 MODIS discrepancies

Different authors have reported discrepancies betweenMCD17 and EC estimates of GPP in semiarid regions Ex-amining MODIS discrepancies in these ecosystems is im-portant because the errors induced by cloud cover are ex-pected to be minimal and other effects can be identified(Gebremichael and Barros 2006) The GPP of MOD17 didnot relate well (EC= 011+ 017MODIS r2

= 067) withestimates of EC GPP in semidesert vegetation of the Sahel(Tagesson et al 2017) With data from different types of veg-etation in the Heihe basin in China MODIS17 overestimatedthe GPP from EC (EC= 115+024MODIS r2

= 068 Cuiet al 2016) For scrub sites in Mexico the relation be-tween GPP calculated from EC and MOD17 was not good(MODIS= 38382+ 0467EC r2

= 06 Delgado-Balbuenaet al 2018) In arid and semiarid ecosystems in China op-timizing parameters of the MODIS GPP model with site-specific data improved the estimate to explain 91 of thevariation in the GPP of the data observed by EC (Wang et al2019) These same authors propose improving the land useclassification used by the MOD17 algorithm and recalibrat-ing light use efficiency parameters to solve the GPP estima-tion problem Gebremichael and Barros (2006) examined anopen shrubland site in a semiarid region of Sonora Mexicoand their analysis of the temporal evolution of the discrep-ancies with MODIS GPP suggested revisiting the light useefficiency parameterization especially the functional depen-dence on VPD and PAR and water stress or soil moistureavailability

The relationship between the GPP MODIS and the GPPEC presented in Sect 33 is an approximation because theuncertainty in the respiration component must be consid-ered The empirical relationship between nocturnal NEE andsoil temperature has been used to represent ecosystem res-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 16: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

382 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

piration (Reco) in order to separate the processes that con-tribute to daytime NEE (Richardson and Hollinger 2005Wofsy et al 1993) Nighttime NEE should be equal to therates of autotrophic and heterotrophic respiration while dur-ing daytime NEE should be equal to the combined rates ofcarboxylation and oxidation of Rubisco autotrophic respi-ration and heterotrophic respiration Then the GPP can becalculated as the difference between daytime NEE and Recoestimated through its relationship with temperature (Gouldenet al 1996) In the present study Reco was calculated basedon soil and air temperature following the procedure of Reich-stein et al (2005) implemented in REddyProc (Wutzler et al2018) Although it is possible to measure or model the parti-tion of respiration (Running et al 2004 Wang et al 2018)the presence of cacti complicates the calculation assum-ing that all nighttime flux represents ecosystem respiration(Owen et al 2016 Richardson and Hollinger 2005) Whilesoil respiration tends to be temperature-limited when soilmoisture is nonlimiting in temperate ecosystems in range-land ecosystems the controls of soil CO2 efflux were photo-synthesis soil temperature and moisture (Roby et al 2019)In our study the instrumentation did not include measure-ments of plant or soil respiration partition to validate theRecoestimates

A problem regarding data comparison from remote orbitalsensors and terrestrial observations is that different quanti-ties are fundamentally measured MODIS measures the radi-ation reflected by the earthrsquos surface in two spectral bands at250 m spatial resolution per pixel five bands at 500 m and29 bands at 1000 m The EC technique has a footprint tomeasure CO2 that varies dynamically in shape and size butis generally considered to be 1 km2 in this study the foot-print radius was 600 m To solve the scaling MODIS prod-ucts related to the carbon cycle have been validated with theEC technique and biometric measurements on several spa-tial scales using process-based ecosystem models and char-acterizing areas up to 47 km2 around the EC measuring tower(Cohen et al 2003)

At Bernal the vegetation was heterogeneous This situa-tion was represented by the heterogeneity in vegetation ac-tivity in the 4 pixels used spanning about 02 units of NDVIat the peak of the season activity (Fig 1) The standard errorof the IVI differed by 1 order of magnitude among speciesAlthough more important species had a lower standard errortheir means were similar indicating that they were equallyabundant at all sampling plots The higher importance ofsome species (Table 2) was explained by selective grazingndashbrowsing behavior and the dispersion caused by cattle by ei-ther ingesting or transporting seeds or plant parts (Belaynehand Tessema 2017) Regarding landscape heterogeneity thetower fetch was predominantly from the northeast capturingthe less heterogeneous area of the site but also the more ac-tive according to the NDVI (Fig 1)

The thorny scrub examined had two vegetation layersthe overstory layer mainly consisting of mesquite acacia

and cacti and the understory layer that included grasses andherbs Cattle preferentially graze the understory and becausethey eat using their tongue they will avoid browsing thornyspecies unlike goats or deer which use their lips Withoutgrazing management overtime the competition balance willfavor bush species resulting in encroaching and the under-story will be stressed only unpalatable species or those withtheir growing meristems very close to the ground would sur-vive Representing the structure and functioning of these twolayers using MODIS is possible (Liu et al 2017) RecentlyHill and Guerschman (2020) presented a MODIS product de-rived from MCD43A4 to estimate the fractions of photosyn-thetic and nonphotosynthetic vegetation and the remainingfraction of bare soil These developments could improve theMOD17 GPP estimates since its model represents a homoge-neous single vegetation layer All these considerations helpto understand the low θs between MOD17 estimates of GPPand EC-derived GPP

43 Forcing variables and machine learning

Forcing variables in this study were gridded meteorologi-cal data and MODIS spectral bands and products but thevariables mostly included in the ML algorithms were LAIand FPAR accounting for 364 and 549 of the vari-able importance index at Bernal and Santa Rita respectivelyin their leader board This result supported the view thatCO2 fluxes can be represented by ML algorithms exclu-sively using remotely sensed data (Tramontana et al 2016Joiner and Yoshida 2020) Using neural networks Joiner andYoshida (2020) showed that with only satellite reflectancefrom MODIS and top-of-atmosphere PAR it is possible tocapture a large fraction of GPP variability They argue thatvegetation indices may reduce the information content ofthe underlying reflectance when compressing the informa-tion from two or more bands into a single index In our studyreflectance bands were not often included in the ML modelsThis was most likely because MOD09GA and MODOCGAprovide estimates of surface reflectance uncorrected for theillumination and viewing geometry while the MCD43 usedin Joiner and Yoshida (2020) uses a surface bidirectional re-flectance distribution function (BRDF) model that improvesthe quality of surface albedo retrievals By propagating theBRDF correction in the MODIS processing pipeline the veg-etation indices would more likely relate better to the GPPFuture modeling efforts should use benchmark scenarios ac-cording to sets of forcing variables already identified as use-ful

44 Carbon flux

Although the carbon balance in ecosystems is influenced bydifferent factors such as soil type and amount of nutrientsthe relationship with soil temperature and humidity is partic-ularly strong (Anderson-Teixeira et al 2011 Hastings et al

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 17: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 383

2005) How much of the rainwater the system can retain orlose has been described as the leakiness of the system (Guer-schman et al 2009) More than immediate incident rainfallthe available soil moisture and its redistribution are impor-tant in semiarid ecosystems including steam flow prefer-ential flow paths hydraulic lift and others (Barron-Gaffordet al 2017) At Bernal when the sum of the precipitation ofthe current month and that of the previous month was con-sidered the correlation with NEE was minus07 suggesting thatcontinuous availability of soil moisture is important for theabsorption of CO2 in this environment This result is con-sistent with other studies in which the relationship betweenthe net productivity of the ecosystem (NEP) and precipita-tion is initially positive but is leveled from 1000 to 1500 mmannually (Xu et al 2014) The hydraulic redistribution ofwater from moist (deeper) to drier soils through plant rootstended to increase modeled annual ecosystem uptake of CO2this process was identified at US-SMR (Fu et al 2018 Scottet al 2003)

The leakiness is highly dependent upon vegetation frac-tional cover the proportion of the surface occupied by baresoil and vegetation photosynthetically active vegetation andnonphotosynthetically active vegetation such as litter woodand dead biomass (Guerschman et al 2009) All these vege-tation fractions have a water storage capacity and can reducethe amount of effective rainfall available to plant roots It ispossible that the canopy of the bushes completely interceptedthe rainfall in some months because the scrub can interceptup to 20 of the precipitation and its canopy storage ca-pacity is 097 mm (Mastachi-Loza et al 2010) Consideringonly the daily rain events greater than 5 mm the correlationbetween precipitation and NEE rose to minus072 In the presentstudy the interception of rain by vegetation surfaces was notcalculated but the results suggest that it would be impor-tant to explore the relationship between net precipitation andNEE

The average NEE at a global level is minus156plusmn284 gCmminus2 yrminus1 (Baldocchi 2014) The highest frequencyamong sites that measured NEE with EC occurs from minus200to minus300 gCmminus2 yrminus1 but in sites with biometric measure-ments the peak occurs at minus100 gCmminus2 yrminus1 (Xu et al2014) Using the daily averages of Table 2 the averageNEE during the measurement period was minus078 gCmminus2 dminus1

and annually would be minus2835 gCmminus2 yrminus1 This resultwas higher than the annual values of the induced grass-land and scrubland vegetation characterizing the SonoranDesert plains (138 and 130 gCmminus2 yrminus1 Hinojo-Hinojoet al 2019) In New Mexico NEE values measured withEC are between 35ndash50 gCmminus2 yrminus1 in desert grassland and344ndash355 gCmminus2 yrminus1 in mixed coniferous forest (Anderson-Teixeira et al 2011) In a dryer region the sarcocaulescentscrubland of Baja California in Mexico the NEE was minus39and minus52 gCmminus2 yrminus1 in 2002 and 2003 respectively (Hast-ings et al 2005) The NEE measured here was within therange of NEE 03plusmn 02 kgCmminus2 yrminus1 for grasslands and

shrublands in Mexico (Murray-Tortarolo et al 2016) Al-though the measurements of the present study had gaps andwere compared with annual studies we considered that thereported value of C was representative of the main season ofgrowth of this type of scrub

45 Ecosystem management

Overgrazing is an appreciation relative to the grazing produc-tive system where the forage resource is overused in a mixedshrubndashgrass ecosystem such as Bernal it usually refers tothe understory Overgrazing means that the plant regrowth isreadily grazed tillers and root reserves are lost and eventu-ally the plant may die Although the Bernal site was over-grazed the carbon fluxes indicated that the plant communitywas photosynthetically active in both dry and rainy seasonsIt is fair to assume that water in the soil was not limiting forthe deep-rooted bush species and that was the reason whyit was possible to maintain the photosynthetic function dur-ing rainless months However this primary production wouldnot have tangible benefits for ranchersrsquo production systemsince no edible biomass would be produced for the cattleFrom the point of view of carbon capture the system accu-mulated nonlabile biomass that would remain in the systemfor a longer time compared to a grassland ecosystem (al-though it would be necessary to determine the partition ofsaid shrub biomass) However the overgrazing condition af-fects the biomass of the understory roots and consequentlythe carbon pool in the soil

In the short term it can be thought that the estimated neg-ative carbon flows had a favorable effect on the environmen-tal agenda As time passes it is possible that the gaps be-tween the individual shrubs of the overstory expand and thiswould have had an effect on soil erosion It is also possiblethat the water stored in the soil profile used by the bushesgradually decreases to the point of causing drought changesin phenology and advancements of the desertification pro-cess There are many opportunities for ecology conservationand livestock-oriented management these may include con-trolled grazing or propagating native thornless shrub speciesIf ranchers do not identify a benefit in the vegetation thenthey will be tempted to remove it as it occurred at thestudy site Because of its wide coverage and availability theMODIS GPP product accuracy is important in representingthe carbon cycle raising awareness and monitoring advance-ment of environmental decisions

Although we found that EML was a good option for mod-eling the GPP of a site what is really needed to evaluate theperformance of semiarid ecosystems is a spatial representa-tion of the carbon flux This is a problem for an underrepre-sented area regarding instrumented EC towers However theEML could be designed to take into account the explanatoryvariables in a spatiotemporal continuity

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 18: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

384 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

5 Conclusions

The best modeling approach was the ensemble of ma-chine learning The second option to estimate GPP was themultiple-regression OLS and the third alternative were theMODIS estimates Machine learning was a good option topredict GPP in the local context where it was generated oth-erwise its performance was not good This was demonstratedwhen using the EML from Santa Rita and trying to predictfluxes at Bernal Nevertheless a machine learning modelwould be useful for gap filling or the evaluation of GPP atthe same site The GPP estimates of a given model can be ad-justed using the bias parameters of the probability of agree-ment to improve the relationship Further research explor-ing the usefulness of EML to predict GPP in other contextscould use EML to calibrate mechanistic model parametershybrid approaches integrating EML and biologically mean-ingful mechanistic models or EML representing the spatialvariability of the landscape

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 19: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 385

Appendix A

Table A1 Threshold of velocity friction (ulowast) above which nighttime fluxes were considered valid Estimates were obtained using the Rpackage REddyProc We used only the flux records with ulowast equal to or higher than the corresponding mean ulowast threshold for each yearBound values are a 95 confidence interval NA not applicable

Aggregation method Year Season Mean Lower bound 5 Upper bound 95

ulowast (msminus1)

Single NA NA 0193 0140 0209Year 2017 NA 0194 0128 0227Year 2018 NA 0193 0132 0225Season 2017 2016012 0194 0128 0227Season 2017 2017003 0215 0141 0297Season 2017 2017006 0194 0128 0227Season 2017 2017009 0181 0144 0208Season 2018 2017012 0149 0116 0232Season 2018 2018003 0271 0134 0297Season 2018 2018006 0193 0132 0221

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 20: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

386 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix B

Table B1 MODIS reflectance bands and products database

Product name Satellite layer Spatial Temporal Spectralresolution resolution coverage

(d) (nm)

MOD09GA Band 1 500 m 1 620ndash670Band 2 841ndash876Band 3 459ndash479Band 4 545ndash565Band 5 1230ndash1250Band 6 1628ndash1652Band 7 2105ndash2155

MODOCGA Band 8 1 km 405ndash420Band 9 438ndash448Band 10 483ndash493Band 11 526ndash536Band 12 546ndash556Band 13 662ndash672Band 14 673ndash683

MOD17A2H Gross primary production (GPP) 8

Net photosynthesis (PsnNet)

MOD15A2H Fraction of photosynthetically active radiation (Fpar) na

Leaf area index (LAI)

MOD11A2 Land surface temperature and emissivity of day (LST day) 1 km

Land surface temperature and emissivity of night (LST night)

MYD13A1 Enhanced vegetation index (EVI) 500 m 16

MYD15A2H Fraction of photosynthetically active radiation (Fpar) 8

Leaf area index (LAI)

MYD11A2 Land surface temperature and emissivity of day (LST day) 1 km na

Land surface temperature and emissivity of night (LST night)

MCD15A2H Fraction of photosynthetically active radiation (Fpar) 500 m

Leaf area index (LAI)

na not applicable

Table B2 Daily vegetation indexes computed using the MODIS reflectance bands described in Table 1

Index Formula Reference

Simple ratio (SimpleR) SimpleR= Band 2Band 1

(Hill et al 2006)Moisture stress (MoistS) MoistS= Band 6

Band 2

Disease stress index (DSI) DSI= Band 2+Band 4Band 6+Band 1

Redndashgreen ratio index (RGRI) RGRI= Band 1Band 4

Normalized difference vegetation index (NDVI) NDVI= Band 2minusBand 1Band 2+Band 1

Normalized difference water index (NDVI_w) NDVI_w =Band 2minusBand 5Band 2+Band 5

Green leaf index (GLI) GLI= 2lowastBand 11minusBand 14minusBand 92lowastBand 11+Band 14+Band

(Henrich et al 2012)Green atmospherically resistant vegetation (GARI) GARI= Band 5minus(Band 11minus(Band 9minusBand 14))

Band 5minus(Band 11+(Band 9minusBand 14))

Enhanced vegetation index (EVI) EVI= 25lowast Band 2minusBand 1Band 2+(6lowastBand 1)minus(75lowastBand 9)+1

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 21: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 387

Table B3 Daymet meteorological database

Variable Spatial resolution Temporal resolution Reference

Precipitation (Dayprc)

1 km Daily (Thornton et al 2017)Shortwave radiation (Daysrad)Maximum air temperature (DayTmax)Minimum air temperature (DayTmin)Water vapor pressure (Dayvp)

Table B4 Precipitation data

Satellite Product name Spatial resolution Temporal resolution

Global Precipitation Measurement (GPM) 3IMERGDF v006 010Daily

Tropical Rainfall Measuring Mission (TRMM) 3B42 v007 025

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 22: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

388 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Appendix C

Table C1 Analysis of variance for GGP derived from EC data atthe Bernal site Variable details are described in Appendix B

Model term Coefficient SEa DFb SSc F p VIFd

Regression 14 178048Constant minus1318 648 minus203 004R_08_40542e

minus212 060 1 2680 minus352 0 220R_13_66267e

minus286 123 1 1161 minus232 002 155R_14_67368e 917 259 1 2706 353 0 122RGRI6

minus083 037 1 1100 minus225 003 143GARIf 201 108 1 755 187 006 237EVIMYDg 853 201 1 3903 425 0 294NDVI_wf 390 155 1 1367 251 001 271PsnNetg 0013 0 1 7530 590 0 383LstNgtMYDg 005 002 1 818 194 005 201Dayprch 014 003 1 3590 407 0 192Daysradh 001 0 1 5121 486 0 501DayTmaxh

minus024 006 1 3658 minus411 0 689DayTminh 018 005 1 3222 386 0 496TRMMi minus004 002 1 1087 minus224 003 141Error 403 87297Total 417 265345

a Standard error of coefficient b Degrees of freedom c Adjusted sum of squares d Variance inflation factore Variable name denotes the band number and spectral bandwidth of MODIS (Moderate Resolution ImagingSpectroradiometer) f Vegetation indices RGRI is redndashgreen ratio index and GARI is green atmosphericallyresistant vegetation index details of formula are described in Appendix B g Layers from MODIS productsEVIMYD is enhanced vegetation index from MYD13 PsnNet is photosynthesis product form MOD17LstNgtMYD is nighttime land surface temperature emissivity from MYD11 h Variables obtained fromDaymet daily dataset DayTmin is minimum temperature DayTmax is maximum temperature Daysrad isshortwave radiation Dayprc is precipitation i Daily rainfall rate from 3B42 TRMM (Tropical RainfallMeasuring Mission)

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 23: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 389

Code and data availability Database and programming code areavailable at httpsdoiorg105281zenodo3598595 (Guevara-Escobar et al 2020)

Author contributions AGE contributed to conceptualization for-mal analysis investigation methodology software supervisionvalidation visualization and writing of the original draft as wellas review and editing EGS was in charge of funding acquisitionproject administration and writing and review MCJ contributed toinvestigation methodology validation visualization and writing ofthe original draft as well as review editing and communicationMEQB helped with investigation validation writing and reviewHSA contributed with conceptualization writing and review ICAacutehelped with investigation writing and review VHCS contributedwith writing review and funding acquisition

Competing interests The authors declare that they have no conflictof interest

Acknowledgement We thank NASA and Ameriflux for data avail-ability Data filtering was carried out by the cluster Abacus ICentre of Applied Mathematics and High Performance ComputingABACUS-CINVESTAV

Financial support This research has been supported by CONA-CyT ndash SEMARNAT (grant no 2014-1-(249407))

Review statement This paper was edited by Paul Stoy and re-viewed by three anonymous referees

References

Aguirre-Diacuteaz G J Aguilloacuten-Robles A Tristaacuten-Gonzaacutelez MLabarthe-Hernaacutendez G Loacutepez-Martiacutenez M Bellon Hand Nieto-Obregoacuten J Geologic setting of the Pentildeade Bernal Natural Monument Quereacutetaro Meacutexico Anendogenous volcanic dome Geosphere 9 557ndash571httpsdoiorg101130GES008431 2013

Anav A Friedlingstein P Beer C Ciais P Harper AJones C Murray-Tortarolo G Papale D Parazoo N Cand Peylin P Spatiotemporal patterns of terrestrial grossprimary production A review Rev Geophys 53 785ndash818httpsdoiorg1010022015RG000483 2015

Anderson-Teixeira K J Delong J P Fox A M Brese D Aand Litvak M E Differential responses of production and res-piration to temperature and moisture drive the carbon balanceacross a climatic gradient in New Mexico Glob Change Biol17 410ndash424 httpsdoiorg101111j1365-2486201002269x2011

Baldocchi D Measuring fluxes of trace gases and energy be-tween ecosystems and the atmospherendashthe state and future of the

eddy covariance method Glob Change Biol 20 3600ndash3609httpsdoiorg101111gcb12649 2014

Barron-Gafford G A Sanchez-Cantildeete E P Minor R LHendryx S M Lee E Sutter L F Tran N Parra EColella T Murphy P C Hamerlynck E P Kumar P andScott R L Impacts of hydraulic redistribution on grassndashtreecompetition vs facilitation in a semi-arid savanna New Phytol215 1451ndash1461 httpsdoiorg101111nph14693 2017

Belayneh A and Tessema Z K Mechanisms of bush encroach-ment and its inter-connection with rangeland degradation insemi-arid African ecosystems a review J Arid Land 9 299ndash312 httpsdoiorg101007s40333-016-0023-x 2017

Bland J M and Altman D G Statistical methods for assessingagreement between two methods of clinical measurement Int JNurs Stud 47 931ndash936 2010

Boettiger C and Eddelbuettel D An introduction to rockerDocker containers 639 for r ArXiv Prepr ArXiv171003675 6402017

Bonilla-Moheno M and Aide T M Beyond deforestationLand cover transitions in Mexico Agric Syst 178 102734httpsdoiorg101016jagsy2019102734 2020

Booker K Huntsinger L Bartolome J W Sayre N Fand Stewart W What can ecological science tell us aboutopportunities for carbon sequestration on arid rangelands inthe United States Glob Environ Change 23 240ndash251httpsdoiorg101016jgloenvcha201210001 2013

Butzer K W and Butzer E K The lsquonaturalrsquovegetation ofthe Mexican Bajiacuteo archival documentation of a 16th-century savanna environment Quat Int 43 161ndash172httpsdoiorg101016S1040-6182(97)00032-3 1997

CICESE C de I C y de E S de E Base de datos climatoloacutegicanacional (CLICOM) Villa Bernal Quereacutetaro available at httpclicom-mexcicesemx (last access 28 August 2019) 2015

Cohen W B Maiersperger T K Yang Z Gower S TTurner D P Ritts W D Berterretche M and Run-ning S W Comparisons of land cover and LAI esti-mates derived from ETM+ and MODIS for four sites inNorth America A quality assessment of 20002001 provi-sional MODIS products Remote Sens Environ 88 233ndash255httpsdoiorg101016jrse200306006 2003

Cui T Wang Y Sun R Qiao C Fan W Jiang GHao L and Zhang L Estimating vegetation primaryproduction in the Heihe River Basin of China withmulti-source and multi-scale data PloS One 11 1ndash20httpsdoiorg101371journalpone0153971 2016

Curtis J T and McIntosh R P The Interrelations of Certain An-alytic and Synthetic Phytosociological Characters Ecology 31434ndash455 httpsdoiorg1023071931497 1950

Delgado-Balbuena J Yepez E A Aacutengeles-Peacuterez G Aguirre-Gutierrez C Arredondo T Ayala-Nintildeo F Bullock S HCastellanos A E Cueva A Figueroa-Espinoza B Garatuza-Payaacuten J Hinojo-Hinojo C Maya-Delgado Y Meacutendez-Barroso L Oechel W Paz-Pellat F Peacuterez-Ruiacutez E R Ro-driacuteguez J C Rojas-Robles N Saacutenchez-Mejiacutea Z M Uuh-Sonda J Vargas R Verduzco V S Vivoni E R andWatts C Flujos anuales de carbono en ecosistemas terrestres deMeacutexico in Estado Actual del Conocimiento del Ciclo del Car-bono y sus Interacciones en Meacutexico Siacutentesis a 2018 p 686Texcoco Estado de Meacutexico Meacutexico 2018

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 24: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

390 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

Eshel G Dayalu A Wofsy S C Munger J W andTziperman E Listening to the Forest An ArtificialNeural Network-Based Model of Carbon Uptake at Har-vard Forest J Geophys Res-Biogeo 124 461ndash478httpsdoiorg1010292018JG004791 2019

Fu C Wang G Bible K Goulden M L Saleska S RScott R L and Cardon Z G Hydraulic redistributionaffects modeled carbon cycling via soil microbial activityand suppressed fire Glob Change Biol 24 3472ndash3485httpsdoiorg101111gcb14164 2018

Gebremichael M and Barros A P Evaluation of MODISGross Primary Productivity (GPP) in tropical mon-soon regions Remote Sens Environ 100 150ndash166httpsdoiorg101016jrse200510009 2006

Gitelson A A Peng Y Arkebauer T J and Suyker A EProductivity absorbed photosynthetically active radiation andlight use efficiency in crops Implications for remote sensingof crop primary production J Plant Physiol 177 100ndash109httpsdoiorg101016jjplph201412015 2015

Glenn E P Scott R L Nguyen U and Nagler P L Wide-area ratios of evapotranspiration to precipitation in monsoon-dependent semiarid vegetation communities J Arid Envi-ron 117 84ndash95 httpsdoiorg101016jjaridenv2015020102015

Goldstein A Turner W R Spawn S A Anderson-Teixeira K J Cook-Patton S Fargione J Gibbs H KGriscom B Hewson J H and Howard J F Protecting ir-recoverable carbon in Earthrsquos ecosystems Nat Clim Change10 1ndash9 httpsdoiorg101038s41558-020-0738-8 2020

Goulden M L Munger J W Fan S Daube B Cand Wofsy S C Measurements of carbon sequestra-tion by long-term eddy covariance Methods and a criti-cal evaluation of accuracy Glob Change Biol 2 169ndash182httpsdoiorg101111j1365-24861996tb00070x 1996

Grossiord C Buckley T N Cernusak L A Novick K A Poul-ter B Siegwolf R T Sperry J S and McDowell N GPlant responses to rising vapor pressure deficit New Phytol226 1550ndash1566 2020

Guerschman J P Hill M J Renzullo L J Barrett D JMarks A S and Botha E J Estimating fractional coverof photosynthetic vegetation non-photosynthetic vegetation andbare soil in the Australian tropical savanna region upscaling theEO-1 Hyperion and MODIS sensors Remote Sens Environ113 928ndash945 httpsdoiorg101016jrse200901006 2009

Guevara-Escobar A Gonzaacutelez-Sosa E Cervantes-Jimeacutenez MSuzaacuten-Azpiri H Queijeiro-Bolantildeos M E Carrillo-AacutengelesI and Cambroacuten-Sandoval V H Machine learning estimates ofeddy covariance carbon flux in a scrub in the Mexican highlandZenodo httpsdoiorg105281zenodo3598595 2020

Hall P Gill N Kurka M Phan W and Bartz A MachineLearning Interpretability with H2O Driverless AI edited byBartz A H2Oai Inc California US 2019

Hastings S J Oechel W C and Muhlia-Melo A Diur-nal seasonal and annual variation in the net ecosystem CO2exchange of a desert shrub community (Sarcocaulescent) inBaja California Mexico Glob Change Biol 11 927ndash939httpsdoiorg101111j1365-2486200500951x 2005

Henrich V Krauss G Goumltze C and Sandow C Entwicklungeiner Datenbank fuumlr Fernerkundungsindizes vol 45 AK Fern-erkundung Bochum 2012

Hill M Held A Leuning R Coops N Hughes D andCleugh H MODIS spectral signals at a flux tower site Rela-tionships with high-resolution data and CO2 flux and light useefficiency measurements Remote Sens Environ 103 351ndash368httpsdoiorg101016jrse200506015 2006

Hill M J and Guerschman J P The MODIS Global VegetationFractional Cover Product 2001ndash2018 Characteristics of Vegeta-tion Fractional Cover in Grasslands and Savanna Woodlands Re-mote Sens 12 406 httpsdoiorg103390rs12030406 2020

Hinojo-Hinojo C Castellanos A E Huxman T Ro-driguez J C Vargas R Romo-Leoacuten J R and Bie-derman J A Native shrubland and managed buffelgrasssavanna in drylands Implications for ecosystem carbonand water fluxes Agr Forest Meteorol 268 269ndash278httpsdoiorg101016jagrformet201901030 2019

Jaumlrvi L Havu M Ward H C Bellucco V McFad-den J P Toivonen T Heikinheimo V Kolari P Ri-ikonen A and Grimmond C S B Spatial modeling oflocal-scale biogenic and anthropogenic carbon dioxide emis-sions in Helsinki J Geophys Res-Atmos 124 8363ndash8384httpsdoiorg1010292018JD029576 2019

Joiner J and Yoshida Y Satellite-based reflectances capture largefraction of variability in global gross primary production (GPP)at weekly time scales Agr Forest Meteorol 291 108092httpsdoiorg101016jagrformet2020108092 2020

Jung M Schwalm C Migliavacca M Walther S Camps-VallsG Koirala S Anthoni P Besnard S Bodesheim P Carval-hais N Chevallier F Gans F Goll D S Haverd V KoumlhlerP Ichii K Jain A K Liu J Lombardozzi D Nabel J EM S Nelson J A OrsquoSullivan M Pallandt M Papale DPeters W Pongratz J Roumldenbeck C Sitch S TramontanaG Walker A Weber U and Reichstein M Scaling carbonfluxes from eddy covariance sites to globe synthesis and eval-uation of the FLUXCOM approach Biogeosciences 17 1343ndash1365 httpsdoiorg105194bg-17-1343-2020 2020

Kang M Ichii K Kim J Indrawati Y M Park J Moon MLim J-H and Chun J-H New Gap-Filling Strategies forLong-Period Flux Data Gaps Using a Data-Driven ApproachAtmosphere 10 568 httpsdoiorg103390atmos101005682019

Lal R Griffin M Apt J Lave L and Morgan M GManaging Soil Carbon Science 304 393ndash393httpsdoiorg101126science1093079 2004

LI-COR B Eddypro Software instruction manual Nebraska USversion 70 2019

Liu Y Liu R Pisek J and Chen J M Separating overstoryand understory leaf area indices for global needleleaf and de-ciduous broadleaf forests by fusion of MODIS and MISR dataBiogeosciences 14 1093ndash1110 httpsdoiorg105194bg-14-1093-2017 2017

Long S P Marshall-Colon A and Zhu X-G Meetingthe Global Food Demand of the Future by EngineeringCrop Photosynthesis and Yield Potential Cell 161 56ndash66httpsdoiorg101016jcell201503019 2015

Ma X Huete A Yu Q Restrepo-Coupe N Beringer J Hut-ley L B Kanniah K D Cleverly J and Eamus D Param-

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 25: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux 391

eterization of an ecosystem light-use-efficiency model for pre-dicting savanna GPP using MODIS EVI Remote Sens Environ154 253ndash271 2014

Ma X Migliavacca M Wirth C Bohn F J Huth ARichter R and Mahecha M D Monitoring Plant FunctionalDiversity Using the Reflectance and Echo from Space RemoteSens 12 1248 httpsdoiorg103390rs12081248 2020

Marcolla B Roumldenbeck C and Cescatti A Patterns and con-trols of inter-annual variability in the terrestrial carbon budgetBiogeosciences 14 3815ndash3829 httpsdoiorg105194bg-14-3815-2017 2017

Mastachi-Loza C A Gonzaacutelez-Sosa E Becerril-Pintildea R andBraud I Peacuterdidas por intercepcioacuten en mezquite (Prosopis laevi-gata) y huizache (Acacia farnesiana) de la regioacuten semiaacuterida delcentro de Meacutexico Tecnol Cienc Agua 1 103ndash120 2010

Mauder M and Foken T Documentation and instruction man-ual of the eddy-covariance software package TK3 UniversitaumltBayreuth Abteilung Mikrometeorologie Bayreuth Germany2011

McClaran M A century of vegetation change on the Santa Rita Ex-perimental Range in Santa Rita Experimental Range 100 years(1903 to 2003) to accomplishments and contributions edited byMcClaran M P Ffolliott P F and Edminister C B Tech Co-ords USDA Forest service proceedings RMRS-P-30 TucsonAZ US 16ndash33 2003

McCulley R Jobbagy E Pockman W and Jackson R Nu-trient uptake as a contributing explanation for deep rootingin arid and semi-arid ecosystems Oecologia 141 620ndash628httpsdoiorg101007s00442-004-1687-z 2004

McElreath R Statistical Rethinking A Bayesian Course with Ex-amples in R and STAN CRC Press Boca Raton FL US 2020

Michailidis M How Driverless AI Prevents Overfitting and Leak-age Open Source Lead AI ML available at httpswwwh2oaiblogdriverless-ai-prevents-overfitting-leakage (last access7 July 2020) 2018

Mitchell R and Frank E Accelerating the XGBoost algo-rithm using GPU computing PeerJ Comput Sci 3 e127httpsdoiorg107717peerj-cs127 2017

Monteith J Solar radiation and productivity in tropical ecosys-tems J Appl Ecol 9 747ndash766 1972

Murray-Tortarolo G Friedlingstein P Sitch S Jaramillo V JMurguiacutea-Flores F Anav A Liu Y Arneth A Arvanitis AHarper A Jain A Kato E Koven C Poulter B Stocker BD Wiltshire A Zaehle S and Zeng N The carbon cycle inMexico past present and future of C stocks and fluxes Biogeo-sciences 13 223ndash238 httpsdoiorg105194bg-13-223-20162016

Nash J E and Sutcliffe J V River flow forecasting through con-ceptual models part I ndash A discussion of principles J Hydrol 10282ndash290 1970

Owen N A Choncubhair Oacute N Males J del Real Laborde J IRubio-Corteacutes R Griffiths H and Lanigan G Eddy covari-ance captures four-phase crassulacean acid metabolism (CAM)gas exchange signature in Agave Plant Cell Environ 39 295ndash309 httpsdoiorg101111pce12610 2016

Pasetto D Arenas-Castro S Bustamante J Casagrandi RChrysoulakis N Cord A F Dittrich A Domingo-Marimon C El Serafy G and Karnieli A Integration ofsatellite remote sensing data in ecosystem modelling at local

scales Practices and trends Methods Ecol Evol 9 1810ndash1821httpsdoiorg1011112041-210X13018 2018

Pastorello G Trotta C and Canfora E The FLUXNET2015dataset and the ONEFlux processing pipeline for eddy covari-ance data Sci Data 7 225 pp httpsdoiorg101038s41597-020-0534-3 2020

Pearcy R W Heterogeneous light environments in Handbookof Functional Plant Ecology edited by Pugnaire I F and Val-ladares F CRC Press Boca Ratoacuten FL US 270ndash314 2007

Piao S Liu Q Chen A Janssens I A Fu Y Dai J Liu LLian X Shen M and Zhu X Plant phenology and global cli-mate change Current progresses and challenges Glob ChangeBiol 25 1922ndash1940 httpsdoiorg101111gcb14619 2019

R Development Core Team R A language and environment forstatistical computing R Foundation for Statistical Computingavailable at httpswwwr-projectorg (last access 15 October2015) 2009

Reichstein M Falge E Baldocchi D Papale D Aubinet MBerbigier P Bernhofer C Buchmann N Gilmanov Tand Granier A On the separation of net ecosystem ex-change into assimilation and ecosystem respiration reviewand improved algorithm Glob Change Biol 11 1424ndash1439httpsdoiorg101111j1365-24862005001002x 2005

Richardson A D and Hollinger D Y Statistical model-ing of ecosystem respiration using eddy covariance datamaximum likelihood parameter estimation and Monte Carlosimulation of model and parameter uncertainty applied tothree simple models Agr Forest Meteorol 131 191ndash208httpsdoiorg101016jagrformet200505008 2005

Richardson A D Hollinger D Y Shoemaker J K Hughes HSavage K and Davidson E A Six years of ecosystem-atmosphere greenhouse gas fluxes measured in a sub-borealforest Sci Data 6 1ndash15 httpsdoiorg101038s41597-019-0119-1 2019

Roby M C Scott R L Barron-Gafford G A Hamerlynck E Pand Moore D J P Environmental and Vegetative Controls onSoil CO2 Efflux in Three Semiarid Ecosystems Soil Syst 3 6httpsdoiorg103390soilsystems3010006 2019

Running S W and Zhao M Daily GPP and annual NPP(MOD17A2A3) products NASA Earth Observing SystemMODIS land algorithm Univ of Mont Missoula Mont US2015

Running S W Nemani R R Heinsch F A Zhao MReeves M and Hashimoto H A continuous satellite-derived measure of global terrestrial primary productionBioscience 54 547ndash560 httpsdoiorg1016410006-3568(2004)054[0547ACSMOG]20CO2 2004

Rzedowski J Vegetacioacuten de Meacutexico Meacutexico Editor Limusa1978

Scott R L and Biederman J A Critical Zone Water Balance Over13 Years in a Semiarid Savanna Water Resour Res 55 574ndash588 httpsdoiorg1010292018WR023477 2019

Scott R L Watts C Payan J G Edwards E Goodrich D CWilliams D and James Shuttleworth W The understory andoverstory partitioning of energy and water fluxes in an opencanopy semiarid woodland Agr Forest Meteorol 114 127ndash139 httpsdoiorg101016S0168-1923(02)00197-1 2003

Scott R L Edwards E A Shuttleworth W J Huxman T EWatts C and Goodrich D C Interannual and seasonal vari-

httpsdoiorg105194bg-18-367-2021 Biogeosciences 18 367ndash392 2021

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References
Page 26: Machine learning estimates of eddy covariance carbon flux ... · Scrub is a xeric category of shrublands characterized by plants with small leaves; it is very thorny, and its biomass

392 A Guevara-Escobar et al Machine learning estimates of eddy covariance carbon flux

ation in fluxes of water and carbon dioxide from a ripar-ian woodland ecosystem Agr Forest Meteorol 122 65ndash84httpsdoiorg101016jagrformet200309001 2004

Scott R L Biederman J A Hamerlynck E P and Barron-Gafford G A The carbon balance pivot point of southwest-ern US semiarid ecosystems Insights from the 21st centurydrought J Geophys Res-Biogeo 120 2612ndash2624 2015

Segerstrom K Geology of the Bernal-Jalpan Area Estado deQueretaro Mexico Geol Surv Bull 1104 19ndash82 1961

Soper F M McCalley C K Sparks K and Sparks J P Soilcarbon dioxide emissions from the Mojave desert Isotopic evi-dence for a carbonate source Geophys Res Lett 44 245ndash251httpsdoiorg1010022016GL071198 2017

Stevens N T Steiner S H and MacKay R J Assessing agree-ment between two measurement systems An alternative to thelimits of agreement approach Stat Methods Med Res 262487ndash2504 httpsdoiorg1011770962280215601133 2015

Tagesson T Ardouml J Cappelaere B Kergoat L AbdiA Horion S and Fensholt R Modelling spatial andtemporal dynamics of gross primary production in theSahel from earth-observation-based photosynthetic capacityand quantum efficiency Biogeosciences 14 1333ndash1348httpsdoiorg105194bg-14-1333-2017 2017

Thornton P Thornton M Mayer B Wei Y Devarakonda RVose R and Cook R Daymet Daily Surface Weather Data ona 1-km Grid for North America Version 3 Oak Ridge NationalLab (ORNL) Oak Ridge TN US 2017

Tramontana G Jung M Schwalm C R Ichii K Camps-VallsG Raacuteduly B Reichstein M Arain M A Cescatti A KielyG Merbold L Serrano-Ortiz P Sickert S Wolf S andPapale D Predicting carbon dioxide and energy fluxes acrossglobal FLUXNET sites with regression algorithms Biogeo-sciences 13 4291ndash4313 httpsdoiorg105194bg-13-4291-2016 2016rdquo

Truong A Walters A Goodsitt J Hines K Bruss C B andFarivar R Towards automated machine learning Evaluationand comparison of AutoML approaches and tools in Proceed-ings of the 2019 IEEE 31st International Conference on Toolswith Artificial Intelligence (ICTAI) Portland OR US 1471ndash1479 2019

Verbist K Santibantildeez F Gabriels D and Soto G Atlas dezonas aacuteridas de Ameacuterica Latina y El Caribe Donoso MC(Coord) UNESCO-PHI and CAZALAC Montevideo Uruguay2010

Wang H Li X Ma M and Geng L Improving esti-mation of gross primary production in dryland ecosystemsby a model-data fusion approach Remote Sens 11 225httpsdoiorg103390rs11030225 2019

Wang X Ma M Li X Song Y Tan J Huang G Zhang ZZhao T Feng J and Ma Z Validation of MODIS-GPP prod-uct at 10 flux sites in northern China Int J Remote Sens 34587ndash599 httpsdoiorg101080014311612012715774 2013

Wang X Wang H Li X and Ran Y Photosynthesis (NPPNEP Respiration) Obs Meas Ecohydrol Process 1ndash30httpsdoiorg101007978-3-662-47871-4_10-1 2018

Wheeler C W Archer S R Asner G P and McMurtry C RClimaticedaphic controls on soil carbonnitrogen response toshrub encroachment in desert grassland Ecol Appl 17 1911ndash1928 httpsdoiorg10189006-15801 2007

Wilcox B P Birt A Fuhlendorf S D and Archer S R Emerg-ing frameworks for understanding and mitigating woody plantencroachment in grassy biomes Curr Opin Env Sust 32 46ndash52 httpsdoiorg101016jcosust201804005 2018

Wilson J B Cover plus ways of measuring plant canopiesand the terms used for them J Veg Sci 22 197ndash206httpsdoiorg101111j1654-1103201001238x 2011

Wofsy S Goulden M Munger J Fan S-M Bakwin PDaube B Bassow S and Bazzaz F Net exchange ofCO2 in a mid-latitude forest Science 260 1314ndash1317httpsdoiorg101126science26051121314 1993

Wu C Niu Z and Gao S Gross primary production estimationfrom MODIS data with vegetation index and photosyntheticallyactive radiation in maize J Geophys Res-Atmos 115 D12127httpsdoiorg1010292009JD013023 2010

Wutzler T Lucas-Moffat A Migliavacca M Knauer JSickel K Šigut L Menzer O and Reichstein MBasic and extensible post-processing of eddy covarianceflux data with REddyProc Biogeosciences 15 5015ndash5030httpsdoiorg105194bg-15-5015-2018 2018

Xiao H McDonald-Madden E Sabbadin R Peyrard NDee L E and Chadegraves I The value of understanding feed-backs from ecosystem functions to species for managing ecosys-tems Nat Commun 10 1ndash10 httpsdoiorg101038s41467-019-11890-7 2019

Xu B Yang Y Li P Shen H and Fang J Globalpatterns of ecosystem carbon flux in forests A biometricdata-based synthesis Global Biogeochem Cy 28 962ndash973httpsdoiorg1010022013GB004593 2014

Yao J Liu H Huang J Gao Z Wang G Li D Yu H andChen X Accelerated dryland expansion regulates future vari-ability in dryland gross primary production Nat Commun 111ndash10 httpsdoiorg101038s41467-020-15515-2 2020

Yepez E A Williams D G Scott R L and Lin GPartitioning overstory and understory evapotranspiration ina semiarid savanna woodland from the isotopic composi-tion of water vapor Agr Forest Meteorol 119 53ndash68httpsdoiorg101016S0168-1923(03)00116-3 2003

Yona L Cashore B Jackson R B Ometto J and Brad-ford M A Refining national greenhouse gas inventories Am-bio 1ndash6 httpsdoiorg101007s13280-019-01312-9 2020

Zeng X Zeng X and Barlage M Growing temperate shrubsover arid and semiarid regions in the Community Land ModelndashDynamic Global Vegetation Model Global Biogeochem Cy 22GB3003 httpsdoiorg1010292007GB003014 2008

Zhang A Jia G Epstein H E and Xia J ENSO elicits oppos-ing responses of semi-arid vegetation between Hemispheres SciRep 7 42281 httpsdoiorg101038srep42281 2017

Zhang L Xiao J Zheng Y Li S and Zhou Y Increased carbonuptake and water use efficiency in global semi-arid ecosystemsEnviron Res Lett 15 034022 httpsdoiorg1010881748-9326ab68ec 2020

Zhu X-G Long S P and Ort D R What is the max-imum efficiency with which photosynthesis can convert so-lar energy into biomass Curr Opin Biotech 19 153ndash159httpsdoiorg101016jcopbio200802004 2008

Biogeosciences 18 367ndash392 2021 httpsdoiorg105194bg-18-367-2021

  • Abstract
  • Introduction
  • Materials and methods
    • Site description
    • Eddy covariance measurements
    • Flux data processing
    • Remotely sensed data
    • MODIS algorithm for GPP
    • Santa Rita site dataset
    • Modeling
    • Variable importance
    • Model agreement
      • Results
        • Eddy covariance fluxes at Bernal
        • Machine learning ensembles as predictors of eddy covariance GPP
        • MODIS as predictor of eddy covariance GPP
        • Prediction eddy covariance GPP with ordinary-least-squares multiple regression
        • Agreement
          • Discussion
            • Model agreement
            • MODIS discrepancies
            • Forcing variables and machine learning
            • Carbon flux
            • Ecosystem management
              • Conclusions
              • Appendix A
              • Appendix B
              • Appendix C
              • Code and data availability
              • Author contributions
              • Competing interests
              • Acknowledgement
              • Financial support
              • Review statement
              • References