MIDS Capstone Project Spring 2020

Detecting Irrigated Croplands Globally

Team members

Detecting the extent of irrigation in croplands is important for food security and water management. Food and Agriculture Organization (FAO) maintains a dataset of cropland areas worldwide equipped for irrigation. This dataset is derived from government and census records. With the advent of satellite data and large-scale machine learning, we can validate and improve upon this knowledge.

Our Time-Stationary Model

In this project, we first built a machine learning model that can classify irrigated croplands. We used the FAO dataset developed by Siebert et al. as labels to train our model; our features included satellite data, climate data and soil data. We then classified land on all continents except Antarctica for every 3 years from 2000 to 2018, at a spatial resolution of 8 km. To run such large-scale predictions, we used the machine learning and data catalog services provided by Google Earth Engine (GEE). We made maps available for interactive exploration as GEE applications on a Web site dedicated to the project. Our model has an accuracy score of 84% and a kappa value of 0.46. Our model map shows many false positives; however, in many cases they are indeed croplands, which means government data is underreporting the true extent of irrigated croplands. This is particularly true of China and India.

Our Time-Series Model

Although we were able to build highly predictive global models to predict irrigated areas, as confirmed by the MIRCA and GFSAD datasets for 2000 and 2010. Unfortunately, our model is dependent on variables which are not time-constant. For instance, we know the planet is heating and this drives increased water evaporation. In addition, irrigation is increasing over time, likely due to factors such as population growth and improved irrigation methodologies. It is therefore naive to assume that features of the satellite datasets, and the use of irrigation globally constitutes a stable relationship. Accordingly, in addition to our time-stationary model, we used a time series forecasting approach. This used a “pooled model” that leveraged datasets from the Siebert analysis from 1985 to 2000 to train, and tested against Siebert’s 2005 irrigation dataset. Features for this model include present values and historical averages to include long range agricultural suitability. The model functions well in predicting binary labels. Put differently, it can determine with 92% accuracy whether a specific plot of land (having an area of 5 arc minutes) includes at least 100 hectacres of irrigated croplands. On a stratified approach that attempts to differentiate classes of irrigation (e.g. high, medium, low etc.), the model does well in predicting zero and high amounts of irrigation, but falters when differentiating classes in the middle. Currently, the model is not designed to predict continuous labels (i.e. exactly how much irrigation exists).

Course

Data Science 210. Capstone , Spring 2020

Class Project Gallery

More Information

Web Application

Last updated: April 20, 2020