nhslogo CS5131 Introduction to Artificial Intelligence

Assignment 1

BY LIEW WEI PYN AND PRANNAYA GUPTA
M22504
Done as part of CS5131: Introduction to Artificial Intelligence

Table of Contents

Problem Definition and Description

UV Aerosol Index (UVAI)

The UV Aerosol Index (UVAI), which is also often referred to as the Absorbing Aerosol Index (AAI) or the residue, the spectral contrast anomaly, or simply aerosol index, is a reading measured based on wavelength-dependent changes in molecular Rayleigh Scattering, surface reflection, gaseous absorption and aerosol and cloud scattering in the UV spectral range for a pair of wavelengths, well known as the spectral contrast at the two wavelengths. The difference between observed and modelled reflectance results in the UVAI. When the UVAI is positive, it indicates the presence of elevated aerosols that absorb ultraviolet (UV) radiation in the atmosphere, such as dust and smoke.

These absorbing aerosols are solid and liquid particles suspended in the atmosphere. They are made up of primary aerosols and secondary aerosols. Primary aerosols are emitted directly, whereas secondary aerosols are formed by gases reacting in the atmosphere. The main sources of aerosols are combustion, like burning biomass, and natural sources, like volcanoes. The UVAI can show dust and smoke particles coming from events like dust storms and outbreaks, volcanic eruptions, fires, and biomass burning around the world, and hence it is useful for tracking the evolution of episodic aerosol plumes from such events. The wavelengths used have very low ozone absorption, so unlike aerosol optical thickness measurements, UVAI can be calculated in the presence of clouds. Daily global coverage is therefore possible.

Traditionally, aerosol optical thickness measurements are made using spaceborne sensors operating in the visible and infrared ranges of electromagnetic radiation, where multiple scattering in the atmosphere is less important than in the ultraviolet radiation and inversion calculations are relatively easier. In the visible and near-infrared waves, the large surface albedos of many land types make retrieval of aerosols difficult over these regions. With the ongoing development of numerical radiative transfer codes and increasing computational speeds, accounting for multiple scattering is no longer a problem, allowing for new techniques of aerosol measurements in the UV. Because the surface albedos of both land and ocean are small in the UV, this wavelength range should be suitable for aerosol detection over land.

The Residue Method

A common way of measuring the UVAI is the Residue Method, where we simplify some wavelength-dependent variable, $r$ to denote UVAI. From here, we have the following expression:

$$r_\lambda = -100 \left[\log \left(\frac{I(\lambda)}{I\left(\lambda_0\right)}\right)_{measured} - \log \left(\frac{I(\lambda)}{I\left(\lambda_0\right)}\right)_{ray} \right]$$

where $I(\lambda)$ is the radiance at the top of the atmosphere (TOA) at a wavelength $\lambda$. The subscript $measured$ refers to a measured TOA radiance of a real atmosphere with aerosols, as opposed to a calculated TOA radiance for an aerosol-free atmosphere with only Rayleigh scattering and absorption by molecules and surface reflection and absorption, which is referred to as $ray$ in above equation. $\lambda_0$, on the other hand, is the wavelength of reference when computing the UV Aerosol Index at different wavelengths. For instance, in the case of the Nimbus 7 Total Ozone Mapping Spectrometer (TOMS) for the years 1979-1993, we noted the use of the 340 and the 380 nm radiances as $I(\lambda)$ and $I(\lambda_0)$ respectively, and hence the computation or $r_\lambda$, thereby referred to as $\Delta N_\lambda$ in Herman et al, is as follows:

$\Delta N_\lambda = -100 \left(\log \left(\frac{I_{340}}{I_{380}} \right)_{meas} - \log \left(\frac{I_{340}}{I_{380}} \right)_{ray} \right)$

This is a general form of the computation. However, this is not sufficient extrapolation of this task. We can simplify this further using the concepts of reflectance and the surface albedo.

Reflectance

We first note reflectance, defined as follows:

$R = \frac{\pi I}{E_0 \cos \left(\theta_0\right)}$

Here, $I$ is similarly a function of $\lambda$, similarly referring to radiance at the TOA, while $E_0$ depicts the solar irradiance at TOA perpendicular to the direction of the incident sunlight. $\theta_0$ depicts the Solar Zenith Angle, as shown below:

Source: Support to Aviation Control Service (SACS) Article on the Solar Zenith Angle

We note that the TOA plane at which $E_0$ is measured is perpendicular to the beam ray which comes at $\theta_0$. Hence, it is necessary to adjust $E_0$ to be altered by this beam, giving the $E_0$ about the TOA plane perpendicular to the ground surface, which is to be measured. This gives the solar irradiance about the observer's TOA, which is divides by his solar radiance, allowing us to get an accurate value for the reflectance. From here, we have another representation of $I(\lambda)$, which continues to simplify the above result by introducing this term know as the surface albedo.

Surface Albedo

The surface albedo $A_s$ for the Rayleigh atmosphere calculation is chosen so that

$R_{\lambda_0, meas} = R_{\lambda_0, ray} (A_s)$

Reduction of Residue Theorem

$r_\lambda = -100 \left(\log R_{\lambda, meas} - \log R_{(A_s)\lambda_0, ray} \right)$

This is a useful mathematical formula, since the simulation is not problematic to compute for near-range satellites such as the Sentinel-5.

Research Question

In this project, we aim to train a model to predict the UVAI readings, based on the ambient gaseous concentrations of specific gases in the atmosphere, including Methane (CH4), Sulphur Dioxide (SO2), Ozone (O3), Nitrogen Dioxide (NO2), Formaldehyde (HCHO) and Carbon Monoxide (CO).

We specifically train this on data from the United States of America (USA), specifically segmented for the specific states in the country. This is also categorical data which we will convert to discrete. This data is acquired from the Google Earth Engine based on the data from the European Space Agency's (ESA) Sentinel-5 Precursor (S5p) Mission Satellite, a Low-Earth Orbit Polar Satellite System part of the Global Monitoring of the Environment and Security (GMES/COPERNICUS) space component programme headed by the European Commission (EC) in partnership with the aforementioned ESA. The goal of the Satellite is to provide information and services on air quality, climate, and the ozone layer. The S5p mission includes the TROPOspheric Monitoring Instrument (TROPOMI), which takes daily (mostly Near Real-Time) global observations of key atmospheric components, such as absorbing aerosols and gaseous concentrations in the environment, at a 5.5 x 3.5 kilometer (km) resolution.

Rationale - Why is this model important?

The Sentinel-5 Precursor Satellite only contains readings for half of the Earth at any one time, hence we wish to use the models to predict the UV Aerosol Index readings on the other hemisphere of the Earth, since this can help us form a reasoning as to how the UV Aerosol Index is changing without gaps in the data.

Summary of Datasets

  1. DataHub's Natural Earth Polygons GeoJSON Dataset
  2. From Google's Earth Engine
    1. Sentinel-5P OFFL CH4: Offline Methane - Collection of High-Res Imagery of Concentrations of Methane
    2. Sentinel-5P NRTI SO2: Near Real-Time Sulphur Dioxide - Real-Time Collection of High-Res Imagery of Concentrations of Sulfur Dioxide in the Atmosphere
    3. Sentinel-5P NRTI O3: Near Real-Time Ozone - Real-Time Collection of High-Res Imagery of Concentrations of Column Ozone in the Atmosphere
    4. Sentinel-5P NRTI NO2: Near Real-Time Nitrogen Dioxide - Real-Time Collection of High-Res Imagery of Concentrations of Nitrogen Dioxide in the Atmosphere
    5. Sentinel-5P NRTI HCHO: Near Real-Time Formaldehyde - Real-Time Collection of High-Res Imagery of Concentrations of Formaldehyde in the Atmosphere
    6. Sentinel-5P NRTI CO: Near Real-Time Carbon Monoxide - Real-Time Collection of High-Res Imagery of Concentrations of Carbon Monoxide in the Atmosphere
    7. Sentinel-5P NRTI AER AI: Near Real-Time UV Aerosol Index - Real-Time Collection of High-Res Imagery of UV Aerosol Index (UVAI), also called the Absorbing Aerosol Index (AAI)

Please note that a lot of this data is derived from the Google Earth Engine Dataset, and a lot of the code requires you to use the Google Earth Engine Library on Google Colab. We have also saved the retrieved data into this Google Drive link in Public Access (please do not share). Save these files locally in a directory data/.

Import required libraries and general setup

Important note: remember to restart kernel after pip installing vaex, otherwise there will be error when importing vaex

We now authenticate earth engine to retrieve data from the website. Note that this requires a google earth engine authenticated account. This is not needed if not rerunning the section with code cells for data collection as the source data will also be included, however the code for data retrieval is still included for rigorous verification purposes.

DataHub's Natural Earth Polygons GeoJSON Dataset

DataHub's Natural Earth Polygons GeoJSON Dataset is a geodata data package geojson polygons for the largest administrative subdivisions in every country. The data comes from Natural Earth, a community effort to make visually pleasing, well-crafted maps with cartography or GIS software at small scale.

We have opened this dataset as a geopandas.GeoDataFrame object in countries_geojson, as shown below.

Now, we consider datasets that specifically contain the United States of America, giving all 51 states, includeing Massachusetts and California, but excluding Alaska, Rhode Island, District of Columbia and Hawaii, since they stray from mainland USA. We save this in the us_geojson object.

Data Acquisition

Remember the geojson information on each state from before? We now turn them all into a FeatureCollection for earth engine.

Ah yes. Code. You can still detect the residual javascript.

Initialise parameters for the Earth Engine API call.

I would not recommend running this cell. It took 2 hours. Per file.

Fairly straightforward data loading. The files are very large so using vaex is much faster, as they do lazy computation of variables, not storing them in memory like pandas.

Opening and Merging Data

Note that all data files were stored in the data/ directory. We now iterate through all the dictionary keys and add the filepaths as a value, based on regex filtering using glob.

We then initalise the starting df using AER_AI, and then iterate through all the keys, processing their vaex dataframes using process_exported and merging them to the starting df as we go.

Produce Dataset Information

Data Cleaning

We now set the datetime and name columns to be our indices, and sort our data to follow based on datetime followed by country. We then rename the columns to shorter, more easy-to-manipulate names, based on the following one-to-one table:

Before After
absorbing_aerosol_index UVAI
CH4_column_volume_mixing_ratio_dry_air CH4
CO_column_number_density CO
H2O_column_number_density H2O
tropospheric_HCHO_column_number_density tropoHCHO
tropospheric_HCHO_column_number_density_amf tropoHCHOamf
HCHO_slant_column_number_density slantHCHO
NO2_column_number_density NO2
tropospheric_NO2_column_number_density tropoNO2
stratospheric_NO2_column_number_density stratoNO2
O3_column_number_density O3
O3_slant_column_number_density slantO3
O3_effective_temperature O3Teff
SO2_column_number_density SO2
SO2_slant_column_number_density slantSO2

Normalise values to between 0 and 1.

Exploratory Data Analysis

Correlation Matrix

We now perform Correlation calculations to analyse if any variable can be simply eliminated, or in what ways they are correlated. We do this by computing the correlation matrix, as stored in a variable corrmat. Following this, we plot a Heatmap of the Correlation Matrix, as shown below.

Heatmap
Correlation Abnormalities

We note that the correlation between the Effective Temperature at Ozone and the Stratospheric NO2 Column Density are incredibly low, hence we plot between the two datasets to search for some form of correlation, as shown below:

UVAI against Categories

Strip Plot of UV Absorbing Index based on the Number of States
Box Plot of UV Absorbing Index based on the Number of States
Violin Plot of UV Absorbing Index based on the Number of States

Methodology

Linear Algebra-ing Data

That is to say, converting the data into Tensors in Numpy.

One-Hot Encode the State Name

Splitting Data

Summary Functions

Compute Model Performance Statistics

Generic Regression Algorithms

Multi-Linear Regression (MLR)

Based on the R2 value of 0.3942, there is weak correlation between the input variables and output variables when using an MLRM. Plotting the prediction values, we can see that MLRM does not produce satisfactory results.

Fit and Predict
Summary Values
Plot of Expected against Predicted Values of UVAI

Ridge Regression

Based on the R2 value of 0.3926, there is similarly weak relationship between the input and output values for Ridge regression model. Plotting the predicted values, we can see that Ridge regression does not produce satisfactory results.

Fit and Predict
Summary Values
Plot of Expected against Predicted Values of UVAI

Lasso Regression

Given that the R2 is negative, we can clearly see that the fit of the variables is probably better off explained with a horizontal line than the Lasso regression model.

Fit and Predict
Summary Values
Plot of Expected against Predicted Values of UVAI

Support Vector Machines for Regression (SVRs)

Support Vector Regression normally does not scale well to datasets beyond 10,000 samples (we have nearly 19,106) due to its quadratic nature, however we can afford to wait the time it takes to compute. It seems that RBF and Polynomial kernels perform the best, even compared to previous models.

Fit and Predict

Summary Values

Plot of Expected against Predicted Values of UVAI

Random Forests for Regression

Absolutely terrible results.

Fit and Predict

Summary Values

Plot of Expected against Predicted Values of UVAI

Comparison

Numerical Results

Based on these values, we note that the RBF Kernel SVR is the best model. We continue to visualise this data as shown below. Since Lasso is very clearly a bad fit for the datapoints, we exclude the predictions from Lasso Regression from comparison with other models.

Graphical Results

These plots seem to make it all the more clear that the RBF Kernel SVR is the most accurate model, due to it's higher R2 and lower MSPE values.

Conclusion

In conclusion, the best performing model is the RBF Kernel SVR due to the fact that they have a higher R2 and lowest mean squared error compared to the other models. Note that it seems the R2 values are very low, the data seems to be quite unrelated to the UVAI. Hence, gas column densities and one hot encoded locations may not be an optimal set of predictor variables for predicting UVAI.

Due to the low order data, all models trained relatively quickly, and considering speed as a factor in determining the best model is unnecessary.

For future work, perhaps more different variables could be considered when predicting UVAI.

References

[1] Herman, J. R., Bhartia, P. K., Torres, O., Hsu, C., Seftor, C., & Celarier, E. (1997). Global distribution of UVā€absorbing aerosols from Nimbus 7/TOMS data. Journal of Geophysical Research: Atmospheres, 102(D14), 16911-16922.

[2] De Graaf, M., Stammes, P., Torres, O., & Koelemeijer, R. B. A. (2005). Absorbing Aerosol Index: Sensitivity analysis, application to GOME and comparison with TOMS. Journal of Geophysical Research: Atmospheres, 110(D1).