NetCDF data is a file format used in climate science to store and exchange multidimensional scientific data. It differs from CSV/Excel files as it can store metadata and n-dimensional arrays of data, making it easier to handle large and complex datasets with various dimensions and variables. In this tutorial we will explore:
# Loading Modules
import xarray as xr # reads and handles netcdf files with metadata
import numpy as np # module for numerical computing
import pandas as pd # module for data manipulation and analysis
# Download the NetCDF dataset we will be using
# this file will be saved on your left sidebar in /content/MERRA2_100.instM_2d_asm_Nx.198001.SUB.nc
# it will be deleted if the runtime restarts
!gdown --id 1PGKIhAHVN-LiR0FDwam7nbus3DgVBvT6
Loading Netcdf Data from Google Drive
To work with large datasets, you may need to load the data directly from your Google Drive instead of downloading it. Here's how it done!
Step 01: Download and store the data in your google drive. Data link: https://drive.google.com/file/d/1PGKIhAHVN-LiR0FDwam7nbus3DgVBvT6/view?usp=share_link
Step 02: Now run this codeblock
from google.colab import drive
drive.mount('/content/drive')
path = 'drive/MyDrive/project/python_tutorial/files/MERRA2_100.instM_2d_asm_Nx.198001.SUB.nc'
data = xr.open_dataset(path)
# Load the dataset
data = xr.open_dataset('/content/MERRA2_100.instM_2d_asm_Nx.198001.SUB.nc')
Now, we will print the data information to see which variables and dimensions it contains. It shows that it has a 2m air temperature variable (T2M) with one time dimension (1980-01-01), 576 longitude values, and 361 latitude values.
# print the dataset
data
Printing the data
object will display a summary of the Xarray dataset,including its dimensions,variables, and attributes.
Dimensions: (time: 1,lon: 576,lat: 361)
This line shows the dimensions of the dataset.In this case,there are 3 dimensions:lat
,lon
and time
.The numbers in the parenthesis indicate the size of each dimension.So there are 361 latitudes, 576 longitudes,and 1 timesteps in the dataset.
Coordinates:
time (time) datetime64[ns] 1980-01-01
lon (lon) float64 -180.0 -179.4 ... 178.8 179.4
lat (lat) float64 -90.0 -89.5 -89.0 ... 89.5 90.0
This section shows the coordinate variables that define the dimensions of the dataset. Each coordinate variable has a name in parentheses and a data type e.g. float64 for latitude and longitude, and datetime64 for time.
Data variables: T2M (time, lat, lon)float32 ...
This section shows the data variables contained in the dataset. Each variable has a name in parentheses and a data type which is float32 in this case. The dimensions of each variable are listed in parentheses.
Indexes:
time PandasIndex
lon PandasIndex
lat PandasIndex
The Indexes section of this example lists the index objects for each of the dataset's dimensions. In this case, the dataset has three dimensions: time, lat, and lon, and each of these dimensions has a PandasIndex object associated with it.
For example,thetime index would contain the dates and times corresponding to the dataset's time dimension, while the lat and lon indices would contain the latitude and longitude values corresponding to the dataset's latitude and longitude dimensions, respectively.
This section shows any global attributes that are present in the dataset.Global attributes provide information about the whole dataset such as its title,creator and version.Each attribute has a name and a value.In this example, the Conventions
attribute indicates that the dataset follows the Climate and Forecast (CF) metadata conventions, and the title
attribute provides a brief description of the dataset.
The netCDF data model is based on three major components:
Dimensions are used to define the shape of the data stored in a netcdf file.Each dimension is given a name,a length, and a tag indicating whether it is an unlimited dimension,which can be expanded to accommodate new data without rewriting the entire file.
Variables are used to store multidimensional arrays of data in a NetCDF file. Each variable is given a name, a data type, a shape defined by one or more dimensions, and a set of attributes that describe the data.
Attributes are used to store metadata about the data in a NetCDF file. Each attribute is given a name and a value, which can be a scalar, a one-dimensional array, or a multidimensional array. Attributes can be attached to dimensions, variables, or the global attributes of the file itself.
Let's suppose we have defined three dimensions:Time, Latitude, Longitude to store data.
Now,we can create data variables e.g. temperature
to store temperature data with a data type and dimensions corresponding to time,lat and lon.
temperature (time, lat, lon) float32
Here temperauture is depending on three dimensions.So we will get different values of temperature for any change in one of these dimensions. we can add more data variables here if we want e.g.
precipitation (time, lat, lon) float32
wind_speed (time, lat, lon) float32
pressure (time, lat, lon) float32
Lastly,we might include an attribute named units
for the temperature variable to specify that the temperature values are in degrees Celsius. We could also include attributes for the time, lat, and lon dimensions to specify their names and units.
Together, dimensions, variables, and attributes form the basic building blocks of the NetCDF data model. By using these components, NetCDF files can store and organize complex scientific data sets in a way that is platform-independent, flexible, and efficient.
Different data types can be used in a netcdf file for storing scientific data,including:
First, We have to install netcdf.
!apt-get install libnetcdf-dev netcdf-bin
!pip install netcdf4
Importing the required packages,netcdf and numpy.
import netCDF4 as nc
import numpy as np
Suppressing any warnings that may appear during the execution of the script.
import warnings
warnings.filterwarnings('ignore')
Creating a new NetCDF file named temperature_pressure.nc
for writing using the nc.Dataset
method.
ncfile = nc.Dataset('temperature_pressure.nc','w')
Defining the dimensions of the data using the createDimension
method. In this example, we define the dimensions of time
, lat
, and lon
.
time = ncfile.createDimension('time', None)
lat = ncfile.createDimension('lat', 10)
lon = ncfile.createDimension('lon', 10)
Creating variables for the data using the createVariable
method. In this example, we create variables for time
, lat
, lon
, temperature
, and pressure
. We also set their attributes using the .units
and .long_name properties
.
times = ncfile.createVariable('time', np.float64, ('time',))
times.units = 'days since 2022-01-01 00:00:00'
times.long_name = 'Time'
lats = ncfile.createVariable('lat', np.float32, ('lat',))
lats.units = 'degrees_north'
lats.long_name = 'Latitude'
lons = ncfile.createVariable('lon', np.float32, ('lon',))
lons.units = 'degrees_east'
lons.long_name = 'Longitude'
temp = ncfile.createVariable('temperature', np.float32, ('time', 'lat', 'lon',))
temp.units = 'Celsius'
temp.long_name = 'Temperature'
pres = ncfile.createVariable('pressure', np.float32, ('time', 'lat', 'lon',))
pres.units = 'hPa'
pres.long_name = 'Pressure'
Creating some fake temperature
and pressure
data to write to the file. In this example, we create 5 time values, 10 latitude values, and 10 longitude values. We also create temperature and pressure data using the np.random.randint
method.
num_times = 5
num_lats = 10
num_lons = 10
times_data = np.arange(num_times)
lats_data = np.linspace(-90, 90, num_lats)
lons_data = np.linspace(-180, 180, num_lons)
temp_data = np.random.randint(-30, 40, (num_times, num_lats, num_lons))
pres_data = np.random.randint(900, 1100, (num_times, num_lats, num_lons))
Writing the data to the variables in the NetCDF file using the [:]
notation.
times[:] = times_data
lats[:] = lats_data
lons[:] = lons_data
temp[:] = temp_data
pres[:] = pres_data
Setting metadata for the NetCDF file using the .title
, .history
, and .source properties
.
ncfile.title = 'Temperature and Pressure Data'
ncfile.history = 'Created on February 15, 2023'
ncfile.source = 'Generated by Python script'
Finally,closing the NetCDF file using the .close()
method.
ncfile.close()
# Installation of netcdf
!apt-get install libnetcdf-dev netcdf-bin
!pip install netcdf4
import netCDF4 as nc
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# creating a new NetCDF file
ncfile = nc.Dataset('temperature_pressure.nc','w')
# defining the dimensions of the data
time = ncfile.createDimension('time', None)
lat = ncfile.createDimension('lat', 10)
lon = ncfile.createDimension('lon', 10)
# creating variables for the data and set their attributes
times = ncfile.createVariable('time', np.float64, ('time',))
times.units = 'days since 2022-01-01 00:00:00'
times.long_name = 'Time'
lats = ncfile.createVariable('lat', np.float32, ('lat',))
lats.units = 'degrees_north'
lats.long_name = 'Latitude'
lons = ncfile.createVariable('lon', np.float32, ('lon',))
lons.units = 'degrees_east'
lons.long_name = 'Longitude'
temp = ncfile.createVariable('temperature', np.float32, ('time', 'lat', 'lon',))
temp.units = 'Celsius'
temp.long_name = 'Temperature'
pres = ncfile.createVariable('pressure', np.float32, ('time', 'lat', 'lon',))
pres.units = 'hPa'
pres.long_name = 'Pressure'
# creating some fake temperature and pressure data to write to the file
num_times = 5
num_lats = 10
num_lons = 10
times_data = np.arange(num_times)
lats_data = np.linspace(-90, 90, num_lats)
lons_data = np.linspace(-180, 180, num_lons)
temp_data = np.random.randint(-30, 40, (num_times, num_lats, num_lons))
pres_data = np.random.randint(900, 1100, (num_times, num_lats, num_lons))
# writing the data to the variables in the NetCDF file
times[:] = times_data
lats[:] = lats_data
lons[:] = lons_data
temp[:] = temp_data
pres[:] = pres_data
# setting metadata for the NetCDF file
ncfile.title = 'Temperature and Pressure Data'
ncfile.history = 'Created on February 15, 2023'
ncfile.source = 'Generated by Python script'
# closing the NetCDF file
ncfile.close()
ncdump
is a command line.It is used to exibits the metadata and data of a netCDF file so that we humans can read.Basically, id is used to show the contents of netCDF files.
ncdump has a lots of options:
ncdump [-c|-h] [-v ...] [-k] [-t] [-s] [-b lang] [-f lang]
[-l len] [-n name] [-p fdig[,ddig]] [-x] file.nc
[-c] Coordinate variable data and header information
[-h] Header information only, no data
[-v var1[,...]] Data for variable(s) var1,... only
[-k] Output kind of netCDF file
[-t] Output time data as ISO date-time strings
[-s] Output special (virtual) attributes
[-b [c|f]] Brief annotations for C or Fortran indices in data
[-f [c|f]] Full annotations for C or Fortran indices in data
[-l len] Line length maximum in data section (default 80)
[-n name] Name for netCDF (default derived from file name)
[-p n[,n]] Display floating-point values with less precision
[-x] Output NcML instead of CDL (netCDF-3 files only)
file.nc Name of netCDF file
**Let's try it on a random netCDF file that is obtained from observations by different organizations.**
**Step-1** We need to download a netcdf file from a organization.for example from Unidata We are downloading this file:
https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc
To download the file we use the code in the cell below.
#downloading the file
!wget -nc https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc -O ECMWF_ERA-40_subset.nc
**Step-2**
Then we can simply use the ncdump
:
#Applying ncdump
!ncdump -ct ECMWF_ERA-40_subset.nc
Xarray
is a powerful python library for working with NerCDF format files. It offers a wide range of functions that are similar to those in pandas
and numpy
, making it easy to manipulate and analyze multidimensional arrays of data.The two primary data structures in Xarray are Dataarray
and Dataset
, both of which are similar to numpy.ndarrays
.Xarray also supports groupby
operations, similar to those in pandas
, which can be used for more complex data manipulation.Additionally,Xarray
provides simple and intuitive visualization capabilities.
Firstly, We can use the xr.open_dataset(file)
function to open a NetCDF file.
# Importing xarray
import xarray as xr
# Opening the netcdf file
data_set=xr.open_dataset('/content/ECMWF_ERA-40_subset.nc')
data_set
This output give us a very well-organized view about the dimensions,coordinates,data variables and attributes.
We can see only one variable using the cell below
# data_set['msl']
# or,
data_set.msl
Here we can see all the information about the variable.The long name of msl
varaible is Mean sea level pressure
which we can see from Attribute section.
msl=data_set.msl
print(msl)
data_set.variables['msl'][0:2,0:4,0:3]
This code cuts a subset of the msl
variable from the netCDF dataset data_set
. The subset contains the first two time steps, the first four latitudes, and the first three longitudes. The resulting array is a 3-dimensional numpy array with shape (2, 4, 3) containing the values of the 'msl' variable at the selected indices.
data_set.msl.isel(time=slice(0,2), latitude=slice(0,4), longitude=slice(0,3))
This code also cuts a subset of the msl
variable in the data_set
dataset using isel()
function to index the variable along the time
, latitude
, and longitude
dimensions.The resulting subset will include only the selected indices along each dimension, and the other indices will be excluded.
data_set.msl.dims
Here we showed the dimensions of the data_set.msl
variable using the dims
attribute.
data_set.msl.coords
Using coords
attribute gives us a dictionary like object contaning the coordinates of the data along each dimension for the data_set.msl
variable.
data_set.msl.attrs
data_set.msl.attrs
returns the attributes of the msl variable in the data_set
dataset.
# ds.air.attrs['who_is_awesome']='xarray'
# ds.air.attrs
data_set.msl.values
In the above cell, data_set.msl.values
returns the data values of the msl
variable as a numpy array.
data_set.sel(time='2002-07-03')
This code selects the data corresponding to a specific time point, which is 2002-07-03
. It uses the sel()
method of xarray to select a single time slice along the time dimension.we can also add other dimensions.
data_set.sel(longitude=240.2,method='nearest')
This code provide us the data at the longitude closest to 240.2 degrees.The sel()
method is used to select data along a dimension. In this case, we are selecting data along the longitude
dimension, with the value closest to 240.2 degrees. Here, we are using 'nearest' method which finds the nearest value to the given coordinate along given dimension.
data_set.sel(longitude=[240,125,234],latitude=[40.3,50.3],method='nearest')
This is same as the above code cell.In this case, the nearest values to longitude values of [240, 125, 234]
and latitude values of [40.3, 50.3]
will be produced as output.
data_set.msl.isel(time=0,latitude=2,longitude=3)
Here, .isel()
method is used to select the data based on the index values for the dimensions specified in the parentheses.We are trying to extract the value fo msl
at the first time step, latitude index 2 and longitude index 3.
data_set.msl.isel(latitude=slice(10))
Here, isel
is used to select the data along the latitude
dimension using a slice object slice(10)
. This selects all the indices less than 10 along the latitude
dimension. The resulting xarray dataset contains all the values of the msl
variable along the time
, latitude
, and longitude
dimensions, but only for the selected indices along the latitude
dimension.
data_set.groupby('time.season')
ds = xr.tutorial.load_dataset('air_temperature')
ds
ds.groupby('time.season')
The groupby method in xarray allows us to group data along one or more dimensions, and apply an aggregation function such as mean,sum,max
etc to each group. In this case, the code ds.groupby('time.season')
is grouping the data in the dataset ds
along the time dimension, using the season
attribute. This will group the data into four groups, one for each season (DJF, MAM, JJA, SON
).
Now we've learned how to group data base on their seasons in the previous code cell.
But we will now group the data in ds
by the season
component of the time
coordinate and then take mean of each group.
seasonal_mean=ds.groupby('time.season').mean()
seasonal_mean
seasonal_mean.air.plot(col='season',col_wrap=2)
we have created a plot of seasonal mean values of the air
variable in the loaded ds
dataset. The col
parameter is set to 'season', which means that each season will be plotted in a separate column. The col_wrap
parameter is set to 2, which means that there will be a maximum of two columns of plots, and if there are more than two seasons, the extra plots will be added to a new row.
!wget -nc https://downloads.psl.noaa.gov//Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc
>>> data=xr.open_dataset('sst.wkmean.1990-present.nc')
>>> print(data)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
# # Read the NetCDF file
# data = xr.open_dataset('sst.wkmean.1990-present.nc')
# Extract the necessary data variables
sst = data['sst'].data
time = pd.to_datetime(data['time'].data)
lat = data['lat'].data
lon = data['lon'].data
# Calculate the mean sea surface temperature over all longitudes
sst_zm = np.mean(sst, axis=2)
# Extract the tropical region (70°S to 110°S latitude)
sst_tropics = sst_zm[:, 70:110]
# Calculate the mean sea surface temperature over the tropical region
sst_final = np.mean(sst_tropics, axis=1)
# Create the plot
plt.figure(figsize=(9, 4))
plt.plot(time, sst_final, color='r', linewidth=2., linestyle='-', alpha=1., label='SST')
plt.title('Sea Surface Temperature ($^\circ$C) in the Tropics')
plt.xlabel('Time')
plt.ylabel('SST ($^\circ$C)')
plt.legend(fontsize=10)
plt.show()
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
# Load the netCDF file
data = xr.open_dataset('sst.wkmean.1990-present.nc')
# Extract the necessary variables
sst = data.sst.data
lon = data.lon.data
time = data.time.data
# Create the contour plot
plt.figure(figsize=(9, 4))
plt.contourf(lon, lat,sst[0, :, :],cmap='magma')
plt.colorbar(label='SST ($^\circ$C)')
plt.title('Sea Surface Temperature')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
!wget -nc https://downloads.psl.noaa.gov/Datasets/noaa.oisst.v2/icec.mnmean.nc
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
# Load the netCDF file
data = xr.open_dataset('icec.mnmean.nc')
# Extract the necessary variables
lon = data.lon.data
lat = data.lat.data
icec = data.icec.data
# Create a contour plot of sea ice concentration
plt.figure(figsize=(10, 6))
plt.contourf(lon, lat, icec[0, :, :])
plt.colorbar(label='Sea Ice Concentration')
plt.title('Sea Ice Concentration')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
'nccopy' command is a tool provided by the NetCDF library for copying and compressing NetCDF files. It has options to specify what kind of output to generate and optionally what level of compression to use and how to chunk the output.
nccopy [-k output_kind] [-d level] [-s] [-c chunkspec]
[-u] [-m n] [-h n] [-e n] input output
[-k output_kind] kind of output netCDF file
omitted => same as input
'1' or 'classic' => classic file format
'2' or '64-bit-offset' => 64-bit offset format
'3' or 'netCDF-4' => netcdf-4 format
'4' or 'netCDF-4 classic model' => netCDF-4 classic model
[-d level] deflation level, from 1 (faster but lower compression)
to 9 (slower but more compression)
[-s] shuffling option, sometimes improves compression
[-c chunkspec] specify chunking for dimensions, e.g. "dim1/N1,dim2/N2,..."
[-u] convert unlimited dimensions to fixed size in output
[-m n] memory buffer size (default 5 Mbytes)
[-h n] set size in bytes of chunk_cache for chunked variables
[-e n] set number of elements that chunk_cache can hold
input name of input file or Link
output name of output file or Link
!nccopy -u -s -d6 ECMWF_ERA-40_subset.nc file_compressed.nc
This creates a compressed copy of the ECMWF_ERA-40_subset.nc
with the name file_compressed.nc
.
Note that the '!'
at the beginning of the command is used to execute shell commnads within the Colab environment.
CDO is a collection of command line tools specially designed for working with climate and weather data.It provides a wide range of operations and functionalities to manipulate, analyze and process climate data.
To install CDO in Colab, we can use the following snippet:
!apt-get install -y cdo
▶ To concatenate and merge files:
!cdo merge file1.nc file2.nc file3.nc merged.nc
Here, file1.nc
,file2.nc
,file3.nc
are merged together into merged.nc
file
▶ To convert grib file to netcdf file:
!cdo -f nc copy file.grb file.nc
▶ TO shift Longitude of data from 0:360 to -180:180:
!cdo sellonlatbox,-180,180,-90,90 input.nc output.nc
▶ To calculate statistical mean:
!cdo timmean input.nc timmean.nc
▶ To apply mathematical and statistical operations:
!cdo add input1.nc input2.nc sum.nc
!cdo mul input1.nc input2.nc product.nc
▶ To perform time series analysis and temporal aggregations:
!cdo yearmean input.nc yearly_mean.nc
!cdo monmean input.nc monthly_mean.nc
Note that the '!'
at the beginning of the command is used to execute shell commnads within the Colab environment.
NCO (netCDF Operators) is typically used when we need to perform advanced operations on netCDF .
Some NCO command line operators to apply on netCDF files:
ncap2 - arithmetic processor
ncatted - attribute editor
ncbo - binary operator
ncdiff - differencer
ncea - ensemble averager
ncecat - ensemble concatenator
ncflint - file interpolator
ncks - kitchen sink (extract, cut, paste, print data)
ncpdq - permute dimensions quickly
ncra - running averager
ncrcat - record concatenator
ncrename - renamer
ncwa - weighted averager
In colab we have to install NCO using the command:
!apt-get install -y nco
▶ To multiply a variable by a constant:
!ncap2 -s 'variable = variable * 2.0' input.nc output.nc
▶ To add or modify an attribute:
!ncatted -a units,new_units,m,c,'New Units' input.nc output.nc
▶ To add two variables:
!ncbo -s 'variable = variable1 + variable2' input1.nc input2.nc output.nc
▶ To calculate the difference between two variables:
!ncdiff -O -o diff.nc input1.nc input2.nc
Note that the '!'
at the beginning of the command is used to execute shell commnads within the Colab environment.