Getting Started with the International Disaster Database (EM-DAT) using Python and Pandas

Over the past few weeks I’ve been exploring the International Disaster Database (EM-DAT), which provides data on over 25,000 mass disasters from the year 1900 to present day.

I first came across the database after discovering Juliana Negrini de Araujo’s fantastic Jupyter Notebook which provides some high-level exploration of the dataset.

In many ways, the EM-DAT dataset is my first foray into analyzing natural disasters with Python.

I’ve spent some time building on Juliana’s work, cleaning up some code, and providing some additional detail and commentary — today I’m excited to publish the article. I hope it helps the community, and at the very least, it serves as documentation for myself as I continue exploring the EM-DAT dataset.

Overall, the goal of this article is to provide a detailed introduction to the EM-DAT natural disaster dataset, serving as a starting point for anyone else in the community who wants to study natural disasters through a data science lens.

Table of Contents

Download the code

All code published on NaturalDisasters.ai can be found in the official GitHub repo.
The code for this specific article can be found here.

What is the EM-DAT database, and what is used for?

EM-DAT Logo

The EM-DAT dataset catalogs over 26,000 mass disasters worldwide from 1900 to present day.

EM-DAT is maintained by the Centre for Research on the Epidemiology of Disasters (CRED), and is provided in open access to any person or organization performing non-commercial research.

It is critical to understand that EM-DAT does not catalog all natural disasters worldwide. It instead focuses on mass disasters.

According to EM-DAT and CRED, a mass disaster is a specific type of natural disaster that leads to significant human and economic loss, requiring that at least one of the following criteria hold:

10 fatalities
100 affected people
A deceleration of state of emergency (at the country level)
A call for international assistance (again, at the country level)

For example, if a tornado touches down in rural Texas but doesn’t kill anyone, nor does it cause significant property damage, then the tornado would not be added to the EM-DAT dataset.

An example of a mass natural disaster that would be included in the the EM-DAT dataset is Hurricane Katrina, which absolutely devastated New Orleans, LA, and the surrounding areas, causing over 1,800 fatalities and estimated damages in the range of $98-145 billion USD.

Clearly, Hurricane Katrina passes multiple tests for inclusion in the EM-DAT dataset.

Below, I’ve included the data EM-DAT reports for Hurricane Katrina so you can get a feel for the data EM-DAT provides:

Hurricane Katrina data in EM-DAT

It’s also worth nothing that EM-DAT utilizes a hierarchical classification of all disasters, as I discuss in the “Breaking down disaster types in the EM-DAT dataset” section of the article.

One particular benefit of the hierarchical structure is that you can “drill down” into natural disaster type based on the following taxonomy:

Disaster Group
Disaster Subgroup
Disaster Type
Disaster Subtype
Disaster Subsubtype

Let’s take “extreme storms” as an example:

The Disaster Group is Natural for all rows (at least for the Kaggle version of the dataset)
The Disaster Subgroup for a storm is Meteorological
The Disaster Type is then storm
We then have multiple Disaster Subtypes that we can explore, including Convective Storm, Extra-tropical Storm, and Tropical Cyclone
And finally, we can drill down further into the Disaster Subsubtype. A Convective Storm as the following subsubtypes: Derecho, Hail, Lightning/Thunderstorms, Rain, Sand/Dust storm, Severe storm, Storm/Surge, Tornado, and Winter storm/Blizzard

Using Python and Pandas, we can easily filter the natural disasters we are most interested in.

EM-DAT data at a glance

Image alt Exploring the EM-DAT dataset (image credit)

Approximately two-thirds of disasters recorded in EM-DAT are related to natural hazards.

Here are common occurrences for natural hazards in the dataset:

Drought: 790
Earthquake: 1,570
Extreme temperature: 600
Flood: 5,750
Landslide: 790
Storm: 4,580
Volcanic activity: 270
Wildfire: 450

As you can see, the EM-DAT dataset is quite diverse, covering a wide variety of natural disasters.

While there are certainly other datasets that provide far more data for a specific natural disaster type (for example, NOAA’s dataset on tornadoes), there are few datasets like EM-DAT that provides worldwide natural disaster data for such a diverse collection of disaster types.

The diversification factor alone makes EM-DAT worth studying in more detail (at least in my humble opinion).

Limitations of the EM-DAT dataset

EM-DAT limitations The EM-DAT dataset is not without its limitations (image credit)

I want to be upfront and say that working with the EM-DAT dataset can be a bit challenging at times:

The dataset is noisy.
Many of the features have missing/incomplete data, with seven columns in particular having > 90% of the features missing.
The data can also be inconsistent and non-standardized, particularly when it comes of the location of where a natural disaster occurred.

For example, when working with location data in the United States, some rows will have a mixture of:

Region of the US (East, West, Southwest, etc.)
State
County
City
Town

Much of this location data is mixed and matched, making it sometimes challenging to work with.

Note: In fact, I had to develop a separate AI-based script to clean up the location data (which I’ll provide in a future article).

While EM-DAT strives to maintain high data accuracy, it relies on multiple sources for its data, including:

United Nations agencies
Non-government organizations
Insurance companies
Research institutions
Press agencies
etc.

Each of these organizations has its own reporting standards which may or may not facilitate high data accuracy.

Additionally, the estimates on economic loss data can vary significantly, implying that the economic loss estimates EM-DAT reports may not measure the full extent of the natural disaster impact.

Finally, and in my opinion, most importantly, under-reporting can be a concern when utilizing EM-DAT in your work or research.

By design, EM-DAT only includes natural disasters where:

10 or more people are killed
100 or more people were affected
A state of emergency was declared
A call of international assistance was issued

By definition, these exclusion criteria may result in natural disasters not being included in the dataset (but otherwise should be) due to limited media coverage, lack of money/infrastructure to gather reliable data, etc.

The EM-DAT is an invaluable resource, but it does have its limitations, so keep them in mind if you choose to utilize it in your own work;

Understanding each of the columns/features in the EM-DAT dataset

EM-DAT features and columns

The EM-DAT dataset includes 43 features (i.e., columns) used to represent and quantify mass natural disasters worldwide.

I’m still exploring and wrapping my head around the nuanced contextualization of each of these features, but I’ve done my best to summarize each of them below:

Dis No: Unique identifier for each recorded disaster event.
Year: The year the disaster occurred.
Seq: Sequence number representing the occurrence order of events in a given year.
Disaster Group: Broad categorization of the disaster (e.g., natural or technological).
Disaster Subgroup: More specific category within the main disaster group (e.g., biological, climatological, hydrological, etc.).
Disaster Type: Specific type of disaster (e.g., wildfire, flood, storm, etc.)
Disaster Subtype: Further subclassification of disaster type (e.g., forest fire, flash flood, tropical cyclone, etc.).
Disaster Subsubtype:: Even more detailed classification within the subtype (e.g., mudslide, blizzard, tornado, etc.).
Event Name: Name given to the disaster event, if any (e.g., Tropical Storm Noul, Typhoon Molave, etc.).
Entry Criteria: Criteria that justified the event’s inclusion in the database.
Country: Country where the disaster occurred.
ISO: The ISO code representation of the country (e.g., USA, SRB, YEM, etc.).
Region: Geographical or administrative region within the country (e.g, Northern America, Southern Europe, Caribbean, etc.).
Continent: The continent where the disaster took place.
Location: Specific location or city affected by the disaster (note: this column is very inconsistent — values range from state/province, city, county, etc.).
Origin: Root cause or source of the disaster (e.g., heavy rains, earthquake, landslide, etc.).
Associated Dis: Related disaster events, if any (e.g., famine, industrial accident, heat wave, etc.).
Associated Dis2: Secondary related disaster events.
OFDA Response: U.S. Office of Foreign Disaster Assistance’s response, if any (values are either yes or NaN).
Appeal: Any international appeals for assistance (values are either yes, no, or NaN).
Declaration: Declarations made regarding the disaster (values are either yes, no, or NaN).
Aid Contribution: Amount of aid contributed (reported in thousands of US dollars) in response to the disaster.
Dis Mag Value: Numeric value representing the magnitude of the disaster (must be used in conjunction with Dis Mag Scale to properly interpret).
Dis Mag Scale: Scale used to measure the disaster’s magnitude, such as KPH, Richter, etc. For example, if Dis Mag Value has a value of 110, and Dis Mag Scale reports KPH, then the natural disaster was reported as having a (presumed) wind speed of 110 KPH.
Latitude: Geographic latitude of the disaster’s epicenter or main affected area.
Longitude: Geographic longitude of the disaster.
Local Time: Local time when the disaster occurred or was first reported.
River Basin: The river basin affected (only applicable for flood events).
Start Year: Year the disaster event started.
End Year: Month the disaster event started.
Start Day: Day the disaster event started.
End Year: Year the disaster event ended.
End Month: Month the disaster event ended.
End Day: Day the disaster event ended.
Total Deaths: Total number of deaths caused by the disaster.
No Injured: Number of individuals injured due to the disaster.
No Affected: Number of individuals affected in any way by the disaster.
No Homeless: Number of individuals rendered homeless by the disaster.
Total Affected: Combined total of injured, affected, and homeless individuals.
Reconstruction Costs (‘000 US$): Estimated costs in thousands of US dollars for reconstruction after the disaster.
Insured Damages (‘000 US$): Estimated damages in thousands of US dollars covered by insurance.
Total Damages (‘000 US$): Total estimated damages in thousands of US dollars due to the disaster.
CPI: Consumer Price Index at the time of the disaster, useful for adjusting costs over time.

Again, I want to reiterate that these explanations are provided at the best of my knowledge given my current understanding of the dataset.

If you are already an expert in the EM-DAT dataset (or better yet, a curator of the dataset), please leave a comment below to correct any of my unfortunate (but well intentioned) misgivings.

Performing an exploratory data analysis of EM-DAT

When confronted with a new dataset that I have no prior experience with, I first like to perform an Exploratory Data Analysis (EDA) to better understand and summarize its main characteristics.

“In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods”
Wikipedia - Exploratory Data Analysis

To start, I like to build a basic table that provides a summary of the following information:

Samples: Number of rows in the dataset
Features: Number of columns (i.e., what exactly is quantified and represented in the dataset)
Duplicate Rows: Number of rows that have entirely identical feature/column values as other rows (typically we want to remove duplicated rows, but that is often dataset dependent)
Rows with NaN: Total number of rows that have missing/not a number (NaN) values
Total NaNs: Total number of NaNs in the dataset (i.e., sum of all NaNs across all rows and all columns)

The following code loads the EM-DAT dataset from disk:

# import the necessary packages
from collections import namedtuple
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os

# specify the path to the EM-DAT dataset
emdat_dataset_path = os.path.join(
    "natural-disasters-data",
    "em-dat",
    "EMDAT_1900-2021_NatDis.csv"
)

# load the EM-DAT natural disasters dataset from disk
df = pd.read_csv(emdat_dataset_path)
df.head()

Note: For the sake of simplicity (and to ensure you can reproduce my results), I am using the version of the EM-DAT dataset hosted on Kaggle rather than the one provided by Centre for Research on the Epidemiology of Disasters (which requires registration).

Below are the first five rows of the dataset:

	Dis No	Year	Seq	Disaster Group	Disaster Subgroup	Disaster Type	Disaster Subtype	Disaster Subsubtype	Event Name	Entry Criteria	Country	ISO	Region	Continent	Location	Origin	Associated Dis	Associated Dis2	OFDA Response	Appeal	Declaration	Aid Contribution	Dis Mag Value	Dis Mag Scale	Latitude	Longitude	Local Time	River Basin	Start Year	Start Month	Start Day	End Year	End Month	End Day	Total Deaths	No Injured	No Affected	No Homeless	Total Affected	Reconstruction Costs ('000 US$)	Insured Damages ('000 US$)	Total Damages ('000 US$)	CPI
0	1900-9002-CPV	1900	9002	Natural	Climatological	Drought	Drought	NaN	NaN	NaN	Cabo Verde	CPV	Western Africa	Africa	Countrywide	NaN	Famine	NaN	NaN	No	No	NaN	NaN	Km2	NaN	NaN	NaN	NaN	1900	NaN	NaN	1900	NaN	NaN	11000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.261389
1	1900-9001-IND	1900	9001	Natural	Climatological	Drought	Drought	NaN	NaN	NaN	India	IND	Southern Asia	Asia	Bengal	NaN	NaN	NaN	NaN	No	No	NaN	NaN	Km2	NaN	NaN	NaN	NaN	1900	NaN	NaN	1900	NaN	NaN	1250000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.261389
2	1902-0012-GTM	1902	12	Natural	Geophysical	Earthquake	Ground movement	NaN	NaN	Kill	Guatemala	GTM	Central America	Americas	Quezaltenango, San Marcos	NaN	Tsunami/Tidal wave	NaN	NaN	NaN	NaN	NaN	8.0	Richter	14	-91	20:20	NaN	1902	4.0	18.0	1902	4.0	18.0	2000.0	NaN	NaN	NaN	NaN	NaN	NaN	25000.0	3.391845
3	1902-0003-GTM	1902	3	Natural	Geophysical	Volcanic activity	Ash fall	NaN	Santa Maria	Kill	Guatemala	GTM	Central America	Americas	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1902	4.0	8.0	1902	4.0	8.0	1000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.391845
4	1902-0010-GTM	1902	10	Natural	Geophysical	Volcanic activity	Ash fall	NaN	Santa Maria	Kill	Guatemala	GTM	Central America	Americas	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1902	10.0	24.0	1902	10.0	24.0	6000.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.391845

A few noteworthy items:

Notice how EM-DAT records mass natural disasters starting from the year 1990
Disasters are categorized in a hierarchical fashion, starting from Disaster Group, and then working its way down to more fine-grained classifications via Disaster Subgroup, Disaster Type, Disaster Subtype, and Disaster Subsubtype
Total deaths are reported
The vast majority of feature values are missing/NaN

Below is my basic exploratory data analysis function (which I optimized from Juliana’s original implementation):

def basic_eda(df):
    # count the number of duplicated rows, then grab all NaN (i.e., null) rows
    # in the dataframe
    num_duplicated = df.duplicated().sum()
    is_nan = df.isnull()
    
    # count the total number of rows that contain *at least one* null value
    num_null_rows = is_nan.any(axis=1).sum()
    
    # count the total number of null values across *all* rows and *all* columns
    # (i.e., a sum of a sum)
    num_total_null = df.isnull().sum().sum()

    # construct a named tuple to represent each row in the exploratory data
    # analysis summary
    EDARow = namedtuple("EDARow", ["Name", "Value", "Notes"])

    # build the list of exploratory data analysis rows
    rows = [
        EDARow("Samples", df.shape[0], ""),
        EDARow("Features", df.shape[1], ""),
        EDARow("Duplicate Rows", num_duplicated, ""),
        EDARow("Rows with NaN", num_null_rows, "{:.2f}% all rows".format(
            (num_null_rows / df.shape[0]) * 100)),
        EDARow("Total NaNs", num_total_null, "{:.2f}% feature matrix".format(
            (num_total_null / (df.shape[0] * df.shape[1])) * 100)),
    ]
        
    # build and return our exploratory data analysis dataframe
    return pd.DataFrame(rows, columns=["Name", "Value", "Notes"])

The above code builds a summary dataframe.

An aspect of the code I want to point out is the difference between:

Counting the number of rows that have at least one null value (num_null_rows)
Counting the total number of NaN values across all rows and all columns (num_total_null)

The former is accomplished simply by summing the number of rows that have at least one null column.

I accomplish the later by doing two sums:

Count the number of NaN values in each row
Summing the sum of row-based NaN counts

The final dataframe provides the following information:

Number of rows (i.e., samples)
Number of columns (i.e., features)
Number of duplicate rows
Number of rows with NaN values
Total number of NaN values in the feature matrix

I can then call the basic_eda method on the EM-DAT dataset:

# perform a basic exploratory data analysis of the EM-DAT dataset
basic_eda(df)

Which provides the following summary table:

	Name	Value	Notes
0	Samples	15827
1	Features	43
2	Duplicate Rows	0
3	Rows with NaN	15827	100.00% all rows
4	Total NaNs	285923	42.01% feature matrix

As the EDA table demonstrates, there are a total of 15,827 rows in the EM-DAT dataset.

Note: Again, I want to stress that I’m using the Kaggle version of the EM-DAT dataset, which is easily accessible and ensures you can reproduce any of my results. You can also use the CRED version of EM-DAT, which is also official version, but will require you to register before you can access it.

There are 43 features (i.e., columns) in the EM-DAT dataset (each of which I did my best to summarize earlier in this article).

The EM-DAT dataset has zero duplicate rows, meaning that no two rows in the dataset have identical column values.

Earlier in this article I mentioned that the EM-DAT natural disaster dataset is noisy, containing missing and incomplete data — the final two rows in the EDA table provide evidence to this claim.

First, every single row in the dataset has at least one column with a NaN value, meaning that there is no row with all column/feature values provided.

Additionally, approximately 42% of the entire feature matrix (15827 x 43 = 68,0561) is also NaN, implying that nearly half the dataset contains null values.

Datasets with a large number of NaNs can be quite challenging and problematic to work with for data scientists as we need to define how to handle missing values:

Do we ignore rows with missing values? Keep in mind that every row of EM-DAT has a NaN value, so if we took this approach, we’d have an empty dataset.
Do we try to impute (i.e., replace) missing values? That is certainly an option, but given that 42% of the feature matrix is incomplete, we’d be artificially filling in a large number of values, thereby calling into question the value and authenticity of the dataset.
And if we decide to fill in missing values, what algorithms do we use? There are a plethora of data science algorithms that can be used to fill in missing values. How did we choose the right one?

In many cases, the answer is dataset (and even feature/column) dependent.

I’ve found that some of EM-DAT features can be reliably filled in while others cannot. I’ll likely do a future article on this type of study.

Investigating missing and unique values in EM-DAT

Given that 42% of all values in EM-DAT are missing, I found it worthwhile to spend a bit more time investigating exactly which features have values provided, and which ones are predominantly NaN.

Furthermore, I wanted to understand the uniqueness of the features, as unique values can have predictive power in downstream machine learning models.

To that end, I again built on Juliana’s work, and defined the summarize_data function to further summarize the EM-DAT dataset:

def summarize_data(df):
    # initialize a summary dataframe consiting of the original dataframe's
    # column names and data types
    summary = pd.DataFrame(df.dtypes, columns=["dtypes"])
    
    # reset the summary index, rename the "index" column to "Name", and then
    # remove the "index" column
    summary = summary.reset_index()
    summary["Name"] = summary["index"]
    summary = summary[["Name", "dtypes"]]
    
    # count the number of (1) null values for each column, and (2) the unique
    # values in each column
    summary["Missing"] = df.isnull().sum().values
    summary["Uniques"] = df.nunique().values
    
    # return the summary dataframe
    return summary

The summarize_data method constructs a dataframe that analyzes every column of the dataset.

For each column, four values are provided:

The name of the feature/column
The data type (integer, float, generic object, etc.)
Number of rows that contain a missing value for the particular feature
Number of unique values for the particular feature

Calling summarize_data on the df is as simple as:

# summarize the EM-DAT dataframe
summarize_data(df)

Which provides the following large block of output, analyzing the total number of missing and unique values for each column:

	Name	dtypes	Missing	Uniques
0	Dis No	object	0	15827
1	Year	int64	0	122
2	Seq	int64	0	1266
3	Disaster Group	object	0	1
4	Disaster Subgroup	object	0	6
5	Disaster Type	object	0	15
6	Disaster Subtype	object	2984	27
7	Disaster Subsubtype	object	14782	12
8	Event Name	object	12024	1532
9	Entry Criteria	object	335	13
10	Country	object	0	227
11	ISO	object	0	227
12	Region	object	0	23
13	Continent	object	0	5
14	Location	object	1808	12453
15	Origin	object	12190	638
16	Associated Dis	object	12614	30
17	Associated Dis2	object	15145	30
18	OFDA Response	object	14220	1
19	Appeal	object	13259	2
20	Declaration	object	12612	2
21	Aid Contribution	float64	15150	556
22	Dis Mag Value	float64	10926	1859
23	Dis Mag Scale	object	1171	5
24	Latitude	object	13111	2360
25	Longitude	object	13108	2426
26	Local Time	object	14735	777
27	River Basin	object	14570	1184
28	Start Year	int64	0	122
29	Start Month	float64	384	12
30	Start Day	float64	3600	31
31	End Year	int64	0	122
32	End Month	float64	708	12
33	End Day	float64	3532	31
34	Total Deaths	float64	4591	840
35	No Injured	float64	12011	753
36	No Affected	float64	6844	3486
37	No Homeless	float64	13422	906
38	Total Affected	float64	4471	4982
39	Reconstruction Costs ('000 US$)	float64	15796	29
40	Insured Damages ('000 US$)	float64	14735	359
41	Total Damages ('000 US$)	float64	10661	1558
42	CPI	float64	424	111

However, I find looking at such a table tedious and uninformative.

A more useful exercise is to visualize the above information.

Exploring features with missing values in EM-DAT

In my opinion, a better way to explore features with missing values is to count the number of NaN values per column, and then create a bar chart that plots the counts in descending order.

Doing so makes it trivially easy to identify columns with a large number of missing values with a simple visual inspection.

The following plot_null_columns method is more verbose, but provides a nicer, cleaner analysis:

def plot_null_columns(
    df,
    title,
    x_label="Feature Names",
    y_label="# of Null Values",
    figsize=(20, 5)
):
    # count the number of times a given column has a null value
    null_cols = df.isnull().sum().sort_values(ascending=False)
    
    # initialize the figure, set the tick information, and update the spines
    plt.figure(figsize=figsize)
    sns.set(style="ticks", font_scale=1)
    plt.xticks(rotation=90, fontsize=12)
    sns.despine(top=True, right=True, bottom=False, left=True)

    # plot the data
    ax = sns.barplot(x=null_cols.index, y=null_cols, palette="cool_r")

    # set the x-label, y-label, and title
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    plt.title(title)

    # loop over the patches and null column counts
    for (p, count) in zip(ax.patches, null_cols):
        # compute the percentage of the number of rows that have a null value
        # for the current column
        ax.annotate(
            "{:.1f}%".format((count / df.shape[0]) * 100),
            (p.get_x() + (p.get_width() / 2.0), abs(p.get_height())),
            ha="center",
            va="bottom",
            rotation="vertical",
            color="black",
            xytext=(0, 10),
            textcoords="offset points"
        )

What I found tricky about the above code block is figuring out the computation of null_cols such that it could also be provided to the barplot method to plot the number of columns with missing values in descending order.

The trick was to first compute a column-wise count of NaNs, sort them in descending order (via sort_values), and then provide null_cols.index to the barplot function.

Secondly, I find it extremely helpful to also plot the percentage of null counts for each column directly above each of the individual bars — the for loop handles that computation.

The handy plot_null_columns function can then be called via:

# plot the null column counts within the dataset
plot_null_columns(df, "Features With Missing Values")

Which produces the following plot:

Plot of features with missing values

Clearly, it’s far easier to visually identify features with missing values using this approach.

A quick inspection of the above plot reveals there are seven features where over 90% of values are missing for a particular column, including:

Reconstruction Costs
Aid Contribution
Associated Dis2
Disaster Subsubtype (which makes sense given that we are performing a hierarchical categorization of natural disaster types, and many don’t need such a fine-grained categorization)
Local Time
Insured Damage
River Basin

Conversely, there are twelve features that I would consider form the “core definition” of the EM-DAT dataset:

Start Year
End Year
Year
Continent
Region
ISO
Country
Disaster Type
Disaster Subgroup
Disaster Group
Seq
Dis No

All twelve of these features contain values (i.e., non are missing/NaN), and while these values may still be noisy, they are at least present in the EM-DAT dataset.

In my initial studies I’ve predominately focused on these core features.

Understanding EM-DAT’s natural disaster hierarchy grouping

My initial foray into the EM-DAT dataset concluded with an exploration of the natural disaster hierarchy grouping using Disaster Group as the base, and then becoming more fine-grained based on:

Disaster Subgroup
Disaster Type
Disaster Subtype
Disaster Subsubtype

I found the hierarchy exploration absolutely essential in understanding EM-DAT’s structure, and if you’re exploring EM-DAT for the first time, I’d suggest you spend considerable time here.

Below is the code I used:

# define the disaster type columns we are interested then
disaster_cols = [
    "Disaster Subgroup",
    "Disaster Type",
    "Disaster Subtype",
    "Disaster Subsubtype",
]

# grab the disaster data from the dataframe
disaster_df = df[disaster_cols]

# fill any null values with an empty string (implying that no subgroups or
# subtypes exist for the current value)
disaster_df = disaster_df.fillna(value={
    "Disaster Subtype": "NA",
    "Disaster Subsubtype": "NA",
})

# construct the final dataframe which displays a hierarchical overview of the
# disaster types, including the counts for each one
disaster_df = pd.DataFrame(
    disaster_df.groupby(disaster_cols).size().to_frame("count")
)
disaster_df

Basically, what I’m doing here is:

Selecting only the hierarchical disaster_cols from the dataframe
Filling in any NaN values with a string (we don’t want to use a blank/empty string here because we want to be able to easily spot empty values in the hierarchy grouping in our output table)
Building an output table that summarizes the hierarchy

The groupby function call is doing quite a bit, but the gist is that the code groups the dataframe by the unique combinations of disaster_cols, counting the occurrences of each combination. The final counts for each combination are stored in a new column named count.

Running the above code produces the following table which concisely depicts the hierarchical organization of natural disasters in EM-DAT:

EM-DAT disaster grouping and hierarchy

For example, let’s investigate the “Biological” subgroup — there are three disaster types for this subgroup:

Animal accident
Epidemic
Insect infestation

Now, let’s further examine the “Insect infestation” disaster type:

There are 16 occurrences of grasshopper infestations
62 occurrences of locust infestations
And finally, there are 18 occurrences of when no further subtype is provided (and are assumed to be more “generic”, or at the very least, uncategorized, types of insect infestations).

I had to sit and study this hierarchy table for a bit to fully appreciate and understand the EM-DAT disaster categorization hierarchy.

If you’re considering working with EM-DAT, I’d highly suggest you spend at least 30-60 minutes here before you move on to more detailed data analysis.

If I hadn’t taken the time to understand the data hierarchy, I think I would have struggled considerably when I moved on to more advanced analysis.

Takeaways

The International Disaster Database (EM-DAT) offers data scientists a rich dataset for the analysis of mass natural disasters, cataloging over 26,000 global events from 1900 to the present day.
EM-DAT focuses specifically on mass disasters, defined as events leading to significant human and economic loss. Lesser incidents are not accounted for in the database.
EM-DAT utilizes a hierarchical classification for disasters, allowing a detailed analysis based on categories such as Disaster Subgroup, Disaster Type, Disaster Subtype, and Disaster Subsubtype.
While the database is a valuable resource, it does have limitations. A significant portion of the data includes missing or incomplete information, and specific categories like the precise location of a disaster can prove inconsistent and challenging to work with.
Nevertheless, if you’re interested in studying natural disasters, I’d suggest starting with EM-DAT as the dataset is easy to use once you wrap your head around the basic structure (which this article attempted to do).

Citation information

Adrian Rosebrock. “Getting Started with the International Disaster Database (EM-DAT) with Python and Pandas”, NaturalDisasters.ai, 2023, https://naturaldisasters.ai/posts/getting-started-em-dat-international-disaster-database/.

@incollection{ARosebrock_GettingStartedEMDAT,
    author = {Adrian Rosebrock},
    title = {Getting Started with the International Disaster Database (EM-DAT) using Python and Pandas},
    booktitle = {NaturalDisasters.ai},
    year = {2023},
    url = {https://naturaldisasters.ai/posts/getting-started-em-dat-international-disaster-database/},
}

AI generated content disclaimer: I’ve used a sprinkling of AI magic in this blog post, namely in the following sections:

Understanding each of the columns/features in the EM-DAT dataset: I needed to summarize what each of the columns/features in the dataset represents. In order to create this summary, I fed my notes from each column into ChatGPT and had it summarize my findings into a nicely formatted numbered list.
Takeaways: AI was used to create a bulleted summary of the article.

Don’t fret, my human eyeballs have read and edited every word of the AI generated content, so rest assured, what you’re reading is as accurate as I possibly can make it. If there are any discrepancies or inaccuracies in the post, it’s my fault, not that of our machine assistants.

Header photo by NASA on Unsplash

Download the code

What is the EM-DAT database, and what is used for?

EM-DAT data at a glance

Limitations of the EM-DAT dataset

Understanding each of the columns/features in the EM-DAT dataset

Performing an exploratory data analysis of EM-DAT

Investigating missing and unique values in EM-DAT

Exploring features with missing values in EM-DAT

Understanding EM-DAT’s natural disaster hierarchy grouping

Takeaways

Citation information

How to Plot a World Map Using Python and GeoPandas

Exploring Natural Disaster Types in the EM-DAT Dataset with Python and Pandas

Decoding the EM-DAT Natural Disaster Dataset Hierarchy

Getting Started with the International Disaster Database (EM-DAT) using Python and Pandas

Download the code

What is the EM-DAT database, and what is used for?

EM-DAT data at a glance

Limitations of the EM-DAT dataset

Understanding each of the columns/features in the EM-DAT dataset

Performing an exploratory data analysis of EM-DAT

Investigating missing and unique values in EM-DAT

Exploring features with missing values in EM-DAT

Understanding EM-DAT’s natural disaster hierarchy grouping

Takeaways

Citation information

You may also like

How to Plot a World Map Using Python and GeoPandas

Exploring Natural Disaster Types in the EM-DAT Dataset with Python and Pandas

Decoding the EM-DAT Natural Disaster Dataset Hierarchy