Introduction

In the following project, I use Tableau and Python to perform necessary exploratory visual analysis on the real-world World Development Indicators data set. The exploratory visual analysis process uses visualizations to explore and form ideas about the data. The process entails looking at the dataset, coming up with a question or hypothesis, iteratively answering the proposed question through visualizations, and describing the final results. Specifically, I formulated one question based on the World Development Indicators data set. Addressing these questions, I manipulated the data and built several visualizations that revealed areas for further exploration and investigation. These various displays included bar charts, scatter plots, maps, time series, and more. In the end, I created a final visualization that best supported the question by presenting a clear resolution.

Data Profile

We explore the World Development Indicators (WDI) data set downloaded from the World Bank’s website for this project. The World Bank Group is an international organization that provides global development data. Downloaded as a zip file, the WDI contains six CSV files. Precisely for this project, I worked with the WDIData.csv file. This rich multivariate data set contains 383,838 rows and 66 columns, including qualitative and quantitative data. The qualitative data includes Country Name, Country Code, Indicator Name, and Indicator Code, whereas the quantitative data consists of a numeric indicator Value. There are 266 unique countries and 1,442 development indicators, covering health, environment, infrastructure, economic policy, education, and more.

Data Columns Type Unique Description
Country Name object 266 Name of country or region
Country Code object 266 Abbreviated code for country or region
Indicator Name object 1442 Name of measured development indicator
Indicator Code object 1442 Abbreviated code for indicator
1960 to 2020 float 383838 Column years ranging from 1960 to 2020

First, scanning the data, one aspect brought to my attention was that the columns labeled from 1960 to 2020 are the years, where each row contains the numeric value for the indicator. Another aspect I noticed is that the indicator names can split into broader topics. For example, we can use a separator dot to split the indicator code column to separate the indicator name fields into more broad topics: Energy Production, Energy Emissions, Exports, Imports, Transportation, Government Finance, and more. Notice that although the indicator name and code variables are thorough, the numeric value for each indicator is not always available, indicated by a null value. Lastly, I realized that specific country names, such as “Arab World” and “European Union,” are not countries. Instead, these qualitative values are pre-aggregated groupings of many countries.


Question Exploration

What are the main direct drivers that influence the changes in CO2 emissions, and how do they change worldwide over time?

The above posed question comes from the information provided in the WDI data set. Greenhouse gases, especially carbon dioxide (CO2) emissions, are one of the leading causes of climate change. The following analysis examines the relationship between global CO2 emissions and economic growth, industrialization, urban population, technology development expenditures, foreign trade, and energy consumption from 1960 to 2018. The first question I asked stems from my interest in the environment. Before starting the analysis, I began by looking at the change in CO2 emission rates in the United States. After a quick examination, I read through academic articles on the driving factors of climate change and CO2 emissions and investigated the relationship between different indicators and emissions. For the overall analysis, my goal is to identify the main trends and direct drivers that affect the changes in CO2 emissions in the world and the largest emitters. Based on my preliminary research, I hypothesize that energy consumption and economic growth are the two main drivers of CO2 emissions, and both drivers increase globally over time.

Data exploration

To prepare data for this analysis, I start by importing the WDIData.csv to Tableau Prep. With Tableau Prep, data preparation becomes visual and straightforward. Through the building of flows, we can instantly clean and shape our data for analysis. Beginning the cleaning process, I selected the following relevant indicators outlined in the table below.

Indicator Name Measure Description
GDP per capita, PPP current international $ Annual percentage growth rate of the sum of gross value added by all residents in economy
Industry, value added % of GDP Value added in mining, manufacturing, construction, electricity, water, and gas
Urban Population % of total population Number of persons residing in ‘urban’ area per 100 total population
Research & Development Expenditure % of GDP Gross domestic expenditures on research and development (R&D), as a percent of GDP
Foreign Direct Investment, net inflows % of GDP Net inflows of foreign investment in an economy’s operating enterprise
Total CO2 emissions Thousand metric tons Carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring

I then pivoted the year columns (from 1960 to 2020) into a single variable called Year and a corresponding values column labeled as Indicator Value. Next, I removed redundant columns such as Country Code and Indicator Code while removing any null values found in the Indicator Value’s column. Then, I converted Country Name to a geographic data role and filtered out 47 geographically integrated country groups. Lastly, I transformed the single indicator variable into multiple variable columns, converting rows under Indicator Name into columns.

Data Visualizations

Now that the data is in a workable format, I start the visual exploratory analysis in Tableau. Exploratory visual analysis is often an iterative process. Hence, I manipulated and created various intermediate views in an attempt to assess the data-related question. First, I made a simple bar chart to show the total CO2 emissions by countries. I filtered the data to include only the top 8 emitting countries.

Based on the above, the top emitting countries are the United Kingdom, Canada, China, Germany, India, Japan, Russia, and the United States. Next, I made a simple time series chart to show the annual changes in CO2 emissions. I filtered the data to include only the top 8 emitting countries, and then I plotted the data available over time to observe any trends.

In the above, we can compare the total CO2 emissions of the top emitting countries. The plot shows that the total CO2 emissions is much higher in the US than in other countries. However, since the population of the US and nations such as India and China aren’t comparable, we cannot make a proper inference on whether there is a significant difference in total CO2 emissions among these countries. Although this view provides a glimpse into the leading countries and trends in yearly CO2 emissions, it does not offer any insight into the driving factors of emission rates. Moving forward, I begin finding different ways to showcase CO2 emissions related to other indicators that measure economic growth. Specifically, in the following, I created a scatter plot for 2018 illustrating CO2 emissions with GDP per capita that uses logarithmic scales on both the horizontal and vertical axes. In addition, the size of each measure shows the total population.

The above chart shows that GDP has an overall positive influence on CO2 emissions. In general, GDP per capita is a broad measure of economic growth; so with these findings, we can say that country economic growth measured by GDP per capita contributes to increased CO2 emissions. While this view provides a glimpse into the relationship between CO2 emissions, GDP rates, and the total population, it does not give insight into other potential driving factors of emission rates. Moving forward, I begin to incorporate a more multivariate view by finding different ways to plot both the CO2 emissions with several indicators. As follows, I present a multivariate view of the relationship between CO2 emissions and GDP growth, urban population, technology development, and foreign investments.

PCA Analysis

Here, we begin by visualizing all the principal components. Using PCA analysis, we use the splom trace to display our results, and our features are the resulting principal components, ordered by how much variance they are able to explain. The importance of explained variance is demonstrated in the plots below.

Overall, the above plot allows us to visualize high-dimensional PCs of our data grouped by region. Next, we visualize the first two principal components of a PCA. We also visualize the loadings of our PCA analysis using annotations to indicate which feature a certain loading original belongs to. Here, we define loadings as \({\small\textrm{loadings}} = \mathbf{v} \cdot \sqrt{\mathbf{\lambda}}\), where \(\mathbf{v}\) denotes the eigenvector and \(\mathbf{\lambda}\) denotes the eigenvalues.

The above visualization provides a multivariate analysis and view of carbon dioxide emissions, GDP growth, industrialization, urbanization, technology development, and foreign investments in the world using the PCA during the period 1970-2018. Based on these results, increased industrialization and economic growth measured in terms of GDP per capita seem to be the leading factors in influencing the rise of CO2 emissions.


Reflection

Overall, the goal of this project is to create meaningful visualizations to explore and effectively translate the rich and multivariate data set to pose new questions and solutions. Other than learning the skills to clean, organize and display data, I also got to learn how to focus on the data that is important to me and filtering out other irrelevant information, which, I believe, is a great skill especially when working with a large data set. When examining the data and constructing the visualizations, I used Tableau Prep, Python, and Tableau. Specifically, for manipulating data, I used Tabelau Prep and Python while for creating visualizations, I used Tableau.

For question one, my goal was to study the main drivers that influence CO2 emissions. In particular, when creating the final visualization for the first question, I wanted to focus on presenting worldwide driving factors and trends of CO2 emissions. So, I chose to use a multivariate graph. The final visualization successfully demonstrates data patterns for different developmental indicators, but the diagram is relatively weaker in showing changes over time. Based on my results, economic growth is the primary factor influencing CO2 emissions.

Overall, I found this project to be challenging but exciting. I learned that exploratory analysis is an iterative process. This process involved refining questions, trying new views, altering my approach, and repeating as necessary until I reached a compelling answer in the form of a visualization. Although there are certain conventions to follow, I also learned that there’s no perfect data visualization as each one has its advantages and disadvantages. However, as creators of the visualization, it’s essential to consider the mindset of our audience when creating meaningful visualizations that successfully present information from the data.


References

Abokyi, Eric, Paul Appiah-Konadu, Francis Abokyi, and Eric Fosu Oteng-Abayie. 2019. “Industrial Growth and Emissions of Co2 in Ghana: The Role of Financial Development and Fossil Fuel Consumption.” Energy Reports 5: 1339–53. https://doi.org/https://doi.org/10.1016/j.egyr.2019.09.002.
Andrée, Bo Pieter Johannes, Andres Chamorro, Phoebe Spencer, Eric Koomen, and Harun Dogo. 2019. “Revisiting the Relation Between Economic Growth and the Environment; a Global Assessment of Deforestation, Pollution and Carbon Emission.” Renewable and Sustainable Energy Reviews 114: 109221. https://doi.org/https://doi.org/10.1016/j.rser.2019.06.028.
Ritchie, Hannah, and Max Roser. 2020. “Co2 and Greenhouse Gas Emissions.” Our World in Data, May. https://ourworldindata.org/co2-emissions.
“World Bank Group - International Development, Poverty, & Sustainability.” n.d. World Bank. https://www.worldbank.org/en/home.