Take-home Exercise 3

Author

Lennis

Published

June 3, 2023

Modified

June 18, 2023

Exploring Vast Challenge Data

For this exercise, we will be looking at Mini-Challenge 3 of the VAST Challenge 2023 which is exploring ways to identify companies that may be engaged in illegal fishing. Specifically, this exercise aims to identify similarities among the companies and group them appropriately.

Install and launching R packages

The code chunk below uses p_load() of pacman package to check if the relevant packages are installed in the computer. If they are, then they will be launched into R. The data in JSON format is then loaded using the jsonlite package.

  • jsonlite: Provides functions for working with JSON data.

  • tidygraph: Provides tidyverse style interface for working with graph data.

  • ggraph: Flexible and intuitive grammar of graphics approach for visualizing graphs.

  • igraph: For working and analyzing graph data.

  • visNetwork: For creating interactive network visualizations.

  • tidyverse: A collection of core packages designed for data science, used extensively for data preparation and wrangling.

  • tidytext: For text mining and analysis following tidyverse data principles.

  • ggforce: Extends capabilities of ggplot2 such as animating plots.

  • graphlayouts: Provides various layout algorithms for visualizing network graphs.

  • skimr: Provides compact summaries of datasets.

  • topicmodels: Provides tools for fitting and analyzing topic models.

  • tm: Provides tools for text mining and analysis

  • topicdoc: Topic-specific diagnostics for topic models using topicmodels

  • textclean: For cleaning and processing text

  • tidytext: Tool for text mining and analysis built on top of tidyverse ecosystem

  • widyr: For pairwise comparisons and calculations in tidy data format

pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, tidyverse, tidytext, ggforce, visNetwork, graphlayouts, skimr, igraph,
               topicmodels, tm, topicdoc,
               textclean, tidytext, widyr)
MC3 <- fromJSON("data/MC3.json")

Loading Edges Data

The following code chunk extracts and transforms the edges data from the main data set. Only distinct edges are used. No missing data were found. From the bar chart below, the data seems to contain mainly links from companies to individuals who are either beneficial owners or company contacts.

Code
MC3_edges <- as_tibble(MC3$links) %>%
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
  summarise(weights = n()) %>%
  filter(source!=target) %>%
  ungroup()
Code
skim(MC3_edges)
Data summary
Name MC3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁
Code
ggplot(data = MC3_edges,
       aes(x = type)) +
  geom_bar()

Loading Nodes Data

The following code chunk extracts and transforms the nodes data from the data set. However, as the main objective is centered around businesses, we will only be focusing on companies and not individuals. Hence, the type of nodes is recoded to only reflect companies and non-companies. Each company is ensured to be uniquely identifiable by concatenating the records of multiple instances into a single row. Revenue data is computed as the sum of all instances.

Code
MC3_nodes <- as_tibble(MC3$nodes) %>%
  distinct() %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  mutate(type = gsub("Company Contacts", "Contacts", type)) %>%
  mutate(product_services = ifelse(product_services == "character(0)" | product_services == "Unknown", '', product_services)) %>%
  group_by(id) %>%
  summarise(countries = paste(country, collapse = ", "),
            types = paste(type, collapse = ", "),
            revenue = sum(revenue_omu),
            description = paste(product_services, collapse = ";")) %>%
  mutate(types = ifelse(grepl("\\bCompany\\b", types), "Company", "Non-Company"),
         description = ifelse(grepl("\\b\\w+\\b", description), description, NA)) %>%
  select(id, countries, types, revenue, description)

From the summary below, we can see that revenue data is only available for about 20% of the records. Hence, we will not be using the revenue data for further investigation. While the description data faces a similar issue, there are still merits to explore similarities among businesses based on it which will be discussed in subsequent sections.

skim(MC3_nodes)
Data summary
Name MC3_nodes
Number of rows 22929
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.00 6 64 0 22929 0
countries 0 1.00 2 120 0 211 0
types 0 1.00 7 11 0 2 0
description 19035 0.17 4 1738 0 3272 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue 17664 0.23 1637722 16795453 3652.23 7826.67 16366.67 49964.69 308249623 ▇▁▁▁▁

The bar chart below reveals that the majority of the records are actually related to non-company records. When we dive further into the breakdown, we realized that the descriptions actually makes up about half of all companies. Non-companies’ description have little meaning. Hence, we will investigate further in subsequent sections.

Code
cool_palette <- c("#78c2c4", "#C4C8C8")

MC3_nodes %>%
  group_by(types) %>%
  summarise(unknown_services = sum(is.na(description)),
            known_services = sum(!is.na(description))) %>%
  pivot_longer(cols = c(unknown_services, known_services),
               names_to = "Service_Type",
               values_to = "Count") %>%
  ggplot(aes(x = types, y = Count, fill = Service_Type)) +
  geom_bar(stat = "identity") +
  labs(x = "Types", y = "Count", fill = "Service Type") +
  scale_fill_manual(values = cool_palette)

Final labelled nodes data

With the result of the topic model, we merge those labels with the original nodes data to obtain a final nodes data for further analysis in subsequent sections.

Code
MC3_nodes_labelled <- MC3_nodes %>% 
  mutate(topic = NA) %>%
  left_join(MC3_nodes_filtered, by = c("id", "countries", "types", "description")) %>%
  mutate(topic = ifelse(is.na(topic.x) & is.na(topic.y), "Unknown", topic.y)) %>%
  select(id, countries, types, revenue.x, description, topic) %>%
  mutate(revenue = revenue.x) %>%
  select(id, countries, types, revenue, description, topic)

Visualizing Network with Community Detection

First, we will explore the complete nodes and edges data by construct the entire network to check if the network is disconnected using group_components.

Code
id1 <- MC3_edges %>%
  select(source) %>%
  rename(id = source)
id2 <- MC3_edges %>%
  select(target) %>%
  rename(id = target)
MC3_nodes_comm <- rbind(id1, id2) %>%
  distinct() %>%
  left_join(MC3_nodes_labelled,
            unmatched = "drop") %>%
  replace_na(list(countries = "Unknown",
                  types = "Unknown",
                  description = "Unknown",
                  topic = "Unknown"))

MC3_graph <- tbl_graph(nodes = MC3_nodes_comm,
                       edges = MC3_edges,
                       directed = FALSE) %>%
  mutate(betweeness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness(),
         community = as.factor(group_components()),
         node_size = ifelse(topic == "Fishing", 0.55, 0.5))

Based on the table below, we can tell that the network comprise of a large majority network coupled with 4 other smaller networks. For our exercise, we will focus on the larger network.

Code
temp <- MC3_graph %>%
  activate(nodes) %>%
  as.tibble() 
temp %>%
  group_by(community) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  filter(count >100)
# A tibble: 5 × 2
  community count
  <fct>     <int>
1 1          6628
2 2           293
3 3           137
4 4           135
5 5           114

Based on the network graph below, we can tell that the fishing related companies are quite sparsely distributed across the network with no significant signs of them falling into a cluster. We will try to carry out community detection on this network to see if we can detect more granular sub-communities within it.

Code
my_colors <- c("#3C53F2", "#CFFAB6", "#C4C8C8")
my_sizes <- c(1, 0.5, 0.5)

MC3_graph %>%
  filter(community == 1) %>%
  ggraph (layout = 'kk') +
  geom_edge_link(aes(), alpha=0.1) + 
  geom_node_point(aes(
    colour = topic,
    size = topic),
    alpha = 0.5) + 
  scale_color_manual(values = my_colors) +
  scale_size_manual(values = my_sizes) +
  theme_graph()

Investigating Largest Connected Community

Using group_label_prop()

First, we explore the label propagation algorithm. As the number of sub_communities generated is very large, we will only focus on the top 5 communities with the highest number of nodes.

Code
MC3_graph_largest <- MC3_graph %>%
  filter(community == 1) %>%
  select(-c(betweeness_centrality, closeness_centrality, community)) %>%
  mutate(betweeness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness(),
         sub_community = as.factor(group_label_prop()))
Code
temp <- MC3_graph_largest %>%
  activate(nodes) %>%
  as.tibble() 
temp %>%
  group_by(sub_community) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  filter(count >1)
# A tibble: 646 × 2
   sub_community count
   <fct>         <int>
 1 1               122
 2 2                96
 3 3                74
 4 4                64
 5 5                62
 6 6                54
 7 7                53
 8 8                52
 9 9                51
10 10               50
# ℹ 636 more rows

From the graph, the algorithm seems to be able to generate relatively effective communities as they are quite closely connected. The labels can serve as a feature for further downstream investigation.

Code
my_colors <- c("#C4C8C8", "#FF0000", "#00FF00", "#0000FF", "#FF00FF", "#FFFF00")

plot_top5 <- MC3_graph_largest %>%
  mutate(sub_community = ifelse(as.numeric(sub_community) >5, "0thers", sub_community)) %>%
  ggraph (layout = 'kk') +
  geom_edge_link(aes(), alpha=0.1) + 
  geom_node_point(aes(
    colour = sub_community),
    alpha = 0.5,
    size = 0.5
  ) +
  scale_color_manual(values = my_colors) +
  theme_graph()
plot_top5

Using group_infomap

Similarly, we will explore another algorithm using group_infomap. The top 5 generated clusters also seem to be quite dense and can be possibly used for further investigation as well.

Code
MC3_graph_largest <- MC3_graph %>%
  filter(community == 1) %>%
  select(-c(betweeness_centrality, closeness_centrality, community)) %>%
  mutate(betweeness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness(),
         sub_community = as.factor(group_infomap()))
Code
temp <- MC3_graph_largest %>%
  activate(nodes) %>%
  as.tibble() 
temp %>%
  group_by(sub_community) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  filter(count >1)
# A tibble: 594 × 2
   sub_community count
   <fct>         <int>
 1 1                95
 2 2                73
 3 3                59
 4 4                54
 5 5                52
 6 6                51
 7 7                49
 8 8                47
 9 9                47
10 10               47
# ℹ 584 more rows
Code
my_colors <- c("#C4C8C8", "#FF0000", "#00FF00", "#0000FF", "#FF00FF", "#FFFF00")

plot_top5 <- MC3_graph_largest %>%
  mutate(sub_community = ifelse(as.numeric(sub_community) >5, "0thers", sub_community)) %>%
  ggraph (layout = 'kk') +
  geom_edge_link(aes(), alpha=0.1) + 
  geom_node_point(aes(
    colour = sub_community),
    alpha = 0.5,
    size = 0.5
  ) +
  scale_color_manual(values = my_colors) +
  theme_graph()
plot_top5

Visualizing Network with Textual Data

Finally, we can also find similar companies based on their descriptions. Unlike with topic modeling, we will carry out pairwise similarity calculations for all companies to check if companies have similar businesses. Those with similar businesses will firm edges with each other and we will explore if the resulting network can give us some insights.

Code
cleaned_data <- MC3_nodes_filtered %>%
  select(id, description) %>%
  unnest_tokens(word, description) %>%
  anti_join(stop_words)  %>%
  count(id, word, sort = TRUE) %>%
  bind_tf_idf(word, id, n)

similarity_edge <- pairwise_similarity(cleaned_data, id, word, tf_idf) %>%
  filter(similarity > 0.95) %>%
  rename(source = item1) %>%
  rename(target = item2) %>%
  rename(weight = similarity)
Code
id1 <- similarity_edge %>%
  select(source) %>%
  rename(id = source)
id2 <- similarity_edge %>%
  select(target) %>%
  rename(id = target)
similarity_nodes <- rbind(id1, id2) %>%
  distinct() %>%
  left_join(MC3_nodes_labelled,
            unmatched = "drop") %>%
  replace_na(list(countries = "Unknown",
                  types = "Unknown",
                  description = "Unknown",
                  topic = "Unknown"))

MC3_graph_similarity <- tbl_graph(nodes = similarity_nodes,
                       edges = similarity_edge,
                       directed = FALSE) %>%
  mutate(betweeness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness(),
         community = as.factor(group_edge_betweenness()))
Code
temp <- MC3_graph_similarity %>%
  activate(nodes) %>%
  as.tibble() 
temp %>%
  group_by(community) %>%
  summarise(count = n(),
            pct_Fishing = as.integer(sum(topic == "Fishing")/count*100),
            pct_Others = as.integer(sum(topic == "Others")/count*100)) %>%
  arrange(desc(count))
# A tibble: 222 × 4
   community count pct_Fishing pct_Others
   <fct>     <int>       <int>      <int>
 1 1            65         100          0
 2 2            52         100          0
 3 3            38          42         57
 4 4            35         100          0
 5 5            24          45         54
 6 6            22         100          0
 7 7            19         100          0
 8 8            19         100          0
 9 9            17         100          0
10 10           17          58         41
# ℹ 212 more rows

From the generated graph of the top 10 communities, companies in communities 1 and 2 are very densely connected to each other, suggesting very clear and distinct similarities in their businesses. The other communities seem to have some overlapping business descriptions. This may suggest an alternative method to classify the business similarities based on their descriptions for further investigation to detect anomalies.

Code
MC3_graph_similarity %>%
  filter(as.numeric(community) < 10) %>%
  ggraph (layout = 'kk') +
  geom_edge_link(aes(), alpha=0.1) + 
  geom_node_point(aes(
    colour = community,
    shape = topic),
    alpha = 0.5,
    size = 3) + 
  scale_shape_manual(values = c(16,15)) +
  theme_graph()