Graphs help us understand the real world better by mathematical abstractions and models in order to define the entities and the relations between them in structures.
Many real-world applications use graphs, including search engines.
Search engines use graphs for different purposes, such as understanding the popularity of pages or detecting spam links.
Centrality is an important concept used for graphs. We use a quantitive measure called centrality measure to discover the importance of the nodes which are essential units forming a graph.
A very popular search engine is known to identify the popularity of nodes (webpages) by using PageRank as the measure of centrality. PageRank has been the most famous centrality measure for graphs.
There are some other measures such as:
Hits (Hyperlink-Induced Topic Search).
In this article, I will focus on “Harmonic Centrality” because it is a promising and, in theory, low-cost measure of centrality.
Harmonic centrality is a much recent measure than most of the other centrality measures and a variant of another measure that is closeness centrality.
It was first proposed by Massimo Marchiori and Vito Latora in 2000, then by Dekker (2005) and Rochat (2009).
Harmonic Centrality is the distance-based centrality measure, unlike PageRank.
In the PageRank algorithm, the importance of a node is given by the importance of the neighborhood but not the distance. The more important pages point a page, the more important that page becomes.
To understand harmonic centrality in simple terms, here’s an example.
Imagine there is a page called example.
If there are 50 pages linking directly to this page, they are called as pages at Distance 1, and you begin your count as 50.
There may be pages linking to example page but not directly so we call them Distance 2. Let’s say there are 100 such pages, this time we count as 50 and add to the 50 we have making our score 100.
There can be pages in Distance 3 with 150 links this time. So we count as 50 which makes our total score of 150.
It is much easier to understand than PageRank, isn’t it?
Best Centrality Measure
Harmonic centrality can be selected as one of the easiest centrality measures to understand and we can stop there.
However, it has also been selected as the best centrality measure in a scientific paper. The authors of the research, Paolo Boldi and Sebastiano Vigna describe the best centrality measure with these words:
“Our results suggest that centrality measures based on distances, which in the last years have been neglected in information retrieval in favour of spectral centrality measures, do provide high-quality signals; moreover, Harmonic Centrality pops up as an excellent general-purpose centrality index for arbitrary directed graphs.”
There is a compelling video titled “A Modern View of Centrality Measures” where Boldi talks about the centrality measures in general and harmonic centrality in particular.
In the video, he also presents a comparison of PageRank and harmonic centrality measurements on the Hollywood graph. In his specific example, harmonic centrality selects top nodes better than the PageRank.
Additionally, nonprofit organization Common Crawl which has been crawling the web since 2011 prefer this centrality measure.
Since 2017, Common Crawl has been using harmonic centrality in their crawling strategy for prioritization by link analysis.
When they list domain level information in their blog posts, the domains are ranked according to their harmonic centrality values, not to their PageRank values.
In this article, I will conduct data analysis to find out if its calculation cost is less than PageRank.
There is software that approximates harmonic centrality for very large graphs in order to further reduce its computation cost. I haven’t used it but it’s worth mentioning in case you want to try it out yourself.
For search engines and for SEO, harmonic centrality is a useful centrality measure because it:
Is simple to understand and explain.
Gives intriguing results on web graphs.
Is not a metric which needs iteration, thus its calculation cost can be less in theory.
Comparing Harmonic Centrality & PageRank
Comparison on Small Graphs
I created some very small web graphs and calculated the nodes(pages) PageRank and five other centrality measure values in order to compare them and give you a better understanding of Harmonic Centrality measure.
Imagine very small websites having a homepage, as well as category and product pages. Let’s use H, C1, C2, P1, P2, P3, P4 as aliases for the homepage, first, second category pages and first, second, third and fourth product pages, respectively.
H = Homepage
C1 = Category page 1
C2 = Category page 2
P1 = Product page 1
P2 = Product page 2
P3 = Product page 3
P4 = Product page 4
I calculated PageRank and five centrality measures on these very small datasets and visualized them as graphs.
For simplicity, only PageRank and Harmonic Centrality graphs are presented. On these graphs, the node sizes show the importance of their chosen centrality measure values.
Centrality measure values compute times on homemade graphs and Stanford-web graph are calculated on MacBook Pro with four cores and 16GB RAM.
Pages: 7Links: 16
PageRankCPU times: user 1.01 ms, sys: 0 ns, total: 1.01 msWall time: 1.02 µs
EigenvectorCPU times: user 392 µs, sys: 1 µs, total: 393 µsWall time: 398 µs
BetweennessCPU times: user 242 µs, sys: 1e+03 ns, total: 243 µsWall time: 247 µs
KatzCPU times: user 576 µs, sys: 13 µs, total: 589 µsWall time: 613 µs
ClosenessCPU times: user 235 µs, sys: 1e+03 ns, total: 236 µsWall time: 240 µs
HarmonicCPU times: user 665 µs, sys: 9 µs, total: 674 µsWall time: 689 µs
Correlations with PageRankEigenvector: 0.990128Betweenness: 0.995301Katz: 0.973233Closeness: 0.966720Harmonic: 0.987344
Graph 1 PageRank
Graph 1 Harmonic Centrality
Pages: 7Links: 14
PageRankCPU times: user 1.01 µs, sys: 0 ns, total: 1.01 µsWall time: 1.01 µs
EigenvectorCPU times: user 393 µs, sys: 1e+03 ns, total: 394 µsWall time: 398 µs
BetweennessCPU times: user 241 µs, sys: 0 ns, total: 241 µsWall time: 247 µs
KatzCPU times: user 499 µs, sys: 1 µs, total: 500 µsWall time: 505 µs
ClosenessCPU times: user 224 µs, sys: 1 µs, total: 225 µsWall time: 229 µs
HarmonicCPU times: user 597 µs, sys: 9 µs, total: 606 µsWall time: 625 µs
Correlations with PageRankEigenvector: 0.999605Betweenness: 0.917375Katz: 0.986605Closeness: 0.946069Harmonic: 0.975830
Graph 2 PageRank
Graph 2 Harmonic Centrality
Pages: 7Links: 10
PageRankCPU times: user 2.85 µs, sys: 148 µs, total: 2.99 µsWall time: 2.92 µs
EigenvectorCPU times: user 505 µs, sys: 1e+03 ns, total: 506 µsWall time: 510 µs
BetweennessCPU times: user 153 µs, sys: 1 µs, total: 154 µsWall time: 158 µs
KatzCPU times: user 436 µs, sys: 0 ns, total: 436 µsWall time: 442 µs
ClosenessCPU times: user 142 µs, sys: 0 ns, total: 142 µsWall time: 146 µs
HarmonicCPU times: user 283 µs, sys: 1e+03 ns, total: 284 µsWall time: 289 µs
Correlations with PageRankEigenvector: 0.930335Betweenness: 0.873700Katz: 0.996401Closeness: 0.979220Harmonic: 0.952115
Graph 3 PageRank
Graph 3 Harmonic Centrality
Pages: 7Links: 10
PageRankCPU times: user 2.76 µs, sys: 374 µs, total: 3.13 µsWall time: 2.77 µs
EigenvectorCPU times: user 730 µs, sys: 177 µs, total: 907 µsWall time: 722 µs
BetweennessCPU times: user 242 µs, sys: 0 ns, total: 242 µsWall time: 248 µs
KatzCPU times: user 330 µs, sys: 0 ns, total: 330 µsWall time: 335 µs
ClosenessCPU times: user 210 µs, sys: 0 ns, total: 210 µsWall time: 215 µs
HarmonicCPU times: user 740 µs, sys: 1 µs, total: 741 µsWall time: 745 µs
Correlations with PageRankEigenvector: 0.993543Betweenness: 0.998952Katz: 0.962346Closeness: 0.998952Harmonic: 0.997933
Graph 4 PageRank
Graph 4 Harmonic Centrality
Plot of the Correlations of 5 Centrality Measures with PageRank on 4 Homemade Graphs
Labels on X-axis
ec: Eigenvector Centrality
bc: Betweenness Centrality
kc: Katz Centrality
cc: Closeness Centrality
hc Harmonic Centrality
Mean of the Correlations of 5 Centrality Measures with PageRank on 4 Homemade Graphs
Harmonic Centrality has the second highest mean of correlations on four graphs with PageRank after Eigenvector Centrality which is a cousin of PageRank.
Comparison on Large Graphs
This web graph is available here.
Nodes represent pages from Stanford University (stanford.edu) and directed edges represent hyperlinks between them. The data was collected in 2002.Nodes: 281 903Edges : 2 312 497
PageRankCPU times: user 1min 17s, sys: 2.3 s, total: 1min 19sWall time: 1min 19s
Harmonic CentralityCPU times: user 1min 18s, sys: 1.96 s, total: 1min 20sWall time: 1min 20s
mean, min, max of pr_val = 3.547319e-06, 5.681687e-07, 6.512977e-03
mean, min, max of harmonicc_val = 31132.897898, 1.001467, 106023.735213
Correlation on 282 K pages
correlation(pr_val, harmonicc_val) = 0.013566
Correlations of Top Pages Sorted According to Harmonic Centrality Values
Top 100Correlation(PageRank.Harmonic Centrality) = 0.916071
Top 1000Correlation(PageRank.Harmonic Centrality) = 0.691981
Top 10 000Correlation(PageRank.Harmonic Centrality) = 0.682505
Top 100 000Correlation(PageRank.Harmonic Centrality) = 0.012681
These correlation scores above show us that PageRank and Harmonic Centrality algorithms agree on almost the top 10 K pages (not on the pages but the distribution of the importance of the pages) but afterwards, they don’t share the same opinion on other pages’ popularity distribution.
Number of Intersected Top Pages Between Pagerank and Harmonic Centrality
In top 100: 8In top 1000: 66In top 10 000: 299In top 100 000: 41 138
Number of Intersected Weakest Pages Between Pagerank and Harmonic Centrality
There is no intersection till 10 000 weakest pages but for 100 000 weakest pages the intersection is 23 417.
The PageRank distribution graph shows us that this distribution is highly right-skewed meaning the majority of the pages have very low PageRank.
Harmonic Centrality Distribution
The Harmonic Centrality distribution is not highly right-skewed so we can’t say that majority of the pages have low values like PageRank. This distribution is multi-modal.
Scatter Plot of PageRank and Harmonic Centrality
When we scatter plot the PageRank and Harmonic Centrality values of pages we observe what we have seen in the statistics part which is PageRank and Harmonic Centrality values of domains are not correlated.
Common Crawl Graph
On Common Crawl’s recent blog post, they shared a file (cc-main-2018-aug-sep-oct-domain-ranks.txt.gz [1.89 GB]) that provides 87 million domains’ most recent Harmonic Centrality and PageRank values.
Below are the statistics from this file about PageRank and Harmonic Centrality values of 87 million domains.
Sebastian Nagel from Common Crawl technical team kindly answered my question about computing time differences between PageRank and Harmonic Centrality algorithms calculations on Common Crawl graph and also allowed me to share his response with you:
“Computation was done on a machine with 48 CPUs and 384 GB RAM using the web graph resp. the law library which was written by Vigna and Boldi.
At a first glance, a look on the “real” wall-clock time and “user” time (used CPU-seconds) would even suggest that the harmonic centrality is faster for the larger host-level graph. However, my guess would be that the processing time is mostly influenced by:
Graph structure (e.g., amount of dangling nodes).
The number of iterations until the scores converge.
(Hyperball / harmonic centrality) size of the hyperloglog counters.”
mean, min, max of pr_val = 1.1473075614418515e-08, 4.48131407e-09, 1.72100302e-02
mean, min, max of harmonicc_val = 9421776.2697027, 0. , 24993276.
Correlation on 87 Million Domainscorrelation(pr_val, harmonicc_val) = 0.00432823
Correlation of Top Pages Sorted According to Harmonic Centrality ValuesTop 100Correlation(PageRank.Harmonic Centrality) = 0.94826372Top 1000Correlation(PageRank.Harmonic Centrality) = 0.87727728Top 10 000Correlation(PageRank.Harmonic Centrality) = 0.63744096Top 100 000Correlation(PageRank.Harmonic Centrality) = 0.31725445
The graph below presents the plot of the count of PageRank values. It shows us that the distribution of PageRank on 87 million domains is highly right-skewed meaning the majority of the domains have very low PageRank.
Harmonic Centrality Distribution
The graph below presents the plot of the count of Harmonic Centrality values of 87 million domains. It shows us that the distribution of Harmonic Centrality on 87 million domains is not highly right-skewed like the PageRank.
It is not a perfect Gaussian distribution but more Gaussian than the distributions of PageRank. This distribution is multi-modal too.
Scatter Plot of Pagerank and Harmonic Centrality
When we scatter plot the PageRank and Harmonic Centrality values of domains we observe what we have seen in the statistics part which is PageRank and Harmonic Centrality values of domains are not correlated.
However what is also interesting to point out is that we observe the beginning of the domains’ PageRank values detachment when their Harmonic Centrality value is closer to 1e7 and accelerates when it is greater than.
What we remark from web Stanford and Common Crawl, large web graphs data analysis above is that Harmonic Centrality and PageRank agree more or less on the popularity of top and middle tail nodes but not the long tail.
To begin with, Harmonic Centrality is a measure of centrality that is easy to understand and explain.
Furthermore, it gives interesting results, especially for the top nodes of the graphs.
Last but not least, some organizations are already using this measure, for instance, Common Crawl, a nonprofit organization that has been crawling the web for several years, uses this measure in its crawling strategy.
On small graphs:
PageRank and Harmonic Centrality give resembling results with PageRank.
The computation time of Harmonic Centrality is much less than PageRank.
As the size of the graph increases their computation times tend to be similar.
For the sake of clarity, computation times are calculated on small graphs and web Stanford graph with an unoptimized Harmonic Centrality algorithm on a MacBookPro with four cores and 16 GB RAM.
On larger graphs, although PageRank and Harmonic Centrality have similar considerations on the distribution of the importance of top pages, at least for top 10 K pages, their opinions on the popularity distribution of the rest of the pages change dramatically afterwards. They agree more on top and middle tail nodes but less on the long tail.
Common Crawl’s technical team calculated Harmonic Centrality and PageRank on the recent Common Crawl graph with 48 CPUs and 384 GB RAM.
Consequently, Harmonic Centrality’s compute time is less than PageRank on paper. However, there are some lurking variables which may affect the result. Therefore, this issue requires further study.
In conclusion, I highly recommend you to look further into Harmonic Centrality. Its differences with PageRank aren’t negative points, on the contrary, they are positive since it brings us new insights.
Therefore, it will be beneficial for SEO to calculate the Harmonic Centrality values for the pages of your clients’ sites.
Accordingly, this may reveal striking information on those pages which you have not discovered by the help of PageRank. Besides, it is always a good idea to listen to what other algorithms want to tell us.
In-Post Images: Created by author, January 2019
Subscribe to SEJ
Get our daily newsletter from SEJ’s Founder Loren Baker about the latest news in the industry!