Skip to content

Similarity Analysis: The Web Portal Mystery​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​​‌‌​​​​‍​​‌‌​‌​‌‍​‌‌​​‌​‌‍​​‌‌‌​​‌‍​​‌‌‌​​​‍​​‌‌​‌​‌‍​​‌‌‌​​​‍​​‌‌​‌‌‌

"A recommendation system is like a librarian who knows exactly which magazine you'll like without having read the full content, just by looking at the words that repeat."​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​​‌‌​​​​‍​​‌‌​‌​‌‍​‌‌​​‌​‌‍​​‌‌‌​​‌‍​​‌‌‌​​​‍​​‌‌​‌​‌‍​​‌‌‌​​​‍​​‌‌​‌‌‌


The Portal Challenge

Imagine you manage a dynamic portal. Your boss has challenged you: "Group these articles automatically. I don't have time to read them all."

Portal Structure

To solve this, we use the Jaccard Index, a mathematical tool that converts text into "sets" and measures how much they overlap.

Jaccard Conceptual Diagram


The Heart of the Algorithm

The magic happens by comparing what documents share versus everything they say.

Attribute Visual Explanation
Intersection The words that appear in BOTH texts.
Union All unique words from BOTH texts.
Result A number between 0 (strangers) and 1 (soulmates).

Jaccard Logic

The Mathematical Formula

Jaccard Formula


Real Results (Generated by Your Script)

This is where theory meets reality. When running 04_similitud_jaccard.py, the system "sees" the portal like this:

1. The Knowledge Heatmap

In this matrix, warm colors (reds) indicate high similarity. Notice how squares form along the diagonal. Those are your Football, Technology, and Cooking categories automatically detected!

Corpus Category Index

Real Similarity Matrix

2. Algorithm Validation (Clustermap)

Can artificial intelligence group the topics without help? The Dendrogram (the side tree) tells us yes. Articles with the same theme "seek" each other and cluster into common branches.

Hierarchical Clustering

3. Clustermap with Numerical Values

Another view of the same clustering, now with exact similarity values in each cell:

Detailed Clustermap


Real-World Applications

This is not just an academic exercise. This technique is used every second in:

  • Plagiarism Detection: Comparing student submissions to see if they share "too much" vocabulary.
  • Recommenders: "If you read about the new CPU, we recommend this article about RAM memory".
  • SEO and Search Engines: To understand whether two pages discuss the same topic and avoid duplicate content.

Real Jaccard Applications


Final Reflection for the Student

Look at the charts saved in your folder:

  1. Do you see any red point outside the diagonal? That would indicate that two different topics share words.
  2. What would happen if the corpus had 10,000 documents? The heatmap would become unreadable, but the Clustermap would still give us the structure.

​‌​‌​‌​​‍​‌​​​‌​‌‍​​‌‌‌​‌​‍​​‌‌​‌​​‍​‌‌​​‌​‌‍​​‌‌‌​​​‍​‌‌​​‌​​‍​​‌‌‌​​‌‍​‌‌​​​‌​‍​​‌‌​​​‌‍​‌‌​​​​‌‍​​‌‌​‌​‌‍​‌‌​​‌‌​‍​​‌‌​‌‌​‍​‌‌​​‌​‌‍​​‌‌​‌‌‌‍​‌‌​​​‌‌‍​​‌‌​​‌‌‍​‌‌​​‌​​‍​​‌‌‌​‌​‍​​‌‌​​‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​‌‌​‍​​‌‌​​​​‍​​‌‌​​‌​‍​​‌‌​​​‌‍​​‌‌​​‌‌‍​​‌‌‌​‌​‍​​‌‌​​​​‍​​‌‌​‌​‌‍​‌‌​​‌​‌‍​​‌‌‌​​‌‍​​‌‌‌​​​‍​​‌‌​‌​‌‍​​‌‌‌​​​‍​​‌‌​‌‌‌--- Certification Hash: 4e8d9b1a5f6e7c3d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c Author: Juan Marcelo Gutierrez Miranda (@TodoEconometria)