Similarity Analysis: The Web Portal Mystery¶
"A recommendation system is like a librarian who knows exactly which magazine you'll like without having read the full content, just by looking at the words that repeat."
The Portal Challenge¶
Imagine you manage a dynamic portal. Your boss has challenged you: "Group these articles automatically. I don't have time to read them all."

To solve this, we use the Jaccard Index, a mathematical tool that converts text into "sets" and measures how much they overlap.

The Heart of the Algorithm¶
The magic happens by comparing what documents share versus everything they say.
| Attribute | Visual Explanation |
|---|---|
| Intersection | The words that appear in BOTH texts. |
| Union | All unique words from BOTH texts. |
| Result | A number between 0 (strangers) and 1 (soulmates). |

The Mathematical Formula¶

Real Results (Generated by Your Script)¶
This is where theory meets reality. When running 04_similitud_jaccard.py, the system "sees" the portal like this:
1. The Knowledge Heatmap¶
In this matrix, warm colors (reds) indicate high similarity. Notice how squares form along the diagonal. Those are your Football, Technology, and Cooking categories automatically detected!


2. Algorithm Validation (Clustermap)¶
Can artificial intelligence group the topics without help? The Dendrogram (the side tree) tells us yes. Articles with the same theme "seek" each other and cluster into common branches.

3. Clustermap with Numerical Values¶
Another view of the same clustering, now with exact similarity values in each cell:

Real-World Applications¶
This is not just an academic exercise. This technique is used every second in:
- Plagiarism Detection: Comparing student submissions to see if they share "too much" vocabulary.
- Recommenders: "If you read about the new CPU, we recommend this article about RAM memory".
- SEO and Search Engines: To understand whether two pages discuss the same topic and avoid duplicate content.

Final Reflection for the Student¶
Look at the charts saved in your folder:
- Do you see any red point outside the diagonal? That would indicate that two different topics share words.
- What would happen if the corpus had 10,000 documents? The heatmap would become unreadable, but the Clustermap would still give us the structure.
--- Certification Hash: 4e8d9b1a5f6e7c3d2b1a0f9e8d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c Author: Juan Marcelo Gutierrez Miranda (@TodoEconometria)