As a follow-up to my recent post about creating document families of any give Google-Search-Result I want to publish an additional idea:
The Content Similarity Equalizer
Now what is this? Well as mentioned in my post above it often happens that a product is released via a press release and then gets discussed in several blogs and news-sites. Most of the blogs and news-sites will basically just copy the press release and add a few words and a bit of there own spin – and that’s it. A few will take their time and do an in depth and personal analysis of the product.
Here the Content Similarity Equalizer comes to play
Having a Content-Similarity-Equalizer is like having different sieves when you dig for gold. If you scrape for gold you can always have different sieves depending on the size of the gold nuggets you are scraping for. So if you tell your Content-Similarity-Equalizer (CSE) to find all documents that are 90% or more similar you will only find documents of the same version (you will have more document families with less documents in them) . If you tell your CSE that you want to find all documents that are 40% similar then your documents families will get bigger and the documents that are really unique (nuggets) will start surfacing. In my example this would then be the document talking about the specific features of the iPhone-CPU, versus all the other documents that were basically just covering the Apple-Press-Release of the iPhone and not really adding anything special or “own” to their blog-post or news-site.