/Images/MainImage/Martin Kaarup.jpg finns inte.

World Wide Web

The World Wide Web, abbreviated as WWW and commonly known as The Web, is a system of interlinked hypertext documents contained on the Internet.

2009-12-03 00:27

The death of a star

It’s not often you get to experience a star dying. Up until yesterday, I was unaware that I had accidentally collected evidence of this happening. How and what evidence you might ask. Well, this was how it happened.

Primo November I did a minor study where I wanted to correlate the structural elements of web sites with the actual content it provided. I knew this would definitely be possible on the broader level, but I was unsure to what degree this differentiation was possible and what the similarities would look like.

Therefore I began to partition a lot of different web sites into categories. Examples of such categories are portals, blogs, and media. I also choose specialized search engines. Further to this, I knew beforehand that there existed a type of search engine that specialized in finding one thing, and one thing only, namely the infamous torrent files. Torrent files are small files that contain information pointing to the whereabouts of other files. In other words, a torrent file is quite simliar to a telephone book that contains information about the whereabouts of people and businesses. In turn, these search engines knows the whereabouts of lots and lots of phone books, each having information about the whereabouts of a certain number of people and businesses.

To put it simple, choosing this specialization would provide me with a near perfect setup for a comparative analysis.

The dying star I am referring to is the web site Mininova (link). Mininova, whos name literally means little star, is one of these highly specialized search engines I mentioned above. Some weeks prior to November 26th it showed evidence of being a highly used and visited web site. Amongst other things, it showed the top ten most popular torrent searches for each major category. Example of such categories are movies, music, games, and other.

This is the structure of the web site primo November (see picture below). Notice the big flower-like structures emanating outwards from the center. These are the categories I mentioned earlier and testify to its heavy usage.

Figure 1: Structural evidence of a widely used and visited torrent search engine.

So what happened November 26th? Well, according to Mininova, they decided to change their business model on that specific date. And they did.

Actually, it began August 2008 when the Dutch court of Utrecht ruled that Mininova’s business model was illegal according to Dutch law. The changing business model was therefore a response to this court decision and its purpose is to comply with Dutch law while hopefully retaining a revivable business opportunity.

This is what happened to Mininova’s web site within 24 hours after launching their new business model (see picture below). Pretty remarkable isn’t it?

Figure 2: A drastically deteriorated structure that bear witness to a failing business model. (zoomed in)

The structure is heavily deteriorated because the consumers have disappeared to other sustainable search engines. Even Google’s hit counter show clear evidence that Mininova is becoming a virtual black hole. Its Swedish counterpart, The Pirate Bay, who might also face changes to its business model in the near future, is doing more than twice as good. The real winner is surely the other competing search engines, mostly specialized but also generalized, and of course the plaintiff’s new and unique legal position within the Dutch free market.

Ending on this business opportunity note, I strongly recommend that all other business areas take a very hard look at how they can achieve similar unique legal positions before they are overrun by the plaintiff’s unique area of expertise.

Notes

If you're unsure of what the pictures above show, please read my previous blog "Visualizing the web", where I explain the details. (link)


Postad av Martin Kaarup

Kommentarer (0)   Kategorier:  World Wide Web    Self-organization



2009-11-19 15:08

Visualizing the web

I have successfully collected and documented all my contacts and their contacts from LinkedIn in Xml-format. It turns out, at the time of writing, that I have approximately 3,000 1st and 2nd degree contacts. I must admit, it was an arduous and sometimes dull undertaking, but it’s behind me now and I can turn my attention towards visualizing these contacts.

And so, I began searching the Internet for suitable source code to derive and build my own graph visualization component. But by virtue of serendipity, I found something remarkable, which made me suspense my LinkedIn project in a heartbeat. I promise to write the LinkedIn blog some other time.

Web pages as graphs

It turns out that the biologist Marcel Salathé, currently associated with the Stanford University, has built a graph component that parses the underlying domain specific language of web pages, html, and visualizes these as colorful minimum spanning trees.

What is interesting with Salathé’s visualization is that it’s so straight forward to pose questions about the structure of web pages and then look for answers. Here’s the color legend he uses to differentiate between the groups of html tags:

Blue Links (A)
Red Tables (TABLE, TR, TD)
Green Container (DIV)
Violet Images (IMG)
Yellow Forms (FORM, INPUT, TEXTAREA, SELECT, OPTION)
Orange Typography (BR, P, BLOCKQUOTE)
Black The root (HTML)
Gray All other tags

Table 1: Color legend used to differentiate tags

Starting with an easy example question to warm up, we could ask; “What does this blog look like?”

Figure 1: http://www.blog.avegagroup.se/MartinKaarup/

Figure 1: http://blog.avegagroup.se/MartinKaarup/

This is the answer; it resembles a nice little “flower bed”. Locating the black node (top right below the little gray flower), we can see it is connected to two gray dots, namely the HEAD and BODY tags. Incidentally, the BODY tag is the one that extends downward, while the HEAD tag extends upward. This is pure accidental, since a graphs main purpose is not to depict topological structures, but only relationships between ‘things’ – in this case, html tags.

Comparing to the actual markup (as seen through a web browser), it turns out that the right side correspond to the static elements of the blog, while the four orange-like wild flowers to the left is the actual blog posts. As expected they consist mainly of typographical and image tags.

Extending the question a bit, we could ask; “What does Avega Group’s blog look like?”

Figure 2: http://blog.avegagroup.se/

Figure 2: http://blog.avegagroup.se/

This answer is marvelous, I think. We can easily identify the super centralized tree structure, which is so typical for a content driven web site. The number of links is just breathtaking! Emanating from a single DIV tag sits numerous short stories – each demarcated by its own DIV tag and mainly consisting of typographical and image tags. Recall, that the latter is just an extrapolation of what we learned from the first question. As we expect blogs consist mainly of text and pictures. And we also expected that the front page would have more blog entries than each separate blog author’s site. This accounts for superior number of tags found.

Another related question could be “Does other known blog sites share the same overall structure?”. (I’ve picked a blog I frequently read, which explain the restriction “known” in the question.)

Figure 3: http://scienceblogs.com/goodmath/

Figure 3: http://scienceblogs.com/goodmath/

And the answer is mostly yes. The flower bed that extends towards the bottom right corner is actually a cluster of blog abstracts, which naturally consist of the same typological and image tags we saw earlier. However, we also see two distinctly new structures. Most visibly perhaps is the yellow flower at the left side. We can convince ourselves that this must be the free text search field and drop down menu we can observe when visiting the blog site via a regular web browser. The other prominent structure is bigger and eludes the eyes, if we’re not careful. It’s the hierarchy of link tags that emanates upwards and to the right from the approximate center. This structure, in its semantic presentation, is actually also present in the first figure. In the first figure it was not a mature and hierarchical list of topics, but a more insignificantly small list of topics and therefore more easily overlooked.

Now, let’s turn our attention to comparisons. I have captured a handful of Avega’s distinct blog author’s sites, and presented the result below (I humbly apologize to the absent authors, but there are apparently limits to the quality of service in the application. It flatly refused to produce anything for the absentees’ blog sites):

Figure 4: Achouiantz’ blog

Figure 5: Granlund’s blog

Figure 6: Ahlberg’s blog

Figure 7: Hammarberg’s blog

We can see that there are repetitious structures visible in each tree. That is, of course, the entire surrounding markup that provides the coherent look and feel of Avega’s blog site. And it’s the overall tertiary tree structure itself, which consists of; the aforementioned design template, the menu positioned to the right, and the unique collection of blog entries. In other words, it’s the fork going from the HTML tag towards the approximate center of the tree (the tertiary fork, which sits in the approximate middle in Achouiantz, Granlund, and Ahlbergs tree, but is positioned slightly more right in Hammarberg’s tree. And it’s the menu fork, which is probably easiest identified, by not being the green dot from where a flower bed of predominantly yellow and blue flower emanates – because that’s the unique blog entries.

From these pictures, we can also ask some other fun questions that might or might not have sensible or useful answers:

  • Who has seemingly written the most (if we count different tags as containing a nominal length of text)?
  • Who has written the smallest blog entry (or guess if the pictures are insufficiently detailed)?
  • Who is more consistent with respect to the blog content (size and colorings)?
  • Why does Granlund have a big homogeneous gray flower? (see if you can guess what it is by browsing his blog entries)

Another kind of business intelligence

Admittedly, I have examined well over hundred trees; domestic and foreign media outlets, general purpose and specific search engines, various consumer sites along with their nearest competitors, and many others. The reason for this is partly to examine what precisely constitutes the similarities and differences between similar groups of web pages, and partly to understand how such a tool could be utilized commercially to collect business information that otherwise would seem incomprehensible for some people in an organization.

So for instance, we could ask the following question: “How does search engine’s usability for the visual impaired compare?”. For brevity, I have chosen only to examine a single performance metric. In reality, this could be any number of comparable metrics. Further to this, I have chosen to examine search engines and visual impaired people, since they have set of properties that coincide into an easily verifiable test.

The relevant properties are:

  • Accessibility. Search engines should be easily accessible to all people regardless of any handicap.
  • Markup requirements. Search engines have no valid need of table tags in their markup. Suitable container tags are a better choice – ceteris paribus.

And since we know that improper use of table tags, instead of container tags, is an unnecessary hinder for visual impaired people, we have constructed our own sociological experiment. The only thing we need is to examine a number of search engines for the appearance of red dots.

And here is my result sample:

Figure 8: AltaVista

Figure 9: Google

Figure 10: Yahoo

Figure 11: Metacrawler

Yahoo, whose structure actually resembles more that of a portal, than that of a search engine, is the only one that doesn’t use tables. Both AltaVista and Metacrawler rely on numerous tables to hold its respective designs together and we therefore see red dots scattered all around the tree. Google also uses table tags to define its main structure, but has less content and therefore can settle with less table tags. Under these circumstances we conclude that Yahoo performs the best.

It turns out that, whomever I talk to about these trees, each person has a new and interesting utilization. Some talk about fun, scientific, or non-commercial questions. Some think the flowers are look like beautiful artificial art. Others think of how a company could gain a competitive advantage by posing difficult questions. And a nerdy few think of building a web parser that uses a combination of tree structure and knowledge about types of web pages to predict the next html tag, and thereby increase performance. The sheer variety of possibilities, I think, bears witness to Marcel Salathé’s insight into the necessity for visualizing our world in new and exciting ways. He himself is astounded by the number of users that visits his application.

Unfortunately I haven’t got the time to convey all these very interesting ideas and will instead leave you with a smorgasbord of flowers to look at (here) and the Internet address to the application (http://www.aharef.info/static/htmlgraph/).

Have fun!


Postad av Martin Kaarup

Kommentarer (0)   Kategorier:  Graph Theory    Minimum Spanning Trees    World Wide Web