Minimum Spanning Trees
2009-11-19 15:08
Visualizing the web
I have successfully collected and documented all my contacts and their contacts from LinkedIn in Xml-format. It turns out, at the time of writing, that I have approximately 3,000 1st and 2nd degree contacts. I must admit, it was an arduous and sometimes dull undertaking, but it’s behind me now and I can turn my attention towards visualizing these contacts.
And so, I began searching the Internet for suitable source code to derive and build my own graph visualization component. But by virtue of serendipity, I found something remarkable, which made me suspense my LinkedIn project in a heartbeat. I promise to write the LinkedIn blog some other time.
Web pages as graphs
It turns out that the biologist Marcel Salathé, currently associated with the Stanford University, has built a graph component that parses the underlying domain specific language of web pages, html, and visualizes these as colorful minimum spanning trees.
What is interesting with Salathé’s visualization is that it’s so straight forward to pose questions about the structure of web pages and then look for answers. Here’s the color legend he uses to differentiate between the groups of html tags:
| Blue |
Links (A) |
| Red |
Tables (TABLE, TR, TD) |
| Green |
Container (DIV) |
| Violet |
Images (IMG) |
| Yellow |
Forms (FORM, INPUT, TEXTAREA, SELECT, OPTION) |
| Orange |
Typography (BR, P, BLOCKQUOTE) |
| Black |
The root (HTML) |
| Gray |
All other tags |
Table 1: Color legend used to differentiate tags
Starting with an easy example question to warm up, we could ask; “What does this blog look like?”

Figure 1: http://blog.avegagroup.se/MartinKaarup/
This is the answer; it resembles a nice little “flower bed”. Locating the black node (top right below the little gray flower), we can see it is connected to two gray dots, namely the HEAD and BODY tags. Incidentally, the BODY tag is the one that extends downward, while the HEAD tag extends upward. This is pure accidental, since a graphs main purpose is not to depict topological structures, but only relationships between ‘things’ – in this case, html tags.
Comparing to the actual markup (as seen through a web browser), it turns out that the right side correspond to the static elements of the blog, while the four orange-like wild flowers to the left is the actual blog posts. As expected they consist mainly of typographical and image tags.
Extending the question a bit, we could ask; “What does Avega Group’s blog look like?”

Figure 2: http://blog.avegagroup.se/
This answer is marvelous, I think. We can easily identify the super centralized tree structure, which is so typical for a content driven web site. The number of links is just breathtaking! Emanating from a single DIV tag sits numerous short stories – each demarcated by its own DIV tag and mainly consisting of typographical and image tags. Recall, that the latter is just an extrapolation of what we learned from the first question. As we expect blogs consist mainly of text and pictures. And we also expected that the front page would have more blog entries than each separate blog author’s site. This accounts for superior number of tags found.
Another related question could be “Does other known blog sites share the same overall structure?”. (I’ve picked a blog I frequently read, which explain the restriction “known” in the question.)

Figure 3: http://scienceblogs.com/goodmath/
And the answer is mostly yes. The flower bed that extends towards the bottom right corner is actually a cluster of blog abstracts, which naturally consist of the same typological and image tags we saw earlier. However, we also see two distinctly new structures. Most visibly perhaps is the yellow flower at the left side. We can convince ourselves that this must be the free text search field and drop down menu we can observe when visiting the blog site via a regular web browser. The other prominent structure is bigger and eludes the eyes, if we’re not careful. It’s the hierarchy of link tags that emanates upwards and to the right from the approximate center. This structure, in its semantic presentation, is actually also present in the first figure. In the first figure it was not a mature and hierarchical list of topics, but a more insignificantly small list of topics and therefore more easily overlooked.
Now, let’s turn our attention to comparisons. I have captured a handful of Avega’s distinct blog author’s sites, and presented the result below (I humbly apologize to the absent authors, but there are apparently limits to the quality of service in the application. It flatly refused to produce anything for the absentees’ blog sites):
We can see that there are repetitious structures visible in each tree. That is, of course, the entire surrounding markup that provides the coherent look and feel of Avega’s blog site. And it’s the overall tertiary tree structure itself, which consists of; the aforementioned design template, the menu positioned to the right, and the unique collection of blog entries. In other words, it’s the fork going from the HTML tag towards the approximate center of the tree (the tertiary fork, which sits in the approximate middle in Achouiantz, Granlund, and Ahlbergs tree, but is positioned slightly more right in Hammarberg’s tree. And it’s the menu fork, which is probably easiest identified, by not being the green dot from where a flower bed of predominantly yellow and blue flower emanates – because that’s the unique blog entries.
From these pictures, we can also ask some other fun questions that might or might not have sensible or useful answers:
- Who has seemingly written the most (if we count different tags as containing a nominal length of text)?
- Who has written the smallest blog entry (or guess if the pictures are insufficiently detailed)?
- Who is more consistent with respect to the blog content (size and colorings)?
- Why does Granlund have a big homogeneous gray flower? (see if you can guess what it is by browsing his blog entries)
Another kind of business intelligence
Admittedly, I have examined well over hundred trees; domestic and foreign media outlets, general purpose and specific search engines, various consumer sites along with their nearest competitors, and many others. The reason for this is partly to examine what precisely constitutes the similarities and differences between similar groups of web pages, and partly to understand how such a tool could be utilized commercially to collect business information that otherwise would seem incomprehensible for some people in an organization.
So for instance, we could ask the following question: “How does search engine’s usability for the visual impaired compare?”. For brevity, I have chosen only to examine a single performance metric. In reality, this could be any number of comparable metrics. Further to this, I have chosen to examine search engines and visual impaired people, since they have set of properties that coincide into an easily verifiable test.
The relevant properties are:
- Accessibility. Search engines should be easily accessible to all people regardless of any handicap.
- Markup requirements. Search engines have no valid need of table tags in their markup. Suitable container tags are a better choice – ceteris paribus.
And since we know that improper use of table tags, instead of container tags, is an unnecessary hinder for visual impaired people, we have constructed our own sociological experiment. The only thing we need is to examine a number of search engines for the appearance of red dots.
And here is my result sample:
Yahoo, whose structure actually resembles more that of a portal, than that of a search engine, is the only one that doesn’t use tables. Both AltaVista and Metacrawler rely on numerous tables to hold its respective designs together and we therefore see red dots scattered all around the tree. Google also uses table tags to define its main structure, but has less content and therefore can settle with less table tags. Under these circumstances we conclude that Yahoo performs the best.
It turns out that, whomever I talk to about these trees, each person has a new and interesting utilization. Some talk about fun, scientific, or non-commercial questions. Some think the flowers are look like beautiful artificial art. Others think of how a company could gain a competitive advantage by posing difficult questions. And a nerdy few think of building a web parser that uses a combination of tree structure and knowledge about types of web pages to predict the next html tag, and thereby increase performance. The sheer variety of possibilities, I think, bears witness to Marcel Salathé’s insight into the necessity for visualizing our world in new and exciting ways. He himself is astounded by the number of users that visits his application.
Unfortunately I haven’t got the time to convey all these very interesting ideas and will instead leave you with a smorgasbord of flowers to look at (here) and the Internet address to the application (http://www.aharef.info/static/htmlgraph/).
Have fun!
Postad av Martin Kaarup
Kommentarer (0)
Kategorier:
Graph Theory
Minimum Spanning Trees
World Wide Web