/Images/MainImage/Martin Kaarup.jpg finns inte.

Graph Theory

In computer science, graph theory is the study of mathematical structures used to model pairwise relations between objects from a certain collection.

2009-11-19 15:08

Visualizing the web

I have successfully collected and documented all my contacts and their contacts from LinkedIn in Xml-format. It turns out, at the time of writing, that I have approximately 3,000 1st and 2nd degree contacts. I must admit, it was an arduous and sometimes dull undertaking, but it’s behind me now and I can turn my attention towards visualizing these contacts.

And so, I began searching the Internet for suitable source code to derive and build my own graph visualization component. But by virtue of serendipity, I found something remarkable, which made me suspense my LinkedIn project in a heartbeat. I promise to write the LinkedIn blog some other time.

Web pages as graphs

It turns out that the biologist Marcel Salathé, currently associated with the Stanford University, has built a graph component that parses the underlying domain specific language of web pages, html, and visualizes these as colorful minimum spanning trees.

What is interesting with Salathé’s visualization is that it’s so straight forward to pose questions about the structure of web pages and then look for answers. Here’s the color legend he uses to differentiate between the groups of html tags:

Blue Links (A)
Red Tables (TABLE, TR, TD)
Green Container (DIV)
Violet Images (IMG)
Yellow Forms (FORM, INPUT, TEXTAREA, SELECT, OPTION)
Orange Typography (BR, P, BLOCKQUOTE)
Black The root (HTML)
Gray All other tags

Table 1: Color legend used to differentiate tags

Starting with an easy example question to warm up, we could ask; “What does this blog look like?”

Figure 1: http://www.blog.avegagroup.se/MartinKaarup/

Figure 1: http://blog.avegagroup.se/MartinKaarup/

This is the answer; it resembles a nice little “flower bed”. Locating the black node (top right below the little gray flower), we can see it is connected to two gray dots, namely the HEAD and BODY tags. Incidentally, the BODY tag is the one that extends downward, while the HEAD tag extends upward. This is pure accidental, since a graphs main purpose is not to depict topological structures, but only relationships between ‘things’ – in this case, html tags.

Comparing to the actual markup (as seen through a web browser), it turns out that the right side correspond to the static elements of the blog, while the four orange-like wild flowers to the left is the actual blog posts. As expected they consist mainly of typographical and image tags.

Extending the question a bit, we could ask; “What does Avega Group’s blog look like?”

Figure 2: http://blog.avegagroup.se/

Figure 2: http://blog.avegagroup.se/

This answer is marvelous, I think. We can easily identify the super centralized tree structure, which is so typical for a content driven web site. The number of links is just breathtaking! Emanating from a single DIV tag sits numerous short stories – each demarcated by its own DIV tag and mainly consisting of typographical and image tags. Recall, that the latter is just an extrapolation of what we learned from the first question. As we expect blogs consist mainly of text and pictures. And we also expected that the front page would have more blog entries than each separate blog author’s site. This accounts for superior number of tags found.

Another related question could be “Does other known blog sites share the same overall structure?”. (I’ve picked a blog I frequently read, which explain the restriction “known” in the question.)

Figure 3: http://scienceblogs.com/goodmath/

Figure 3: http://scienceblogs.com/goodmath/

And the answer is mostly yes. The flower bed that extends towards the bottom right corner is actually a cluster of blog abstracts, which naturally consist of the same typological and image tags we saw earlier. However, we also see two distinctly new structures. Most visibly perhaps is the yellow flower at the left side. We can convince ourselves that this must be the free text search field and drop down menu we can observe when visiting the blog site via a regular web browser. The other prominent structure is bigger and eludes the eyes, if we’re not careful. It’s the hierarchy of link tags that emanates upwards and to the right from the approximate center. This structure, in its semantic presentation, is actually also present in the first figure. In the first figure it was not a mature and hierarchical list of topics, but a more insignificantly small list of topics and therefore more easily overlooked.

Now, let’s turn our attention to comparisons. I have captured a handful of Avega’s distinct blog author’s sites, and presented the result below (I humbly apologize to the absent authors, but there are apparently limits to the quality of service in the application. It flatly refused to produce anything for the absentees’ blog sites):

Figure 4: Achouiantz’ blog

Figure 5: Granlund’s blog

Figure 6: Ahlberg’s blog

Figure 7: Hammarberg’s blog

We can see that there are repetitious structures visible in each tree. That is, of course, the entire surrounding markup that provides the coherent look and feel of Avega’s blog site. And it’s the overall tertiary tree structure itself, which consists of; the aforementioned design template, the menu positioned to the right, and the unique collection of blog entries. In other words, it’s the fork going from the HTML tag towards the approximate center of the tree (the tertiary fork, which sits in the approximate middle in Achouiantz, Granlund, and Ahlbergs tree, but is positioned slightly more right in Hammarberg’s tree. And it’s the menu fork, which is probably easiest identified, by not being the green dot from where a flower bed of predominantly yellow and blue flower emanates – because that’s the unique blog entries.

From these pictures, we can also ask some other fun questions that might or might not have sensible or useful answers:

  • Who has seemingly written the most (if we count different tags as containing a nominal length of text)?
  • Who has written the smallest blog entry (or guess if the pictures are insufficiently detailed)?
  • Who is more consistent with respect to the blog content (size and colorings)?
  • Why does Granlund have a big homogeneous gray flower? (see if you can guess what it is by browsing his blog entries)

Another kind of business intelligence

Admittedly, I have examined well over hundred trees; domestic and foreign media outlets, general purpose and specific search engines, various consumer sites along with their nearest competitors, and many others. The reason for this is partly to examine what precisely constitutes the similarities and differences between similar groups of web pages, and partly to understand how such a tool could be utilized commercially to collect business information that otherwise would seem incomprehensible for some people in an organization.

So for instance, we could ask the following question: “How does search engine’s usability for the visual impaired compare?”. For brevity, I have chosen only to examine a single performance metric. In reality, this could be any number of comparable metrics. Further to this, I have chosen to examine search engines and visual impaired people, since they have set of properties that coincide into an easily verifiable test.

The relevant properties are:

  • Accessibility. Search engines should be easily accessible to all people regardless of any handicap.
  • Markup requirements. Search engines have no valid need of table tags in their markup. Suitable container tags are a better choice – ceteris paribus.

And since we know that improper use of table tags, instead of container tags, is an unnecessary hinder for visual impaired people, we have constructed our own sociological experiment. The only thing we need is to examine a number of search engines for the appearance of red dots.

And here is my result sample:

Figure 8: AltaVista

Figure 9: Google

Figure 10: Yahoo

Figure 11: Metacrawler

Yahoo, whose structure actually resembles more that of a portal, than that of a search engine, is the only one that doesn’t use tables. Both AltaVista and Metacrawler rely on numerous tables to hold its respective designs together and we therefore see red dots scattered all around the tree. Google also uses table tags to define its main structure, but has less content and therefore can settle with less table tags. Under these circumstances we conclude that Yahoo performs the best.

It turns out that, whomever I talk to about these trees, each person has a new and interesting utilization. Some talk about fun, scientific, or non-commercial questions. Some think the flowers are look like beautiful artificial art. Others think of how a company could gain a competitive advantage by posing difficult questions. And a nerdy few think of building a web parser that uses a combination of tree structure and knowledge about types of web pages to predict the next html tag, and thereby increase performance. The sheer variety of possibilities, I think, bears witness to Marcel Salathé’s insight into the necessity for visualizing our world in new and exciting ways. He himself is astounded by the number of users that visits his application.

Unfortunately I haven’t got the time to convey all these very interesting ideas and will instead leave you with a smorgasbord of flowers to look at (here) and the Internet address to the application (http://www.aharef.info/static/htmlgraph/).

Have fun!


Postad av Martin Kaarup

Kommentarer (0)   Kategorier:  Graph Theory    Minimum Spanning Trees    World Wide Web



2009-10-24 23:42

Solving the Good Will Hunting problem

I've seen the movie Good Will Hunting from 1997, starring Matt Damon in the role as a mathematical gifted janitor, twice. The first time I hadn’t taken a course in graph theory, the second time I had. Regarding the mathematical aspect of the movie, it makes all the difference. The second time I could actually understand the problem the professor posed when he threw down the gauntlet. In the movie the MIT professor stated a, supposedly, very tough problem that they had worked on intensely for almost two years before they were able to solve it. It reads:

"Draw all the homeomorphically irreducible trees having 10 vertices, such that no vertex has degree 2."

In plain words, connect ten dots together with lines such that all dots are connected to at least one other dot. Further, there must only be one path from any dot to any other dot, which means that circles are not allowed. Lastly, all dots must have 1, 3, or more lines connecting it to other dots, but not 2. Now draw all the different figures that satisfies the requirement.

That's what the gifted janitor did on the white board in the hallway.

Sounds simple? Well, it is – surprisingly simple actually.

In reality, the problem is no harder than any other high school problem. The real problem is quite different, namely that we accepted it was hard because Hollywood said so. A corollary to this claim could be to check the Internet and realize that many people already have solved the problem and some even before the movie. Another corollary could be to spend the next 10 minutes or so to solve it yourself, tell someone you know that has seen the movie, and then watch their reaction.

So, the only thing left now, is to challenge you to solve the problem and to present the solution.

This week I challenged Erik Forsberg, Avega Öresund’s .Net consultant. He solved the problem before he zipped his morning coffee a third time. Can you?

I've hidden the solution from plain view here.


Postad av Martin Kaarup

Kommentarer (1)   Kategorier:  Graph Theory



2009-10-16 09:14

My scale-free contacts

One drawback of returning from Avega’s 2009 conference in Karpathos, Greece, is having a head full of ideas and recharged batteries. It manifested itself a couple of days ago. Around bed time, I felt a sudden urge to map all my contacts, and their connections as well, on the professional network site LinkedIn. Supposedly, I should be able to see the scale-free invariance property of a social network.

That night I went to bed at around 4 a.m. and I was still miles from completing my data gathering. I didn’t finish the next night either – or the next. The fourth night it dawned on me. The very raison d’être for my investigation, the scale-free invariance, was the problem.

Since LinkedIn is a typical social network site, it follows the power law function which, simply put, states that the vast majority of people would have few contacts, but also that there exist a small handful of people that would have lots and lots of contacts.

And so, on the fifth evening, I haven’t finished gathering the data yet. I expect to be done in a week or so, and then I can finally begin automate a solution to connect the dots, so to speak. On the positive side, I have finished connecting my 1st degree contacts, and if they are ordered by number of connections, it’s possible to observe the scale-free invariance pattern – albeit with some minor variance, mainly because the sample set is small. It’s truly remarkable.

As can be seen, I’m missing a person; preferably, one with a really huge number of contacts. And more than 500 of those would be nice. This would fix the vertical asymptote that breaks off at around 325 instead of going straight up. I’m not worried though, because the nature of the power law itself also states that, in time, such a power law-conformant person will emerge.

The center of the universe is… me

In my small universe on LinkedIn, I’m the only one that knows everyone else. While some of my contacts know each other, along with some people outside my network, it’s not the case that everyone knows everyone else in my network, i.e. my network is not a complete graph.

In case you're wondering, I’m the dot labeled “29” that sits in the middle and connects the two big clusters of people to the left and to the right. I should mention, that the labels on the dots indicate the number of people that each person knows in total. Also, I have sized the dots relatively and accordingly.

There really isn’t much interesting to say about this small universe, except from the fact that if a person from the left side connects to a person from the right side, I will be rendered utterly worthless as a broker between small worlds .

As I mentioned earlier, I expect that in a few weeks time I have gathered, automated, and analyzed enough data from my contacts and their connections to write a follow-up. Then we can begin to talk about all kinds of interesting things, like for instance the best way to vaccinate an entire population against the hyped swine-flu, with the fewest possible vaccines.

References

  • Albert-Laszlo Barabasi, Linked: How Everything Is Connected to Everything Else and What It Means, Plume, 2003
  • Brian Uzzi & Shannon Dunlap, Managing Yourself: How to Build Your Network, Harvard Business Review, December 2005
  • Kurt Mehlhorn & Peter Sanders, Algorithms and Data Structures: The Basic Toolbox, Springer-Verlag, 2008
  • Malcolm Gladwell, The Tipping Point: How Little Things Can Make a Big Difference, Back Bay Books, 2002
  • Rob Cross, Jeanne Liedtka & Leigh Weiss, A Practical Guide to Social Networks, Harvard Business Review, March 2005


Postad av Martin Kaarup

Kommentarer (3)   Kategorier:  Graph Theory    Scale-free Networks    Self-organization