|
|
So, in my mission to see how we can automatically detect “core” teams, I need a measure for how closely people work together. Those of you with strong memories will remember I once coined the term “cohesion” for this measure. I introduced it in a paper at the International Conference on Software Maintenance, three years ago and blogged about it around that time.
This measurement is based on some basic graph theory that I have been over before. But for the sake of completeness here is a quick recap. Let’s start by taking a look at a graph which represents one month of KDELIBS development, in this case, April 2009 (click to enlarge):

Each node here represents someone who has committed to KDELIBS in the month. The edges represent resource sharing: two nodes are connected if the committers both commit to the same file in the month. These edges have a weight (not shown) which is the number of shared files between the nodes.
Using the Floyd-Warshall algorithm it is possible to find the shortest paths between all pairs of nodes in the graph. This, in turn, allows us to find the mean shortest path length and this is what I call “community cohesion” (which should not be confused with graph structural cohesion). Now, this number is not really comparable between communities; their differing working practices really disallow this. However, within a community, we can certainly trend this metric and see how it varies over time. Perhaps, for example, certain events (such as release deadlines) cause the metric to increase? An increase in this metric shows the community is working together more tightly (higher edge weights, contributors sharing more resources).
The next step, of course, is then to actually measure this and see how the trend looks for different projects. So, I have picked KDEPIM and KDELIBS to look at; below is their cohesion trends for the 120 months from 2001 to 2010 (click to enlarge):

Now, I admit these projects are both part of the greater KDE community and so share a few contributors and a release cycle. Other than that, however, they are distinct projects. So I was surprised to see the two trends above. Why? You do not have to look too closely to see that there is a certain degree of correlation between them.
If we use Pearson’s method, we find the correlation is 0.33. To jog your memory: a score of -1 shows perfect negative correlation, 1 perfect correlation and 0 shows no correlation whatsoever. So our score of 0.33 is hardly strong correlation, but it is enough to show that either the release cycle or contributor sharing has some impact.
Interesting, no?
At a later time I will rerun this with more randomly selected KDE projects to see if similar results are found.
So this is about the time I usually do my annual review of activity in KDE SVN. Of course I have now stopped my analysis of KDE SVN and moved on to git. Instead of analysis every repo in KDE git, I will focus on what happened in KDEPIM in 2011 (KDEPIM exclusively, no PIMLIBS or PIMRUNTIME).
OK, to kickoff, the green blobs (click to enlarge):

The first thing I noticed here is that there is no account which has committed in every week of 2011. Notice, also, Laurent; he is not the most regular contributor to this repo (he committed in 67% of the weeks) and yet he is one of the most regular contributors. If we look at commits per committer, we get the following top 10 for last year:
58 Bjoern Ricks
89 Christophe Giboudeauw
109 Torgny Nyblom
142 Volker Krause
162 Script Kiddy
195 Tobias Koenig
196 Allen Winter
273 David Jarvie
273 Sergio Martins
1198 Montel Laurent
The second thing I noticed about the green blobs is how “white” that image is towards the bottom; that is, developers whose first commit for KDEPIM in 2011 was after the first week tended not to stay around too long. This for me feels like the people towards the top are most-likely part of an existing “core” team.
My “Oracle of Ervin” tool reveals Laurent to be the most highly-connected developer in this repo; this comes as no surprise. If we visualise the community we can see him along with others in the “core” of the community (click to enlarge):

Remember, this visualisation is not concerned so much with how much work an individual does, so much as how much they work with others (a better measure of a “team” I believe). So, Laurent appears in the “core” of the KDEPIM team along with many others from the top of the green blobs image. Based upon team structure it appears that Allen “The Kaiser Soze of KDE” Winter is at the centre of the community. You can expect Sergio, Till and Volker to appear in a line-up with him any day now.
How much did the KDEPIM community evolve over 2011? Let’s start by looking at the trend of all-time contributors:

At the start of the year there had been 536 contributors to KDEPIM through its history and this had grown to 568 by the end of the year. However, we can see from the green blobs that many new developers are only hanging around for one week and the project continues to be dominated by a “core” that is not evolving all that much (this is neither surprising nor worrying). If we show the daily commits split by length of contribution, we get the following:

As you can see, this chart is dominated by the purple of committers with more than 2 years of contribution to KDEPIM. Whilst a mature community is an excellent thing to have, I’d definitely still like to see some young blood get retained in KDEPIM in 2012.
So there you have it; KDEPIM 2011 in pretty pictures. I have improved the green blobs so that: 1. the font used is more legible 2. it supports UTF-8; thanks go to Chusslove Illich (Часлав Илић) for contributing to KDEPIM and exposing this. UTF-8 support was also added to the community graphing tool. The “downside” to this is that I am now using real names in these visualisations (I also toyed with the idea of using email addresses, but these create other issues). Ever since I switched from analysis of SVN to git I have struggled to find a suitable alternative to SVN account names. Using %an or %ae in git log really are the best replacements.
[This is slightly off topic from my usual Free Software analysis.]
So the Collatz Conjecture came to mind. I took a look at the Wikipedia article and was struck by a couple of things: I liked the stopping time (the number of steps you have to take to get from the given starting number to 1) plot and the graph showing the paths from certain starting numbers to 1.
Both also disappointed me for not showing enough data; this had clearly been done for clarity. Fair enough, but sometimes if you throw enough data in a visualisation it just “looks” right. Right? (OK, this is far from true). So, since it had been a while since I had last dusted off my Python and Graphviz skills, I thought I would try to replicate these visualisations, just with more data.
So let’s start with the stopping time plot (click to enlarge):

Nice pattern. Hardly exciting.
What is a little more fun is the graph showing the paths from given starting numbers back to 1 (click to see the full image, 36mb):

[If you are not familiar with the English idiom "When push comes to shove" you can read more here.]
For some time I have been hesitant to start publishing data about usage of Git. You see, when a community changes a tool as fundamental as the SCM it will need to change its processes (to some degree). Of course, this is often the reason why the SCM has been switched. It is also the first reason why it is difficult to compare SVN data from “before” to Git data from “after”. Reason 2 is that the two systems work in very different ways. A commit in a DVCS is very different from a commit in a centralised system. It is probably the “push” that is more comparable. Right? Right??
Let’s take a look at the daily commits for KDEPIM:

KDEPIM switched to Git on 28 January, 2011 (or thereabouts). Before this date the average daily commits was 16 (14 in the month prior), after it drops to 11. I’m sure the KDEPIM community is not crying into its collective beer tonight. Here’s why:
- Human factor: The initial large drop in commit rate could easily be caused by people needing to learn how to use Git properly.
- Process factor: Git allows the user to squash multiple commits into one.
The change of tool will always have human and process impacts. Here I have suggested just one of each; there are many more. But these factors plausibly explain my concern with coming forward with Git data… It is up to me to make it absolutely clear why (or potentially why) the figures change in the way that they do. Whilst the need for education and commit squashing are two factors that might apply to any project, the factors that actually apply can only really be revealed by those directly involved.
So what can we conclude? Two things:
- The impact of the switch to Git can be shown in the measurement of something as simple as daily commits;
- Watching the new trends develop over time is going to be fascinating.
OK, now KDE is 15 years old, it is time for my work to grow up and start looking at git. One of the questions I get asked from time to time is how much code rewriting I will need to do in order to for with git. Thankfully… none.
All of my scripts parse SVN logs and it is easy enough to get git to give back logs in SVN format. Just like this:
git log –reverse –format=”<logentry revision=\”%H\”>%n <author>%ae</author>%n <date>%ci</date>%n</logentry>”
So as a first brief experiment with git, here is the result of generating the green blobs for KDEPIM (click to enlarge):

So, I thought I would take a quick look at what KDE community “looks” like after 15 years under development. So here I will briefly show off three visualisations with no particular comment. I will just leave them here for your amusement.
So let’s start with the now-infamous green blobs (click to enlarge):

For the uninitiated, a quick lesson: Each column in this visualisation represents the commit history of everyone who has committed to KDE SVN. Each row represents a week, with the most recent weeks being at the top. If the contributor committed during that week, they get a green blob, otherwise it is left empty. For each column the committer, the date of their first commit and the % of weeks in which they committed (of those they /could/) is given.
You might remember from my last blog post that I charted the growth in the number of accounts in KDE SVN. With such a steady growth in contributors, should we expect something similar in the daily commits and committer trends? Of course we should…
- Daily Commits (click to enlarge):

I will admit that I have doctored this data ever-so-slightly in order to filter out the days in which script went crazy and created 1000s of commits by itself.
- Daily Committers (click to enlarge):

So there you have it, 15 years of KDE development reduced to just three pictures! Of course, I could try and do 1000 more visualisations of the work in KDE SVN and still get nowhere near to telling the whole story. As the commits and committers plots show, KDE git really is the place to be. It is incredible how quickly contributions to KDE SVN have dropped to circa 2001 levels.
So, a big “congratulations” to my chums in the KDE community. Happy birthday and all the best for the next 15 years!
So, in the not too distant future KDE will turn 15 years old. This is normally a time when I will go back and reflect on lessons that can be learned from past activities in the SCM. This year is no different.
After my last blog post I was asked about the history of how many people had committed to KDE. So, for your viewing pleasure:
 Number of KDE committers (accumulative) since project foundation (click to enlarge).
This plot shows, for each day, the number of accounts that had committed to the SCM up to and including that day. As you can see, lately the growth rate is starting to tail off. Again, the most sensible hypothesis is that fewer new contributors are using SVN, but git instead.
So, in my previous blog post, I talked a little about how we can show if it is the newcomers or the “oldies” that are the most active contributors to KDE SVN. Let’s jog our memories by taking another look at the 2010 data I previously posted:
 Daily commits in 2010 (click to enlarge)
What are we looking at here? This shows, for each day in 2010, the number of daily commits in KDE SVN. Each day is colour coded to show the commits made by those contributors who have been “around” fewer than 6 months since their first commits, less than one year, less than two years and more than two years.
It is plain to see that the contributors who have been involved for more than two years are contributing the most (commits per day). Now the question I left us with last time was: Is this because there are a larger number of committers in this category or is the commit rate just higher?
Let’s take a look, starting with commit rate:
 Average daily commits, in 2010, per committer in each category (click to enlarge)
This plot shows, per day in 2010, the average number of commits made by the people in each involvement category. It is mostly a mass of dots, right? But what it does tell us is that the people who have been around the longest do not have a massively increased commit rate above the others. In fact, the data behind the plots shows that on an average day, the average commit rates are as follows:
- < 6 months: 3
- < 1 year: 3
- < 2 years: 4
- > 2 years: 5
This would lead us to believe it must simply be that there are more active people in the “> 2 years” category. So let’s take a look at the number of daily committers per category per day in 2010:
 Daily committers, in 2010, per committer category (click to enlarge)
So there you have it! Clearly, there are simply more contributors fitting into the “> 2 years” category. On an average day in 2010, the number of committers in each category was:
- < 6 months: 10
- < 1 year: 8
- < 2 years: 14
- > 2 years: 54
So, here’s a question for you all: Does it feel a little odd to you that the committers who have been around for fewer than 6 months have similar commit rates to those who have been around for more than 2 years?
So Lydia asked me about having slightly more fine-grained information about daily commits. She pointed me to this video which at the 15-minute mark has a visualisation for people contributing to Wikipedia. This visualisation reveals information about for how long people have been contributing to the community.
So, as a distraction from my work on detecting core team members, I coded a tool that does the following:
- Scans the entire log for a project’s history and finds the date upon which an SVN account commits for the first time;
- Per day finds the number of commits by people who have been “around” for less than 6 months, less than 12 months (but more than 6), less than 2 years (but more than 1) and those around for more than 2 years;
- Plots this data for arbitrary time periods using gnuplot.
So let’s start by taking a look at KDE SVN for 2010:
 Daily commits KDE SVN 2010 (click to enlarge)
There is a few things to note here:
- The overall downward trend in daily commits (probably caused by people switching to git);
- The vast majority of work is being conducted by people who have been in KDE for over 2 years.
There is probably a good reason why this group is responsible for the majority of the commits; the most obvious reason being that it is the largest group. It is also possible to posit that people in this group, on average, have a higher commit rate (I would argue this is less likely though). I will do some more on this point later…
Having shared this plot with someone earlier they came up with an interesting suggestion… Perhaps the people who have been around longest are the most resistant to switching to git. This is an interesting thought and easy-enough to test. Let’s look at the same plot for 2009:
 Daily commits KDE SVN 2009 (click to enlarge)
Just by looking you can see that, overall, the KDE SVN commit rate has dropped between 2009 and 2010; almost certainly caused by migration to git. But are the golden oldies really the ones holding on to SVN the tightest? Actually, no.
Between 2009 and 2010 the change in average daily commit rate, per “age” category was as follows (roughly):
- < 6 months: -28%
- < 12 months: -17%
- < 24 months: -7%
- > 24 months: -13%
There is nothing particularly significant in this (statistically or otherwise). There goes that theory.
One last thing that I think is worth mentioning. Take a look at the < 6 months commits for 2010. Notice a growth pattern around the Summer? I think you need to look really carefully to see the same in 2009. Still, I think there is some indication here of the impact of Google Summer of Code and Season of KDE.
Paul is blogging? He must be delayed in an airport lounge again.
So, my previous work on Oracles was a starting point on a long journey. The destination? Being able to automatically identify who the “core team” of SCM contributors to a project are. Not particularly easy. Before I expand on the idea of how we get at the core team, I want to throw more data into the mix. Analysing more of the same type of data, after a certain point, will rarely get you to a better answer; sometimes we need something new. Different data. In this case, I want to expand how representative my graphs are of what goes on inside the SCM by also including the artefacts… files.
Those of you with a good memory will remember that I have briefly looked into this before. For those who do not remember, let’s take a look at a visualisation in action. In this graph the nodes are either artefacts (blue) or committers (orange):
Immediately there are a few things we can note:
- It looks like a firework display! (Perhaps not all that important.)
- Disconnected graph (Some committers do not appear fully-integrated in this log. We’ll save this for another day.)
- Clustering of artefacts (This is the interesting bit, for now!)
When we look at this particular example, we see there is very little sharing of artefacts. Maybe everyone works in their own branch? Maybe everyone hates each other? Maybe the codebase has been lovingly modularised and the community with it? (This is highly likely and is caused buy something called “Conway’s Law“).
Here is where things get a little messy when it comes to identifying the core team… We cannot make assumptions about these clusters, but their presence is important. Allan Winter (the Kaiser Soze of KDE PIM) creates a branch to work on a particularly fiddly task; Thomas McGuire (the boy-wonder of KDE PIM) creates a release branch. To my scripts these look the same (i.e. both Allan and Thomas appear to work on a lot of artefacts alone). Along comes David Faure (of course); he does not create a branch at all, he actually just has a module of code that he works on and no-one else touches (because only David has the brains for this)… Again, we see an orange node with a cluster of blue nodes around.
Long story short, it is non-trivial to automatically assess what is causing the clustering. Either way, it is always the result of something going on in the SCM related to either personal or project process. So I will continue to treat them equally when I do the maths and visualisations.
Now what happens when this type of representation is run through the Oracel of Ervin. Well, I don’t know (there are far to many nodes here, it would probably take days to process this). But there is certainly something we can predict, just by thinking about this… The central node, as revealed by the Oracle of Ervin, might actually be an artefact and not a committer.
Now there’s a thought! More on that another day.
However, I will leave you with a question. Look again at the image above. I have deliberately not told you a couple of things:
- The community being visualised here (not particularly relevant).
- How long a period of time the visualisation represents (very relevant).
Imagine the log is one day, or a month, or a year, or one release cycle… Does your interpretation of the clustering change?
|
About me
Trained as a software engineer and specialising in process management, Dr. Paul J. Adams has worked in both academia and industry as a researcher and project manager, covering a variety of Free Software-related topics. Today, he is Chief Operating Officer for Kolab Systems AG.
In 2009 he worked for Zea Partners conducting research on behalf of the commercial community involved in Zope and Plone development and services. Prior to this he worked as a research and project manager for Sirius Corporation in the UK. Paul graduated in 2004 as a Software Engineer, from the University of Durham, UK. His subsequent doctorate was conducted between 2005 and 2009 from the University of Lincoln.
Paul was awarded Chartered IT Professional status, in 2008 and is a full professional member of the British Computer Society (for whom he is co-founder and former chairman of the Open Source Specialist Group), IEEE as well as of KDE e.V.and the Fellowship of the FSFE.
|