[If you are not familiar with the English idiom "When push comes to shove" you can read more here.]
For some time I have been hesitant to start publishing data about usage of Git. You see, when a community changes a tool as fundamental as the SCM it will need to change its processes (to some degree). Of course, this is often the reason why the SCM has been switched. It is also the first reason why it is difficult to compare SVN data from “before” to Git data from “after”. Reason 2 is that the two systems work in very different ways. A commit in a DVCS is very different from a commit in a centralised system. It is probably the “push” that is more comparable. Right? Right??
Let’s take a look at the daily commits for KDEPIM:
KDEPIM switched to Git on 28 January, 2011 (or thereabouts). Before this date the average daily commits was 16 (14 in the month prior), after it drops to 11. I’m sure the KDEPIM community is not crying into its collective beer tonight. Here’s why:
- Human factor: The initial large drop in commit rate could easily be caused by people needing to learn how to use Git properly.
- Process factor: Git allows the user to squash multiple commits into one.
The change of tool will always have human and process impacts. Here I have suggested just one of each; there are many more. But these factors plausibly explain my concern with coming forward with Git data… It is up to me to make it absolutely clear why (or potentially why) the figures change in the way that they do. Whilst the need for education and commit squashing are two factors that might apply to any project, the factors that actually apply can only really be revealed by those directly involved.
So what can we conclude? Two things:
- The impact of the switch to Git can be shown in the measurement of something as simple as daily commits;
- Watching the new trends develop over time is going to be fascinating.
OK, now KDE is 15 years old, it is time for my work to grow up and start looking at git. One of the questions I get asked from time to time is how much code rewriting I will need to do in order to for with git. Thankfully… none.
All of my scripts parse SVN logs and it is easy enough to get git to give back logs in SVN format. Just like this:
git log –reverse –format=”<logentry revision=\”%H\”>%n <author>%ae</author>%n <date>%ci</date>%n</logentry>”
So as a first brief experiment with git, here is the result of generating the green blobs for KDEPIM (click to enlarge):
So, I thought I would take a quick look at what KDE community “looks” like after 15 years under development. So here I will briefly show off three visualisations with no particular comment. I will just leave them here for your amusement.
So let’s start with the now-infamous green blobs (click to enlarge):
For the uninitiated, a quick lesson: Each column in this visualisation represents the commit history of everyone who has committed to KDE SVN. Each row represents a week, with the most recent weeks being at the top. If the contributor committed during that week, they get a green blob, otherwise it is left empty. For each column the committer, the date of their first commit and the % of weeks in which they committed (of those they /could/) is given.
You might remember from my last blog post that I charted the growth in the number of accounts in KDE SVN. With such a steady growth in contributors, should we expect something similar in the daily commits and committer trends? Of course we should…
- Daily Commits (click to enlarge):
I will admit that I have doctored this data ever-so-slightly in order to filter out the days in which script went crazy and created 1000s of commits by itself.
- Daily Committers (click to enlarge):
So there you have it, 15 years of KDE development reduced to just three pictures! Of course, I could try and do 1000 more visualisations of the work in KDE SVN and still get nowhere near to telling the whole story. As the commits and committers plots show, KDE git really is the place to be. It is incredible how quickly contributions to KDE SVN have dropped to circa 2001 levels.
So, a big “congratulations” to my chums in the KDE community. Happy birthday and all the best for the next 15 years!
So, in the not too distant future KDE will turn 15 years old. This is normally a time when I will go back and reflect on lessons that can be learned from past activities in the SCM. This year is no different.
After my last blog post I was asked about the history of how many people had committed to KDE. So, for your viewing pleasure:
Number of KDE committers (accumulative) since project foundation (click to enlarge).
This plot shows, for each day, the number of accounts that had committed to the SCM up to and including that day. As you can see, lately the growth rate is starting to tail off. Again, the most sensible hypothesis is that fewer new contributors are using SVN, but git instead.
So, in my previous blog post, I talked a little about how we can show if it is the newcomers or the “oldies” that are the most active contributors to KDE SVN. Let’s jog our memories by taking another look at the 2010 data I previously posted:
Daily commits in 2010 (click to enlarge)
What are we looking at here? This shows, for each day in 2010, the number of daily commits in KDE SVN. Each day is colour coded to show the commits made by those contributors who have been “around” fewer than 6 months since their first commits, less than one year, less than two years and more than two years.
It is plain to see that the contributors who have been involved for more than two years are contributing the most (commits per day). Now the question I left us with last time was: Is this because there are a larger number of committers in this category or is the commit rate just higher?
Let’s take a look, starting with commit rate:
Average daily commits, in 2010, per committer in each category (click to enlarge)
This plot shows, per day in 2010, the average number of commits made by the people in each involvement category. It is mostly a mass of dots, right? But what it does tell us is that the people who have been around the longest do not have a massively increased commit rate above the others. In fact, the data behind the plots shows that on an average day, the average commit rates are as follows:
- < 6 months: 3
- < 1 year: 3
- < 2 years: 4
- > 2 years: 5
This would lead us to believe it must simply be that there are more active people in the “> 2 years” category. So let’s take a look at the number of daily committers per category per day in 2010:
Daily committers, in 2010, per committer category (click to enlarge)
So there you have it! Clearly, there are simply more contributors fitting into the “> 2 years” category. On an average day in 2010, the number of committers in each category was:
- < 6 months: 10
- < 1 year: 8
- < 2 years: 14
- > 2 years: 54
So, here’s a question for you all: Does it feel a little odd to you that the committers who have been around for fewer than 6 months have similar commit rates to those who have been around for more than 2 years?
So Lydia asked me about having slightly more fine-grained information about daily commits. She pointed me to this video which at the 15-minute mark has a visualisation for people contributing to Wikipedia. This visualisation reveals information about for how long people have been contributing to the community.
So, as a distraction from my work on detecting core team members, I coded a tool that does the following:
- Scans the entire log for a project’s history and finds the date upon which an SVN account commits for the first time;
- Per day finds the number of commits by people who have been “around” for less than 6 months, less than 12 months (but more than 6), less than 2 years (but more than 1) and those around for more than 2 years;
- Plots this data for arbitrary time periods using gnuplot.
So let’s start by taking a look at KDE SVN for 2010:
Daily commits KDE SVN 2010 (click to enlarge)
There is a few things to note here:
- The overall downward trend in daily commits (probably caused by people switching to git);
- The vast majority of work is being conducted by people who have been in KDE for over 2 years.
There is probably a good reason why this group is responsible for the majority of the commits; the most obvious reason being that it is the largest group. It is also possible to posit that people in this group, on average, have a higher commit rate (I would argue this is less likely though). I will do some more on this point later…
Having shared this plot with someone earlier they came up with an interesting suggestion… Perhaps the people who have been around longest are the most resistant to switching to git. This is an interesting thought and easy-enough to test. Let’s look at the same plot for 2009:
Daily commits KDE SVN 2009 (click to enlarge)
Just by looking you can see that, overall, the KDE SVN commit rate has dropped between 2009 and 2010; almost certainly caused by migration to git. But are the golden oldies really the ones holding on to SVN the tightest? Actually, no.
Between 2009 and 2010 the change in average daily commit rate, per “age” category was as follows (roughly):
- < 6 months: -28%
- < 12 months: -17%
- < 24 months: -7%
- > 24 months: -13%
There is nothing particularly significant in this (statistically or otherwise). There goes that theory.
One last thing that I think is worth mentioning. Take a look at the < 6 months commits for 2010. Notice a growth pattern around the Summer? I think you need to look really carefully to see the same in 2009. Still, I think there is some indication here of the impact of Google Summer of Code and Season of KDE.
Paul is blogging? He must be delayed in an airport lounge again.
So, my previous work on Oracles was a starting point on a long journey. The destination? Being able to automatically identify who the “core team” of SCM contributors to a project are. Not particularly easy. Before I expand on the idea of how we get at the core team, I want to throw more data into the mix. Analysing more of the same type of data, after a certain point, will rarely get you to a better answer; sometimes we need something new. Different data. In this case, I want to expand how representative my graphs are of what goes on inside the SCM by also including the artefacts… files.
Those of you with a good memory will remember that I have briefly looked into this before. For those who do not remember, let’s take a look at a visualisation in action. In this graph the nodes are either artefacts (blue) or committers (orange):
Immediately there are a few things we can note:
- It looks like a firework display! (Perhaps not all that important.)
- Disconnected graph (Some committers do not appear fully-integrated in this log. We’ll save this for another day.)
- Clustering of artefacts (This is the interesting bit, for now!)
When we look at this particular example, we see there is very little sharing of artefacts. Maybe everyone works in their own branch? Maybe everyone hates each other? Maybe the codebase has been lovingly modularised and the community with it? (This is highly likely and is caused buy something called “Conway’s Law“).
Here is where things get a little messy when it comes to identifying the core team… We cannot make assumptions about these clusters, but their presence is important. Allan Winter (the Kaiser Soze of KDE PIM) creates a branch to work on a particularly fiddly task; Thomas McGuire (the boy-wonder of KDE PIM) creates a release branch. To my scripts these look the same (i.e. both Allan and Thomas appear to work on a lot of artefacts alone). Along comes David Faure (of course); he does not create a branch at all, he actually just has a module of code that he works on and no-one else touches (because only David has the brains for this)… Again, we see an orange node with a cluster of blue nodes around.
Long story short, it is non-trivial to automatically assess what is causing the clustering. Either way, it is always the result of something going on in the SCM related to either personal or project process. So I will continue to treat them equally when I do the maths and visualisations.
Now what happens when this type of representation is run through the Oracel of Ervin. Well, I don’t know (there are far to many nodes here, it would probably take days to process this). But there is certainly something we can predict, just by thinking about this… The central node, as revealed by the Oracle of Ervin, might actually be an artefact and not a committer.
Now there’s a thought! More on that another day.
However, I will leave you with a question. Look again at the image above. I have deliberately not told you a couple of things:
- The community being visualised here (not particularly relevant).
- How long a period of time the visualisation represents (very relevant).
Imagine the log is one day, or a month, or a year, or one release cycle… Does your interpretation of the clustering change?
So I am sitting in the lounge at Newark Liberty airport and I am not entirely happy with my last blog post… which I wrote in the lounge at Chicago. Not happy because, whilst I addressed his point, I did not go the extra mile and show Rolf how the Oracle tool works in the context of KGPG. After all, the reason I do what I do is because I want to help contributors understand their communities better and Rolf had made some assumptions that might turn out to be incorrect.
So, first of all, let’s take a look at the community for KGPG in 2010 (I gathered the SVN log for the entire year):
For the purpose of this exercise we will ignore dfaure (sorry, dude) and scripty and focus on the connected developer community. If we run this through the Orcale script we find the following:
- dakon 1.0
- yurchor 1.5
- mlaurent 1.125
- coles 1.375
- kossebau 1.5
- cfeck 1.375
- woebbe 1.625
So, as it turns out, Rolf is bang in the middle of the KGPG community with a score of 1; we might have guessed this. Now the links between Rolf and everyone else might not necessarily indicate real-time collaboration, but they do show that he worked on the same files as other people in the KGPG community within the space of a year (just 2 releases).
Now perhaps none of this is news to Rolf. But, if nothing else, this little exercise has just gone to prove that dfaure really does get everywhere in KDE.
So, Rolf asks “Who needs oracles?” Sadly his blog does not (appear) to allow me to respond in comments, so I will post my thoughts here… It is a perfectly valid question after all.
So let me respond to a few things that Rolf raises and then come back to answering the question:
First, most of the things you would get from this is at least wrong for some KDE modules.
Not true. I am producing mathematical models of the development community using graphs. Not an uncommon approach to issues of social networking. It is equally possible to further model aspects of these models using equations. This is the beauty of maths. But this is also the downside… Models are, of course, abstractions. So information is lost. Think about the London Underground map. How do you get from South Wimbledon to Wimbledon? The map is great for its purpose (telling people how to get around the London Underground) but it is missing crucial information for other purposes (to get from South Wimbledon to Wimbledon you should either walk or take the bus). As an aside… If you are a fan of the London Underground map, you should check out the London Connections map which does include other forms of public transport.
What is important here is that we (those creating the models) explain how the model works and why it is fit for the purpose…. More on the purpose later.
When you build that oracle for KGpg you will find some connections. But looking back at the last two years or so I only remember one recent commit from Burkhard that was the result of collaboration.
Perfect! The model works! It will tell me that those collaborators are collectively the Orcale. Yes, there is a risk that there is one contributor who only ever works alone. This is a small resk inherent in the model (the model assumes that Free Software development communities are based on open collaboration).
This sort of “oracle” has actually some practical meaning
Some, sure. But mine is an entirely different network with an entirely different purpose. That by no means indicates that my model is either without purpose or not fit for that purpose.
The idea is to show if there are possible paths from you to other keys.
Very useful. But is this any more useful than knowing how the development community is structured to, say, those interested in the management of the community?
Rolf (and everyone else for that matter), if you are confused about the structure of my work and its purpose, please go back and re-read my previous blog post.Most importantly note that I explicitly state what the purpose of this work is:
Why am I bothering with this? Well, if we wanted to somehow automatically find the “core team” of a Free Software community, the most connected contributor makes a good starting point.
Do I say that my approach is perfect and finds the one true core team? No. It is just a good starting point.
So, in my previous blog post I told you about the Oracle of ERVIN. This is a Python script that I have written and which determines who is the most connected person within a Free Software community. As per usual this is all achieved by parsing SCM logs.
So, to recap:
- The tool creates a community graph. Nodes are committers and edges show that the connected nodes have both committed into at least one file that they have both worked on;
- That the “most connected” person is the contributor who has the shortest path to all other committers. So, to be clear, for each developer we find out how many steps it takes to get to each other committer and then we take an average of all of those.
The tool was called the Oracle of ERVIN as a show of respect to the Oracle of Bacon. ERVIN was specifically chosen because when I ran the script against a log for KDELIBS for 2010, the most connected committer was ERVIN.
Now, I left you all with a question: Who was the most connected KDE contributor (the entire SVN-based community) in 2010? Let’s start with ERVIN, he was the most connected KDELIBS dude after all…
- ERVIN – 1.78926174497 (just in KDELIBS he was 1.6137931034482758)
So, the average number of steps ERVIN needs to link to any other contributor is only ~1.8. Impressive.
But not impressive enough for you! You had some other names in mind:
- ASEIGO – 1.75167785235
- DFAURE – 1.56510067114
- PINO – 1.54228187919
- MLAURENT – 1.5288590604
- SCRIPTY – 1.50201342282
Kudos goes to Sune who correctly identified that Albert was actually the most connected person inside KDE:
Sorry, Sune, no prize. Whilst you correctly identified Albert, you put your money on Pino (who was also a good shout).
So what is interesting here?… Well, you tell me. Here is what I find interesting:
- Forget 6 degrees of separation, within KDE the average is actually 2.201671. Although this is probably a function of the size of the community… Note how ERVIN’s connectedness was better within the smaller KDELIBS community?)
- AACID has a better connectedness within the entire KDE community than ERVIN had inside just KDELIBS (given that both he and ERVIN have good overall connectedness we should assume that either AACID is a networking madman, or the KDELIBS community needs a little “tightening up”).
- Members of the community had a good “feeling” for who would be some of the best-connected contributors.
Finally, and because it was specifically asked for, here is the distribution of the KDE community’s “connectedness”:
Why am I bothering with this? Well, if we wanted to somehow automatically find the “core team” of a Free Software community, the most connected contributor makes a good starting point. This is by no means perfect: it is possible to “game” the system if you know this measurement is being made and, even if you are not aware that the measurement is being made, you can “accidentally” place yourself at the center of a community just by committing into certain files.