Due to me recently learning about Mike Kuketz’ (popular German privacy blogger) interesting stance that Threema is better in protecting your metadata than XMPP, I felt the need to write down some thoughts on metadata in instant messaging and XMPP specifically.
With end-to-end-encryption becoming popular in instant messaging and as we learnt that they kill based on metadata, it started to become more important to look into the privacy of metadata. For the rest of this post, I am assuming that our instant messaging system is using proper end-to-end-encryption for the message content and the only things that are left is metadata.
Transient metadata in instant messaging
Transient metadata is metadata that exists temporarily, but does not need to be persistent. It is important to note that if the system is centralized or federated, the user needs to trust third-party nodes in the network to not persist this transient metadata. In larger centralized systems, it is very typical that they are ordered by a government agency or court to store certain data and not inform the public about it (especially if those services are run by companies or individuals that are head-quartered in countries that don’t value privacy a lot, like the United States, Russia or China).
When sending a message, the following metadata inevitably exists somewhere:
- Network identity (IP address, hostname) of sender, recipient and middle nodes, if any
- Recipient user identity (such that the system knows where to deliver the message)
- Date and time the message was sent and received
On top of that, the senders user or cryptographic identity is often revealed or deducible, even if that is not necessarily needed, but is very useful for spam protection reasons. Most notably, Signal makes an effort to hide the senders identity, but it can still be deduced from the remaining metadata.
Many instant messaging systems allow you to see if your chat partner is currently online or actively using the chat client. Such a feature requires that the network identity and user identity are linked with each other for as long as the user is online using a certain network identity. Additionally, if the online status should not be visible to everyone but only a certain list of users (like friends), that list of users is also needed (or at least needs to be deducible).
In his blog post, Mike lists a lot of things XMPP servers can log. Yet he forgets to mention that pretty much the same can also be logged on the servers of any other messaging service.
Most modern instant messaging systems allows others to send you messages while you are offline and those will be delivered as soon as you get back online. For this to work, some entity has to persist the message for as long as you are offline – usually that’s the centralized server or your selected server in a federated system. In systems that support multiple devices, messages usually are persisted until all devices of a user have fetched the message – which may result in them being stored indefinitely (or until a certain timeout is reached) if you stop use one of your client devices.
XMPP is one of those protocols supporting multiple devices and it comes with two kinds of message stores. The first one, known as “offline messages”, will store messages for the first device to come online if no device is online. The second one, known as “Message Archive Management” (MAM), stores message for a predefined time (sometimes unlimited), so that any device that doesn’t have them yet can fetch them when coming back online. If users don’t need it, they can turn MAM off.
In his blog post, Mike mentions that XMPP has a message archive, probably referring to MAM, but forgot to mention that it can be easily turned on or off in the XMPP-Client Conversations he is presenting to his readers in the same post. Some servers even have MAM disabled by default.
Obvious and not related to messaging: in centralized and federated systems, some user data has to be stored on servers. This includes login credentials, but potentially also phone number, email address or similar PII.
To let you know which of your contacts are already using your service, many modern instant messaging services allow you to upload your contact list to a centralized server and they will tell you which of your contacts is using the messaging service. This of course requires storing identifiers like phone numbers or email addresses. Some messengers only do this contact discovery with your consent and remove all data after matching, but others store your contact list indefinitely and let you know once one of your contacts start using their service.
Some instant messaging services feature a friend or contact list. People on this friend list would be handled differently by the server, e.g. they can send you push notifications, call you, see your current online status or other features. Some messengers populate the friend list automatically (via the data retrieved from contact discovery), others don’t have one at all.
In XMPP, the contact list (called roster) is used for two purposes: It’s used to synchronize contacts and contact names across devices and it’s used to manage subscription, which in turn decides who is going to see if you are online or not. Of course populating the contact list is completely optional.
In his blog post, Mike mentions that XMPP has a list of saved contacts and that server admins can access it, but forgot to mention that it’s completely optional to use it. The Conversations client is able to handle having contacts that are only stored in your local phone book and not synced to the server.
Threema vs. XMPP
As mentioned in the introduction, there seems to be the opinion out there, that Threema is better in protecting metadata than XMPP. However looking at what transient and persistent metadata exists, they look pretty much the same, just XMPP having optional additional features that make sense for multi-device support (which Threema doesn’t have). The only difference is that Threema claims to not log any data (which can not be verified independently) and with XMPP servers could log data, but you can just pick one that you trust most that it doesn’t.
In my opinion, having the option to choose the server I like (including the possibility to set up my own) is worth way more than having only a single server choice that claims to not log any data.